Jump to content

Tucker Congruence coefficient calculation


VITAS

Recommended Posts

Please read the whole text and also take a look at the link below before you post something. Thank you :)

 

Hi,

im working on some real world sience stuff and have to write a php function that compares datasets using the Tucker Congruence coefficient.

I cant get it to work like it should and welcome every help in fixing my code.

You can see more details under the link below.

Thank you in advance.

More Details

 

 

A more detailed Explanation:

Some vocabulary:

Wavelength: a datapoint. This is about fluorescence spectrometry aka measuring the intensety at different wavelength of light.

Component: a dataset containg measurements for a range of wavelength

Model: sort of a folder containing multiple components

Goal:

The user wants to find similar components to the ones contained in the model he selected.

how it (should) work(s):

the tuckercalc function calculates a value between 1 and 0 that represents the similarity between  all wavelength (aka measurements) of a component (aka dataset) in relation to another components (datasets) wavelnegth.

1 represents 100% identical and 0 0%.

But we can only compare ranges of data we have for both components.

So we have to a) start at the smallest wavelength both components have and stop at the largest and b) interpolate mssing datapoints that are in the source component (the one we want to find matches for) from the existing data of the target component using linear interpolation.

What i do is to first get the source component data fromt he db and dump it into an array.

then i cycle trough all the datapoints in the db that dont belong to the source component trying to extract the needed wavelength or interpolate them for each component of each model.

so what i should end up with is two arrays:

one source component array and one with the correct target wavlenegth (without the unneeded ones and with the missing interpolated ones)

once the loop finds out that a new component started it calls the tucker calc function, gets the similarity value and dumps the result into an array that is later used to make a bulk db insert after all components of all models have been checked.

my problem:

the result is always 0

what i know:

the tucker calculation is correct until i start to compare the results of the source and the the target.

And i also found that there are more wavlenegths compared as needed.

My guess:

my array preparation code (in the main function) isnt matching the source and target arrays, so they contain wavlength the other doesnt contain and because php tends to return zero isntead of null for non existant variables it messes up the tuckercalc function

what i think is needed:

someone to revisit my hot code mess in the main function with the goal to make sure the arrays that are send to the tuckercalac fucntiona re in order (as defined above)

The future:

Im open for improvements that make the code faster  and/or more compact after its working as intended.

 

I know i repeat myself but i hope this is clearer :)

 

And another disclaimer for anybody new:

Yes its PHP and no this is not a place or time to voice your opinion about it or suggest different mathematical solutions.
This has to be done the way i described with PHP. Getting this to work is my main priority. Making it beautiful and fast comes after that.

Edited by VITAS
Link to comment
Share on other sites

Maybe i am writing nonsense and i know nothing about php, but a little c and c++. Analyzing other people's code is too tedious for me :-)

 

A little specification would help i think. How should it work ? What exactly is the fabric of the datastructure you put into the algorithm and what exactly happens to it ?

If the algorithm is the formula you posted in the wikipedia link than that is trivial. If not, a thorough description of that would help.

What happens to the data; shall the algorithm simply put out a value of -1 to 1, i understand somehow that 0/null values aren't interpreted correctly, but what would be correct ?

 

I am sure programmers will show up soon ... sorry if that didn't help ...

 

Edit, oh, i see you did yo, I simply don't understand the jargon. Nevermind ...

 

Edited by Green Baron
Link to comment
Share on other sites

Yes m explanation is a bit sloppy. I spend a week on it and ive just no energy left in me to finish it. (But i need to finish it by the end of next week)

It isnt that complicated i would say. Im just in a state of "huh?!"

The second part is to display the best matches and so on. I will simply get them from the cache tables afterwards.
I do that because theres no point in calculating everythign over and over each time somone wants to find the best matches.

Btw. I use laravel and thats another problem: iwhile looking at that framework i was tossed into this project and had to use it while learning it. so i would say my laraval is crap but php is no problem.

(and just as disclaimer: this has to be done in php (+laravel) and using those algorythems so "use x its better" wont help)

Link to comment
Share on other sites

Can't test as I don't use Laravel, but several remarks about calcTucker function:

$sT and $tT are both used as 2d-arrays, but you init them:

    $sT = [[],[],[],[]];

    $tT = [[],[],[]];


, i.e. as 4d and 3d arrays correspondingly, but then use them as 2d, assigning to $sT[][] and $tT[][] a scalar value (while every 2nd-level item is inited as an array).

Try to reduce dimensions in the declaration.

 

Both fragments
 

Spoiler

    //
    $sT = [[],[],[],[]];

    $sTs = [];

    foreach( $sdata as $sWl)
        $sT[0][] = pow($sWl,2);

    $sTs[0] = array_sum($sT[0]);

    foreach( $sT[0] as $sWl)
        $sT[1][] = $sWl/$sTs[0];

    $sTs[1] = array_sum($sT[1]);

    foreach( $sT[1] as $sWl)
        $sT[2][] = pow($sWl,2);

    $sTs[2] = sqrt(array_sum($sT[2]));

and

Spoiler

    $tT = [[],[],[]];

    $tTs = [];

    foreach( $tdata as $tWl)
        $tT[0][] = pow($tWl,2);

    $tTs[0] = array_sum($tT[0]);
    
    foreach( $tT[0] as $tWl)
        $tT[1][] = $tWl/$tTs[0];            

    $tTs[1] = array_sum($tT[1]);            

    foreach( $tT[1] as $tWl)
        $tT[2][] = pow($tWl,2);                

    $tTs[2] = sqrt(array_sum($tT[2]));

look absolutely similar, but you write this code twice.

Cut it off into a smaller function returning an array.

 

Check whether these two points effect your result.

 

No "pow($sWl,2)". Use "$sWl * $sWl", as pow is 1) expensive 2) inaccurate. "Pow" is for fractional or high degrees.

Edited by kerbiloid
Link to comment
Share on other sites

$sT[0][] = pow($sWl,2); <- isnt that 2D?

i wrote it that long so i can find bugs. youre right that i can massivly compact and speed it up once its working

i tried " "$sWl ^2" and that didnt work but yes mybe * would. Ill keep that in mind once the code itself works.

 

Thank you for your input :)

 

Link to comment
Share on other sites

private function calcTucker($t_data, $s_data){
    $s_squares = [];
    $t_squares = [];
    $st_factors = [];
    
    if (sizeof($t_data) != sizeof($s_data)){
        throw new Exception("Factor vectors must be same size");
    }
    
    $n = sizeof($t_data);
    
    for($i=0; $i<$n; $i++){
        $s_val = $s_data[$i];
        $t_val = $t_data[$i];
        
        if($s_val == null or $t_val == null) continue;
        
        $s_squares[] = $s_val * $s_val;
        $t_squares[] = $t_val * $t_val;
        $st_factors[] = $s_val * $t_val;
    }
    
    $s_square_sum = array_sum($s_squares);
    $t_square_sum = array_sum($t_squares);
    $st_factor_sum = array_sum($st_factors);
    
    return $st_factor_sum / sqrt($s_square_sum * $t_square_sum)
}

Hope it helps :)

Uh, I had written a couple paragraphs, but then went and pasted my code over it :mad: 

My point was: your calcTucker() is waaaay overthought, and your algorithm is going places I don't follow (e.g. why are you dividing the squares by their sum? And why on Earth are you then summing over that array? The sum will always be 1!)

I also don't understand how you're using it (tho honestly I get really lazy following code when the variable names don't tell me anything about what they are...), and I think it's possible you are understanding the concept wrong. If your data is shaped with rows as observations (i.e, runs of an experiment) and the columns are the variables (results of each run), then the Tucker congruence measures similarity between two columns (kinda like correlation does). Is that what you're trying here? (I don't understand what your components and wavelenghts are, in this statistical context of observations and variables)

1 hour ago, VITAS said:

$sT[0][] = pow($sWl,2); <- isnt that 2D?

Yes, that is 2D. @kerbiloid (mentioned the wrong guy, gimme slack, I need some sleep) is probably thinking of dimensions in terms of vector spaces, and that's not what's usually meant by "2D array", which is kind of like "an array of arrays".

Edited by monstah
Link to comment
Share on other sites

Oh, another thing. Like I said, I didn't quite follow how you used the function. The one I wrote here accepts two numerical arrays of the same length as inputs; your compare() function should then prepare the data, extracting each desired column as an array, and inputting those two to calcTucker()

Also: it's been over a decade since I wrote any PHP, and I've been doing Python for awhile now. My syntax might be all wrong, but I tried :sticktongue: 

Link to comment
Share on other sites

i summ the array to add a check later. if it isnt 1 something went wrong.

I think the problem isnt the tuckercalc function but the compare function.

i hunch is, that it has to do with php returning 0 isntead of null for nonexistant values and that the function tries to compare source values with target values that dont exist.

(so it fails if the target dataset has a lower maximum wavelength than the source.)

 

Your function certenly makes the tucerclac function a lot more streamlined. (Ive to test it) Thank you :)

 

p.s. im sure you know how it is: you can conenctrate and get more confused by the second but have to get stuff done. thats when you end up with code like mine :D

 

Edited by VITAS
Link to comment
Share on other sites

I'm confused, what's this, then? :huh: 

6 hours ago, VITAS said:

it has to do with php returning 0 isntead of null for nonexistant values

(and yeah, I didn't read through the data dump... :rolleyes:)

Anyway, if the zeros are valid data, then the algorithm shouldn't have a problem

Link to comment
Share on other sites

Some vocabulary:

Wavelength: a datapoint. This is about fluorescence spectrometry aka measuring the intensety at different wavelength of light.

Component: a dataset containg measurements for a range of wavelength

Model: sort of a folder containing multiple components

Goal:

The user wants to find similar components to the ones contained in the model he selected.

how it (should) work(s):

the tuckercalc function calculates a value between 1 and 0 that represents the similarity between  all wavelength (aka measurements) of a component (aka dataset) in relation to another components (datasets) wavelnegth.

1 represents 100% identical and 0 0%.

But we can only compare ranges of data we have for both components.

So we have to a) start at the smallest wavelength both components have and stop at the largest and b) interpolate mssing datapoints that are in the source component (the one we want to find matches for) from the existing data of the target component using linear interpolation.

What i do is to first get the source component data fromt he db and dump it into an array.

then i cycle trough all the datapoints in the db that dont belong to the source component trying to extract the needed wavelength or interpolate them for each component of each model.

so what i should end up with is two arrays:

one source component array and one with the correct target wavlenegth (without the unneeded ones and with the missing interpolated ones)

once the loop finds out that a new component started it calls the tucker calc function, gets the similarity value and dumps the result into an array that is later used to make a bulk db insert after all components of all models have been checked.

my problem:

the result is always 0

what i know:

the tucker calculation is correct until i start to compare the results of the source and the the target.

And i also found that there are more wavlenegths compared as needed.

My guess:

my array preparation code (in the main function) isnt matching the source and target arrays, so they contain wavlength the other doesnt contain and because php tends to return zero isntead of null for non existant variables it messes up the tuckercalc function

what i think is needed:

someone to revisit my hot code mess in the main function with the goal to make sure the arrays that are send to the tuckercalac fucntiona re in order (as defined above)

The future:

Im open for improvements that make the code faster  and/or more compact after its working as intended.

 

I know i repeat myself but i hope this is clearer :)

 

And another disclaimer for anybody new:

Yes its PHP and no this is not a place or time to voice your opinion about it or suggest different mathematical solutions.
This has to be done the way i described with PHP. Getting this to work is my main priority. Making it beautiful and fast comes after that.

 

 

Edited by VITAS
Link to comment
Share on other sites

This thread is quite old. Please consider starting a new thread rather than reviving this one.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...