Cubytus

August 4th, 2012, 09:05 AM

Hi, community,

I'm trying to solve this issue using free software on Mac OS X and Ubuntu, namely PSPP (SPSS-compatible), Gnumeric, LibreOffice, SOFA statistics (http://www.sofastatistics.com), and of course, R, although I would like to avoid the latter since it is not interactive and may have trouble importing very large CSV files.

Optional note: group1 to n are individuals, data1 to 4 are a timing measure on different trials, dataA is an manual ability assessment made for each individual. Timings were classified according to a unique method, and two hypothesis are tested: 1 -wether there is a link between the motor assessment and the timing observed, and 2- has the classification used for timing measure a valid operational background? Data is taken from a sub-clinical and borderline population, and as such can't be divided further

I've read the relevant chapters in 3 excellent manuals (Scherrer, Biostatistique, 1984; Sokal & Rohlf, Biometry, 1995, Legendre & Legendre, Numerical Ecology, 1998), but the case I'm about to expose doesn't seem to appear there. My background is in neurosciences with only one elementary statistics course, so please excuse my ignorance there as I try to explain the problem without unnecessary details.

The experimental situation is as such (shortened to keep this post readable):

Group1

data1 dataA

data2

data3

data4

Group2

data1 dataA

data2

data3

data4

Group3

...

data1 to data4 are continuous variables, with a n around 1600 (24 in each group. First issue is this variable isn't normally distributed (not even close, either inside each group, or across groups), no matter which transformation is applied to them. Therefore, I can only use non-parametrics statistics there.

dataA is an discrete variable. n1 is different from nA.

data1 to data4 are paired to dataA, respectively, in each group. For this experiment and to simplify the model, Group1, Group2, Group3 are independant. Data1 to data4 are not the same order of magnitude as dataA.

To eliminate the different order-of-magnitude between the data, I though about categorizing it. At first, this was intended to make a Chi-square test feasible, but using only 5 categories (arbitrary number), 92%+ of data was in category 1. I can't remove outliers since they may be the most meaningful, biologically speaking. I then tried to find a way to make an appropriate number of classes and assimilated the concept to that of "bins" in histograms. As Sturge's and Yule's rules are supposedly quite inaccurate for large n, I found a paper that supposedly describes the ideal number of bins to use (equal width), but I just couldn't understand it (too many formulas I don't understand, and that chains up):

Birgé & Rozenholc (2006), ESAIM: Probability and Statistics, HOW MANY BINS SHOULD BE PUT IN A REGULAR HISTOGRAM [DOI: 10.1051/ps:2006001]

From what I understood from the three manuals, I could use either the Wilcoxon rank-sum test pr the sign test. As much as I want to compare data1 to data4 with dataA across all groups, I want to make sure possible links between data1 to data4 from each group are accounted for and do not "contaminate" inter-group results.

Of course, this would be much easier if I could simply replace data1 to data4 with their mean, but since data is highly non-normally distributed, I think this would increase type II error - that data1 to data4 have no relationship to dataA in group1, although they would.

***

A friend of mine suggested I answer the hypotheses using a 2-step process, but fell short of time to explain it further: 1- is there a link between data1 to 4 and dataA? 2- How link (if any) between data1 to 4 is NOT explained by the test?

Considering my limited knowledge of statistics, I know non-parametric tests such as the sign test and Wilcoxon signed-ranks test allow for different n in each column, and that these test require reordering. As such, how can I make sure data1 to data4 from group1 are not compared to dataA from group2, for example?

I'm trying to solve this issue using free software on Mac OS X and Ubuntu, namely PSPP (SPSS-compatible), Gnumeric, LibreOffice, SOFA statistics (http://www.sofastatistics.com), and of course, R, although I would like to avoid the latter since it is not interactive and may have trouble importing very large CSV files.

Optional note: group1 to n are individuals, data1 to 4 are a timing measure on different trials, dataA is an manual ability assessment made for each individual. Timings were classified according to a unique method, and two hypothesis are tested: 1 -wether there is a link between the motor assessment and the timing observed, and 2- has the classification used for timing measure a valid operational background? Data is taken from a sub-clinical and borderline population, and as such can't be divided further

I've read the relevant chapters in 3 excellent manuals (Scherrer, Biostatistique, 1984; Sokal & Rohlf, Biometry, 1995, Legendre & Legendre, Numerical Ecology, 1998), but the case I'm about to expose doesn't seem to appear there. My background is in neurosciences with only one elementary statistics course, so please excuse my ignorance there as I try to explain the problem without unnecessary details.

The experimental situation is as such (shortened to keep this post readable):

Group1

data1 dataA

data2

data3

data4

Group2

data1 dataA

data2

data3

data4

Group3

...

data1 to data4 are continuous variables, with a n around 1600 (24 in each group. First issue is this variable isn't normally distributed (not even close, either inside each group, or across groups), no matter which transformation is applied to them. Therefore, I can only use non-parametrics statistics there.

dataA is an discrete variable. n1 is different from nA.

data1 to data4 are paired to dataA, respectively, in each group. For this experiment and to simplify the model, Group1, Group2, Group3 are independant. Data1 to data4 are not the same order of magnitude as dataA.

To eliminate the different order-of-magnitude between the data, I though about categorizing it. At first, this was intended to make a Chi-square test feasible, but using only 5 categories (arbitrary number), 92%+ of data was in category 1. I can't remove outliers since they may be the most meaningful, biologically speaking. I then tried to find a way to make an appropriate number of classes and assimilated the concept to that of "bins" in histograms. As Sturge's and Yule's rules are supposedly quite inaccurate for large n, I found a paper that supposedly describes the ideal number of bins to use (equal width), but I just couldn't understand it (too many formulas I don't understand, and that chains up):

Birgé & Rozenholc (2006), ESAIM: Probability and Statistics, HOW MANY BINS SHOULD BE PUT IN A REGULAR HISTOGRAM [DOI: 10.1051/ps:2006001]

From what I understood from the three manuals, I could use either the Wilcoxon rank-sum test pr the sign test. As much as I want to compare data1 to data4 with dataA across all groups, I want to make sure possible links between data1 to data4 from each group are accounted for and do not "contaminate" inter-group results.

Of course, this would be much easier if I could simply replace data1 to data4 with their mean, but since data is highly non-normally distributed, I think this would increase type II error - that data1 to data4 have no relationship to dataA in group1, although they would.

***

A friend of mine suggested I answer the hypotheses using a 2-step process, but fell short of time to explain it further: 1- is there a link between data1 to 4 and dataA? 2- How link (if any) between data1 to 4 is NOT explained by the test?

Considering my limited knowledge of statistics, I know non-parametric tests such as the sign test and Wilcoxon signed-ranks test allow for different n in each column, and that these test require reordering. As such, how can I make sure data1 to data4 from group1 are not compared to dataA from group2, for example?