3. Results
3.2. Characterization of retina cells in RT-DC
3.2.2. Comparing retina cell types using statistical tests
Student’s t-test 129, Mann-Whitney rank test 130 and Kruskal-Wallis H-test 131 are examples of tests, which can be used to compare two independent samples. Figure 3.7 shows the application of these tests to artificial and experimental data. The two artificial datasets were obtained by drawing 1500 values from a normal distribution generated by a random number generator with a mean of zero and standard deviation of one. The histogram on the left in Figure 3.7 shows both distributions (red and blue histogram in Figure 3.7), which are strongly overlapping. Let us assume that sampling and comparing using statistical tests is repeated 100 times. Then, one would expect to get a p-value below 0.05 in 20 cases due to the definition of the p-value 125. Such an example of a significant difference despite drawing values from an identical random number generator is shown in Figure 3.7 and the table (Figure 3.7, bottom) shows the corresponding the p-values. The histograms on the right in Figure 3.7 show area- distributions of two GFP+ rod photoreceptor samples, which were measured individually using RT-DC. Again, the distributions are highly overlapping, but the resulting p-values when comparing these distributions indicate a significant difference. Using these statistical tests, one is essentially asking for the probability that both samples were obtained by sampling from the same distribution (null hypothesis). In case of the Nrl- GFP data, cells of different mice were measured, which means that cells originate from different populations. Therefore, it makes sense that the tests return low p-values. Even when measuring identical cells from the same donor twice, there could be slight differences due to experimental noise (cells aged, room temperature changed,…). Such small differences are typically not resolved when for example measuring 30 cells, but in in RT-DC the sample size is normally on the order of thousands of cells, which allows detecting minute differences. Assuming, the second sample of Nrl-GFP+ cells was for
62
example drug-treated, it would not be possible to decipher if the statistically significant difference arose due to an effect of the treatment or due to experimental noise.
Figure 3.7 Application of three statistical tests to artificial and experimental data
The histogram on the left shows two Gaussian distributions (red and blue) with 1500 data-points each, produced by a random number generator. The histogram on the right shows the population of GFP+ cells
from Nrl-GFP retina cells at P04 for two biological replicates (red and blue). The distributions in each histogram are compared and tested for significant differences using student’s t-test, Mann-Whitney rank test, and Kruskal-Wallis H-test. The table states the corresponding p-values. Despite strongly overlapping populations, the resulting p-values are all below 0.05, indicating significant differences, especially for the Nrl-GFP data.
To compute meaningful significance levels for such large sample sizes, it is important to consider not only the difference between two populations but also how reliably this difference can be measured. Section 2.4 described a method based on linear mixed models and a likelihood ratio test that allows considering biological variation and reproducibility of the effect. To use this model, data from biological replicates is required. Hence, the GFP+ as well as the GFP- fraction was obtained for three biological replicates by FACS sorting. Three biological replicates of each developmental stage were measured using RT-DC. Figure 3.8 shows boxplots for area and deformation for each measurement. Green and gray boxes show data from GFP+ and GFP- samples, respectively and p-values (obtained using the LMM-based approach) in each plot indicate whether the difference between GFP+ and GFP- is significant. Since GFP+ cells are consistently smaller than GFP- cells for each replicate, p-values indicate a significant difference for area at each developmental stage. At E15.5, deformation of GFP+ is considerably lower compared to GFP- for the first and second replicate, but not for the third, resulting in a p-value above 0.05 implying a non-significant difference (due to
63 insufficient reproducibility). In contrast, the difference in deformation between GFP+ and GFP- is very small for each replicate at P04, but this small difference is so similar across all replicates, that the LMM-based significance test returns a p-value of 0.0073. At P10, the differences are larger, but same as consistent, resulting in a p-value of 0.0019. At P20, the differences are also large for the first and second replicate, but not for the third, resulting in a p-value of 0.0276.
Figure 3.8 Boxplots for area and deformation and statistical analysis using LMM
Boxplots show area and deformation for triplicate measurements of GFP+ (green boxes) and GFP- (gray
boxes) samples at four developmental stages. Subscript numbers at x-axis indicate the replicate number. Deformation and area tend to be smaller for GFP+ cells compared to GFP- cells and p-values in each plot
indicate whether this difference is significant. The p-values were computed using a test based on linear mixed models, which allows taking reproducibility of a measurement into account by considering replicates. Boxplot shows median, interquartile range and range of data, as introduced in Figure 2.7.
3.2.3. Discussion
RT-DC measurements of Nrl-GFP retina samples from mice at different maturation stages reveal a continuous change of morphological and mechanical properties, which is expected since the retina develops rapidly at the chosen ages. Using FAC-sorting, the GFP+ rods were isolated and measured individually in RT-DC, resulting in a narrow distribution of cell sizes for each maturation stage. Such a population of cells with a narrow area-range was also found in unsorted samples. Using a 2D GMM I showed that it is possible to predict the location of the GFP+ fraction in an unsorted sample. Predicting the location of the GFP+ cells is easier for maturation stage P10 and P20 as there are clearly distinguishable populations, but for transplantation, especially samples
64
from maturation stage P04 are of interest due to a higher transplantation success 2. Despite the narrow area-distribution of GFP+ cells, in an unsorted sample at P04, this population is overlapped by GFP- cells. To fully discriminate GFP+ and GFP- cells, more label-free features are required. Therefore, deformation was assessed, which shows a significant difference between GFP+ and GFP- cells (at P04). For the significance analysis, a test based on linear mixed models was leveraged, which allows to include biological replicates. Other approaches (see Figure 3.7) such as the t-test tend to return lower p- values for larger sample sizes. Therefore, a low p-value can be obtained even for a very small effect by simply measuring more cells. In contrast, the LMM based approach considers if an effect is reproducible across biological replicates. This approach appears to be robust and was used in multiple RT-DC related publications 13,15,22,23,85,97,101–106. Like many other statistical tests, the LMM based test requires data to follow a normal distribution, but it was shown that the test is robust and results in useful outcome also for considerably skewed distributions 132. Especially for deformation one could alternatively use a generalized linear mixed model, which uses a log-link function to account for the lognormal behavior of deformation 133. I implemented this alternative and it was integrated into ShapeOut (courtesy of Paul Müller). Furthermore, LMM requires equal variances of the residuals of the compared distributions (homoscedasticity), which is certainly not given for most biological cases. Recently, approaches that are robust for heteroscedasticity were published 134,135. Therefore, I implemented a test using the more robust Bayesian hierarchical models (BHM) and compared the p-values resulting for several scenarios (several experiments and artificial datasets) to the p-values from the analogous LMM based test. In general, the p-values were very similar in all cases, but the computational time for BHM was orders of magnitude longer (while LMM took seconds, BHM required multiple minutes), rendering the application of BHM unfavorable, especially when dealing with large amounts of data or many experiments.
The LMM based significance test can also result in a very low p-value even for very small differences, if the effect is highly reproducible. This shows that the p-value only indicates that there is a difference between two states (e.g. between GFP+ and GFP-
65 sample), but not if the difference is large enough to distinguish single cells from those populations when samples were mixed. Since the goal of this thesis is to find parameters that allow distinguishing rod precursor cells from other retina cells in mixed samples, the next section presents more advanced methods for classification.
66