Once the n × m matrix is collated, we have obtained our first feature: the absolute cell count for each node (cell population) in a single FCS file (i.e. FCS file i can be described by the cell count feature, a vector of cell counts xcounti 0 = {xcounti 0
1 , ..., x
count0 ij , ..., x
count0
im }, where
each element in this vector xcounti 0
j is the cell count of node (cell population) vj).
However, before the cell count features can be compared across FCS files, they must be normalized. Normally normalization is done per sample, by converting it to a proportion:
xcountP ropi = x
count0 i
xcountT otali
where xcountP ropi is the normalized version of xcounti 0 obtained by dividing its original cell count xcounti 0 over the total number of cells in FCS file i, xcountT otali yielding a propor- tion value. Although popular, percentage values may be misleading when analyzing cell production changes between different classes of FCS files [44].
To illustrate the issue with using proportions, Figure 3.8 shows a hypothetical scenario where we sample 20 cells from a WT mouse and a KO mouse. In reality, knocking out a gene causes the mouses’ immune system to double the production of total immune cells, via a tripling of the CD8+ cell population. If the same number of cells are sampled from both mice and we use the associated proportion values, we may misinterpret the effect of knocking out a gene to also being a decrease in production of cell populations CD11b+ and CD8+Ly6C+.
To prevent such misinterpretations, we use a modified cell count normalization method called the trimmed mean (TMM) [86] but instead of it being based on percentage values, we base it on the absolute cell count.
First, a ‘reference’ file is used xcountref 0 – in this case, a FCS file that is from a control with a median total cell count. Each of the other non-reference files i then divides its absolute
Figure 3.9: FCM Data Processing Pipeline: Cell Count Normalization (Example 2)
cell counts xcounti 0 point by point over that of the reference file to obtain a vector of ratios x0counti/ref0. We then convert these ratios into log scale:
ti = log2
xicount0 xrefcount0
!
A sample visualization of these ratios can be seen in Figure 3.9
Before obtaining the TMM from these ratios, we weigh them by their cell counts. In FCM, very small cell counts can sometimes be attributed to noise or minor errors in previous steps in the pipeline. For instance, as we are unable to sample large amounts of cells from a rare cell population, we see larger variance in its cell count – see Figure 3.9. Hence, we are less confident when statistically analyzing the cell counts of rare cell populations. An example of error in gating is as follows. In Figure 3.5, the cells are spread out as a distribution rather than precise clusters. Therefore, a slight change in the gates could cause the cells near the gates to be categorized into completely different cell populations. A small number of cells mis-classified would not effect larger cell populations as much as it would effect rarer cell populations. Hence, we use an optional weight vector, the inverse asymptotic variance wi, to reduce the influence of rare cell populations on the normalization factor. wi
wi =
(xcountT otalref − xcount0 ref ) · xref
xcountT otalref +
(xcountT otali − xcount0
i ) · xcount
0
i
xcountT otali
Also calculated based on cell counts, we produce an additional weight vector zi:
zi = 1 2· [log2(x countP rop i ) + log2(x countP rop ref )]
In this thesis, we disregard cell counts of all populations j with zij < α where α is a
default value of α = −10.
Finally, we combine the weights and the ratios to obtain the TMM. The idea is to assume that most nodes’ cell counts are not significantly different between FCS files. In other words, the number of cells produced for most cell populations do not differ significantly and that an experiment only correlates with the significant change in production of a minority of cell populations. Then, the cell counts that are approximately the same with respect to the reference sample must be the ones where its ratios occur at the highest frequency across all nodes. This ratio is the TMM fi and is obtained by taking a weighted mean of the ratios
ti: gi = P {j|zij<α} tij wij P {j|zij<α}w1ij fi = .5gi
As such, we obtain a TMM for each FCS file. To normalize the cell counts, for each non-reference file i, we multiply all of its cell counts with its TMM fi (see the blue line in Figure 3.9):
xicount= fi· xcounti 0
Note: if the cell count of a single node vj changes, this change would affect the cell
count of its parent nodes (and the parent nodes of those parent nodes, i.e. ancestors), whose cell count is a sum of its child nodes’ cell counts. Therefore, the same change also implies a change in cell count amongst its child nodes (and the child nodes of those child nodes, i.e. descendants). As a result, a single node’s cell count change would mean a change in cell count for a maximum of l2 ancestor nodes and (L − l)3 descendant nodes, where l is the cell hierarchy layer on which vj presides. Taking this further, we can also say that all cell production changes only occur in cell populations nodes on the last layer of the cell hierarchy (i.e. the leaf nodes). In turn, all cell count changes in the nodes on higher layers are simply results of leaf node cell count changes. Therefore, the assumption that the production of cells for most cell populations are not effected by the experiment still holds
when the total number of leaf nodes affected, along with L2 ancestor nodes each, remain a minority of the total number of cell population nodes on the cell hierarchy.