Cell Count Normalization - Feature-based Comparison of Flow Cytometry Data

Once the n × m matrix is collated, we have obtained our first feature: the absolute cell count for each node (cell population) in a single FCS file (i.e. FCS file i can be described by the cell count feature, a vector of cell counts xcount_i 0 = {xcount_i 0

1 , ..., x

count0 ij , ..., x

count0

im }, where

each element in this vector xcount_i 0

j is the cell count of node (cell population) vj).

However, before the cell count features can be compared across FCS files, they must be normalized. Normally normalization is done per sample, by converting it to a proportion:

xcountP rop_i = x

count0 i

xcountT otal_i

where xcountP rop_i is the normalized version of xcount_i 0 obtained by dividing its original cell count xcount_i 0 over the total number of cells in FCS file i, xcountT otal_i yielding a proportion value. Although popular, percentage values may be misleading when analyzing cell production changes between different classes of FCS files [44].

To illustrate the issue with using proportions, Figure 3.8 shows a hypothetical scenario where we sample 20 cells from a WT mouse and a KO mouse. In reality, knocking out a gene causes the mouses’ immune system to double the production of total immune cells, via a tripling of the CD8+ cell population. If the same number of cells are sampled from both mice and we use the associated proportion values, we may misinterpret the effect of knocking out a gene to also being a decrease in production of cell populations CD11b+ and CD8+Ly6C+.

To prevent such misinterpretations, we use a modified cell count normalization method called the trimmed mean (TMM) [86] but instead of it being based on percentage values, we base it on the absolute cell count.

First, a ‘reference’ file is used xcount_ref 0 – in this case, a FCS file that is from a control with a median total cell count. Each of the other non-reference files i then divides its absolute

Figure 3.9: FCM Data Processing Pipeline: Cell Count Normalization (Example 2)

cell counts xcount_i 0 point by point over that of the reference file to obtain a vector of ratios x0count_i/ref0. We then convert these ratios into log scale:

ti = log2

x_icount0 x_refcount0

A sample visualization of these ratios can be seen in Figure 3.9

Before obtaining the TMM from these ratios, we weigh them by their cell counts. In FCM, very small cell counts can sometimes be attributed to noise or minor errors in previous steps in the pipeline. For instance, as we are unable to sample large amounts of cells from a rare cell population, we see larger variance in its cell count – see Figure 3.9. Hence, we are less confident when statistically analyzing the cell counts of rare cell populations. An example of error in gating is as follows. In Figure 3.5, the cells are spread out as a distribution rather than precise clusters. Therefore, a slight change in the gates could cause the cells near the gates to be categorized into completely different cell populations. A small number of cells mis-classified would not effect larger cell populations as much as it would effect rarer cell populations. Hence, we use an optional weight vector, the inverse asymptotic variance wi, to reduce the influence of rare cell populations on the normalization factor. wi

wi =

(xcountT otal_ref − xcount0 ref ) · xref

xcountT otal_ref +

(xcountT otal_i − xcount0

i ) · xcount

xcountT otal_i

Also calculated based on cell counts, we produce an additional weight vector zi:

z_i = 1 2· [log2(x countP rop i ) + log2(x countP rop ref )]

In this thesis, we disregard cell counts of all populations j with zij < α where α is a

default value of α = −10.

Finally, we combine the weights and the ratios to obtain the TMM. The idea is to assume that most nodes’ cell counts are not significantly different between FCS files. In other words, the number of cells produced for most cell populations do not differ significantly and that an experiment only correlates with the significant change in production of a minority of cell populations. Then, the cell counts that are approximately the same with respect to the reference sample must be the ones where its ratios occur at the highest frequency across all nodes. This ratio is the TMM fi and is obtained by taking a weighted mean of the ratios

ti: gi = P {j|z_ij<α} t_ij w_ij P {j|z_ij<α}w1_ij fi = .5gi

As such, we obtain a TMM for each FCS file. To normalize the cell counts, for each non-reference file i, we multiply all of its cell counts with its TMM f_i (see the blue line in Figure 3.9):

x_icount= f_i· xcount_i 0

Note: if the cell count of a single node vj changes, this change would affect the cell

count of its parent nodes (and the parent nodes of those parent nodes, i.e. ancestors), whose cell count is a sum of its child nodes’ cell counts. Therefore, the same change also implies a change in cell count amongst its child nodes (and the child nodes of those child nodes, i.e. descendants). As a result, a single node’s cell count change would mean a change in cell count for a maximum of l2 ancestor nodes and (L − l)3 descendant nodes, where l is the cell hierarchy layer on which v_j presides. Taking this further, we can also say that all cell production changes only occur in cell populations nodes on the last layer of the cell hierarchy (i.e. the leaf nodes). In turn, all cell count changes in the nodes on higher layers are simply results of leaf node cell count changes. Therefore, the assumption that the production of cells for most cell populations are not effected by the experiment still holds

when the total number of leaf nodes affected, along with L2 ancestor nodes each, remain a minority of the total number of cell population nodes on the cell hierarchy.

In document Feature-based Comparison of Flow Cytometry Data (Page 50-53)