Deep profiling of multitube flow cytometry data
Supplemental information
Kieran O’Neill
et al
Table S1: Markers in simulated multitube data. The data was split into three tubes, each containing CD3, CD4 and CD8 in addition to FSC and SSC. The remaining nine markers were distributed across the tubes, three per tube.
Marker Type Tube 1 Tube 2 Tube 3
Common (scatter) FSC FSC FSC
Common (scatter) SSC SSC SSC
Common (fluorescent) CD3 CD3 CD3
Common (fluorescent) CD4 CD4 CD4
Common (fluorescent) CD8 CD8 CD8
Phenotyping (fluorescent) KI67 CD57 CD27
Phenotyping (fluorescent) CD28 CCR5 CCR7
Figure S1: Overview of the flowBin pipeline, applied to one multitube sample. 1) FCM data from individual aliquot tubes is quantile normalised in terms of the common population markers present in every tube. 2) The tubes are then binned in terms of these population markers, using either K-means or flowFP. 3) The bins from the first tube are mapped to the other tubes (by nearest-neighbour mapping for K-means bins, or directly for flowFP bins). 4) The expression of each bin in terms of each phenotyping marker (those markers differing across tubes) is measured. This may be done by taking median fluorescent intensity, normalised median fluorescent intensity, or proportion of cells exceeding the 98th percentile of a negative control. The final result is a high-dimensional matrix containing expression levels for each bin in terms of each unique marker.
Bone Marrow
Aliquot, stain, run flow cytometry
FCS Data T ube 1 T ub e 2 Markers S catt er
...
129 patients 7-10 tubes/patient 20,000 cells, 3-4 markers/tube Combine tubes using flowBin FSC SSC CD45 HLA-DR CD13 CD34 CD20 CD19 CD10 CD61 CD56 CD33 CD64 CD117 CD14 CD7 CD2 CD4 CD3 FS C S SC C D4 5 H LA -D R C D1 3 C D3 4 C D2 0 C D1 9 C D1 0 C D6 1 C D5 6 C D3 3 C D6 4 C D1 17 C D1 4 C D7CD2CD4CD3 128 clusters/patient 17 markers each Cell Clusters 801 cell types associated with NPM-1 FSC SSC CD45 HLA-DR CD13 CD34 CD20 CD19 CD10 CD61 CD56 CD33 CD64 CD117 CD14 CD7 CD2 CD4 CD3 FS C S SC C D4 5 H LA -D R C D1 3 C D3 4 C D2 0 C D1 9 C D1 0 C D6 1 C D5 6 C D3 3 C D6 4 C D1 17 C D1 4 C D7CD2CD4CD3 128 clusters/patient 17 markers each Cell Clusters Type clusters using flowType (1:6-combinations)Wilcoxon rank sum
vs patient NPM1
with Holm correction
616,285 cell types per patient Cell Type Proportions + -+
-Figure S2: Pipeline used to determineNPM1-associated immunophenotypes in
AML.Steps taken are denoted by arrows, while the data consumed/produced is
in-dicated in boxes. FCM was performed in the clinic historically; all other steps were computational. The end result was a list of 801 cell types which showed a significant
Figure S3: One, two and three-dimensional representations of quantile
normali-sation of population markers. Empirical cumulative density function (ECDF) plots
are shown for all tubes and for forward scatter (FS), the most variant marker. Fol-lowing normalisation, the ECDF for all tubes is identical, as is expected from quantile normalisation. Two-dimensional scatter plots for representative tubes show visually the improvement in two-dimensional registration. Lastly, flowFP plots show the im-provement in three-dimensional registration, measured by the standard deviation of the number of cells falling within each bin, after bins have been fitted to the consensus of all tubes.
Figure S4: The two options for binning within flowBin: k-means and flowFP, as
applied to a 7-tube sample.a. and b. show comparisons between the bin labels
them-selves. K-means creates roughly spherical bins, which conform around the location of cell populations. FlowFP creates grid-like bins,which may not conform to the true underlying shape of cell populations. c. shows the number of cells per bin across all tubes, for every bin. flowFP has approximately the same mean distribution of bin den-sity across tubes as K-means (mean SD: 24.6 vs 28.5). However, flowFP has a much closer to constant number of cells per bin across bins (SD of means: 0.07 vs 255).
Figure S5:Comparison between nearest-neighbours merging and flowBin for two
tubes computationally sampled from a real data set. a. Raw data (compensated,
transformed and filtered for debris), gated for CD3+cells, and showing the true CD4
and CD8 distribution. b. The two sampled tubes, one containing CD4 and the other
CD8. The CD4+ population has slightly higher average CD3 than the CD8+, but
both have substantially overlapping CD3 distributions. c. Results of merging by near-est neighbours and by flowBin, including proportion of resulting “cells” falling within
each quadrant. The nearest-neighbours merging created a substantial CD4+CD8+
pop-ulation not present in the original sample. Both nearest neighbours and flowBin slightly
overestimate the CD4−CD8−population. flowBin is more accurate at reproducing the
CD4+CD8− and CD4−CD8+ populations than nearest neighbours. d. and e. This
analysis was repeated 100 times each for each number of bins, with a separate sam-pling of 5,000 events each. d. Representative results (those with median RMSD) for selected numbers of bins. e. All results for all numbers of bins and NN merging. The best result (lowest RMSD) was for 128 bins, whereafter increasing bin number caused RMSD to tend towards that of NN.
Figure S6:nu-SVM separation of normal and abnormal cell populations in AML
samples. a. Heatmap of all populations within the AML samples that were predicted
to be normal. Most can readily be identified as having the properties of common blood and bone marrow cell populations: myeloid cells expressing CD16 and/or CD64, lym-phoid cells (dominated by CD3-expressing T-lymphocytes/precursors, and erythroid cells not expressing any of the markers in the panel, including CD45. b. Heatmap of all populations predicted to be abnormal. In contrast to the cells predicted to be normal, many of these express CD34 and CD117, primitive markers typical of stem cells and of AML. Training data pop 1) 0.48 0.44 0.78 ... pop 2) 0.67 0.45 0.34 ... pop 3) 0.74 0.89 0.12 ... ... Training classes pop 1) AML pop 2) AML pop 3) AML ... Classifier Training Algorithm Trained Classifier pop 1) 0.35 0.46 0.67 ... pop 2) 0.21 0.56 0.49 ... pop 3) 0.78 0.41 0.89 ... ... pa tien t 1 patie nt 2 pop 1) healthy pop 2) healthy pop 3) healthy ... pa tien t 1 patie nt 2 Predicted classes pop 1) AML pop 2) AML pop 3) healthy ... Trained Classifier Test data (all bins from one patient)
pop 1) 0.48 0.44 0.78 ... pop 2) 0.67 0.45 0.34 ... pop 3) 0.74 0.89 0.12 ... ... pa tien t 1 Predicted class Patient 1) AML Take vote a. b.
Training data 1) 0.35 0.46 0.67 ... 2) 0.21 0.56 0.49 ... 3) 0.78 0.41 0.89 ... ... Training classes 1) healthy 2) AML 3) healthy ... Classifier Training Algorithm Trained Classifier 1 Subsample sample 1 classes 2) healthy 3) AML 5) healthy ... sample 1 data 2) 0.21 0.56 0.49 ... 3) 0.78 0.41 0.89 ... 5) 0.33 0.43 0.47 ... ... sample 2 classes 1) healthy 3) healthy 8) AML ... sample 2 data 1) 0.35 0.46 0.67 ... 3) 0.78 0.41 0.89 ... 8) 0.12 0.31 0.71 ... ...
...
Classifier Training Algorithm Trained Classifier 2 Test data 1) 0.22 0.26 0.65 ... 2) 0.34 0.24 0.45 ... 3) 0.67 0.34 0.46 ... ... Predicted classes 1) AML 2) healthy 3) healthy ... Trained Classifier Predicted classes 1) AML 2) AML 3) AML ... Trained Classifier Predicted classes 1) AML 2) AML 3) healthy ... Trained Classifier Final classes 1) AML 2) AML 3) healthy ... Take vote Patient class Patient 1) AML Take vote a. b.Figure S8: Schema for a voting classifier for flowBin output incorporating
bal-anced bagging. a. Training. This is similar to the base classifier (Fig. S7), except
that multiple classifiers are trained, each on a bootstrap subsample of patients. Each
bootstrap sample is set to contain equal numbers of patients from each class. b.
Pre-dictionTo predict the class of a new patient, predictions for each bin from that patient
are made by each of the trained classifiers. Final per-bin predictions are taken by ma-jority vote of those predictions. Then, the prediction for the patient is made based on a majority vote of the per-bin predictions.
All Cells CD34+ CD34+CD61− CD34+CD61−CD14− CD34+CD10−CD61−CD14− CD34+CD20−CD10−CD61−CD14− CD34+CD20−CD10−CD61−CD14−CD3+ CD34+CD20−CD61−CD14− CD34+ CD61− CD14− CD10− CD20− CD20− CD3+ 1 2 3 4 5 6 7 8 9 -log10(P-value)
Figure S9: An example of RchyOptimyx analysis of one cluster of cell types. As
801 cell types are too many to visualise meaningfully with RchyOptimyx, we clustered the cell types and visualised each in turn. In this example, the addition of CD10- or
CD20- make little difference to the P-value of the cell type CD34+CD61−CD14−. As
this was a general trend and in line with reported AML biology, we chose to exclude cell types defined over these markers from further analysis.
q q q q q q 0 wt mt P rop or tio n of all ce lls wt mt wt mt CD34-CD13+ P=0.00221 CD34-CD33+ P=0.00225 1 CD34-P>0.05 CD13+CD34−CD33+P=0.00138 wt mt 0 P rop or tio n of all ce lls 1 CD34-CD2-P=0.0265 CD34-CD2-CD4+ P=0.000149 CD13+CD34-CD2-CD4+ P=1.94e-06 wt mt wt mt wt mt 0 P rop or tio n of all ce lls 1 wt mt wt mt wt mt CD34+CD61-CD14-P=0.015 CD34+CD61-CD14-CD2+ P=0.000114 CD34+CD61-CD2+CD4-P=0.00548 0 P rop or tio n of all ce lls 1 HLA+CD34+CD33-CD64-P=0.0235 HLA+CD34+CD4-CD64-P=0.0119 wt mt wt mt
a.
b.
c.
d.
Figure S10: Selected classes of cell types showing significant differences in
abun-dance betweenNPM1-mtandNPM1-wt.P-values are given after Holm correction. a.
Gating for the presence of myeloid lineage markers CD13 and CD33 within the
CD34-compartment yields much stronger differences in abundance betweenNPM1-wt and
NPM1-mt than CD34- alone. b. Gating for CD2- within the CD34- compartment yields a slightly better separation than CD34- alone, but gating down further to
CD4-and CD13+ is a cell type that, while present in mostNPM1-mt, is absent or below 20%
abundance in nearly allNPM1-wt. c. Gating for CD61- and CD14- within the CD34+
compartment leads to a cell type which is common inNPM1-wt but almost entirely
absent inNPM1-mt. d. Gating for HLA-DR+ and CD64- within the CD34+
compart-ment leads to a cell type that occurs in a subset ofNPM1-wtbut is entirely absent in