Feature-based Comparison of Flow Cytometry Data

(1)

Feature-based Comparison of Flow

Cytometry Data

by

Alice Yue

B.Sc., University of Victoria, 2015

Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of

Master of Science in the

School of Computing Science Faculty of Applied Science

c

Alice Yue 2017

SIMON FRASER UNIVERSITY Summer 2017

However, in accordance with the Copyright Act of Canada, this work may be reproduced without authorization under the conditions for “Fair Dealing.” Therefore, limited reproduction of this work for the purposes of private study, research, education, satire, parody, criticism, review and news reporting is likely

(2)

Approval

Name: Alice Yue

Degree: Master of Science (Computing Science)

Title: Feature-based Comparison of Flow Cytometry

Data

Examining Committee: Chair: Joseph Peters Professor Cedric Chauve

Senior Supervisor Professor

Department of Mathematics Simon Fraser University Ryan Brinkman Co-supervisor Professor

Department of Medical Genetics University of British Columbia Leonid Chindelevitch Supervisor

Assistant Professor

School of Computing Science Simon Fraser University Kay C. Wiese

Internal Examiner Associate Professor

School of Computing Science Simon Fraser University

(3)

Abstract

Flow cytometry (FCM) bioinformatics is a sub-field of bioinformatics, aimed at devel-oping effective and efficient computational tools to store, organize, and analyze high-throughput/dimensional FCM data. Flow cytometers are capable of analyzing thousands of cells per second for up to 40 features. These features primarily signal the presence of different proteins on cells in the bloodstream. Hence contributing large amounts of data towards the big biological data paradigm. The data that a flow cytometer outputs from a biological sample, is called a FCS file.

The International Mouse Phenotyping Consortium (IMPC) is a collaboration between 23 international institutions and funding organizations. Its aim is to decipher the function of 20,000 mouse genes. IMPC is doing so by breeding mice with a certain gene knocked out (KO), cancelling the function of that gene. In turn, FCM is used to measure the immunological changes correlated to this knockout.

Many tools exist to classify FCS files. However, there is a lack of tools to conduct unsu-pervised clustering of FCS files. One goal of IMPC is to compare and contrast KO genes, hence IMPC becomes a prime motivation for this problem.

As such, this thesis outlines a data processing pipeline used to isolate features for each FCS file. We then test the different types of features extracted on a benchmark data set from the FlowCAP-II challenge, containing data from healthy persons and patients with AML (acute myeloid leukemia). We then evaluate how well these features separate out FCS files of different origin (i.e. healthy vs AML).

(4)

Dedication

(5)

Acknowledgements

I would like to show my gratitude to my senior and co-supervisors Dr. Cedric Chauve and Dr. Ryan Brinkman. Thank you for the opportunity to learn and solve meaningful problems together – taking the time for many thought provoking discussions, and your most generous support and patience throughout my masters’ education. I am honoured to work with you, as my supervisors and as life mentors.

I would also like to express my appreciations to my defence committee for your invaluable feedback. Thank you to my supervisory committee member, Dr. Leonid Chindelevitch for always providing inspiring food for thought during the bioinformatics course and program meetings, Dr. Kay Wiese for your encouragement and for agreeing to examine my super long MSc thesis, and Dr. Joseph Peters for being the defence chair and your patience towards my un-finish-able list of algorithm questions.

Last but not least, thank you to my family, friends, and lab-mates for your companion-ship, and support.

(6)

List of Tables

Table 2.1 Examples of Cell Population Identification Tools categorized into three types: Supervised Classification, Unsupervised Clustering, and Auto-mated Gating . . . 14 Table 3.1 Co-occurrence True/False Positive/Negative Definition for any two FCS

files . . . 48 Table C.1 External Validation of Classification Results for the FlowCAP-II data

set on the Panels variable: K-NN scores . . . 126 Table C.2 Internal and External Validation of Clustering Results for the

FlowCAP-II data set on the Panels variable: Spectral Clustering scores . . . 128 Table C.3 Internal and External Validation of Clustering Results for the

FlowCAP-II data set on the Panels variable: Louvain Clustering scores . . . 129 Table C.4 Internal and External Validation of Clustering Results for the

FlowCAP-II data set on the Panels variable: Agglomerative Clustering scores . 130 Table C.5 Internal and External Validation of Clustering Results for the

FlowCAP-II data set on the Panels variable: K-medoid Clustering scores . . . . 131 Table C.6 Internal Validation of Distance Matrices for the FlowCAP-II data set

on the Panels variable: Distance Matrix scores . . . 133 Table C.7 External Validation of Classification Results for the FlowCAP-II data

set on the AML vs Healthy variable: K-NN scores for average of All Panels (Trimmed Absolute Features) . . . 135 Table C.8 External Validation of Classification Results for the FlowCAP-II data

set on the AML vs Healthy variable: K-NN scores for average of All Panels (Trimmed phenodeviant Features) . . . 136 Table C.9 External Validation of Classification Results for the FlowCAP-II data

set on the AML vs Healthy variable: K-NN scores for average of All Panels (Un-trimmed Absolute Features) . . . 137 Table C.10 External Validation of Classification Results for the FlowCAP-II data

set on the AML vs Healthy variable: K-NN scores for average of All Panels (Un-trimmed phenodeviant Features) . . . 138

(10)

Table C.11 Internal and External Validation of Clustering Results for the FlowCAP-II data set on the AML vs Healthy variable: Spectral Clustering scores for average of All Panels (Trimmed Absolute Features) . . . 139 Table C.12 Internal and External Validation of Clustering Results for the

FlowCAP-II data set on the AML vs Healthy variable: Spectral Clustering scores for average of All Panels (Trimmed phenodeviant Features) . . . 140 Table C.13 Internal and External Validation of Clustering Results for the

FlowCAP-II data set on the AML vs Healthy variable: Spectral Clustering scores for average of All Panels (Un-trimmed Absolute Features) . . . 141 Table C.14 Internal and External Validation of Clustering Results for the

FlowCAP-II data set on the AML vs Healthy variable: Spectral Clustering scores for average of All Panels (Un-trimmed phenodeviant Features) . . . . 142 Table C.15 Internal and External Validation of Clustering Results for the

FlowCAP-II data set on the AML vs Healthy variable: Louvain Clustering scores for average of All Panels (Trimmed Absolute Features) . . . 143 Table C.16 Internal and External Validation of Clustering Results for the

FlowCAP-II data set on the AML vs Healthy variable: Louvain Clustering scores for average of All Panels (Trimmed phenodeviant Features) . . . 144 Table C.17 Internal and External Validation of Clustering Results for the

FlowCAP-II data set on the AML vs Healthy variable: Louvain Clustering scores for average of All Panels (Un-trimmed Absolute Features) . . . 145 Table C.18 Internal and External Validation of Clustering Results for the

FlowCAP-II data set on the AML vs Healthy variable: Louvain Clustering scores for average of All Panels (Un-trimmed phenodeviant Features) . . . . 146 Table C.19 Internal and External Validation of Clustering Results for the

FlowCAP-II data set on the AML vs Healthy variable: Agglomerative Clustering scores for average of All Panels (Trimmed Absolute Features) . . . 147 Table C.20 Internal and External Validation of Clustering Results for the

FlowCAP-II data set on the AML vs Healthy variable: Agglomerative Clustering scores for average of All Panels (Trimmed phenodeviant Features) . . 148 Table C.21 Internal and External Validation of Clustering Results for the

FlowCAP-II data set on the AML vs Healthy variable: Agglomerative Clustering scores for average of All Panels (Un-trimmed Absolute Features) . . . 149 Table C.22 Internal and External Validation of Clustering Results for the

FlowCAP-II data set on the AML vs Healthy variable: Agglomerative Clustering scores for average of All Panels (Un-trimmed phenodeviant Features) 150 Table C.23 Internal and External Validation of Clustering Results for the

FlowCAP-II data set on the AML vs Healthy variable: K-medoid Clustering scores for average of All Panels (Trimmed Absolute Features) . . . 151

(11)

Table C.24 Internal and External Validation of Clustering Results for the FlowCAP-II data set on the AML vs Healthy variable: K-medoid Clustering scores for average of All Panels (Trimmed phenodeviant Features) . . 152 Table C.25 Internal and External Validation of Clustering Results for the

FlowCAP-II data set on the AML vs Healthy variable: K-medoid Clustering scores for average of All Panels (Un-trimmed Absolute Features) . . . 153 Table C.26 Internal and External Validation of Clustering Results for the

FlowCAP-II data set on the AML vs Healthy variable: K-medoid Clustering scores for average of All Panels (Un-trimmed phenodeviant Features) 154 Table C.27 Internal Validation of Distance Matrices for the FlowCAP-II data set

on the AML vs Healthy variable: Distance Matrix scores for average of All Panels (Trimmed Absolute Features) . . . 155 Table C.28 Internal Validation of Distance Matrices for the FlowCAP-II data set

on the AML vs Healthy variable: Distance Matrix scores for average of All Panels (Trimmed phenodeviant Features) . . . 156 Table C.29 Internal Validation of Distance Matrices for the FlowCAP-II data set

on the AML vs Healthy variable: Distance Matrix scores for average of All Panels (Un-trimmed Absolute Features) . . . 157 Table C.30 Internal Validation of Distance Matrices for the FlowCAP-II data set

on the AML vs Healthy variable: Distance Matrix scores for average of All Panels (Un-trimmed phenodeviant Features) . . . 158

(12)

List of Figures

Figure 1.1 FCM Machinery and How FCM Analyzes Biological Samples – top

half of schematic inspired by [53] . . . 2

Figure 1.2 Applications of FCM Bioinformatics . . . 5

Figure 1.3 Applications of FCM Bioinformatics on the IMPC . . . 7

Figure 2.1 Data Processing Pipeline . . . 11

Figure 3.1 Data Processing Pipeline . . . 21

Figure 3.2 Sample FlowCAP-II and IMPC FCS files and their Cell Populations Visualized using t-SNE and FlowSOM (FlowCAP-II FCS file cell populations identified using gating strategies [26, 95, 96]) . . . 22

Figure 3.3 FCM Data Progressing Pipeline: Pre-processing; Cells of a FCS file are plotted on markers CD5 and CD11b, with the colours represent-ing the density . . . 25

Figure 3.4 FCM Data Processing Pipeline: Quality Control . . . 27

Figure 3.5 FCM Data Processing Pipeline: the gating strategy . . . 29

Figure 3.6 FCM Data Processing Pipeline: Automated gating and its ability to reduce variance caused by residual and unknown variables [1] . . . 30

Figure 3.7 Cell Hierarchy: A Representation of the FCS file . . . 32

Figure 3.8 FCM Data Processing Pipeline: Cell Count Normalization (Example 1) . . . 34

Figure 3.9 FCM Data Processing Pipeline: Cell Count Normalization (Example 2) . . . 35

Figure 3.10 FCM Data Processing Pipeline: Feature Design (Examples for Ab-solute Features) . . . 38

Figure 3.11 FCM Data Processing Pipeline: Feature Design (Examples for phen-odeviant Features) . . . 41

(13)

Figure 3.12 External Validation of Clustering Results for the FlowCAP-II data set on the AML vs Healthy variable: Louvain Clustering scores av-eraged across All Panels plotted for Layer 7, Manhattan Distance Metric, & All Features (see legend) (Proportion of longest dis-tance edges deleted vs F Measure (see Section 3.7.2.1)) – see Section 4: ’How result plots are organized’ for more details on the plot. . . 47 Figure 4.1 External Validation of Classification Results for the FlowCAP-II

data set on the Panels variable: Distance Matrix scores plotted for All Distance Metrics (Features vs F Measure) . . . . 54 Figure 4.2 Internal Validation of Distance Matrices for the FlowCAP-II data

set on the Panels variable: Distance Matrix scores plotted for All Distance Metrics (Layers vs F Measure) . . . . 55 Figure 4.3 External Validation of Classification Results for the FlowCAP-II

data set on the Panels variable: Distance Matrix scores plotted for All Distance Metrics (Distance Metrics vs F Measure) . . . . . 56 Figure 4.4 External Validation of Classification Results for the FlowCAP-II

data set on the AML vs Healthy variable: K-NN scores plotted for All Panels & Distance Metrics (Features vs F Measure) . . . . . 61 Figure 4.5 External Validation of Classification Results for the FlowCAP-II

data set on the AML vs Healthy variable: K-NN scores plotted for All Panels & Distance Metrics (Layers vs F Measure) . . . 62 Figure 4.6 External Validation of Classification Results for the FlowCAP-II

data set on the AML vs Healthy variable: K-NN scores plotted for All Panels & Distance Metrics (Distance Metrics vs F Measure) 63 Figure 4.7 External Validation of Clustering Results for the FlowCAP-II data

set on the AML vs Healthy variable: scores plotted for All Panels & Distance Metrics (Features vs F Measure) . . . . 65 Figure 4.8 Internal Validation of Clustering Results for the FlowCAP-II data

set on the AML vs Healthy variable: scores plotted for All Panels & Distance Metrics (Features vs Median Silhouette Index) . . . . 66 Figure 4.9 Internal Validation of Clustering Results for the FlowCAP-II data

set on the AML vs Healthy variable: scores plotted for All Panels & Distance Metrics (Features vs Pearson Gamma) . . . . 67 Figure 4.10 External Validation of Clustering Results for the FlowCAP-II data

set on the AML vs Healthy variable: scores plotted for All Panels & Distance Metrics (Layers vs F Measure) . . . . 68

(14)

Figure 4.11 Internal Validation of Clustering Results for the FlowCAP-II data set on the AML vs Healthy variable: scores plotted for All Panels & Distance Metrics (Layers vs Median Silhouette Index) . . . . . 69 Figure 4.12 Internal Validation of Clustering Results for the FlowCAP-II data

set on the AML vs Healthy variable: scores plotted for All Panels & Distance Metrics (Layers vs Pearson Gamma) . . . . 70 Figure 4.13 External Validation of Clustering Results for the FlowCAP-II data

set on the AML vs Healthy variable: scores plotted for All Panels & Distance Metrics (Distance Metrics vs F Measure) . . . . 72 Figure 4.14 Internal Validation of Clustering Results for the FlowCAP-II data

set on the AML vs Healthy variable: scores plotted for All Panels & Distance Metrics (Distance Metrics vs Median Silhouette Index) 73 Figure 4.15 Internal Validation of Clustering Results for the FlowCAP-II data

set on the AML vs Healthy variable: scores plotted for All Panels & Distance Metrics (Distance Metrics vs Pearson Gamma) . . . 74 Figure 4.16 Internal Validation of Distance Matrices for the FlowCAP-II data

set on the AML vs Healthy variable: scores plotted for All Panels & Distance Metrics (Features vs NCA) . . . . 75 Figure 4.17 Internal Validation of Distance Matrices for the FlowCAP-II data

set on the AML vs Healthy variable: scores plotted for All Panels & Distance Metrics (Features vs Median Silhouette Index) . . . . 76 Figure 4.18 Internal Validation of Distance Matrices for the FlowCAP-II data

set on the AML vs Healthy variable: scores plotted for All Panels & Distance Metrics (Features vs Pearson Gamma) . . . . 77 Figure 4.19 Internal Validation of Distance Matrices for the FlowCAP-II data

set on the AML vs Healthy variable: scores plotted for All Panels & Distance Metrics (Layers vs NCA) . . . 78 Figure 4.20 Internal Validation of Distance Matrices for the FlowCAP-II data

set on the AML vs Healthy variable: scores plotted for All Panels & Distance Metrics (Layers vs Median Silhouette Index) . . . . . 79 Figure 4.21 Internal Validation of Distance Matrices for the FlowCAP-II data

set on the AML vs Healthy variable: scores plotted for All Panels & Distance Metrics (Layers vs Pearson Gamma) . . . . 80 Figure 4.22 Internal Validation of Distance Matrices for the FlowCAP-II data set

on the AML vs Healthy variable – for All Panels Distance Matrix vs NCA) . . . 82 Figure 4.23 Internal Validation of Distance Matrices for the FlowCAP-II data

set on the AML vs Healthy variable: scores plotted for All Panels & Distance Metrics (Distance Metrics vs Median Silhouette Index) 83

(15)

Figure 4.24 Internal Validation of Distance Matrices for the FlowCAP-II data set on the AML vs Healthy variable: scores plotted for All Panels & Distance Metrics (Distance Metrics vs Pearson Gamma) . . . 84 Figure B.1 External Validation of Clustering Results for the Panels variable:

K-NN scores for all Distance Metrics (Features vs F Measure) . 105 Figure B.2 Internal Validation of Clustering Results for the Panels variable:

scores for all Distance Metrics (Features vs Median Silhouette Index) . . . . 106 Figure B.3 Internal Validation of Clustering Results for the Panels variable:

scores for all Distance Metrics (Features vs Pearson Gamma) . 107 Figure B.4 External Validation of Clustering Results for the Panels variable:

scores for all Distance Metrics (Layers vs F Measure) . . . 108 Figure B.5 Internal Validation of Clustering Results for the Panels variable:

scores for all Distance Metrics (Layers vs Median Silhouette In-dex) . . . . 109 Figure B.6 Internal Validation of Clustering Results for the Panels variable:

scores for all Distance Metrics (Layers vs Pearson Gamma) . . . 110 Figure B.7 External Validation of Clustering Results for the Panels variable:

scores for all Distance Metrics (Distance Metrics vs F Measure) 112 Figure B.8 Internal Validation of Clustering Results for the Panels variable:

scores for all Distance Metrics (Distance Metrics vs Median Sil-houette Index) . . . 113 Figure B.9 Internal Validation of Clustering Results for the Panels variable:

scores for all Distance Metrics (Distance Metrics vs Pearson Gamma) . . . 114 Figure B.10 Internal Validation of Distance Matrices for the FlowCAP-II data

set on the Panels variable: Distance Matrix scores for all Distance Metrics (Features vs NCA) . . . 115 Figure B.11 Internal Validation of Distance Matrices for the FlowCAP-II data

set on the Panels variable: Distance Matrix scores for all Distance Metrics (Features vs Median Silhouette Index) . . . . 116 Figure B.12 Internal Validation of Distance Matrices for the FlowCAP-II data

set on the Panels variable: Distance Matrix scores for all Distance Metrics (Features vs Pearson Gamma) . . . . 117 Figure B.13 Internal Validation of Distance Matrices for the FlowCAP-II data

set on the Panels variable: Distance Matrix scores for all Distance Metrics (Layers vs NCA) . . . . 118

(16)

Figure B.14 Internal Validation of Distance Matrices for the FlowCAP-II data set on the Panels variable: Distance Matrix scores for all Distance Metrics (Layers vs Median Silhouette Index) . . . . 119 Figure B.15 Internal Validation of Distance Matrices for the FlowCAP-II data

set on the Panels variable: Distance Matrix scores for all Distance Metrics (Layers vs Pearson Gamma) . . . 120 Figure B.16 Internal Validation of Distance Matrices for the FlowCAP-II data

set on the Panels variable: Distance Matrix scores for all Distance Metrics (Distance Metrics vs NCA) . . . . 121 Figure B.17 Internal Validation of Distance Matrices for the FlowCAP-II data

set on the Panels variable: Distance Matrix scores for all Distance Metrics (Distance Metrics vs Median Silhouette Index) . . . 122 Figure B.18 Internal Validation of Distance Matrices for the FlowCAP-II data

set on the Panels variable: Distance Matrix scores for all Distance Metrics (Distance Metrics vs Pearson Gamma) . . . . 123

(17)

Chapter 1

Introduction and Background

The life science community has the ability to create large amounts of biological data at drastically decreasing costs [68]. This has created an abundance of biological data and the lack of tools to interpret them. This led to the rise of bioinformatics. Bioinformatics is an interdisciplinary field of research that aims to develop computational tools to mine useful information out of biological data.

The biological data we will be focusing on in this thesis is flow cytometry (FCM).

1.1 Flow Cytometry (FCM)

A cytometer is a high-throughput apparatus capable of simultaneously measuring more than 40 features per single cell, for thousands of cells per second [4, 19, 20]. With its ability to produce big biological data, cytometers have become a routine part of research and diagnosis of diseases of the immune system, such as leukaemia and lymphoma [106]. As such, we will discuss cytometry in the context of the immune system (i.e. the biological samples processed by the FCM here will contain different types of immune cells from organs such as the spleen, bone marrow, and blood [54]).

The cytometer was first invented in 1953, paving the way for cell sorting [41, 31]. This is a process where one identifies what cell type or cell population each cell belongs to, based on the proteins it contains – assuming each cell population can be identified by a unique combination of proteins. In other words, a cell population in a FCS file is defined as a group of cells containing the same group or subgroup of proteins.

Flow Cytometers (FCM) detect these proteins via fluorescence. To do so, a biological sample is first mixed with several different types of markers. Each marker attaches to a certain target protein on the cells. Each marker has an associated fluorochrome which is able to emit a fluorescence, a light of a certain wavelength, or colour, under stimulation. Hence, the different markers, or coloured light, present or not present on a cell represents the types of proteins the cell contains. For example, the fluorochrome of marker A emits

(18)

Figure 1.1: FCM Machinery and How FCM Analyzes Biological Samples – top half of schematic inspired by [53]

(19)

a yellow coloured fluorescence and only attaches to protein 1. If a solution of marker A is mixed with a sample of cells, then only the cells with protein 1 would emit a yellow fluorescence.

This processed cell sample suspended in fluid is then analyzed by the flow cytometer. Once in the machine, the cells are first aligned into a single file stream by sheath fluid in a component of the machine called the flow cell. After the cells are focused, they are parsed by a laser one at a time. In reaction to the laser, the markers on the cell emit fluorescence detected by an array of photomultiplying tubes (PMTs) – each measuring light of a certain range of wavelength. Finally, whether or not a marker is present is determined by the fluorescence intensity (FI). FI is the brightness of a detected fluorescence (i.e. if the marker A’s FI value is high on a cell, then that cell has marker A or is A+, marker A positive, and if the marker A’s FI value is low on a cell, then that cell does not have marker A or is A-, marker A negative). Therefore, an alternative definition of a cell population is a group of cells that have similar FI values for a group of markers.

At the same time, flow cytometers also detect a cell’s physical characteristics – namely its size (forward scatter FS) and granularity (side scatter SS).

In summary, for each sample, the machine outputs a file in the Flow Cytometry Standard (FCS) [100]. This FCS file includes a R × L matrix where R is the number of cells, L is the number of markers, and each value in the matrix reflects a FI value. The matrix also includes two or more columns for the physical characteristics of the cells (FS, SS) but for simplicity, we will presume the features of a cell to be marker FI’s. Usually, a biologist would analyze a single biological sample using different sets of markers, or different panels. This would result in multiple such matrices per sample, one per panel. This thesis will simply refer to a single matrix as a FCS file.

1.1.1 Mass Cytometry

First commercialized in 2009, DVS Sciences marketed a variation of FCM called mass cytometry. While FCM detects a spectrum of light emitted by the (fluorochrome on the) markers of a cell, mass cytometers use mass spectometry, a technique that detects the mass-to-charge ratio of ionized chemical species on a cell. In turn, these chemical species serve the same function as the markers (i.e. a specific mass-to-charge ratio represents a specific chemical species which in turn reflects whether or not a certain protein is present on the cell). The mass cytometer outputs its data in the FCS format such that they can be analyzed using the same computational tools that can analyze outputs of FCM. Theoretically, mass spectometry is more precise in that a single type of chemical species can only be detected as a single mass-to-charge ratio value, in contrast to the range of wavelengths a single (fluorochromes on a) marker can be detected as. In addition, it can analyze over 40 of such chemical species features per cell [102]. In regards to its drawbacks, mass cytometry

(20)

can only analyze hundreds of cells per second in practice, as opposed to a theoretical one thousand cells per second [69].

1.2 Flow Cytometry (FCM) Bioinformatics

Given the array of FCM and mass cytometry improvements, FCM bioinformatics came into the research scene as a new sub-field of bioinformatics in the early 2000s [89, 74]. FCM bioinformatics is focused on computationally storing, organizing, and analyzing high di-mensional FCS files. Recent advances in FCS file standards [100], dissemination routes [99], analytical platforms [98, 23], and benchmark data sets [3, 5, 43] have driven an increasing amount of automated tools to complement and possibly replace manual FCS file analysis.

Another large driver in automated tool development is the variability that arises through manual analysis [15]. Manual analysis is the current norm in the FCM community leaving room for human errors, subjectivity, and bias [89, 51, 74, 65]. The FlowCAP initiative reports that analysis done manually can have up to 94% more variability than analysis done computationally [37]. Manual analysis is also inefficient given the large amount of time and labour needed to go through each FCS file – therefore driving a trend in the FCM community to move from manual analysis to automated analysis of FCS files.

1.2.1 Applications

FCM bioinformatics serves to extract information for two application settings: clinical and research [85, 74]. Examples of these can be seen in Figure 1.2.

1.2.1.1 Clinical Diagnosis Applications

In clinical environments, a typical use case for FCM data is to differentiate between healthy and diseased persons of different stages. In other words, the FCS files need to be clas-sified into the different classes, diseased or healthy, using a classifier. This provides key information for diagnosis. Examples of tools used to do so are provided in Section 2.2.1

1.2.1.2 Clinical Research Applications

In research settings, one goal would be to create those classification models. A standard scenario is when there are FCS files that have already been labelled with classes. For the purposes of this thesis, we will call these classes ‘control’ and ‘experiment(s)’. In the case that we do know the corresponding effects (control FCS files are from healthy persons and experiment FCS files are from diseased persons), the goal is to:

1. Use the FCS files as input into supervised classification model(s) to create a classifier that would be able to differentiate between FCS files of different classes, and

(21)

(22)

2. Identify the main features, or bio-markers, that differentiate between those classes. For example, once the cells in each FCS file have been sorted into their appropriate cell populations, if a certain cell population has significantly more cells in the FCS file from a sick patient then one from a healthy individual, then that cell population can act as a bio-marker to indicate whether or not a person is sick. Examples of tools used to identify these bio-markers are in Section 2.2.2.

1.2.1.3 Exploratory Research Applications

In the circumstance that we do not know what effects the experiment has on the immune cells, we need to conduct biologically meaningful unsupervised clustering of FCS files to understand whether the experiment(s) has:

1. No effect (are the same as control FCS files), 2. Significant effect(s), or

3. Significant effect(s) that are similar or different from other experiments and the bio-markers that correlate with these effects.

The effect referred to in this thesis is a change in the immune system, usually in response to an experiment. To generalize the terminology for all data sets, an experiment is defined as a stimulus or a change in a health condition (e.g. a knocked out gene). To the best of our knowledge, there has not been a formal exploration into this problem, thus this problem serves as the motivation for this thesis.

1.3 Motivation and Contributions

More formally, the motivation and problem of this thesis are as follows. Given a set of FCS files, extract features based on properties of the immune cell populations within those files. These features should then be able to produce meaningful distances between the FCS files for effective unsupervised clustering.

The motivating scenario for this problem comes from the research initiative, the Inter-national Mouse Phenotyping Consortium (IMPC). The IMPC [16] is a collaborative effort to decipher the function behind a target list of 20,000 genes. IMPC is doing so by raising colonies of genetically identical inbred lab mice, setting aside some as controls (wildtype WT) and the rest as experiments (knockout; KO). Knockout refers to the fact that the experimental mice have had a certain gene knocked out, or had its function cancelled out. Among the KO mice, they are further divided into groups (i.e. the mice in one group all have the same gene knocked out, the mice in another group would have another gene knocked out, and so on). All the mice are then raised for 16 weeks before they are euthanized. One assay obtained after their lifetime is FCM, as the immune cells processed through FCM

(23)

(24)

are able to reflect the state of the mice’s immune system, or their immunophenotype. This is the first time FCS files are being generated in such a large industrial scale totalling 77,000 FCS files in 17 laboratories worldwide over more than 5 years upon completion of the project.

The challenge in the IMPC data set is that it is unknown what function many of these genes serve in the immune system. Therefore, one cannot assume that these FCS files belong to known classes – many genes could in fact cause no changes to the immune system at all. This provides a situation where we must rely only on the cell features of each FCS file in order to understand whether or not there are actually any differences between two files of different experiments. If there are, another challenge is to isolate the features associated with these differences. See an example of this in Figure 1.3.

To answer these questions, this thesis is organized as per below.

1. Chapter 2 explains the different steps involved in the FCM data analysis pipeline and the different tools that have been developed for them.

2. Chapter 3 introduces the data processing pipeline we utilized.

3. We also elaborate on how we analyze each FCS file as a cell hierarchy as defined in [75], a representation of the FCS file in the form of a structured graph incorporating all possible cell populations and the relations between them. We hypothesize that using the cell hierarchy representation of a FCS file would allow us to mine information from the files such that they can be clustered more accurately according to available ground truth class labels of FCS file.

4. From this, we go into the design of multiple features extracted using the cell hierarchy, and how we derived a distance matrix for each feature describing the distance between each pair of FCS files.

5. We then evaluate how well the distance matrices derived from each feature facilitate clustering of the FCS files, into their respective classes.

6. Finally, chapters 4 & 5 present the evaluation results and 7. Chapter 6 provides conclusive remarks and perspectives.

Additionally, this thesis outlines several novel contributions:

1. We propose a new problem to cluster FCS files in a completely unsupervised fashion (Sections 1.2.1.3 and 1.3) ,

2. We design novel features from each FCS file that incorporate properties of the cell hierarchy (Section 3.5), and finally

(25)

3. We show the efficacy of these features on facilitating accurate clustering of benchmark FCS files of known classes (Chapter 4).

This work is a preliminary step in furthering our capacity to analyze which genes are functionally similar and may be involved in similar processes through interactions. The significance behind this ultimate goal of understanding the relation between gene functions is that it allows us to understand the immunophenotypic bio-markers and thus model the effects of a certain KO gene. Given the significant insight these models provide into the function of each mouse gene, they provide us with opportunities to investigate human immunodeficiency diseases of unknown origin.

(26)

Chapter 2

Flow Cytometry (FCM)

Bioinformatics

2.1 FCM Data Processing Pipeline

This chapter briefly introduces the research areas and open questions in FCM bioinformatics that are involved in a FCM data analysis pipeline in Figure 2.1. This pipeline consists of steps to pre-process and identify the cell populations in a FCS file, which can in turn be put to use in real world applications as shown in Figures 1.2 and 1.3.

2.1.1 Pre-Processing

In FCM, analysts often opt to first pre-process the FCS files to simplify downstream anal-ysis and amplify data signals. Broadly, the pre-processing stage involves: compensation, transformation, quality control, and normalization [113]. As a reminder, the input here is a FCS file includes a R × L matrix where R is the number of cells, L is the number of markers, and its values are the FI of each marker.

2.1.1.1 Compensation

Compensation is a standard procedure to ‘un-mix’ fluorescence from different markers [87, 11]. As the fluorescence emitted by different (fluorochromes on) markers can be detected as light belonging to different wavelength ranges, the machine may mistakenly categorize these markers. For example, marker A may be defined by a fluorochrome that emits a fluorescence belonging to a range of wavelength perceived as the colour yellow. However, if its wavelength is on the lower range of the colour yellow, it may be mistakenly detected as being within the range of wavelengths reflecting the colour orange – or within the range of wavelength that the fluorochrome on marker B would emit. The amount of marker A’s that are mistakenly detected as marker B are false positive observations called ‘spill-over’. These

(27)

(28)

can be accounted for by directly subtracting all spill-overs from the marker B observations [11]. Figure 3.3 shows the effect of compensation on cells through a 2D scatterplot.

2.1.1.2 Transformation

Many forms of transformation have been proposed for better analysis of FI values in FCS files. Transformation attempts to mitigate several challenges with analyzing FCM data. These challenges include and are not limited to the following [38].

1. FI values associated with different markers can be on immensely different scales hence making it hard to systematically analyze cell populations across markers. For example, cell population A+ may show FI values 10,000 larger than those of A-. However, cell population B+ may show FI values just 1,000 larger than those of B-.

2. Frequent outlier events or rows on the FCS file matrix exists. These may be debris, mis-handled cells, air bubbles etc.

As FI values are in logarithmic scale [92], log transform is traditionally applied to FCS files in order to expand the FI values into linear scale such that the algorithms used downstream in the pipeline would be able to better recognize the data signals present in each file. As such, commonly used transformations in FCM data analysis includes log transform, arcsinh, logicle, and its generalization, biexponential transform [38]. Furthermore, [38] proposes the use of maximum likelihood to optimize parameters for these transformations using the flowTrans package. This thesis expands on logicle transformation and how its parameters are chosen in Section 3.2.2.

2.1.1.3 Quality Control

Another pre-processing step is to control for quality by checking for and removing technical anomalies in FCS files. As FCM measures cells in a single-file fashion, a steady flow is essential for accurate measurements. However, artifacts, such as air bubbles, clogging, insufficient flow, change in speed, or unwanted particles, can disturb fluorescence signals. Although the flow may quickly return to normal, the anomalous signals would need to be removed. FlowCore [47] provides convenient functions to visually monitor these anomalies over time for each file. Additional packages such as FlowQ [58] and FlowClean [40] are able to remove these time points by using change-point analyses over a linear time trajectory to find time sequences with outlying mean and variance.

2.1.1.4 Normalization

Finally, it may be necessary to normalize and thus remove batch effects in the pre-processing phase. Batch effects are biological and technical artifacts created when FCS files are cre-ated from different facilities, come from different biological samples, are made on machines

(29)

using different settings etc. Tools to remove these unwanted effects include per-marker dis-tribution normalization via flowStats [45], GaussNorm [46], and FdaNorm [36]. These tools work to align FI distribution peaks that are found to be in common between the FCS files. Other methods such as variance stabilization have also been proposed [9], taking inspiration from traditional statistical analysis and RNAseq data normalization. In addition, many cell population identification tools, such as FLOCK [80], use their own normalization methods to maximize the data signals they would be able to mine. Building in custom normalization methods in with downstream analysis provides analysts with a simple one step solution. For instance, the user may input the raw FCS file and directly receive the cell populations within the file as output, rather than having to work with multiple tools and packages.

2.1.2 Cell Population Identification

The process of identifying cell populations in a FCS file is one where the cells in a file are sorted into its different cell populations (as defined in Section 1.1).

Identifying cell populations is one of the most important steps in the data analysis pipeline – with over 50 tools developed for this purpose [57]. Taking the pre-processed FCS file (R × L matrix) as input, the tools that can identify cell populations, or clusters of cells, is broadly categorized into three types (see Table 2.1):

1. Supervised classification, 2. Unsupervised clustering, and

3. Automated gating, termed a parent type of unsupervised clustering of cells in FCM bioinformatics.

2.1.2.1 Supervised Classification

Supervised classification of cell populations is an open question with various challenges. The formal problem is: given training FCS files with all of their cells labelled with a ground truth cell population class label, train a classifier such that it can classify cells of unknown FCS files into their respective cell populations – as accurately as possible.

The first challenge is a lack of standard panels, or a set of standard marker sets used to analyze FCS files. Currently, most panels are tested and created per experiment per laboratory. With different panels, the FI values and the unique combination of markers that define a certain cell population can vary greatly. To overcome this, organizations are starting to create standard panels (e.g. for leukaemic diagnosis [106], multi-national project IMPC [1], the Human Immunophenotyping Consortium (HIPC) [64]). Nevertheless, panel designs, especially in experimental settings, are still largely up to the biologist conducting the experiment and the biological sample in hand. Thus, aside from past experiments in a local laboratory, no public sets of training files are available for specific experiments.

(30)

Tool Type

DeepCyTOF [60] Supervised Classification FlowGM [24] HDPGMM [28] ImmunoCLUST [97] SWIFT [72] FLAME [79] FlowClust [62]

Unsupervised Clustering (Mixed-model-based)

FLOCK [80] FlowPeaks [42] ACCENSE [94] ClusterX [21] DensVM [13]

Unsupervised Clustering (Density-based)

X-shift [90] PhenoGraph [59] CLARA [101]

SamSPECTRAL [114]

Unsupervised Clustering (Graph-based)

BayesFlow [55]

ASPIRE [32] Unsupervised Clustering (Bayesian-based) FlowDensity [66]

OpenCyto [35] Automated Gating

Table 2.1: Examples of Cell Population Identification Tools categorized into three types: Supervised Classification, Unsupervised Clustering, and Automated Gating

(31)

A second challenge is how easily FCS files can vary depending on external factors such as batch effects. As mentioned previously, FI values are largely influenced by things such as machine settings and sample handling. Tools exist to normalize small systematic changes (as introduced in Section 2.1.1.4). Alignment, sensitivity, and fluidic quality control beads can also be used to normalize mean FI’s and standardize machine settings.

However, it is still difficult to normalize large changes (e.g. the testing FCS files could be analyzed on different flow cytometers, in a different laboratory). The variability between FCS files can pose a problem when training a classification model, as the model tends to overfit on the few training files provided. Tools such as DeepCyTOF [60] (CyTOF being mass cytometers by DIV Sciences) are starting to emerge in an attempt to mitigate these challenges. DeepCyTOF is a deep learning solution that first uses a deep learning network model to unify the distributions of all FCS files (alternative methodological reference for this step can be found in [93]). It then trains another deep model on a single manually gated FCS file, which would then, classify the cells from the FCS files whose distributions are modified by the first model.

2.1.2.2 Unsupervised Clustering

Another approach in cell population identification is unsupervised clustering of cells. Again, unsupervised clustering clusters the cells in a FCS file into cell populations.

Clustering Methodologies

Mixed-model-based Clustering: One clustering procedure in FCM is to fit the cells on some type of mixture model and then proceed with clustering. FlowGM [24], HDPGMM [28], ImmunoCLUST [97], SWIFT [72], FLAME [79], and Flow-Clust [62] all model the FI values of cells as a variant of the Gaussian mixture model (GMM). They then use expectation maximization (EM) to optimize the GMM to generate initial cell populations.

Density-based Clustering: Another property used is the density distribution of the FI on each marker. FLOCK [80] and FlowPeaks [42] both directly use those density distributions on the original data set to define the shape of the clusters. Misty Mountain [103] also looks at the density distributions but it does so by shedding down density contours. As these density-based procedures, such as density contouring and local maxima searching, can become intractable in higher dimensions, density-based tools, such as ACCENSE [94], ClusterX [21], and DensVM [13], pre-process the FCS file by lowering the number of dimensions using the t-SNE (t-Distributed Stochastic Neighbor Embedding) [63] projection

(32)

of the original data. This pre-processing allows for faster and more accurate density-based clustering results on specific data sets [110].

Graph-based Clustering: X-shift [90] ties in both density-based clustering and graph-based clustering. It first defines the clusters via density local maxima on a weighted K-nearest neighbour density estimation (K-NN DE) and then connects those clusters on a graph. The clusters that are close together on this graph, the ‘communities’, are subsequently merged. Similarly, PhenoGraph [59] also clusters the cells based on communities except it skips the initial clustering phase and directly lay out each cell as a node on a K-NN graph and then connects the cells. CLARA [101] is another method that represent each cell as a node on a graph. It utilizes a force-directed weighted graph with edge weights based on cosine distances between the median FI of cells. It then clusters these cells using scaffold maps. In addition, spectral clustering of cells has also been implemented as a tool called SamSPECTRAL [114].

Bayesian-based Clustering: A few methods have proposed the use of Bayesian statistics in cell clustering. For instance, BayesFlow [55] uses a Bayesian hierar-chical model, with or without explicit priors, to create lots of little cell clusters of which are then merged. Similarly, ASPIRE [32] uses a non-parametric Bayesian approach to cluster cells.

Rare Cell Population Identification

Rare cell populations are often missed during clustering because of how few cells they contain. This is because many methodologies pass rare cell populations off as outliers, debris, or as a part of a larger cell population [74]. Splitting and then merging cell population as a refinement procedure to obtain accurate rare cell populations is a tactic that can be seen in many methodologies. One such method is SWIFT [72]. SWIFT uses an iterative approach where it takes a user input k as the number of clusters it should expect. It samples cells from the original FCS file to reduce the input size and it does so repeatedly to avoid missing out on rare cell populations. It then models these sampled cells as a GMM and optimizes for this model until all clusters are fixed and a better parameter k is generated. It further refines this parameter by multi-modality splitting and agglomerate merging of the said clusters. Finally, it uses the refined k and conducts another round of soft clustering extracting rare cell populations via a Hierarchical Dirichlet process model – similar to HPDGMM [28]. Other methods using this strategy of splitting and merging cell populations include FlowClust/FlowMerge [34], BayesFlow [55], and immunoCLUST [97].

(33)

Cell Population Matching and Labelling

After the cells in each FCS file have been clustered into their respective cell population, it may be necessary to match and then label these cell populations across several FCS files to make them comparable. Clustering tools such as HDPGMM [28], ASPIRE [32], FlowGM [24], ImmunoCLUST [97], and SWIFT [72] simultaneously cluster cells and match them across FCS files. Other tools, such as flowMatch [10], maps cell populations across FCS files using the mixed edge cover algorithm after clustering has already been done for all the files.

2.1.2.3 Automated Gating

Gating is the process of manually identifying cell populations within a FCS file. As it is difficult for the human eye to analyze more than 3 dimensions at once, the cells are drawn out on multiple 2D scatterplots analyzed on two markers at a time. The order in which these markers are analyzed is laid out in an instruction manual called the gating strategy. In manual gating, a human expert would follow the gating strategy, lay out the cells on the instructed scatterplots, and gate the cells. The verb gate here, is the drawing of borders around regions of cells plotted on the 2D scatterplots. These borders encircle cells that belong to the same cell population (i.e. cells with similar FI values). Automated gating utilizes this expert created gating strategy and gates on the same 2D scatterplots as one would do manually. But instead of a human drawing borders around target cell populations, automated gating uses FI density distribution patterns to find those borders. Tools created for this category include FlowDensity [66] and OpenCyto [35] – a more detailed explanation of gating and FlowDensity is in Section 3.3.1.

2.1.2.4 Cell Population Visualization

Techniques have also been proposed to visualize the distribution of cells in a FCS file by reducing the dimensionality down to a human analyzable 2/3 dimensions. Traditional dimensionality reduction methods used include PCA [111] and t-SNE [63] (as a part of viSNE [8] and one-SENSE [25]). Another way to visualize these cells is to first cluster the cells into cell populations and then arrange those cell populations out on a 2D surface. Tools that do so include SPADE [81] and FlowSOM [107]. SPADE agglomerativly clusters a down-sampled set of cells and then organizes them as a minimum spanning tree (MST), while FlowSOM uses self-organizing maps to organize the orientation of the cells with options to display clusters of cells connected in a MST. Clustering tools such as PhenoGraph [59] and CLARA [101], and post-clustering analysis tools such as the flowType/RchyOptimyx pipeline [75] and FloReMe [108] also have visualization capabilities.

Figure 3.2 in Section 3.1 shows examples of visualization outputs from t-SNE and Flow-SOM.

(34)

2.2 Applications

This section goes over the current state of literature aimed at applying FCM data to the application scenarios described in Section 1.2.1 and in Figure 1.2.

2.2.1 FCS file Classification

A FCS file classifier (as used in Section 1.2.1.1 and as described in Section 1.2.1.2) can be trained on raw FCS files, or processed FCS files and their cell populations found by tools from Section 2.1.2. An example tool of the latter is Team21 [5]. Tools belonging to the latter type of classifier include the FlowCore/FlowStats statistical pipeline [45] and flowMatch [10]. FlowMatch creates a hierarchical models by agglomeratively meta-clustering or matching the cell populations across the training FCS file of a given class. This step outputs a classifier containing a hierarchical model per class. When given a processed FCS file and its cell populations, flowMatch calculates a similarity index between it and the cell population models of different class. FlowMatch then labels this FCS file with the same class as the model it is most similar to. Most, if not all classification tools follow this two step process of first creating a classifier and then classifying files. The difference between the tools is the model that is used to define the classifier.

2.2.2 Biomarker Cell Population Identification

Not only do we want to classify the FCS files, we also need to understand the bio-markers or cell populations that correlate with and signal a difference between the different classes of files (as described in Section 1.2.1.2). In other words, given that the cell counts of each cell population represents the features of a FCS file, then the bio-markers are those cell popula-tions whose cell counts are significantly different between files of different classes. The goal of feature selection here is to find those populations. This step can also be done after cell populations have been identified. Pipelines that have integrated this process include Sam-SPECTRAL [114], FloReMi [108], gEM/GANN [105] the flowType/RchyOptimyx pipeline, the flowtype/FeaLect (Feature Selection for Sample Classification) pipeline [75], Citrus (hi-erarchical clustering) [17], and COMPASS [61]. The latter four finds rare cell populations or possible bio-markers first by over-clustering the cells. They then pick out and/or merge the cell populations whose features correlate with FCS file class labels and highlight them as bio-markers. Another tactic is to use survival or competitive-based models where cell populations that do not correlate to any class labels are eliminated from being listed as candidate bio-markers. Methodologies that incorporate such a model include the previ-ously mentioned COMPASS [61], Competitive SWIFT [83], and flowClust/survival-model pipeline [12].

(35)

2.3 Remarks

FCM bioinformatics is a growing sub-field of bioinformatics. It tailors to the needs of and answers questions posed by the life science community in an automated and more efficient manner. The research areas in this field are focused on creating tools that come together in a FCM data analysis pipeline. These tools pre-process FCS files, and identify cell populations defining the immunophenotype of each FCS file. There are also tools available for conducting analyses on those results to provide biologically meaningful insight into the immunophenotype of the FCS file(s) at hand. Chapter 3 describes one of those tools or pipeline, whose purpose is to extract informative features from those immunophenotypes.

(36)

Chapter 3

Data and Methods

This chapter covers the implementation of a generic data processing pipeline mentioned in chapter 2. This pipeline is used to process the IMPC data set from [1] and is displayed in Figure 3.1. We also apply a similar pipeline to process a benchmark data set from FlowCAP-II [5].

We then give a definition of the cell hierarchy [75], and elaborate on how we use it to extract features from each processed FCS file. Afterwards, we describe how a distance matrix and clustering of FCS files is derived per feature. These are then used to evaluate how well each feature can facilitate clustering of FCS files into informative clusters.

It is also important to keep in mind the ultimate goal of extracting informative features and subsequently, clusters of FCS files in the context of our overarching motivation – to identify KO genes from IMPC that are similar or different from each other, which may indicate possible gene interactions.

All of the tools mentioned are freely available on the Bioconductor [43] platform. All methods described are implemented using the language, R.

3.1 Data

The two data sets experimented on are from FlowCAP-II (Critical Assessment of Population Identification) [5] and IMPC (from data generated in the Sanger Centre [1]).

3.1.1 IMPC

The data processing methodology prior to feature extraction is put together for the IMPC data set from the Sanger Centre [1], on standard IMPC panels. The example images displayed in this thesis contains FCS files analyzed on ‘Panel 2’ from biological samples extracted from the mouse’s spleen organ. Panel 2 contains 10 markers. In total, there are 2506 FCS files, one FCS file per mouse, each file having about 300,000 cells.

(37)

(38)

Figure 3.2: Sample FlowCAP-II and IMPC FCS files and their Cell Populations Visualized using t-SNE and FlowSOM (FlowCAP-II FCS file cell populations identified using gating strategies [26, 95, 96])

(39)

Here, we define the term, variable, as an attribute of the biological sample that a FCS file is analyzed for. The main variables influencing the FCS files are:

1. Gene: Each FCS file is from a mouse that had a single gene (or no gene) knocked out. 564 FCS files are from control WT mice, and 1942 files are from experiement mice with different genes knocked-out (KO). Each KO gene has at most 6 FCS files usually distributed over different days and genders.

2. Date: The initial set of KO and WT FCS files used in this thesis are created over the days 2014-06-09 to 2016-01-27 – note there are days where KO FCS files were created but WT FCS files were not.

3. Gender: There is clear evidence of gender dimorphism in the mouse’s immune system [1].

4. Centre of FCS file creation: Files generated at different centres can be drastically different depending on how the mice are raised, the time zone, how the biological samples are transported, preserved etc. In fact, the centre of file creation is the largest source of variability between FCS files.

The variable that we are most interested in is gene. The problem here is to cluster the FCS files into its respective KO gene or WT clusters – with the exception that FCS files from mice whose KO gene has no effect on the immune system should be clustered with the FCS files from WT mice. As such, we obtain a multi-cluster clustering problem.

In this thesis, we will not be including results for the IMPC data set beyond the pre-processing and the cell population identification pipeline. What makes evaluating feature designs here more difficult is that we do not know what, if there is any at all, effects each KO gene has on the immune system. Therefore, we do not know how many real clusters there should be. Even if we do know how many clusters there are, we are not certain of how homogeneous, and how well separated the clusters should be. Our goal for IMPC, however, is not necessarily to obtain high clustering or classification accuracy against a given ground truth. In contrast, what we want to do is to simply obtain a distance matrix of which can be used to get an idea of how KO genes relate to each other. Nevertheless, in this thesis, we want to first ensure our pipeline can obtain reliable distances before it is applied on the IMPC data set.

In the following sections, we will refer to the different types of FCS files on these different variables, FCS files of different classes. For example, if we are referring to the gender variable, FCS files from female mice are of one class and FCS files from male mice are of another class. Furthermore, we will refer to the WT and KO FCS files as control and experiment FCS files respectively.

(40)

3.1.2 FlowCAP-II AML

The FlowCAP-II AML data set [5] is the benchmark data set we will use to extract and test features on. It is available on FlowRepository [99]. The data processing pipeline used on FlowCAP-II consists of the same steps as the one used on the IMPC data set. Within the data set, 8 files are extracted per human individual, one which acts as a control (which will not be considered in this thesis) and the other 7 are analyzed on different panels, mixed with different sets of 5 markers and FS/SS (note: referenced papers may also refer to these panels as ‘tubes’). The FlowCAP-II data set contains a total of 2,513 FCS files from 43 acute myeloid leukemia (AML) positive patients and 316 healthy individuals. Each file consists of approximately 60,000 cells. In summary, the ground truth variables this data set contains are:

1. Panels: the different panels on which the files are analyzed. These are not comparable and thus different from each other.

2. AML positive vs Healthy individuals: the FCS files from patients with and without AML should be different from each other. An important note is that AML positive patients have a larger CD34+ cell population than healthy patients. CD34 is a marker usually applied to stem cells, and are expressed on AML positive blast cells, all cells within the immune system [109].

This provides us with a problem to generate features that would facilitate accurate clustering of these two variables.

Again, the FCS files of different types (i.e. analyzed on different panels, from an AML positive or healthy patient) are referred to as being FCS files of different classes. For consistency, we will refer to the healthy individual’s files and AML positive patients’ files as control and experiment FCS files respectively.

3.2 Pre-Processing

Before the FCS files can be analyzed, several pre-processing steps are performed. For these steps, the input is a FCS file, or the R × L matrix cocntaining raw FI values. The output is of the same dimensions, but with pre-processed FI values.

3.2.1 Compensation

Compensation is a standard procedure that occurs as a first step in the data analysis pipeline, previously described in Section 2.1.1.1. As shown in Figure 3.3 between steps 2 and 3, compensation helps to un-mix cells whose markers may have been detected incorrectly.

(41)

Figure 3.3: FCM Data Progressing Pipeline: Pre-processing; Cells of a FCS file are plotted on markers CD5 and CD11b, with the colours representing the density

(42)

3.2.2 Transformation

After the FCS file has been compensated, the cells with maximum or negative FS and SS values are first removed. This is because ‘cells’ with abnormally small or large FS and SS values can be interpreted as debris or large non-biological particles.

Data transformation is a procedure that changes the distribution of data (FI values). This is done to allow downstream analysis tools to detect signals that may otherwise have been hidden. In this pipeline, we use logicle transform [76] – whose effects can be seen in steps 1 and 2 of Figure 3.3. This step is done only for the FI values and not the FS and SS values.

The logicle transform is another name for the parametrized biexponential function: S(x; a, b, c, d, f ) = a · exp(bx) − c · exp(−dx) + f

, a generalization of the hyperbolic sine function:

sinh = exp(x) − exp(−x) 2

where x is the FI value to be transformed. Logicle transform has nice properties that spread data out like a log transform but also maintains near linear scales around 0 [76].

In flow cytometry, there are several considerations that can be used to simplify the parametrized biexponential function. Using the FCS files available, the following can be defined:

1. T is the maximum value of the original FI value x to be analyzed.

2. m indicates the upper bound of the transformed FI value (e.g. m = 4 ln(10) means that the transformed data will fall between 0 and 4).

3. w adjusts the strength of linearization around 0. Plugging these into the biexponential function we get:

S(x; w) = T · exp(−(m − w)) · (exp(x − w) − p2· exp(−(x − w) p ) + p

2_{− 1)}

x ≥ w

where p can be derived exclusively from parameter w via w = 2pln(p)_p+1. For this thesis, w is derived by a ‘global frame’ (see details in [76]), created by sampling 1,000 cells from each FCS file. Then the same parameters are used to transform all FCS files.

(43)

Figure 3.4: F CM Data Pro cessing Pip eline: Qualit y Con trol

(44)

3.2.3 Quality Control

As mentioned in Section 2.1.1.4, quality control is a necessary step to delete anomalies in the FCS files.

In this pipeline, we use a cleaning tool developed at the Brinkman Lab, Terry Fox Lab, BC Cancer Research Agency.

The pipeline takes as input, the transformed FCS file. In this thesis, we divide the total time used to analyze a FCS file into 500 equally long time intervals. We term these time intervals as ‘bins’.

As a precursor step, we identify time bins when the amount of cells the flow cytometer measured is less than 10% of the amount of cells measured during the time bin that measured the most cells. These time bins usually occur for a brief amount of time at the beginning, end, or during errors in the midst of the flow process. The cells measured during these time bins are removed.

For each bin, the 5th, 20th, 50th, 80th, & 95th percentiles are recorded along with the mean fluorescence value, and the 2nd and 3rd central moment. All of these values are then tested for outliers separately. Here, the definition of an outlier is any bin with a value that is outside of the 3rd upper or lower standard deviation around the value with the maximum frequency. In general, this value would be around the mean. For any time bin, If more than 4 of the above values are marked as outliers for any marker, all cells measured during that time bin are removed. This step is shown in Figure 3.4 where the cells are plotted for each marker over time (at which a cell is analyzed by the machine). The colour represents the density while the axis shows the FI values for the corresponding markers.

3.3 Cell Population Identification

3.3.1 Automated Gating

Taking a cleaned FCS file as input, the next step is to cluster each cell into its respective cell populations via automated gating, flowDensity [66].

Again, gating is the process of manually identifying cell populations within a FCS file on multiple 2D scatterplots of cells analyzed on two markers at a time. The order in which these markers are analyzed are laid out in a gating strategy.

Figure 3.5 illustrates a gating strategy and its importance using a toy example. Note that while this one only has two steps, a gating strategy usually consists of many steps or 2D scatterplots before all the cell population borders are defined. In this thesis, the definition of a border is equivalent to that of a gate – a FI threshold value for a marker. Threshold values are defined such that they separate cells into cell populations that contain either cells with a FI value greater than, or lower than the threshold(s). Let us suppose we are given a first 2D scatterplot on marker CD11b and its threshold b. The cells here

(45)

(46)

Figure 3.6: FCM Data Processing Pipeline: Automated gating and its ability to reduce variance caused by residual and unknown variables [1]

would be divided based on whether their marker CD11b FI value is greater or less than the threshold. If a cell has a greater FI value than b, it is marker CD11b positive or a part of the ‘CD11b+’ population, otherwise it is marker CD11b negative or is a part of the ‘CD11b-’ population. After the first gating, only CD11b+ cells are used to form a second scatterplot on markers CD8 and Ly6C. This scatterplot is used to establish thresholds a and c. In this case, the human expert designed the gating strategy this way because he/she has determined a biologically valid threshold a to be the valley in the second scatterplot (plotting cells in population CD11b+) of Figure 3.5. If the gating strategy is not used, then a0 (as shown by the red dashed line) on the first scatterplot can easily be misinterpreted as the threshold.

Therefore, this project opts to use a gating strategy in order to incorporate expert knowledge and allow for comparability across files. The panel 2 gating strategy used for the IMPC Sanger Centre data is the one used in [1].

Instead of gating by hand, we follow the gating strategy and find the thresholds using R package flowDensity [66] – the ‘automated’ part of automated gating. flowDensity is set according to the gating strategy such that it finds a threshold based on common density distribution scenarios. The three most common density distribution scenarios and their thresholds are:

1. Bimodal (or bimodal after smoothing and/or selection of two target modes, i.e. peaks on a density distribution): set a threshold on the valley.

Feature-based Comparison of Flow Cytometry Data