3.5 Feature Design
3.5.2 Phenodeviant features
Phenodeviant features are defined as features containing information on how significantly different the experiment FCS files’ cell population nodes or edges are from those of the control FCS files. These features help identify the immunophenotypic differences between experiment FCS files and control FCS files, and can act as a filter to remove insignificant feature attributes (i.e. nodes/edges).
To get an idea of such features, an example of a phenodeviant node feature is the (LogFold) – here we use a natural logarithm ln. Suppose for node A, the experiment file
indicates a normalized cell count of 10, while the control files have a mean normalized cell count of 50. Then the LogFold feature value for node A of the experiment FCS file would be 1.61 = ln1050. On the other hand, the phenodeviant feature would not be necessary for a control FCS file because we know it should not be any different from the other controls. More features are presented later in this section.
IMPC: For the IMPC Sanger Centre data, the controls are the FCS files from the WT mice, while there are multiple types of experiments, or mice with different genes knocked out.
In the IMPC data set, there are several confounding factors that need to be accounted for prior to extracting the phenodeviant features. One of those, is the variable date. An example of how to reduce its effects is as follows. We first separate the FCS files based on date. The WT files are plotted out by date of creation and then segmented into groups. These segments are separated on dates where the total normalized cell count of the FCS files change drastically. These changepoints are detected by the method [22], on the modified Bayes information criterion (MBIC) [116]. Only a maximum of 70 WT files, created on dates closest to the date the KO file was created and is in the same date segment as the KO file, is used to create the KO file’s phenodeviant features. As the phenodeviant features reflect how different an experiment FCS file is from the control FCS files, we only produce phenodeviant features for the experiment files. In other words, for each experiment KO FCS file, we obtain additional phenodeviant features on top of their absolute features.
FlowCAP-II: To keep consistent terminology, a reminder that the controls in the FlowCAP-II data set are the FCS files extracted from healthy individuals and the experiments are the FCS files from AML positive patients. Since we only have one type of experiment, we simulate the IMPC scenario of multiple experiments by randomly assigning half the healthy patients as the controls and the other half as an experiment that should be different from the AML FCS files. As such, the phenodeviant features are extracted for half the healthy patients and all the AML patients.
Additionally, each experiment is only compared against the control FCS files that are analyzed on the same set of markers, or panel, as the experiment FCS file in question.
Examples of phenodeviant features can be found in Figure 3.11.
The following describes node-based phenodeviant features x<f eature>i = {x<f eature>i
j }.
For each node vj in experiment FCS file i, its features x<f eature>i
Figure 3.11: FCM Data Processing Pipeline: Feature Design (Examples for phenodeviant Features)
against the same node in the corresponding control FCS files x<f eature>ref
j are as
follows.
1. (Pval) is a single sample p value xpvali
j representing how different the nor-
malized cell count in the experimental FCS file xcounti is from that of the control FCS files xref
j. This allows us to extract information on cell pop-
ulations that are affected by the experiment. The single sample t statistic for obtaining the p value is calculated as:
tpvali j = xcountij − ρj δj ρj = P{x refj} |xrefj|
where δj is the standard deviation of xrefj. From this t statistic, we then
obtain a p value of which is doubled to perform a two-tailed test. Finally, we apply a (− ln) (or (ln) for cell population nodes in the experiment FCS file whose value is lower than ρj) on the p values to produce xpvalij .
2. (LogFold) is the ln of the ratio between the experiment FCS file and mean of the control files, calculated as:
xlratioj = ln x
count j
ρj
!
For each edge ejk in experiment FCS file i, its features x<f eature>i
jk compared
against the same edge in the corresponding control FCS files x<f eature>ref
jk derives
edge-based phenodeviant features x<f eature>i = {x<f eature>i
jk }:
1. (Parent_contrib) calculates a child node’s change as a proportion of its parent’s change from the control FCS files. This is a way to represent a child node’s contribution to its parent node’s phenodeviant change. Scenarios where this would be important include: if a node has a significant change in normalized cell count, but its change is only because one of its child changed in the same direction while all of its other child nodes did not change. In this case, we would want to analyze that single child over analyzing the node in question. There would also be scenarios where a node’s change is equally distributed to all of its child nodes. In this case, this node itself should draw more attention and be further investigated. This contribution feature is calculated as:
ycontrjk = x
count k − ρk
xcount j − ρj
2. Useful for similar senarios, (Parent_effort) describes the effort a node ex- erts on its parent node’s phenodeviant change. The difference here is that effort is calculated in proportion to the nodes’ normalized cell count and is calculated as:
yef f ortjk = xlratiok − xlratioj