1.3. Unsupervised learning to discover hidden expression states
1.4.2. Unsupervised learning models require interpretation of compressed features
Compression algorithms applied to transcriptome data output features with different combinations of gene weights, or importance scores, that can be interpreted to represent biological processes. There are many mechanisms by which ranked gene lists can be interpreted, including overrepresentation pathway analysis and gene set enrichment analysis (102). However, the interpretation of compressed features in gene expression space has many open-ended questions. When trained on the same data set, the
18
distribution of feature importance scores across different algorithms has different skews and kurtosis values (Figure 1.4A). Therefore, it is not clear that interpreting compression features is equivalent across algorithms. Furthermore, with the exception of the positive values learned by NMF, all other algorithms learn positive and negative signatures. It is not apparent if these values represent one general feature, two independent features, or something else. It is also not clear if the compressed features are learning single
sources of variation, entangled sources of variation, or noise associated with technical artifacts. Thus far, researchers have attempted to interpret compressed features from a variety of algorithms in several ways (Figure 1.4B). For example, one can set a cutoff on gene importance scores based on two or three standard deviations above or below the mean (87, 103). Another strategy consists of sequentially removing top weighted genes from positive and negative tails and performing Lilliefors test of normality until the compressed feature resembles a normal distribution (77, 104). The removed genes represent a ranked gene list of the feature-specific genes. Another strategy is to use counterfactual analysis to observe which genes are strongly associated with covariates and to weight their importance to the biological source (105). In Chapter 7, we introduce a network projection approach that considers the full distribution of compressed gene expression features. We build gene set networks from publically available gene set compendia and determine enrichment of gene sets compared to permuted networks.
Another important question concerns how many compressed features exist. In other words, how many sources of variation to be compressed are there that contain important biology in a population? Researchers using a gene expression compendium of over 5,000 human tissues determined that only the first three principle components of a PCA contained biologically relevant information (106). However, a follow-up study using the same data extracted additional biologically meaningful features and reported that the low
19 Sa m pl es Genes Original Sa m ples Genes Reconstructed
…
Latent SpaceWeight Progressive Normal
14 32
Standard Deviation Full Distribution
Compressed Feature Interpretation
B.
PCA ICA NMF DA VAE
A.
Figure 1.4: Interpretation of compressed gene expression features
(A) An example of a single random encoded feature of five different compression algorithms reveals the heterogeneity of the feature importance distribution. The input data are from The Cancer Genome Atlas PanCanAtlas gene expression data from 33 different tissue types spanning over 10,000 patients. (B) Defining genes that contribute to compressed features. These genes can be extracted in different ways. After the feature-associated genes are defined, there are various options for interpreting these compressed features, including various pathway- and network-based options. Abbreviations: DAE, denoising autoencoder; ICA, independent components analysis; NMF, non-negative matrix factorization; PCA, principal components analysis; VAE, variational autoencoder.
number of relevant compressed features was a sampling bias effect (107). Furthermore, an application of ICA to over 9,000 microarray samples revealed 423 components significantly associated with GO terms (51). A more recent analysis applying ICA to over 97,000 microarray samples revealed a total of 139 reproducible transcriptome modules (108). An issue common to many compression algorithms is the requirement to set an internal dimensionality. Mao et al. included extra capacity in the bottleneck layer to pool
20
technical artifacts in regions without prior biological knowledge constraints (83). In fact, it has been posited that gene expression consists of a series of compressed composite measurements (109). Nevertheless, it is clear that compression algorithms extract sources of variation in the underlying biology that are dependent on the strength of the signal, the number of samples that contain the biology, the assumptions of the model (e.g., linear versus nonlinear), and the predefined internal dimensionality. In Chapter 7, we investigate the dimensionality of gene expression data by serially compressing input matrices with an increasing bottleneck layer. More specifically, compress the data into 2 dimensions, 3 dimensions, 4 dimensions and so on up to 200. We project gene set networks onto the compressed features to quickly determine enriched gene sets captured in these features and determine how the bottleneck layer contributes to their identification.
Lastly, the stability of unsupervised learning solutions is of utmost importance. Because many unsupervised models are trained through an iterative process, the solutions identified will be different depending on internal conditions. Therefore, it is important to recognize stable patterns identified across various initializations. To this end, a method called stability NMF evaluates solutions from multiple starting points and determines stable basis vectors, or principle patterns, if they are consistently identified and correlated (110). Ensemble models have been used to aggregate solutions into a single model (87). Other methods have also been proposed to assess the stability of solutions, including adding dropout to NN models at test time (111). Nevertheless, interpreting machine learning models, investigating model stability, and associating compressed features with real biology are of paramount importance.
21