• No results found

6.4 Methods

6.4.3 MeDeCom element III: parameter selection

The mixture model (6.1) and the fitting algorithm (Algorithm 1) involve two free parameters to be provided by the user. The inner dimension k of the matrix product T A, k ≤ min{m, n} in (6.1) equals the number of DNA methylation prototypes used to model the given data. The regularization parameter λ determines how strongly the entries ofTbare encouraged to take values in {0, 1}. While the choice of k can be guided by prior knowledge about the composition of the underlying mixture to some extent, we developed a cross-validation procedure to select suitable values of both k and λ in a data-driven manner.

Cross-validation

The use of cross-validation in the context of matrix factorization, which is in fact a problem in unsupervised learning, whereas cross-validation is typically used in supervised settings, requires additional explanation. A conceptual difference arises from the fact that as opposed to

Algorithm 1Alternating minimization algorithm for objective (6.3) Denote CT = {T ∈ Rm×k: 0 ≤ Tis ≤ 1, i = 1, . . . , m, s = 1, . . . , k}, CA= {A ∈ Rk×n: Asj ≥ 0, Pks=1Asj = 1, s = 1, . . . , k, j = 1, . . . , n}, g(T, A) = kD − T Ak2 F, h(T ) = λ P m i=1 Ps k=1ω(Tis), and f(T, A) = g(T, A) + h(T ). Initialize T0∈ C

T and A0 ∈ CA; fix numerical tolerance  > 0.

t ←0, ft← f (Tt, At).

repeat Update T :

t ← t+ 1, T ← Tt−1

repeat

Linearize h(T ) around T = T to obtain a function eh(T ) = h(Tt−1) + Pm i=1 Pk s=1ω 0(Tt−1 is )(Tis− Tist−1). T ←argminT ∈CTg(T, A t−1) + eh(T ) (optT) until(f (T , At−1) − ft−1)/ft−1<  Tt← T. Update A:

At← argminA∈CAg(Tt, A). (optA)

ft← f (Tt, At).

until(ft− ft−1)/ft−1< .

the standard supervised setting, where the object to be predicted is a vector (one-dimensional array), one now has to deal with a matrix (two-dimensional array). There are multiple ways of generalizing the principle of leaving out sequentially different portions of the given data when moving from the vector to the matrix case, such as (a) leaving out columns, (b) leaving out rows, (c) leaving out both rows and columns [Owen and Perry,2009]. We here use (a) mainly because it leads to a straightforward scheme as displayed in Algorithm2. For each fold, a subset of the samples is left out. The thus column-reduced data matrix Dinis factorized as if

one were given the full matrix. The resulting left factorTbinis used to fit the left-out columns in Doutas Dout ≈ bTinAbout. The squared error of that approximation or cross-validation error

(CVE), is saved and finally combined with the errors from other folds.

Selectingk

The choice of k is canonical as long as the composition of the cell populations is known to a good extent, as is the case e.g. for synthetic mixtures. Cell populations sampled from human tissue tend to be considerably more complex. Prior knowledge about the number of cell types present in the samples may not be available, and even if it is, each cell type may not necessarily correspond to a perfectly homogeneous subpopulation. As a result, multiple similar, yet not identical methylation profiles may exist per cell type, reflecting a hierarchy of cell types and subpopulations. Furthermore, (sub)clusters can emerge from individual-specific DNA methy- lation effects, like allele-specific methylation and imprinting, or phenotypic effects, e.g. influ- ence of age, gender, disease status etc. It is not feasible to capture such fine-grained structure given a small to moderate number of samples, which are in addition contaminated by noise. As a rule, k should be chosen such that the estimation error and the approximation error in

Algorithm 2Column-based L-fold cross-validation scheme for validation of model (6.1)

Choose an integer L ∈ {1, . . . , bn/2c}.

Let I = {1, . . . , n}. Randomly partition I into disjoint subsets I` so that bn/Lc ≤ |I`| ≤ dn/Le

and PL

`=1|I`| = n.

for ` ∈ {1, . . . , L} do

Form Din= D

:,I\I`, Dout= D:,I`.

Solve the matrix factorization problem (6.3) with Din in place of D and λ = λ

g. Denote the

minimizing T byTbin.

ObtainAboutas the minimizer of min A kD out− bTinAk2 F subject to Asj ≥ 0 ∀s, j, k X s=1 Asj = 1 ∀j. err(`) g ← kD − bTinAboutk2F end for

return errg ←PL`=1err(`)g .

model (6.1) are roughly balanced. The former results from noise and is incurred when fitting the model to the data, while the latter is a consequence of model misspecification, which, as discussed above, is inevitable for limited k given the many possible sources of diversity among methylation profiles.

From a more statistical perspective, the issue of choosing k is related to determining the number of components in principal component analysis (PCA). In fact, the matrix factorization model (6.1) can be seen as a method of linear dimension reduction applied to D. A common computational approach to PCA is the singular value decomposition (SVD) which yields a matrix factorization of rank k of D by discarding all singular vectors not corresponding to the top k singular values. A notable advantage of our model (6.1) over the truncated SVD / PCA is its direct interpretability at a biological level, which is achieved by putting suitable constraints on the two factors T and A.

For a fixed value of the parameter λ, the data-fitting term of the factorization problem (6.3) decreases as k increases. The approximation error of the factorization model decreases since with more columns in T one has a better chance of capturing differences between the cluster methylomes. At the same time, the estimation error increases as the additional degrees of freedom favour over-adaptation to noise. A suitable choice of k balances both effects. The use of cross-validation is intended to achieve this balance by tracing the cross-validation error over a grid of values for k and selecting the one corresponding to the minimum. The final choice of kwas made by combining visual inspection of the cross-validation results and available prior information about the most likely number of underlying methylation signatures.

Selectingλ

As illustrated by the example in Figure6.1B, the regularization parameter λ, which balances the trade-off between the data fidelity term and the data-independent regularization term, has a crucial influence on the solution of the factorization problem (6.1) delivered by Algorithm

biological level, we embark on cross-validation error as for the parameter k. Determining the value of λ achieving minimum cross-validation error is more difficult as that parameter takes values in a continuous domain, namely the non-negative real line. We perform a two-stage grid search, starting with a coarse grid and then concentrating on a smaller range covered by a finer grid. Details of the procedure are outlined in Algorithm3. At the beginning of each of the two rounds of grid search, Algorithm3is run for each grid point of λ using multiple (≈50) random initializations. As the solutions corresponding to nearby grid points can be expected to be similar, we complement random initializations with a smoothing scheme in which the solutions of the five preceding and the five subsequent grid points are used for initialization.