Methodology and implementation of cluster comparison on RNA-Seq data

RNA-Seq data

Here, the methodology used for cluster comparison (using Equation 3.1) on the three RNA-Seq datasets is described.

The entire methodology is repeated for each dataset. For each, it is performed for each number of clusters k, tested systematically from 4 to 10. First, for each k, a clustering is created using all the samples and this is considered the reference clustering of size k. In other words, there is a different reference sample for each number of clusters. After that, the clustering method is applied to different random subsets of the samples. The random subsets are varied from 3 samples to the total number of samples, m (see Figure 3.1). Then the following is done N times (in our case, N = 100): take a random subset of the samples of the appropriate size, cluster with K-means algorithm, then calculate the distance (with a CCD) to the reference sample of size k (using Equation 3.1). It is then possible to use different subsets of the samples to generate clusterings, and see how they can differ in terms of the number of genes. Clustering is completed using the (heuristic) K-means clustering implementation with the VLFeat (MATLAB) program [80]. Here two distances (with a CD) of K-means are used. In version 1 (V1), the Euclidean distance function is applied to calculate the distance between cluster objects, and for version 2 (V2), Manhattan distance function is used.

Using Equation 3.1, the distance between the reference clustering of size k (based on all the samples) and each random subset of the samples is calculated. All distance measurement values are stored in a matrix,

called SampleDisM atrix [Table 3.2], which stores the distances between the different random subsets of the samples and the reference.

Table 3.2: SampleDisMatrix measures the distance between clusterings and the reference. The rows indicate the different number of samples (RSS) that were randomly chosen. The columns represent each iteration with that number of samples. The entries contain distance values to the reference.

TS1 TS2 TS3 ... TSN RSS3 D31 D32 D33 ... D3N RSS4 D41 D42 D43 ... D4N RSS5 D51 D52 D53 ... D5N ... ... ... ... ... ... RSSm Dm1 Dm2 Dm3 ... DmN

Let Y be a subset of the samples. Let expY be a function from X that maps each gene onto its expression

in sample set Y , where expY(g) maps each gene to an |Y |-tuple expression vector. Further, let expALL be

the function based on all samples. Pseudocode of the method is given in Algorithm 1.

Algorithm 1: Cluster comparison

1 calculate GREF . reference clustering

2 for each number i up to m do

3 for each iteration itr up to N do

4 Y ← random subset of samples of size i

5 calculate GY . clustering using Y

6 calculate D (GY, GREF) . using Equation 3.1

7 store in sampleDistance table in row i and column itr

8 end 9 end

10 Output SampleDisMatrix

To calculate the CCD D (GY, GREF), three techniques are possible and all three were implemented. First,

the best permutation θ can be calculated with a brute force search technique [63]. In computer science, brute force is a general search technique that iterates through each possible combination of solutions to find the optimal one. Brute force is usually a time intensive computational technique. However, it guarantees to find the best match between a reference clustering and a random subset of samples by trying every possible permutation. Second, to speed up the computation time, branch-and-bound can also be applied. Branch-and- bound considers a tree of all possible combinations of candidate solutions [18, 45] and skips entire subtrees

if every resulting permutation is a guaranteed to give a worse score than the best score found so far. For example, say there are two clusterings: G1and G2, and each has 4 clusters: P1, P2, P3, P4and Q1, Q2, Q3, Q4.

For observing the mapping between one clustering (G1) to another (G2), using branch-and-bound can possibly

reduce the number of permutations that need to be considered. If any combination of the clustering set G1

starting with P1compared to Q1fails to reach optimal criteria (because their distance is already larger than

the optimal found so far), then branch-and-bound skips those (all that map P1to Q1) combinations to speed

up the computation time (python implementation is given in the Appendix A).

Additionally, a graph bipartite matching technique can be used to speed up the computation time in a similar fashion as mentioned by Meil˘a [56] for CE. In graph theory, a bipartite graph is an undirected graph with a set of vertices, V , which can be divided into two independent and disjoint sets, V1 and V2 [9]. In a

bipartite graph, there are no edges from a vertex in V1to one in V1, and no edges from one in V2to one in V2.

For this application, a weighted bipartite graph is used where each cluster from both clusterings is a vertex, the edges are placed on the connections between clusters from one clustering only to each cluster of the other clustering, and the weights are distances between the clusters i.e. D(Pi, Qj). The goal is to find a perfect

matching (the permutation θ) between the two clusterings with the minimum sum of the distances. On the graphs, the goal is to find a matching between vertices of the first set with the second that minimizes the sum of the distances. The maximal bipartite matching finds a maximal such matching in a graph, and this can be used to find a minimal matching by negating the weights. A python package — Munkres (version 1.0.12) [3] was applied which implements the Hungarian algorithm [42]. The purpose of the Hungarian algorithm is to solve the assignment problem (determining the permutation that minimizes distance) in polynomial time. The Munkres module takes as input the k by k matrix of the distances between clusters in G1with G2, and

In document Sample Size Evaluation and Comparison of K-Means Clusterings of RNA-Seq Gene Expression Data (Page 33-37)