Model selection using simulated annealing

We can then seek to find mixture components parameters that would maximize the likelihood by a suitable search algorithm over assignments of data points to clusters. Without penalizing the extra number of parameters, ML maximizing the likelihood by increasing number of parameters. Number of parameters is directly proportional to the number of clusters, thus, this would result in each data point assigned to its own cluster. Penalizing this, discourages the over-estimation of parameter numbers. We have used information criterion (IC) based models optimization where extra parameters will be penalized.

We have compared different model selection criteria in selecting the best approximating model as there is no single best criterion for every underlying data structure. Moreover, as the number of data points, 𝑁_𝑔 increases, different information criteria imply different trade-offs between the goodness of fit and model complexity. In reference to the Equation 5, in our method, 𝑘 = 𝑁_𝑚 (1 + 𝑛_𝑟+ 2𝑁_𝑒) − 1, accounting for 𝑁_𝑚− 1 independent mixing coefficients, and the Bernoulli parameters and Normal distribution parameters in each model.

Simulated annealing algorithm with Monte-Carlo moves are used to cluster simulated data by optimizing the objective function and different objective functions are tested on different level of data complexity (see Table 2.1 in the section 2.1.3). The representation/pseudo-code of the SA algorithm is as follows:

Monte Carlo simulated annealing for clusters optimization 1: Normalize expression (genes)*

2: Clusters = Initialize clusters (agglomerative/divisive) 3: Best clusters = [list]

4: Old clusters = [list] 5: Score = 0.0

6: Old score = 0.0 7: Best score = 0.0 8: Difference = 0.0 9: Count temperature = 0

10: Temperature = Start temperature

11: While Count temperature < Maximum temperature cycle:

12: Temperature = Temperature * Temperature decreasing factor 13: Score = Calculate modules score (clusters)

14: beta = 1.0/temp 15: Old clusters = clusters

16: While Iteration < Maximum iterations:

17: If random > Merge and split probability:

18: Change (clusters)

19: Else:

20: If random > 0.5:

21: Merge (clusters)

22: Else: Split (clusters)

23: Old score = Score

24: Score = Calculate clusters score (Clusters) 25: Difference = (Score-Old score)

26: If Difference < 0.0 or random < exponent(-beta*Difference): 27: ‘accept’

28: if Score < Best score:

29: Best score = Score

30: Best clusters = Clusters 31: Else:

32: ‘reject’

33: Clusters = Old clusters 34: Score = Old score

35: End while

36: Count temperature + 1 37: If Best score = Old Score:

38: Count score = Count score + 1 39: If Count score == MaxRepIters:

40: break 41: End while

42: Return Best score, Best clusters

The first step of the algorithm is to read and store the local copies binary inputs and continuous inputs. Then, if normalization of continuous inputs is chosen, it will be normalized to zero mean and one standard deviation also known as z-score (see line 1 in Algorithm 1). This is then followed by initialization/starting point of clustering. We represent a clustering solution as an integer array of length 𝑁_𝑔 where each element of the array holds the ‘cluster number’ to which the corresponding gene is assigned. Cluster numbers are thus integers in the range 1 to 𝑁𝑔. Depending on the initialization schedule either agglomerative or divisive, each data point will be given a different cluster number for agglomerative and one cluster number for all data points for divisive schedule. Below is a representation of the clusters initialization for 10 genes:

Divisive:

Agglomerative:

The initial clustering solution (see line 2 in Algorithm 1) are stored in an array as shown above (e.g. 10 data points). Next, we initialized arrays for storing the solutions (see lines 3 and 4) and values of 0.0 for scores and scores difference (lines 5 to 8). We then initialized the temperature for the simulation system as in lines 9 and 10.

The main method or loop for simulated annealing is starting from line 11 onwards. While count temperature less than the maximum number of temperatures to be simulated (Maximum temperature), this program will run Monte-Carlo moves (lines 17 to 22) and acceptance probability (line 26) evaluation as in Equation 6. Furthermore, with different options of Monte-Carlo moves available, change, merge and split cluster’s member, we decided on arbitrary threshold of probability of either one of them. Here how it works: If a random uniformly distributed number, 𝑟 is greater than the specified probability of splitting (MergeSplitProbability is 0.25 as in Table 2.2), than any random data point in a cluster will be moved to another cluster. If 𝑟 is more than 0.5, two random clusters are merged into one cluster or else, a random cluster will be split into two new clusters. For every iteration of Monte-Carlo moves (line 16), a score will be calculated and accepted as the best score and clusters corresponding to this score will be accepted as the best clusters. These steps in Maximum temperature loop will run as long as the Maximum temperature has not been reached and the repetition of the same best score is less than MaxRepIters (line 40).

For each subsequent solution following Monte Carlo randomization moves and simulated annealing, two genes are in the same cluster if and only if they have the same cluster

number. There is no requirement that all cluster numbers be used in a solution, for example, the following instance of a clustering solution for 10 genes

Solution:

represents a cluster solution with 3 clusters (labelled 1,3,4) where cluster 1 is genes 1,2,3 and 10, cluster 3 is genes 4 and 8, and cluster 4 is genes 5,6,7 and 9. Clearly this solution could also be represented using cluster numbers 1, 2, 3 as

Solution:

but for algorithmic reasons it is easier to allow both representations to be used in the search procedure, only renumbering on completion.

The configuration of clusters and its parameters will be sent to the EM to be further optimized locally.

In document Flexible model-based joint probabilistic clustering of binary and continuous inputs and its application to genetic regulation and cancer (Page 50-53)