2. Machine learning algorithms and descriptions
2.3. A Proof of Concept example of the effectiveness of a Support Vector Ma-
genetic structure
Before the Support Vector Machine algorithm was put into effect on the schizophrenia datasets, an example experiment was carried out to ensure that this algorithm was indeed capable of dealing with genotyped information from multiple SNPs. This initial experi- ment is referred to as a Proof of Concept, because it was expected that the performance on this would, in fact, be 100% accuracy. The dataset used in this experiment was taken from the International HapMap project, which was introduced in chapter 1 (Gibbs et al., 2003). The original raw dataset contained information on 76,035 SNPs from 483 individ- uals. The pruning method in PLINK was used to isolate out those not in LD with each other. This was done using the following parameters:
• Window size 500 Kilo-Bases (KBs)
• Movement of window 250 KBs (resulting in 50% overlap) • r2 threshold 0.2
This procedure left a total of 66,396 SNPs remaining. This data then had to be prepared to represent the inputs needed for the SVM. The information that was required was the minor-allele count for each SNP. For this study, the identity of the minor-allele was identified by looking at the distribution of the alleles in the dataset provided to PLINK, and denoting the less frequent allele as the minor-allele, with no additional information. This is perfectly viable in this study, but for the additional studies carried out in the experimental chapters that make use of GWAS data, it is important that the identified minor-allele is the same as that reported in the GWAS. The procedure to do this is described in the respective methods sections for all of the experiments. This recoding was performed by using the --recodeA command in PLINK, and results in a file containing a large matrix of numbers in the set {0, 1, 2}, as this represents the possible number of minor-alleles at each SNP. This data could then be used in the SVMs.
The target, or outcome, variable in this case was not a case/control status for any disorder, but the ethnic population group that the individuals belong to. One of the achievements of the HapMap project was to identify genomic differences that occur between differ- ent populations, so the samples were carefully selected to avoid any possible effects of inter-racial reproduction that could have occurred. This task was therefore a multi-class problem for the SVMs to solve, and not a binary one. As such, the AUC metric could not be used, but as can be seen in the results, this was not necessary. Out of all of the different ethnic populations used in the HapMap study, the following five were used here. These can be seen in table 2.1.
Table 2.1.: Table showing the different populations used in the positive example study, together with their respective HapMap codes, and the number of samples used.
Population HapMap Code Number of Samples
Han Chinese in Beijing,
China CHB 84
Japanese in Tokyo, Japan JPT 86
Utah residents with Northern and Western
European Ancestry
CEU 165
Yoruba in Ibadan, Nigeria YRI 113
Gujarati Indians in
Houston, Texas GIH 88
For this study, the information from the first two populations were combined to make one group of 170 samples, with the code ASI: Asian. This was done because these populations were very similar to each other, and this can be seen in figure 2.23. The main task here was to provide the SVM algorithm with a problem that is known to be easy to solve, to see how it performs. If these two populations had been kept together, then it might not have been possible to assign a lack of performance to any possible insufficiency in the algorithm for this type of data.
In order to gain a meaningful graphical representation of the distribution of the genetic variations in this dataset, a Principal Component Analysis (PCA) was carried out to identify the two top components that explain most of the variation seen in the data. A plot of how the samples map onto these two components can be seen in figure 2.23.
PC 1 PC 2 Population ASI CEU GIH YRI
Figure 2.23.: Scatter plot showing the distribution of the different populations across the two top prin- cipal components
This plot shows that the four different populations really do cluster into separate regions, even when the information from all 66,396 SNPs have been combined into the top two principal components. It also shows the need to combine the two Asian populations, as they show a complete overlap in two dimensions. It is possible that taking into account information from higher order principal components would separate these 2, but this exercise was not relevant to the task at hand.
The next stage was to fit the SVM to this problem, to see if it could identify differences between the populations. In this task, a linear SVM was built, and provided with all of the 66,396 SNP features. The only parameter for a linear kernel is C, and this was kept at the scikit-learn default value of 1. No adjustments for the different class sizes were made. The inputs were scaled as part of a pipelined procedure, so the mean and standard deviations from the training points were used to transform those data in the validation set at each split.
For the CV procedure, a stratified shuffle split was made, with splits of 75/25% train/test proportions, for a total of ten iterations. As was mentioned earlier, as this is a multi-class problem, the AUC metric could not be used, instead, the proportion of correct answers
was given for the held out samples at each split. The slight variation in the algorithm that is made with a multi-class problem is that the model is built several times using a “one-vs-rest” method, that iterates through all the labels and classifies them as “positive” for their respective turn.
The outcome of these trials was that a linear SVM obtained 100% prediction accuracy at identifying all of the four separate class labels. It is important to note that, while the image in figure 2.23 would suggest that these are not linearly separable, as they resemble the clustering of patterns seen in the XOR example, it must be remembered that this image is only showing the two top principal components. These data points must be completely linearly separable in their original dimension space in order for the algorithm to display the perfect performance that it did.
The result showed that the SVM algorithm is capable of working with minor-allele infor- mation from genotyped datasets, and was suitable for use in the experimental chapters.