3.5 Experimental studies
3.5.1 Methods and parameter tuning
In order to tune the parameters of the methods in our experiments, we performed grid search with predefined parameter values using 10-fold cross-validation and the best result is reported.
For SGT algorithm, we performed a broader search than suggested inJoachims[2003] with all the parameter combinations in the following manner: the number of neighbours was searched ink ∈ {10,50,100}; the number of first eigenvectors wasd∈ {10,40,80,100}
and the error parameter was c ∈ {100,102,103,104}. Although, in Joachims [2003], the
parameter c was set between 3200 and 12800, our preliminary experiments generated better results with our setting.
Similarly to ClusterReg, TSVM possesses the cluster assumption. Therefore, if the dataset has a meaningful cluster structure, we expect ClusterReg to deliver more accurate results. If such a structure is not present, both algorithms may have similar performance. For the parameters in TSVM, we followed Chapelle et al. [2006]. We used a RBF kernel and its width was selected as the median of the pairwise distances between instances [Chapelle et al., 2006]. Unlike inChapelle et al. [2006], we decided to perform a broader
1
We denote ClusterReg as the general algorithm, and ClusterReg-MLP and ClusterReg-RBFN as specific instantiations of ClusterReg with MLP and RBFN, respectively.
2
search of the soft margin parameter C (it controls the trade-off between margin size and misclassified training instances) withC ∈ {100,101,102,103}. In preliminary experiments,
lower values of C increased the computational time and reduced the generalisation accu- racy of TSVM; and higher values did not improve results. We performed a grid search with all combinations of these parameters and selected the ones with the best result for each dataset.
Since Multi-Class Semi-Supervised Boosting (MCSSB) [Valizadegan et al.,2008] uses all three SSL assumptions, we expect ClusterReg to outperform MCSSB only on datasets where the cluster assumption holds, that is, datasets that possess a clear cluster structure that relates to the class distribution. MCSSB would deliver better results on datasets where there is an unclear or no cluster structure. As its base classifier, we chose SVM since it delivered the best results in our preliminary experiments. We fixed the parameter1 C =
10000. The ratio of the range of distances used for kernel construction was searched inσ ∈ {0.01,0.05,0.1,0.15,0.2,0.25,0.5,0.8,1}. We set the sample size as a ratio{0.1,0.5,0.8,1}
of the total number of instances for transductive and inductive contexts. The number of base learners was searched in {20,50}.
For RegBoost,Chen and Wang[2011] also suggested a grid search for the best combi- nation of parameters. The number of iterations was tuned with 20 and 50. The number of neighbours was searched in {3,4,5,6}. The resampling rate in the first iteration was set to 0.1. And the resampling rate in the rest of iterations was searched in {0.1,0.25,0.5}. FollowingChen and Wang[2011] and our preliminary experiments, we chose SVM as base classifier.
For ClusterReg, the parameter λ controls the amount of regularisation in the algo- rithm. Thus, we perform a grid search in{0.2,0.4,0.6,0.8,1}, as we do not know whether the data hold the cluster assumption. It is advisable to set this value between 0 and 1.
1
As demonstrated in Valizadegan et al. [2008] and confirmed by our preliminary experiments, this value should be set to 10000. Lower and higher values did not improve the performance.
Our preliminary experiments showed that the number of neighbours V can be set to 30 for most datasets used in this work. For datasets not larger than 1500 instances, this number might represent a comprehensive search for labels in the neighbourhood of an instance. For a small number of neighbours, ClusterReg may not capture the correct label structure of the neighbourhood. For datasets with more than 1500 instances, V
could be set to 2% of the number of instances.
We employed four clustering methods from different clustering approaches: k-means, STSC, GMM and Fuzzy GK. We also selected the clustering algorithm by grid search, since the performance of these algorithms varies depending on the real underlying class structure in the dataset and the type of partition that such methods attempt to find.1
However, our experiments demonstrated that ClusterReg with STSC usually obtains good generalisation ability for most datasets. This fact might indicate that most of these datasets have clusters with arbitrary shapes that other algorithms might not be able to find. Therefore, we suggest the use of STSC as the clustering algorithm for ClusterReg.
For the number of clusters K, we recommend to set such a parameter to, at least, the number of classes. We intend to generate clusters as compact as possible. If the class structure is not captured by the clustering algorithm, we can increase the number of clusters, so that one class is composed of multiple clusters. ClusterReg will avoid dividing these clusters and, therefore, may be able to produce the decision boundary outside the class. According to our preliminary experiments, we recommend, in general, to set K to two times the number of classes (or greater multiples of the number of classes).
The parameterκcontrols the importance of each neighbour according to their similar- ity (conforming to the clustering algorithm) to an instance. With a largerκ, we relax the cluster assumption by allowing the decision boundary to cut through relatively distant neighbours. It regulates the size of the portion of a cluster that we allow the decision
1
k-means tends to generate hyperspherical clustersXu and Wunsch[2005]. GMM and Fuzzy GK are able to obtain elliptical clusters. Whereas STSC is capable of finding clusters with arbitrary shapes.
Clustering algorithm Grid search with K-means, GMM, STSC or Fuzzy GK
λ Grid search in {0.2,0.4,0.6,0.8,1}
K Grid search in {1,2,3,4} times the number of classes Centre widths Grid search with ratio of{0.2,0.5}of the median of pair-
wise distances
Table 3.1: Summary of tuned parameters for ClusterReg.
boundary to traverse. According to our preliminary experiments, it should be set between 1 and 12 – values in the middle of this range often deliver good performance. The per- formance of ClusterReg degrades, for all datasets, with values outside this range. Thus, we fixed κ= 5, although further tuning might produce better results.
For ClusterReg-MLP, specifically, the number of hidden nodes was fixed at 15, as larger numbers did not improve generalisation in our preliminary experiments due to overfitting and smaller values did not produce sufficiently complex networks for our datasets. And the number of epochs in SCG algorithm was 50.
In ClusterReg-RBFN, the centres (hidden nodes) of RBFN coincide with the instances of the entire dataset. Except when the number of instances is larger than 1000, in that case we randomly select 100 instances to be assigned to the centres. The width of centres was calculated as a ratio of the median of all pairwise Euclidean distances between instances. Such a ratio was searched in {0.2,0.5}, as different values produced lower generalisation accuracy. The parameter α for weight regularisation was fixed at α = 0.5 for both ClusterReg-MLP and ClusterReg-RBFN.
In Figure3.6, we show the behaviour of the generalisation error for different values of
λ,V,K and κacross three different percentages of labelled data in BUPA dataset [Frank and Asuncion, 2010]. We selected only a subset of the values that roughly yielded good performance in Figure 3.6 to be used in our experiments. Thus, Table 3.1 summarises the selection of each tuned parameter in ClusterReg. Further tuning might improve generalisation accuracy.
0 0.2 0.4 0.6 0.8 1 30 35 40 45 50 Lambda Generalisation error (%) 5% 10% 20% (a)λ. 0 10 20 30 40 30 35 40 45 50
Number of nearest neighbours
Generalisation error (%)
5% 10% 20%
(b) Number of nearest neighboursV.
0 5 10 15 30 35 40 45 50 Number of clusters Generalisation error (%) 5% 10% 20% (c) Number of ClustersK. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 30 35 40 45 50 Kappa Generalisation error (%) 5% 10% 20% (d)κ.
Figure 3.6: Generalisation error from 10-fold cross-validation with different values ofλ,V,
K andκ across three different percentages of labelled data (5%, 10% and 20% in relation to the total number of instances) in BUPA dataset [Frank and Asuncion, 2010].