In this section we will implement the smallest enclosing hypersphere as a procedure to pre-screen or filter the training data before applying the criteria given in Section 5.3. In practical situations we are often faced with large data sets and calculating the values of the criteria upon omission of each of the cases may become computationally prohibitive. For this reason we propose the following strategy to reduce the number of cases that should be evaluated by the criteria on a leave-one-out basis. We propose using the smallest enclosing hypersphere (Section 6.4) in a preliminary step to identify a subset of cases. In a subsequent step, only the cases in this subset will then be individually evaluated on a leave-one-out basis as cases that potentially may have a detrimental effect on the generalization performance of the KFD classifier.
We argue that the cases that potentially have a detrimental effect on the KFD classifier are those cases in a class that deviate most from the typical cases in that class. If these cases can be identified, only they need to be evaluated by each criterion and not the entire training data set. As illustrated in the following examples, the hypersphere can be used to obtain a subset of cases that deviate from the rest of the cases in a class.
6.5.1 Illustrative examples
The first example, presented in Figure 6.5, contains two classes with equal covariance matrices but different locations, i.e. Σ1 =Σ2 =I and μ1 ≠μ2. For each class, 200 observations were generated from a bivariate normal distribution. The mean vector for
class 1 is µ1 =2.51 and class 2 has mean vector µ2 =0. An enclosing hypersphere was obtained for each class separately using a Gaussian kernel with parameter γ=0.1. The support regions corresponding to these hyperspheres are plotted in Figure 6.5. From the figure we observe that all the cases lie inside the support regions and the boundary of the support regions passes through the support vectors. The number of support vectors obtained for class 1 (green squares) is six and for class 2 (red circles) is five. These support vectors have non-zero α -values which are also plotted as spikes in Figure 6.6 for each class separately. The rest of the data points have zero α -values. From Figure 6.5, we see that the observations that deviate most from the bulk of the data in a class lie on the boundary of the support region, i.e. they form part of the support vectors. We propose that only the support vectors should be evaluated by the criteria from Section 5.3 as cases that may possibly have a detrimental effect on the KFD generalization performance. As we see in this example, there are 400 observations in the training data set and the subset (support vectors) contains 11 observations. Thus, in our proposed two- step procedure for identification of influential cases, instead of evaluating all 400 cases on a leave-one-out basis, only the 11 cases will be evaluated. This obviously leads to a considerable reduction in the number of computations that have to be performed.
The second example is illustrated in Figures 6.7 and 6.8. The scatter plot in Figure 6.7 contains lognormal training data with equal covariance matrices
(
Σ1 =Σ2 =I)
but location differences between the two classes. The class 1 mean vector is µ1 =1 and theclass 2 mean vector is µ2 =0. Again, there are 200 observations in each class. The smallest enclosing hyperspheres for each of the classes were obtained using a Gaussian kernel with γ=0.1 and the corresponding support regions are shown in Figure 6.7. For this example, the number of support vectors in each class was 6. Again we observe that the observations that deviate from the bulk of the data are support vectors and these lie on the boundaries of the support regions. In Figure 6.8 the α -values of the support vectors for each class are plotted as spikes, while the rest of the data cases have zeros. Implementing our two-step procedure in this example implies that 12 cases, instead of 400, will be evaluated on a leave-one-out basis.
FIGURE 6.5: Illustrative example of the support regions for normal training data. The Gaussian
kernel with γ=0.1 was used. The green squares represent class 1 and the red circles represent
FIGURE 6.6: Index plot of the α-values of the support vectors for each class. These plots
correspond to the same data used in Figure 6.5. There were 6 support vectors in class 1 and 5
FIGURE 6.7: Illustrative example of the support regions for the lognormal training data. The
FIGURE 6.8: Index plot of the α-values of the support vectors for each class. These plots
correspond to the same data used in Figure 6.7. There were 6 support vectors in class 1 as well
From these examples it is clear that the observations that deviate most from the bulk of the data in a class form part of the support vectors and therefore has α*i >0. Referring back to our argument, we therefore will consider the support vectors as the cases that most probably will have a detrimental effect on the KFD generalization performance. In the next section we will investigate the relationship between the parameter γ of the Gaussian kernel and the number of support vectors.
6.5.2 Relationship between γ and the number of support vectors
In this section we will demonstrate how the number of support vectors in each class can be controlled by varying the γ -value. In Figures 6.5 and 6.7 we obtained the support regions by using γ=0.1. For these examples only a few support vectors were needed to construct the support region. Thus, the subset was quite small. A question that one may ask is whether one can control the size of the subset, i.e. change the number of support vectors. The following shows how the support region and the number of support vectors change as γ changes. Using the same data that were used in Figures 6.5 and 6.7, we selected γ=5 and constructed the new support regions. The resulting support regions are shown in Figures 6.9 and 6.10. A significant change in the shapes of support regions took place for both the normal and the lognormal data. The support regions for γ=5 is not as smooth as the ones when γ=0.1 was used. The increase in the number of support vectors is also very dramatic. For the normal data the number of support vectors in the training data increased to 264, and for the lognormal data the number of support vectors increased to 96. Note that in Figures 6.9 and 6.10 there are some points falling outside of the support regions. These points are also support vectors having α* >0
.
It is clear from Figures 6.5 and 6.7 as well as 6.9 and 6.10 that the choice of the γ -value has a significant effect on the number of support vectors. A simulation study was performed as a further investigation of the relationship between the γ -value and the number of support vectors.
We used the following configurations:
- Normal and lognormal distributions were used.
- Five uncorrelated variables were used for both distributions. - The training data consisted of 400 observations
(
n1 =n2 =200)
.- The two populations differed with respect to location. We used μ2 =0 and μ1 =c1
where c=2.5 was used for normal data and c=1 for lognormal data. The identity matrix was used as covariance matrix for both populations
(
Σ1 =Σ2 =I)
.- Twenty equally spaced γ -values were selected between 0.01 and 10.
For each of these γ -values the number of support vectors in each class as well as for the entire training data set were obtained. These were then used to calculate the fraction of the data corresponding to support vectors in each class as well as in the entire training data set. For each of the configurations described above, 100 simulation repetitions were performed and an average fraction of support vectors was calculated over these repetitions. The results are presented in Figure 6.11 for the normal and lognormal data separately.
Figure 6.11 shows an interesting relationship between the γ -values and the fraction of support vectors. For small γ -values the fraction of support vectors is also small, but as γ increases, the fraction of support vectors also increases. For large γ -values, all the training data cases will become support vectors. When comparing the normal with the lognormal case another interesting pattern is observed. For the normal data, the curve shows a rapid increase in the fraction of support vectors for 0<γ<2. For γ>2, all the training data are support vectors (fraction of support vectors is equal to one). In the case of the lognormal data, the curve increases at a lower rate compared to the normal data case. It seems that only after γ>10, all the training data cases become support vectors. Since we observe a clear relationship between the γ -values and the number of support vectors, we can use γ to control the number of support vectors.
FIGURE 6.9: Support regions for the same normal training data as in Figure 6.5. A Gaussian
kernel with γ=5 was used. Compared to Figure 6.5 the number of support vectors increases
(from 11 to 264 for the entire data set) which also changes the shape of the support regions dramatically.
FIGURE 6.10: Support regions for the same lognormal training data as in Figure 6.7. A
Gaussian kernel with γ=5 was used. Compared to Figure 6.7 the number of support vectors
increases dramatically (from 12 to 96 for the entire data set). The shapes of the support regions change accordingly.
FIGURE 6.11: Relationship between the parameter γ and the fraction of the data corresponding to support vectors for the normal and lognormal training data. The vertical blue
line represents γ=1 p, which corresponds to a number of support vectors of about 15% to 20%
As mentioned in the previous section, we want to use the support vectors as a subset of cases that should be evaluated by the leave-one-out criteria. By selecting an appropriate γ -value, we are able to control the size of the subset. The question that now arises is, which γ -value is an appropriate value. In the next section a simulation study will be conducted where the smallest enclosing hypersphere is applied as a filter to obtain a subset of cases, which is then evaluated by each leave-one-out criterion to identify the cases that are most detrimental to the KFD generalization performance. We will use
p
1
γ= to construct the hypersphere in the simulation study. For the congifurations considered above, this γ -value corresponds to a subset size of about 15% to 20% of the training data, as represented by the vertical line in Figure 6.11.