Experiments with Handwritten Digit Data - Expectation Propagation for NBSBC

5.4 Expectation Propagation for NBSBC

5.5.2 Experiments with Handwritten Digit Data

In this section we evaluate the performance of SVM, NBSVM, SBC, GL and NBSBC in the problem of automatic classification of handwritten digits. In particular, we focus on discrimi- nating between the digits 7 and 9 in the MNIST dataset (Lecun et al.,1998). This is a challenging problem because the digits 7 and 9 present similarities. In MNIST, each digit is centered and normalized in size in a 28 × 28 black and white image. The pixel intensities range from 0 to 255. Additionally, the background pixel intensities are constant and equal to 0. This means that most pixels could be directly ignored by a feature selection method because their value is always 0 for all training instances. To consider a more difficult problem, in which all pixels are potential predictors for the class label, noise is added to the digitized images so that each pixel with intensity equal to 0 is replaced by a pixel whose intensity is a random number uniformly distributed between 0 and 128.

Figure5.4shows two sample digits from each of the two classes. The MNIST dataset con- tains 7293 instances of class ”7” and 6958 instances of class ”9”. To incorporate dependencies among features, we consider classifiers in which contiguous pixels in the images tend to be either both excluded or both included in the prediction model. Empirical support for this as- sumed dependence structure is given in the top-middle plot in Figure5.5. This figure displays the absolute value of the linear correlation coefficient between each of the features and the class label estimated using the complete dataset. The network of feature dependencies is generated by connecting each pixel to its four nearest neighbors in the image. To avoid spurious boundary effects, the network forms a torus, in which pixels close to a given boundary are connected to pixels on the opposite boundary (see Figure5.2). The experiments are repeated for 100 independent random partitions into a training set with 150 instances and a test set of size 14,101.

Table5.3summarizes the results obtained by each method in the handwritten digit dataset. The best technique in terms of test error is NBSBC, followed by SBC. The differences between these two methods are statistically significant according to a paired Wilcoxon test with a p- value lower than 2 · 10−16. NBSVM is the third best method, closely followed by SVM. Finally, GL obtains the worst performance. In terms of the ability to select relevant features, the best method is NBSBC and GL is the worst one. The top-left of Figure 5.5 displays plots of the average of IFSQ(k) for each method. Note that the curve for NBSBC is always above those for

Chapter5. Network-based Sparse Bayesian Classification 91

Figure 5.4: Each plot shows a sample digit from each class, that is , ”7” and ”9”. Table 5.3: Results for each method in the handwritten digit dataset.

SVM NBSVM GL SBC NBSBC Avg. Test Error in % 10.32±0.015 10.23±0.013 11.18±0.012 9.18±0.009 8.35±0.009 Avg. Area under IFSQ 41.78±9.28 31.75±3.57 54.87±5.38 61.61±5.59

Avg. Training Time 35.80±0.61 1992.51±211.95 29.93±2.47 0.56±0.04 21.32±8.49

is similar to GL and faster than SVM. In contrast, training NBSVM is about 100 times slower than NBSBC. Figure 5.5 displays the relevance assigned by SBC, NBSVM, GL and NBSBC to each feature (image pixel) in a particular realization of the handwritten digit classification problem. The relevance map given by NBSBC is composed of a few uniform patches. By contrast, NBSVM and GL tend to select individual features or small clusters of features, which are in most cases disconnected from each other.

5.5.3 Experiments with Precipitation Data

We now evaluate the accuracy of SVM, NBSVM, SBC, NBSBC and GL in the task of modeling precipitation data. In particular, we attempt to build a classifier that predicts days with zero and days with positive rainfall at a target meteorological station, given the rainfall measurements collected at other stations on the same day. The data correspond to daily precipitation measurements gathered at 223 meteorological stations in the former-USSR from 1881 until 2001 (Razuvaev et al.,2008). This is the same dataset used for the experiments of Section3.4.3. The 223 stations are displayed in the right part of Figure5.6.

The first task considered consists in predicting whether it rained or not in Moscow. Further experiments show that the results for this particular station are similar to those obtained when other target stations are considered. The identification number assigned to the Moscow station by the world meteorological organization (WMO) is 27,612. The instance features for the problem are the precipitation measurements collected at the other 222 meteorological stations. In the original dataset, rainfall measurements are available at all the stations for 4543 days. From these, 2217 days were dry in Moscow, leaving a total of 2326 days with positive precipitation at that station.

To construct the network of feature dependencies, we assume that two nearby stations should be either both excluded or both included in the classification model. The network used (Figure

Figure 5.5: Top left, plots of the average of IFSQ(k) for NBSBC, SBC, NBSVM and GL

in the handwritten digit dataset. Top middle, estimate of the actual feature relevance in the handwritten digit dataset. Top right, bottom left, bottom middle and bottom right. Respectively, relevance for each feature given by SBC, NBSVM, GL and NBSBC. when these methods are executed on the first training set of the handwritten digit dataset. The most relevant feature is colored in black and the most irrelevant feature is colored in white.

Figure 5.6: Left, average of IFSQ(k) for each feature selection method in the precipitation

dataset. Right, meteorological stations in the former-USSR. Each node corresponds to a different rainfall station. The arrow points to Moscow and represents the location of the target precipitation station with WMO number 27,612. The edges correspond to a Delaunay triangulation of all the precipitation stations except the target station. Links between stations that are more than 1000 km away from each other have been removed.

Chapter5. Network-based Sparse Bayesian Classification 93

Table 5.4: Results for each method in the precipitation dataset.

SVM NBSVM GL SBC NBSBC

Avg. Test Error in % 38.12±0.02 36.69±0.03 32.31±0.03 35.16±0.03 33.17±0.03 Avg. Area under IFSQ 14.14±5.97 14.52±4.51 21.15±4.94 36.71±9.38

Avg. Training Time 9.82±0.16 254.37±19.18 14.74±0.97 0.31±0.12 8.36±2.92

5.6) results from a Delaunay triangulation (Renka,1997) of the different meteorological stations, removing links between stations that are more than 1000 km apart. This type of triangulation is the dual graph of a Voronoi diagram. Voronoi diagrams are commonly used for the interpola- tion of scattered data in earth sciences (Sen,2009). The experiments involve 100 independent realizations of a training set with 150 instances and a test set of size 4393.

Table5.4summarizes the results obtained by each method. The lowest test error is obtained by GL, followed by NBSBC. The differences between these two techniques are statistically significant according to a paired Wilcoxon test (p-value = 0.003). SBC is the third best method, followed by NBSVM and SVM. Regarding the ability to select relevant features, NBSBC is the best technique. The left of Figure5.6shows plots of the average of IFSQ(k) for each method. The

curve for NBSBC is generally above the curves for the other methods. Building the classifier using NBSBC is faster than GL by a factor of ≈ 1.6 and faster than NBSVM by a factor of ≈ 30. These experiments are also repeated for 50 different randomly selected target stations. The results show that the ranking GL (best), NBSBC, SBC, NBSVM, SVM (worst) if fairly robust: the average ranks for these methods are 1.16, 2.54, 3.02, 3.48, and 4.80, respectively.

The good results obtained by GL in this dataset are probably due to the characteristics of the feature selection process implemented by this method. Specifically, the features selected by GL correspond to edges that need not be connected. By contrast, NBSBC and NBSVM favor the selection of connected components from the original network. In this particular domain, it may be necessary to reflect other geographical information in the network, beyond the distances between the stations, (for example, the existence of geographical barriers) to provide a sufficiently accurate description of the feature dependencies. Since the sparsity pattern imposed by GL is looser, in the sense that the selected edges need not be close to each other, it is possible that the limitations of the network based exclusively on distances affect GL less severely than the other methods. Nevertheless, the differences in performance between GL and NBSBC are fairly small.

In document Balancing flexibility and robustness in machine learning: semi-parametric methods and sparse linear models (Page 104-107)