5.4 Experiments and Results with Binary-Binary Restricted Boltzmann Ma-
5.4.2 Experiment 2 with All Other Binary Data Sets
5.4.2.1 CalTech-101 Data Set
The CalTech 101 Silhouettes data set (Marlin et al.,2010) has been derived from the original CalTech 101 database (Fei-Fei et al.,2007) of distinct objects. In order to obtain the silhouettes, the primary object in the image is first outlined through a high quality polygon, and then centered and scaled to render on a 28 × 28 pixel image plane. The final image is a filled black polygon on a white background as shown in Figure 5.7. The Caltech101 silhouettes data set is very different from
Figure 5.7: A subset of CalTech 101 silhouettes data set projecting the silhouettes on a 28 × 28 image plane.
MNIST as it contains a significantly larger number of classes (101 in total) but much fewer samples for each class comparatively. The train/validation/test split
(a) Classification performance (b) Overall computation complexity
Figure 5.8: Comparison of the classification performances achieved by the RBM generative model (η = 0.005), Fisher kernel RBM (η = 0.005) and ClassRBM (η = 0.05 ) on CalTech 101 silhouettes data set. The comparison of the overall computation
time taken by these techniques is also shown in parallel.
is therefore a stratified sample to handle the class imbalance in the data set. Given N c instances from each class c, we put min(35× N c, 100) instances from that class into the training set, so each of the 101 classes has at most 100 training instances. The minimum number of training instances per class is around 20. The remaining instances are split evenly between validation and test sets. The validation and test sets for each class have between 6 and 400 instances. It makes sense to use class-balanced prediction accuracy, as for the standard CalTech data set, since the test and validation sets are badly imbalanced and some classes may be much easier to predict than others.
The 28 × 28 dimensional binary silhouettes serve as an observation to the visible layer of the RBM thus formulating 784 visible units. The RBM model consists of one hidden layer which was tested with different number of units to capture the distinctive features of different objects. The number of epochs for the generative model were fixed to 10 for the experiments shown in Figure 5.8and Figure 5.15. Other parameters that are significant for building and training this generative model are learning rate (0.005), initial momentum (0.5), final momentum (0.9), penalty for the weight decay factor (0.0002) and batch size. We have used con- trastive divergence (CD-1) algorithm to approximate the gradient of the likelihood function of RBM.
We compare the obtained classification results with some baseline state of the art techniques on the CalTech 101 Silhouettes data set. Our achieved classification performance through Fisher kernels is competitive to the state of the art results shown in Table 5.6. Note that we confine the comparison of our classification results with the methods that use the silhouettes rather than the colored images in the original Caltech 101 database (Fei-Fei et al.,2007). Once again, the Fisher
(a) Zoomed image of train time complexity (b) Training complexity
Figure 5.9: Comparison of the computational complexity incurred by each algorithm for training the generative models and SVM optimizer is shown. The data set used is
Caltech-101.
(a) Zoomed image of test time complexity (b) Testing complexity
Figure 5.10: Comparison of the computational complexity of each algorithm for the testing phase is shown. The data set used is Caltech-101.
kernel shows the best classification accuracy at 100 hidden units level as compared to the ClassRBM’s best performance at 500 hidden units and generative model’s performance at all scales. From the results in Table 5.6, it is also clear that the classification performance achieved by Fisher kernel RBM is competitive to the performance achieved by two layers DBN. This result speaks of the computational benefit one would get by using Fisher kernels in comparison to the popular deep models which require a lot of parameter tweaking to tune initially and then classify the data.
Note that on this data set, the optimization algorithm used for SVM training and
6
This model is different from the classical model of RBM that forms the core of DBN and is used throughout in all our experiments. The performance figures are only mentioned here for the sake of completion with other state of the art methods.
Table 5.6: Performance achieved by state of the art methods on CalTech 101 silhou- ettes data set.
Models Performance
(% Accuracy) Support Vector Machines (Fisher Kernel; hidden units=100) 63.82 ± 1.5% Support Vector Machines (Linear Kernel; Input=Image Pixels) 70.32 ± 0.11 Support Vector Machines (Gaussian Kernel; Input=Image Pixels) 68.57 ± 0.12 K-Nearest Neighbor (k = 1; Input=Fisher scores from RBM with hidden units=100) 59.92 ± 0.08%
K-Nearest Neighbor (k=1; Input=Image Pixels) 64.29 ± 0.16%
Condensed Nearest Neighbour (Input=Image Pixels)(@55% retrieved rate) 62.40 ± 0.42% Convolutional Deep Belief Networks (DBN)(Lee et al.,2009) (2 layers) 65.4 ± 0.5%
ClassRBM (hidden units=500, η=0.05) 59.37 ± 1.18
Restricted Boltzmann Machine(Marlin et al.,2010) 71.4%
(550 class relevant and class irrelevant hidden units, Persistent CD learning)6
prediction is sequential minimal optimization (SMO) as well as stochastic gradient descent (SGD) learning. The SMO implementation uses one versus one classifi- cation method, whereas the SGD implementation uses one against all method to solve the multi-class classification problem. Empirically, SMO offers a better clas- sification accuracy close to the state of the art performances shown in Table5.6, whereas the SGD offers a comparable accuracy with a better computational cost. Note that this better computational cost of SGD is not due to the one against all methodology, rather it is so due to the SVM optimization algorithm which learns the data in an online way. If one is interested in building a fast classifica- tion system, then using SGD for SVM optimization is a better choice than SMO. See Figure 5.11to analyse the time and performance space of all the competitive methods on CalTech-101 data set.
0 500 1000 1500 2000 2500 3000 3500 0 10 20 30 40 50 60 70 80 90 100 Performance(%Correct) CPU Time(Seconds)
Scatterplot of Mean Accuracy versus Time for CALTECH−101 dataset RBM−generative model Fisher kernel(LIBSVM) ClassRBM Fisher kernel(SGD)
Figure 5.11: Scatter plot of performance and time of all the competitive techniques; SVM with SGD using Fisher kernel again outclasses the other methods on the compu- tational complexity frontier, yet its performance is not the best as achieved by the the
Table 5.7: Performance achieved by state of the art methods on USPS data set.
Models Performance
(% Accuracy) Support Vector Machines (Fisher Kernel; hidden unit=1) 87.39 ± 0.1% K-Nearest Neighbor (k = 1; Input = Fisher scores from RBM with hidden units=1) 78.02 ± 1.67% Support Vector Machines (Linear Kernel; Input = Image Pixels) 94.47 Support Vector Machines (Gaussian Kernel; Input = Image Pixels) 93.52
K-Nearest Neighbor (k = 1; Input = Image Pixels) 94.37%
Condensed Nearest Neighbour (Input=Image Pixels) 91.88%