ISSN: 2005-4238 IJAST 135
Copyright ⓒ 2019 SERSC
Object Classification: Performance Evaluation of CNN based Classifiers using Standard Object Detection
Datasets
Akhil Kumar1, Arvind Kalia2
1,2Himachal Pradesh University, Shimla, India
Abstract
The process of recognizing and classifying objects in a given image is known as Object Classification. Presently, the classification of objects in images is performed using Convolutional Neural Networks (CNNs). The CNNs have ability to extract features from a given image and by using Softmax Layer they can perform classification task. In recent years, several classifiers based on CNNs are introduced. The CNN architectures such as AlexNet, VGG Net, ZF Net, Google Inception Net, Microsoft ResNet, Dense Net and Spatial Transformer Networks are used as image classifiers. In this work, authors have used CNN based image classifiers to perform classification task on standard image datasets. The image datasets used in this work are Kaggle- Dog v/s Cat, MNIST, CIFAR-10 and Tiny ImageNet dataset. The authors have presented a comparison of CNN based image classifiers evaluated on different object detection datasets by measuring their accuracy, error rate, precision and recall metrics. The results obtained provide insight for selection of a particular CNN based image classifier on the basis of image dataset characteristics.
Keywords: Object Classification, CNNs, Object Detection Dataset
1. Introduction
Object classification is a classification problem in the area of image processing and object detection. The problem of object classification tends to classify a given number of objects into the category of objects they belong to. In previous studies, the solutions proposed to solve the problem of object classification are categorized into two approaches. The first approach is based on traditional methods and the second approach is based on evolutionary methods such as Deep Learning. The traditional approach uses shape, motion, color and texture as features of an object to classify the object in a given image. The limitation of traditional approach is its application to static imagery, objects with no or few deformations and small datasets. The evolutionary approach of object classification employs method such as Convolutional Neural Networks (CNNs) to solve the problem of object classification. The CNNs are based on gradient based learning. The CNN based methods can perform classification on large datasets, images having object deformations and multi-class object categories. A CNN is made up of several layers; each layer of the CNN extract features of the object from the image and transforms the features to classify the objects into the object classes. In recent years, several CNN based classifiers are proposed. The CNN classifiers proposed are AlexNet [3], ZF Net [4], VGG Net [5], GoogLeNet [6], Microsoft ResNet [7], Dense Net [8] and Spatial Transformer Networks [9]. The proposed CNN classifiers have varying depth of Convolution Layers, Maxpool Layers and Classification Layers. These CNN classifiers act as backbone architecture in the present day object detection methods.
The basic idea of CNNs is based on neural networks and gradient based learning [1, 2].
The CNNs are made up of neurons, learnable weights and biases. The neurons in the CNNs take the input and to produce the output they take a sum over the weights and pass
ISSN: 2005-4238 IJAST 136
Copyright ⓒ 2019 SERSC
the neurons through an activation function. The input to a CNN is a multi-channel image.
The activation functions used by CNNs are Sigmoid, Tanh and ReLU.
In this work, CNN based classifiers are evaluated on benchmark image datasets such as MNIST [12] handwritten digit dataset, CIFAR-10 [11] multi-object category dataset, Tiny ImageNet [10] and Kaggle- Dog v/s Cat [13] dataset. The CNN classifiers discussed in this work are trained and evaluated on the mentioned benchmark image datasets to reach to a conclusion to find which CNN classifier to choose based on the image dataset characteristics.
This paper is divided into following sections: 1. Introduction, 2. CNN architectures used, 3. Datasets used, 4. Experiment design, 5. Results and 6. Discussion and conclusion.
2. CNN Architectures Used
2.1. AlexNet
This CNN architecture is proposed in [3]. This network model is constituted of 8 layers. The first 5 layers of the network are convolution layers with few as max-pooling layers. The last 3 layers of the network are fully connected layers. The activation function used by this network model is non-saturating ReLU.
2.2. ZF Net
This CNN architecture is proposed in [4]. This network model is constituted of 8 layers. The first 6 layers of the network are convolution layers with few acting as max- pooling layers. The last 2 layers of the network are fully connected layers. This network model uses ReLU as the activation function and cross entropy loss is used as error function. This network model is trained with Stochastic Gradient Descent (SGD) function.
2.3. VGG Net
This architecture is proposed in [5]. This network model is constituted of 16 layers.
The first 13 layers of the network acts as convolution layers and last 3 layers of the network are fully connected layers. This network model uses convolution layers of size 3*3 and maxpooling layers of size 2*2. This network model uses ReLU layers after each convolution layer and trained with Batch Gradient Descent function.
2.4. GoogLeNet
This CNN architecture is proposed in [6]. This CNN model is 22-layer architecture with an intuitive inception module to extract the granule information from the feature maps. The inception module in this architecture is based on Gabor Filter [19] that convolves the images at multiple scale levels to provide the accurate detailing. The filters used by inception module are capable of learning the features.
2.5. Microsoft ResNet
This CNN architecture is proposed in [7]. This CNN model is constituted of 152 layers. The architecture of this network model is based on identity shortcuts connections and residual learning. In this model, instead of learning all the features by the CNN model only few features are learned as the residuals. Residuals are simply subtraction of features learned from the input of the layer. This is performed by the identity shortcut connections.
ISSN: 2005-4238 IJAST 137
Copyright ⓒ 2019 SERSC
2.6. Dense Net
This network model is based on the depth of the CNNs where each layer of the network is connected to every layer in feed forward manner. In this model, for each layer feature maps of all layers are taken as input and the features maps of the present layer are fed as the input to the subsequent layers. This model has several advantages over previous models. This model reduces the vanishing gradient problem, strengthens feature propagation, encourages feature reuse and reduces the number of parameters substantially [8]. DenseNet [8] by connecting all the layers in feed forward manner reduces the number of parameters that are required to learn the feature maps. This vanish the redundant feature maps. DenseNet uses 12 filters, this makes the layers narrow and add a new set of feature maps. In traditional feed forward networks, flow of information and gradients increases the training time. DenseNet reduces the training since in DenseNet each layer is directly connected to the gradients from the loss function and input image.
2.7. Spatial Transformer Networks
This network model is introduced in [9] and in this model a spatial transformer module is introduced to the conventional neural networks. The transfer module introduced in this network transforms the input image in such a way that the layers of the network can do the classification task conveniently. In traditional CNNs, the max-pooling layer deals with the spatial invariance whereas, in this model, the spatial transformer acts on the sample of the input to get the spatial invariance. Basically, this model transforms the input images. Further, it transforms the entire feature maps. The transformations applied on the input image include scaling, cropping, rotations and non-rigid deformations. This allows the transformer networks [9] to not only select regions of an image that are most relevant but also transform them accurately. This network model can be trained with backpropagation. The spatial transformer networks can be incorporated in CNNs to benefit multiple tasks such as classification, co-localization and spatial attention.
The detailed configuration of the CNN architectures used in this work is presented in Table 1 in Section 4 of this paper.
3. Datasets Used
The datasets selected to evaluate performance of CNN architectures are collection of hand written digits, images of dogs and cats and collection of multi object categories such as car, bicycle, airplane, person, truck etc. The datasets used in this work are part of open source data repositories [10, 11, 12, 13].
3.1. Kaggle Dog v/s Cat dataset
This dataset is a collection of images of dogs and cats [13]. This dataset is specifically used for the task of object classification. This dataset is a collection of 25,000 images belonging to different species of dogs and cats, captured at different positions.
3.2. CIFAR 10
This dataset is a collection of 10 object categories namely, airplane, automobile, bird, cat, deer, dog, frog, horse, ship and truck [11]. This dataset is a collection of 60,000 images belonging to 10 object classes. The dataset is divided into 50,000 training images and 10,000 test images.
3.3. Tiny ImageNet
ISSN: 2005-4238 IJAST 138
Copyright ⓒ 2019 SERSC
Labelled Multi-Class Data Feature Representation
Training
Feature Extraction using CNN
Trained Classifier Model
Classification
Predicted Class
This dataset is a subset of ImageNet dataset [10]. This dataset is a collection of 200 object categories. The object categories in this dataset are person, train, car, truck, bicycle, airplane etc. In this dataset, each object category has 500 training images, 50 validation images and 50 test images. This dataset provide object class level annotations, class labels and object’s bounding boxes.
3.4. MNIST
This dataset is a collection of 70,000 images of hand written digits [12]. The dataset is divided into 60,000 training images and 10,000 test images.
4. Experiment Design
To implement the CNN classifiers, open source API’s TensorFlow [17] and Keras [18]
are used. The programming platform used is Python 2.7. The scheme of experiment design is shown in Fig. 1.
Figure 1. Experiment design
The detailed configuration of CNN based classifiers with their number of layers, activation function used to get output of the node and type of classification layer is presented in Table 1.
Table 1. Classifier model configuration
Model Number of Layers Activation
Function Classification Layers CNN with 2
Conv Layers 02 Sigmoid 2 way Softmax Layer
Performance Metric Evaluation (Accuracy/ Error Rate/ Precision/ Recall)
ISSN: 2005-4238 IJAST 139
Copyright ⓒ 2019 SERSC
CNN with 4
Conv Layers 04 Sigmoid 4 way Softmax Layer
AlexNet
05 Conv Layers and 03
Fully Connected Layers ReLU 1000 way Softmax Layer
ZF Net
06 Conv Layers and 02
Fully Connected Layers ReLU C class Softmax Layer
VGG Net
13 Conv Layers and 3
Fully Connected Layers ReLU Softmax Layer
GoogLeNet
22 Conv Layers with
Inception Module ReLU 03 Softmax Layers
ResNet
152 Conv Layers with
Residual Connection ReLU
1000 way Fully Connected Layer with Softmax Layer at end
Spatial Transformer Networks
02 Conv Layers, 01 Fully Connected Layer
and 01 STN Layer ReLU
01 Fully Connected Layer with 01 Spatial Transformation Layer
DenseNet
03 Conv Layers, 12 Dense Blocks for
feature reduction ReLU
1000 D Fully Connected Layer and Softmax Layers
In this work, the activation functions used are Sigmoid and ReLU. The Sigmoid activation function is used when the probability of an output is to be predicted. The range of Sigmoid activation function lies between 0 to 1. The other activation function used in this work is ReLU. The range of ReLU activation function lies between 0 to infinity.
The configuration of the system used for implementing the CNN architectures is shown in Table 2.
Table 2. Configuration of the system
Processor Inter Core i5 7th Generation
RAM 12 GB
GPU NVIDIA MX 130 (2 GB)
5. Results
5.1. Evaluation Parameters
To evaluate the performance of CNN architectures we employed the objective metrics.
For this purpose, to measure the performance, accuracy, error rate, precision and recall values are calculated. The objective metrics calculated are computed in terms of correct and incorrect classification of object category present in the standard object detection datasets.
ISSN: 2005-4238 IJAST 140
Copyright ⓒ 2019 SERSC
For object classification by CNN architecture on a given standard object detection dataset, accuracy represents the correctly classified object category (True Positives and True Negatives) out of the total number object categories present in the standard object detection dataset. The formula of accuracy is:
𝐴 =
𝑇𝑃+𝑇𝑁𝑃+𝑁 Eq. 1
Where, true positives (TP) represent the correctly classified object category of a positive (P) class. And, true negatives (TN) represent the correctly classified object category of a negative (N) class.
Error rate is an objective metric that represents the ratio of wrongly classified object category i.e. False Positives (FP) and False Negatives (FN) to the total examined object categories. Error rate is computed as:
𝐸 =
𝐹𝑃+𝐹𝑁𝑃+𝑁 Eq. 2
Where, FP represents the N class object category misclassified as positive class object category. And, FN represents the P class object category misclassified as negative class object category.
Precision refers the ratio of correctly classified object category over the total number of object categories. The formula of precision is:
𝑃 =
𝑇𝑃𝑇𝑃+𝐹𝑃 Eq. 3
Recall refers to the fraction of true classification of an object category over a total number of object categories in the image dataset. Recall is measured as:
𝑅 =
𝑇𝑃𝑇𝑃+𝐹𝑁 Eq. 4 5.2. Performance Evaluation
In this work, firstly a CNN network with 2 convolution layers and 4 convolution layers is implemented. In between the convolution layers, 2 maxpooling layers are applied. The activation function used is Sigmoid. The objective metrics evaluated are accuracy, error rate, precision and recall. The results of this experiment indicates that for a plain convolutional network with Sigmoid activation function, accuracy, precision and recall values increases by increasing the number of convolution layers. This experiment is performed on Kaggle –Dog v/s Cat image dataset.
In second experiment, CNN architectures namely, AlexNet, ZF Net, VGG Net, GoogLeNet, Microsoft ResNet, Spatial Transformer Networks and DenseNet performance is evaluated using objective metrics (Accuracy, Error Rate, Precision and Recall). To draw the conclusion, AlexNet and VGG Net architectures are evaluated on Tiny ImageNet dataset, ZF Net, GoogLeNet and Microsoft ResNet are evaluated on CIFAR-10 image dataset and Spatial Transformer Networks and DenseNet architectures are evaluated on MNIST hand written digits image dataset.
The performance of CNN architectures based on evaluation of objective metrics is shown in Table 3 and a comparative graph is presented in Fig. 2.
ISSN: 2005-4238 IJAST 141
Copyright ⓒ 2019 SERSC
Table 3. Performance evaluation of CNN classifiers
CNN Architecture Dataset Accuracy
Error rate Precision Recall CNN with 2 Conv layers Kaggle- Dog vs Cat 74.41%
25.59% 74.92% 75.88%
CNN with 4 Conv layers Kaggle- Dog vs Cat 80.11% 19.89% 80.62% 81.58%
AlexNet Tiny ImageNet 52.82% 47.18% 53.33% 54.29%
VGG Net Tiny ImageNet 74.56% 25.44% 75.07% 76.03%
ZF Net CIFAR-10 75.42%
24.58% 75.93% 76.89%
GoogLeNet CIFAR-10 89.77% 10.23% 90.28% 91.24%
ResNet CIFAR-10 93.71% 6.29% 94.22% 95.18%
Spatial Transformer
Networks MNIST 98.12% 1.88% 98.63% 99.59%
DenseNet MNIST 89.37% 10.63% 89.88% 90.84%
Figure 3. Comparative graph of performance of CNN architectures 74.41%
80.11%
52.82%
74.56%
75.42%
89.77%
93.71%
98.12%
89.37%
25.59%
19.89%
47.18%
25.44%
24.58%
10.23%
6.29%
1.88%
10.63%
74.92%
80.62%
53.33%
75.07%
75.93%
90.28%
94.22%
98.63%
89.88%
75.88%
81.58%
54.29%
76.03%
76.89%
91.24%
95.18%
99.59%
90.84%
0.00% 20.00% 40.00% 60.00% 80.00% 100.00%120.00%
Kaggle- Dog vs Cat Kaggle- Dog vs Cat Tiny ImageNet Tiny ImageNet CIFAR-10 CIFAR-10 CIFAR-10 MNIST MNIST
CNN with 2 Conv layers
CNN with 4 Conv layersAlexNetVGG NetZF NetGoogLeN etResNet Spatial Transfor mer NetworksDenseNet
RECALL PRECISION ERROR RATE ACCURACY
ISSN: 2005-4238 IJAST 142
Copyright ⓒ 2019 SERSC
6. Discussion and Conclusion
In this work, performance of CNN architectures is evaluated on the basis of objective metrics (Accuracy, Error Rate, Precision and Recall). The results of the first experiment shows that a CNN based on 4 Convolution Layers and Sigmoid activation function performs better than a CNN based on 2 Convolution Layers for object classification task on Kaggle Dog v/s Cat image dataset. The results of the second experiment indicates, for object classification task on Tiny ImageNet dataset i.e. on multi object categories, the VGG Network architecture shows higher accuracy, precision and recall than AlexNet architecture. For object classification task on ten object categories i.e. airplane, automobile, bird, cat, deer, dog, frog, horse, ship and truck, the 152 layers based Microsoft ResNet with residual connections shows better accuracy, precision and recall than ZF Net, GoogLeNet CNN architectures. For classification task on handwritten digits dataset, the Spatial Transformer Network CNN architecture shows better accuracy, precision and recall values than DenseNet architecture.
From the results obtained in this work this can be concluded that for multi object category classification, the VGG network architecture can be used for real world object classification problems. For classifying ten object categories namely, airplane, automobile, bird, cat, deer, dog, frog, horse, ship and truck, the Microsoft ResNet architecture can be used for classification task in real world applications. And, for performing digits classification, the Spatial Transformer Networks can be used. The advantage of Spatial Transformer Networks is its ability to perform spatial invariance on images passed to it.
7. References
[1] Fukushima, Kunihiko, “Neocognitron: A Self-organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shify in Position”, Biological Cybernatics, Volume 36 (4), Pages 193-202. 1980.
[2] LeCun Y., Haffner P., Bottou L. and Bengio Y., “Object Recognition with Gradient-Based Learning In:
Shape, Contour and Grouping in Computer Vision”, Lecture Notes in Computer Science, Volume 1681.
Springer, Berlin, Heidelberg. 1999.
[3] Krizhevsky, A., Sutskever, I., and Hinton, G. E., “ImageNet classification with deep convolutional neural networks”, In NIPS, Pages 1106–1114. 2012.
[4] Zeiler, M. D. and Fergus, R., “Visualizing and understanding convolutional networks”. CoRR, abs/1311.2901, 2013. Published in Proc. ECCV. 2014.
[5] Simonyan, K. and Zisserman, A., “Very deep convolutional networks for large-scale image recognition”, arXiv:1409.1556. 2014.
[6] Szegedy, C., Liu, W., Jia, Y., Sermanent, P., et.al., "Going deeper with convolutions", Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.
[7] He, Kaiming, et al. "Deep residual learning for image recognition", Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
[8] Huang, Gao, et al. "Densely connected convolutional networks." arXiv preprint arXiv:1608.06993.
2016.
[9] Jaderberg, M., Simonyan, K., Zisserman, A. et.al., “Spatial Transformer Networks”, arXiv:1506.02025.
2016.
[10] Tiny ImageNet database: https://tiny-imagenet.herokuapp.com/
[11] CIFAR-10 database: https://www.cs.toronto.edu/~kriz/cifar.html [12] MNIST database: http://yann.lecun.com/exdb/mnist/
[13] Kaggle Dog vs Cat: https://www.kaggle.com/c/dogs-vs-cats/data
[14] Hubel, D. H., & Wiesel, T. N., Receptive fields of single neurones in the cat’s striate cortex. Journal of Physiology, 148(1), 574–591. 1959.
ISSN: 2005-4238 IJAST 143
Copyright ⓒ 2019 SERSC
[15] Yu, D., Wang, H., Chen, P., & Wei, Z., Mixed pooling for convolutional neural networks. In Proceedings of the 9th International Conference on Rough Sets and Knowledge Technology, Pages 364–375. Berlin: Springer. 2014.
[16] LeCun, Y., Bengio, Y., & Hinton, G., Deep learning. Nature, Vol. 521, Pages. 436–444. 2015.
[17] TensorFlow, https://www.tensorflow.org/
[18] Keras, https://keras.io/
[19] Jain, A., Ratha, N., and Lakshmanan, S., "Object Detection using Gabor Filters", Pattern Recognition, Vol. 30(2), Pages 295-309. 1997.