Object Classification: Performance Evaluation of CNN based Classifiers using Standard Object Detection Datasets

(1)

ISSN: 2005-4238 IJAST 135

Object Classification: Performance Evaluation of CNN based Classifiers using Standard Object Detection

Datasets

Akhil Kumar¹, Arvind Kalia²

1,2Himachal Pradesh University, Shimla, India

Abstract

The process of recognizing and classifying objects in a given image is known as Object Classification. Presently, the classification of objects in images is performed using Convolutional Neural Networks (CNNs). The CNNs have ability to extract features from a given image and by using Softmax Layer they can perform classification task. In recent years, several classifiers based on CNNs are introduced. The CNN architectures such as AlexNet, VGG Net, ZF Net, Google Inception Net, Microsoft ResNet, Dense Net and Spatial Transformer Networks are used as image classifiers. In this work, authors have used CNN based image classifiers to perform classification task on standard image datasets. The image datasets used in this work are Kaggle- Dog v/s Cat, MNIST, CIFAR-10 and Tiny ImageNet dataset. The authors have presented a comparison of CNN based image classifiers evaluated on different object detection datasets by measuring their accuracy, error rate, precision and recall metrics. The results obtained provide insight for selection of a particular CNN based image classifier on the basis of image dataset characteristics.

Keywords: Object Classification, CNNs, Object Detection Dataset

1. Introduction

Object classification is a classification problem in the area of image processing and object detection. The problem of object classification tends to classify a given number of objects into the category of objects they belong to. In previous studies, the solutions proposed to solve the problem of object classification are categorized into two approaches. The first approach is based on traditional methods and the second approach is based on evolutionary methods such as Deep Learning. The traditional approach uses shape, motion, color and texture as features of an object to classify the object in a given image. The limitation of traditional approach is its application to static imagery, objects with no or few deformations and small datasets. The evolutionary approach of object classification employs method such as Convolutional Neural Networks (CNNs) to solve the problem of object classification. The CNNs are based on gradient based learning. The CNN based methods can perform classification on large datasets, images having object deformations and multi-class object categories. A CNN is made up of several layers; each layer of the CNN extract features of the object from the image and transforms the features to classify the objects into the object classes. In recent years, several CNN based classifiers are proposed. The CNN classifiers proposed are AlexNet [3], ZF Net [4], VGG Net [5], GoogLeNet [6], Microsoft ResNet [7], Dense Net [8] and Spatial Transformer Networks [9]. The proposed CNN classifiers have varying depth of Convolution Layers, Maxpool Layers and Classification Layers. These CNN classifiers act as backbone architecture in the present day object detection methods.

The basic idea of CNNs is based on neural networks and gradient based learning [1, 2].

The CNNs are made up of neurons, learnable weights and biases. The neurons in the CNNs take the input and to produce the output they take a sum over the weights and pass

(2)

ISSN: 2005-4238 IJAST 136

the neurons through an activation function. The input to a CNN is a multi-channel image.

The activation functions used by CNNs are Sigmoid, Tanh and ReLU.

In this work, CNN based classifiers are evaluated on benchmark image datasets such as MNIST [12] handwritten digit dataset, CIFAR-10 [11] multi-object category dataset, Tiny ImageNet [10] and Kaggle- Dog v/s Cat [13] dataset. The CNN classifiers discussed in this work are trained and evaluated on the mentioned benchmark image datasets to reach to a conclusion to find which CNN classifier to choose based on the image dataset characteristics.

This paper is divided into following sections: 1. Introduction, 2. CNN architectures used, 3. Datasets used, 4. Experiment design, 5. Results and 6. Discussion and conclusion.

2. CNN Architectures Used

2.1. AlexNet

This CNN architecture is proposed in [3]. This network model is constituted of 8 layers. The first 5 layers of the network are convolution layers with few as max-pooling layers. The last 3 layers of the network are fully connected layers. The activation function used by this network model is non-saturating ReLU.

2.2. ZF Net

This CNN architecture is proposed in [4]. This network model is constituted of 8 layers. The first 6 layers of the network are convolution layers with few acting as maxpooling layers. The last 2 layers of the network are fully connected layers. This network model uses ReLU as the activation function and cross entropy loss is used as error function. This network model is trained with Stochastic Gradient Descent (SGD) function.

2.3. VGG Net

This architecture is proposed in [5]. This network model is constituted of 16 layers.

The first 13 layers of the network acts as convolution layers and last 3 layers of the network are fully connected layers. This network model uses convolution layers of size 3*3 and maxpooling layers of size 2*2. This network model uses ReLU layers after each convolution layer and trained with Batch Gradient Descent function.

2.4. GoogLeNet

This CNN architecture is proposed in [6]. This CNN model is 22-layer architecture with an intuitive inception module to extract the granule information from the feature maps. The inception module in this architecture is based on Gabor Filter [19] that convolves the images at multiple scale levels to provide the accurate detailing. The filters used by inception module are capable of learning the features.

2.5. Microsoft ResNet

This CNN architecture is proposed in [7]. This CNN model is constituted of 152 layers. The architecture of this network model is based on identity shortcuts connections and residual learning. In this model, instead of learning all the features by the CNN model only few features are learned as the residuals. Residuals are simply subtraction of features learned from the input of the layer. This is performed by the identity shortcut connections.

(3)

ISSN: 2005-4238 IJAST 137

2.6. Dense Net

This network model is based on the depth of the CNNs where each layer of the network is connected to every layer in feed forward manner. In this model, for each layer feature maps of all layers are taken as input and the features maps of the present layer are fed as the input to the subsequent layers. This model has several advantages over previous models. This model reduces the vanishing gradient problem, strengthens feature propagation, encourages feature reuse and reduces the number of parameters substantially [8]. DenseNet [8] by connecting all the layers in feed forward manner reduces the number of parameters that are required to learn the feature maps. This vanish the redundant feature maps. DenseNet uses 12 filters, this makes the layers narrow and add a new set of feature maps. In traditional feed forward networks, flow of information and gradients increases the training time. DenseNet reduces the training since in DenseNet each layer is directly connected to the gradients from the loss function and input image.

2.7. Spatial Transformer Networks

This network model is introduced in [9] and in this model a spatial transformer module is introduced to the conventional neural networks. The transfer module introduced in this network transforms the input image in such a way that the layers of the network can do the classification task conveniently. In traditional CNNs, the max-pooling layer deals with the spatial invariance whereas, in this model, the spatial transformer acts on the sample of the input to get the spatial invariance. Basically, this model transforms the input images. Further, it transforms the entire feature maps. The transformations applied on the input image include scaling, cropping, rotations and non-rigid deformations. This allows the transformer networks [9] to not only select regions of an image that are most relevant but also transform them accurately. This network model can be trained with backpropagation. The spatial transformer networks can be incorporated in CNNs to benefit multiple tasks such as classification, co-localization and spatial attention.

The detailed configuration of the CNN architectures used in this work is presented in Table 1 in Section 4 of this paper.

3. Datasets Used

The datasets selected to evaluate performance of CNN architectures are collection of hand written digits, images of dogs and cats and collection of multi object categories such as car, bicycle, airplane, person, truck etc. The datasets used in this work are part of open source data repositories [10, 11, 12, 13].

3.1. Kaggle Dog v/s Cat dataset

This dataset is a collection of images of dogs and cats [13]. This dataset is specifically used for the task of object classification. This dataset is a collection of 25,000 images belonging to different species of dogs and cats, captured at different positions.

3.2. CIFAR 10

This dataset is a collection of 10 object categories namely, airplane, automobile, bird, cat, deer, dog, frog, horse, ship and truck [11]. This dataset is a collection of 60,000 images belonging to 10 object classes. The dataset is divided into 50,000 training images and 10,000 test images.

3.3. Tiny ImageNet

(4)

ISSN: 2005-4238 IJAST 138

Labelled Multi-Class Data Feature Representation

Training

Feature Extraction using CNN

Trained Classifier Model

Classification

Predicted Class

This dataset is a subset of ImageNet dataset [10]. This dataset is a collection of 200 object categories. The object categories in this dataset are person, train, car, truck, bicycle, airplane etc. In this dataset, each object category has 500 training images, 50 validation images and 50 test images. This dataset provide object class level annotations, class labels and object’s bounding boxes.

3.4. MNIST

This dataset is a collection of 70,000 images of hand written digits [12]. The dataset is divided into 60,000 training images and 10,000 test images.

4. Experiment Design

To implement the CNN classifiers, open source API’s TensorFlow [17] and Keras [18]

are used. The programming platform used is Python 2.7. The scheme of experiment design is shown in Fig. 1.

Figure 1. Experiment design

The detailed configuration of CNN based classifiers with their number of layers, activation function used to get output of the node and type of classification layer is presented in Table 1.

Table 1. Classifier model configuration

Model Number of Layers Activation

Function Classification Layers CNN with 2

Conv Layers 02 Sigmoid 2 way Softmax Layer

Performance Metric Evaluation (Accuracy/ Error Rate/ Precision/ Recall)

(5)

ISSN: 2005-4238 IJAST 139

CNN with 4

Conv Layers 04 Sigmoid 4 way Softmax Layer

AlexNet

05 Conv Layers and 03

Fully Connected Layers ReLU 1000 way Softmax Layer

ZF Net

Fully Connected Layers ReLU C class Softmax Layer

VGG Net

Fully Connected Layers ReLU Softmax Layer

GoogLeNet

22 Conv Layers with

Inception Module ReLU 03 Softmax Layers

ResNet

152 Conv Layers with

Residual Connection ReLU

1000 way Fully Connected Layer with Softmax Layer at end

Spatial Transformer Networks

02 Conv Layers, 01 Fully Connected Layer

and 01 STN Layer ReLU

01 Fully Connected Layer with 01 Spatial Transformation Layer

DenseNet

03 Conv Layers, 12 Dense Blocks for

feature reduction ReLU

1000 D Fully Connected Layer and Softmax Layers

In this work, the activation functions used are Sigmoid and ReLU. The Sigmoid activation function is used when the probability of an output is to be predicted. The range of Sigmoid activation function lies between 0 to 1. The other activation function used in this work is ReLU. The range of ReLU activation function lies between 0 to infinity.

The configuration of the system used for implementing the CNN architectures is shown in Table 2.

Table 2. Configuration of the system

Processor Inter Core i5 7^th Generation

RAM 12 GB

GPU NVIDIA MX 130 (2 GB)

5. Results

5.1. Evaluation Parameters

To evaluate the performance of CNN architectures we employed the objective metrics.

For this purpose, to measure the performance, accuracy, error rate, precision and recall values are calculated. The objective metrics calculated are computed in terms of correct and incorrect classification of object category present in the standard object detection datasets.

(6)

ISSN: 2005-4238 IJAST 140

For object classification by CNN architecture on a given standard object detection dataset, accuracy represents the correctly classified object category (True Positives and True Negatives) out of the total number object categories present in the standard object detection dataset. The formula of accuracy is:

𝐴 =

^{𝑇𝑃+𝑇𝑁}

𝑃+𝑁 Eq. 1

Where, true positives (TP) represent the correctly classified object category of a positive (P) class. And, true negatives (TN) represent the correctly classified object category of a negative (N) class.

Error rate is an objective metric that represents the ratio of wrongly classified object category i.e. False Positives (FP) and False Negatives (FN) to the total examined object categories. Error rate is computed as:

𝐸 =

^{𝐹𝑃+𝐹𝑁}

𝑃+𝑁 Eq. 2

Where, FP represents the N class object category misclassified as positive class object category. And, FN represents the P class object category misclassified as negative class object category.

Precision refers the ratio of correctly classified object category over the total number of object categories. The formula of precision is:

𝑃 =

^𝑇𝑃

𝑇𝑃+𝐹𝑃 Eq. 3

Recall refers to the fraction of true classification of an object category over a total number of object categories in the image dataset. Recall is measured as:

𝑅 =

^𝑇𝑃

𝑇𝑃+𝐹𝑁 Eq. 4 5.2. Performance Evaluation

In this work, firstly a CNN network with 2 convolution layers and 4 convolution layers is implemented. In between the convolution layers, 2 maxpooling layers are applied. The activation function used is Sigmoid. The objective metrics evaluated are accuracy, error rate, precision and recall. The results of this experiment indicates that for a plain convolutional network with Sigmoid activation function, accuracy, precision and recall values increases by increasing the number of convolution layers. This experiment is performed on Kaggle –Dog v/s Cat image dataset.

In second experiment, CNN architectures namely, AlexNet, ZF Net, VGG Net, GoogLeNet, Microsoft ResNet, Spatial Transformer Networks and DenseNet performance is evaluated using objective metrics (Accuracy, Error Rate, Precision and Recall). To draw the conclusion, AlexNet and VGG Net architectures are evaluated on Tiny ImageNet dataset, ZF Net, GoogLeNet and Microsoft ResNet are evaluated on CIFAR-10 image dataset and Spatial Transformer Networks and DenseNet architectures are evaluated on MNIST hand written digits image dataset.

The performance of CNN architectures based on evaluation of objective metrics is shown in Table 3 and a comparative graph is presented in Fig. 2.

(7)

ISSN: 2005-4238 IJAST 141

Table 3. Performance evaluation of CNN classifiers

CNN Architecture Dataset Accuracy

Error rate Precision Recall CNN with 2 Conv layers Kaggle- Dog vs Cat 74.41%

25.59% 74.92% 75.88%

CNN with 4 Conv layers Kaggle- Dog vs Cat 80.11% 19.89% 80.62% 81.58%

AlexNet Tiny ImageNet 52.82% 47.18% 53.33% 54.29%

VGG Net Tiny ImageNet 74.56% 25.44% 75.07% 76.03%

ZF Net CIFAR-10 75.42%

24.58% 75.93% 76.89%

GoogLeNet CIFAR-10 89.77% 10.23% 90.28% 91.24%

ResNet CIFAR-10 93.71% 6.29% 94.22% 95.18%

Spatial Transformer

Networks MNIST 98.12% 1.88% 98.63% 99.59%

DenseNet MNIST 89.37% 10.63% 89.88% 90.84%

Figure 3. Comparative graph of performance of CNN architectures 74.41%

80.11%

52.82%

74.56%

75.42%

89.77%

93.71%

98.12%

89.37%

25.59%

19.89%

47.18%

25.44%

24.58%

10.23%

6.29%

1.88%

10.63%

74.92%

80.62%

53.33%

75.07%

75.93%

90.28%

94.22%

98.63%

89.88%

75.88%

81.58%

54.29%

76.03%

76.89%

91.24%

95.18%

99.59%

90.84%

0.00% 20.00% 40.00% 60.00% 80.00% 100.00%120.00%

Kaggle- Dog vs Cat Kaggle- Dog vs Cat Tiny ImageNet Tiny ImageNet CIFAR-10 CIFAR-10 CIFAR-10 MNIST MNIST

CNN with 2 Conv layers

CNN with 4 Conv layersAlexNetVGG NetZF NetGoogLeN etResNet Spatial Transfor mer NetworksDenseNet

RECALL PRECISION ERROR RATE ACCURACY

(8)

ISSN: 2005-4238 IJAST 142

6. Discussion and Conclusion

In this work, performance of CNN architectures is evaluated on the basis of objective metrics (Accuracy, Error Rate, Precision and Recall). The results of the first experiment shows that a CNN based on 4 Convolution Layers and Sigmoid activation function performs better than a CNN based on 2 Convolution Layers for object classification task on Kaggle Dog v/s Cat image dataset. The results of the second experiment indicates, for object classification task on Tiny ImageNet dataset i.e. on multi object categories, the VGG Network architecture shows higher accuracy, precision and recall than AlexNet architecture. For object classification task on ten object categories i.e. airplane, automobile, bird, cat, deer, dog, frog, horse, ship and truck, the 152 layers based Microsoft ResNet with residual connections shows better accuracy, precision and recall than ZF Net, GoogLeNet CNN architectures. For classification task on handwritten digits dataset, the Spatial Transformer Network CNN architecture shows better accuracy, precision and recall values than DenseNet architecture.

From the results obtained in this work this can be concluded that for multi object category classification, the VGG network architecture can be used for real world object classification problems. For classifying ten object categories namely, airplane, automobile, bird, cat, deer, dog, frog, horse, ship and truck, the Microsoft ResNet architecture can be used for classification task in real world applications. And, for performing digits classification, the Spatial Transformer Networks can be used. The advantage of Spatial Transformer Networks is its ability to perform spatial invariance on images passed to it.

7. References

[1] Fukushima, Kunihiko, “Neocognitron: A Self-organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shify in Position”, Biological Cybernatics, Volume 36 (4), Pages 193-202. 1980.

[2] LeCun Y., Haffner P., Bottou L. and Bengio Y., “Object Recognition with Gradient-Based Learning In:

Shape, Contour and Grouping in Computer Vision”, Lecture Notes in Computer Science, Volume 1681.

Springer, Berlin, Heidelberg. 1999.

[3] Krizhevsky, A., Sutskever, I., and Hinton, G. E., “ImageNet classification with deep convolutional neural networks”, In NIPS, Pages 1106–1114. 2012.

[4] Zeiler, M. D. and Fergus, R., “Visualizing and understanding convolutional networks”. CoRR, abs/1311.2901, 2013. Published in Proc. ECCV. 2014.

[5] Simonyan, K. and Zisserman, A., “Very deep convolutional networks for large-scale image recognition”, arXiv:1409.1556. 2014.

[6] Szegedy, C., Liu, W., Jia, Y., Sermanent, P., et.al., "Going deeper with convolutions", Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.

[7] He, Kaiming, et al. "Deep residual learning for image recognition", Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.

[8] Huang, Gao, et al. "Densely connected convolutional networks." arXiv preprint arXiv:1608.06993.

2016.

[9] Jaderberg, M., Simonyan, K., Zisserman, A. et.al., “Spatial Transformer Networks”, arXiv:1506.02025.

2016.

[10] Tiny ImageNet database: https://tiny-imagenet.herokuapp.com/

[11] CIFAR-10 database: https://www.cs.toronto.edu/~kriz/cifar.html [12] MNIST database: http://yann.lecun.com/exdb/mnist/

[13] Kaggle Dog vs Cat: https://www.kaggle.com/c/dogs-vs-cats/data

[14] Hubel, D. H., & Wiesel, T. N., Receptive fields of single neurones in the cat’s striate cortex. Journal of Physiology, 148(1), 574–591. 1959.

(9)

ISSN: 2005-4238 IJAST 143

[15] Yu, D., Wang, H., Chen, P., & Wei, Z., Mixed pooling for convolutional neural networks. In Proceedings of the 9th International Conference on Rough Sets and Knowledge Technology, Pages 364–375. Berlin: Springer. 2014.

[16] LeCun, Y., Bengio, Y., & Hinton, G., Deep learning. Nature, Vol. 521, Pages. 436–444. 2015.

[17] TensorFlow, https://www.tensorflow.org/

[18] Keras, https://keras.io/

[19] Jain, A., Ratha, N., and Lakshmanan, S., "Object Detection using Gabor Filters", Pattern Recognition, Vol. 30(2), Pages 295-309. 1997.