REAL TIME HAND GESTURES RECOGNITION USING CNN

(1)

REAL TIME HAND GESTURES RECOGNITION USING CNN

Avaneet Kumar

*1

_{, Anuj Kumar Singh}

*2

_{, Birendra Kumar}

*3

_{, Swapnil Kumar}

*4 Mr. Faisal Shameem (Guide) and Dr. P.K.S Nain(Guide)

Department of Mechanical Engineering, Galgotias University, Greator Noida, Uttar Pradesh, India

ABSTRACT

Gesture recognition plays an important role in communication through sign language. It is a fast-growing domain within computer vision and has attracted significant research due to its wider social impact. One of the uses of gesture recognition is to establish an interaction system between human and machine so that human can communicate with the machine without the use of a remote or any controlling system and also it can be used to tackle the difficulties faced by hearing impairment, it is the need of the hour to develop a system that translates the sign language into text which can easily be recognized and can be used in different areas. In this paper hand, the gesture recognition system is developed Language using Convolutional Neural Network and result is discussed.

Keywords: Sign Recognition, Gesture Recognition, Computer Vision, Convolutional Neural Networks.

I. INTRODUCTION

Bobick and Wilson have given a definition of gesture. According to them, the motion of the body that is intended to communicate with other agents can be defined as gesture [3]. Communication is transmission, sharing and conveyance of knowledge, news, concepts, and feelings. Of them, language is in an important means of non-verbal communication that is gaining impetus and powerful foothold because of its applications in a sizable amount of fields. The foremost distinguished application of this methodology is its usage by, otherwise disabled persons like deaf and mute individuals. They'll communicate with non-linguistic communication individuals while not the assistance of a translator or interpreter by this methodology. Another applications area unit within the automotive sector, transit sector, recreation sector and conjointly whereas, unlocking a smart phone [1]. The sign gesture recognition is wiped out 2 ways: static gesture and dynamic gesture [2]. Whereas human activity, the static gesture makes use of hand shapes whereas the dynamic gesture makes use of the movements of the hand [2]. Our paper specialize in real time hand gesture recognition system. Hand gesture recognition may be a means of understanding so classifying the movements by the hands. However the human hands have terribly advanced articulations with the shape and thus, heaps of errors will arise. Thus, it's powerful to acknowledge the hand gestures. Our paper focuses on detection and recognizing the hand gestures victimization totally different strategies and sorting out the accuracy by those strategies. Conjointly we tend to see the performance, convenience and problems connected with every methodology. Presently heaps of strategies and technologies being employed for sign, and gesture recognition. Amongst them the foremost common ones used area unit Hand Glove primarily based Analysis, Microsoft Kinect primarily based Analysis, Support Vector Machines and Convolutional Neural Networks. One in all the target of those strategies is to bridge the communication gap between speech and hearing impaired individuals with the conventional individuals, and conjointly prosperous and swish integration of those otherwise able individuals in our society. In our analysis paper we tend to build a true times communication system victimization the advancements in Machine Learning. Presently the systems alive either work on little dataset and achieve stable accuracy or work on an oversized dataset with unstable accuracy. We tend to attempt to resolve this drawback by applying Convolutional Neural Network (CNN) on a reasonably giant dataset to realize a decent and stable accuracy.

II. RELATED WORKS

(2)

Gesture detection by victimisation associate degree ANN has additionally been developed[6]. during this system, pictures were divided supported skin colours. the chosen options for ANN were changes of constituent through cross sections, boundary and therefore the scalar description like quantitative relation by ratio and edge ratio. when establishing those feature vectors, they were fed to the ANN for coaching. The accuracy was around ninety eight. A statistical procedure supported haar-like options to discover gestures was projected [7]. during this system, AdaBoost formula was accustomed learn the model. the complete work was divided into two levels. In higher level, a random context-free descriptive linguistics

was accustomed discover gestures. In lower level, postures were detected. A terminal string was generated for every input according to the descriptive linguistics. The chance related to each rule was calculated and therefore the rule with highest chance for the given string was elect because the rule. The gesture associated with this rule was came back because the gesture of the input. However, feature extraction by manual method has some drawbacks. The extraction method is tedious and each attainable feature might not be extracted. Moreover, the extraction may become human-biased. Then comes the machine-driven feature engineering that isn't tedious and not human biased. Also, most the options can be captured victimisation machine-driven feature engineering. Useful features from structured knowledge are often extracted by CNN. So, a move to machine-driven feature engineering was created and deep learning i.e. CNN began to emerge. An formula to acknowledge hand gestures employing a 3D CNN was projected [8]. during this system, the idea of the popularity was difficult depth and intensity of the pictures. They also used knowledge augmentation technique and achieved accuracy concerning 77.5% on viva voce challenge dataset [8]. Another system to discover gesture victimisation CNN that is powerful under 5 invariants scale, rotation, translation, illumination, noise and background was projected [9]. linguistic communication of Peru (LSP) were used as dataset. They achieved ninety six.20% accuracy on the LSP Dataset.

III. CNN

CNN is a class of deep neural networks. CNN are regular version of multilayer Perceptron’s. Multilayer perceptions are used for fully connected network, each neuron in one layer is connected to all neuron is the next layer. CNNs use relatively little small processing compared to other image classification algorithms. Huge features are mandatory for classification and recognition. Conventional models for pattern recognition cannot process natural data in raw form [4]. The Convolutional neural networks are regularized versions of multilayer perceptron (MLP). It work on neurons. The idea of neurons was developed based on the working of the neurons of the animal visual cortex.

The idea is that image is converted in 2D array of numbers and given to the computer and it will output the probability of the image of being a certain class (.80 for a cat, .15 for a dog, .05 for a bird, etc.). It works similar to how human’s brain works. When we look at a picture of a dog, we can classify it as such if the picture has identifiable features such as paws or 4 legs. In a similar way, the computer is able to perform image classification by looking for low-level features such as edges and curves and then building up to more abstract concepts through a series of convolutional layers. The computer uses low-level features obtained at the initial levels to generate high-level features such as paws or eyes to identify the object.

(3)

Pooling layers

Pooling layer reduce the dimension of data. It combines the output of neuron clusters at one layer into single neuron of next layer. It is of two types – max pooling and average pooling. Max pooling uses maximum value from the cluster and average pooling uses the average value from the cluster.

Fully Connected layer

It is the final stage of CNN. Every neuron of one layer is connected to every neuron of another layer. The flattened matrix goes to fully connected layer for classification.

https://en.wikipedia.org/wiki/Convolutional_neural_network#/media/File:Typical_cnn.png

IV. METHODOLOGY

1) The first step is to create dataset for CNN. We have taken 600 image of each hand sign. The hand sign are left, stop, up, down, back, right, forward and start are shown down respectively. To capture the image, laptop webcam is used. Since CNN does not understand string. So every sign is given a numerical value-Up:0, down:1, right:2, left:3, forward:4, backword:5, start:6 and stop:7. So in folder name 0 , all the up hand sign are stored and similar for the rest.

(4)

2) Next step is to train the CNN against the created dataset. For training we are using tensorflow sequential model. The model consist of different layers:

> The first layer is Conv2D which have 400 filters, kernel size = 4x4 and image size =100x100. > The second layer is of Activation layer which is Relu.

> Third layer is of MaxPooling layer of filter size=2x2.

> Again a Conv2D layer with 200 filters, kernel size =3x3 is used. > Then Activation layer is used.

> Then MaxPooling is used.

3) The last step is to read image from webcam and classify them. First image is captured and converted into grayscale. Then the image is resized to 100X100 and finally passed for classification. The prediction is generated for the input image and then the prediction are sorted according to descending order. The prediction which is higher is displayed on the screen.

V. RESULT

In this paper we have discussed a method of recognizing hand gestures using CNN. Although accuracy is not so good but it can be improved further. The main work of controlling drone is to identify the gesture provided by the user. Dataset plays an important role in image classification and identification. The more the variation in data set and more the number of images more is the accuracy. Since large dataset also requires more time to compile, a pc with high graphics and ram can solve this problem. Our dataset contains 600 images of each hand sign because our pc is not so powerful to compute such a large dataset, even if it somehow compiles the program it will take too much time. But we are able to get an accuracy of 59.50 % which is not so bad with current dataset.

(5)

[3] A. D. Wilson and A. F. Bobick, “Learning visual behavior for gesture analysis,” in Proceedings of International Symposium on Computer Vision-ISCV. IEEE, 1995, pp. 229–234.

[4] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, no. 7553, p. 436, 2015.

[5] E. Stergiopoulou and N. Papamarkos, “Hand gesture recognition using a neural network shape fitting technique,” Engineering Applications of Artificial Intelligence, vol. 22, no. 8, pp. 1141–1158, 2009. [6] T.-N. Nguyen, H.-H. Huynh, and J. Meunier, “Static hand gesture recognition using artificial neural

network,” Journal of Image and Graphics, vol. 1, no. 1, pp. 34–38, 2013.

[7] Q. Chen, N. D. Georganas, and E. M. Petriu, “Hand gesture recognition using haar-like features and a stochastic context-free grammar,” IEEE transactions on instrumentation and measurement, vol. 57, no. 8, pp. 1562–1571, 2008.

[8] P. Molchanov, S. Gupta, K. Kim, and J. Kautz, “Hand gesture recognition with 3d convolutional neural networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition workshops, 2015, pp. 1–7.