Chapter 2 Background and Related Studies
2.3 Deep learning
2.3.2 Deep Learning Algorithms
Commonly, deep learning algorithms can be divided into four groups based on the elementary method that they apply (Figure 2.7): auto-encoder, Restricted Boltzmann Machine (RBM), sparse coding, and Convolutional Neural Networks (CNN).
Figure 2.7 Category and Representative Examples of Deep Learning Algorithms
The auto-encoder is often implemented for learning efficient encodings through reconstructing its own inputs rather than predicting a target value for the provided input (Liou et. al., 2014). A typical auto-encoder contains an encoder as well as a decoder to accomplish the reconstruction of inputs. A deep auto-encoder is formed by a series of typical auto-encoders with deterministic network architectures, in which the code learnt from the previous auto-encoder is transferred to the next auto-encoder. Trained with variants of back-propagation, the deep auto- encoder can efficiently obtain more discriminative and characteristic features than the typical auto- encoder (Zhang et. al., 2014). However, deep auto-encoder sometimes can be ineffective especially when first few layers of this model yield errors and deliver them to posterior layers (Guo et. al., 2016).
Different from auto-encoders, which have deterministic network architectures, a RBM is a generative stochastic directionless graphical model designed by Hinton (1986). A classic RBM has a visible layer and a hidden layer; there is no connection within the hidden layer or the input layer (Zhu et. al., 2017). Although the feature representation ability of a single RBM is limited,
efficient deep models such as Deep Belief Networks (DBN) can be composed using RBMs as learning modules. A DBN is a probabilistic generative model applying a greedy learning approach. Nonetheless, as training a DBN involves the training of numerous RBMs, the implementation of the DBN model is computationally costly (Bengio et. al., 2013).
The sparse coding aims to describe the input data through learning a series of elementary functions (Olshausen and Field, 1997). This model has various benefits. Firstly, since the sparse coding utilizes multiple bases, it can positively restructure the descriptor and establish the relationships between similar descriptors identified by sharing bases. Also, noticeable characteristics of the data can be well described due to the sparsity of the sparse coding. Furthermore, data with sparse features can be more linearly separated.
The CNN, the most famous and commonly used deep learning method especially in the field of computer vision, is composed by convolutional layers, pooling layers, and fully connected layers. The CNN has a hierarchical architecture where convolutional layers alternate with pooling layers followed by fully connected layers. The convolutional layer is the main calculating part of a CNN, utilizing numerous kernels to convolve both the input image and the intermediate feature maps to produce multiple feature maps (Guo et. al., 2016). The convolution operation of CNN benefits deep learning process considerably especially in computer vision domains. It reduces the number of parameters due to the weight sharing mechanism, improves the understanding of correlations among neighbouring pixels, and does not change the location of the object (Zeiler, 2013). The pooling layer can be simply regarded to a down-sampling process which combines the outputs of neuron clusters at one layer into a single neuron in the next layer (Ciresan et. al., 2011). To gradually diminish the spatial size of the representation, decrease the number of parameters and calculation, and consequently control overfitting, a pooling layer is usually inserted in- between successive convolutional layers. The fully-connected layer transforms the 2D feature maps into a one-dimensional (1D) feature vector which is similar to a layer in the traditional neural network with about 90% of the parameters in the CNN (Guo et. al., 2016). The vector can be considered as a feature vector for further processing (Girshick et. al., 2014). The training process of CNN consists of forward steps and backward steps. The forward step aims to generate feature maps in each layer based on the current parameters such as weights and bias. The prediction output of this forward process and the given ground truth labels are utilized to calculate the loss cost. Then, a backward step is applied to calculate the gradient of each parameter. These gradients are
used to update all the parameters such as weights and bias in each layer. With these updated parameters the system can proceed to the next forward calculation. The circulation can be stopped when the loss cost of the model or the number of iterations of the forward and backward stages reaches a specified threshold.
Figure 2.8, established by Guo et. al. (2016), summarized the benefits and drawbacks in terms of various properties. To be more specific, ׳Generalization׳ indicates whether the method has fine performance in various media and applications. ‘Real-time’ emphasizes the efficiency of the approach. ‘Invariance’ evaluates if the method has robustness in transformation such as scale, rotation, and translation (Guo et. al., 2016). In conclusion, the CNN performs best in automatic feature learning; the auto-encoder and the sparse coding are more efficient in training especially when the training datasets are small; the sparse coding is more suitable for biological studies and more invariant towards transformation.
Figure 2.8 Comparisons among Four Categories of Deep Learning Algorithms.