T HE HINT OF THE NEURAL NETWORK

FEATURE ENCODING

5.2.1.1 T HE HINT OF THE NEURAL NETWORK

The building blocks of a neural network are neurons, and their operation was described in detail in section 2.3.3. However, brief overview is presented here for a self-contained discussion. Mathematical expression of a neuron is provided in (5.1).

= + (5.1)

where is the single-value neuron’s output function, are the single-value inputs, are the weights associated with each input, is the scalar value called the bias, and ( ) is the nonlinear squash function that maps the input value into a constrained range. If we consider the neuron’s input vector to be a single point in ℝ , then the sum-of-products in (5.1) is the scaled distance between this point to the hyper-plane defined by and . The scale factor is the length | | of vector . The bias is the shift of this hyperplane in the direction of the normal vector . One of the most frequently used squash functions is the nonlinear ReLU function introduced in [13] and defined as in (5.2).

( ) = max ( , 0) (5.2)

From the perspective of data topology, (5.1) expresses an operation that divides the data space into two halves using a hyper-plane and then discards one half by function in (5.2); this process is tuned iteratively during the learning procedure. The objective for this learning is to find a good direction for the plane (learning the weight values) and placing it at an appropriate location in the space (learning the bias value).

AlexNet, introduced by Krizhevsky et al. [13], is one of the first deep neural networks for image classification. The network was trained on over 1.2 million images from the ImageNet database [28] representing 1000 classes. As illustrated in Figure 5.3, AlexNet is a deep CNN [108] constructed as a combination of convolutional and fully-connected layers. It consists of an input layer followed by five convolutional layers (conv1-conv5) and three fully connected layers (fc6-fc8). The ReLU function is used as the learning unit in all neurons. Some layers are followed by extra processing units such as Local Response Normalization (LRN) and Max Pooling (Pool). The output from the last layer is passed through the normalized exponential Softmax function, which maps a vector of real values into the range (0, 1] and must add up to one. These values represent the probabilities of each class.

75 Table 5.1 shows the neuron’s input size and the number of output channels at each layer of AlexNet. It can be observed that large numbers of dimensions are being discarded as images are being processed through subsequent layers. For example, for the first layer (conv1) the input data space contains image patches of size 11*11*3 (which are data points in 363- dimensional space), while the outputs have only 64 dimensions. While it is well-known that pixels in a local neighbourhood are highly correlated, it is unclear whether 64 channels are enough to capture the most important information from the original 363-dimensional space, and how unique the choice of these 64 channels is.

Figure 5.1 - Histogram of pairwise cosine distance between filters in each layer of the pre- trained AlexNet.

Given that pixels in local neighbourhoods of an arbitrary image are highly correlated [2], filters applied at the low-level layers of a neural network should show certain degrees of inter- dependency to be able to explore this correlation. As the data becomes less and less correlated when progressing upwards through the network, the higher layers are more likely to explore the independent dimensions of the data space. To verify this hypothesis, histograms of the pairwise cosine distance between filters in each layer of the pre-trained AlexNet are shown in Figure 5.1. In this figure, the cosine distances between each filter against all other filters in the same network layer were computed, and a histogram of the distance values are drawn for each network layer to illustrate the mutual relationship among its filters. It can be seen in the figure that filters in higher-level layers tend to be more mutually orthogonal (i.e., the cosine distance among each pair filter vectors are closer to zero; noting the scale of the vertical axis). In other

76 words, filters in high-level layers tend to look for more independent patterns in the data space. It is almost impossible to guess the directions of filter vectors in the low-level layers without prior knowledge of the data distribution. However, it is easy to emulate the orthonormality of higher-layer filters by using random orthogonal projections.

Figure 5.2 – Evolution of histogram of pairwise cosine distance between filters in each layer of AlexNet during training.

To be more certain, AlexNet was trained from scratch using ImageNet images and its weights of neurons were recorded over the first several epochs. The histogram of pair-wise cosine distance between neurons over the training epochs are plotted in Figure 5.2. Right after initialization, all layers showed a large orthogonality between neurons, but the network could not really do any meaningful classification. As discussed, this is due to the correlation of pixels in the image which requires a certain level of correlation between neurons in early layers to be able to pick out the subtle discriminative information. As the training continued, the orthogonality in the early layers quickly decreased, while the orthogonality in the high-level layers increased slightly before settling down. After about 3 epochs, the histogram of pairwise

77 cosine distance is almost indistinguishable from the one from fully trained network, so no further plots are required.

A neural network does its own form of low-dimensional encoding, targeted at retaining information that benefits the classification process. Principal Component Analysis (PCA) [10] reduces the data dimension by estimating a low-rank matrix that captures the span of the input data. In both cases, it is required that the data matrix is stored in memory. In many cases, this is not possible. Contrarily, random based approaches such as sparse random projection (SRP) [206] does not require that matrix. It generates the projection bases randomly without knowledge about the data distribution, thus does not require much memory to store data samples. Sparse random projection aims to preserve the pair-wise distance between data points, and thus appears to be particularly applicable to image classification tasks as the majority of classifiers rely on pair-wise distance between samples in feature space.

In this study, inspired by the orthogonality of filters in the pre-trained network, a new method is proposed to create a random projection matrix using orthonormal bases generated by the QR factorization [207], specifically for object recognition. The proposed approach also works regardless of the number of available data samples, does not require storing all data points in memory, and is thus suitable for cases with limited computational resources. It is important to emphasize that although the proposed method is inspired by the deep neural network, it does not involve any change to the pre-trained neural network. The network is used as a “black-box” feature extractor, and the proposed method consists only of the dimensionality reduction phase performed on these extracted features. Experiments described in Section 5.2.3 show that the QR-generated filter matrix works better than its sparse random counterpart, and in some conditions, surpasses the effectiveness of the filters acquired through intensive DNN training.

In document Neural network based image representation for small scale object recognition (Page 86-89)