Function Set - Learning Discriminative Feature Representations for Visual Categorization

3.1 Overview

4.3.3 Function Set

A key concept for MOGP is the function set (i.e., internal nodes of the tree) which is the

alphabet of the programs and typically driven by the nature of the problem domain. Com- monly, the choice of functions is based on the following principles:

(1)The set must contain functions which can extract meaningful information.

(2)To minimize the total running time of MOGP, all the operators in the function set need to be relatively simple and efficient.

Functions for Filtering Layer

The filtering layer is the most significant part of our MOGP procedure, extracting the effective features from the input data. We construct the filtering layer by using 19 unary functions and 5 binary functions, shown in Table 4.2. All the filtering functions listed here operate on the images which represent in the form of a pixel matrix.

In the proposed MOGP architecture, we adopt Gaussian filter series, Garbor filter series and some basic arithmetic functions. Since Gaussian and Gaussian derivative filters are

4.3 Methodology 59

Table 4.3 Max-pooling Layer functions

Operator Name Input Function Description Operator Type

Maxpooling2 1 Image 2D max-pooling with pooling window size 2×2 on the input image Filter Maxpooling4 1 Image 2D max-pooling with pooling window size 4×4 on the input image Filter Maxpooling6 1 Image 2D max-pooling with pooling window size 6×2 on the input image Filter Maxpooling8 1 Image 2D max-pooling with pooling window size 8×8 on the input image Filter Maxpooling10 1 Image 2D max-pooling with pooling window size 10×10 on the input image Filter

widely used to reduce the noise and smooth input images, we adopt them into our function set to achieve denoising and extract the meaningful contour information. 2D Laplacian filters are used for separating signals into different spectral sub-bands. Gabor filters are regarded as an effective method to obtain the orientation information. To mimic the biological mechanism of the human visual cortex, we follow Riesenhuber and Poggio’s work [126] to

define our GBO-θs(i.e.,u=0, 45, 90 and 135) filter by two steps: first, convolving the input

image with six different scale Gabor filters (7×7, 9×9, 11×11, 13×13, 15×15 and

17×17) with a certain orientationθs; Second, using max operation to pick the maximum

value across all six convolved images with the filter scales in each orientation. Fig. 4.6 il- lustrates our multi-scale-max Gabor filter. The pooling equation among different scales is defined below: GBO-θs=max (x,y) [I₇_×₇(x,y,θs),I9×9(x,y,θs), ...,I₁₅_×₁₅(x,y,θs),I17×17(x,y,θs)] (4.1)

where, GBO-θs is the output of a multi-scale-max Gabor filter andIi×i(x,y,θs)denotes the

convolved images with the scalei×iand the orientationθs.

In addition, basic arithmetic functions are chosen based on the desire to allow the MOGP process computing features from images more naturally. The binary arithmetic functions also play a significant role to stretch the MOGP tree by splits and make the individual programs more variable and diversified as such qualities significantly influence final results. To ensure operator closure [39], we have only used functions which map one or two

2D images to a single 2D image with the identical size (i.e., the input and the output of

each function node have the same size). In this way, a MOGP tree can be an unrestricted composition of function nodes but still always produce a semantically legal structure.

Functions for Max-pooling Layer

The max-pooling function set is listed in Table 4.3. All the functions in this set are per-

4.3 Methodology 61

Table 4.4 Concatenation Layer functions

Operator Name Input Function Description Operator Type Cat1 1 Image and 1 coefficient ’Flatten’ the 2D input image to a 1D vector with coefficient Fusion Cat2 2 Images and 2 coefficients ’Flatten’ the two 2D input images to 1D vectors and concatenate

these two vectors into a long vector with coefficients Fusion Cat3 3 Images and 3 coefficients’Flatten’ the three 2D input images to 1D vectors and concatenate

these three vectors into a long vector with coefficients Fusion Cat4 4 Images and 4 coefficients ’Flatten’ the four 2D input images to 1D vectors and concatenate

these four vectors into a long vector with coefficients Fusion

incremental step of 2 pixels both horizontally and vertically. This max-like feature selec- tion operation is a key mechanism for object recognition in the human brain cortex and provides a robust response in the case of recognition with clutter or multiple stimuli in the receptive field. This kind of mechanism successfully achieves invariance to image-plane transforms such as translation and scaling. One noteworthy point here is that the output of a max-pooling filter is inevitably shrunk along the spatial dimensions relative to the input. To ensure the closure property mentioned above, we further resize each output calculated from max-pooling filters to an identical size with the inputs using linear interpolation. In this way, the size of inputs and outputs of our max-pooling filters are totally the same.

Functions for Concatenation Layer

After the filtering and the max-pooling layers, we have successfully extracted the corresponding features from the original input images through the tree-based MOGP programs. By first converting 2D images to 1D vectors, we further concatenate these 1D vectors calculated from different sub-trees into a weighted linear concatenation vector with coefficients

randomly selected from the terminal set as shown in Fig. 4.7 , where C1 and C2 denote

the corresponding coefficients automatically obtained by MOGP and ‘L

’ means concatenation. To cope with different numbers of sub-trees, in our architecture, we include four different concatenation functions with different numbers of inputs. The relevant functions are illustrated in Table 4.4. Different from the other three functions in this set, for Cat1, there exists no concatenation procedure, but only ’flatten’ the 2D input image to a 1D vector and complete multiplication by the corresponding coefficient for each element in the vector. We also restrict the concatenation layer with only one depth. Since this layer is the last layer in our program structure, the output we obtain from this layer is the final feature descriptor extracted from the input image.

Each function in the function set is regarded as a tree node in evolved programs, which connects the outputs of lower level functions or inputs. Note that, in our proposed MOGP architecture, not all the functions listed in the function set have to be used in a given structure

Fig. 4.7 An illustration of Concatenation Layer.

and the same function can be used more than once. The topology of our MOGP structure is essentially unrestricted. Besides, functions in the function set are also highly related to our problem domain. For this work, we aim to construct novel discriminative feature descriptors for image classification. So, what we choose in the function set are effective

filters for feature extraction, i.e., Gaussian filters, Gabor filters, Laplacian filters, etc. We

expect the whole learned architecture is consistent with the physical structure of the human visual cortex.

4.3.4 Fitness Function

The basis of evolutionary methods is to maximize the performance of individual solutions as gauged by some appropriate fitness functions. In our approach, we apply multi-objective genetic programming (MOGP) to evolve each individual program generation by generation and finally select the best-preformed individual as the optimal solution.

Two Defined Fitness Objectives

Classification error rate: To evaluate the candidate MOGP-evolved feature descriptor, we

estimate their classification error rate Er using a linear support-vector-machine (SVM) on

the learning set. We take the output of the MOGP tree and first adopt principal component analysis (PCA) to reduce the high dimensional vector into a low dimensional one which comprises the input of the SVM. To obtain a more reliable fitness evaluation, for each candidate MOGP tree we estimate the classification error rate with the SVM by using n-fold

In document Learning Discriminative Feature Representations for Visual Categorization (Page 80-85)