1.5 Thesis Outline
2.1.2 Local Feature Descriptor
2.1.2.2 Learning-Based Feature Descriptors
More recently, the learning-based feature descriptors, which involve a dedicated train- ing process of encoding function on massive training data, are widely developed to boost the descriptor performance and gain better robustness.
Earlier learning-based works learn the shallow projections to obtain the local de-
scriptors. For example, Linear Discriminant Analysis Hashing (LDAHash) [177]
is proposed that uses linear projections combining linear discriminant analysis to generate binary descriptors. Discriminative BRIEF (D-BRIEF) [184] produces the descriptors by projecting the training data into a latent subspace. To deal with the nonlinear data structure, BinBoost [183] learn a set of nonlinear classifiers in encoding the data, which makes the learned binary codes more discriminative with applying the boosting algorithm jointly. Online learning is adopted in Binary Online Learned Descriptor (BOLD) [4], which aims at selecting binary intensity tests to produce low intra-class and high inter-class distances in the code learning. However, these meth- ods generally adopt simple binary intensity tests and some critical cues of a patch cannot be captured in the to-be-learned descriptor. Subsequently, Coupled Compact Binary Face Descriptor (C-CBFD) [124] is proposed to generate binary codes under three complementary learning objectives: high variance for information preserva- tion, low quantization errors and even-distribution at each bit. A one-stage learning strategy is utilized in Simultaneous Local Binary Feature Learning and Encoding (SLBFLE) [123], where the binary codes and the encoding codebook are jointly opti- mized for local face patches. Consequently, they extend these works as Context-Aware
2.1. SINGLE-MODALITY SIMILARITY SEARCH 29
Local Binary Feature Learning (CA-LBFL) [40] and Rotation-Invariant Local Binary Descriptor (RI-LBD) [39], which learns the robust local binary descriptor further to improve the efficiency and accuracy in face recognition.
With the development of deep learning techniques, more recent works apply CNN network and deep features in learning the local feature descriptor. For example, Doso- vitskiy et al. [37] train a CNN network by optimizing the classification loss, where the output vectors before the classification layer are used as the patch descriptors. Particularly, data augmentation is applied to the training data to avoid overfitting for the upcoming classification process, where the augmented patches are generated by adding some random variations/noises. Instead of merely optimizing the classification loss, Siamese loss is introduced in the network training of DeepDesc [162], where the patch pairs as the network inputs are selected by applying an aggressive searching strategy. A central-surround two-steam network structure is utilized in [237] to im- prove the matching performance of the learned feature descriptor, where the center of a patch is used as input and the similarity between patch pairs is computed through a Siamese network. HardNet [130] proposes a triplet loss function that explores the hard examples by an effective mining strategy to mimic the matching procedure in a batch fashion, where at least one positive pair is guaranteed in building the triplet input. Descriptors Optimized for Average Precision (DOAP) [70] is proposed to train the deep network via optimizing a new loss function termed Average Precision (AP) directly, which improves the ranking-based retrieval performance. Wei et al. [209] in- troduce a novel pooling method termed Subspace Pooling in the code learning, which is claimed to obtain the robustness against a range of geometric deformations for the learned feature descriptor.
More works have been done recently to learn the binary descriptor from deep-based frameworks. For example, Deep Hashing (DH) [43] optimizes the binary descriptor with independence and even distribution, while Deep Supervised Hashing (DSH) [111] optimizes distance loss and Siamese loss jointly to improve the binary descriptor qual- ity. Subsequently, L2-NET [182] trains a Siamese network for pairwise patches and produces binary codes by directly quantizing the real-valued outputs, where differ- ent regularization terms are applied to the intermediate layer outputs to improve the code quality. More than just pairwise inputs, the triplet loss is incorporated in the objective function of [233] to further guarantee the code discriminativeness. Deep Binary Descriptor with Multi-Quantization (DBD-MQ) [41] adopts a multi- quantization strategy that reduces the quantization errors within the K-AutoEncoders (KAEs) networks. GraphBit [42] integrates the reinforcement learning with binary
2.1. SINGLE-MODALITY SIMILARITY SEARCH 30
code learning, where the uncertainty of binary codes is minimized by maximizing the mutual information between the real-valued inputs and the corresponding bits. With the mighty Generative Adversarial Network (GAN) [52], BinGAN [252] learns the compact binary descriptors from patches via optimizing two additional losses from distance matching and entropy regularizers. GAN has also been employed in [169] to facilitate image retrieval and compression. More recently, Compact Discrimina- tive binary descriptor (CDbin) [231] is proposed to generate the binary descriptors via jointly optimizing four complementary loss functions in an end-to-end manner. In such cases, dedicated prior knowledge (e.g., labels) is required, which is usually impractical in real application scenarios. Despite the great success achieved by those descriptors, the transformation-invariant nature of the local feature descriptor is not considered in the training process. Consequently, DeepBit [106] is proposed to learn compact binary descriptors via optimizing several loss functions in network training, one of which minimizes the Hamming distances of the binary codes from the original patch and their transformed versions in a pairwise manner. Although it encodes the transformation invariance to some extent, the learned binary codes of original data and their transformed sets via minimizing the Euclidean distances between them in the binary space are not identical.
Recently, such learning-based binary descriptors have been widely developed in many other applications like palmprint and object recognition [45, 145]. For example, Discriminant Direction Binary Code (DDBC) [45] learns a simple mapping function to project the convolution difference vectors to the neighboring directions of the templates. While in [145], they propose a stacked convolutional autoencoder structure to generate the compact binary code for the accurate object detection. In these applications, the learning-based binary descriptors act as the leading contributing roles in improving the performance of the specific tasks.