Learning visual tasks with selective attention

(1)

c

(2)

LEARNING VISUAL TASKS WITH SELECTIVE ATTENTION

BY

KEVIN JONATHAN SHIH

DISSERTATION

Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science

in the Graduate College of the

University of Illinois at Urbana-Champaign, 2017

Urbana, Illinois Doctoral Committee:

Associate Professor Derek Hoiem, Chair Associate Professor Svetlana Lazebnik Professor David Forsyth

(3)

ABSTRACT

Knowing where to look in an image can significantly improve performance in computer vision tasks by eliminating irrelevant information from the rest of the input image, and by breaking down complex scenes into simpler and more familiar sub-components. We show that a framework for identifying multiple task-relevant regions can be learned in current state-of-the-art deep network architectures, resulting in significant gains in several visual predic-tion tasks. We will demonstrate both directly and indirectly supervised mod-els for selecting image regions and show how they can improve performance over baselines by means of focusing on the right areas.

(4)

ACKNOWLEDGMENTS

I would like to start by thanking my advisor, Derek Hoiem, as I wouldn’t have made it so far without his guidance. Throughout my time here, he has provided remarkably keen insight to the problems I was working on, and ultimately taught me how to be a better researcher. I would also like to thank other members of the computer vision faculty here: David Forsyth, Svetlana Lazebnik, and more recently Alexander Schwing for the numerous useful and informative discussions I’ve had with them throughout the years. Next, I would like to thank all of my labmates, past and present, for creating a great environment for improving myself as a grad student. Thanks to Ian Endres for showing me the ropes when I was getting started. Daphne Tsatsoulis always offered great words of encouragement (and baked goods) when things weren’t quite working out. Arun Mallya and Saurabh Singh both helped significantly in getting my feet wet with deep learning, and they also contributed significantly to the methods described in this thesis. I would also like to thank Tanmay Gupta for our collaboration – the last chapter would not have been possible without him, Liwei Wang for the many late-night chats we’ve had in our shared office, Bryan Plummer for always providing great feedback for research ideas and for lecturing me about the world of hip hop, Aditya Deshpande and Arun (again) for taking over the server management duties from me, and honorary vision lab member Yonatan Bisk for offering sage advice and for sending me a bottle of ghost pepper vodka that I have yet to figure out how to safely consume.

Finally, I would like to thank my friends and family. Special thanks go to Jonathan Ligo for letting me pick his brain on various topics ranging from academic to culinary, Pooya Khorrami for always being supportive and for the delicious bucket of pineapple cotton candy, Andrew Murphy for being a great roommate for the last few years, and Chris Cervantes for sharing my interest in games despite the lack of time to play them. I would especially like

(5)

to thank Nadia Danienta for supporting me through everything over the last two years. Most importantly, I would like to thank my family for supporting me through my incredibly long schooling process. I’m not entirely sure how they managed to financially support my undoubtedly expensive education, but I’m very thankful for the sacrifices they made to make it happen.

(6)

LIST OF TABLES

3.1 Localization and Visibility Prediction Performance of various

methods without using the ground truth Bounding Box . . . 22 3.2 Comparison of per-part PCP with Liu et al 2013 [1] and Liu

et al 2014 [2]. The abbreviated part names from left to right

stand for back, beak, belly, breast, crown, forehead, eye, leg,

wing, nape, tail, and throat. . . 23 4.1 Comparison of Part Localization Performance: Our method

based on keypoint prediction from Edge Boxes shows significant

improvement over previous work.. . . 25 4.2 Comparison of our classification with other works . . . 27 5.1 Overall accuracy comparison on Validation. Our region

se-lection model outperforms our own baselines,

demonstrat-ing the benefits of selective region weightdemonstrat-ing. . . 37 5.2 Accuracy comparison on VQA test sets. . . 40 5.3 Accuracies by type of question on the validation set.

Per-cent accuracy is shown for each subset for our region-based approach, classification using only text, text with a whole-image feature vector, and text with salient attention (at-tention based only on image). Overall, our region selection scheme outperforms use of whole images by 2% and text-only features by 5%. The learned salient attention model performed surprisingly well. Most notably, it had a similar performance boost over the whole-image baseline on scene questions. The proposed region-selection model still out-performs all baselines on color questions, suggesting that attention for color-identification cannot be easily learned

via saliency. . . 41 5.4 Language model comparison. The 2-bin model is the

con-catenation of the question and answer averages. The parsed model uses the Stanford dependency parser to further split

(9)

6.1 Inductive transfer from VR to VQA through SVLR in joint training: We evaluate the performance of our model with the SVLR module trained jointly with VR and VQA supervision (provided by Genome and VQA datasets respectively) on the validation set of the multiple-choice VQA task. We compare this jointly-trained model to a model trained on only VQA data. We also compare to a traditional multitask learning setup that is jointly trained on VQA and VR and shares visual features but does not

use the object and attribute word embeddings for recog-nition. While multitask learning outperforms the VQA-only model, using the SVLR module doubles the improve-ment. Our model is most suited for the question types in bold that require visual recognition without specialized skills like counting or reading. In this setting we train on

Genome VR data and apply to VQA val. Details in Sec 6.2.2. 54 6.2 VQA performance on val and test sets: Because these

systems vary in many ways, our internal comparisons are more instructive, but we include these for reference. For test accuracy, it is unclear whether FDA uses val in train-ing. The MLP results were obtained using the implemen-tation provided by [3]. The original MLP implemenimplemen-tation [4] using Resnet-101 yields 64.9 and 65.2 on test-dev and

test-std respectively. MCB reports only test-dev accuracy

(10)

LIST OF FIGURES

2.1 Top salient Edge Boxes [5] on an example image. Region proposals allow us to avoid unecessarily process

uninterest-ing background regions . . . 8 3.1 The pipeline of our keypoint localization process: Given an

in-put image, we extract multiple edge boxes. Using each edge box, we make predictions for the location of each of the 15 key-points, along with their visibility confidences. We then find the best predicted location by performing confidence thresholding and finding the medoid. The process is illustrated for the right eye keypoint (Black edge boxes without associated dots make predictions with confidences below the set threshold, and green

is an outlier with a high confidence score). . . 19 3.2 Qualitative results for a subset of the keypoints. Predictions

for most of the images cluster tightly. Therefore, simple pre-diction methods such as medoids work well. Medoid shift adds to the robustness, leading to further improvements (second last column). Primary failure mode is when visibility thresholding

fails to rule out clusters of false positives (bottom right). . . 22 4.1 Examples of good (left) and failed (right) localization results:

The ground truth boxes are in solid black. The head, torso, and whole body boxes are in green, blue and red respectively. The head is correctly localized in most of the above examples. In the top row middle example, even though the whole body box IOU is low, most of the missed area is actually background due to the bird extending its wings. In the bad examples, we show that we mostly fail in rare close-ups and when there are

multiple instances. . . 26 4.2 Camporison of classification accuracies obtained using

(11)

5.1 Our goal is to identify the correct answer for a natural language question, such as “What color is the walk light?” or “Is it raining?” We focus on the problem of learning where to look. The above figure shows example attention

regions produced by our model. . . 30 5.2 Examples from VQA ([6]). From left to right, the above

ex-amples require focused region information to pinpoint the dots, whole image information to determine the weather, and abstract knowledge regarding relationships between

children and stuffed animals. . . 30 5.3 Overview of our network for the example question-answer

pairing: “What color is the fire hydrant? Yellow.” Ques-tion and answer representaQues-tions are concatenated, fed through the network, then combined with selectively weighted

im-age region features to produce a score. . . 31 5.4 Example parse-based binning of questions. Each bin is

represented with the average of the word2vec vectors of its

members. Empty bins are represented with a zero-vector. . . . 35 5.5 Comparison of salient attention, conditioned on only the

image, and the proposed attention model that considers both image and query. Many images have predictable saliency, in that it is easy to predict what any question in the image will be about. In the top row of this figure, the salient object is the plane and is predicted by both models with and without considering the query text. In more complex cases such as the bottom two rows, where there are multiple foreground objects, the salient model does a decent job of identifying those over the background, but fails to produce the correct attention map when the query refers to only one of the many possible foreground

objects. . . 39 5.6 Comparison of attention regions generated by various

question-answer pairings for the same question. Each visualization is labeled with its corresponding answer choice and returned confidence. We show the highlighted regions for the top multiple choice answers and some unrelated ones. Notice that in the first example, while the model clearly identified a green region within the image to match the “green” op-tion, the corresponding confidence was significantly lower than that of the correct options, showing that the model

does more than just match answer choices with image regions. 40 5.7 Example image with corresponding region weighting. Red

boxes correspond to manual annotation of regions relevant

(12)

5.8 Comparison of qualitative results from Val. The larger image shows the selection weights overlaid on the original image (smaller). L: Word only model; I: Word+Whole Image; R: Region Selection. The scores shown are ground truth confidence - top incorrect. Note that the first row shows successful examples in which tight region localization allowed for an accurate color detection. In the third row, we show examples of how weighting varies on the same

image due to differing language components. . . 43 5.9 Plot of color-based question accuracy with varying number

of regions sampled at every 10. The experiment was run on a 10% held-out set on train. We look at using the weighted average of only the top K scoring regions, as well as only the Kth. We include the whole image baseline’s accuracy

in this category for comparison. . . 44 6.1 Sharing image-region and word representations across

multiple vision-language domains: The SVLR mod-ule projects images and words into a shared representation space. The resulting visual and textual embeddings are then used for tasks like Visual Recognition and VQA. The models for individual tasks are formulated in terms of in-ner products of region and word representations enforcing

an alignment between them in the shared space. . . 46 6.2 Joint Training on Visual Recognition(VR) and

Vi-sual Question Answering(VQA) with the proposed SVLR Module: The figure depicts sharing of image and word representations through the SVLR module during joint training on object recognition, attribute recognition, and VQA. The recognition tasks use object and attribute labelled regions from Visual Genome while VQA uses im-ages annotated with questions and answers from the VQA dataset. The benefit of joint training is that while the VQA dataset does not provide region groundings of nouns and adjectives in the QA (eg. “fluffy”,“dog”), this complemen-tary supervision is provided by the Genome recognition dataset. Models for each task involve image and word em-beddings produced by SVLR module or their inner

(13)

6.3 Inference in our VQA model: The image is first broken down into Edge Box region proposals[5]. Each region R is represented by visual category scoress(R) = [so(R), sa(R)] obtained using the visual recognition model. Using the SVLR module, the regions are also assigned an attention score using the inner products of region features with rep-resentations of nouns and adjectives in the question and answer. The region features are then pooled using the rel-evance scores as weights to construct the attended image representation. Finally, the image and question/answer representations are combined and passed through a neural network to produce a score for the input

question-image-answer triplet. . . 50 6.4 Interpretable inference in VQA: Our model produces

interpretable intermediate computation for region relevance and object/attribute predictions for the most relevant re-gions. Our region relevance explicitly grounds nouns and adjectives from the Q/A input in the image. In addition to attention, we show object and attribute predictions for the most relevant region identified for a few correctly answered questions. The relevant regions are visualized by applying a mask generated from relevance scores projected back to

their source pixel locations. . . 53

6.5 Inductive Transfer from VQA to Object

Recogni-tion: Each cell’s color reflects the average accuracy im-provement for classes within the corresponding frequency ranges of both datasets from training on Genome-only to training on Genome and VQA. Most gains are in rare Genome nouns with higher frequency in the VQA dataset (top left corner), suggesting that the weak supervision pro-vided by training VQA attention helped to augment per-formance via the SVLR. The numbers in each cell show the Genome-only mean accuracy +/- the change due to SVLR multitask training, followed by the number of classes in the

(14)

6.6 Failure modes: Our model cannot count or read, though it will still identify the relevant regions. It is blind to re-lations and thus fails to recognize that birds, while present in the image, are not drinking water. The model may give a low score to the correct answer despite accurate visual recognition. For instance, the model observes asphalt but predicts concrete, likely due to language bias. A clear ex-ample of an error due to language bias is in the top-left image as it believes the lady is holding a baby rather than adog, even though visual recognition confirms evidence for dog. Finally, our model fails to answer questions that re-quire complex reasoning involving comparison of multiple

regions. . . 58 6.7 Synthetic center-focused image baseline provided by the

authors of Daset al [7]. This image was used to represent a baseline attention model that always focuses on the center of the image. By computing the correlation between the human attention maps and this one, we are able to identify low correlation subsets of the dataset in which the human

subjects looked away from the image center. . . 60 6.8 Qualitative comparison of attention maps from various

mod-els. Saliency generally corresponds pretty well with what questions ask about. Compared to the WTL model, the SVLR model’s attention is typically much more focused. Regions deemed irrelevant by the SVLR seem to be more readily downweighted than in the WTL and Salient cases. Note that Gaussian smoothing was used on the attention masks for Salient, WTL, and SVLR for visualization

(15)

6.9 Mean Spearman rank-correlation coefficients between model attention and human attention at various threshold. The threshold points define subsets of the dataset for which the human attention correlation with the synthetic center heatmap is below the current threshold value. For example: the first sample point of each curve is the mean correlation of each model with human attention, measured on a subset in which the human attention’s correlation with the center heatmap is less than or equal to 0. WTL and Salient are the proposed model and salient attention baseline from the previous chapter. The Center baseline is the correlation of the center heatmap measured against all examples in the current subset. As can be seen, the attention of the pro-posed SVLR model significantly outperforms those of the models from the previous chapter at all threshold levels. The WTL slightly outperforms its corresponding strong salient baseline up to the threshold at 0.6. As the thresh-old approaches 1, the synthetic center heatmap baseline outperforms all proposed models, confirming that the ma-jority of the questions are asking about something in the center of the image. Note that there were only 11 examples

(16)

LIST OF ABBREVIATIONS

VQA Visual Question Answering VR Visual Recognition

HOG Histogram of Oriented Gradients SVM Support Vector Machines

CNN Convolutional Neural Networks LSTM Long Short-Term Memory LDA Linear Discriminant Analysis

(17)

CHAPTER 1

INTRODUCTION

Consider what happens when people attempt to recognize a face. Do they observe every visible component of the face with the same amount of atten-tion? Or do they spend more time looking for distinctive features such as a mole on the cheek, the shape of the eyes, or even the contour of their jawline? Visual attention is focusing the visual system on what is most informative and relevant for the task at hand. The following work addresses the problem of training computer vision models capable of exploiting visual attention to improve their own performance.

Visual attention in computer vision systems is strongly motivated by the way the human visual system works. In 1967, Alfred Yarbus [8] was able to track the gaze of his human subjects as they observed certain images, noting that most of the attention was directed towards parts of the image that the subject considered to be most informative. While computer vision systems are designed for performance rather than to simulate their analogs in biology, the main takeaway is that not all of the visual input is equally im-portant. Reasons to adopt attention-like behavior can be for computational reasons (much of the input image can be ignored) or for improving accuracy (irrelevant parts of the image may distract the model).

To date, incorporating visual attention has produced many successful mod-els in the field of computer vision. One such example is fine-grained image recognition. The human attention analog for fine-grained recognition can be observed when one tries to identify the difference between two similar objects. Consider the act of comparing two different species of magpies. In order to compare, the human gaze will likely dart back and forth between analogous parts on both birds, comparing the eyes, beaks, and breast pattern, wings etc. to pick up minute differences. The state of the art in fine-grained im-age recognition has taken a similar approach to comparing similar objects. When comparing birds, analogous regions such as the head and torso are first

(18)

localized in all instances, then part-specific classifiers are trained to classify the birds based on just the head or the bird. In this setting, the specialized classifiers are able to learn the minute details that distinguish the head of one species of bird from another’s – something that would have otherwise been difficult to learn from whole-images of birds in various poses and orientations. Another successful application of attention is in the recognition of complex objects and scenes. Consider the task of recognizing a restroom. On the one hand, one could go through the difficult process of attempting to model the full visible appearance of a restroom, including the 3D geometry and all possible positionings of the sink, bathtub, and toilet, as well as the color of the tiles and walls. Alternatively, one could note that some of the above attributes are more important than others – knowing the color of the wall says little about whether you are looking at a restroom, but the presence of a toilet alone is a strong and often sufficient indicator. Additionally, identifying the presence of a toilet in a bathroom or a bookshelf in a bookstore requires only a detector, significantly simplifying the approach. As such, modeling recognition as the detection of a few highly discriminative visual patterns has seen widespread adoption. It has also spawned an interesting line of work dealing with the discovery of these dicriminative parts and patches.

An interesting complication to consider is what if we were dealing with a collection of different tasks, each requiring different behavior for visual attention? This is an important problem to consider, as the push to develop more human-like AIs will necessitate having a model that can adapt to new tasks and situations. The latter half of this work specifically addresses the problem of visual attention for visual question answering, in which questions about images may pose a variety of different tasks. In this setting, it is up to the model to adjust its attention behavior to best handle the currently posed question.

While our computer vision models will ultimately process whatever we show them, being selective about where we make them look can be advanta-geous in many scenarios. The goal of this work is to demonstrate several ways of incorporating selective visual attention into models for various computer vision tasks.

(19)

1.1

Contributions

We propose trainable models capable of identifying image regions relevant to their respective tasks. Our models are applied to part localization, fine-grained image recognition, and visual question answering (VQA), the last of which can be considered to be a meta-vision task. With identifying task-relevant image regions as the broader picture, the problems we tackle in the following works are as follows:

Localizing Parts and Keypoints with Multiple Crops: Part and keypoint localization can be conditioned on the nearby context within the image. For example, the necessary information to localize an eye would lie on the face. In order for a localization model to accurately predict the loca-tion down, we would ideally like to feed the model an input image with as much resolution as possible. However, due to some CNN-based architectures having a fixed input resolution (e.g. 224x224 pixels), we would ideally feed in the minimum necessary context into the model at as high a resolution as possible, as any unecessary context would come at the cost of reduced resolu-tion for the necessory context. In chapter 3, we introduce a sampling-based approach to identifying the best image regions from which to make local-ization predictions. By conducting the prediction task on multiple random crops from the image, we expect at least some of the crops to be close to the optimal context region. We then propose a simple scoring scheme that, com-bined with outlier rejection, allows us to identify a robust set of candidate predictions from which to predict the final keypoint location. Further, our candidate identification also allows us to accurately predict when a keypoint

is not present in the image at all.

Part-aligned Fine-Grained Classification: As demonstrated in previ-ous works, aligning analogprevi-ous regions across images is an effective strategy for fine-grained classification. We demonstrate that a keypoint-localization method that can accurately predict both position and visibility leads to very accurate alignments and, by extension, better classification.

Conditioning Visual Attention on Language: Building machines that can interface with natural language instructions is an end-goal of human-robot interaction. Leveraging advancements in deep learning frameworks, we are interested in directing the visual attention of a vision system with natural language. Specifically, given the natural queries such as “What color is the

(20)

car?” or “Is there a cat on the bed?”, the system should focus on question-relevant regions of the image to answer the queries. The main complication is that question-relevant region annotation may not exist. As such, we are interested in training such a model using only question-answer supervision. By incorporating question-relevant region selection as a latent task, we expect the model to learn region-question relevance as a means of improving its question answering accuracy.

Language-based Visual Attention with Phrase-Level Supervision:

While question-level attention is hard to supervise, it is certainly feasible to provide supervision for individual phrases within the question. Phrase-level attention cannot always tell you exactly where to look, specifically if the tar-get object is never directly mentioned. For example, to answer “Is something sitting on the chair?”, we cannot train a model to localize “something” in the image directly. However, we can first localize the mentioned “chair” and use that to aid the search process. Further, phrase-level attention can be trained using existing datasets for object detection and phrase-localization, allowing us to introduce additional supervision into the visual question answering task at a lower level.

1.2

Original Publications

The chapters in this work are based following original publications and tech reports:

• Shih, Kevin J., Arun Mallya, Saurabh Singh, and Derek Hoiem. “Part localization using multi-proposal consensus for fine-grained categoriza-tion.” BMVC 2015. (Chapters 3 and 4) As primary author, my con-tributions include the design and experimentation of the outlier re-moval and consensus method from multiple predictions. Co-author Arun Mallya contributed significantly to the implementation of the CNN whereas co-author Saurabh Singh proposed the use of medoid-shift over simple medoids.

• Shih, Kevin J., Saurabh Singh, and Derek Hoiem. “Where to look: Focus regions for visual question answering.” CVPR 2016. (Chapter 5) As primary author of this work, my contributions include the majority

(21)

of the implementation for both the model and experimentation. The formulation of the attention model was jointly derived with co-authors. • Gupta, Tanmay, Kevin Shih, Saurabh Singh, and Derek Hoiem. “Aligned

Image-Word Representations Improve Inductive Transfer Across Vision-Language Tasks.” arXiv preprint arXiv:1704.00260 (2017). (Chapter 6) Tanmay Gupta is the primary author of this work. My specific contributions involve design decisions regarding the loss functions, the design of the word to region embedding mechanism, inductive transfer analysis from VQA to object recognition, and the additional human attention comparison.

(22)

CHAPTER 2

BACKGROUND

This chapter provides background for the technical concepts in this work. We begin with an overview of visual attention in existing literature, followed by background for the specific tasks addressed in our work.

2.1

Visual Attention and Saliency

Our work focuses on applying the concept of visual attention and saliency to various computer vision tasks. In brief, visual attention refers to selectively attending to relevant parts of the input, and saliency is the extent to which something in the input stands out or will be attended to.

Visual attention has long been an important topic of study in human cog-nition. In 1967, Yarbus [8] studied how people’s eyes moved when perceiving complex objects by attaching measurement devices to the eye. He noted that “When looking at a human face, an observer usually pays most atten-tion to the eyes, the lips, and the nose. The other parts of face are given much more cursory consideration.” In other words, the human visual system will specifically focus on the most salient and informative parts of the visual input.

Later works attempted to model the human attention system. Two of the most influential works in this field are the Feature Integration Theory (FIT) of Treisman and Gelade [9] and the Guided Search model of Wolfe et al [10, 11]. The FIT suggests a bottom-up pipeline such that: “features are registered early, automatically, and in parallel across the visual field, while objects are identified separately and only at a later stage, which requires focused attention.” The Guided Search model was later proposed to address some issues with the FIT, specifically that top-down information can guide the parallel feature registration process to specifically activate task-relevant

(23)

features. This is in contrast to the FIT pipeline in which the feature regis-tration process is purely bottom-up and therefore task-agnostic.

In computer vision, methods that incorporate some form of visual atten-tion do so for performance reasons rather than for simulating human cog-nition. Nevertheless, the general pipeline used attention-based vision mod-els closely resembles the theoretical frameworks of FIT and Guided Search. First, bottom-up features are used to transform the input image representa-tion, creating the feature map. Next, salient regions are identified within the feature map based on a task-dependent metric (e.g. likelihood of being an object for object detection). Finally, a second-stage processes the informa-tion from the salient regions, to complete the inference. We direct interested readers to Frintrop et al [12] for a more detailed overview of computational visual attention.

In this section, we will look at several forms of visual attention in computer vision. We will first look at region proposals, which model object saliency for the object detection task. Next, we look at part discovery and discriminative patches, which identify salient visual patterns for many recognition tasks. Finally, we include a brief overview of soft-attention networks.

2.1.1

Region Proposals

Region proposal methods identify regions within an image with the goal of capturing all objects within the image in as few proposals as possible. They specifically model a form of visual saliency for object detection, directing the detector where to evaluate in the image as efficiently as possible. Popu-lar methods include Objectness [13], Category Independent Object Propos-als [14], Selective Search [15], Edge Boxes [5], CPMC [16], and RPN [17]. An example of where this would be extremely beneficial can be seen in figure 2.1. An exhaustive sliding window approach would need to run the full model on the image at all locations and scales – a process that may be prohibitively slow and expensive for large models. Preprocessing with a region-proposal method is much cheaper, relying only on low-level image cues, and directly returns a manageable selection of image regions at the appropriate locations and scales.

(24)

seg-Figure 2.1: Top salient Edge Boxes [5] on an example image. Region proposals allow us to avoid unecessarily process uninteresting background regions

mentation problem, proposing multiple possible object segementations and ranking them by likelihood of being an object. Methods such as Object-ness [13] and Edge Boxes [5] avoid producing full segmentations to reduce computation, as object detection methods tend to operate on entire bound-ing boxes. Specifically, Objectness directly scores boxes by lookbound-ing for high color-contrast with the exterior of the box, low levels of superpixel straddling, and high edge density nearing the borders. Edge Boxes similarly looks for edge contours contained wholly within a bounding box, using the Structured Edge detector from Doll´ar et al [18] to generate edge maps.

In recent years, focus has shifted to unified deep network frameworks in which region proposing and the object detection task can be trained simul-taneously within the same architecture. Examples include OverFeat [19], YOLO [20], Faster RCNN [17], and SSD [21]. In these frameworks, pre-defined anchor boxes are exhaustively generated at multiple scales, aspect ratios, and locations. The deep-network architectures then predict offsets for the box coordinates to reshape the boxes to overlapping objects based on feature maps generated from several convolutional layers.

(25)

2.1.2

Part Discovery Pre-CNN

As previously discussed, one important application of visual attention in-volves identifying discriminative components for the model to focus on. For example, the eyes, nose, and mouth would be some of the more discrimina-tive and informadiscrimina-tive parts of the face as compared to a random patch of skin on the forehead or the cheek. However, identifying the discriminative visual patterns and components of a new object is not always straightforward. Not only does one want patterns that are discriminative of the category that they represent, these patterns should be easily identifiable and diverse with respect to each other to increase coverage over the example space.

Automatic part discovery for object and scene recognition is an important area of research, largely due to the difficulty of defining an appearance model. The identification of discriminative parts and patches greatly simplifies the task of object and scene recognition. Instead of focusing on the entire input frame, we now focus on detecting a set of smaller and less varied image patches that are strongly indicative of a target category.

Previous works largely focus on identifying discriminative parts or patches for training detectors using HOG representations [22]. The deformable part model [23], automatically determines the high resolution parts of an object category by first training a coarse whole-object HOG filter and then greedily partitioning areas as parts based on the magnitude of filter weights within the area. In Juneja et al [24] as well as in our previous Boosted Collections of Parts work [25, 26], a large number of part detectors are quickly initialized by training on a single positive patch using exemplar SVMs [27] or the faster exemplar LDA [28], then a discriminative and diverse subset of the detectors is identified and iteratively refined by mining for additional positive patches from the training set. Singh et al [29] initializes part detectors by cluster-ing similar patches and iteratively refines across two splits of the traincluster-ing data. Their selection criteria focuses on cluster purity and discriminative-ness. Doersch et al [30] explores a similar criteria, but uses an extension of mean-shift mode-seeking over density ratios of positive and negative data to identify discriminative parts. Sun and Ponce [31] also initialize part de-tectors by clustering patches, but they enforce diverse part selection using a sparse regularization term.

(26)

2.1.3

Automated Part Discovery in CNNs

Convolutional Neural Networks (CNNs) refer to a family of neural network architectures with the explicit property of being shift-invariant. On image data, a CNN architecture typically contains a series of convolutional layers applied on image data (or any tensor with height, width, and depth). Each convolutional layer can be seen a sliding window filter that applies a linear transformation on the elements within its current window as it strides across the width and height dimensions. For example, let x be an input image of dimensions H×W ×D. When the filter’s top left corner is located at xi,j, it outputs the following K0-dimensional vector:

yi000_,j000_,k0 = M X i00₌₁ H X j00₌₁ D X k=1 wi00_,j00_,k×x_i₊_i00₋₁_,j₊_j00₋₁_,k fork0 = 1...K0 (2.1)

The full output after convolving over the entire input is an H0×W0×K0

feature map. Here,H0 andW0 are determined by the horizontal and vertical strides of the convolutional filter as it slides across the input x. As the output is another tensor with width, height, and depth, we can easily chain a series of convolutional layers, leading to “deep” architectures. It is worth noting that we can reduce a convolutional layer to a traditional multi-layer perceptron layer (fully-connected layer) by setting the window’s width and height to exactly match that of its input.

CNNs have drastically improved benchmark performance on many tradi-tional computer vision tasks. One of the earliest successful applications of CNNs was LeCun et al [32] in optical classification, in which the authors propose LeNet-5, a 7-layer CNN to recognize hand-written digits. More recently, Krizhevsky et al [33] popularized the AlexNet architecture which made significant improvements over existing methods in the large-scale im-age classification challenge Imim-ageNet [34]. Following the success of AlexNet, more accurate CNN architectures have been proposed, including VGG [35], Inception [36], and ResNet [37]. The feature representations of these net-works, after being pre-trained on the ImageNet classification challenge, have been shown to be very effective in nearly all related computer vision tasks, including but not limited to scene recognition [38], object detection [39, 40], and semantic/instance segmentation [39, 41, 42].

(27)

As noted in LeCun et al[32], the stacking of convolutional layers with sub-sampling every few layers “ensures some degree of shift, scale, and distortion invariance.” This is similar to the previously popular HOG feature pyramids generated from running HOG filters at multiple scales of the image. One can think of CNN filters as much more expressive HOG filters that are end-to-end-trainable with the main task. While end-to-end training will not always outperform a compositional approach (training the model one sub-problem at a time), it benefits from being much easier to setup and learns its own internal representations that are importantly jointly optimized for the task with the rest of the architecture’s components.

An interesting result of training CNNs end-to-end is that the filters will naturally learn to detect patterns that benefit the main task – arguably a form of automatic part discovery. CNN visualization works such as Zeiler et al [43] suggest that the model starts by learning low-level edge-like cues at the bottom and becomes increasingly abstract as layers are stacked on. From the bottom up, low level edge filters are pooled to create various shape filters. These are are combined to form simple parts such as wheels and eyes – parts which are further pooled to capture entire viewpoints of vehicles or faces of animals. With the automatic representation learning and part-discovery due to end-to-end CNN training, it is no longer necessary to manually engineer feature representations or to manually identify salient image patterns from which to train part-detectors. Further, the parts and patterns determined by the CNNs training may be better choices than manually engineered solutions (given sufficient data), as their selection was driven directly by the model’s task performance as opposed to human intuition. Our work in latent atten-tion will exploit similar behavior in end-to-end training of neural network architectures, allowing the model to self-identify task-relevant image regions.

2.1.4

Soft Attention Networks

Up until now, we have looked at examples of visual attention in which the tasks’s objective is well-known beforehand. We now look at a more gen-eral framework for visual attention in which the attention behavior may be adapted to a different objective on the fly.

(28)

refers to a soft, differentiable alternative to the argmax selection: ˆ

v = arg max vi∈V

s(vi) (2.2)

where we wish to select the vector ˆv from a set of N vectors vi ∈ V based on their respective scores s(vi). As this is non-differentiable, we approximate the hard-selection with a weighted average:

ˆ v =X i∈N g(s(vi))vi s.t. X i∈N g(s(vi)) = 1 (2.3)

Here, g(s(vi)) is the normalization function over selection scores. It is most commonly a softmax distribution over all vectors vi ∈V:

g(s(vi)) =

exps(vi)

P

j∈Nexps(vj)

(2.4) Note that as the softmax distribution approaches 1-hot, soft-attention ap-proaches argmax selection.

Soft attention has seen applications in numerous deep network architec-tures to tackle various tasks. Bahdanau et al [44] uses soft attention as a soft alignment between a source sentence and its target translation. Xu et al [45] similarly uses this technique to align different parts of the image with the next word to predict in an image captioning framework. Sukhbaatar et al [46] learns to predict a soft distribution over a set of previously made statements to respond to a natural language query.

The significance of soft-attention to our work is the abstraction of the scoring function s(vi). Specifically, suppose we parameterized the scoring function as s(vi, θ), then we can adjust the visual attention behavior on the fly by predicting the appropriate θ. We address this in our chapters on attention for visual question answering, in which we try to vary the behavior of visual attention for different questions about the image.

2.2

Vision Tasks

The field of computer vision spans a diverse range of tasks requiring some form of visual perception. In our work, we focus on incorporating the ability

(29)

to learn task-driven visual attention in three main tasks: keypoint local-ization, fine-grained image recognition, and visual question answering. We provide an overview of each of the tasks in the following section.

2.2.1

Keypoint Localization and Regression in CNNs

In the following work, we refer to the task of localizing annotated pixel-locations (eg. center of the eye or nose) as keypoint localization. The key-point/part localization task is strongly related to object detection in that it was used to model object detectors capable of capturing various poses. The use of pose in object detection can be seen in the line of work deriving from the Pictorial Structure models [47, 48], in which recognition was modeled as localizing rigid parts arranged in a deformable configuration. Popular datasets for keypoint localization include Leeds Sports [49, 50], Poselets on Pascal [51], UCSD birds [52], and more recently MSCOCO [53].

Due to the recent advancements in deep learning, keypoint localization methods have shifted from classical approaches that focus on localizing vari-ous part-based templates ([54, 55, 56]) to models based on end-to-end trained CNNs. Most relevant to our work are CNN architectures that attempt to regress to the target coordinates. Prior to our work, the most notable appli-cations of of deep regression networks to keypoint localization are Toshev et al [57] and Sunet al [58], which use cascades of deep network based regressors for human pose estimation and facial keypoint localization respectively. At each stage of the cascades, the network uses a region around the previous pre-diction to acquire higher resolution inputs. This allows the models to slowly adjust their prediction context in a coarse-to-fine fashion. The cascade ad-dresses the problem in which CNNs expect a fixed-size input – feeding in the entire image will require downsampling, whereas feeding in smaller regions of the image would involve knowing where to crop and how much context is necessary. Instead of cascades, our work as described in Chapter 3 relies on multiple regions sampled with Edge Boxes from the image and simultane-ously predicts all keypoints. Varying sized regions provide varying resolution and context, and we achieve more robust predictions from multiple regions with statistical outlier removal.

(30)

2.2.2

Fine-Grained Image Recognition

Fine-grained visual recognition refers to classification between visually sim-ilar and closely related categories. Differences may be as minute as feather color or beak shape between birds [52, 59], petal shapes between various plants [60], or even fur patterns between various types of dogs [61]. Prior work in this field focuses on localizing informative parts of objects and then extracting features from them for classification. Using pairs of localized key-points, Berg et al [62] learn a set of highly discriminative features for fine-grained classification. Farrell et al [63] and Branson et al [64] use pose nor-malized representations of birds and their regions (head, torso, entire bird) followed by feature extraction for classification. Liu et al [1] extend the ex-emplar based model of [65] with pose information for keypoint localization and subsequent classification of birds. Based on the very successful frame-work of the RCNN [66], Zhang et al [67] perform bird classification using three localized bird regions: head, torso, and full body.

The above mentioned methods are highly dependent on accurate keypoint and bird region localization. In fact, [62, 63] rely on the groundtruth bird bounding box at test time to localize keypoints and to perform classification. Our work overcomes this bottleneck of localization and we demonstrate state-of-the-art classification performance using the framework of [67] along with our localized regions.

2.2.3

Visual Question Answering

Visual question answering (VQA) is the task of answering a natural lan-guage question about an image. VQA includes many challenges in lanlan-guage representation and grounding, recognition, common sense reasoning, and spe-cialized tasks such as counting objects and reading signs. To some degree, VQA benchmarks were proposed as a vision-language task with a less am-biguous evaluation than one such as image-captioning. It is much easier to identify correct and incorrect responses to a question about an image than to determine whether a random caption in a dataset is a valid match for an image. Further, models tackling various vision-language tasks are often similar in that they contain a mechanism for comparing vision and language feature representations. As such, a improvements in the VQA task will likely

(31)

transfer to other related vision-language tasks as well.

Our work experiments on the VQA dataset of Antol et al [6] due to the open ended nature of its question and answer annotations. Questions are collected by asking annotators to pose difficult problems for a smart robot, and multiple answers are collected for each question. We experiment on the multiple-choice setting as its evaluation is less ambiguous than that of open-ended response evaluation. Most other visual question answering datasets [68, 69] are based on reformulating existing object annotations into questions, which provides an interesting visual task but limits the scope of visual and abstract knowledge required. Accompanying approaches tend to use recurrent networks to model language and predict answers [68, 6, 69, 70]. We find a fixed-length representation for vision and language to be highly effective, and our approach differs at a high level in our focus on learning where to look. Simple Bag-Of-Words models have been shown to perform roughly as well if not better than sequence-based LSTM[68, 6]. Further, Yu et al. [69] propose a Visual Madlibs dataset for fill-in-the-blank and question answering and focus their approach on learning latent embeddings, finding normalized CCA [71] to outperform recurrent networks for embedding.

(32)

CHAPTER 3

LOCALIZING KEYPOINTS

The most common approach to keypoint localization is to learn a set of key-point detectors to model appearance and an associated spatial model [67, 2, 1, 64] to capture their spatial relations. The keypoint detectors generate a set of likely candidates per part and a spatial model is used to infer the most likely configuration. Keypoint detectors typically model local appearance and thus an approach has to rely on expressive spatial models to capture long range dependencies. Alternatively, the keypoint detectors can condition their pre-dictions on larger spatial support and jointly predict several keypoints [72], reducing the need to explicitly model inter-keypoint relationships.

In this chapter, we describe a method for learning a keypoint localization model that relies on larger spatial support to jointly localize several keypoints and predict their respective visibilities. Leveraging recent developments in Convolutional Neural Networks (CNNs), we introduce a framework that out-performs the state-of-the-art for localizing bird keypoints for eyes, beaks, etc. on the CUB dataset. Further, while CNN-based methods suffer from a loss of image resolution due to the fixed-sized inputs of the networks, we introduce a simple sampling scheme that allows us to work around the issue without the need to train cascades of coarse-to-fine localization networks [57, 58].

Our approach to keypoint localization mainly draws inspiration from the use of regression in networks in the MultiBox approach by Erhan et al [73]. The authors train a deep network which regresses a small number of bounding boxes (∼ 100) as object bounding box proposals, along with a confidence value for each bounding box.

Our work is applied to the Caltech-UCSD Birds dataset. The most closely related work on that dataset is from Liu et al [1, 2]. Their works achieve re-markable performance on both keypoint localization and visibility prediction using ensembles of pose exemplars and part-pair detectors. We compare our performance with theirs using metrics defined in their work.

(33)

3.1

Method

We design our model to simultaneously predict keypoint locations and their visibilities for a given image patch. To share the information across cate-gories, our model is trained in a category agnostic manner. At test time, we efficiently sample each image with Edge Boxes, make predictions from each Edge Box, and reach a consensus by thresholding for visibility and reporting the medoid.

3.1.1

Training Convolutional Neural Networks for Keypoint

Regression

Our network is based on AlexNet ([33]), but modified to simultaneously predict all keypoint locations and their visibilities for any given image patch. AlexNet is an architecture with 5 convolutional layers and 3 fully connected layers. Henceforth, we refer to the 3 fully connected layers as fc6, fc7, and fc8. We replace the final fc8 layer with two separate output layers for keypoint localization and visibility respectively. Our network is trained on Edge Box ([5]) crops extracted from each image and is initialized with a pre-trained AlexNet ([33]) trained on the ImageNet ([34]) dataset. Each Edge Box is warped to 227×227 pixels before it can be fed through the network. We apply padding to each Edge Box such that the warped 227×227 pixel crop has 16 pixels of buffer in each direction.

Given N keypoints of interest, we train a network to output an N dimen-sional vector ˆv and a 2N dimensional vector ˆl corresponding to the visibility and location estimates of each of the keypoints ki, i ∈ {1, N}, respectively. The corresponding groundtruth targets during training are v and l. We de-fine v to consist of indicator variablesvi ∈ {0,1}such that vi = 1 if keypoint

ki is visible in the given Edge Box image before padding is performed, and 0 otherwise. The groundtruth location vector l is of length 2N and consists of pairs (lxi, lyi) which are the normalized (˜x,y˜) coordinates of keypoint ki

with respect to the un-padded Edge Box image. Output predicted from the network, ˆvi ∈[0,1], acts as a measure of confidence of keypoint visibility, and 2D locations predicted by the network are denoted by ˆli.

We use theCaffe framework ([74]) for training our deep networks. To train a network optimized for both tasks simultaneously, we define our losses as

(34)

follows:

Lvis =||v −vˆ||2₂ and Lloc = N X i=1 vi· h (lxi −ˆlxi) 2_{+ (}_l yi−ˆlyi) 2i _(3.1)

Lnet =Lvis+Lloc (3.2)

The visibility loss Lvis is the squared Euclidean distance between the ground truth visibility label vector v, and the predicted visibility vector ˆv. The values in our ˆv’s always lie between 0 and 1 as they are obtained after squashing network outputs with a sigmoid function. The keypoint localiza-tion loss Lloc is a modified Euclidean loss, in which we set the loss between the prediction and the target to be 0 if vi = 0 i.e. if the keypointki is absent in the given image. The final training loss (Lnet) is given by the sum of the two losses.

To construct our training set for predicting keypoint visibility and loca-tions, we extract up to 3000 Edge Boxes per image. To train a robust pre-dictor, we need a collection of training images with high variability in which different subsets of keypoints are visible. We generate examples that satisfy this criteria by retaining a subset of Edge Boxes which have at least 50% of their area contained inside the groundtruth bounding box and have at least 20% intersection over union overlap (IOU) with the groundtruth bounding box. We also included up to 50 random boxes per image from outside the bounding box as negative background examples. We augment our dataset with left/right flips. After flipping, appropriate changes were applied to the label vectors. This consisted of swapping orientation-sensitive keypoints such as “left eye” and “left wing” with “right eye” and “right wing”, and updating their respective coordinates and visibility indicators. We first train our model on 25 images per class and tune algorithmic and learning rate parameters on a held-out validation set comprising the remaining 4-5 images per class. Finally, we re-train using the entire training set before running our model on the test set.

3.1.2

Combining Multiple Keypoint Predictions

Our algorithm for dealing with predictions from multiple Edge Boxes at test time is illustrated in Fig. 3.1. Due to the variance from making predictions

(35)

Input Image! Edge Boxes and associated predictions for Right-Eye keypoint!

Outlier Removal and Consensus!

Figure 3.1: The pipeline of our keypoint localization process: Given an input image, we extract multiple edge boxes. Using each edge box, we make predictions for the location of each of the 15 keypoints, along with their visibility confidences. We then find the best predicted location by performing confidence thresholding and finding the medoid. The process is illustrated for the right eye keypoint (Black edge boxes without associated dots make predictions with confidences below the set threshold, and green is an outlier with a high confidence score).

from multiple unique subcrops of the image, we need to form a consensus from the multiple predictions. In our experiments, we found that after re-moving predictions with low visibility confidences, the remaining predictions had a peaky distribution around the ground truth. We use medoid as a robust estimator for this peak and found it to be effective in most cases (Fig. 3.2). For the task of localizing part regions around keypoints (described in chapter 4), we found on our train/val split that we achieved better localization per-formance if we kept a set of good predictions (referred to as inliers) instead of using only the medoid. We now describe our procedure for obtaining a tight set of inliers and our choice of parameters. For the keypoint prediction task, we only use the visibility thresholds and report the medoid.

Case 1: Ground Truth Object Box Given:

We first describe our method in the case that the ground truth object boxes are given. Using the ground truth object box, we retain the generated Edge Boxes that are mostly contained within and have an IOU of at least 0.2. This results in roughly 50-200 remaining Edge Box subcrops per image. Each subcrop is then independently fed through our keypoint prediction network, returning a set of normalized keypoint predictions and visibilities.

Because each subcrop is expected to cover less than the whole object and contain only a subset of the keypoint predictions, we drop any prediction if

(36)

its corresponding visibility is below 0.6. Because we make use of multiple overlapping subcrops, it is very likely that at least one of them will lead to a prediction with a sufficiently high visibility score, thereby allowing us to be much more aggressive with the false positive filtering.

Given multiple remaining keypoint predictions per keypoint with suffi-ciently high visibility scores, we then proceed to remove outliers. To do so, we threshold on a modified Z-score based on a description given by Iglewicz and Hoaglin ([75]). The modified Z-score is one that is re-defined using medoids and medians in place of means, as the former estimates are more robust to outliers.

Let pi where i = 1,· · · , M be the set of M surviving un-normalized key-point predictions (for a given keykey-point) in (x, y) image coordinates. We first define ¯p to be the medoid prediction such that:

¯ p= arg min pj M X i=1 ||pj −pi||2, j ∈ {1, ..., M} (3.3)

In other words, ¯p is the prediction such that its Euclidean distance from all other predictions for that keypoint is minimal. While this optimization is costly at a large scale, we typically deal with only 10-20 predictions at a time after thresholding for visibility scores. To compute the modified Z-score we use:

Zi =

λ ||pi−p¯||2

median (||pi−p¯||2)

, i∈ {1, ..., M} (3.4)

Here, the denominator is the median absolute deviation, or simply the median distance from the medoid ¯p. We use the recommended λ = 0.6745. The above procedure is separately computed for all 15 sets of keypoint pre-diction candidates. Finally, we drop any keypoint prepre-diction withZi >0.35, a threshold that was experimentally determined on the held-out set.

Case 2: Ground Truth Object Box Not Given:

Our ground truth object box not given scenario requires little change from the above case. Using the Edge Box ranking, we found that most of our “good” Edge Boxes fell within the top 600 Edge Boxes per image, saving us a lot of computation. Tuning parameters on our train/val split, we found

(37)

that an even more aggressive visibility threshold of 0.94 and a Z-score thresh-old of 0.3 gave the best results.

Medoid-Shift:

While the simple Z-score thresholding combined with the medoid achieves excellent results, as we will demonstrate in the results section, we were able to further improve our results by using medoid-shifts ([76]). We use the medoid of the largest output cluster from the algorithm instead of the medoid computed over all the visibility-filtered predictions.

3.2

Results

We evaluate our keypoint prediction model on the Caltech UCSD-Birds dataset by Wah et al. This dataset contains 200 bird categories with 15 keypoint location and visibility labels for each of the total of 11788 images. We first evaluate our keypoint localization and visibility predictions against other top-performing methods.

3.2.1

Keypoint Localization and Visibility Prediction

Table 3.1 reports our keypoint and visibility performance without using any ground truth bounding box information. Our medoid method reports the medoid of predictions above a visibility threshold, as seen in the red star in Fig. 3.2. Our “mdshift” method reports the new medoid computed using medoid-shift, which is the blue circle in Fig. 3.2. We used the evaluation code provided by the authors of [1] to measure our performance using the metrics defined in their work. In short, PCP (Percent Correct Parts) is the percentage of keypoints localized within 1.5 times the annotator standard deviation. We received the pre-computed standard deviatons and evaluation code from the authors of [1] to avoid any discrepancies during evaluation. AE (Average Error) is the mean euclidean prediction error, capped at 5 pixels, computed across examples where a prediction was made and a ground truth location exists. FVR and FIR refer to False Visibility Rate and False Invisibility Rate

(38)

Be ak% Br east% Tai l%

*"

Outliers% Inliers% Mean% Median% Medoid% Medoid%Shi3% Ground%Truth%

Ori

gi

na

l%

Figure 3.2: Qualitative results for a subset of the keypoints. Predictions for most of the images cluster tightly. Therefore, simple prediction methods such as medoids work well. Medoid shift adds to the robustness, leading to further improvements (second last column). Primary failure mode is when visibility thresholding fails to rule out clusters of false positives (bottom right).

Method PCP AE FVR FIR Poselets ([77]) 24.47 2.89 47.9 17.15 Consensus ([65]) 48.70 2.13 43.9 6.72 Exemplar ([1]) 59.74 1.80 28.48 4.52 Ours (medoid) 68.7 1.4 17.1 5.2 Ours (mdshift) 69.1 1.39 17.1 5.2 Human ([1]) 84.72 1.00 20.72 6.03

Table 3.1: Localization and Visibility Prediction Performance of various methods without using the ground truth Bounding Box

respectively. The additional methods for comparison are the same as listed in their paper.

Compared to the top-performing methods that also predict visibility, our method achieves the best numbers in three out of four metrics. Our PCP and AE metrics outperform other methods in the table, with our medoid-shift variant performing slightly better. Our FIR is higher because we are using the visibility threshold tuned on the part-localization task. A slightly low-ered threshold would lower the FIR and raise the FVR without significantly affecting the PCP.

The highest reported PCP is 66.7% due to [2], which also predicts visibil-ities but did not report them. We compare against their PCP in Table 3.2.

(39)

PCP Ba Bk Be Br Cr Fh Ey Le Wi Na Ta Th Total Liu ’13 62.1 49.0 69.0 67.0 72.9 58.5 55.7 40.7 71.6 70.8 40.2 70.8 59.7 Liu ’14 64.5 61.2 71.7 70.5 76.8 72.0 70.0 45.0 74.4 79.3 46.2 80.0 66.7 Ours 74.9 51.8 81.8 77.8 77.7 67.5 61.3 52.9 81.3 76.1 59.2 78.7 69.1

Table 3.2: Comparison of per-part PCP with Liuet al 2013 [1] and Liuet al 2014 [2]. The abbreviated part names from left to right stand for back, beak, belly, breast, crown, forehead, eye, leg, wing, nape, tail, and throat.

Because our method differs significantly from theirs, we outperform them in only 7 of the listed part categories despite having a better overall PCP, suggesting further improvements by targeting the differences in our models’ behaviors.

3.3

Conclusion

We presented a method for obtaining state-of-the-art keypoint predictions on the Caltech UCSD-Birds dataset. We demonstrated that conditioning the predictions on multiple object proposals for sufficient image support can reliably improve localization predictions without using a cascade of coarse-to-fine networks. We tackle the problem of fixed-size inputs when using neural networks by sampling predictions from several boxes and determining the “peak” of the predictions with medoids. In the next chapter, we will look at applying these keypoint predictions to part-aligned fine-grained image classification.

(40)

CHAPTER 4

FINE GRAINED CLASSIFICATION WITH

ALIGNED PARTS

Fine-grained image categorization is the task of accurately separating cate-gories where the distinguishing features may be as minute as a different fur pattern, shorter horns, or a smaller beak. The widely accepted and popular approach of dealing with such a task is intuitive: align analogous regions and compare. The alignment process allows the model to compare apples to apples, and oranges to oranges. A specific set of parameters can focus exclusively on learning the minute differences between beak shapes whereas a different set can focus on wing patterns. In this chapter, we describe how we use our keypoint prediction results from the previous chapter to conduct region-aligned classification.

4.1

From Keypoints to Regions

In order to align analogous regions to perform fine-grained classification, we must first map our pixel-level keypoint predictions to alignable image regions from which we can extract features. To do this, we first use the keypoint mapping as used in works by Zhang et al [67, 78]. Using the keypoints, three regions are identified from each bird: head, torso, and whole body. The head is defined as the tightest box surround the beak, crown, forehead, eyes, nape, and throat. Similarly, the torso is the box around the back, breast, wings, tail, throat, belly, and legs. The whole body bounding box is the object bounding box provided in the annotations.

To handle the case when ground truth bounding box is not given at test time, we use an overlap heuristic based on the predicted head and torso boxes. We first start by finding the tightest box around the predicted head and torso boxes. While this initial box will do well for birds in their canonical poses, it will result in an undersized box in many cases because the keypoints

(41)

Method Head Torso Whole Body

GT Bbox

Part-Based RCNN ([67]) 68.2 79.8 N/A

Deep LAC ([79]) 74.0 96.0 N/A

Ours (single GT bbox) 75.6 90.2 N/A

Ours (multiple) 88.8 93.9 N/A

Ours (multiple, mdshift) 88.9 94.3 N/A

No GT Bbox

Part-Based RCNN ([67]) 61.4 70.7 88.3

Exemplar ([1]) 79.9 78.3 N/A

Ours (multiple) 87.8 89.0 84.5

Ours (multiple, mdshift) 88.0 88.7 84.6

Table 4.1: Comparison of Part Localization Performance: Our method based on keypoint prediction from Edge Boxes shows significant improvement over

previous work.

do not always capture the full extent of the bird. We then assume that there exists an Edge Box with a high edge score that better captures the whole bird. To let the box expand to capture more of the object, we first identify the Edge Boxes such that the tightest box is at least 90% contained within and has at least 50% IOU overlap. The final whole body bounding box is the Edge Box that passes both criteria that also has the highest Edge Box object score. If no Edge Box passes the overlap test, we fall back to the starting tightest box.

The results in Table 4.1 demonstrate that our keypoint predictions are use-ful in generating accurate part boxes. Our lower performing single GT Bbox method suggests that our use of multiple predictions from Edge Boxes allows for more accurate predictions. Further, we also computed head and torso boxes using the keypoint predictions from [1] as shown in the “Exemplar” row. Based on their accuracy, their boxes should also be able to improve the results of [67].

Next, given bounding boxes for head, torso, and whole body, we use the same SVM-classification framework as used by [67] to conduct part-aligned fine-grained classification. Specifically, AlexNet fc6 features are extracted from each of the localized regions, then concatenated into a feature vector of length 4096×3 and used for 200-way linear 1-vs-all SVM classification.

(42)

Figure 4.1: Examples of good (left) and failed (right) localization results: The ground truth boxes are in solid black. The head, torso, and whole body boxes are in green, blue and red respectively. The head is correctly localized in most of the above examples. In the top row middle example, even though the whole body box IOU is low, most of the missed area is actually background due to the bird extending its wings. In the bad examples, we show that we mostly fail in rare close-ups and when there are multiple instances.

4.2

Fine-Grained Classification

We now test our part-predictions in a fine-grained classification setting. These results are shown in Table 4.2. To do this, we train three networks to re-implement the three-part framework of [67] as described in the pre-vious section. The oracle performance refers to the classification assuming ground truth keypoints at test time. While [67] reports an oracle accuracy of 82.0%, we compare with the highest we were able to achieve with our implementation: 81.5%. This is likely due to minor differences in network training parameters. We also tried both fc6 and fc7 features and found that fc6 performed a little better. Although [67] and [64] noted that their drops in accuracy from using ground truth parts to predicted parts were surprisingly small, our relative improvements suggest that it is still worthwhile to focus on better localization. Further, we perform at least as well as the contem-porary Deep LAC model ([79]), likely due to our better localization of the more discriminative head regions.

In Fig. 4.2, we show how our accuracy is affected from the ground truth keypoint ideal case (Oracle) to the use of predicted keypoints (GT Bbox), and finally with the GT Bbox removed (No GT Bbox). Unsurprisingly, the better localization at test time allows for a significantly smaller drop as annotations are removed.

The same plot also shows an ablation test of individual parts. It appears that the bulk of our performance comes from discriminating localized bird

(43)

H+T+B H+T H T 60 70 80 81.5 ₈₁ 71.5 61 80.3 ₇₈_.₈ 68.3 60.4 78.3 _77.7 68.2 58.7 Parts Used Class Acc . (%)

Oracle GT Bbox No GT Bbox

Figure 4.2: Camporison of classification accuracies obtained using varying combinations of parts localized under different conditions

Method Acc.

Oracle Oracle Parts + SVM 81.5

GT Bbox DPD ([78]) 51.0 Symbiotic ([80]) 59.4 Alignment ([81]) 62.7 DeCAF ([82]) 65.0 POOF ([62]) 56.8 Part-Based RCNN ([67]) 76.4 Deep LAC ([79]) 80.3 Ours (mult, medoid) 80.3 Ours (mult, mdshft) 80.3

No GT Bbox

Pose Norm ([64]) 75.7 Part-Based RCNN ([67]) 73.9 Ours (mult, medoid) 78.2 Ours (mult, mdshft) 78.3

Table 4.2: Comparison of our classification with other works

heads. This is also supported by [64] which observed that of their learned poses, the one that corresponded to the head was the most discriminative. This suggests that most of our improvement over our base method of [67] comes from significantly improving our head part localization (shown in Table 4.1).

4.3

Conclusion

We presented an extension of our keypoint prediction work to fine-grained classification. We demonstrated the importance of keypoint prediction with

(44)

accurate visibility prediction in robustly localizing image regions. Using our part-localization approach, we improved upon existing work in both localiz-ing head and torso regions, and subsequently the overall classification accu-racy.

(45)

CHAPTER 5

LATENT ATTENTION FOR VISUAL

QUESTION ANSWERING

5.1

Introduction

Visual question answering (VQA) is the task of answering a natural language question about an image. VQA includes many challenges in language repre-sentation and grounding, recognition, common sense reasoning, and special-ized tasks like counting and reading. In this paper, we focus on a key problem for VQA and other visual reasoning tasks: knowing where to look. Consider Figure 5.1. It is easy to answer “What color is the walk light?” if the light bulb is localized, while answering whether it’s raining may be dealt with by identifying umbrellas, puddles, or cloudy skies. We want to learn where to look to answer questions supervised by only images and question/answer pairs. For example, if we have several training examples for “What time of day is it?” or similar questions, the system should learn what kind of answer is expected and where in the image it should base its response.

Learning where to look from question-image pairs has many challenges. Questions such as “What sport is this?” might be best answered using the full image. Other questions such as “What is on the sofa?” or “

Learning visual tasks with selective attention