Visual Material Recognition

(1)

Submitted to the Faculty of

Drexel University by

Gabriel Schwartz in partial fulfillment of the requirements for the degree

of

Doctor of Philosophy December 2017

(2)

(3)

First and foremost I want to thank my advisor, Dr. Ko Nishino. Without his support and patience I would not be where I am today. He was a constant source of guidance, insight, and motivation throughout my graduate student career. He knew what I was capable of even when sometimes I did not.

My parents have always supported me and I’ve never doubted that they would continue to do so regardless of where life took me. For that I can’t thank them enough. Their hard work and dedication has afforded me with countless opportunities; my education is just one of many.

I’m extremely grateful for my labmates, especially Geoff and Steve. I could always turn them for advice, technical discussions, or just to talk about video games. Simply knowing that they faced the same obstacles and succeeded was a source of hope. I’d also like to thank all of my friends and climbing partners over the years. There’s no better way to relax than a trip into the desert or a long day on the rocks.

This thesis was brought to you in part by the National Science Foundation and the Office of Naval Research. Thanks to their support I’m able to emerge from the Ph.D. program debt-free, which these days is something to be thankful for.

(4)

Contents

List of Tables . . . vi

List of Figures . . . vii

Abstract . . . xiii 1. Introduction . . . 1 1.1 Contributions . . . 5 2. Related Work . . . 6 2.1 Material Recognition . . . 6 2.2 Attributes . . . 8 2.2.1 Fully-Supervised Attributes . . . 9 2.2.2 Weakly-Supervised Attributes . . . 9

2.3 Material Perception and Convolutional Neural Networks . . . 11

2.4 Dense Prediction . . . 12

2.5 Context in Visual Recognition . . . 13

3. Visual Material Traits . . . 14

3.1 Representing Material Traits . . . 17

3.1.1 Convolutional Material Trait Features . . . 18

3.1.2 Supplemental Features . . . 22

3.1.3 Groupwise Feature Selection . . . 22

3.2 Recognizing Material Traits . . . 23

3.3 Using Visual Material Traits . . . 28

(5)

4.1 Perceptual Distance between Materials . . . 37

4.2 Defining the Material Attribute Space . . . 41

4.3 Training a Material Attribute Classifier . . . 43

4.4 Analysis of Discovered Attributes . . . 46

4.5 From Discovered Attributes to Materials . . . 50

5. Perceptual Material Attributes in Convolutional Neural Networks 54 5.1 Perceptual Material Attributes from Local Material Recognition . . . 55

5.1.1 Finding Material Attributes in a Material Recognition CNN . . . 56

5.1.2 Material Attribute-Category CNN . . . 56

5.2 Local Material Database . . . 59

5.2.1 Material Category Hierarchy . . . 59

5.2.2 Data Collection and Annotation . . . 61

5.3 Perceptual Material Attributes Discovered in the MAC-CNN . . . 64

5.3.1 Properties of the Perceptual Material Attributes . . . 64

5.3.2 Local Material Recognition . . . 68

5.4 Novel Material Category Recognition . . . 70

6. Integrating Local Materials with Global Context . . . 74

6.1 Role of Context in Material Recognition . . . 77

6.1.1 Object Context . . . 78

6.1.2 Place Context . . . 78

6.2 Integrating Context in a Material Segmentation CNN . . . 80

(6)

6.3.1 Material Segmentation Comparisons . . . 84

6.3.2 Ablation Studies . . . 86

6.3.3 Qualitative Examples . . . 91

7. Conclusion . . . 94

7.1 Visual Material Traits . . . 94

7.2 Perceptual Material Attributes . . . 95

7.3 MAC-CNN and Local Materials Database . . . 95

7.4 Integrating Materials and Context . . . 96

7.5 Future Work . . . 96

Bibliography . . . 98

(7)

3.1 Selected features for material traits. As “fuzziness” is characterized by fine edge patterns, oriented filters and LBP are useful. Since we define “shiny” only on areas that exhibit specular highlights, it follows that color histograms and learned convolutional filters are important features for this material trait. . . 23 3.2 Performance breakdown. FS: feature selection, SF: supplemental features, CAE:

convolutional auto-encoder features. For the first row we performed direct mate-rial category recognition using the concatenation of all feature sets. This shows that the trait representation is indeed providing crucial information. . . 31 6.1 Material segmentation scores for same-dataset experiments (each method trained

and tested only on the given dataset). For MINC models, the table shows the score of the best scoring single model without ensembles. Note that the Open-Surfaces [7] dataset is a subset of MINC, and the FMD is a subset of our local materials database. . . 84 6.2 Material segmentation scores for cross-dataset experiments (each model trained

on one dataset and tested on another). Models are tested without retraining, on overlapping categories only, in order to highlight generalization performance. . . 86 6.3 Accuracy for varying sources of context. Object and place categories each

con-tribute significantly to the overall accuracy, and the combination of the two is even more accurate. As an example of this, sinks (an object) are often metal or ceramic, and bathrooms (a place) often contain metals and ceramics. Bathroom sinks, however, are typically ceramic. Objects and places together can provide information that is not available given either alone. . . 88 6.4 Accuracy for varying levels of context granularity. Fine-grained places may not

appear in many images, but coarse grained categories may offer little in the way of material recognition cues. We find that the finest category granularity offers the best material segmentation performance. In this case, the 205 place categories are both fine-grained and sufficiently well-distributed across training examples. . 88 6.5 Accuracy with context introduced at varying levels. We introduce context at each

of the above layers and compute material segmentation accuracy. The accuracy increases as the context is introduced at higher layers in the network, showing that the best level for context introduction is in the upper layers of the network. 91

(8)

List of Figures

3.1 Materials like the plastic in these images exhibit a wide range of appearances depending on the object and scene, making extraction of material information without the use of object information challenging. We propose to locally rec-ognize visual material traits, distinct appearances of material properties such as "translucent," to provide contextual cues for challenging vision tasks including material category recognition and segmentation. . . 15 3.2 When adapted to use aggregated features from local image patches, methods

that perform well on full images quickly lose accuracy. This suggests that they are relying heavily on context, including object shape cues, to recognize materials. 16 3.3 Successfully recognized material traits. These image patches were recognized by

our framework as exhibiting the indicated material traits. Even at the patch level, we can see the characteristic visual appearances of each material trait. . . . 17 3.4 These 7×7px. convolution filters learned by the CAE represent the top three

filters for the listed material traits, ranked by average presence in the testing images. The filters represent characteristic local texture and color patterns. The six filters on the right do not rank in the top three for any material trait. They exhibit significantly less texture variation than the top filters. . . 21 3.5 Example material trait recognition. Non-masked pixels in (b) and (c) correspond

to pixels with high probability (p > 0.5) of exhibiting the given trait. Note that the recognized material traits appear consistently across regions of related materials. . . 24 3.6 Visual material trait recognition accuracy. Material traits are recognized via

binary classification on a balanced training and testing set, thus random chance accuracy is 50%. Most traits are recognized well. Difficult material traits, such as metallic and transparent, are challenging due to their object- and environment-dependent appearances. Average accuracy is 78.4%. . . 25 3.7 Material trait frequency distributions. We compute the class-conditional

dis-tributions for appearance frequency of each material trait given each material category. These are stored as histograms, examples of which are shown above. Plastic is most often smooth, while stone is very rarely smooth. . . 26 3.8 Our framework produced false-positive detections of material traits in these

patches. For the challenging metallic trait, it is clear that color plays a strong role. The misclassifications generally have a metallic color even though the material is not metal. In some rare cases such as “smooth” there are missing annotations and thus the false positives are actually true positives. . . 27

(9)

of glass to create characteristic local distortions. . . 29 3.10 Three misclassified ImageNet images, with true classes for each prediction is in

parentheses. The left two are a result of confusing appearances (striped and translucent are more often associated with wood and plastic respectively) while the rightmost is due to the bounding box poorly fitting the object. . . 31 3.11 Comparing segmentation with and without material traits. Images on the left

were segmented using the original NCuts algorithm, while those on the right were segmented with our modified version. Material traits can indicate the difference between fuzzy grass in the foreground and rocks in the background, despite the fact that they have similar colors. . . 33 4.1 Sample material image patches. Each column contains patches containing the

same material. We would like to obtain a set of attributes that describe what makes each material look distinct. Asking annotators to simply describe the patches, however, is an ambiguous question. Patches may look similar even though the annotator cannot find a concrete word to identify the similarity. In this chapter, we show that we can probe the human perception of materials by asking only for binary visual similarity decisions: “Do these two patches look similar?” . . . 36 4.2 Example projections of materials into a 2D similarity subspace. The locations

of the two material categories corresponding to the axes are marked. We would expect that, in this case, water would lie furthest along the “water” axis and likewise with leather. Materials with common visual properties, such as the smoothness of plastic and glass, lie close to each other. Materials with distinct visual properties, such as woven fabric and shiny metal, do not. . . 40 4.3 t-SNE [53] embedding of materials from the raw feature space (a) and from our

discovered attributes (b). We embed a set of material image patches into 2D space via t-SNE using raw features and predicted attribute probabilities as the input space for the embeddings. Though t-SNE has been shown to perform well in high-dimensional input spaces, it fails to separate material categories from the raw feature space. Material categories are, however, clearly more separable with our attribute space. . . 45 4.4 Per-pixel discovered attribute probabilities for four attributes (one per column).

These images show that the discovered attributes exhibit patterns similar to those of known material traits. The first attribute, for example, appears consistently within the woven hat and the koala; the second attribute tends to indicate smooth regions. The last two columns show we are discovering attributes that can appear both sparsely and densely in an image, depending on the context. These are all properties shared with visual material traits. . . 47

(10)

4.5 Typical per-pixel attribute probabilities based on a random attribute matrix. Unlike the predictions for attributes derived from human perception, these at-tributes appear randomly within a region and do not reflect any local visual properties. . . 48 4.6 Correlation between discovered attribute predictions and material traits. Groups

of attributes can collectively indicate the presence of a material trait. Metallic, for example, correlates positively with attribute 0 and negatively with attribute 8. 49 4.7 Confusion matrix for material recognition on FMD images. Well-recognized

cat-egories, such as foliage, correspond with categories that appeared distinct in hu-man annotations for perceptual distance. Annotators regularly selected foliage patches as appearing different from all other categories. . . 51 4.8 Accuracy vs. training set size. Accuracy does not continue to increase as we use

larger training datasets. This shows that we have successfully extracted as much local information as possible from human perception. . . 52 5.1 Material Attribute-Category CNN (MAC-CNN) Architecture: We introduce

aux-iliary fully-connected attribute layers to each spatial pooling layer, and combine the per-layer predictions into a final attribute output via an additional set of weights. The loss functions attached to the attribute layers encourage the ex-traction of attributes that match the human material representation encoded in perceptual distances. The first set of attribute layers acts as a set of weak learners to extract attributes wherever they are present. The final layer combines them to form a single prediction. . . 57 5.2 Our proposed material category hierarchy. Categories at the top level (red)

separate materials with notable differences in physical properties. Mid-level cat-egories (green) are visually distinct. The lowest level of catcat-egories (blue) are fine-grained and may require both physical and visual properties and expert knowledge to distinguish them. In our local materials database, we collect an-notations for mid-level categories only, as they correspond to names likely to be familiar with a non-technical audience. We make one exception for concrete and asphalt, as those names are more familiar than the term “composite”. We also add supplemental categories for food, water, and non-water liquids. . . 60 5.3 Local material patches extracted as the final step in our database creation

pro-cess. These patches are used to compute human perceptual distances, and also form the training input for our combined material attribute-category CNN. . . . 62 5.4 Example annotation results. Annotators did not hesitate to take advantage of the

ability to draw multiple regions, and most understood the guidelines concerning regions crossing object boundaries. As a result, we have a rich database of segmented local material regions. . . 63

(11)

form separate regions in the space. . . 65 5.6 Each column after the first (the input image) shows per-pixel probabilities for an

extracted perceptual attribute. The attributes form clearly delineated regions, similar to semantic attributes, and their distributions match as well. . . 66 5.7 By performing logic regression from our MAC-CNN extracted attributes to

se-mantic material traits, we are able to extract sese-mantic information from our non-semantic attributes. We can apply logic regression to material attribute pre-dictions on patches in a sliding window to obtain per-pixel semantic material trait information. The per-pixel trait predictions show crisp regions that cor-respond well with their associated semantic traits. Traits are independent, and thus the maps contain mixed colors. Fuzzy and organic in the lower right image, for example, creates a yellow tint. . . 67 5.8 Local material recognition accuracy, by category. Average accuracy is 60.2%.

It is clear that some categories, such as metal and glass, are significantly more challenging to recognize locally. . . 68 5.9 Images in each column share true material categories. The first three rows are

correct predictions, and incorrect predictions (bottom two rows) are shown under the corresponding images. Glass and metal, for example, are both materials whose appearance depends heavily on the surrounding environment. Asphalt and concrete are both common paving materials and it is sensible that they are often confused. . . 69 5.10 Applying the MAC-CNN in a sliding-window fashion leads to a set of material

category probability maps. These material maps show that we may obtain co-herent regions using only small local patches as input. The foliage predictions in the bottom right image are reasonable, as the local appearance is indeed a flower. In the upper right image, the local appearance of the fence resembles lace (a fabric). . . 71 5.11 Graphs of novel category recognition accuracy vs. training set size for various

held-out categories. The rapid plateau shows that we need only a small number of examples to define a previously-unseen category. The accuracy difference between feature sets shows that the attributes are contributing novel information. 72

(12)

6.1 Material segmentation methods based on large-patch CNNs implicitly rely on the context present in the patch to classify materials. When the context is am-biguous, this leads to errors that can be resolved using the local appearance information. Here, for example, the object is a house in an outdoor scene, but the area surrounding the windows is a painted surface, not glass. Since exist-ing methods do not cleanly separate local appearance and context, they cannot resolve such ambiguities. . . 75 6.2 The image above, output from our MAC-CNN, shows material category

prob-abilities for three materials: wood, foliage, and fabric, in the RGB channels respectively. Their method uses only local information; as a result the foliage pattern on the sofa is misclassified as actual foliage. This is an example where scene context is vital in resolving an otherwise ambiguous local material appear-ance. . . 76 6.3 The conditional distributions of materials given ground-truth object categories

(top row) and predicted places (bottom row) are highly discriminative. Many context categories exhibit only a small set of materials. Some outliers are in-evitable as the ground-truth COCO segmentation masks do not perfectly conform to actual object boundaries in the image. . . 79 6.4 Distribution of the ratio of probabilities for predicted vs. true categories given

that the prediction was incorrect. We can see that incorrect categories, the ones we would like to change via the use of context, can have much higher probabilities than that of the true category. As a result, simple multiplication with context-conditional distributions will rarely change the classifier’s output for the better. . 81 6.5 Material segmentation CNN architecture. Our network takes an input image,

object category probability map, and a place category probability vector as in-puts. Horizontal lines represent additive skip connections, with appropriate zero-padding on the channel axis. During training, the network only sees 48×48px image patches to ensure we are separating local material appearance from con-text. At test time, we may input an image of arbitrary size. . . 83 6.6 Accuracy vs. training set size on the MINC database (1.0≈2.5 million patches).

We can clearly see that by separating local material appearance from context, we are able to recognize materials more accurately from fewer examples. . . 87 6.7 These examples show that context helps disambiguate materials when local

infor-mation is not sufficient. In the first set of insets,the water has a local appearance similar to asphalt. Global context suggests that this is unlikely. In the second set, we see that the airplane body is incorrectly recognized due to the lack of characteristic specular reflection that locally identifies metal. Again, context fixes this error. Sky is not a material and in this case has the local appearance of water, hence the prediction for those pixels in the second row. . . 89

(13)

pearance. The stone statue image contains few contextual cues, but we are able to make reasonable predictions based on the local appearance. . . 91 6.9 Additional examples of dense material recognition with context. It is important

to note that neither skin nor sky are considered materials within our hierarchy. Skin is a unique case of material that is visible only on one object category (people) in most databases, and the sky is not a material. . . 92 6.10 Additional examples of the output of our dense per-pixel material recognition

(14)

Abstract

Visual Material Recognition Gabriel Schwartz Ko Nishino, Ph.D.

Materials inform many of our interactions with everyday objects. Knowing that a cup is ceramic, we handle it more gently. When sidewalks are covered with snow and ice, we walk differently so as not to slip. If we aim to create an autonomous system, such as a robot, that can manipulate a wide variety of objects or traverse the many different surfaces it may encounter, we will need to be able to provide this material information algorithmically. Visual material recognition is the process of identifying the presence of materials, such as plastic, glass, or metal, in ordinary images. By recognizing these materials, we can obtain valuable cues for general image understanding. Doing so, however, is a challenging problem, as a single material may exhibit many different visual appearances. We can recognize an object based on its characteristic shape, but materials do not have such a singular distin-guishing property. In this thesis, we study the problem of visual material recognition by breaking the recognition process down into fundamental and separable components. Our key observation is that the appearance variation which makes materials so challenging to recognize arises from the context in which the materials appear. A smooth white surface does not on its own provide many cues as to the material in question, but when combined with the fact that the surface is on a mug, we may infer that the material is likely ceramic or plastic. In order to take advantage of this observation, we must be able to separate material appearance from the context in which it appears. As a first step, we demonstrate that it is possible to recognize materials from small image patches. These small patches contain only

(15)

as “shiny” or “translucent”, as an intermediate representation for the materials themselves. We refer to these properties as visual material traits. Though they prove useful, obtain-ing annotations for these traits is a challengobtain-ing and time-consumobtain-ing process. To address this, we derive an automatic perceptual attribute discovery method that generates classi-fiers for a set of unknown attributes. By probing the human perception of materials through easily-obtained binary annotations, we may measure the visual similarity of materials and discover attributes that serve the same function as material traits. Finally, having shown that material appearance may be isolated in small local image patches, we introduce a con-volutional neural network (CNN)-based framework that integrates local material appearance with global contextual cues. By cleanly separating and combining the material appearance and context, we can take advantage of the strong material cues we show are present in that context to accurately recognize materials with far fewer examples than past attempts at material recognition.

(16)

(17)

Material recognition – identifying the presence of materials, such as glass or metal, in images – can provide valuable cues for autonomous interaction. Knowing the composition of an object can strongly influence how a robot or other autonomous system may handle it: a plastic knife, for example, can tolerate much less force than a metal one. Materials are also a key component in image understanding and visual-question-answering [3], enabling a robot to, “Pick up the glass cup on the table,” or answer the question, “How many wooden toys are there?” Object recognition can identify the cups, tables, and toys, but to be more specific we need material recognition. We must be able to algorithmically recognize the presence of materials in ordinary images if we are to provide such information to any system.

Recognizing materials has proven to be a challenging problem. Early work, such as that of Liu et al. [37], focused on simple images (one primary material and object of interest, uncluttered scenes, closeup views) and material categories. Even so, the accuracy of their resulting material predictions was relatively low (44.6%). The challenge in recognizing ma-terials visually is largely due to the wide variety of appearances which each material may exhibit. Unlike objects, where, for example, cars tend to exhibit a characteristic shape, materials have no such simple distinguishing properties. One material, such as plastic, may appear in a number of different colors, textures, and reflectances.

The unifying observation in this thesis is that the challenging variation in material ap-pearance arises due to the different contexts in which materials appear. A single material may appear as part of many different objects, and each of those objects may in turn appear

(18)

in that scene both strongly constrain and influence the presence and appearance of materials in the image. Metal and ceramic, for example, are two challenging materials to recognize: metal due to the fact that its appearance often depends on its environment, and ceramic due to its lack of distinguishing features. If, however, we know an object is a sink, then we may infer that it is likely made of metal or ceramic. Likewise we may also observe that kitchen sinks are typically metal while bathroom sinks are often ceramic. We refer to such object and scene categories as “context” when they are used to inform material recognition. We show that we can use our observations concerning materials and context to greatly reduce the number of examples required to accurately recognize a material. In order to achieve this, however, we must first be able to separate material appearance from the sur-rounding context. Existing material recognition methods do not do so, and instead build frameworks that rely on an entangled combination of material appearance and context with no clear delineation. These methods require very large training datasets to achieve reason-able accuracy. Since they cannot separate the effects of context on material appearance, such methods depend too heavily on the contextual cues and must see all combinations of material and context.

As a first step towards a full integration of untangled materials and context, we demon-strate that we can recognize materials independent of context using small image patches. We refer to this process as local material recognition. Recognizing materials using only the local information contained in a small image patch appears to be a daunting task. Looking at materials closely, it becomes clear how much even our own recognition process can rely on context. Despite this, humans are able to identify visual properties of materials even when we can’t see the surrounding context. We can look at a smooth plastic surface, for example, and see that it is translucent and possibly shiny regardless of the object involved. We refer to

(19)

local information and thus that we can recognize materials independent of external context. Though visual material traits form a useful object-independent intermediate representa-tion for materials, a number of challenges arise when attempting to apply material traits to larger datasets. First, material traits rely on a single, manually-defined set of trait names for annotation and recognition. This is acceptable when dealing with small datasets which may be annotated by a single annotator following their own internal definitions of the traits. If, however, we aim to increase the size of the dataset in question, then this assumption no longer holds. Some of the material traits are intuitive and challenging to precisely de-fine, something that would be required if multiple annotators are to be able to provide consistent annotations. Furthermore, it is difficult to evaluate whether or not any given set of manually-defined material traits is complete. We show that we may address both of these issues by automatically discovering useful visual material attributes. We use the term attributes to highlight the distinction between named material traits and the unnamed properties we discover. We derive a method to define an attribute space that faithfully encodes our own human perceptual representation of materials while simultaneously serving as an intermediate representation for material recognition. Our method produces attributes with the same desirable properties as visual material traits using only a small amount of easily-obtained weak supervision.

Our automatic perceptual attribute discovery method requires only simple supervision and eliminates the need to manually define a set of material traits. The training process is, however, relatively slow and does not scale well to larger datasets. Working well with small amounts of training data is a benefit, but we would ideally like to leverage recent

(20)

advances in large-scale end-to-end learning as well. As a step towards this goal, we show that our perceptual material attributes can in fact be discovered within a Convolutional Neural Network (CNN) framework focused on local material recognition (the Material At-tribute/Category CNN, MAC-CNN). This enables us to take advantage of potentially larger material datasets. We also find interesting parallels with the material representation in the human material recognition process as observed in neuroscience [25, 22]. In contrast to the intermediate representations formed by our previous attribute methods, the human material recognition process (as well as our MAC-CNN) produces a perceptual representation (ma-terial attributes) as a side-product of ma(ma-terial category recognition. Our results show that we are able to discover similar perceptual attributes using the MAC-CNN, and we addition-ally demonstrate the usefulness of perceptual material attributes for transfer learning. To support these experiments, we introduce a new material database focused on local material recognition.

Finally, having shown that we may separate material appearance from context using small local image patches, we introduce a novel material recognition framework that inte-grates local material appearance and global scene context, in the form of object and place category probabilities, to accurately recognize materials given far fewer examples than re-quired by existing methods. Specifically, we propose a fully-convolutional full-resolution CNN that combines local material appearance and global context to generate per-pixel ma-terial category predictions. Our method achieves state-of-the-art accuracy scores on multiple material recognition datasets. Furthermore, we quantitatively investigate the informative properties of various forms of contextual cues as they pertain to the recognition of materials, and evaluate the impact of each form of context we introduce to the recognition process.

(21)

nition. These contributions include:

Methods

• A framework for recognizing local visual material properties (visual material traits)

• An attribute discovery method that automatically builds a set of classifiers for at-tributes which encode the human perception of materials

• An end-to-end trainable CNN-based framework (MAC-CNN) for unifying discovered attributes and material recognition

• A dense per-pixel material recognition method which integrates local appearance and global context to accurately recognize materials from fewer training examples

Datasets

• Visual Material Traits (trait mask annotations)

• Material Patch Similarities (pairwise binary similarity annotations)

• Local Material Recognition Database (images and associated material masks)

– A three-level hierarchy of material categories from which material datasets may be built

(22)

Chapter 2: Related Work

Our overall goal is to predict the presence of material categories (e.g. fabric, metal, plas-tic, etc...) in natural images. Here, we will review prior work involving general material recognition methods and relevant image understanding tools, such as semantic/non-semantic attributes and Convolutional Neural Networks (CNN).

2.1 Material Recognition

Textures are visual patterns associated with a specific combination of material, illumina-tion, and surface geometry. Though textures are not materials, texture recognition methods formed the basis for early material recognition methods. Leung and Malik [35] first intro-duced textons to describe and classify images of textures. A texton represents a particular set of responses for a fixed hand-designed filter bank applied to an image. Texture recognition methods focused on using the distribution of exemplar textons within images to represent texture categories. Later methods, such as that of Varma and Zisserman [54], achieved extremely high accuracy scores (90-100%) on the databases available at that time. These databases, however, typically consisted of extremely specific texture categories like “crum-pled paper” or “ribbed paper”, and contained images of flat surfaces, exhibiting solely the texture in question, taken under controlled laboratory conditions. The one exception is a database with only 37 images labeled with 6 categories: air, building, car, road, vegetation, trunk.

Adelson [2] first suggested materials as a distinct concept from objects or simple textures when discussing “things vs. stuff”. “Things” refers to objects, which have been the focus of

(23)

shape or fixed spatial extent. Ice cream is one example of “stuff” that is not an object but is still a recognizable concept in images. While materials are not equivalent to the “stuff” discussed in his work, the work does lay the foundation for material recognition as a vision problem.

The first collection of material category images for classification originated in Sha-ranet al. [49] where they introduced a new image database (theFlickrMaterialsDatabase or FMD) containing images from the photo sharing website Flickr. The FMD contains a set of images each with a single material annotation and corresponding mask identify-ing the presence of that material. Buildidentify-ing on the FMD, Liu et al. [37] created a frame-work to recognize these material categories using a modified LDA probabilistic topic model. Hu et al. [26] improved upon the state-of-the-art FMD accuracy using kernel descriptors and large-margin nearest neighbor distance metric learning. Their experiments showed that providing explicit object detection information to material category recognition results in a large improvement in accuracy. Sharan et al. [48] later showed that without information associated with objects (such as the object shape), performance degrades significantly (from 57.1% to 42.6%). Specifically, they note that their material category recognition method depends heavily on non-local features such as edge contours. It is this dependency which we wish to either remove or make explicit with our proposed local material recognition methods. Zhang et al. [60] have shown further-improved performance on the FMD, but they require an auxiliary training dataset which contains a number of images that are extremely similar to those in the FMD.

(24)

im-age. This inherently assumes that there is only one material of interest in the image, a very restrictive assumption. To relax this assumption, recent work focuses on dense prediction: providing a material category for each pixel in the input image. Bell et al. introduced the OpenSurfaces [7] and MINC [6] datasets to aid in the training of dense material recognition models. With MINC they also describe a simple modification of the VGG CNN archi-tecture of Simonyan and Zisserman [52] to predict their material categories at each pixel. Zhang et al. [59] improved the state-of-the-art accuracy for a subset of the MINC dataset, MINC-2500, using their deep texture encoding network, but their method is limited to single per-patch predictions. Cimpoi et al. [11] aggregate texture descriptors within region pro-posals, similar to R-CNN [21], for material recognition. They refine their predictions with a dense CRF, but if the region proposals fail to separate two materials their method cannot recover. Wang et al. [55] also demonstrate accurate dense per-pixel material predictions using 4D light field images. These datasets and models have inherent drawbacks involving their category selection and training procedures which we will discuss in later chapters. 2.2 Attributes

Attributes, as used in machine learning and computer vision, are distinctive properties (visual properties, in the context of computer vision) of categories in a classification prob-lem. Attributes are often shared across a sparse subset of the associated categories. In the case of materials, these attributes include visual properties like “shiny”, or “smooth”. As part of our early investigation of local material recognition, we introduce two forms of material attributes as intermediate representations: fully-supervised (Chapter 3) and weakly-unsupervised (Chapter 4) visual material properties.

(25)

but largely at the image or scene level. Ferrari and Zisserman [19] introduced a generative model for certain pattern and color attributes, such as “dots”, or “stripes”. The attributes described in their model focus on texture and color, but are not material attributes. A paper cup, for example, may have stripes painted on it, but “striped” is not a property of the paper itself. Kumar et al. [29] proposed a face search engine with their attribute-based FaceTracer framework. FaceTracer uses SVM and AdaBoost to recognize attributes within fixed facial regions. Such fixed regions are not present in materials, which may take on an arbitrary shape unlike the objects which they make up. Farhadi et al. [17] applied attributes to the problem of object recognition. Their results showed an improvement in accuracy over a basic approach using texture features. Lampertet al. [30] also showed that attributes transfer information between disjoint sets of classes. These results suggest that attributes can serve as an intermediate representation for recognition of the categories which exhibit them. Patterson and Hays [42] showed that they could recognize a variety of visual attributes, some of which happen to be general material categories. Their work, however, was not an explicit attempt at recognizing materials.

2.2.2 Weakly-Supervised Attributes

The attributes described above were all fully-supervised or “semantic” attributes. A semantic attribute is one to which we can assign a name like “round” or “transparent”. While these attributes are useful, it is difficult to quantify the completeness and consistency of any given attribute set: does the set of attributes contain everything that could help recognize the target categories, and can the appearance (for visual attributes) be agreed upon by

(26)

a variety of annotators? Semantic attributes are also task-specific and must be manually defined for each new recognition task.

To address the issues inherent to semantic attributes, a number of unsupervised or weakly-supervised attribute discovery methods have been proposed. Berg et al. [8] de-scribed a framework for automatically learning object attributes from web data (images and associated text). This approach learns some localized attributes (as we would require for local material recognition). The required text annotations are, however, image-wide and do not guarantee locality. Patterson and Hays [42] also proposed a process to discover and recognize scene-wide attributes in natural images. While they are able to discover a large amount of attributes, their learned attributes are not local. Rastegari et al. [43] learn a binary attribute representation (binary codes) for images. As with most existing methods, however, these attributes are image-wide and not local. Cimpoi et al. [10] demonstrated a method for learning an arbitrary set of describable texture attributes based on terms de-rived from psychological studies. As noted by Adelson [2], texture is only one component of material appearance, and cannot alone describe our perception of materials. Though their results demonstrate impressive performance on the FMD, their learned attributes apply only globally. Most relevant to the work discussed in this thesis are the attribute discovery methods of Akataet al. [4] and Yuet al. [58]. Akataet al. [4] formulated attribute discovery as a label embedding problem. Yu et al. [58] proposed a two-step procedure for discovering and classifying attributes based on a similarity matrix. They computed a distance matrix using Euclidean distances in the raw feature space of labeled image patches. In contrast, we embed the material categories in an attribute space derived from our own human visual perception of material similarity.

(27)

attributes within Convolutional Neural Networks (CNNs). We also formulate the integration of material appearance and context as a CNN. Introduced by LeCun et al. [31] for hand-written digit classification, the convolutional neural network model is a general non-linear model which applies a set of convolution kernels to an image in an hierarchical fashion to generate a category probability vector. The kernel weights are model parameters that are set via non-linear optimization (generally Stochastic Gradient Descent) to attempt to maximize the likelihood of a set of training data.

Recently, Shankar et al. [47] proposed a modified CNN training procedure to improve attribute recognition. Their “deep carving” algorithm provides the CNN with attribute pseudo-label targets, updated periodically during training. This causes the resulting network to be better-suited for attribute prediction. Escorcia et al. [15] show that known semantic attributes can also be extracted from a CNN. They show that attributes depend on features in all layers of the CNN, which will be particularly relevant to our investigation of perceptual material attributes in CNNs (Chapter 5). ConceptLearner, proposed by Zhouet al. [63] uses weak supervision, in the form of images with associated text content, to discover semantic attributes. These attributes correspond to terms within the text that appear in the images. All of these frameworks predict a single set of attributes for an entire image, as opposed to the per-pixel attributes which we introduce in this thesis.

At the intersection of neuroscience and computer vision, Yamins et al. [57] find that feature responses from high-performing CNNs can accurately model the neural response of the human visual system in the inferior temporal (IT) cortex (an area of the human brain that responds to complex visual stimuli). They perform a linear regression from CNN

(28)

feature outputs to IT neural response measurements and find that the CNN features are good predictors of neural responses despite the fact that the CNN was not explicitly trained to match the neural responses. Their work focuses on object recognition CNNs, not materials. Hiramatsuet al. [25] take functional magnetic resonance imaging (fMRI) measurements and investigate their correlation with both direct visual information and perceptual material properties (similar to the material traits we introduce in Chapter 3) at various areas of the human visual system. They find that pairwise material dissimilarities derived from fMRI data correlate best with direct visual information (analogous to pixels) at the lower-order areas and with perceptual attributes at higher-order areas. Godaet al. [22] obtain similar findings in non-human primates. These studies suggest the existence of perceptual attributes in human material recognition, but do not actually derive a process to extract them from novel images.

2.4 Dense Prediction

Dense prediction, outputting a value or category prediction for each pixel, has been ex-tensively studied in the context of object recognition and object semantic segmentation. Object recognition datasets, such as ImageNet [46] or MS COCO [36], often contain many (80-1,000) categories. Despite this, state-of-the-art semantic segmentation methods such as DeepLab [9] focus on only a small subset of coarse-grained categories. While we might gain some small contextual cues from such coarse categories, intuitively we would expect that the more detailed the context categories are, the more they will be able to inform material recognition. We show that this is indeed true in Chapter 6. A notable and relevant ex-ception is the recent ADE20k dataset, scene parsing challenge, and associated models [65]. The dataset contains many fully-segmented images, and the challenge defines a set of 150 categories for semantic segmentation. We find the ADE20k models to be ideal sources of Chapter 2: Related Work

(29)

The use of context as a means to reduce ambiguity, whether in materials or other cases, appears promising. Huet al. [26] showed that a simple addition of object category predictions as features could potentially improve material recognition. Iizukaet al. [27] use scene place category predictions to improve the accuracy of greyscale image colorization. Shrivastava and Gupta [51] investigate the use of semantic segmentation to augment Faster R-CNN. In this case, the semantic segmentation network is trained with R-CNN in a multi-task learning fashion. The semantic segmentation network provides an additional signal for object recognition, but this is not the same thing as context: the semantic segmentation network is producing output for the same type of category (objects) as the main network. Our work, in contrast to these previous methods, takes advantage of multiple sources of context that are not merely additional forms of material recognition.

(30)

Chapter 3: Visual Material Traits

In Chapter 6, we show that separating materials from their surrounding context allows us to combine them with accurately-recognized information about said context for improved accuracy. Prior to doing so, we must first show that we can indeed recognize materials in the absence of global context like object shape or scene properties. We refer to such recognition in the absence of context aslocal material recognition.

Recognizing materials is an inherently challenging problem, made more so by our goal of local material recognition. As Figure 3.2 shows, previous material recognition frameworks rely heavily on context cues present in large image patches. As the patch size is reduced, materials become more difficult to recognize for their frameworks. One contributing factor to this difficulty is the intra-class appearance variation present in typical material categories. A car, for example, often has a very distinct boundary shape that allows for its identification as an object. On the other hand metal, a material present in most cars, can take on a variety of appearances depending on the surroundings. Figure 3.1 contains a visual example of such variation. Each image contains a sample of plastic material, but the material appearance varies based on the object and the surrounding scene conditions.

Looking at the images in Figure 3.1, one can see that plastic tends to have properties that are associated with a distinct visual appearance, such as “smooth” and “translucent”. Our key observation is that these visual properties are recognizable even when the surrounding objects and scenes are not visible. We can use these properties to tackle the challenging variations in material appearance and recognize materials independent of context. In general, material properties can include tactile ones such as “hard,” or purely visual ones such as

(31)

Figure 3.1: Materials like the plastic in these images exhibit a wide range of appear-ances depending on the object and scene, making extraction of material information without the use of object information challenging. We propose to locally recognize vi-sual material traits, distinct appearances of material properties such as "translucent," to provide contextual cues for challenging vision tasks including material category recog-nition and segmentation.

(32)

Cimpoi et al. [5] Sharan et al. [18] Proposed Method

Patch Size (px)

Accuracy (%)

Accuracy vs. Patch Size

Figure 3.2: When adapted to use aggregated features from local image patches, meth-ods that perform well on full images quickly lose accuracy. This suggests that they are relying heavily on context, including object shape cues, to recognize materials.

(33)

···

Shiny Fuzzy Metallic Soft Smooth Liquid Rough Woven Figure 3.3: Successfully recognized material traits. These image patches were recog-nized by our framework as exhibiting the indicated material traits. Even at the patch level, we can see the characteristic visual appearances of each material trait.

“shiny.” We model the local visual appearance of these characteristic material properties as a novel intermediate representation: visual material traits.

Experimental results show that visual material traits can be recognized accurately from small (32×32) image patches, as high as 93.1% with an average accuracy of 78.4%. To express more complex concepts, such as material categories, we may treat the distribution of material traits in a region as an image descriptor and generate a per-image material category prediction. Furthermore, material traits learned from one dataset can be recognized and used to extract material information from an entirely different set. This is in contrast with past methods [48, 26] that train and test on images taken from a single source. These results show that the representation generalizes well. We also demonstrate the use of material traits in mid-level image understanding tasks by augmenting segmentation algorithms with per-pixel material information.

3.1 Representing Material Traits

Figure 3.3 shows examples of the visual material traits recognized by our framework. Even at the local level of the example images, each visual material trait corresponds to the

(34)

ap-pearance of a characteristic material property. Ideally, recognition of these material traits will enable us to extract crucial material information from any image.

The key contribution of our material traits is their ability to encode per-pixel material information without relying on object-specific features. Material traits provide a compact, local, and discriminative encoding of material properties. To obtain a representation for these material traits, we must avoid introducing any dependence on object information in the recognition process. We accomplish this by learning the best convolutional features to describe material trait patches in an unsupervised setting. Convolutional features are ideal for this purpose as they can be applied at any point in an image, and do not encode object boundary contours. We supplement these unsupervised features with selected low-level features to describe appearance patterns that cannot be learned by the unsupervised model.

3.1.1 Convolutional Material Trait Features

Expressing the appearance of material traits poses a challenge. While intuitive, traits such as “fuzzy” can be hard to quantify. While we may attempt to do so using only existing designed features, the space of images that may be represented using these features is incomplete (as shown by our feature selection results).

Rather than rely solely on handcrafted features, we determine features associated with each material trait through unsupervised feature learning. Unsupervised learning builds a generative model for images by finding simple components that can be combined to repro-duce them. Constraints, such as sparsity, force optimal model components to also act as discriminative features for classification.

Our goal is to recognize per-pixel, object-independent visual material traits. To this end, we choose to learn convolutional features so that we may extract them at any pixel in an

(35)

undesired object-dependent features in previous frameworks [48, 26].

We build upon the convolutional auto-encoder (CAE) model [39] to learn the feature kernels. The model defines images as the weighted sum of convolution kernel responses. Optimal filters under our model are defined by the following objective function:

C=Tr+αTw+βTs. (3.1)

The objective contains three terms: a reconstruction error termTr, a weight-decay

(smooth-ness) term Tw, and a sparsity term Ts. The weight-decay and sparsity terms have

corre-sponding weightsαandβ, and each term acts as a constraint to help produce useful features. Reconstruction error forN images is the squared-difference between the input imagesI and their reconstructions Rusing the learned features,

Tr = 1 N N X i=1 kIi−Rik2₂ . (3.2)

(36)

of the encoding in feature spaceEi by Ei = h(W∗Ii+be) , (3.3) Ri = W0∗Ei+br, (3.4) h(xi) =                      0 if x <0 xi if 0≤xi ≤1 1 if x >1 (3.5)

with∗ representing convolution with a set of filtersW, along with bias terms be andbr for

the encoding and reconstruction, respectively. Some formulations force the reconstruction filtersW0 to be the transpose of the encoding filtersW. We, however, found that allowing them to be separately optimized resulted in more diverse features.

The non-linear encoding functionh(xi)in Equation 3.3 contains a linear region between

0 and 1. If allowed, the combination of small encoding weights and large decoding weights could force any inputs to encode solely into this linear region. Such an encoding would result in a trivially perfect reconstruction. Weight decay, Tw =kWk22+kW0k

2

2 , is a term that prevents this trivial solution by ensuring that the weights do not take on exceedingly large values.

By definition, discriminative image features do not appear everywhere in an image. Figure 3.3 shows that certain material traits, particularly “shiny,” exhibit strong local ap-pearance cues. Sparsity constraints express this property well. Sparse features are features that are only present in a small fraction of the possible locations in each image, as measured by their presence in the encodingEi. As in Leeet al. [33], we enforce sparsity by penalizing

(37)

· · ·

Soft Smooth Liquid Organic Low Ranking

Figure 3.4: These 7×7px. convolution filters learned by the CAE represent the top three filters for the listed material traits, ranked by average presence in the testing images. The filters represent characteristic local texture and color patterns. The six filters on the right do not rank in the top three for any material trait. They exhibit significantly less texture variation than the top filters.

the difference between mean filter activations and a small constantp:

Ts= p− 1 N N X i=1 Ei 2 2 . (3.6)

To further constrain the learning process and obtain a discriminative feature set, we force a fixed number of the features to be oriented first-order Gaussian filters. Learning these filters alone will satisfy both sparsity and reconstruction constraints, but their discriminative power is limited. As shown in Table 3.1, edge filters are selected roughly half as often as the CAE-learned features.

We optimize the full objective function using L-BFGS with automatically-generated symbolic gradient evaluation.

Figure 3.4 shows a selection of the top convolution filters by the CAE, ranked by average presence in the corresponding material trait images. The filters were learned from whitened material trait image patches. The top filters appear to represent the presence or absence of

(38)

specific local texture patterns. For comparison, the non-ranked features on the right exhibit far less texture variation.

3.1.2 Supplemental Features

Cybenko [12] showed that artificial neural networks, including auto-encoders such as the CAE, are capable of approximating any continuous function defined on _Rn_{. There are,}

however, local features such as HOG that are not continuous and thus cannot be learned by the CAE. These discrete features may encode important properties of material traits, such as the strong local patterns in woven material. To address this, we supplement the learned features with Local Binary Patterns (LBP), HOG features and color histograms. We do not use other low-level features, such as the edge slices and ribbons of Sharanet al. [48], as they encode object-specific information and cannot be extracted on a per-pixel basis.

The results of our feature selection process show that these additional features supple-ment rather than replace the CAE-learned features. As will be shown in Table 3.1 in the following analysis of feature selection, CAE features are selected on average as often as any of the supplemental features. Furthermore, our analysis in Table 3.2 shows that the CAE features play a crucial role in the application of material traits.

3.1.3 Groupwise Feature Selection

We would like to obtain a feature set that generalizes well to new datasets. To avoid over-fitting and improve generalization, we perform feature subset selection on the supplemental and CAE-learned features. Our final feature set contains a small number of groups of con-ceptually related features. Rather than separate the groups into individual elements, we select the best combination of groups to recognize each trait. This process takes advantage of the fact that two individually useless features can have predictive power when grouped

(39)

Trait CAE Oriented HOG LBP Color Histograms Shiny • • Fuzzy • • Transparent • • • · · · (13 Material Traits ) Total Uses 7 4 6 9 7

gether [23]. We are able to exhaustively evaluate all combinations of groups (CAE features, oriented edges, HOG, LBP, color histograms), selecting those that maximize performance on a validation set. Feature groups are not further divided, thus, for example, either all HOG features are included or none are.

Table 3.1 shows the results of our feature selection process. Features are selected fairly evenly and, as the full table shows, in disjoint sets. A particular case of note is the “shiny” material trait. Since we focus on recognizing visual material traits without dependence on object-specific information, “shiny” is synonymous with specular highlights. This may be seen clearly in Figure 3.3. While there are visual cues, such as contoured reflections on a car body, that may lead an observer to call a material “shiny,” these features are specific to the object and do not directly indicate the material trait. As a result of this, color histograms and learned convolutional filters prove to be more useful features for this material trait. 3.2 Recognizing Material Traits

For training and testing, we annotate images in the Flickr Materials Database (FMD) [49] with masks indicating regions that exhibit each material trait. From these regions, we extract 45,500 annotated patches1. We use balanced sets of positive and negative examples

1

(40)

Input Image Organic Fuzzy

Figure 3.5: Example material trait recognition. Non-masked pixels in (b) and (c) cor-respond to pixels with high probability (p >0.5) of exhibiting the given trait. Note that the recognized material traits appear consistently across regions of related materials.

to train randomized decision forest (RDF) classifiers for each material trait. Though we use the same dataset as methods that include object information, our feature set and recognition process explicitly avoid object dependence.

Figure 3.5 shows the recognition results for two material traits on an image from the Berkeley Segmentation Dataset (BSDS) [38]. Note that the main object in the image, a Koala, was not present in the Flickr dataset. The FMD does not, in fact, contain any animals or any examples of animal fur. Despite this, characteristic properties of the fur and plants are accurately recognized.

Figure 3.6 contains recognition accuracies for each of the 13 material traits. Since we

(41)

uzzy hiny ooth sof t

iped tallic anic cent rent ugh quid oven ade

0.4 0.5 0.6 0.7 0.8 0.9 1.0

Fuzzy Shiny

Smooth

Soft

StripedMetall

ic

Or

ganic

Translucent

Transparent

RoughLiquidW

oven

Manmade

1.0

0.9

0.8

0.7

0.6

0.5

Material Trait

Accuracy

Figure 3.6: Visual material trait recognition accuracy. Material traits are recognized via binary classification on a balanced training and testing set, thus random chance ac-curacy is 50%. Most traits are recognized well. Difficult material traits, such as metallic and transparent, are challenging due to their object- and environment-dependent ap-pearances. Average accuracy is 78.4%.

(42)

Plastic

Stone

F

oliage

Smooth Rough Organic Soft

Figure 3.7: Material trait frequency distributions. We compute the class-conditional distributions for appearance frequency of each material trait given each material cate-gory. These are stored as histograms, examples of which are shown above. Plastic is most often smooth, while stone is very rarely smooth.

predict material traits independently, and the training and testing data are balanced, random chance performance is 50% accuracy. Most material traits are recognized very accurately, however, some are challenging. “Metallic” and “transparent” have the two lowest recognition rates (66.4% and 67.0%). The appearance of these material properties depends heavily on the environment surrounding the object. In the case of a reflective metal surface or a clear glass sphere, the appearance is determined entirely by the object and its environment. As we explicitly avoid object dependence, we cannot expect to model these particular material traits with the same level of accuracy as others. Despite this, “metallic” and “transparent” are still recognized better than chance.

Material traits, as a form of visual attribute, should represent a discriminative set of appearances. To investigate this, we compute the class-conditional distributions of ma-terial traits given mama-terial categories. We use the ten categories of the FMD for this

(43)

Shiny Fuzzy Metallic Soft Smooth Liquid Rough Woven Figure 3.8: Our framework produced false-positive detections of material traits in these patches. For the challenging metallic trait, it is clear that color plays a strong role. The misclassifications generally have a metallic color even though the material is not metal. In some rare cases such as “smooth” there are missing annotations and thus the false positives are actually true positives.

test. For each image in each category, we sample material traits uniformly across the masked material region in the image. Figure 3.7 shows selected distributions from the set{p(ti|mj)|i∈1. . .13, j ∈1. . .10}. The resulting distributions do, in fact, represent the

characteristic properties of their respective material categories. Stone is often rough but very rarely smooth (there are a small number of polished stone examples in the training data), plastic is smooth, and foliage is organic. As material traits are purely visual, they can occasionally produce false positives, as seen in p(soft|stone). While stone is not soft, porous stones may have a soft appearance.

Figure 3.8 shows a set of false positive material trait recognition results. “Shiny,” with its characteristic bright highlights, is prone to be recognized in over-exposed image regions. Results for “metallic” show that color is a strong cue for this material trait. Though the patches are metallic in color, the material is not in fact metallic. These are limitations of the representation. There are a few cases where the material trait annotations are incomplete, generally for the pervasive “smooth” material trait.

(44)

3.3 Using Visual Material Traits

Our analysis shows that we may accurately recognize material traits. The material trait distributions also show that material traits encode discriminative material information. Each material category exhibits characteristic class-conditional material trait distributions. From these results, we expect to be able to inform higher-level processes with material information from material traits. Material trait distributions allow us to recognize material categories in arbitrary images without dependence on prior object knowledge. We also demonstrate a preliminary application of material traits to the problem of segmentation.

3.3.1 Material Categories from Visual Material Traits

Sharanet al. [48] showed that material category recognition depends on object-specific infor-mation. Despite this, our class-conditional trait distributions suggest that the information encoded in material traits does provide a discriminative set of features for material category recognition. We rely on these visual material trait distributions to encode and recognize material categories.

We recognize material categories from material traits by training SVM classifier on the material trait distributions. Distributions are computed from material traits recognized in uniformly sampled random patches within material regions. We select features and train ma-terial trait classifiers using half of the FMD for training, then predict their class-conditional distributions. We further supplement the distributions, in a cascade fashion, with the out-put of a RDF classifier trained to directly predict the material category of a patch using our feature set. The cascade process is responsible for improvements in the more recogniz-able categories such as foliage (11% improvement), with minor changes in other categories. Accuracy without the cascade process is 46.5%.

(45)

0.6 0.5 0.4 0.3 0.2 0.1 0.0 Fabr ic

FoliageGlassLea ther Me tal PaperPlast ic StoneWat er Wood Lea Me tal Paper Plast ic Stone Wat er Wood (a)Flickr 0.6 0.5 0.4 0.3 0.2 0.1 0.0 Paper Plast ic Stone Wat er Wood

Foliage Glass Paper Plast ic

Stone Wat er

Wood (b)ImageNet

Figure 3.9: Confusion matrices showing true class vs. predicted class on the Flickr Material Database and ImageNet images. Average accuracy is 49.2% in (a) and 60.5% in (b). Though metal and glass both have an appearance that is environment-dependent, glass is more accurately classified. This is likely due to the tendency of glass to create characteristic local distortions.

Using the computed class-conditional distributions, we train an SVM classifier with a histogram intersection kernel to recognize material categories. The histogram intersection kernel, defined as

k(x,y) =X

i

min (xi, yi), (3.7)

for histogram feature vectors x and y with elements xi and yi, measures the similarity

between two normalized histograms [5]. As the material trait distributions are histograms, they are ideally suited for the histogram intersection kernel SVM.

Figure 3.9 shows the average and per-class accuracy for our method on the FMD. We split the dataset of 1000 images in half for training and testing. Our accuracy (49.2%) does not surpass the final results of Sharan et al. (57.1%) but again, their method relies heavily on features that encode the shape of the objects. We do find that our method achieves

(46)

higher accuracy than that of theirs (42.6%) when object context is removed. These results show that material traits provide important information to the material recognition process. To demonstrate the ability of material traits to generalize well between datasets, we collected a second set of material images from a different source: ImageNet [13]. ImageNet obtains images from a variety of sources; they are thus more diverse than solely Flickr images. We collected 3480 images from ImageNet via searches for each material category. Images without bounding boxes were discarded.

To evaluate the use of material traits for material recognition on this ImageNet dataset, we first train material trait classifiers on the full set of FMD images. We then split the Ima-geNet images evenly into training and test sets and compute the distributions of recognized material traits on the training and test sets. We train an SVM classifier with the histogram intersection kernel of Equation 3.7 using the distribution of material traits on the training set.

Figure 3.9 shows the average accuracy for our method on this dataset. The average ac-curacy of 60.5% on ImageNet images shows that material traits encode material information that depends on neither the particular type of object exhibiting a material, nor the scene context in which that material appears. While Huet al. [26] do not provide an exact value, visual inspection of their results indicates an accuracy of roughly 60% as well.

Figure 3.10 contains three misclassification examples from ImageNet images. The stone in the first image has brown color stripes characteristic of wood. The glass in the second image looks translucent due to condensation, and translucent is a trait associated with plastic more than glass. The final image is a misclassification due to localization. The ImageNet database only provides object bounding boxes, not masks. This box contains mostly smooth regions and light colors, traits representative of paper.

(47)

Wood (Stone) Plastic (Glass) Paper (Foliage) Figure 3.10: Three misclassified ImageNet images, with true classes for each prediction is in parentheses. The left two are a result of confusing appearances (striped and translucent are more often associated with wood and plastic respectively) while the rightmost is due to the bounding box poorly fitting the object.

Table 3.2: Performance breakdown. FS: feature selection, SF: supplemental features, CAE: convolutional auto-encoder features. For the first row we performed direct ma-terial category recognition using the concatenation of all feature sets. This shows that the trait representation is indeed providing crucial information.

FS Traits SF CAE Accuracy • • 34.2%

• • • 43.5%

• • 42.5%

• • • • 49.2%

We ran a set of tests, summarized in Table 3.2, to examine the impact of each major component of the material trait and category recognition process. The first row, accuracy when performing direct category recognition, with all features, without material traits, shows that the trait representation provides crucial information for the material recognition pro-cess. By excluding either CAE-learned features or supplemental features (HOG, LBP, Color Histograms) from the trait recognition process, we see that both feature sets are necessary in order to best represent material categories.

3.3.2 Segmenting Images with Visual Material Traits

Segmenting images is a challenging process partially because the concept of a good seg-mentation is subjective. In the Berkeley Segseg-mentation Dataset (BSDS) benchmark of

(48)

Mar-tinet al. [38], evaluation relies on multiple human segmentations as ground truth, since each one is a potentially correct solution. Visual material traits, with their accurate encoding of characteristic and intuitive material properties, should contribute valuable contextual cues to this process.

As an investigation of the potential for image segmentation via material traits, we aug-ment the Normalized Cuts (NCuts) algorithm of Shi and Malik [50] with material trait information. In their method, they treat image segmentation as a graph partitioning prob-lem and show that the optimal solution can be obtained from the solution to a generalized eigensystem, specifically, the eigenvectory2corresponding to the second-smallest eigenvalue (the smallest eigenvalue is trivially 1due to the properties of the matrices involved):

(D−W)y=λDy , (3.8)

whereW is a matrix of weights representing pairwise pixel similarities andD is a diagonal matrix containing the sum of all weights for a given pixel. We add an additional term,

exp ( − kti−tjk2₂ σT ) , (3.9)

to the similarity score function used to obtain W. ti represents the predicted per-trait

probabilities for pixel iin the image and σT is a scaling parameter. This term should cause

pixels that exhibit similar material traits to be grouped together in the segmentation. Figure 3.11 shows images segmented using the original NCuts algorithm and our mod-ified version. The first example shows that material traits can help discriminate between regions exhibiting different material properties (fuzzy grass and rocks). The expanded bor-der around the penguin in the second segmentation is likely due to the fact that the traits

(49)

Figure 3.11: Comparing segmentation with and without material traits. Images on the left were segmented using the original NCuts algorithm, while those on the right were segmented with our modified version. Material traits can indicate the difference between fuzzy grass in the foreground and rocks in the background, despite the fact that they have similar colors.

(50)

are recognized in part using learned convolution kernels. Th