Attribute Learning Models - Attribute Learning for Image/Video Understanding

1.5 Outline

2.1.1 Attribute Learning Models

We will briefly review several of the most commonly used attribute learning models in this sec- tion. Generally speaking, a key advantage of attribute learning models is their use to provide an intuitive mechanism for multi-task (Salakhutdinov et al. [STT11]) and transfer learning (Hwang

et al.[HSG11]): enabling learning with few or zero instances of each class via sharing attributes

– zero-shot/N-shot learning. Particularly, the challenge of zero-shot recognition (as illustrated in Figure 2.2 ) is to recognize unseen visual object categories without any training exemplars of the unseen class. This requires the transfer of knowledge of additional semantic information from auxiliary classes with example images to unseen target classes.

Figure 2.2: To recognise novel classes, zero-shot learning transfers knowledge from classes with examples to novel classes. Images from Dr. Christoph Lampert’s slides for [LNH09].

Attribute learning models have been explored for images and to a lesser extent video (Liu et al.[LKS11] and Fu et al. [FHXG12, FHXG13] as well as Chapter 3). Applications include modeling the properties of human actions (Liu et al. [LKS11]), animals (Lampert et al. [LNH09, LNH13]), faces (Kumar et al. [KBBN09]), scenes (Hwang et al. [HSG11]), and objects (Farhadi

Figure 2.3: The high-level attributes allows the transfer of knowledge between object categories [LNH09]: the visual appearance of attribute is independently learned from training examples and across different categories; the the object class without any training images can be detected based on which attribute description a test image fits best. Images from [LNH09].

has been manually specified.

Generative models for visual attributes The earliest work on attributes in Ferrari et al.

[FZ07] studied some elementary properties such as colour or geometric pattern. From human annotations, Ferrari and Zisserman in [FZ07] proposed a generative model for learning simple color and texture attributes. Specifically, we use model M to explain a whole image I. And the image I is further represented by a set of segments {s}. A latent variable f is defined to be associated with a foreground ( f = 1) or background ( f = 0) segment. All f for all segments of I are grouped into a vector F. So the likelihood of the image is

p(I|M; F, a) =

_∏

s∈I

p(s|M; f , a)Ns _(2.1)

where Ns is the number of pixel the image contains. Different types of attributes will con-

figure distinctive probability formulations which are specified by parameter M. For example, as illustrated in Figure 2.1, the attribute can be either viewed as an unary (e.g. red colour and round texture), or a binary (e.g., black/white stripes ).

Some later work (Parikh et a.[PG11b], Kovashka et al. [KPG12] and Berg et al. [BBS10]) extended the unary/binary attributes to compoundable attributes, which makes them extremely

image and video understanding. The framework proposed in Chapter 3 belongs to the category of generative models. In that framework, we have different prior knowledge for user-defined and data-driven attributes.

IAP and DAP models Lampert et al. [LNH09, LNH13] studied the problem of object

recognition of categories for which no training examples are available. To solve such a problem, attribute-based classification is introduced to perform object detection based on an intermediate level semantic attribute representations. As illustrated in Figure 2.3, such attributes transcends the specific learning tasks and pre-learned independently across different categories and thus allowing transferring knowledge. Specifically, for zero-shot learning tasks, they proposed two probabilistic frameworks, i.e., Direct Attribute Prediction (DAP) in Figure2.4(b) and Indirect Attribute Prediction (IAP) in Figure2.4(c), these models can integrate human knowledge in the recognition process of unseen classes by using category-level class-attribute associations.

• DAP model Assume the relation between known classes yi, ..., yk, unseen classes z1, ..., zL

and descriptive attributes a1, ..., aM is given by the matrix of binary associations values

aym and azm. Such a matrix encodes the status of one attribute regarding one given class.

Extra knowledge is applied to define such an association matrix, for instance, by human experts (Lampert et al. [LNH09, LNH13]), by concept ontology (Fu et al. [FHXG13]), and by semantic relatedness measured between class and attribute concepts (Rohrbach et

al. [RSS12]). In the training stage, the attribute classifiers are trained by the attribute

annotations of known classes yi, ..., yk. At the test stage, the posterior probability p(am|x)

can be inferred for an individual attribute amin an image x. To predict the class label of

object class z, p(z|x) = Σ_a∈{0,1}Mp(z|a)p(a|x) = p(z) p(az₎ M

∏

m=1 p(am|x)a z m _(2.2)

• IAP model The DAP model directly learns attribute classifiers from the known classes,

while the IAP model builds attribute classifiers by combining the probabilities of all associated known classes. It is also introduced as direct similarity-based model in Rohrbach

et al. [RSS12]. In the training step, we can learn the probabilistic multi-class classifier to estimate p(yk|x) for all training classes yi, ..., yk. Once p(a|x) is estimated, we use it as the same way as in for DAP in zero-shot learning classification problems. In the testing step, we predict,

p(am|x) = ΣK_k=1p(am|yk)p(yk|x) (2.3)

PST model Rohrbach et al. [RES13] explored the manifold structure of the instances in

the novel classes to help attribute-based transfer learning for zero-shot and N-shot learning. Thus they proposed a graph-based semi-supervised learning algorithm – PST model. Specifically, they constructed a k-NN graph by using the low-level features of testing data. The distance of any two data pairs (xi, xj) is

d(xi, xj) = D

∑

d=1 xi,d− xj,d

where D is the dimensionality of the low-level feature space. Once the k-NN graph is computed, they replace the original distance with the semantic distance of attribute vectors by using

d(xi, xj) = M

∑

m=1

|p(am|xi) − p(am|xj)|

where M D and the similarity of the whole graph is measured by the RBF kernel. The label set is initialized by the nearest neighbourhood distance of each testing instance to the prototypes of each novel class.

In document Attribute Learning for Image/Video Understanding (Page 36-39)