Summary - Attribute Learning for Image/Video Understanding

The preceding discussions have covered essential issues and studies in the literature regarding attribute learning for image and video understanding. Some widely used attribute learning models are introduced. We particularly discuss and compare binary vs. relative attributes, user- defined vs. data-driven attributes, image vs. video attributes, as well as the low-level features and datasets. We also review other semantic representations beyond attributes, and machine learning work related to this thesis.

The existing methods have shown promising results of attribute learning for image and video understanding. Nevertheless, there are still several open problems and limitations that they do not solve. Firstly, the user-defined attributes are very limited in analysing complex image and video data. The user-defined attributes are defined by extra-knowledge of either expert users or a concept ontology. Thus these attributes are affected intrinsically by sparse, incomplete and ambiguous annotations. Secondly, the existing attribute learning models suffer from the projection domain-shift problems, prototype sparsity problems and inability to combine multiple semantic representations. Thirdly, how to learn from noisy annotations of relative attributes is still an unsolved problem.

In the subsequent chapters of this thesis, our approach is formulated to address these limitations by the following approach: learning latent attributes in Chapter 3 to break the limitations of user-defined attributes; transductive multi-view embedding in Chapter 4 to tackle the problems

of projection domain-shift, prototype sparsity and the inability to combine multiple semantic representations; robust learning of relative attributes in Chapter 4 to learn from noisy annotations of relative attributes.

Learning Latent Attributes

In this Chapter, we are interested in automatic classification and annotation of unstructured group social activities and complex image classes. Particularly, we focus on home videos of social occassions such as graduation ceremony, birthday party, and wedding reception in USSA dataset of Chapter 2.1.6.6 which feature activities of group of people ranging anything between a handful to hundreds (Fig. 1). By classification, we aim to categorise each video/image into a class; and by annotation we aim to predict what are present in the video/image. This implies a wide range of multi-modal annotation types including object (e.g. group of people, cake, balloon), action (e.g. clapping hands, hugging, taking photos), scene (e.g. indoor, garden, street), and sound (e.g. birthday song, dancing music). We consider that the problems of classification and annotation are inter-related and should be tackled together.

We propose to solve the problems using an attribute learning framework, where annotation becomes the problem of attribute prediction and image/video classification is helped by a learned attribute model. Attributes describe the characterisitics that embody an instance or a class. Es- sentially attributes answer the question of describing a class or instance in contrast to the typical (classification) question of naming an instance. The attribute description of an instance or cate- gory is useful as a semantically meaningful intermediate representation to bridge the gap between low level features and high level classes. Attributes thus facilitate transfer and zero-shot learning to alleviate issues of the lack of labelled training data, by expressing classes in terms of well known attributes.

the user-defined attributes may be limited when used to explore complex multi-modal visual data, since these attributes are defined by extra knowledge from either user experts or concept ontologies and the definition process has no direct linkage with the visual recognition tasks. The possibly poor annotation quality of user-defined attributes may further negatively affect attribute learning algorithms. In most cases, the annotations of user-defined attributes are sparse, incom-

pleteand ambiguous.

These problems are particularly prominent when we apply attribute learning to understand complex consumer videos. The visual data of consumer videos are of unstructured social group activity, i.e. an unconstrained space of objects, events and interactions. The casual nature of this data makes it difficult to extract good features, since they are typically captured with low resolution, poor lighting, occlusion, clutter, camera shake and background noise.

To this end, we propose a framework which can jointly learn user-defined and latent attributes. This chapter systematically formulates a semi-latent attribute space learning framework of learning multi-modal user-defined and latent attributes for automatic classification and annotation of unstructured group social activity. In contrast to existing work of attribute learning for image object class or simple human action classification, this work for the first time, tackles the problem of attribute learning for understanding group social activities with sparse and incomplete labels. In particular we focus on videos of social group activities, which are particularly challenging and topical examples of this task because of their multi-modal content and complex and unstructured nature relative to the density of annotations.

The main content of this Chapter has been previously published in

1. Yanwei Fu, Timothy M. Hospedales, Tao Xiang, and Shaogang Gong; “Attribute Learning for Under- standing Unstructured Social Activity”, European Conference on Computer Vision (ECCV) 2012;

2. Yanwei Fu, Timothy M. Hospedales, Tao Xiang, and Shaogang Gong “Learning Multi- modal Latent At- tributes” IEEE Trans. Pattern Analysis and Machine Intelligence (TPAMI), 36(2), 303-316, Feb 2014;

F = S(L(·)), L : Xd→ Yp, S : Yp→ Z, (3.1)

where L maps the raw data to an intermediate representation Yp(typically with p d) and then

Smaps the intermediate representation to the final class Z. Examples of this approach include

dimensionality-reduction via PCA (where L is chosen to explain the variance of x and Yp is

the space of orthogonal principal components of x) or linear discriminant and multi-layer neural networks (where L is optimised to predict Z).

Attribute learning [LNH09, PHPM09] exploits the idea of requiring Yp to be a semantic

attribute space. L and S are then learned by direct supervision with instance, attribute vector

and class tuples D = {(xi,yi, zi)n_i=1}. This has benefits for sparse data learning including multi- task, N-shot and zero-shot. In multi-task learning [STT11] the statistical strength of the whole dataset can be shared to learn L, even if only subsets corresponding to particular classes can be used to learn each class in S. In N-shot transfer learning, the mapping L is first learned on a large “source/auxiliary” dataset D. We can then effectively learn a much smaller “target” dataset D∗= {(xi, z∗i)}

i=1, m n containing novel classes z∗by transferring the attribute mapping L to

the target task, leaving only parameters of S to be learned from the new dataset D∗. The key

unique feature of attribute learning is that it allows zero-shot learning: the recognition of novel

classes without any training examples F : Xd→ Z∗(Z∗∈ Z) via the learned attribute mapping/

Land a manually specified attribute description S∗of the novel class.

In document Attribute Learning for Image/Video Understanding (Page 60-64)