A Ranking based Model for Automatic Image Annotation in a Social Network

(1)

A Ranking based Model for Automatic Image Annotation in a Social Network

Ludovic Denoyer and Patrick Gallinari

University Pierre et Marie Curie - LIP6 104 avenue du prsident Kennedy

75016 PARIS FRANCE

Abstract

We propose a relational ranking model for learning to tag im-ages in social media sharing systems. This model learns to associate a ranked list of tags to unlabeled images, by consid-ering simultaneously content information (visual or textual) and relational information among the images. It is able to handle implicit relations like content similarities, and explicit ones like friendship or authorship. It can be used either for fully automatic image labeling or for helping the user with a ranked list of candidate tags. The model itself is based on a transductive algorithm thats learns from both labeled and unlabeled data. Experiments on a real corpus extracted from Flickr show the effectiveness of this model, particularly when using authorship and friendship relations.

Introduction

We consider the problem of image annotation in large so-cial media sharing web sites. These systems allow users to upload and share pictures and videos over the Web. Popu-lar systems like Flickr1_{or Youtube}2_{contain billions of}

im-ages/videos. Retrieving relevant information in such large collaborative systems is usually handled via manual tagging of the corpus resources. Textual tags then allow using simple and fast keyword based search engines.

Tagging large collections is often prohibitive and man-ual tags are known to be imprecise, ambiguous, inconsis-tent and subject to many variations (Golder and Huberman 2006),(Matusiak 2006). Automatic and collaborative tag-ging methods, aimed at improving the quality of annotations and at helping users to tag their resources have been the sub-ject of an intensive recent research. However, for most ap-plications, performance is still quite deceiving.

The most common approaches attack the tagging prob-lem as a supervised classification probprob-lem, where a visual signal has to be associated to a set of known key words ((Hi-ronobu, Takahashi, and Oka 1999), (Chang et al. 2003), (Li and Wang 2008),...). Depending on the application, different types of features may be considered, ranging from simple histograms for Web image search to more complex process-ing for professional applications. A recent work proposes to

1

http://www.flickr.com

2

http://www.youtube.com

exploit also the visual correlation between images (Wu et al. 2009). Some methods propose to use the dependencies be-tween tags as an additionnal information (Liu et al. 2009). Several unsupervised methods, exploiting the co-occurrence of image cues and keywords, have also been proposed, e.g. (Barnard et al. 2003). Other methods rely on tag propaga-tion: starting from a partially tagged collection, labels are propagated in order to tag unlabeled images or to enrich ex-isting tags. Note that many of the models in this family only propagate tag information without incorporating visual in-formation.

All these methods have strong limitations. Classification or latent variable techniques for example can only learn a correspondence between visual information and correspond-ing keywords and cannot cope with more specific or abstract annotations. For example, considering the top left picture in figure 1, sky and river could certainly be inferred easily using visual information while tags Budapest or 2006 will not. This last tag will not be found without knowing that the photographer certainly spent his holidays in Hungary during summer 2006. The tag 2006 could only be found know-ing some extra information about the user that shared this picture. This example shows that the use of users’ informa-tion, and globally, of the information contained in the social network of a sharing system could be very relevant for tag-ging systems. The existing models cannot handle multiple sources of information such as GPS information, time, au-thorship and other metadata which often come with recent image collections. Last, they have not been designed for exploiting the rich information present in cooperative Web sites such as social annotations or virtual friendship net-works.

(2)

general tagging.

We introduce here a new relational learning model de-signed for image annotation in a social media. It is able to exploit simultaneously visual and textual content, metadata, and relational information, either implicit like word or im-age similarities, or explicit, like friendship links of a social network. In particular the model is able to exploit the rich folksonomy present in social sites.

The specific task the model is used for, is the ranking of tags associated to images in a social media collection. More precisely, we consider a collection composed of both anno-tated images and non annoanno-tated ones. Explicit social links and metadata are also available. The goal is to provide a ranking of keywords for any non annotated image and to rank and enrich the tags set of annotated ones.

In our experiments we consider friendship links and au-thorship metadata which were available from the Flickr data. If provided, additional relational and metadata information could be incorporated in the model as well.

The contribution are the following:

• We propose a new graph based relational model for learn-ing to rank keywords associated to images.

• This model is able to exploit a variety of informations present in a social media for tagging semantic resources. • All information sources are fully integrated during the

learning process via the optimization of a global objec-tive function.

• The method is validated on Flickr data.

Besides, the system being based on a relational ranking algorithm, tags associated to an image will be ranked by rel-evance and may serve either for fully automatic annotation or for suggesting ordered lists of keywords to a user.

The paper is organized as follows:We first introduce a baseline algorithm for learning to rank tags using only vi-sual information. This algorithm will serve as a baseline for comparison and as a brick for our graph based method. We then present our new algorithm for ranking images or-ganized according to a graph of relations. In the experimen-tal section we provide results obtained from three datasets and compare different instances of the model to the baseline ranking methods. These instances rely on the use of differ-ent types of relations and data features. We show that the best results are obtained when this model is used on a graph extracted from the social network of the sharing system.

Note that both the software and the datasets are available on the authors website3and allow readers to launch exactly the same set of experiments.

Ranking Model for Image Annotation

Notations and Definitions

Vectors are denoted in bold; subscripts correspond to com-ponents of a vector while superscripts correspond to indexes. For example, xj_k corresponds to the k-th component of the j-th vector x. We consider:

3

http://[blind review]

• The set of images that are currently in the sharing system : I = (i1_{, ..., i}n_{) where n is the number of images.}

• A feature function ψ : I → RN_{. This function will map}

an image onto a feature space of size N . Different ψ func-tions will be defined in section . For simplicity, we will replace ψ(ij_{) by the identifier x}j_{in the following. x}j

kthus

corresponds to the k-th component of the feature vector of image ij_.

• The set of possible tags is T = (1, .., T ) where T is the number of tags.

We define a ranking function f as: f : [1..n] × T → R

(j, t) 7→ f (j, t) = yj_t (1) where ytjis the score of tag t for image ij. The higher the

score, the more relevant the tag is.

The set of images in I is composed of two subsets: • A labeled set Il = (i1, ..., i`) composed of ` images

and their associated annotations (y1_{, ..., y}l_{) provided}

by users where y_kj = 1 if k is a tag associated to ij _and

0 otherwise

• An unlabeled set Iu = (i`+1, ...., i`+u) of size u

corre-sponding to images not been labeled by any user4_.

The annotation problem amounts at computing from the set of images I and the set of manual annotations (y1_{, ..., y}l_{), a ranking score f (i}j_{, t) for all images and tags}

in Iu5

In order to describe the relational ranking model, we pro-ceed in two steps. In the next section, we introduce a simple ranking model which operates only on the content of im-ages. It will be used in the experiments section as a baseline for performance comparison. It is also a component of the relational ranking model. The latter is described in section and is presented as a relational extension of the content only ranking model.

Content only Annotation with Pairwise Ranking

Model

Learning problem Several ranking algorithms have been proposed in the machine learning community, and recently, their use has become popular in the IR field leading to a large literature in this domain. The algorithm below is formally similar to RankSVM (Joachims 2002). The only difference is that in the relational learning model of section , all mod-ules are jointly trained using gradient descent since classical SVM optimization algorithms do not easily extend to this relational setting.

We define the pairwise ranking risk for model θ over a labeled image ik_{∈ I}

lwith annotation ykas:

4

Note that ` + u = n

5

(3)

Figure 1: A social network of users and pictures. Circles correspond to labelled images and the goal is to automatically find the tags of the other pictures. While river or sky can certainly be found using only visual information, this figure shows that the 2006 tag of the second image of John, which cannot be found by visual content, can be infered by propagating the tag of the first picture of this person - using authorship link. The friendship relations allows us to propage the flower tag through the community composed of Georges, Patrick and Melissa. This example shows why it is important to use both the content information but also the social network information between pictures.

∆θ(ik, yk) =

X

(t,t0_):yk t>ykt0

h(f_θP R(i, t) − f_θP R(i, t0)) (2)

where h(.) is the classical hinge loss function defined by:

h(x) = max(0, 1 − x) (3)

and fP R

θ (i, t) is the annotation function defined in

equa-tion 1. This risk is known to be an upper bound of the er-ror over pairs of examples to rank and its minimization has been shown to be very effective for the minimization of the ranking error, and particularly for the minimization of the average precision (see section ).

We introduce a L2-regularization term with hyperparame-ter λreg6for avoiding overfitting. We then define the ranking

objective function as: LP R(θ) =

X

ik_∈I_l

∆θ(ik, yk) + λreg||θ||2 (4)

Note that only labeled images are considered at this step. The learning problem thus corresponds at finding parameters θ∗that minimize this ranking loss:

θ∗= argmin_θLP R(θ) (5)

This model is denoted PR for Pairwise Ranking in the fol-lowing.

6

λregis usually fixed by cross-validation

Ranking function We will use a linear function fθ(k, t)

for computing the score of an image ikfor tag t:

f_θP R(k, t) = hθ, Φ(xk, t)i (6) where Φ(xk, t) is a feature vector for the couple (image ik ;tag t). h., .i is the classical dot product. The feature function Φ, used here is the multiclass feature function (Har-Peled, Roth, and Zimak 2002) defined as:

Φ(xk, t) = 0 · · · 0 · · · xk_{· · · 0 · · · 0} !

(7)

Φ(xk, t) is a block vector with sub-vector xk at position N ∗ t + 1 in RN ∗T _{and 0 anywhere else. This notation}

al-lows expressing the ranking function as a simple dot product over an input space: it can be seen that one linear model per tag will be learned, which amounts at T independent linear models in all.

Annotation in a Social Network

In this section, we introduce a new ranking model called GPR - for Graph Pairwise Ranking - designed for labeling objects (images) based on the object content itself, on exter-nal information sources like metadata, and on different types of relations shared by the objects in the collection.

(4)

directly from the content of the images or the surrounding text as in (Tong et al. 2006) (implicit relations) or build over a social network (explicit relations e.g. authors friendship). In section , we will explore the influence of both types of relations. In the presentation below we will not make this distinction and will consider a general relation.

Let us denote R a set of relations between images. El-ements of R are scalars: R = {wj,k}, j × k ∈ [1..n]2}

where wj,k > 0 is the weight of the relation between j and

k. The couple (I, R) thus corresponds to a weighted graph of images.

Proposed Model

GPR model is based on two assumptions. It considers that a good ranking model is both:

• A model that correctly ranks the tags of all images I. (Classical Ranking Assumption)

• A model that exploits the graph structure in order to ex-tract new regularities over the image collection. In par-ticular, we will exploit here relations which will reinforce the proximity of tag lists, for images that share a strong positive connection. (Regularity Assumption)

Typically, if the structure of the graph is extracted from a friendship network of people sharing common interests, knowing that two authors are friends will probably indicate some important relation about their image collections and tag lists. If the network is inferred from some similarity function between images or surrounding texts, a similar con-clusion will held. The first assumption is captured by the ob-jective function of the PR model (section ), the second one is introduced through a relational term:

LGP R(θ) = X ik_∈I_l ∆θ(xk, yk) + λreg||θ||2 +λREL X t∈[1..T ] X (j,k)∈[1..n]2 wj,k fθGP R(j, t) − f GP R θ (k, t) 2 =LP R(θ) + λRELLREL(θ) (8) where f_θGP R(k, t) is the scoring function of the GPR model with parameters θ.

The term LREL(θ) will force to increase tag scores

sim-ilarity for connected pictures according to the connection weight. Such regularity assumptions over a graph struc-ture are commonly used in the context of semi-supervised learning for classification (Abernethy, Chapelle, and Castillo 2008).

GPR Ranking Function In the relational model, we want the score of a node to depend on several available informa-tion sources in the social network. In this model, the score of a tag for an image will depend both on:

• the visual information in the image itself

• relational information gathered from neighboring images in the graph

We define the GPR scoring function as: f_θ,ξGP R(k, t) = hθ, Φ(xk, t)i + ξk,t

= f_θP R(k, t) + ξk,t

(9) where θ ∈ RNdenotes as before the parameters of a content ranking function, ξ ∈ R[1..n]×[1..T ]are additional slack vari-ables. There is one slack variable per image and tag. Glob-ally considering the n images and T tags, they add n ∗ T degrees of freedom w.r.t. the initial PR model. They will allow the model to adjust the tag scores as a function of the neighboring images influence. Note that all the parameters of this new scoring function, i.e. both the θs and ξs, will be learned together according to the global objective function 8. The scoring function will then use all available informa-tion present in the social sharing system (content, metadata, relations) in an optimal way, according to the objective func-tion.

GPR Objective Function Inserting this new ranking function in the general objective function 8, we obtain the detailed form of the objective function for our relational model: LGP R(θ, ξ) = X ik∈Il X (t,t0_):yk t>yt0k h(hθ, Φ(xk, t)i + ξk,t− hθ, Φ(xk, t0)i − ξk,t0) + λREL X t X (j,k):wj,k>0 wj,k(hθ, Φ(xj, t)i + ξj,t− hθ, Φ(xk, t)i − ξk,t)2 + λreg||θ||2 + λslack X k,t∈[1..`]×[1..T ] ξ2k,t (10) where:

• the first term of the sum is the pairwise hinge loss as de-fined in section , where f_θ,ξGP R has been substituted to fP R

θ ,

• the second term is the graph regularity term introduced in section ,

• the third term is the L2 regularization over the content parameters θ,

• the last term is the L2 regularization term over the slack variables. λslackis an hyper-parameter fixed by cross

idation or by hand that penalizes high slack variables val-ues and encourages final scores to be close to the content-based score f_θP R(k, t). Typically, if λslack = +∞ ,

∀k, t, ξk,t = 0 and the GPR model will rank tags using

only the content information.

This objective function is minimized through gradient de-scent simultaneously for θ and ξ. It should be observed that since this is not a convex function, the algorithm may converge to a local minimum. The experiments presented in section show that it is stable and provides good perfor-mance.

(5)

friendship or authorship) than dense ones like similarities. Moreover, the experiments show that sparse relations work much better.

Gradient Descent Learning

In order to minimize the GPR objective function described in Equation 10, we have used a classical gradient descent algorithm with a -step as described in Algorithm 1. This algorithm is used with a manually fixed number of iterations but could also be used by using a validation set. It consists in iteratively updating the θ and ξ parameters according to the gradient of the objective function LGP R(θ, ξ).

Algorithm 1 Gradient descent algorithm

1: procedure GRADIENT(I, R, (y1, ..., yl_{), )}

2: θ ← random

3: ξ ← random

4: while stop criterion do . Here: fixed number of iterations 5: θ ← θ + ∇θLGP R(θ, ξ) 6: ξ ← ξ + ∇ξLGP R(θ, ξ) 7: end while 8: return {fGP R θ,ξ (k, t)}∀i∈Iu,∀t∈[1..T ] 9: end procedure

Experiments

The experiments have been made on different corpora ex-tracted from the Flickr7 website. Each corpus is composed by a set of images and a set of users. Each image is de-scribed by both its visual content (.jpg files) and by meta-informations (author, date, comments, ...). Each user in the social network is described by his name, his location, his list of friends and the list of his own images. We have collected 3 corpora that have different characteristics as described in Table 2:

• Corpus C1 is composed of a small set of images and a small set of tags. It allows us to perform a large set of ex-periments and compare the performances obtained with different configurations of the model (textual or visual content, implicit or explicit relations, ...).

• Corpus C2 is composed of a larger set of possible tags and allows us to evaluate the ability of the model to deal with complicated tagging tasks.

• Corpus C3 is composed of more than 3 thousands of pic-tures and is used to evaluate how the model scales to larger databases.

All these corpus have been collected through the Flick api8 _{by taking random connected users (that are friends, or}

friends of friends, etc..) and by then getting randomly some of their pictures. For the experiments, the collection was di-vided into two sets: a labeled set and a so called unlabeled

7

http://www.flickr.com

8

http://www.flickr.com/services/api/

Figure 2: Graph built with visual similarities. Image with strong visual similarities are connected. Bold lines corre-spond to hard similarities with a high weights, normal lines correspond to soft similarities. For example, this graph al-lows us to propagate the budapest tag from the top-left pic-ture to the bottom right picpic-ture.

set (i.e. the labels of these images were not considered dur-ing traindur-ing). Traindur-ing was performed usdur-ing both labeled and unlabeled sets according to the algorithm. Evaluation was performed on the unlabeled set using as targets the tags associated to these images.

From these collections, we have extracted two different feature vectors that correspond respectively to the visual content of the images (ψimage_{) and to the textual}

represen-tation associated with the images (ψtext). The latter directly comes from the meta informations available in Flickr. The feature functions are detailed in table 1. We have deliber-ately used simple histograms as image features instead of more complex representations, since the focus here is on the role of relational social data and not on the absolute perfor-mance of the system. We have extracted 4 types of relations between pictures described in table 1. Two of them (wtext and wimage_{) correspond to implicit relations extracted from}

the content of the images as in (Tong et al. 2006). Both are proportional to the text and image similarities (see table 1). wauthor and wf riends correspond to explicit relations extracted from the Flickr social network (table 1). For the implicit relations which are very dense, with have used a threshold in order to keep only the strongest weights and re-duce the number of edges in the image graph. Figures 2 and 3 the graph obtained by using implicit or explicit relations onto the example presented in figure 1.

These datasets contain heterogeneous tags and tagging is difficult. For example, some of the tags are: beach, califor-nia, water, 2008, canon, nyc, flower, 2007, ....2007,2008,nyc or canon9_{are almost impossible to find using classical visual}

9

(6)

Possible Features Description

ψimage Normalized RGB histograms with 48 different colors

ψtext _{Normalized TF-IDF vectors computed over the titles and description of}

the pictures. The dimension depends on the corpus. Relations Weight values (0 if no relation)

(Implicit) Content relations

wimage _wimage j,k =< ψ

image_(ij_{); ψ}image_(ik_{) >. This relation corresponds to a}

visual similarity of images. wtext _wtext

j,k =< ψtext(ij); ψtext(ik) >. This relation corresponds to a textual

similarity of images titles and descriptions. (Explicit)

Social relations

wauthor _wauthor j,k =

1 if image j and k have the same authors. 0 otherwise

wf riends _wf riends

j,k = 1 if the authors of image j and k are friends

Table 1: Feature functions and relations

Corpus C1 C2 C3

Number of images 519 801 3 183

Number of authors 100 100 1 000

Number of tags 32 326 25

Size of ψtextvectors 990 990 4 460

Number of wimagerelations ≈ 120 000 ≈ 260 000 ≈ 2 millions Number of wtext_relations _{≈ 90 000} _{≈ 140 000} _{≈ 1.3 millions}

Number of wauthorrelations ≈ 9 000 ≈ 20 000 ≈ 100 000 Number of wf riends_relations _{≈ 12 000} _{≈ 30 000} _{≈ 320 000}

Table 2: Statistics about the datasets. The implicit relations have been thresholded so that only the largest are kept.

Figure 3: Graph built with friendship relations. The images of two friends are connected. This graph allows us to prop-agate the flower tag among friends.

based systems.

Methodology

We have performed experiments with different hyper-parameters values for both the PR and the GPR models. These hyper-parameters are the number of gradient descent iterations, the gradient descent step, the proportion of unla-beled images denoted p with p = _l+uu and the regulariza-tion parameters λreg, λRELand λslack. We have launched

three runs for each experiment. We have then computed, for each image, the average precision (APR) obtained with the ordered list of tags returned by the learning machine. We have then computed the average of these values over all the images and all the runs.

We have compared the results obtained for various config-urations of feature functions and relations, using as baselines the content-only PR model. We have performed experiments with all the possible combinations of feature ψ and relations w and different meta-parameters values. This corresponds to more than a thousand runs. Due to a lack of space, we only report here, in table 3 the most significant results obtained.

Results

(7)

Corpus: C1 C2

Test Size (p): 25% 50 % 75 % 25% 50 % 75 %

Features Edges Model Average Precision (.. %)

ψimage Content only 27 25.6 23.4 8.6 8.1 7.5 wauthor _GPR _55.7 _59.3 ₄₅ _39.7 _33.5 _24.3 wf riends _GPR _51.5 _49.3 ₄₂ _25.6 _21.3 _16.6 wimage _GPR _28.3 _26.9 _24.7 _8.4 _7.9 _7.8 wtext _GPR _29.9 _26.6 _24.7 _8.8 _8.2 _8.1 ψtext Content only 41.5 38.5 34.4 20.6 18.7 15.3 wauthor GPR 59.7 56.8 51.5 41.4 39.2 32 wf riends _GPR ₅₉ _58.8 ₅₂ _43.2 _38.9 _31.5 wimage GPR 32 27.6 27.3 15.9 13.1 12.1 wtext _GPR ₃₄ _35.4 ₃₄ _15.6 _16.8 _15.4 Corpus: C3 Test Size (p): 25% 50 % 75 %

Features Edges Model Average Precision (.. %)

ψtext

Content only 33.2 31.7 30.4

wauthor _GPR _40.5 _36.1 _33.7

wf riends _GPR _39.1 _37.2 _35.3

Table 3: Performances with Structure on the Datasets. The percentage (e.g. 25%) indicates the proportion of unlabeled images used.

performed 500 learning iterations and 3000 for C3. The best results have been obtained using λreg = 0.1, λslack = 1

and λREL= 0.1 for social relations (author and friend) and

0.01 for content relations (image and text).

First, we can see that the textual features (ψtext) most often outperform the visual features (ψimage). This is be-cause words share much more semantic than global visual informations encoded in image histograms do. Moreover, the textual description of an image may contain important information that allow the model to find difficult tags like dates or places for example or more abstract or specific tags. This information cannot be inferred from visual informa-tion. Second, the use of implicit relations through similari-ties does not really improve the performance of the baseline model, and sometimes degrades this performance. For ex-ample, for the C1 corpus with 25% of unlabelled images, the use of these relations increases the average precision by 2.9% when using image features and text similarities rela-tions, whereas with text features and text relarela-tions, the per-formance decrease is 7.5%. At last, the use of explicit so-cial relations greatly improves the performance: for the C1 corpus with visual features, using the authorship relation al-most increases the performance by a factor 2. This effect is very important in the C2 corpus which contains many pos-sible tags and where the performance grow from about 8% to about 40% by using social relations. When using tex-tual features, this effect is weaker than with visual features, but the use of authorship or friendship also greatly improves the performance. Note that while the author relation works better than friendship relations with visual features, the lat-ter connections are equally important for text features. Note that although the increase is smaller for the large set C3, it is still very significant (from 5 % to 7%).

Related Work

(8)

Conclusion

We have proposed a model that is able to automatically an-notate images. This method handles both the content of im-ages and also the relations between imim-ages which can be either implicit (visual similarities for example) or explicitly build upon a social network, while state-of-the-art methods either consider content or structure. We have shown that this model greatly improves a baseline ranking method when us-ing authorship or friendship relations. This model leaves open some questions. It considers only one relation at a time and cannot deal with all the possible relations simul-taneously. Moreover, its complexity is still high for now. These two points are currently being investigated.

References

Abernethy, J.; Chapelle, O.; and Castillo, C. 2008. Web spam identification through content and hyperlinks. In AIRWeb ’08: Proceedings of the 4th international work-shop on Adversarial information retrieval on the web. New York, NY, USA: ACM.

Barnard, K.; Duygulu, P.; Forsyth, D.; de Freitas, N.; Blei, D. M.; and Jordan, M. I. 2003. Matching words and pic-tures. J. Mach. Learn. Res. 3:1107–1135.

Bilgic, M.; Namata, G.; and Getoor, L. 2007. Combin-ing collective classification and link prediction. In ICDM Workshops, 381–386.

Cao, L.; Luo, J.; and Huang, T. S. 2008. Annotating photo collections by label propagation according to multiple sim-ilarity cues. In MM ’08: Proceeding of the 16th ACM in-ternational conference on Multimedia, 121–130.

Chang, E.; Goh, K.; Sychay, G.; and Wu, G. 2003. Cbsa: Content-based soft annotation for multimodal image re-trieval using bayes point machines. IEEE Transactions on Circuits and Systems for Video Technology13:26–38. Frome, A.; Singer, Y.; Sha, F.; and Jitendra, M. 2007. Learning globally-consistent local distance functions for shape-based image retrieval and classification. In Proceed-ings of IEEE 11th International Conference on Computer Vision, 1–8.

Golder, S. A., and Huberman, B. A. 2006. Usage patterns of collaborative tagging systems. J. Inf. Sci. 32(2):198– 208.

Guan, Z.; Bu, J.; Mei, Q.; Chen, C.; and Wang, C. 2009. Personalized tag recommendation using graph-based rank-ing on multi-type interrelated objects. In SIGIR ’09: Pro-ceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, 540–547. New York, NY, USA: ACM.

Har-Peled, S.; Roth, D.; and Zimak, D. 2002. Constraint classification: A new approach to multiclass classification. In ALT, 365–379.

Hironobu, Y. M.; Takahashi, H.; and Oka, R. 1999. Image-to-word transformation based on dividing and vector quan-tizing images with words. In in Boltzmann machines, Neu-ral Networks, 405409.

Joachims, T. 2002. Optimizing search engines using click-through data. In KDD.

Li, J., and Wang, J. Z. 2008. Real-time computerized anno-tation of pictures. IEEE Transactions on Pattern Analysis and Machine Intelligence30(6):985–1002.

Liu, D.; Hua, X.-S.; Yang, L.; Wang, M.; and Zhang, H.-J. 2009. Tag ranking. In WWW, 351–360.

Maes, F.; Peters, S.; Denoyer, L.; and Gallinari, P. 2009. Simulated iterative classification a new learning procedure for graph labeling. In ECML/PKDD (2), 47–62.

Matusiak, K. K. 2006. Towards user-centered indexing in digital image collections. OCLC Systems & Services 22(4):283–298.

Stone, Z.; Zickler, T.; and Darrell, T. 2008. Autotagging facebook: Social network context improves photo annota-tion. In InterNet08, 1–8.

Tong, H.; He, J.; Li, M.; Ma, W.-Y.; Zhang, H.-J.; and Zhang, C. 2006. Manifold-ranking-based keyword propa-gation for image retrieval. EURASIP J. Appl. Signal Pro-cess.

Wu, L.; Yang, L.; Yu, N.; and Hua, X.-S. 2009. Learning to tag. In WWW, 361–370.

Zhou, D.; Bousquet, O.; Lal, T. N.; Weston, J.; and Sch¨olkopf, B. 2003. Learning with local and global con-sistency. In NIPS.