Multiclass-Multilabel Classification with More Classes than Examples

(1)

Ofer Dekel Ohad Shamir Microsoft Research

One Microsoft Way Redmond, WA, 98052, USA

School of Computer Science and Engineering The Hebrew University

Jeursalem, 91904, Israel

Abstract

We discuss multiclass-multilabel classification problems in which the set of classes is extremely large. Most existing multiclass-multilabel learn-ing algorithms expect to observe a reasonably large sample from each class, and fail if they re-ceive only a handful of examples per class. We propose and analyze the following two-stage ap-proach: first use an arbitrary (perhaps heuris-tic) classification algorithm to construct an ini-tial classifier, then apply a simple but principled method to augment this classifier by removing harmful labels from its output. A careful theo-retical analysis allows us to justify our approach under some reasonable conditions (such as label sparsity and power-law distribution of class fre-quencies), even when the training set does not provide a statistically accurate representation of most classes. Surprisingly, our theoretical anal-ysis continues to hold even when the number of classes exceeds the sample size. We demonstrate the merits of our approach on the ambitious task of categorizing the entire web using the1.5 mil-lion categories defined on Wikipedia.

1 Introduction

In multiclass-multilabel classification, the goal is to assign one or more labels to each instance in an instance space. Each label associates an instance with one of k possible classes. An example of a multiclass-multilabel problem is document categorization, which is the problem of associat-ing each document in a corpus with one or more topics (e.g. McCallum, 1999). Multiclass-multilabel problems are also common in other fields, such as computer vision (Boutell

Appearing in Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS) 2010, Chia La-guna Resort, Sardinia, Italy. Volume 9 of JMLR: W&CP 9. Copy-right 2010 by the authors.

et al., 2004) and computational biology (Barutcuoglu et al., 2006). If a training set of m labeled examples is avail-able, a multiclass-multilabel classifier can be learned using a supervised machine learning algorithm. Typically, learn-ing algorithms for multiclass-multilabel problems are de-veloped and analyzed under the assumption thatkis held constant as m grows. In this paper, we consider a dif-ferent version of the multiclass-multilabel problem, where the set of classes grows with the number of examples (i.e. k ≥ Ω(m)). For example, this situation occurs when the set of classes is a Folksonomy, a set of classes that emerges from a collaborative tagging or a social tagging scheme. The concrete problem that motivates this work is the prob-lem of categorizing the entire web using the categories de-fined by the Wikipedia website. At the bottom of every Wikipedia article there is a short list of categories, and we define our set of classes to be the union of these lists. The Wikipedia articles themselves can be used as training ex-amples, since they are labeled web pages. When new arti-cles are added to Wikipedia, they often introduce new cat-egories. At the time of writing this paper, Wikipedia con-tains 2.9 million articles and almost 1.5 million categories. The Internet provides many other examples wherekgrows withm. For instance, photo sharing websites allow users to annotate their photos with keywords. The implied clas-sification task is to recommend keywords whenever new photos are uploaded. Assuming that the site does not im-pose any restrictions on the keywords that may be used, the set of distinct keywords is likely to grow as more and more photos are uploaded to the site.

For such datasets, applying standard techniques is problem-atic for two reasons: First, since kscales with m, many classes occur only a handful of times in the training set, so we do not have a statistically reasonable sample from each class; Second, the concrete values of m and k we deal with are very large, to the point that most existing multiclass-multilabel learning algorithms become compu-tationally intractable. For example, the most common ap-proach to multiclass-multilabel learning is to train a sep-arate binary classifier for each class. For the datasets we have in mind, such an approach is both statistically

(2)

un-tenable (due to the small number of examples per class) and computationally impractical (as it requires maintaining millions of hypotheses for all the classes). Similar prob-lems are encountered for other standard approaches, such as those based on ranking (e.g. Amit et al., 2007; Crammer and Singer, 2003; Lebanon and Lafferty, 2002).

In practice, the only realistic solution is to turn to much simpler classification algorithms, such as nearest neighbor methods. For example, consider once again the problem of categorizing the entire web using Wikipedia categories, and assume that we have access to the log of an Internet search engine. We can use the log to construct a click graph, a bi-partite graph whose vertices include all web pages and all queries ever issued to the search engine. An edge is drawn between a queryQand a web pageW if enough users is-sued the queryQand then clicked onW. We can use the graph distance induced by the click graph to define a met-ric over web pages. Given a set of labeled Wikipedia pages, we can label the entire web using a nearest neighbor type algorithm over this metric. In other words, labels are prop-agated from the labeled Wikipedia pages along the edges of the click graph to the rest of the web. Such simple algo-rithms can be implemented in a way that is almost entirely insensitive to the number of classes, but their simplicity of-ten comes at the cost of lower classification accuracy. One way to improve the accuracy of a simple classifica-tion algorithm is to refine its output with a separate post-learning step. Taking such a two-stage approach is com-mon in other areas of machine learning. For example, in ranking problems it is common to run a simple and fast algorithm to obtain an initial ranking, and then to run a more accurate ranking algorithm only on the top few re-sults. In this paper, we focus on a post-learning step that modifies a classifier by pruning certain labels from the out-put. In other words, the original classifier associates an instance with a set of classes and the post-learning step deletes certain classes from this set. The intuition behind this approach is that in such massive multiclass-multilabel datasets many classes are inherently hard to learn, and at-tempting to predict them decreases the overall accuracy of the classifier.

We propose and analyze, both theoretically and experi-mentally, a simple label-pruning method. The method is based on comparing the number of true-positives and false-positives in each class, and discarding classes where the ra-tio of true to false positives exceeds a certain threshold. Re-turning to the example of categorizing the web, the initial nearest neighbor algorithm is likely to find that many web pages on classical composers turn out to be close neighbors of the Wikipedia article on Mozart. The nearest neighbor classifier indiscriminantly assigns all of the Wikipedia cat-egories that are associated with the article on Mozart to all of these pages. One of these categories is People Born in 1756, which is likely to have many false positives across

the validation set. Intuitively, the category People Born in 1756 is incompatible with the click-graph based metric we have chosen, namely, our metric is unlikely to put differ-ent web pages from this class in close proximity to each other. In this situation, our label-pruning method removes this class from the set of classes the classifier is allowed to output.

While our method is simple and straightforward to im-plement, its analysis is quite tricky, since it is based on premises that might initially appear to be statistically un-acceptable. After all, our basic assumption is that most classes are very rare, so the decision to drop a label may be based on statistically insufficient evidence. Say that we see two false-positives and one true-positive of a given class in our validation set: can we confidently decide to remove labels based on this empirical evidence? The key to the formal analysis of our technique is to think of its overall effect rather than considering its effect on each individual class. Indeed, we cannot conclusively evaluate each class and our technique will most certainly mistake some good classes for bad ones. Nevertheless, we can show that our pruning criterion removes more bad labels than good ones, and overall improves the accuracy of our classifier, under mild conditions that often hold in practice. Concretely, we assume that the class frequencies follow a power-law dis-tribution and that every example only has a rather small number of labels.

To our knowledge, our theoretical approach is unique and quite distinct from previous analyses of multiclass-multilabel learning algorithms. Most previous such ses build on techniques originally developed for the analy-sis of binary classification algorithms, and therefore require at least some degree of class-wise convergence.

We conclude our paper with a set of experiments, in which we validate our approach on the task of categorizing web pages using the set of 1.5 million Wikipedia categories. The simplicity of our approach enables us to perform these experiments on a single server, without requiring a large cluster computer.

Related Work

The work in (Zhang, 2004) deals with multiclass classifica-tion and bears similarities to our work, in that the space of possible classes can be very large compared to the size of the dataset. However, the analysis there is specific to mul-ticlass rather than mulmul-ticlass-multilabel learning (i.e. each instance is assigned only a single label), and focuses on large margin classifiers with a particular rule for choosing the label of each instance.

A more closely related paper is (Hsu et al., 2009), which also deals with massive multiclass-multilabel classifica-tion. It proposes a clever method, where a predictor is

(3)

trained on a compressed representation of the original label vectors. The original labels are reconstructed using tech-niques from compressed sensing. The problem setting and some of the assumptions made (such as label sparsity) are similar to our work. However, the approach of (Hsu et al., 2009) applies only to learning algorithms that regress on a real-valued compressed label vector. This is often not the case with algorithms designed for massive datasets, such as the click-graph based approach described earlier. In con-trast, our approach makes no assumptions about the learn-ing algorithm.

2 Setting and Notation

We assume that the learning task at hand is a supervised multiclass-multilabel problem. Formally, let _X be an ar-bitrary measurable space, _Y = {0,1}k_{, and let} _D _{be an}

unknown distribution on the product space _{X × Y}. Each element (x,y) in this space is composed of an instance x and a label vector y, which is a vector of indicators y = (y1, . . . , yk)that specifies which classes are

associ-ated withx. We assume that label vectors sampled from Dare sparse, namely, thatPr(P

jyj ≤ s) = 1for some

constant s. A classifier is a functionh : X → Y, that maps an instancex to a label vectoryˆ = h(x). We re-strict our discussion to classifiers that output sparse label vectors, namelyP

jyˆj≤s. We evaluate the accuracy of a

classifier using a loss functionℓ(h(x),y), which measures the disparity between the predicted label set and the actual label set. In this paper, we focus on a simple weighted loss function that is parameterized byγ∈(0,1)and defined as

1 s k X j=1 (1−γ) 11(ˆyj= 0, yj= 1)+γ1(ˆ1yj = 1, yj= 0), (1)

where 1(1·) is the indicator function. The parameter γ

controls the importance of false negatives vs. false pos-itives, and the normalization by s ensures that the loss is always bounded in [0,1]. Our ultimate goal is to ob-tain a classifier h with a small risk, which is defined as

E(x,y)∼D[ℓ(h(x),y)].

We distinguish between two phases of the learning pro-cess. In the learning phase, we use some learning algo-rithm to obtain an initial classifier h. We then perform a post-learning phase, where we find a label transformation function ϕ : Y → Y, such that the final classifier is the composition ϕ◦h. In this paper, we focus on the post-learning phase, and make no assumptions on the post-learning phase or on the quality of the initial classifier. From the perspective of the post-learning phase, the initial classifier

his simply a predefined function. For simplicity, we as-sume that the data used to trainhis independent from the data used in the post-learning phase to trainϕ.

In principle, the label transformation functionϕcan be ar-bitrarily complex. In this paper, we focus on the simple

set of label pruning rules. A label pruning ruleϕy˜ corre-sponds to an element˜y∈ {0,1}k, andϕy(˜ y)is the vector resulting from removing the labels represented byy˜ from

the set of labels represented byy. Such rules are simple to implement and are particularly useful in massive multilabel problems, where many labels are both inherently noisy and very rare. In such cases, refraining from predicting these la-bels can actually improve the final classifier’s performance. The four basic quantities we work with are the risk and the empirical risk of h and of ϕ◦h. LettingS denote an i.i.d. sample of size m from_D, we define the initial empirical risk Rˆ0 = 1/mP(x,y)∈Sℓ(h(x),y), the

ini-tial risk R0 =E(x,y)[ℓ(h(x),y)], the final empirical risk

ˆ

Rϕ = 1/mP₍_x,y_)∈_Sℓ(ϕ◦h(xi),y), and the final risk

Rϕ=E(x,y)[ℓ(ϕ◦h(x),y)]. Our goal is to find a pruning ruleϕsuch thatRϕis as small as possible.

For the analysis, we need to describe these quantities in an alternative form, as specified in the following easy-to-prove lemma:

Lemma 1. For a given classifierh(·), define

ˆ pj,11= 1−γ m X (x,y)∈S 1 1 (h(x)j =yj= 1) ˆ pj,10= γ m X (x,y)∈S 1 1 (h(x)j = 1, yj= 0) ˆ pj,6== 1 m X (x,y)∈S (1−γ) 11 (h(x)j= 0, yj = 1) +γ1 (1 h(x)j = 1, yj= 0).

Letpj,6=, pj,11,andpj,10be the expected values (over the sample S) ofpˆj,6=,pˆj,11,andpˆj,10 respectively. Also, for a fixed pruning rule ϕ(·), let 1(1labeljpruned)be an in-dicator that equals 1 if and only if the pruning ruleϕ(·)

removes labelj. Then it holds that

ˆ R0= 1 s k X j=1 ˆ pj,6= , R0= 1 s k X j=1 pj,6=, ˆ Rϕ= 1 s k X j=1 ˆ pj,6=+ 11 (labeljpruned) (ˆpj,11−pˆj,10) , Rϕ= 1 s k X j=1 pj,6=+ 11 (labeljpruned) (pj,11−pj,10) .

3 The Pruning Method

Recall that our goal is to reduce the final riskRϕ. The

ex-pression forRϕgiven in Lemma 1 suggests thatRϕcan be

reduced by removing those labels for whichpj,10> pj,11. Unfortunately,pj,11andpj,10are unknown quantities that depend on_D, and we must resort to using their empirical counterpartspˆj,11andpˆj,10. Specifically, our simple label

(4)

pruning procedure proceeds as follows: given a sampleS, calculatepˆj,11andpˆj,10, and choose the label pruning rule

ϕthat removes all labels for whichpˆj,11<pˆj,10. In other

words, this procedure prunes any label for which the ratio of false positives to true positives exceeds(1−γ)/γ. No-tice that this makes ϕa random function that depends on the randomness of the sampleS. For the theoretical anal-ysis, it will be convenient to viewRϕandRˆϕas random

variables, which depend on the random draw ofS. This algorithm essentially attempts to decrease the final empirical riskRˆϕin lieu ofRϕ. However, notice that in

our setting (where k scales with m), we cannot assume that each and everypˆj,11,pˆj,10is an accurate estimate of

pj,11, pj,10. In fact, our analysis shows thatRˆϕis generally

not a good estimator of Rϕ. Nevertheless, we can prove

that our method reduces the final riskRϕcompared to the

initial riskR0with high probability, under mild conditions.

4 Theoretical Analysis

Our pruning procedure works by making the empirical risk

ˆ

Rϕas small as possible. In this section, we show that this

is also likely to makeRϕsmaller thanR0. The

straightfor-ward theoretical approach would be to show that for rea-sonably large samples, Rˆ0 is close toR0andRˆϕis close

toRϕ. While the first premise is easy to show via a large

deviation inequality, it turns out thatRˆϕdoes not

necessar-ily converge toRϕwhen the number of classes grows with

the number of examples. This is implied by the following theorem and the discussion which follows. Its proof is a simple consequence of the definitions, and is omitted due to lack of space.

Theorem 1. E[Rϕ−Rˆϕ]is lower bounded by 1 s k X j=1 Pr(labeljpruned)(pj,11−pj,10).

If we were to assume that k is fixed, we might expect

ˆ

pj,11,pˆj,10to converge topj,11, pj,10uniformly for allj = 1, . . . , k. Since our method prunes labels for whichpˆj,11<

ˆ

pj,10, we would have thatPr(labeljpruned)(pj,11−pj,10)

converges to a non-positive quantity uniformly for any j, and thus our lower bound would converge to a non-positive number. However, when we assume thatkgrows withm,

ˆ

pj,11,pˆj,10need not converge uniformly topj,11, pj,10, and

the correlation between Pr(labelj pruned)and the sign of(pj,11−pj,10)can remain weak regardless of the sample

size. To give a concrete example, if we takeγ= 1/2, s= 10and assume thatpj,11 = s/3k, pj,10 = s/6kfor allj,

then we have by the theorem above that

E[Rϕ−Rˆϕ]≥ 1 6k k X j=1 Pr(labeljpruned).

It can be shown that whenm, k→ ∞but (say)m/k = 3, the right hand side above converges to a strictly positive constant. Therefore, it is possible that our lower bound will remain larger than some positive constant regardless of sample size, which implies thatRˆϕdoes not converge to

Rϕin such cases.

This observation precisely captures the difficulty of work-ing with a sample that does not sufficiently represent many of the individual classes in the problem, and is the rea-son why most existing algorithms are inadequate when the number of classes is not fixed. Nevertheless, we can show that it is possible to analyze the behavior of Rϕ directly.

Specifically, we prove that Rϕ is well behaved when the

training set is large enough, even whenkis very large and grows with m. Namely, although the empirical quantities do not necessarily correspond to their expected values, we can still provide high probability guarantees that our prun-ing method reduce the overall risk of the classifier. In a nutshell, the analysis consists of proving that_|Rϕ−E[Rϕ]|

is small with high probability (where the expectations are taken over the random draw of the sampleS), and then di-rectly proving thatE[Rϕ]is strictly smaller thanR0, under

mild conditions.

The first part of the proposed approach is formalized in the following theorem. Informally, it states that whenm is large enough,Rϕ is arbitrarily close to its expectation

with arbitrarily high probability. Note that this bound does not depend at all onk, the number of classes.

Theorem 2. For any fixedǫ >0, it holds that

Pr |Rϕ−E[Rϕ]|> 2m−1/6+ǫ γ(1−γ) +m 2/3_exp ₋_m2ǫ ≤2sm2/3exp −m2ǫ ,

The proof is presented in the appendix. Intuitively, the idea is to distinguish between labels for which _|pj,11−pj,10|

is large, and labels for which this difference is small. The first type of labels are more common in the data, and thus we can reliably estimatepj,11−pj,10and decide whether to

prune them or not. On the other hand, there cannot be too many such labels, becauseP

jpj,11+pj,10is a bounded

quantity. This effectively limits the dimensionality of the problem regardless of the parameterk. Whenever_|pj,11−

pj,10|is small, the pruning process is noisy and prone to

errors, but it can be shown that these cases do not influence Rϕtoo much. A careful formalization of these ideas, using

Bernstein and McDiarmid’s large deviation bounds, allows us to show thatRϕconcentrates around its expectation with

high probability, regardless ofk.

Next, we need to show thatR0−E[Rϕ]is strictly positive,

to prove that our method indeed reduces the final risk. It turns out that the exact value ofR0−E[Rϕ]is highly

de-pendent on the specific values ofpj,11 andpj,10 for each

(5)

with high probability and labels for which pj,10 ≤ pj,11

are pruned with low probability, we expectR0−E[Rϕ]to

be large. Although it is possible to provide positive lower bounds on R0−E[Rϕ]in terms of these quantities, they

are not particularly enlightening. Instead, the theorem be-low will albe-low us to characterize a mild condition, under which we can expectR0−E[Rϕ]to be strictly positive. A

proof appears in the appendix.

Theorem 3. The differenceR0−E[Rϕ]is at least

1 s X j:pj,10≥pj,11 (pj,10−pj,11)−1 s k X j=1 r pj,11+pj,10 m .

Moreover, if we assume that pj,10 +pj,11 are sorted in

descending order, and there exists somer 6= 2 such that

Pr(h(x)j = 1)≤O(j−r)for allj, thenR0−E[Rϕ]is at

least 1 s X j:pj,10≥pj,11 (pj,10−pj,11)−O r kmax{2−r,0} m ! .

The requirement thatr6= 2is for technical reasons and can easily be treated separately.

What does this theorem tell us? The non-negative term P

j:pj,10≥pj,11(pj,10−pj,11)can be arbitrarily small, but we can expect it to be lower bounded by a positive constant (independent ofm, k) if a fixed fraction of the labels are such thatpj,10 ≥pj,11, and ifpj,10−pj,11is proportional

topj,10+pj,11. So we turn our attention to the term

1 s k X j=1 r pj,11+pj,10 m ,

which can indeed be large in the regime wherekscales with m. For example, ifpj,11+pj,10=s/kfor allj, the above

equalspk/sm≥Ω(1), and Thm. 3 may become vacuous. Luckily, assuming thatpj,11+pj,10is equal for alljis

un-realistic. By definition,pj,11+pj,10is upper bounded by

Pr(h(x)j = 1), or the probability that our learned hypoth-esis labels a random instance with labelj. If the marginal class distribution of the classifier is similar to the marginal class distribution of the data, then this distribution is of-ten observed to follow a power law, which corresponds to the assumption that Pr(h(x)j = 1) ≤ O(j−r)for all j. Under this assumption, we obtain the second statement in Thm. 3. This power-law behavior, sometimes known as Zipf’s law, is a very well known and well documented phe-nomenon for many rank vs. frequency datasets (see exam-ples in Manning and Sch ¨utze, 2002; Adamic and Huber-man, 2002; Gabaix, 1999), and in particular for the appli-cations we have in mind. We verify this property in our experiments, presented below.

Overall, this lower bound implies that if we letm, k→ ∞, we can expectR0−E[Rϕ]to be positive whenevermgrows

faster than k2−r_{. In particular, if}_{r >} ₁ _{(which happens} quite often in practice, including in our experiments), we obtain the interesting result that the lower bound remains meaningful, even when the number of classes k grows faster than the number of examplesm.

5 Experiments

We applied our technique to the task of categorizing web pages using the 1.5 million categories defined in Wikipedia. As mentioned in the introduction, we first used search engine logs to create a click graph, which is a bipar-tite graph between queries and web pages. A link between queryQand web pageWindicates that a sufficiently large number of users issued the query Q and then clicked on the link to page W. Next, we randomly split the set of Wikipedia articles into three sets: 50%training,30% vali-dation, and20%test. Each Wikipedia article is associated with a set of categories and also corresponds to a node in the click graph. Next, we propagated the categories from each Wikipedia training article along the edges of the click graph, to all of the web pages that have a query in common with that article (namely, to all web pages whose distance to the training article is2). We call the resulting labeling of the web labeling A. The rationale behind this labeling procedure is the assumption that two web pages that were clicked on (by different people, at different times) after the same query are likely to share many topics. Next, we prop-agated the categories along the edges of the click graph a second time, extending the reach of each category to all pages with graph distance4from the original article. We call this labeling B.

We repeated the process described above a second time, this time seeded with a larger set of labels per Wikipedia training article. We used the fact that Wikipedia categories are themselves categorized by higher-level categories. For example, the Wikipedia article on Dogs is associated with the category Domesticated Animals, and the latter is asso-ciated with the category Animals. We added all of these second-order categories to each Wikipedia article. We propagated the extended category sets along the edges of the click graph as before, to obtain labeling C. We then performed a second iteration of label-propagation to obtain labeling D.

We applied our label-pruning technique independently to each of the four initial labellings. Namely, we revealed the true categories of the Wikipedia validation articles and compared them to the propagated labels in the four versions of our experiment. For each class we counted true and false positives, and decided which labels to prune.

The set of Wikipedia categories is problematic in that it is over-complete. Many categories have duplicates or near-duplicates; some articles are labeled by one category while other articles are labeled by its near-duplicate category.

(6)

te st lo ss ra ti o 0.02 0.04 0.06 0.08 0.1 0 0.2 0.4 0.6 0.8 1 0.02 0.04 0.06 0.08 0.1 0 0.2 0.4 0.6 0.8 1 0.02 0.04 0.06 0.08 0.1 0 0.2 0.4 0.6 0.8 1 0.02 0.04 0.06 0.08 0.1 0 0.2 0.4 0.6 0.8 1 our algorithm original random γ γ γ γ A B C D

Figure 1:Ratios between the best attainable test loss and the test loss attained by three different techniques, on four initial labellings.

Also the false-positives in all four labellings significantly outnumber the true-positives. For these reasons, false-positives should be treated with great suspicion. When we see a false-positive, either our classification is wrong or the Wikipedia editor may have simply neglected to add this category. Spot-checking reveals that many false-positives are actually quite reasonable. On the other hand, false-negatives should always be taken seriously: a human editor explicitly added a category to the article and our algorithm concluded that it is not relevant. To correct this imbalance, we setγin Eq. (1) to give more weight to false-negatives. Specifically, we setγto values between0.01and0.1. After using the validation set to identify and remove harm-ful labels, we revealed the categories in the Wikipedia test set, and evaluated the performance of our algorithm. For each of the four labellings and for each value ofγ, we also calculated an oracle pruning which provides a lower bound on the test loss of any possible pruning algorithm. This was done by cheating and finding the best pruning on the test set (in terms of eachγ-weighted loss). The loss attained by the oracle varies greatly withγ, so it is meaningless to plot absolute loss values for different values ofγ on the same figure. To get a coherent visualization of our results, we plotted the ratio between the oracle loss and the loss of our algorithm. The performance of our algorithm is shown in solid lines in Fig. 1. Values close to1indicate that our test loss is very close to the loss of the ideal pruning.

For comparison, the plots in Fig. 1 also show the perfor-mance of two other simple algorithms. The first is the al-gorithm that performs no pruning and just keeps the initial labeling. The second is an algorithm that uses our method to determine how many labels to remove, and then removes labels randomly. These experimental results clearly show the amount of improvement achieved by our algorithm. De-spite the statistical challenge of generalizing with only a handful of examples per class, our algorithm performs very well across a wide range ofγ.

Finally, using a simple least-squares fitting technique, we validated that all four datasets satisfy the power-law as-sumption used in our theoretical analysis (see Thm. 3 and the discussion which follows). Namely, when we sort the classes by frequency in the data, we see that the frequency of classyjis proportional toj−r, withr≈1.3for labeling

A; r ≈ 1.6for labelingB; r ≈ 1.9for labelingC; and r≈2.3for labelingD.

6 Conclusions

In this paper, we studied the problem of massive multiclass-multilabel learning, where the set of classes scales with the number of available training examples. This setting is very relevant when the set of classes results from a collaborative tagging scheme, such as Wikipedia categories or keywords in media hosting websites. In this regime, the standard assumption of a fixed set of classes is too simplistic, and straightforward generalizations of methods for binary clas-sification (such as multiclass SVM) may be impractical. Motivated by the computational issues faced by practition-ers in this area, we proposed and analyzed a post-learning method on top of any desired learning algorithm, which for our purposes can be treated as a black-box. Our ex-periments demonstrate that the method works quite well on real-world, large scale data.

Theoretically, this setting poses a challenge, since we can-not hope to get a lot of data on each and every class. As far as we know, this setting violates the assumptions underlying most previous theoretical work on multiclass-multilabel learning. Nevertheless, a careful analysis allows us to justify our approach, using some non-trivial but mild sufficient conditions, such as sparsity of labels per instance and a power-law behavior of the class frequencies. While our approach seems to work in practice, and has some interesting theoretical properties, the algorithm we have focused on is obviously a very simple one, and sev-eral extensions immediately come to mind. One direction is to utilize additional knowledge about class dependen-cies, rather than treating each class separately. Also, we have dealt only with very simple label transformation rules, which prune a subset of labels (i.e. “if labelAappears, re-move it”). However, it is possible to envision more com-plex rules, such as “if labelsAandBappear, but not label C, replace labelDby labelE”. Understanding how to im-plement these extensions effectively and in a theoretically justified manner, even when there are as many classes as examples, remains a topic for future research.

(7)

References

L.A. Adamic and B.A. Huberman (2002). Zipf’s law and the internet. Glottometrics, 3:143–150.

Y. Amit, O. Dekel, and Y. Singer (2007). A boosting al-gorithm for label covering in multilabel problems. In Pro-ceedings of Artifical Intelligence and Statistics 2007. Z. Barutcuoglu, R.E. Schapire, and O.G. Troyanskaya (2006). Hierarchical multi-label prediction of gene func-tion. Bioinformatics, 22(7):830–836.

S. Boucheron, G. Lugosi, and O. Bousquet (2004). Con-centration inequalities. In O. Bousquet, U. von Luxburg, and G. R¨atsch, editors, Advanced Lectures on Machine Learning, Springer LNCS 3176, pages 208–240.

M.R. Boutell, J. Luo, X. Shen, and C.M. Brown (2004). Learning multi-label scene classification. Pattern Recogni-tion, 37(9):1757–1771.

K. Crammer and Y. Singer (2003). A family of additive online algorithms for category ranking. Journal of Machine Learning Research, 3:1025–1058.

X. Gabaix (1999). Zipf’s law for cities: An explanation. The Quarterly Journal of Economics, 114(3):739–767. D. Hsu, S. Kakade, J. Langford, and T. Zhang (2009). Multi-label prediction via compressed sensing. In Ad-vances in Neural Information Processing Systems 22. G. Lebanon and J.D. Lafferty (2002). Conditional models on the ranking poset. In Advances in Neural Information Processing Systems 15, pages 415–422.

C.D. Manning and H. Sch ¨utze (2002). Foundations of Sta-tistical Natural Language Processing. MIT Press.

A. McCallum (1999). Multi-label text classification with a mixture model trained by em. In AAAI99 Workshop on Text Learning.

T. Zhang (2004). Class-size independent generalization analysis of some discriminative multi-category classifica-tion. In Advances in Neural Information Processing Sys-tems 17, pages 1625–1632.

A

Technical Proofs

A.1 Proof of Thm. 2

We need the following two lemmas. The first lemma fol-lows directly from Bernstein’s inequality (see for instance (Boucheron et al., 2004)). We note that using an inequal-ity that relies on variance is crucial to obtain a non-trivial bound with our proof technique. The second lemma fol-lows directly from the definitions. The proofs are omitted due to lack of space.

Lemma 2. For any j, ifpj,10 ≤ pj,11, then Pr(ˆpj,10 > ˆ pj,11)is at most exp − m(pj,10−pj,11) 2 2((1−γ)pj,11+γpj,10+|pj,10−pj,11|/3) , A similar bound holds onPr(ˆpj,10≤pj,11)ˆ ifpj,10≥pj,11. Lemma 3. It holds that

k X j=1 |pj,11−pj,10| ≤ k X j=1 pj,11+pj,10 ≤ s.

Letα > 0be an arbitrary parameter to be specified later, and define the label subsetsJ1={j:|pj,11−pj,10| ≤α}, J2={1, . . . , k} \J1. We have by definition of the pruning procedure and Lemma 1 that_|Rϕ−E[Rϕ]|is at most

1 s X j∈J1 (pj,11−pj,10) 11pˆj,10>ˆpj,11−Pr (ˆpj,10>pj,11)ˆ +1 s X j∈J2 |pj,11−pj,10||11pˆj,10>ˆpj,11−Pr (ˆpj,10>pj,11)ˆ |. (2) Focusing on the first line in the expression, note that if we change any single instance in our sample, at most2sterms will change by at most _|pj,11 −pj,10| ≤ α. Therefore, the expression in the first line will change by at most 2α. Applying McDiarmid’s inequality, and noting that the ex-pectation of what’s inside the absolute value is zero, we get that with probability of at least1−δ, it is upper bounded by

p

2mα2_log(1/δ). ₍₃₎ Turning to the second line in Eq. (2), and applying Lemma 2, we get that for anyj, with probability of at least 1−g(m, pj,11, pj,10), it holds that |1(ˆ1pj,10>pj,11)ˆ −Pr(ˆpj,10>pj,11)ˆ | ≤g(m, pj,11, pj,10), whereg(m, pj,11, pj,10)equals exp − m(pj,11−pj,10) 2 2((1−γ)pj,11+γpj,10+|pj,10−pj,11|/3) . Letc >0be another parameter to be determined later. If c((1−γ)pj,11+γpj,10) ≤ |pj,10−pj,11|, we can upper boundg(m, pj,11, pj,10)by exp − mc 2₍₍₁₋_γ)pj,11₊_γpj,10)2 2 ((1−γ)pj,11+γpj,10+|pj,10−pj,11|/3) . Dividing the numerator and denominator of the fraction in the exponent by(1−γ)pj,11+γpj,10, and using the easily verified fact that for anya > 0, b >0, γ ∈ (0,1)it holds

(8)

that_|a−b|/((1−γ)a+γb)≤1/(γ(1−γ)), we get the upper bound exp −mc 2₍₍₁₋_γ₎_p j,11+γpj,10) 2(1 + 1/3γ(1−γ)) . (4)

On the other hand, we always have

|1(ˆ1pj,10>pˆj,11)−Pr(ˆpj,10>pˆj,11)| ≤1 (5) with probability 1. Applying Eq. (4) and Eq. (5) on the second line of Eq. (2), we get a probabilistic upper bound for it, of the form

X j∈J2,1 |pj,11−pj,10| s exp −mc 2₍₍₁₋_γ₎_p j,11+γpj,10) 2(1 + 1/3γ(1−γ)) +1 s X j∈J2,2 |pj,11−pj,10|, (6) whereJ2,1={j∈J2:c≤ ₍₁_−γ|pj,₎_p10−pj,11| j,11+γpj,10}, andJ2,2= {j ∈ J2\J2,1}. By a union bound, Eq. (5) holds with probability at least 1− X j∈J1 exp −mc 2₍₍₁₋_γ₎_p j,11+γpj,10) 2(1 + 1/3γ(1−γ)) . (7) We now make four observations. First, by Lemma 3, P

then(1−γ)pj,11+γpj,10 > αγ(1−γ). Third, for any j ∈J2,2,|pj,11−pj,10|< c((1−γ)pj,11+γpj,10). Fourth, P

j∈J2,2((1−γ)pj,11+γpj,10)≤sby Lemma 3 and the

fact thatγ ∈ (0,1). Applying these four observations on Eq. (6) and Eq. (7), we can weaken this bound to the form

1 αexp − mc 2_αγ₍₁₋_γ₎ 2(1 + 1/3γ(1−γ)) +c, which holds with probability of at least

1−_αsexp − mc 2_αγ₍₁₋_γ₎ 2(1 + 1/3γ(1−γ)) .

To get the theorem statement, we combine this with the bound in Eq. (3), substitute into Eq. (2), choose α =

m−2/3_,_δ₌_sm2/3_exp ₋_m2ǫ

(for someǫ >0), let

c=m−1/6+ǫ s

2(1 + 1/3γ(1−γ))

γ(1−γ) ,

and perform some straightforward simplifications. A.2 Proof of Thm. 3

We have thatR0−E[Rϕ]equals 1 s k X j=1 (pj,10−pj,11) Pr(ˆpj,10>pˆj,11). (8)

For anyj, ifpj,10−pj,11 ≥0, we have by Lemma 2 that

Pr(ˆpj,10≥pˆj,11)is lower bounded by 1−exp − m(pj,10−pj,11) 2 2((1−γ)pj,10+γpj,11) +|pj,10−pj,11|/3 ≥1−exp − m(pj,10−pj,11) 2 2(pj,10+pj,11+ (pj,10+pj,11)/3) = 1−exp −3m(pj,10−pj,11) 2 8(pj,10+pj,11) .

If pj,10−pj,11 ≤ 0, we have by Lemma 2 in a similar manner that Pr(ˆpj,10>pˆj,11)≤exp −3m(pj,10−pj,11) 2 8(pj,10+pj,11) . Substituting these results into Eq. (8), we get that R0 − E[Rϕ]is lower bounded by 1 s X j:pj,10≥pj,11 (pj,10−pj,11) −1_s k X j=1 |pj,10−pj,11|exp −3m(pj,10−pj,11) 2 8(pj,10+pj,11) . (9) In order to upper bound the second line in the expression (with something which does not depend onpj,10−pj,11), it is enough to upper bound for anyjthe expression

max |pj,10−pj,11| |pj,10−pj,11|exp −3m(pj,10−pj,11) 2 8(pj,10+pj,11) . (10) For that, it is sufficient to find the maximal value of the functionf(x) = xexp(−3mx2_/₈_p₎_{, where}_p _:=_p

j,11+ pj,10, for anyx∈[0, p]. It can be verified that this function is maximized atx=p

4p/3m. Substituting this value for

|pj,10−pj,11|in Eq. (10), we get an upper bound of the form p

4(pj,10+pj,11)/3mexp(1). Substituting this bound in Eq. (9), and simplifying by noting that p4/3 exp(1) ≈ 0.7<1, we get the required lower bound

1 s X j:pj,10≥pj,11 (pj,10−pj,11)−1 s k X j=1 r pj,11+pj,10 m on R0−E[Rϕ]. To derive from it the second inequality

in the theorem, notice that under the assumptions stated there,Pk j=1 √_p j,11+pj,10 ≤ Pkj=1 p Pr(h(x)j= 1)is at mostCPk

j=1j−r/2 for some constantC. This sum is O(k1−r/2₎_if_{r <}₂_,_O_(log(_k₎₎_if_r_{= 2}_{, and}_O₍₁₎_if_{r >}₂_. Ignoring the case r = 2 for simplicity, we upper bound the different cases byO(√kmax{2−r,0}₎_{, and the inequality}