• No results found

Supervised Document Indexing

(L1-regularized linear regression) suggests moderate improvements on both tasks. However,

sLDA inherits much of the disadvantages of LDA including computational cost and choice of an optimal number of topics.

2.4

Supervised Document Indexing

Term weighting is a critical part of document indexing in the VSM. The goal of term weight- ing is to assign, for each term tj in the indexing vocabulary and for each document di, a weight

wi,j which represents how much tj contributes to the discriminative semantics of di. Because

of the unsupervised nature of the traditional tf-idf term weighting scheme, it is not likely to be optimal for text classification. In particular, the suitability of idf for text classification has been challenged (Debole & Sebastiani 2003). The aim of idf is to assign higher weight to terms that better distinguish the small set of documents that are likely to be relevant to any given query from the much larger set of irrelevant documents in the collection. Note that this assumption is more intuitive for information retrieval where typically, a large heterogeneous collection of documents is expected to cater for a diverse multitude of user information needs or topics. However, for text classification, the set of topics (i.e. classes) are much fewer (in many cases just two) and are explicitly labelled in the training collection. Thus, a number of supervised document index- ing approaches have proposed replacing the idf component of the tf-idf weighting scheme with a supervised alternative which better captures the class distribution of terms as presented in equa- tion 2.22 (Debole & Sebastiani 2003, Deng, Tang, Yang, Li & Xie 2004, Lan, Tan & Low 2006).

wi,j = tfi,j× δ(tj) (2.22)

Where wi,j is the weight of term tj in document di, tfi,j is the term frequency of tj in di

and δ(tj) is a function that returns the supervised weight of tj. In practice, δ(tj) is typically

obtained using supervised feature selection metrics e.g. Chi-square, Information Gain, Gain Ratio or Mutual Information. For example, supervised weighting with χ2 using the approach presented in equation 2.22 is as shown in equation 2.23.

2.4. Supervised Document Indexing 39 Given the entire vocabulary V of a document collection, feature selection is a technique used for selecting a subset U ⊂ V of the most important terms for use as an optimised indexing vocabulary. This involves computing for each term tj ∈ V , a statistical score of term importance

which is used to rank all terms in V . Terms that rank below a certain threshold are subsequently excluded from the new indexing vocabulary U . Note that this score of term importance can be used as a weight for terms such that more important terms have a greater contribution to document representation.

Feature selection approaches can be categorised into supervised and unsupervised. For the pur- pose of this discussion, we will focus exclusively on supervised feature selection metrics. Many supervised feature selection techniques have been proposed in the literature which include Infor- mation Gain (IG), Chi squared (χ2), Mutual Information, Gain Ratio (GR) and Odds Ratio (OR). The mathematical formulations of these feature selection metrics are given in table 2.1, where p(tj, ck) is the probability that a document contains the term tj and belongs to the class ck, p(tj)

is the probability that a document contains the term tj, and ckis the probability that a document

belongs to class ck.

Function Formula

Chi-squared χ2(tj, ck) = (P (tj,ck)P ( ¯tj, ¯ck)−P (tj, ¯ck)P ( ¯tj,ck))

2

P (tj)P ( ¯tj)P (ck)P ( ¯ck)

Information Gain IG(tj, ck) =

P c∈{ck, ¯ck} P t∈{tj, ¯tj}P (t, c)log P (t,c) P (t)P (c) Gain Ratio GR(tj, ck) = P c∈{ck, ¯ck} P t∈{tj , ¯tj }P (t,c)log p(t,c) p(t)p(c) P c∈{ck, ¯ck}P (c)logP (c)

Mutual Information M I(tj, ck) = logP (tP (tj,ck)

j)P (ck)

Odds Ratio OR = df (tj,ck)/df ( ¯tj,ck)

df (tj, ¯ck)/df ( ¯tj, ¯ck)

Table 2.1: Supervised Feature selection metrics.

Let N be the number of documents in the collection, Nck the total number of documents

that belong to class ck, df (tj) the number of documents in the collection that contain the term

tj, df (tj, ck) the number of documents belonging to class ck that contain term tj. Then, the

2.4. Supervised Document Indexing 40

p(tj, ck) = df (tj, ck)/N

p(tj) = df (tj)/N

p(ck) = Nck/N

A comparative analysis on feature selection techniques for text classification found χ2and IG to give the best performance (Yang & Pedersen 1997). These results are supported by another comparative study of a larger set of feature selection metrics where IG and χ2 were found to give the best performance in terms of precision (Forman 2003). IG is a measure of the information available for category prediction by knowledge of the presence or absence of a term in class. Thus, the higher IG value of a term tj, the more important tj is for class prediction. On the other hand,

χ2 measures the lack of independence between a term t

j and a class ck. Accordingly, the higher

the χ2score of a term tjthe more important tjis for class prediction. Despite the differences in the

fundamental approach of IG and χ2to feature selection, both techniques are good at measuring the predictiveness of terms, hence their good performance on feature selection. This means that both IG and χ2 are likely to produce good results when used for providing supervised term weights. Indeed, this intuition is supported by results of a comparative study of tf-idf and supervised term weighting approaches presented in (Deng et al. 2004). The supervised weights considered are: tf-CHIwhich combines tf with χ2, and tf-OddsRatio which combines tf with Odds Ratio where, in both cases, tf is combined with the supervised weight as shown in equation 2.22. Results of a comparative evaluation on text classification using SVM showed tf-CHI to outperform the other weighting schemes while the second best weighting scheme was tf-OddsRatio.

A more extensive comparative analysis using different classifiers: Rocchio, SVM and KNN, is presented in (Debole & Sebastiani 2003). Here also, the authors use the same approach as equation 2.22 for supervised term weighting using χ2, Gain Ratio (GR) and Information Gain (IG), and compared these with standard tf-idf. Of the three supervised weighting approaches, GR produced the best result across all three classifiers followed by χ2. Supervised weighting with IG was found to produce rather disappointing results. Also, the results show that supervised weighting does not always produce improvements as all three supervised approaches were outperformed by tf-idfon a number of datasets.

2.5. Concept-Based Document Indexing 41

Related documents