CS590I: Information Retrieval. Text Categorization (I) Text Categorization. Text Categorization. Text Categorization: Examples

(1)

CS590I: Information Retrieval

CS-590I Information Retrieval

Text Categorization (I)g ( )

Luo Si

Department of Computer Science Purdue University

Text Categorization (I)

Outline

zIntroduction to the task of text categorization

¾Manual v.s. automatic text categorization

zText categorization applications E l ti f t t t i ti

zEvaluation of text categorization

zK nearest neighbor text categorization method

Text Categorization

zTasks

¾Assign predefined categories to text documents /objects

zMotivation

¾Provide an organizational view of the data

L t f l t t t i ti

zLarge cost of manual text categorization

¾Millions of dollars spent for manual categorization in companies, governments, public libraries, hospitals

¾Manual categorization is almost impossible for some large scale application (Classification or Web pages)

Text Categorization

zAutomatic text categorization

¾Learn algorithm to automatically assign predefined categories to text documents /objects

¾automatic or semi-automatic

zProceduresProcedures

¾Training:Given a set of categories and labeled document examples; learn a method to map a document to correct category (categories)

¾Testing:Predict the category (categories) of a new document

zAutomatic or semi-automatic categorization can significantly reduce the manual efforts

Text Categorization: Examples

News Categories

Text Categorization: Examples

Categories

(2)

Text Categorization: Examples

Medical Subject Headings (Categories) (Categories)

Example: U.S. Census in 1990

zIncluded 22 million responses

zNeeded to be classified into industry categories (200+) and occupation categories (500+)

zWould cost $15 millions if conduced by hand

zTwo alternative automatic text categorization methods have been evaluated

¾Knowledge-Engineering (Expert System)

¾Machine Learning (K nearest neighbor method)

Example: U.S. Census in 1990

zA Knowledge-Engineering Approach

¾Expert System (Designed by domain expert)

¾Hand-Coded rule (e.g., if “Professor” and “Lecturer” -> “Education”)

¾Development cost: 2 experts, 8 years (192 Person-months)

¾Accuracy = 47%

zA Machine Learning Approach

¾K Nearest Neighbor (KNN) classification: details later; find your language by what language your neighbors speak

¾Fully automatic

¾Development cost: 4 Person-months

¾Accuracy = 60%

Many Applications!

zWeb page classification (Yahoo-like category taxonomies)

zNews article classification (more formal than most Web pages)

zAutomatic email sorting (spam detection; into different folders))

zWord sense disambiguation (Java programming v.s. Java in Indonesia)

zGene function classification (find the functions of a gene from the articles talking about the gene)

zWhat is your favorite applications?...

Techniques Explored in Text Categorization

zRule-based Expert system (Hayes, 1990)

zNearest Neighbor methods(Creecy’92; Yang’94)

zDecision symbolic rule induction (Apte’94)

zNaïve Bayes(Language Model) (Lewis’94; McCallum’98)

zRegression method(Furh’92; Yang’92)

zSupport Vector Machines(Joachims’98) zBoosting or Bagging (Schapier’98)

zNeural networks (Wiener’95)

z……

Text Categorization: Evaluation

Performance of different algorithms on Reuters-21578 corpus: 90 categories, 7769 Training docs, 3019 test docs, (Yang, JIR 1999)

(3)

Text Categorization: Evaluation

Contingency Table Per Category (for all docs)

Truth: True Truth: False Predicted

Positive a b a+b

Predicted

Negative c d c+d

a+c b+d n=a+b+c+d

a: number of truly positive docs b: number of false-positive docs c: number of false negative docs d: number of truly-negative docs n: total number of test documents

Text Categorization: Evaluation

Contingency Table Per Category (for all docs)

a c b

n: total number of docs

d

Sensitivity: a/(a+c) truly-positive rate, the larger the better Specificity: d/(b+d) truly-negative rate, the larger the better They depend on decision threshold, trade off between the values

Text Categorization: Evaluation

F 2pr 1)pr F (β

2+

Recall: r=a/(a+c) truly-positive (percentage of positive docs detected) Precision: p=a/(a+b) how accurate is the predicted positive docs Accuracy: (a+d)/n how accurate is all the predicted docs

r p F p r p β

)p

Fβ (β2 1

= +

= + F-measure:

Harmonic average:

⎟⎟⎠

⎜⎜ ⎞

⎝

⎛ +

2

1 x

1 2 1

1

Accuracy: (a+d)/n how accurate is all the predicted docs Error: (b+c)/n error rate of predicted docs Accuracy+Error=1

Text Categorization: Evaluation

zMicro F1-Measure

¾Calculate a single contingency table for all categories and calculate F1 measure

¾Treat each prediction with equal weight; better for algorithms that work well on large categories algorithms that work well on large categories

zMacro F1-Measure

¾Calculate a single contingency table for every category calculate F1 measure separately and average the values

¾Treat each category with equal weight; better for algorithms that work well on many small categories

K-Nearest Neighbor Classifier

zAlso called “Instance-based learning” or “lazy learning”

¾low/no cost in “training”, high cost in online prediction

zCommonly used in pattern recognition (5 decades)

zTheoretical error bound analyzed by Duda & Hart (1957)

zApplied to text categorization in 1990’s

zAmong top-performing text categorization methods

K-Nearest Neighbor Classifier

zKeep all training examples

zFind k examples that are most similar to the new document (“neighbor” documents)

zAssign the category that is most common in these neighbor documents (neighbors vote for the category)

documents (neighbors vote for the category)

zCan be improved by considering the distance of a neighbor ( A closer neighbor has more weight/influence)

(4)

K-Nearest Neighbor Classifier

zIdea: find your language by what language your neighbors speak

(k=1) (k=5)

zUse K nearest neighbors to vote

1-NN: Red; 5-NN: Blue; 10-NN: ?; Weight 10-NN: Blue (k=10) ?

K Nearest Neighbor: Technical Elements

zDocument representation

zDocument distance measure: closer documents should have similar labels; neighbors speak the same language

zNumber of nearest neighbor (value of K)

zDecision threshold

K Nearest Neighbor: Framework

{0,1}

y docs, , R x )}, y , {(x

D= _i _i _i∈ ^M _i∈

RM

x∈

∈

∑

=

(x) D x

i i k i

)y x sim(x, k (x) 1 yˆ Training data Test data

Scoring Function

The neighbor hood is D_k∈D

Classification: 1 if y(x) t 0ˆ 0 otherwise

− >

⎧⎨

⎩

Document Representation: Xiuses tf.idf weighting for each dimension

Choices of Similarity Functions

∑

^∗

=

∗ 2 x₁_v x₂_v

1 x

x Dot product

∑

⁻

= _v _1v _2v ²

2

1,x ) (x x )

x d(

Euclidean distance

∑

= _v

2v 1v 2 1v

1 x

logx x ) x , x Kullback Leibler d(

distance

∑

v v

v 2

2 1

p 1

∑

∑ ∑

^∗

=

v 2 v 2v

2 1v

v 1v 2v

2

1 x x

x ) x

x , x cos(

Cosine Similarity

Kernel) (Gaussian e

) x , x

k( 1 2 = −^d(^x¹^,^x²^)/2σ²

Kernel functions

Automatic learning of the metrics

Choices of Number of Neighbors (K)

Trade off between small number of neighbors and large number of neighbors

z Find desired number of neighbors by cross validation

¾Choose a subset of available data as training data, the rest as validation data

¾Find the desired number of neighbors on the validation data

Choices of Number of Neighbors (K)

data

¾The procedure can be repeated for different splits; find the consistent good number for the splits

(5)

Pros

z Simple and intuitive, based on local-continuity assumption

z Widely used and provide strong baseline in TC Evaluation

z No training needed, low training cost

Characteristics of KNN

z Easy to implement; can use standard IR techniques (e.g., tf.idf)

Cons

z Heuristic approach, no explicit objective function

z Difficult to determine the number of neighbors

z High online cost in testing; find nearest neighbors has high time complexity

Text Categorization (I)

Outline

zIntroduction to the task of text categorization

¾Manual v.s. automatic text categorization

zText categorization applications E l ti f t t t i ti

zEvaluation of text categorization

zK nearest neighbor text categorization method

¾Lazy learning: no training

¾Local-continuity assumption: find your language by what language your neighbors speak