CS590I: Information Retrieval
CS-590I Information Retrieval
Text Categorization (I)g ( )
Luo Si
Department of Computer Science Purdue University
Text Categorization (I)
Outline
zIntroduction to the task of text categorization
¾Manual v.s. automatic text categorization
zText categorization applications E l ti f t t t i ti
zEvaluation of text categorization
zK nearest neighbor text categorization method
Text Categorization
zTasks
¾Assign predefined categories to text documents /objects
zMotivation
¾Provide an organizational view of the data
L t f l t t t i ti
zLarge cost of manual text categorization
¾Millions of dollars spent for manual categorization in companies, governments, public libraries, hospitals
¾Manual categorization is almost impossible for some large scale application (Classification or Web pages)
Text Categorization
zAutomatic text categorization
¾Learn algorithm to automatically assign predefined categories to text documents /objects
¾automatic or semi-automatic
zProceduresProcedures
¾Training:Given a set of categories and labeled document examples; learn a method to map a document to correct category (categories)
¾Testing:Predict the category (categories) of a new document
zAutomatic or semi-automatic categorization can significantly reduce the manual efforts
Text Categorization: Examples
News Categories
Text Categorization: Examples
Categories
Text Categorization: Examples
Medical Subject Headings (Categories) (Categories)
Example: U.S. Census in 1990
zIncluded 22 million responses
zNeeded to be classified into industry categories (200+) and occupation categories (500+)
zWould cost $15 millions if conduced by hand
zTwo alternative automatic text categorization methods have been evaluated
¾Knowledge-Engineering (Expert System)
¾Machine Learning (K nearest neighbor method)
Example: U.S. Census in 1990
zA Knowledge-Engineering Approach
¾Expert System (Designed by domain expert)
¾Hand-Coded rule (e.g., if “Professor” and “Lecturer” -> “Education”)
¾Development cost: 2 experts, 8 years (192 Person-months)
¾Accuracy = 47%
¾Accuracy = 47%
zA Machine Learning Approach
¾K Nearest Neighbor (KNN) classification: details later; find your language by what language your neighbors speak
¾Fully automatic
¾Development cost: 4 Person-months
¾Accuracy = 60%
Many Applications!
zWeb page classification (Yahoo-like category taxonomies)
zNews article classification (more formal than most Web pages)
zAutomatic email sorting (spam detection; into different folders))
zWord sense disambiguation (Java programming v.s. Java in Indonesia)
zGene function classification (find the functions of a gene from the articles talking about the gene)
zWhat is your favorite applications?...
Techniques Explored in Text Categorization
zRule-based Expert system (Hayes, 1990)
zNearest Neighbor methods(Creecy’92; Yang’94)
zDecision symbolic rule induction (Apte’94)
zNaïve Bayes(Language Model) (Lewis’94; McCallum’98)
zRegression method(Furh’92; Yang’92)
zSupport Vector Machines(Joachims’98) zBoosting or Bagging (Schapier’98)
zNeural networks (Wiener’95)
z……
Text Categorization: Evaluation
Performance of different algorithms on Reuters-21578 corpus: 90 categories, 7769 Training docs, 3019 test docs, (Yang, JIR 1999)
Text Categorization: Evaluation
Contingency Table Per Category (for all docs)Truth: True Truth: False Predicted
Positive a b a+b
Predicted
Negative c d c+d
a+c b+d n=a+b+c+d
a: number of truly positive docs b: number of false-positive docs c: number of false negative docs d: number of truly-negative docs n: total number of test documents
Text Categorization: Evaluation
Contingency Table Per Category (for all docs)a c b
n: total number of docs
d
Sensitivity: a/(a+c) truly-positive rate, the larger the better Specificity: d/(b+d) truly-negative rate, the larger the better They depend on decision threshold, trade off between the values
Text Categorization: Evaluation
F 2pr 1)pr F (β
2+
Recall: r=a/(a+c) truly-positive (percentage of positive docs detected) Precision: p=a/(a+b) how accurate is the predicted positive docs Accuracy: (a+d)/n how accurate is all the predicted docs
r p F p r p β
)p
Fβ (β2 1
= +
= + F-measure:
Harmonic average:
⎟⎟⎠
⎜⎜ ⎞
⎝
⎛ +
2
1 x
1 x
1 2 1
1
Accuracy: (a+d)/n how accurate is all the predicted docs Error: (b+c)/n error rate of predicted docs Accuracy+Error=1
Text Categorization: Evaluation
zMicro F1-Measure
¾Calculate a single contingency table for all categories and calculate F1 measure
¾Treat each prediction with equal weight; better for algorithms that work well on large categories algorithms that work well on large categories
zMacro F1-Measure
¾Calculate a single contingency table for every category calculate F1 measure separately and average the values
¾Treat each category with equal weight; better for algorithms that work well on many small categories
K-Nearest Neighbor Classifier
zAlso called “Instance-based learning” or “lazy learning”
¾low/no cost in “training”, high cost in online prediction
zCommonly used in pattern recognition (5 decades)
zTheoretical error bound analyzed by Duda & Hart (1957)
zApplied to text categorization in 1990’s
zAmong top-performing text categorization methods
K-Nearest Neighbor Classifier
zKeep all training examples
zFind k examples that are most similar to the new document (“neighbor” documents)
zAssign the category that is most common in these neighbor documents (neighbors vote for the category)
documents (neighbors vote for the category)
zCan be improved by considering the distance of a neighbor ( A closer neighbor has more weight/influence)
K-Nearest Neighbor Classifier
zIdea: find your language by what language your neighbors speak
(k=1) (k=5)
zUse K nearest neighbors to vote
1-NN: Red; 5-NN: Blue; 10-NN: ?; Weight 10-NN: Blue (k=10) ?
K Nearest Neighbor: Technical Elements
zDocument representation
zDocument distance measure: closer documents should have similar labels; neighbors speak the same language
zNumber of nearest neighbor (value of K)
zDecision threshold
K Nearest Neighbor: Framework
{0,1}
y docs, , R x )}, y , {(x
D= i i i∈ M i∈
RM
x∈
∈
∑
=
(x) D x
i i k i
)y x sim(x, k (x) 1 yˆ Training data Test data
Scoring Function
The neighbor hood is Dk∈D
Classification: 1 if y(x) t 0ˆ 0 otherwise
− >
⎧⎨
⎩
Document Representation: Xiuses tf.idf weighting for each dimension
Choices of Similarity Functions
∑
∗=
∗ 2 x1v x2v
1 x
x Dot product
∑
−= v 1v 2v 2
2
1,x ) (x x )
x d(
Euclidean distance
∑
= v
2v 1v 2 1v
1 x
logx x ) x , x Kullback Leibler d(
distance
∑
v v
v 2
2 1
p 1
∑
∑ ∑
∗=
v 2 v 2v
2 1v
v 1v 2v
2
1 x x
x ) x
x , x cos(
Cosine Similarity
Kernel) (Gaussian e
) x , x
k( 1 2 = −d(x1,x2)/2σ2
Kernel functions
Automatic learning of the metrics
Choices of Number of Neighbors (K)
Trade off between small number of neighbors and large number of neighbors
z Find desired number of neighbors by cross validation
¾Choose a subset of available data as training data, the rest as validation data
¾Find the desired number of neighbors on the validation data
Choices of Number of Neighbors (K)
data
¾The procedure can be repeated for different splits; find the consistent good number for the splits
Pros
z Simple and intuitive, based on local-continuity assumption
z Widely used and provide strong baseline in TC Evaluation
z No training needed, low training cost
Characteristics of KNN
z Easy to implement; can use standard IR techniques (e.g., tf.idf)
Cons
z Heuristic approach, no explicit objective function
z Difficult to determine the number of neighbors
z High online cost in testing; find nearest neighbors has high time complexity
Text Categorization (I)
Outline
zIntroduction to the task of text categorization
¾Manual v.s. automatic text categorization
zText categorization applications E l ti f t t t i ti
zEvaluation of text categorization
zK nearest neighbor text categorization method
¾Lazy learning: no training
¾Local-continuity assumption: find your language by what language your neighbors speak