AI:
Natural Language Processing
Dr. Muhammad Humayoun Assistant Professor
University of Central Punjab, Lahore.
Slides of Prof. Dr. Dan Jurafsky and Prof. Dr. Chris Manning, Stanford
Text
Classification and Naïve
Bayes
The Task of Text
Classification
Dan Jurafsky
Is this spam? Features
Dan Jurafsky
Authorship attribution
Who wrote which Federalist papers?
• 1787-8: anonymous essays try to convince New York to ratify U.S Constitution: Jay, Madison, Hamilton.
• Authorship of 12 of the letters in dispute
• 1963: solved by Mosteller and Wallace using Bayesian methods
James Madison Alexander Hamilton
Later gave rise to the use of naïve Bayes
Dan Jurafsky
Gender Identification
Male or female author?
1. By 1925 present-day Vietnam was divided into three parts under French colonial rule. The southern region embracing Saigon and the Mekong delta was the colony of Cochin-China;
the central area with its imperial capital at Hue was the protectorate of Annam…
2. Clara never failed to be astonished by the extraordinary felicity of her own name. She found it hard to trust herself to the
mercy of fate, which had managed over the years to convert her greatest shame into one of her greatest assets…
S. Argamon, M. Koppel, J. Fine, A. R. Shimoni, 2003. “Gender, Genre, and Writing Style in Formal Written Texts,” Text, volume 23, number 3, pp.
321–346
More use of facts and determinants
More use of pronouns
Dan Jurafsky
Sentiment Analysis
Positive or negative movie review?
• unbelievably disappointing
• Full of zany characters and richly applied satire, and some great plot twists
• this is the greatest screwball comedy ever filmed
• It was pathetic. The worst part about it was the boxing scenes.
6
Very important commercial application We’ll do it in Weka
Dan Jurafsky
Topic Extraction
What is the subject of this article?
• Antagonists and Inhibitors
• Blood Supply
• Chemistry
• Drug Therapy
• Embryology
• Epidemiology
• …
7
MeSH Subject Category Hierarchy
?
MEDLINE Article
Dan Jurafsky
Text Classification Tasks
• Assigning subject categories, topics, or genres
• Spam detection
• Authorship identification
• Age/gender identification
• Language Identification
• Sentiment analysis
• …
Dan Jurafsky
Text Classification: definition
• Input:
• a document d
• a fixed set of classes C = {c
1, c
2,…, c
J}
• Output: a predicted class c C
Dan Jurafsky
Classification Methods:
Hand-coded rules
• Rules based on combinations of words or other features
• spam: black-list-address OR (“dollars” AND “have been selected”)
• Accuracy can be high
• If rules carefully refined by expert
• But building and maintaining these rules is expensive
• These rules may be used as a part of a larger systems
• In which rules are combined with more general methods from ML
Dan Jurafsky
Classification Methods:
Supervised Machine Learning
• Input:
• a document d
• a fixed set of classes C = {c
1, c
2,…, c
J}
• A training set of m hand-labeled documents (d
1,c
1),....,(d
m,c
m)
• Output:
• a learned classifier :d c
• is a function that takes d as input and returns c as output
•
11
Illustrative Example
12
Target Label Y
Feature
vector
Another example
13
Size (feet2)
x1
Number of bedrooms
x2
Number of floors
x3
Age of home (years)
x4
Price ($1000)
2104 5 1 45 460
1416 3 2 40 232
1534 3 2 30 315
852 2 1 36 178
… … … … …
Target Label Y Features
Movie review data example
From text to vectors
14
Target Label Y Features
Target Label Feature
vector
• We train function f on particular known inputs X
m.
• We may test them for unknown inputs X for which we want to
classify or predict etc.
Supervised Learning
Supervised Learning
16
Training Set
Learning Algorithm
Size of h house
Estimated price
x Hypothesis Estimated
value of y
h is a function
h maps from x’s to y’s
Dan Jurafsky
A training set of m
• Hand-labeled documents (d
1,c
1),....,(d
m,c
m)
• Assume we have two classes: :+ve, :-ve
• Assume we are documents containing one comment each
• Dataset m
• “This movies is unbelievably disappointing ”, -ve
• “Full of zany characters and richly applied satire, and some great plot twists”, +ve
• “this is the greatest screwball comedy ever filmed”, +ve
• “It was pathetic. The worst part …”, -ve
•
17
Dan Jurafsky
Classification Methods:
Supervised Machine Learning
• Any kind of classifier
• Naïve Bayes
• Logistic regression
• Support-vector machines (SVM)
• k-Nearest Neighbors
• …
Text
Classification and Naïve
Bayes
Naïve Bayes (I)
Dan Jurafsky
Naïve Bayes Intuition
• Simple (“naïve”) classification method based on Bayes rule
• Relies on very simple representation of document
• Bag of words
• Similar to unigram, it loses context information
Dan Jurafsky
The bag of words representation
I love this movie! It's sweet, but with satirical humor. The dialogue is great and the
adventure scenes are fun… It manages to be whimsical and
romantic while laughing at the conventions of the fairy tale genre. I would recommend it to just about anyone. I've seen it several times, and I'm
always happy to see it again whenever I have a friend who hasn't seen it yet.
γ
( )=c
Dan Jurafsky
The bag of words representation
I love this movie! It's sweet, but with satirical humor. The dialogue is great and the
adventure scenes are fun… It manages to be whimsical and
romantic while laughing at the conventions of the fairy tale genre. I would recommend it to just about anyone. I've seen it several times, and I'm
always happy to see it again whenever I have a friend who hasn't seen it yet.
γ
( )=c
Dan Jurafsky
The bag of words representation:
using a subset of words
x love xxxxxxxxxxxxxxxx sweet xxxxxxx satirical xxxxxxxxxx xxxxxxxxxxx great xxxxxxx
xxxxxxxxxxxxxxxxxxx fun xxxx xxxxxxxxxxxxx whimsical xxxx romantic xxxx laughing
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxx recommend xxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xx several xxxxxxxxxxxxxxxxx xxxxx happy xxxxxxxxx again xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxx
γ
( )=c
Stop-word removal
Like unigram, it loses context information
Dan Jurafsky
The bag of words representation
γ
( )=c
great 2
love 2
recommend 1
laugh 1
happy 1
... ...
All we care about are counts
Vector of words and corresponding counts
Dan Jurafsky
Planning GUI
Garbage Collection Machine
Learning NLP
parser tag
training translation language...
learning training algorithm shrinkage network...
garbage collection memory
optimization region...
Test document
parser language label
translation
…
Bag of words for document classification More than two classes
planning ...
temporal reasoning plan
language...
c1 c2
?
c3 c4 c5Text
Classification and Naïve
Bayes
Formalizing the Naïve
Bayes Classifier
Dan Jurafsky
Bayes’ Rule Applied to Documents and Classes
• For a document d and a class c
� ( � | � ������� ) = � ( ��������∨� ) × � (�)
� ( �������� )
P(c| d) = P(d| c)P(c)
P(d)
Dan Jurafsky
Naïve Bayes Classifier (I)
MAP is “maximum a posteriori” = most likely class
MAP is “maximum a posteriori” = most likely class
Bayes Rule Bayes Rule
Dropping the denominator Dropping the denominator
c
MAP= argmax
cC
P(c| d)
= argmax
cC
P(d| c)P(c) P(d)
= argmax
cC
P(d| c)P(c)
Dan Jurafsky
Naïve Bayes Classifier (II)
Document d represented as features Document d represented as features
c
MAP= argmax
cC
P(d| c)P(c)
) ( )
| ,
, ,
(
argmax P x
1x
2x
nc P c
C
c
=
Dan Jurafsky
Naïve Bayes Classifier (IV)
How often does this class occur?
How often does this class occur?
O(|X|
n•|C|) parameters O(|X|
n•|C|) parameters
We can just count the relative frequencies in a corpus
We can just count the relative frequencies in a corpus
Could only be estimated if a very, very large number of training examples was
available.
Could only be estimated if a very, very large number of training examples was
available.
) ( )
| ,
, ,
(
argmax P x
1x
2x c P c
c
nC
MAP c
=
Dan Jurafsky
Multinomial Naïve Bayes Independence Assumptions
• Bag of Words assumption: Assume position doesn’t matter
• Conditional Independence: Assume the feature
probabilities P(x
i|c
j) are independent given the class c.
)
| ,
, ,
( x
1x
2x c
P
n)
| (
...
)
| ( )
| (
)
| ( )
| ,
,
( x
1x c P x
1c P x
2c P x
3c P x c
P
n=
nDan Jurafsky
Multinomial Naïve Bayes Classifier
�
��=����� �
�∈C� ( �
�) [ � ( �
1| �
�) ∗ � ( �
2| �
�) ∗ …∗ � ( �
�| �
�) ]
) ( )
| ,
, ,
(
argmax P x
1x
2x c P c
c
nC
MAP c
=
=
X x C j
NB c
P c P x c
c argmax ( ) ( | )
Dan Jurafsky
Applying Multinomial Naive Bayes Classifiers to Text Classification
positions all word positions in text document
)
Classes: {c1, c2}
=
positions i
j i
C j
NB c
P c P x c
c argmax ( ) ( | )
j
Text
Classification and Naïve
Bayes
Naïve Bayes:
Learning
Dan Jurafsky
Learning the Multinomial Naïve Bayes Model
• First attempt: maximum likelihood estimates
• simply use the frequencies in the data
Sec.13.3
P(w ˆ
i| c
j) = count(w
i, c
j) count(w, c
j)
wV
å
doc
j
j
N
c C
document c
P ( )
)
ˆ ( =
=
Dan Jurafsky
• Create mega-document for topic j by concatenating all docs in this topic
• Use frequency of w in mega-document
Parameter estimation
fraction of times word w
iappears among all words in documents of topic c
jDoc_Pos Doc_Neg
P(w
ˆ
i|
cj) =
count(wi,
cj)
count(w, cj)
wV
å
) negative ,
(
) negative ,
fantastic"
"
) ( negative fantastic"
"
ˆ(
) positive ,
(
) positive ,
fantastic"
"
) ( positive fantastic"
"
ˆ(
å å
=
=
V w
V w
w count count
P
w count count
P
Dan Jurafsky
Problem with Maximum Likelihood
• What if we have seen no training documents with the word fantastic and classified in the topic positive (thumbs-up)?
• Zero probabilities cannot be conditioned away, no matter the other evidence!
Sec.13.3
P("fantastic" positive) = ˆ count("fantastic", positive) count(w, positive
wV
å
) = 00 )
| ˆ (
) ˆ (
argmax =
=
c
i iMAP
P c P x c
c
Dan Jurafsky
Laplace (add-1) smoothing for Naïve Bayes
in training phase
P(w ˆ
i| c) = count(w
i, c)+1 count(w, c)+1
( )
wV
å
= count(w
i, c) +1 count(w, c
wV
å )
æ
è çç ö
ø ÷÷ + V P(w ˆ
i| c) = count(w
i, c)
count(w, c)
( )
wV
å
Dan Jurafsky
Multinomial Naïve Bayes: Learning
•
Calculate P(c
j) terms
• For each cj in C do
docsj all docs with class =cj
•
Calculate P(w
k | cj) terms
• Textj single doc containing all docsj
• Foreach word wk in Vocabulary
nk # of occurrences of wk in Textj
•
From training corpus, extract Vocabulary. i.e. list f words
P(wk
|
cj)
nk+ a
n+
a |
Vocabulary|P(cj
) |
docsj|
| total # documents|
Text
Classification and Naïve
Bayes
Multinomial Naïve Bayes: A Worked
Example
Dan Jurafsky
Choosing a class:
P(c1|d5)
P(j1|d5) 1/4 * (2/9)3 * 2/9 * 2/9
≈ 0.0001
Doc Words Class
Training 1 Chinese Beijing Chinese c1
2 Chinese Chinese Shanghai c1
3 Chinese Macao c1
4 Tokyo Japan Chinese j1
Test 5 Chinese Chinese Chinese Tokyo Japan ?
41
Conditional Probabilities:
P(Chinese|c1) = P(Tokyo|c1) = P(Japan|c1) = P(Chinese|j1) = P(Tokyo|j1) = P(Japan|j1) = Priors:
P(c1)=
P(j1)=
3 4 1
4
(5+1) / (8+6) = 6/14 = 3/7 (0+1) / (8+6) = 1/14
(1+1) / (3+6) = 2/9 (0+1) / (8+6) = 1/14 (1+1) / (3+6) = 2/9 (1+1) / (3+6) = 2/9
3/4 * (3/7)3 * 1/14 * 1/14
≈ 0.0003
P(w| c) =ˆ count(w, c)+1 count(c)+ |V |
P(c) =ˆ Nc N
Dan Jurafsky
Naïve Bayes in Spam Filtering
• SpamAssassin Features:
• Mentions Generic Viagra
• Online Pharmacy
• Mentions millions of (dollar) ((dollar) NN,NNN,NNN.NN)
• Phrase: impress ...
• From: starts with many numbers
• Subject is all capitals
• HTML has a low ratio of text to image area
• One hundred percent guaranteed
• Claims you can be removed from the list
• 'Prestigious Non-Accredited Universities'
• http://spamassassin.apache.org/tests_3_3_x.html
Dan Jurafsky
Summary: Naive Bayes is Not So Naive
• Very Fast, low storage requirements
• Robust to Irrelevant Features
Irrelevant Features cancel each other without affecting results
• Very good in domains with many equally important features
Decision Trees suffer from fragmentation in such cases – especially if little data
• Optimal if the independence assumptions hold:
If assumedindependence is correct, then it is the Bayes Optimal Classifier for problem
• A good dependable baseline for text classification
• But we will see other classifiers that give better accuracy
Text
Classification and Naïve
Bayes
Precision, Recall, and
the F measure
Dan Jurafsky
The 2-by-2 contingency table
correct not correct
Selected(SPAM) tp fp
not selected(NS) fn tn
Predicted Class by System
Actual Class
SPAM Not SPAM
���= �� +��
�� +�� +�� +��
Not good for Skewed classes
Dan Jurafsky
Not good for Skewed classes
correct not correct
Selected tp fp
not selected Fn = 10 tn = 99,990 Actual Class
Shoe brand Anything else
���= 0+ 99,990
0 +0+ 10+99,900 =99.99 %
0 100,000
10 99,990
Dan Jurafsky
Precision and recall
• Precision: % of selected items that are correct
• Recall: % of correct items that are selected
correct not correct
Selected (SPAM) tp fp
not selected(NS) fn tn
Predicted Class by System
Actual Class
SPAM Not SPAM
Precision(P) Recall (R) Average F1 Score
Algorithm 1 0.5 0.4 0.45 0.444
Algorithm 2 0.7 0.1 0.4 0.175
Algorithm 3 0.02 1.0 0.51 0.0392
F
1Score (F score)
How to compare precision/recall numbers?
Average:
F
1Score:
48
Text
Classification and Naïve
Bayes
Text Classification:
Evaluation
Dan Jurafsky
Development Test Sets and Cross-validation
• Metric: P/R/F1 or Accuracy
• Unseen test set
• avoid overfitting (‘tuning to the test set’)
• Independent estimate of performance
• Cross-validation over multiple splits
• Handle sampling errors from different datasets
• Pool results over each split
• Compute pooled dev set performance
Training set Development Test Set Test Set
Test Set Training Set
Training SetDev Test Training Set
Dev Test Dev Test
Dan Jurafsky
Cross-Validation
• Break up data into 10 folds
• (Equal positive and negative inside each fold?)
• For each fold
• Choose the fold as a temporary test set
• Train on 9 folds, compute performance on the test fold
• Report average
performance of the 10 runs
Training Test
Test
Test
Test
Test Training
Training Training
Training
Training Iteration
1
2
3
4
5
Dan Jurafsky
END
52