Natural Language Processing AI:

(1)

AI:

Natural Language Processing

Dr. Muhammad Humayoun Assistant Professor

University of Central Punjab, Lahore.

[email protected]

Slides of Prof. Dr. Dan Jurafsky and Prof. Dr. Chris Manning, Stanford

(2)

Text

Classification and Naïve

Bayes

The Task of Text

Classification

(3)

Dan Jurafsky

Is this spam? ^Features

(4)

Dan Jurafsky

Authorship attribution

Who wrote which Federalist papers?

• 1787-8: anonymous essays try to convince New York to ratify U.S Constitution: Jay, Madison, Hamilton.

• Authorship of 12 of the letters in dispute

• 1963: solved by Mosteller and Wallace using Bayesian methods

James Madison Alexander Hamilton

Later gave rise to the use of naïve Bayes

(5)

Dan Jurafsky

Gender Identification

Male or female author?

1. By 1925 present-day Vietnam was divided into three parts under French colonial rule. The southern region embracing Saigon and the Mekong delta was the colony of Cochin-China;

the central area with its imperial capital at Hue was the protectorate of Annam…

2. Clara never failed to be astonished by the extraordinary felicity of her own name. She found it hard to trust herself to the

mercy of fate, which had managed over the years to convert her greatest shame into one of her greatest assets…

S. Argamon, M. Koppel, J. Fine, A. R. Shimoni, 2003. “Gender, Genre, and Writing Style in Formal Written Texts,” Text, volume 23, number 3, pp.

321–346

More use of facts and determinants

More use of pronouns

(6)

Dan Jurafsky

Sentiment Analysis

Positive or negative movie review?

• unbelievably disappointing

• Full of zany characters and richly applied satire, and some great plot twists

• this is the greatest screwball comedy ever filmed

• It was pathetic. The worst part about it was the boxing scenes.

6

Very important commercial application We’ll do it in Weka

(7)

Dan Jurafsky

Topic Extraction

What is the subject of this article?

• Antagonists and Inhibitors

• Blood Supply

• Chemistry

• Drug Therapy

• Embryology

• Epidemiology

• …

7

MeSH Subject Category Hierarchy

?

MEDLINE Article

(8)

Dan Jurafsky

Text Classification Tasks

• Assigning subject categories, topics, or genres

• Spam detection

• Authorship identification

• Age/gender identification

• Language Identification

• Sentiment analysis

• …

(9)

Dan Jurafsky

Text Classification: definition

• Input:

• a document d

• a fixed set of classes C = {c

₁

, c

₂

,…, c

_J

}

• Output: a predicted class c  C

(10)

Dan Jurafsky

Classification Methods:

Hand-coded rules

• Rules based on combinations of words or other features

• spam: black-list-address OR (“dollars” AND “have been selected”)

• Accuracy can be high

• If rules carefully refined by expert

• But building and maintaining these rules is expensive

• These rules may be used as a part of a larger systems

• In which rules are combined with more general methods from ML

(11)

Dan Jurafsky

Classification Methods:

Supervised Machine Learning

• Input:

• a document d

• a fixed set of classes C = {c

₁

, c

₂

,…, c

_J

}

• A training set of m hand-labeled documents (d

₁

,c

₁

),....,(d

_m

,c

_m

)

• Output:

• a learned classifier :d  _c

• is a function that takes d as input and returns c as output

•

11

(12)

Illustrative Example

12

Target Label Y

Feature

vector

(13)

Another example

13

Size (feet²)

x1

Number of bedrooms

x2

Number of floors

x3

Age of home (years)

x4

Price ($1000)

2104 5 1 45 460

1416 3 2 40 232

1534 3 2 30 315

852 2 1 36 178

… … … … …

Target Label Y Features

(14)

Movie review data example

From text to vectors

14

Target Label Y Features

(15)

Target Label Feature

vector

• We train function f on particular known inputs X

_m

.

• We may test them for unknown inputs X for which we want to

classify or predict etc.

Supervised Learning

(16)

Supervised Learning

16

Training Set

Learning Algorithm

Size of h house

Estimated price

x Hypothesis Estimated

value of y

h is a function

h maps from x’s to y’s

(17)

Dan Jurafsky

A training set of m

• Hand-labeled documents (d

₁

,c

₁

),....,(d

_m

,c

_m

)

• Assume we have two classes: :+ve, :-ve

• Assume we are documents containing one comment each

• Dataset m

• “This movies is unbelievably disappointing ”, -ve

• “Full of zany characters and richly applied satire, and some great plot twists”, +ve

• “this is the greatest screwball comedy ever filmed”, +ve

• “It was pathetic. The worst part …”, -ve

•

17

(18)

Dan Jurafsky

Classification Methods:

Supervised Machine Learning

• Any kind of classifier

• Naïve Bayes

• Logistic regression

• Support-vector machines (SVM)

• k-Nearest Neighbors

• …

(19)

Text

Classification and Naïve

Bayes

Naïve Bayes (I)

(20)

Dan Jurafsky

Naïve Bayes Intuition

• Simple (“naïve”) classification method based on Bayes rule

• Relies on very simple representation of document

• Bag of words

• Similar to unigram, it loses context information

(21)

Dan Jurafsky

The bag of words representation

I love this movie! It's sweet, but with satirical humor. The dialogue is great and the

adventure scenes are fun… It manages to be whimsical and

romantic while laughing at the conventions of the fairy tale genre. I would recommend it to just about anyone. I've seen it several times, and I'm

always happy to see it again whenever I have a friend who hasn't seen it yet.

γ

( )=c

(22)

Dan Jurafsky

The bag of words representation

I love this movie! It's sweet, but with satirical humor. The dialogue is great and the

adventure scenes are fun… It manages to be whimsical and

romantic while laughing at the conventions of the fairy tale genre. I would recommend it to just about anyone. I've seen it several times, and I'm

always happy to see it again whenever I have a friend who hasn't seen it yet.

γ

( )=c

(23)

Dan Jurafsky

The bag of words representation:

using a subset of words

x love xxxxxxxxxxxxxxxx sweet xxxxxxx satirical xxxxxxxxxx xxxxxxxxxxx great xxxxxxx

xxxxxxxxxxxxxxxxxxx fun xxxx xxxxxxxxxxxxx whimsical xxxx romantic xxxx laughing

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxx recommend xxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xx several xxxxxxxxxxxxxxxxx xxxxx happy xxxxxxxxx again xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxx

γ

( )=c

Stop-word removal

Like unigram, it loses context information

(24)

Dan Jurafsky

The bag of words representation

γ

( )=c

great 2

love 2

recommend 1

laugh 1

happy 1

... ...

All we care about are counts

Vector of words and corresponding counts

(25)

Dan Jurafsky

Planning GUI

Garbage Collection Machine

Learning NLP

parser tag

training translation language...

learning training algorithm shrinkage network...

garbage collection memory

optimization region...

Test document

parser language label

translation

…

Bag of words for document classification More than two classes

planning ...

temporal reasoning plan

language...

c1 c2

?

c3 c4 c5

(26)

Text

Classification and Naïve

Bayes

Formalizing the Naïve

Bayes Classifier

(27)

Dan Jurafsky

Bayes’ Rule Applied to Documents and Classes

• For a document d and a class c

� ( _� | _{� ��} ) ₌ ^� ( ��∨� ) × � (�)

� ( �� )

P(c| d) = P(d| c)P(c)

P(d)

(28)

Dan Jurafsky

Naïve Bayes Classifier (I)

MAP is “maximum a posteriori” = most likely class

Bayes Rule Bayes Rule

Dropping the denominator Dropping the denominator

c

_MAP

= argmax

cC

P(c| d)

= argmax

cC

P(d| c)P(c) P(d)

= argmax

cC

P(d| c)P(c)

(29)

Dan Jurafsky

Naïve Bayes Classifier (II)

Document d represented as features Document d represented as features

c

_MAP

= argmax

cC

P(d| c)P(c)

) ( )

| ,

, ,

(

argmax P x

₁

x

₂

x

_n

c P c

C

c





=

(30)

Dan Jurafsky

Naïve Bayes Classifier (IV)

How often does this class occur?

O(|X|

ⁿ

•|C|) parameters O(|X|

ⁿ

•|C|) parameters

We can just count the relative frequencies in a corpus

Could only be estimated if a very, very large number of training examples was

available.

Could only be estimated if a very, very large number of training examples was

available.

) ( )

| ,

, ,

(

argmax P x

₁

x

₂

x c P c

c

_n

C

MAP c





=

(31)

Dan Jurafsky

Multinomial Naïve Bayes Independence Assumptions

• Bag of Words assumption: Assume position doesn’t matter

• Conditional Independence: Assume the feature

probabilities P(x

_i

|c

_j

) are independent given the class c.

)

| ,

, ,

( x

₁

x

₂

x c

P 

_n

)

| (

...

)

| ( )

| (

)

| ( )

| ,

,

( x

₁

x c P x

₁

c P x

₂

c P x

₃

c P x c

P 

_n

=    

_n

(32)

Dan Jurafsky

Multinomial Naïve Bayes Classifier

�

_��

=��

_�∈C

� ( ^�

�

) [ ^� ⁽ ^�

¹

^| ^�

^�

⁾ ^{∗ �} ⁽ ^�

²

^| ^�

^�

⁾ ^{∗ …∗ �} ⁽ ^�

^�

^| ^�

^�

⁾ ]

) ( )

| ,

, ,

(

argmax P x

₁

x

₂

x c P c

c

_n

C

MAP c





=





=

X x C j

NB c

P c P x c

c argmax ( ) ( | )

(33)

Dan Jurafsky

Applying Multinomial Naive Bayes Classifiers to Text Classification

positions  all word positions in text document

)

Classes: {c1, c2}





=



positions i

j i

C j

NB c

P c P x c

c argmax ( ) ( | )

j

(34)

Text

Classification and Naïve

Bayes

Naïve Bayes:

Learning

(35)

Dan Jurafsky

Learning the Multinomial Naïve Bayes Model

• First attempt: maximum likelihood estimates

• simply use the frequencies in the data

Sec.13.3

P(w ˆ

_i

| c

_j

) = count(w

_i

, c

_j

) count(w, c

_j

)

wV

å

doc

j

N

c C

document c

P ( )

)

ˆ ( =

=

(36)

Dan Jurafsky

• Create mega-document for topic j by concatenating all docs in this topic

• Use frequency of w in mega-document

Parameter estimation

fraction of times word w

_i

appears among all words in documents of topic c

_j

Doc_Pos Doc_Neg

P(w

ˆ

_i

|

c_j

) =

count(w_i

,

c_j

)

count(w, c_j

)

wV

å

) negative ,

(

) negative ,

fantastic"

"

) ( negative fantastic"

"

ˆ(

) positive ,

(

) positive ,

fantastic"

"

) ( positive fantastic"

"

ˆ(

å å



=

V w

w count count

P

w count count

P

(37)

Dan Jurafsky

Problem with Maximum Likelihood

• What if we have seen no training documents with the word fantastic and classified in the topic positive (thumbs-up)?

• Zero probabilities cannot be conditioned away, no matter the other evidence!

Sec.13.3

P("fantastic" positive) = ˆ count("fantastic", positive) count(w, positive

wV

å

⁾ ^{= 0}

0 )

| ˆ (

) ˆ (

argmax =

=

c



i i

MAP

P c P x c

c

(38)

Dan Jurafsky

Laplace (add-1) smoothing for Naïve Bayes

in training phase

P(w ˆ

_i

| c) = count(w

_i

, c)+1 count(w, c)+1

( )

wV

å

= count(w

_i

, c) +1 count(w, c

wV

å ⁾

æ

è çç ö

ø ÷÷ + V P(w ˆ

_i

| c) = count(w

_i

, c)

count(w, c)

( )

wV

å

(39)

Dan Jurafsky

Multinomial Naïve Bayes: Learning

•

Calculate P(c

_j

) terms

• For each c_jin C do

docs_j  all docs with class =c_j

•

Calculate P(w

_k | c_j

) terms

• Text_j  single doc containing all docs_j

• Foreach word w_k in Vocabulary

n_k  # of occurrences of w_k in Text_j

•

From training corpus, extract Vocabulary. i.e. list f words

P(w_k

|

c_j

) 

n_k

+ a

n+

a |

Vocabulary|

P(c_j

)  |

docs_j

|

| total # documents|

(40)

Text

Classification and Naïve

Bayes

Multinomial Naïve Bayes: A Worked

Example

(41)

Dan Jurafsky

Choosing a class:

P(c1|d5)

P(j1|d5) 1/4 * (2/9)³ * 2/9 * 2/9

≈ 0.0001

Doc Words Class

Training 1 Chinese Beijing Chinese c1

2 Chinese Chinese Shanghai c1

3 Chinese Macao c1

4 Tokyo Japan Chinese j1

Test 5 Chinese Chinese Chinese Tokyo Japan ?

41

Conditional Probabilities:

P(c1)=

P(j1)=

3 4 1

4

(5+1) / (8+6) = 6/14 = 3/7 (0+1) / (8+6) = 1/14

(1+1) / (3+6) = 2/9 (0+1) / (8+6) = 1/14 (1+1) / (3+6) = 2/9 (1+1) / (3+6) = 2/9

3/4 * (3/7)³ * 1/14 * 1/14

≈ 0.0003

P(w| c) =ˆ count(w, c)+1 count(c)+ |V |

P(c) =ˆ N_c N



(42)

Dan Jurafsky

Naïve Bayes in Spam Filtering

• SpamAssassin Features:

• Mentions Generic Viagra

• Online Pharmacy

• Mentions millions of (dollar) ((dollar) NN,NNN,NNN.NN)

• Phrase: impress ...

• From: starts with many numbers

• Subject is all capitals

• HTML has a low ratio of text to image area

• One hundred percent guaranteed

• Claims you can be removed from the list

• 'Prestigious Non-Accredited Universities'

• http://spamassassin.apache.org/tests_3_3_x.html

(43)

Dan Jurafsky

Summary: Naive Bayes is Not So Naive

• Very Fast, low storage requirements

• Robust to Irrelevant Features

Irrelevant Features cancel each other without affecting results

• Very good in domains with many equally important features

Decision Trees suffer from fragmentation in such cases – especially if little data

• Optimal if the independence assumptions hold:

If assumed

independence is correct, then it is the Bayes Optimal Classifier for problem

• A good dependable baseline for text classification

• But we will see other classifiers that give better accuracy

(44)

Text

Classification and Naïve

Bayes

Precision, Recall, and

the F measure

(45)

Dan Jurafsky

The 2-by-2 contingency table

correct not correct

Selected(SPAM) tp fp

not selected(NS) fn tn

Predicted Class by System

Actual Class

SPAM Not SPAM

��= �� +��

�� +�� +�� +��

Not good for Skewed classes

(46)

Dan Jurafsky

Not good for Skewed classes

correct not correct

Selected tp fp

not selected Fn = 10 tn = 99,990 Actual Class

Shoe brand Anything else

��= 0+ 99,990

0 +0+ 10+99,900 =99.99 %

0 100,000

10 99,990

(47)

Dan Jurafsky

Precision and recall

• Precision: % of selected items that are correct

• Recall: % of correct items that are selected

correct not correct

Selected (SPAM) tp fp

not selected(NS) fn tn

Predicted Class by System

Actual Class

SPAM Not SPAM

(48)

Precision(P) Recall (R) Average F₁ Score

Algorithm 1 0.5 0.4 0.45 0.444

Algorithm 2 0.7 0.1 0.4 0.175

Algorithm 3 0.02 1.0 0.51 0.0392

F

₁

Score (F score)

How to compare precision/recall numbers?

Average:

F

₁

Score:

48

(49)

Text

Classification and Naïve

Bayes

Text Classification:

Evaluation

(50)

Dan Jurafsky

Development Test Sets and Cross-validation

• Metric: P/R/F1 or Accuracy

• Unseen test set

• avoid overfitting (‘tuning to the test set’)

• Independent estimate of performance

• Cross-validation over multiple splits

• Handle sampling errors from different datasets

• Pool results over each split

• Compute pooled dev set performance

Training set Development Test Set Test Set

Test Set Training Set

Training SetDev Test Training Set

Dev Test Dev Test

(51)

Dan Jurafsky

Cross-Validation

• Break up data into 10 folds

• (Equal positive and negative inside each fold?)

• For each fold

• Choose the fold as a temporary test set

• Train on 9 folds, compute performance on the test fold

• Report average

performance of the 10 runs

Training Test

Test

Test Training

Training Training

Training

Training Iteration

1

2

3

4

5

(52)

Dan Jurafsky

END

52

Natural Language Processing AI:

AI: