Data Mining Boot Camp 2. Text Mining

(1)

SAN DIEGO SUPERCOMPUTER CENTER

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Data Mining Boot Camp 2

Text Mining

Natasha Balac, Ph.D.

Predictive Analytics Center of Excellence,

Director

San Diego Supercomputer Center

University of California, San Diego

(2)

SAN DIEGO SUPERCOMPUTER CENTER

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Discover useful and previously unknown “gems” of

information in large text collections

Text Mining Definition

Many definitions in the literature:

The non trivial extraction of implicit,

previously unknown, and potentially useful

information from (large amount of) textual data

(3)

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Text Mining Example

•

Research objective:

• Follow chains of causal implication to discover a

relationship between migraines and biochemical

levels.

•

Data:

• medical research papers, medical news

(

unstructured text information)

•

Key concept types:

• symptoms, drugs, diseases, chemicals…

•

Medical research

•

Find causal links between symptoms or

diseases and drugs or chemicals

(4)

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Medical Research Example



stress is associated with

migraines



stress can lead to loss of

magnesium



calcium channel blockers prevent some

migraines



magnesium

is a natural calcium channel blocker



spreading cortical depression (SCD) is implicated in

some

migraines



high levels of

magnesium

inhibit SCD



migraine

patients have high platelet aggregability



magnesium

can suppress platelet aggregability

(5)

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Data vs. Text Mining Comparison

Data Mining

•

Identify data sets

•

Select features

•

Prepare data

•

Identify causal

relationship

•

Structured numeric

transaction data

residing

•

Analyze distribution

Text Mining

•

Identify documents

•

Extract features

•

Diverse collections

and formats

•

Linguistic processing

•

Select features by

algorithm

•

Prepare data

•

Analyze distribution

(6)

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

•

Information Retrieval

• Indexing and retrieval of textual documents

•

Information Extraction

• Extraction of partial knowledge in the text

•

Web Mining

• Indexing/retrieval of textual docs and extraction

knowledge

•

Document Classification

• Classifying similar documents, paragraphs, etc.

•

Document Clustering

• Generating collections of similar text documents

•

NLP

– Natural Language processing

•

Concept Extraction

- semantically similar grouping

(7)

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

“Search” vs.“Discover”

Data

Mining

Text

Mining

Data

Retrieval

Information

Retrieval

Search

(goal-oriented)

Discover

(opportunistic)

Structured

Data

Unstructured

Data (Text)

(8)

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Data Retrieval

•

Find records within a structured

database

Database Type

Structured

Search Mode

Goal-driven

Atomic entity

Data Record

Example Information Need

“

Find a Japanese restaurant in Boston

that serves vegetarian food.

”

Example Query

“

SELECT * FROM restaurants WHERE

city = boston AND type = japanese

AND has_veg = true

”

(9)

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Information Retrieval

•

Find relevant information in an

unstructured information source

(typically text)

Database Type

Unstructured

Search Mode

Goal-driven

Atomic entity

Document

“

Find a Japanese restaurant in Boston

that serves vegetarian food.

”

Example Query

“

Japanese restaurant Boston

”

or

Boston->Restaurants->Japanese

(10)

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Intelligent Information Retrieval

•

meaning

of words

• Synonyms

“

buy

”

/

“

purchase

”

• Ambiguity

“

bat

”

(baseball vs. mammal)

•

order

of words in the query

•

hot dog stand in the amusement park

•

hot amusement stand in the dog park

•

user dependency

for the data

• direct feedback

• indirect feedback

•

authority

of the source

• IBM is more likely to be an authorized source then my second

far cousin

(11)

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

•

Given:

• A source of textual documents

• A well defined limited query (text based)

•

Find:

• Sentences with

relevant

information

• Extract the relevant information and

ignore non-relevant information (important!)

• Link related information and output in a

predetermined format

(12)

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Information Extraction: Example

•

Salvadoran President-elect Alfredo Cristiania condemned the terrorist

killing of Attorney General Roberto Garcia Alvarado and accused the

Farabundo Marti Natinal Liberation Front (FMLN) of the crime. … Garcia

Alvarado, 56, was killed when a bomb placed by urban guerillas on his

vehicle exploded as it came to a halt at an intersection in downtown San

Salvador. … According to the police and Garcia Alvarado

’

s driver, who

escaped unscathed, the attorney general was traveling with two

bodyguards. One of them was injured.

•

Incident Date:

19 Apr 89

•

Incident Type:

Bombing

•

Perpetrator Individual ID:

“

urban guerillas

”

•

Human Target Name:

“

Roberto Garcia Alvarado

”

•

...

(13)

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Text Mining

•

Discover new knowledge

through analysis of text

Database Type

Unstructured

Search Mode

Opportunistic

Atomic entity

Language feature or concept

“

Find the types of food poisoning most

often associated with Japanese

restaurants

”

Example Query

Rank

diseases

found associated with

(14)

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Motivation for Text Mining

•

Approximately

90%

of the world’s data is held

in unstructured formats (source: Oracle

Corporation)

•

Information intensive business processes

demand that we transcend from simple

document retrieval to “knowledge” discovery.

90%

Structured Numerical or Coded

Information

10%

Unstructured or Semi-structured

Information

(15)

SAN DIEGO SUPERCOMPUTER CENTER

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Challenges of Text Mining

•

Very high number of possible “dimensions”

• All possible word and phrase types in the language!!

•

Unlike data mining:

• records (= docs) are not structurally identical

• records are not statistically independent

•

Complex and subtle relationships between

concepts in text

• “AOL merges with Time-Warner”

• “Time-Warner is bought by AOL”

•

Ambiguity and context sensitivity

• automobile = car = vehicle = Toyota

• Apple (the company) or apple (the fruit)

(16)

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Challenges in Text Mining

•

Information is in unstructured textual

form

•

Not readily accessible to be used by

computers

•

Dealing with huge collections of

documents

•

Language is ambiguous and context

dependent

(17)

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Text Mining Challenges

•

Homographs

• Bat – a piece of sporting equipment in baseball

• Bat - a winged animal associated with vampires

•

Synonyms

– different words same meaning

•

Polysemy – same word form different meaning

•

Hyponymy – concept hierarchy

(18)

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Text Processing

•

Statistical Analysis

•

Quantify text data

•

Language or Content Analysis

•

Identifying structural elements

•

Extracting and codifying meaning

(19)

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Two Mining Phases

•

Knowledge Discovery

:

Extraction

of codified

information (features)

•

Information Distillation

:

Analysis

of the feature

distribution

(20)

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Statistical Analysis

•

Use statistics to add a numerical

dimension to unstructured text

Term frequency

Document length

Document

frequency

Term proximity

(21)

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Content Analysis

•

Lexical and Syntactic Processing

• Recognizing “tokens” (terms)

• Normalizing words

• Language constructs (parts of speech, sentences, paragraphs)

•

Semantic Processing

• Extracting meaning

• Named Entity Extraction (People names, Company Names,

Locations, etc…)

•

Extra-semantic features

• Identify feelings or sentiment in text

(22)

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

(23)

SAN DIEGO SUPERCOMPUTER CENTER

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Text mining process

•

Text preprocessing

• Syntactic/Semantic text

analysis

•

Features Generation

• Bag of words

•

Features Selection

• Simple counting

• Statistics

•

Text/Data Mining

•

Classification-Supervised learning

• Clustering- Unsupervised

learning

•

Analyzing results

(24)

(25)

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Feature Extraction

•

To recognize and classify significant vocabulary

items in unrestricted natural language texts

•

Very fast processing to be able to deal with

mass data

(26)

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Bag-of-Tokens Approaches

Four score and seven

years ago our fathers brought

forth on this continent,

a new

nation

, conceived in Liberty,

and dedicated to the

proposition that all men are

created equal.

Now we are engaged in a

great civil war, testing

whether

that nation

, or …

nation – 5

civil - 1

war – 2

men – 2

died – 4

people – 5

Liberty – 1

God – 1

…

Feature

Extraction

Loses all order-specific information!

Severely limits context!

(27)

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Natural Language Processing (NLP)

A dog is chasing a boy on the playground

Det

Noun Aux

Verb

Det Noun Prep

Det

Noun

Noun Phrase

Complex Verb

Noun Phrase

Prep Phrase

Verb Phrase

Sentence

Dog(d1).

Boy(b1).

Playground(p1).

Chasing(d1,b1,p1).

Semantic analysis

Lexical

analysis

(part-of-speech

tagging)

Syntactic analysis

(Parsing)

A person saying this may

be reminding another person to

get the dog back…

Pragmatic analysis

(speech act)

Scared(x) if Chasing(_,x,_).

+

Scared(b1)

Inference

(28)

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

General NLP—Too Difficult!

(Taken from ChengXiang Zhai, CS 397cxz – Fall 2003)

•

Word-level ambiguity

•

“

design

”

can be a noun or a verb

(Ambiguous POS)

•

“

root

”

has multiple meanings

(Ambiguous sense)

•

Syntactic ambiguity

•

“

natural language processing

”

(Modification)

•

“

A man saw a boy

with a telescope

.

”

(PP Attachment)

•

Anaphora resolution

•

“

John persuaded Bill to buy a TV for

himself

.

”

(

himself

= John or Bill?)

•

Presupposition

•

“

He has quit smoking.

”

implies that he smoked before.

Humans rely on context to interpret (when possible).

This context may extend beyond a given document!

(29)

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Shallow Linguistics

•

Progress on Useful Sub-Goals:

•

English Lexicon

•

Text Normalization

•

Lower case

•

Typos, misspelled words

•

Syntactic analysis

•

Recognizing larger constructs

•

Part-of-Speech Tagging

•

Word Sense Disambiguation

(30)

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Named Entity Extraction

•

Identify and type language features

•

Examples:

• People names

• Company names

• Geographic location names

• Dates

• Monetary amount

(31)

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Canonical Forms

•

Normalized forms of dates, numbers, …

•

Allows applications to use information very

easily

•

Abstracts from different morphological variants

of a single term

(32)

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Canonical Names

President Bush

Mr. Bush

George Bush

Canonical Name:

George Bush

•

The canonical name is the most explicit,

least ambiguous name constructed from

the different variants found in the

document

(33)

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Tokenization

•

Convert streams of characters into “words”

Main clues (in English): white space

Words can contain special characters, such

as these: . , ’ – etc.

•

No single algorithm “works” always

• Some languages do not have white space

(34)

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Stemming

•

Normalizes / unifies variations of the same idea

• “walking”, “walks”, “walked”, “walker” => “walk”

•

Inflectional Stemming

• Remove plurals

• Normalize verb tenses

• Remove other affixes

•

Stemming to root

• Reduce word to most basic element

• More aggressive than inflectional

• Examples

• • “denormalization” -> “norm”

(35)

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Data Mining -Volinsky - 2011 - Columbia University

35

Stop words

•

Many of the most frequently used words in English

are worthless in retrieval and text mining – these

words are called

stop words

.

• a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not,

of, on, or, such, that, the, their, then, there, these, they, this, to,

was, will, with

• Typically about 400 to 500 such words

• For an application, an additional domain specific stop words list

may be constructed

•

Why do we need to remove stop words

?

• Reduce indexing (or data) file size

• stopwords accounts 20-30% of total word counts

• Improve efficiency

• stop words are not useful for searching or text mining

• stop words always have a large number of hits

(36)

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

•

Part Of Speech (pos) tagging

• Find the corresponding pos for each word

e.g., John (noun) gave (verb)

the (det)

ball (noun)

•

~98% accurate

•

Word sense disambiguation

• Context based or proximity based

• Very accurate

•

Parsing

• Generates a parse tree (graph) for each sentence

• Each sentence is a stand alone graph

(37)

SAN DIEGO SUPERCOMPUTER CENTER

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Simple Entity Extraction

“The quick brown fox jumps over the lazy dog”

Noun phrase

Mammal

Canidae

Mammal

(38)

(39)

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

(40)

SAN DIEGO SUPERCOMPUTER CENTER

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Document Frequency

•

# Documents the term occurs in

Assumptions:

•

Terms that occur in fewer documents are more specified

to a document and more descriptive of the content: rarity

matters

•

Terms that occur in most documents are common words,

not as descriptive

• Often true

•

Sometimes just reflect textual variants (synonyms),

regional differences, personal style

(41)

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Term Frequency

•

Two-fold heuristics based on frequency

• TF (Term frequency)

• More frequent

within

a document



more relevant to

semantics

• e.g., “query” vs. “commercial”

• IDF (Inverse document frequency)

• Less frequent

among

documents



more discriminative

• e.g. “algebra” vs. “science”

(42)

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Feature Extraction And Reduction

•

TF: Counts of keywords in field

•

Inverse Document Frequency - IDF

IDF= log( 1 + NumDocs / NumDocs with Term )

Interested in: TF*IDF

•

Multiple-word phrases: n-grams

(43)

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Data Mining -Volinsky - 2011 - Columbia University

43

Document Distance

•

Pairwise distances between documents

•

Image plots of

cosine

distance, Euclidean, and

(44)

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Extra-semantic Information

•

Extracting hidden meaning or sentiment

based on use of language.

• Examples:

• “Customer is unhappy with their service!”

• Sentiment = discontent

•

Sentiment is:

• Emotions: fear, love, hate, sorrow

• Feelings: warmth, excitement

• Mood, disposition, temperament, …

•

Or even (someday)…

(45)

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

•

Given: a collection of labeled records (

training set

)

• Each record contains a set of features

(

attributes

),

and the

true class

(

label

)

•

Find: a

model

for the class as a function of the values of the

features

•

Goal: previously unseen records should be assigned a class

as accurately as possible

• A

test set

is used to determine the accuracy of the model.

Usually, the given data set is divided into training and test

sets, with training set used to build the model and test set

used to validate it

(46)

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

•

Supervised learning (

classification

)

• Supervision: The training data (observations,

measurements, etc.) are accompanied by

labels

indicating

the class of the observations

• New data is classified based on the training set

•

Unsupervised learning (

clustering

)

• The class labels of training data is unknown

• Given a set of measurements, observations, etc. with the

aim of establishing the existence of classes or clusters in

the data

(47)

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

•

Given: a set of documents and a

similarity measure

among documents

•

Find: clusters such that:

• Documents in one cluster are more similar to one another

• Documents in separate clusters are less similar to one

another

•

Goal:

• Finding a

correct

set of documents

(48)

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Classification: An Example

Ex#

Country Marital

Status

Income

Hooligan

1

England Single

125K

Yes

2

England Married

Yes

3

England Single

70K

Yes

4

Italy

Married

40K

No

5

USA

Divorced 95K

No

6

England Married

60K

Yes

7

England

20K

Yes

8

Italy

Single

85K

Yes

9

France

Married

75K

No

10 Denmark Single

50K

No

10

Training

Set

Model

Learn

Classifie

r

Country Marital

Status

Income

Hooligan

England Single

75K

?

Turkey

Married

50K

?

England Married

150K

?

Divorced 90K

?

Single

40K

?

Itlay

Married

80K

?

10

Test

Set

(49)

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Text Classification: An Example

Ex#

Hooligan

1

An English football fan

…

Yes

2

During a game in Italy

…

Yes

3

England has been

beating France …

Yes

4

Italian football fans were

cheering …

No

5

An average USA

salesman earns 75K

No

6

The game in London

was horrific

Yes

7

Manchester city is likely

to win the championship

Yes

8

Rome is taking the lead

in the football league

Yes

10

Training

Set

Model

Learn

Classifier

Test

Set

Hooligan

A Danish football fan

?

Turkey is playing vs. France.

The Turkish fans …

?

(50)

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Ex#

Country Marital

Status

Income

Hooligan

1

England Single

125K

Yes

2

England Married

100K

Yes

3

England Single

70K

Yes

4

Italy

Married

40K

No

5

USA

Divorced 95K

No

6

England Married

60K

Yes

7

England Divorced 20K

Yes

8

Italy

Single

85K

Yes

9

France

Married

75K

No

10 Denmark Single

50K

No

10

Decision Tree: An Example

Yes

English

Yes

No

MarSt

NO

Married

Single, Divorced

Splitting Attributes

Income

YES

NO

> 80K

< 80K

The splitting attribute at a node is

determined based on a specific

(51)

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Ex#

Hooligan

1

An English football fan

…

Yes

2

During a game in Italy

…

Yes

3

England has been

beating France …

Yes

4

Italian football fans were

cheering …

No

5

An average USA

salesman earns 75K

No

6

The game in London

was horrific

Yes

7

Manchester city is likely

to win the championship

Yes

8

Rome is taking the lead

in the football league

Yes

10

Decision Tree: A Text Example

Yes

English

Yes

No

MarSt

NO

Married

Single, Divorced

Splitting Attributes

Income

YES

NO

> 80K

< 80K

The splitting attribute at a node is

determined based on a specific

(52)

SAN DIEGO SUPERCOMPUTER CENTER

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

•

The Needs:

• Analysis of call records as input into

decision-making process of Bank’s

management

• Quick answers to important questions

• Which offices receive the most angry calls?

• What products have the fewest satisfied customers?

• (“Angry” and “Satisfied” are recognizable sentiments)

• User friendly interface and visualization tools

Decision Support using Bank Call Center

(53)

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Decision Support using Bank Call

Center Data

•

The Information Source:

• Call center records

• Example:

AC2G31, 01, 0101, PCC, 021, 0053352,

NEW YORK, NY

, H-SUPRVR8,

STMT

,

“

mr stark has been with the company for

about 20 yrs. He

hates

his

stmt

format and

wishes that we would show a daily balance

to help him know when he falls below the

(54)

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Call Volume by Sentiment

0

200

400

600

800

1000

Negative Calls Related to Bank

Statements

Cleveland

New York

Boston

(55)

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Movie Review Task

•

Build a model from the move review DB to

classify positive from negative reviews

(56)

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Movie Review Data

• 1000 positive movie review and 1000 negative

review texts from

•

(57)

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Sentiment Polarity Dataset Version 2.0

1000 positive movie review and 1000 negative review texts from:

Thumbs up? Sentiment Classification using Machine Learning

Techniques.

Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan.

Proceedings of EMNLP, pp. 79--86, 2002.

“

Our data

source

was the

Internet Movie Database

(IMDb) archive of the

rec.arts.movies.reviews newsgroup. We selected only reviews where the

author

rating

was

expressed

either with stars or some

numerical value

(other conventions

varied too widely to allow for automatic processing). Ratings were automatically

extracted and converted into one of three categories: positive, negative, or neutral.

For the work described in this paper, we concentrated

only

on discriminating

between

positive

and

negative

sentiment.”

(58)

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Weka

•

TextMining Data Set

•

Use CLI

(59)

(60)

(61)

(62)

•