• No results found

Data Mining Boot Camp 2. Text Mining

N/A
N/A
Protected

Academic year: 2021

Share "Data Mining Boot Camp 2. Text Mining"

Copied!
62
0
0

Loading.... (view fulltext now)

Full text

(1)

SAN DIEGO SUPERCOMPUTER CENTER

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Data Mining Boot Camp 2

Text Mining

Natasha Balac, Ph.D.

Predictive Analytics Center of Excellence,

Director

San Diego Supercomputer Center

University of California, San Diego

(2)

SAN DIEGO SUPERCOMPUTER CENTER

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Discover useful and previously unknown “gems” of

information in large text collections

Text Mining Definition

Many definitions in the literature:

The non trivial extraction of implicit,

previously unknown, and potentially useful

information from (large amount of) textual data

(3)

SAN DIEGO SUPERCOMPUTER CENTER

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

© 2002, AvaQuest Inc.

Text Mining Example

Research objective:

• Follow chains of causal implication to discover a

relationship between migraines and biochemical

levels.

Data:

• medical research papers, medical news

(

unstructured text information)

Key concept types:

• symptoms, drugs, diseases, chemicals…

Medical research

Find causal links between symptoms or

diseases and drugs or chemicals

(4)

SAN DIEGO SUPERCOMPUTER CENTER

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

© 2002, AvaQuest Inc.

Medical Research Example

stress is associated with

migraines

stress can lead to loss of

magnesium

calcium channel blockers prevent some

migraines

magnesium

is a natural calcium channel blocker

spreading cortical depression (SCD) is implicated in

some

migraines

high levels of

magnesium

inhibit SCD

migraine

patients have high platelet aggregability

magnesium

can suppress platelet aggregability

(5)

SAN DIEGO SUPERCOMPUTER CENTER

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Data vs. Text Mining Comparison

Data Mining

Identify data sets

Select features

Prepare data

Identify causal

relationship

Structured numeric

transaction data

residing

Analyze distribution

Text Mining

Identify documents

Extract features

Diverse collections

and formats

Linguistic processing

Select features by

algorithm

Prepare data

Analyze distribution

(6)

SAN DIEGO SUPERCOMPUTER CENTER

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Information Retrieval

• Indexing and retrieval of textual documents

Information Extraction

• Extraction of partial knowledge in the text

Web Mining

• Indexing/retrieval of textual docs and extraction

knowledge

Document Classification

• Classifying similar documents, paragraphs, etc.

Document Clustering

• Generating collections of similar text documents

NLP

– Natural Language processing

Concept Extraction

- semantically similar grouping

(7)

SAN DIEGO SUPERCOMPUTER CENTER

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

© 2002, AvaQuest Inc.

“Search” vs.“Discover”

Data

Mining

Text

Mining

Data

Retrieval

Information

Retrieval

Search

(goal-oriented)

Discover

(opportunistic)

Structured

Data

Unstructured

Data (Text)

(8)

SAN DIEGO SUPERCOMPUTER CENTER

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

© 2002, AvaQuest Inc.

Data Retrieval

Find records within a structured

database

Database Type

Structured

Search Mode

Goal-driven

Atomic entity

Data Record

Example Information Need

Find a Japanese restaurant in Boston

that serves vegetarian food.

Example Query

SELECT * FROM restaurants WHERE

city = boston AND type = japanese

AND has_veg = true

(9)

SAN DIEGO SUPERCOMPUTER CENTER

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

© 2002, AvaQuest Inc.

Information Retrieval

Find relevant information in an

unstructured information source

(typically text)

Database Type

Unstructured

Search Mode

Goal-driven

Atomic entity

Document

Example Information Need

Find a Japanese restaurant in Boston

that serves vegetarian food.

Example Query

Japanese restaurant Boston

or

Boston->Restaurants->Japanese

(10)

SAN DIEGO SUPERCOMPUTER CENTER

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Intelligent Information Retrieval

meaning

of words

• Synonyms

buy

/

purchase

• Ambiguity

bat

(baseball vs. mammal)

order

of words in the query

hot dog stand in the amusement park

hot amusement stand in the dog park

user dependency

for the data

• direct feedback

• indirect feedback

authority

of the source

• IBM is more likely to be an authorized source then my second

far cousin

(11)

SAN DIEGO SUPERCOMPUTER CENTER

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Given:

• A source of textual documents

• A well defined limited query (text based)

Find:

• Sentences with

relevant

information

• Extract the relevant information and

ignore non-relevant information (important!)

• Link related information and output in a

predetermined format

(12)

SAN DIEGO SUPERCOMPUTER CENTER

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Information Extraction: Example

Salvadoran President-elect Alfredo Cristiania condemned the terrorist

killing of Attorney General Roberto Garcia Alvarado and accused the

Farabundo Marti Natinal Liberation Front (FMLN) of the crime. … Garcia

Alvarado, 56, was killed when a bomb placed by urban guerillas on his

vehicle exploded as it came to a halt at an intersection in downtown San

Salvador. … According to the police and Garcia Alvarado

s driver, who

escaped unscathed, the attorney general was traveling with two

bodyguards. One of them was injured.

Incident Date:

19 Apr 89

Incident Type:

Bombing

Perpetrator Individual ID:

urban guerillas

Human Target Name:

Roberto Garcia Alvarado

...

(13)

SAN DIEGO SUPERCOMPUTER CENTER

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

© 2002, AvaQuest Inc.

Text Mining

Discover new knowledge

through analysis of text

Database Type

Unstructured

Search Mode

Opportunistic

Atomic entity

Language feature or concept

Example Information Need

Find the types of food poisoning most

often associated with Japanese

restaurants

Example Query

Rank

diseases

found associated with

(14)

SAN DIEGO SUPERCOMPUTER CENTER

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

© 2002, AvaQuest Inc.

Motivation for Text Mining

Approximately

90%

of the world’s data is held

in unstructured formats (source: Oracle

Corporation)

Information intensive business processes

demand that we transcend from simple

document retrieval to “knowledge” discovery.

90%

Structured Numerical or Coded

Information

10%

Unstructured or Semi-structured

Information

(15)

SAN DIEGO SUPERCOMPUTER CENTER

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

© 2002, AvaQuest Inc.

Challenges of Text Mining

Very high number of possible “dimensions”

• All possible word and phrase types in the language!!

Unlike data mining:

• records (= docs) are not structurally identical

• records are not statistically independent

Complex and subtle relationships between

concepts in text

• “AOL merges with Time-Warner”

• “Time-Warner is bought by AOL”

Ambiguity and context sensitivity

• automobile = car = vehicle = Toyota

• Apple (the company) or apple (the fruit)

(16)

SAN DIEGO SUPERCOMPUTER CENTER

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Challenges in Text Mining

Information is in unstructured textual

form

Not readily accessible to be used by

computers

Dealing with huge collections of

documents

Language is ambiguous and context

dependent

(17)

SAN DIEGO SUPERCOMPUTER CENTER

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Text Mining Challenges

Homographs

• Bat – a piece of sporting equipment in baseball

• Bat - a winged animal associated with vampires

Synonyms

– different words same meaning

Polysemy – same word form different meaning

Hyponymy – concept hierarchy

(18)

SAN DIEGO SUPERCOMPUTER CENTER

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

© 2002, AvaQuest Inc.

Text Processing

Statistical Analysis

Quantify text data

Language or Content Analysis

Identifying structural elements

Extracting and codifying meaning

(19)

SAN DIEGO SUPERCOMPUTER CENTER

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Two Mining Phases

Knowledge Discovery

:

Extraction

of codified

information (features)

Information Distillation

:

Analysis

of the feature

distribution

(20)

SAN DIEGO SUPERCOMPUTER CENTER

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

© 2002, AvaQuest Inc.

Statistical Analysis

Use statistics to add a numerical

dimension to unstructured text

Term frequency

Document length

Document

frequency

Term proximity

(21)

SAN DIEGO SUPERCOMPUTER CENTER

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

© 2002, AvaQuest Inc.

Content Analysis

Lexical and Syntactic Processing

• Recognizing “tokens” (terms)

• Normalizing words

• Language constructs (parts of speech, sentences, paragraphs)

Semantic Processing

• Extracting meaning

• Named Entity Extraction (People names, Company Names,

Locations, etc…)

Extra-semantic features

• Identify feelings or sentiment in text

(22)

SAN DIEGO SUPERCOMPUTER CENTER

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

(23)

SAN DIEGO SUPERCOMPUTER CENTER

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Text mining process

Text preprocessing

• Syntactic/Semantic text

analysis

Features Generation

• Bag of words

Features Selection

• Simple counting

• Statistics

Text/Data Mining

Classification-Supervised learning

• Clustering- Unsupervised

learning

Analyzing results

(24)

SAN DIEGO SUPERCOMPUTER CENTER

(25)

SAN DIEGO SUPERCOMPUTER CENTER

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Feature Extraction

To recognize and classify significant vocabulary

items in unrestricted natural language texts

Very fast processing to be able to deal with

mass data

(26)

SAN DIEGO SUPERCOMPUTER CENTER

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Bag-of-Tokens Approaches

Four score and seven

years ago our fathers brought

forth on this continent,

a new

nation

, conceived in Liberty,

and dedicated to the

proposition that all men are

created equal.

Now we are engaged in a

great civil war, testing

whether

that nation

, or …

nation – 5

civil - 1

war – 2

men – 2

died – 4

people – 5

Liberty – 1

God – 1

Feature

Extraction

Loses all order-specific information!

Severely limits context!

(27)

SAN DIEGO SUPERCOMPUTER CENTER

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Natural Language Processing (NLP)

A dog is chasing a boy on the playground

Det

Noun Aux

Verb

Det Noun Prep

Det

Noun

Noun Phrase

Complex Verb

Noun Phrase

Noun Phrase

Prep Phrase

Verb Phrase

Verb Phrase

Sentence

Dog(d1).

Boy(b1).

Playground(p1).

Chasing(d1,b1,p1).

Semantic analysis

Lexical

analysis

(part-of-speech

tagging)

Syntactic analysis

(Parsing)

A person saying this may

be reminding another person to

get the dog back…

Pragmatic analysis

(speech act)

Scared(x) if Chasing(_,x,_).

+

Scared(b1)

Inference

(28)

SAN DIEGO SUPERCOMPUTER CENTER

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

General NLP—Too Difficult!

(Taken from ChengXiang Zhai, CS 397cxz – Fall 2003)

Word-level ambiguity

design

can be a noun or a verb

(Ambiguous POS)

root

has multiple meanings

(Ambiguous sense)

Syntactic ambiguity

natural language processing

(Modification)

A man saw a boy

with a telescope

.

(PP Attachment)

Anaphora resolution

John persuaded Bill to buy a TV for

himself

.

(

himself

= John or Bill?)

Presupposition

He has quit smoking.

implies that he smoked before.

Humans rely on context to interpret (when possible).

This context may extend beyond a given document!

(29)

SAN DIEGO SUPERCOMPUTER CENTER

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Shallow Linguistics

Progress on Useful Sub-Goals:

English Lexicon

Text Normalization

Lower case

Typos, misspelled words

Syntactic analysis

Recognizing larger constructs

Part-of-Speech Tagging

Word Sense Disambiguation

(30)

SAN DIEGO SUPERCOMPUTER CENTER

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

© 2002, AvaQuest Inc.

Named Entity Extraction

Identify and type language features

Examples:

• People names

• Company names

• Geographic location names

• Dates

• Monetary amount

(31)

SAN DIEGO SUPERCOMPUTER CENTER

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Canonical Forms

Normalized forms of dates, numbers, …

Allows applications to use information very

easily

Abstracts from different morphological variants

of a single term

(32)

SAN DIEGO SUPERCOMPUTER CENTER

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Canonical Names

President Bush

Mr. Bush

George Bush

Canonical Name:

George Bush

The canonical name is the most explicit,

least ambiguous name constructed from

the different variants found in the

document

(33)

SAN DIEGO SUPERCOMPUTER CENTER

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Tokenization

Convert streams of characters into “words”

Main clues (in English): white space

Words can contain special characters, such

as these: . , ’ – etc.

No single algorithm “works” always

• Some languages do not have white space

(34)

SAN DIEGO SUPERCOMPUTER CENTER

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Stemming

Normalizes / unifies variations of the same idea

• “walking”, “walks”, “walked”, “walker” => “walk”

Inflectional Stemming

• Remove plurals

• Normalize verb tenses

• Remove other affixes

Stemming to root

• Reduce word to most basic element

• More aggressive than inflectional

• Examples

• • “denormalization” -> “norm”

(35)

SAN DIEGO SUPERCOMPUTER CENTER

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Data Mining -Volinsky - 2011 - Columbia University

35

Stop words

Many of the most frequently used words in English

are worthless in retrieval and text mining – these

words are called

stop words

.

• a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not,

of, on, or, such, that, the, their, then, there, these, they, this, to,

was, will, with

• Typically about 400 to 500 such words

• For an application, an additional domain specific stop words list

may be constructed

Why do we need to remove stop words

?

• Reduce indexing (or data) file size

• stopwords accounts 20-30% of total word counts

• Improve efficiency

• stop words are not useful for searching or text mining

• stop words always have a large number of hits

(36)

SAN DIEGO SUPERCOMPUTER CENTER

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Part Of Speech (pos) tagging

• Find the corresponding pos for each word

e.g., John (noun) gave (verb)

the (det)

ball (noun)

~98% accurate

Word sense disambiguation

• Context based or proximity based

• Very accurate

Parsing

• Generates a parse tree (graph) for each sentence

• Each sentence is a stand alone graph

(37)

SAN DIEGO SUPERCOMPUTER CENTER

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

© 2002, AvaQuest Inc.

Simple Entity Extraction

“The quick brown fox jumps over the lazy dog”

Noun phrase

Noun phrase

Mammal

Canidae

Mammal

(38)

SAN DIEGO SUPERCOMPUTER CENTER

(39)

SAN DIEGO SUPERCOMPUTER CENTER

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

(40)

SAN DIEGO SUPERCOMPUTER CENTER

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Document Frequency

# Documents the term occurs in

Assumptions:

Terms that occur in fewer documents are more specified

to a document and more descriptive of the content: rarity

matters

Terms that occur in most documents are common words,

not as descriptive

• Often true

Sometimes just reflect textual variants (synonyms),

regional differences, personal style

(41)

SAN DIEGO SUPERCOMPUTER CENTER

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Term Frequency

Two-fold heuristics based on frequency

• TF (Term frequency)

• More frequent

within

a document

more relevant to

semantics

• e.g., “query” vs. “commercial”

• IDF (Inverse document frequency)

• Less frequent

among

documents

more discriminative

• e.g. “algebra” vs. “science”

(42)

SAN DIEGO SUPERCOMPUTER CENTER

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Feature Extraction And Reduction

TF: Counts of keywords in field

Inverse Document Frequency - IDF

IDF= log( 1 + NumDocs / NumDocs with Term )

Interested in: TF*IDF

Multiple-word phrases: n-grams

(43)

SAN DIEGO SUPERCOMPUTER CENTER

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Data Mining -Volinsky - 2011 - Columbia University

43

Document Distance

Pairwise distances between documents

Image plots of

cosine

distance, Euclidean, and

(44)

SAN DIEGO SUPERCOMPUTER CENTER

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

© 2002, AvaQuest Inc.

Extra-semantic Information

Extracting hidden meaning or sentiment

based on use of language.

• Examples:

• “Customer is unhappy with their service!”

• Sentiment = discontent

Sentiment is:

• Emotions: fear, love, hate, sorrow

• Feelings: warmth, excitement

• Mood, disposition, temperament, …

Or even (someday)…

(45)

SAN DIEGO SUPERCOMPUTER CENTER

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Given: a collection of labeled records (

training set

)

• Each record contains a set of features

(

attributes

),

and the

true class

(

label

)

Find: a

model

for the class as a function of the values of the

features

Goal: previously unseen records should be assigned a class

as accurately as possible

• A

test set

is used to determine the accuracy of the model.

Usually, the given data set is divided into training and test

sets, with training set used to build the model and test set

used to validate it

(46)

SAN DIEGO SUPERCOMPUTER CENTER

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Supervised learning (

classification

)

• Supervision: The training data (observations,

measurements, etc.) are accompanied by

labels

indicating

the class of the observations

• New data is classified based on the training set

Unsupervised learning (

clustering

)

• The class labels of training data is unknown

• Given a set of measurements, observations, etc. with the

aim of establishing the existence of classes or clusters in

the data

(47)

SAN DIEGO SUPERCOMPUTER CENTER

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Given: a set of documents and a

similarity measure

among documents

Find: clusters such that:

• Documents in one cluster are more similar to one another

• Documents in separate clusters are less similar to one

another

Goal:

• Finding a

correct

set of documents

(48)

SAN DIEGO SUPERCOMPUTER CENTER

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Classification: An Example

Ex#

Country Marital

Status

Income

Hooligan

1

England Single

125K

Yes

2

England Married

Yes

3

England Single

70K

Yes

4

Italy

Married

40K

No

5

USA

Divorced 95K

No

6

England Married

60K

Yes

7

England

20K

Yes

8

Italy

Single

85K

Yes

9

France

Married

75K

No

10 Denmark Single

50K

No

10

Training

Set

Model

Learn

Classifie

r

Country Marital

Status

Income

Hooligan

England Single

75K

?

Turkey

Married

50K

?

England Married

150K

?

Divorced 90K

?

Single

40K

?

Itlay

Married

80K

?

10

Test

Set

(49)

SAN DIEGO SUPERCOMPUTER CENTER

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Text Classification: An Example

Ex#

Hooligan

1

An English football fan

Yes

2

During a game in Italy

Yes

3

England has been

beating France …

Yes

4

Italian football fans were

cheering …

No

5

An average USA

salesman earns 75K

No

6

The game in London

was horrific

Yes

7

Manchester city is likely

to win the championship

Yes

8

Rome is taking the lead

in the football league

Yes

10

Training

Set

Model

Learn

Classifier

Test

Set

Hooligan

A Danish football fan

?

Turkey is playing vs. France.

The Turkish fans …

?

(50)

SAN DIEGO SUPERCOMPUTER CENTER

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Ex#

Country Marital

Status

Income

Hooligan

1

England Single

125K

Yes

2

England Married

100K

Yes

3

England Single

70K

Yes

4

Italy

Married

40K

No

5

USA

Divorced 95K

No

6

England Married

60K

Yes

7

England Divorced 20K

Yes

8

Italy

Single

85K

Yes

9

France

Married

75K

No

10 Denmark Single

50K

No

10

Decision Tree: An Example

Yes

English

Yes

No

MarSt

NO

Married

Single, Divorced

Splitting Attributes

Income

YES

NO

> 80K

< 80K

The splitting attribute at a node is

determined based on a specific

(51)

SAN DIEGO SUPERCOMPUTER CENTER

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Ex#

Hooligan

1

An English football fan

Yes

2

During a game in Italy

Yes

3

England has been

beating France …

Yes

4

Italian football fans were

cheering …

No

5

An average USA

salesman earns 75K

No

6

The game in London

was horrific

Yes

7

Manchester city is likely

to win the championship

Yes

8

Rome is taking the lead

in the football league

Yes

10

Decision Tree: A Text Example

Yes

English

Yes

No

MarSt

NO

Married

Single, Divorced

Splitting Attributes

Income

YES

NO

> 80K

< 80K

The splitting attribute at a node is

determined based on a specific

(52)

SAN DIEGO SUPERCOMPUTER CENTER

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

© 2002, AvaQuest Inc.

The Needs:

• Analysis of call records as input into

decision-making process of Bank’s

management

• Quick answers to important questions

• Which offices receive the most angry calls?

• What products have the fewest satisfied customers?

• (“Angry” and “Satisfied” are recognizable sentiments)

• User friendly interface and visualization tools

Decision Support using Bank Call Center

(53)

SAN DIEGO SUPERCOMPUTER CENTER

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

© 2002, AvaQuest Inc.

Decision Support using Bank Call

Center Data

The Information Source:

• Call center records

• Example:

AC2G31, 01, 0101, PCC, 021, 0053352,

NEW YORK, NY

, H-SUPRVR8,

STMT

,

mr stark has been with the company for

about 20 yrs. He

hates

his

stmt

format and

wishes that we would show a daily balance

to help him know when he falls below the

(54)

SAN DIEGO SUPERCOMPUTER CENTER

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

© 2002, AvaQuest Inc.

Call Volume by Sentiment

0

200

400

600

800

1000

Negative Calls Related to Bank

Statements

Cleveland

New York

Boston

(55)

SAN DIEGO SUPERCOMPUTER CENTER

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Movie Review Task

Build a model from the move review DB to

classify positive from negative reviews

(56)

SAN DIEGO SUPERCOMPUTER CENTER

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Movie Review Data

• 1000 positive movie review and 1000 negative

review texts from

(57)

SAN DIEGO SUPERCOMPUTER CENTER

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Sentiment Polarity Dataset Version 2.0

1000 positive movie review and 1000 negative review texts from:

Thumbs up? Sentiment Classification using Machine Learning

Techniques.

Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan.

Proceedings of EMNLP, pp. 79--86, 2002.

Our data

source

was the

Internet Movie Database

(IMDb) archive of the

rec.arts.movies.reviews newsgroup. We selected only reviews where the

author

rating

was

expressed

either with stars or some

numerical value

(other conventions

varied too widely to allow for automatic processing). Ratings were automatically

extracted and converted into one of three categories: positive, negative, or neutral.

For the work described in this paper, we concentrated

only

on discriminating

between

positive

and

negative

sentiment.”

(58)

SAN DIEGO SUPERCOMPUTER CENTER

at theUNIVERSITY OF CALIFORNIA; SAN DIEGO

Weka

TextMining Data Set

Use CLI

(59)

SAN DIEGO SUPERCOMPUTER CENTER

(60)

SAN DIEGO SUPERCOMPUTER CENTER

(61)

SAN DIEGO SUPERCOMPUTER CENTER

(62)

SAN DIEGO SUPERCOMPUTER CENTER

References

Related documents

Verma studied the inuence of annular conical spike nozzle and compared the performance of this nozzle with conical plug 10.. Wang conducted computational heat transfer analyses to

Keywords: Evaluation, First principles of instruction, Fitness for purpose, OER, Open educational resources, Pedagogy, Selection

Cadmium toxicity: focus on bioaccumulation, oxidative stress induction and amelioration with calcium and selenium in the selected tissues of fresh water teleost

Informal money management devices have much to teach us about the real financial service needs of poor people, and they leave the door open for a more formal approach to offering

Overall, the PANIC analysis demonstrates not only the notable persistence of Spanish inflation, but also the higher importance of the common component of the series in the second

That, subject to the passing of Resolution 9 to be proposed at the Annual General Meeting of the Company convened for 21 September 2010 (‘‘Resolution 9’’), the Directors of

Joining DAPP allows students to lock in DePaul degree requirements for three years, meet regularly with DePaul transfer admission counselors, qualify for DAPP scholarships and be

We will look at the pros and cons of providing training in this way and discuss the issues of providing learning to those who are socially disadvantaged and its implications