SAN DIEGO SUPERCOMPUTER CENTER
at theUNIVERSITY OF CALIFORNIA; SAN DIEGO
Data Mining Boot Camp 2
Text Mining
Natasha Balac, Ph.D.
Predictive Analytics Center of Excellence,
Director
San Diego Supercomputer Center
University of California, San Diego
SAN DIEGO SUPERCOMPUTER CENTER
at theUNIVERSITY OF CALIFORNIA; SAN DIEGO
Discover useful and previously unknown “gems” of
information in large text collections
Text Mining Definition
Many definitions in the literature:
The non trivial extraction of implicit,
previously unknown, and potentially useful
information from (large amount of) textual data
SAN DIEGO SUPERCOMPUTER CENTER
at theUNIVERSITY OF CALIFORNIA; SAN DIEGO
© 2002, AvaQuest Inc.
Text Mining Example
•
Research objective:
• Follow chains of causal implication to discover a
relationship between migraines and biochemical
levels.
•
Data:
• medical research papers, medical news
(
unstructured text information)
•
Key concept types:
• symptoms, drugs, diseases, chemicals…
•
Medical research
•
Find causal links between symptoms or
diseases and drugs or chemicals
SAN DIEGO SUPERCOMPUTER CENTER
at theUNIVERSITY OF CALIFORNIA; SAN DIEGO
© 2002, AvaQuest Inc.
Medical Research Example
stress is associated with
migraines
stress can lead to loss of
magnesium
calcium channel blockers prevent some
migraines
magnesium
is a natural calcium channel blocker
spreading cortical depression (SCD) is implicated in
some
migraines
high levels of
magnesium
inhibit SCD
migraine
patients have high platelet aggregability
magnesium
can suppress platelet aggregability
SAN DIEGO SUPERCOMPUTER CENTER
at theUNIVERSITY OF CALIFORNIA; SAN DIEGO
Data vs. Text Mining Comparison
Data Mining
•
Identify data sets
•
Select features
•
Prepare data
•
Identify causal
relationship
•
Structured numeric
transaction data
residing
•
Analyze distribution
Text Mining
•
Identify documents
•
Extract features
•
Diverse collections
and formats
•
Linguistic processing
•
Select features by
algorithm
•
Prepare data
•
Analyze distribution
SAN DIEGO SUPERCOMPUTER CENTER
at theUNIVERSITY OF CALIFORNIA; SAN DIEGO
•
Information Retrieval
• Indexing and retrieval of textual documents
•
Information Extraction
• Extraction of partial knowledge in the text
•
Web Mining
• Indexing/retrieval of textual docs and extraction
knowledge
•
Document Classification
• Classifying similar documents, paragraphs, etc.
•
Document Clustering
• Generating collections of similar text documents
•
NLP
– Natural Language processing
•
Concept Extraction
- semantically similar grouping
SAN DIEGO SUPERCOMPUTER CENTER
at theUNIVERSITY OF CALIFORNIA; SAN DIEGO
© 2002, AvaQuest Inc.
“Search” vs.“Discover”
Data
Mining
Text
Mining
Data
Retrieval
Information
Retrieval
Search
(goal-oriented)
Discover
(opportunistic)
Structured
Data
Unstructured
Data (Text)
SAN DIEGO SUPERCOMPUTER CENTER
at theUNIVERSITY OF CALIFORNIA; SAN DIEGO
© 2002, AvaQuest Inc.
Data Retrieval
•
Find records within a structured
database
Database Type
Structured
Search Mode
Goal-driven
Atomic entity
Data Record
Example Information Need
“
Find a Japanese restaurant in Boston
that serves vegetarian food.
”
Example Query
“
SELECT * FROM restaurants WHERE
city = boston AND type = japanese
AND has_veg = true
”
SAN DIEGO SUPERCOMPUTER CENTER
at theUNIVERSITY OF CALIFORNIA; SAN DIEGO
© 2002, AvaQuest Inc.
Information Retrieval
•
Find relevant information in an
unstructured information source
(typically text)
Database Type
Unstructured
Search Mode
Goal-driven
Atomic entity
Document
Example Information Need
“
Find a Japanese restaurant in Boston
that serves vegetarian food.
”
Example Query
“
Japanese restaurant Boston
”
or
Boston->Restaurants->Japanese
SAN DIEGO SUPERCOMPUTER CENTER
at theUNIVERSITY OF CALIFORNIA; SAN DIEGO
Intelligent Information Retrieval
•
meaning
of words
• Synonyms
“
buy
”
/
“
purchase
”
• Ambiguity
“
bat
”
(baseball vs. mammal)
•
order
of words in the query
•
hot dog stand in the amusement park
•
hot amusement stand in the dog park
•
user dependency
for the data
• direct feedback
• indirect feedback
•
authority
of the source
• IBM is more likely to be an authorized source then my second
far cousin
SAN DIEGO SUPERCOMPUTER CENTER
at theUNIVERSITY OF CALIFORNIA; SAN DIEGO
•
Given:
• A source of textual documents
• A well defined limited query (text based)
•
Find:
• Sentences with
relevant
information
• Extract the relevant information and
ignore non-relevant information (important!)
• Link related information and output in a
predetermined format
SAN DIEGO SUPERCOMPUTER CENTER
at theUNIVERSITY OF CALIFORNIA; SAN DIEGO
Information Extraction: Example
•
Salvadoran President-elect Alfredo Cristiania condemned the terrorist
killing of Attorney General Roberto Garcia Alvarado and accused the
Farabundo Marti Natinal Liberation Front (FMLN) of the crime. … Garcia
Alvarado, 56, was killed when a bomb placed by urban guerillas on his
vehicle exploded as it came to a halt at an intersection in downtown San
Salvador. … According to the police and Garcia Alvarado
’
s driver, who
escaped unscathed, the attorney general was traveling with two
bodyguards. One of them was injured.
•
Incident Date:
19 Apr 89
•
Incident Type:
Bombing
•
Perpetrator Individual ID:
“
urban guerillas
”
•
Human Target Name:
“
Roberto Garcia Alvarado
”
•
...
SAN DIEGO SUPERCOMPUTER CENTER
at theUNIVERSITY OF CALIFORNIA; SAN DIEGO
© 2002, AvaQuest Inc.
Text Mining
•
Discover new knowledge
through analysis of text
Database Type
Unstructured
Search Mode
Opportunistic
Atomic entity
Language feature or concept
Example Information Need
“
Find the types of food poisoning most
often associated with Japanese
restaurants
”
Example Query
Rank
diseases
found associated with
SAN DIEGO SUPERCOMPUTER CENTER
at theUNIVERSITY OF CALIFORNIA; SAN DIEGO
© 2002, AvaQuest Inc.
Motivation for Text Mining
•
Approximately
90%
of the world’s data is held
in unstructured formats (source: Oracle
Corporation)
•
Information intensive business processes
demand that we transcend from simple
document retrieval to “knowledge” discovery.
90%
Structured Numerical or Coded
Information
10%
Unstructured or Semi-structured
Information
SAN DIEGO SUPERCOMPUTER CENTER
at theUNIVERSITY OF CALIFORNIA; SAN DIEGO
© 2002, AvaQuest Inc.
Challenges of Text Mining
•
Very high number of possible “dimensions”
• All possible word and phrase types in the language!!
•
Unlike data mining:
• records (= docs) are not structurally identical
• records are not statistically independent
•
Complex and subtle relationships between
concepts in text
• “AOL merges with Time-Warner”
• “Time-Warner is bought by AOL”
•
Ambiguity and context sensitivity
• automobile = car = vehicle = Toyota
• Apple (the company) or apple (the fruit)
SAN DIEGO SUPERCOMPUTER CENTER
at theUNIVERSITY OF CALIFORNIA; SAN DIEGO
Challenges in Text Mining
•
Information is in unstructured textual
form
•
Not readily accessible to be used by
computers
•
Dealing with huge collections of
documents
•
Language is ambiguous and context
dependent
SAN DIEGO SUPERCOMPUTER CENTER
at theUNIVERSITY OF CALIFORNIA; SAN DIEGO
Text Mining Challenges
•
Homographs
• Bat – a piece of sporting equipment in baseball
• Bat - a winged animal associated with vampires
•
Synonyms
– different words same meaning
•
Polysemy – same word form different meaning
•
Hyponymy – concept hierarchy
SAN DIEGO SUPERCOMPUTER CENTER
at theUNIVERSITY OF CALIFORNIA; SAN DIEGO
© 2002, AvaQuest Inc.
Text Processing
•
Statistical Analysis
•
Quantify text data
•
Language or Content Analysis
•
Identifying structural elements
•
Extracting and codifying meaning
SAN DIEGO SUPERCOMPUTER CENTER
at theUNIVERSITY OF CALIFORNIA; SAN DIEGO
Two Mining Phases
•
Knowledge Discovery
:
Extraction
of codified
information (features)
•
Information Distillation
:
Analysis
of the feature
distribution
SAN DIEGO SUPERCOMPUTER CENTER
at theUNIVERSITY OF CALIFORNIA; SAN DIEGO
© 2002, AvaQuest Inc.
Statistical Analysis
•
Use statistics to add a numerical
dimension to unstructured text
Term frequency
Document length
Document
frequency
Term proximity
SAN DIEGO SUPERCOMPUTER CENTER
at theUNIVERSITY OF CALIFORNIA; SAN DIEGO
© 2002, AvaQuest Inc.
Content Analysis
•
Lexical and Syntactic Processing
• Recognizing “tokens” (terms)
• Normalizing words
• Language constructs (parts of speech, sentences, paragraphs)
•
Semantic Processing
• Extracting meaning
• Named Entity Extraction (People names, Company Names,
Locations, etc…)
•
Extra-semantic features
• Identify feelings or sentiment in text
SAN DIEGO SUPERCOMPUTER CENTER
at theUNIVERSITY OF CALIFORNIA; SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
at theUNIVERSITY OF CALIFORNIA; SAN DIEGO
Text mining process
•
Text preprocessing
• Syntactic/Semantic text
analysis
•
Features Generation
• Bag of words
•
Features Selection
• Simple counting
• Statistics
•
Text/Data Mining
•
Classification-Supervised learning
• Clustering- Unsupervised
learning
•
Analyzing results
SAN DIEGO SUPERCOMPUTER CENTER
SAN DIEGO SUPERCOMPUTER CENTER
at theUNIVERSITY OF CALIFORNIA; SAN DIEGO
Feature Extraction
•
To recognize and classify significant vocabulary
items in unrestricted natural language texts
•
Very fast processing to be able to deal with
mass data
SAN DIEGO SUPERCOMPUTER CENTER
at theUNIVERSITY OF CALIFORNIA; SAN DIEGO
Bag-of-Tokens Approaches
Four score and seven
years ago our fathers brought
forth on this continent,
a new
nation
, conceived in Liberty,
and dedicated to the
proposition that all men are
created equal.
Now we are engaged in a
great civil war, testing
whether
that nation
, or …
nation – 5
civil - 1
war – 2
men – 2
died – 4
people – 5
Liberty – 1
God – 1
…
Feature
Extraction
Loses all order-specific information!
Severely limits context!
SAN DIEGO SUPERCOMPUTER CENTER
at theUNIVERSITY OF CALIFORNIA; SAN DIEGO
Natural Language Processing (NLP)
A dog is chasing a boy on the playground
Det
Noun Aux
Verb
Det Noun Prep
Det
Noun
Noun Phrase
Complex Verb
Noun Phrase
Noun Phrase
Prep Phrase
Verb Phrase
Verb Phrase
Sentence
Dog(d1).
Boy(b1).
Playground(p1).
Chasing(d1,b1,p1).
Semantic analysis
Lexical
analysis
(part-of-speech
tagging)
Syntactic analysis
(Parsing)
A person saying this may
be reminding another person to
get the dog back…
Pragmatic analysis
(speech act)
Scared(x) if Chasing(_,x,_).
+
Scared(b1)
Inference
SAN DIEGO SUPERCOMPUTER CENTER
at theUNIVERSITY OF CALIFORNIA; SAN DIEGO
General NLP—Too Difficult!
(Taken from ChengXiang Zhai, CS 397cxz – Fall 2003)
•
Word-level ambiguity
•
“
design
”
can be a noun or a verb
(Ambiguous POS)
•
“
root
”
has multiple meanings
(Ambiguous sense)
•
Syntactic ambiguity
•
“
natural language processing
”
(Modification)
•
“
A man saw a boy
with a telescope
.
”
(PP Attachment)
•
Anaphora resolution
•
“
John persuaded Bill to buy a TV for
himself
.
”
(
himself
= John or Bill?)
•
Presupposition
•
“
He has quit smoking.
”
implies that he smoked before.
Humans rely on context to interpret (when possible).
This context may extend beyond a given document!
SAN DIEGO SUPERCOMPUTER CENTER
at theUNIVERSITY OF CALIFORNIA; SAN DIEGO
Shallow Linguistics
•
Progress on Useful Sub-Goals:
•
English Lexicon
•
Text Normalization
•
Lower case
•
Typos, misspelled words
•
Syntactic analysis
•
Recognizing larger constructs
•
Part-of-Speech Tagging
•
Word Sense Disambiguation
SAN DIEGO SUPERCOMPUTER CENTER
at theUNIVERSITY OF CALIFORNIA; SAN DIEGO
© 2002, AvaQuest Inc.
Named Entity Extraction
•
Identify and type language features
•
Examples:
• People names
• Company names
• Geographic location names
• Dates
• Monetary amount
SAN DIEGO SUPERCOMPUTER CENTER
at theUNIVERSITY OF CALIFORNIA; SAN DIEGO
Canonical Forms
•
Normalized forms of dates, numbers, …
•
Allows applications to use information very
easily
•
Abstracts from different morphological variants
of a single term
SAN DIEGO SUPERCOMPUTER CENTER
at theUNIVERSITY OF CALIFORNIA; SAN DIEGO
Canonical Names
President Bush
Mr. Bush
George Bush
Canonical Name:
George Bush
•
The canonical name is the most explicit,
least ambiguous name constructed from
the different variants found in the
document
SAN DIEGO SUPERCOMPUTER CENTER
at theUNIVERSITY OF CALIFORNIA; SAN DIEGO
Tokenization
•
Convert streams of characters into “words”
Main clues (in English): white space
Words can contain special characters, such
as these: . , ’ – etc.
•
No single algorithm “works” always
• Some languages do not have white space
SAN DIEGO SUPERCOMPUTER CENTER
at theUNIVERSITY OF CALIFORNIA; SAN DIEGO
Stemming
•
Normalizes / unifies variations of the same idea
• “walking”, “walks”, “walked”, “walker” => “walk”
•
Inflectional Stemming
• Remove plurals
• Normalize verb tenses
• Remove other affixes
•
Stemming to root
• Reduce word to most basic element
• More aggressive than inflectional
• Examples
• • “denormalization” -> “norm”
SAN DIEGO SUPERCOMPUTER CENTER
at theUNIVERSITY OF CALIFORNIA; SAN DIEGO
Data Mining -Volinsky - 2011 - Columbia University
35
Stop words
•
Many of the most frequently used words in English
are worthless in retrieval and text mining – these
words are called
stop words
.
• a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not,
of, on, or, such, that, the, their, then, there, these, they, this, to,
was, will, with
• Typically about 400 to 500 such words
• For an application, an additional domain specific stop words list
may be constructed
•
Why do we need to remove stop words
?
• Reduce indexing (or data) file size
• stopwords accounts 20-30% of total word counts
• Improve efficiency
• stop words are not useful for searching or text mining
• stop words always have a large number of hits
SAN DIEGO SUPERCOMPUTER CENTER
at theUNIVERSITY OF CALIFORNIA; SAN DIEGO
•
Part Of Speech (pos) tagging
• Find the corresponding pos for each word
e.g., John (noun) gave (verb)
the (det)
ball (noun)
•
~98% accurate
•
Word sense disambiguation
• Context based or proximity based
• Very accurate
•
Parsing
• Generates a parse tree (graph) for each sentence
• Each sentence is a stand alone graph
SAN DIEGO SUPERCOMPUTER CENTER
at theUNIVERSITY OF CALIFORNIA; SAN DIEGO
© 2002, AvaQuest Inc.
Simple Entity Extraction
“The quick brown fox jumps over the lazy dog”
Noun phrase
Noun phrase
Mammal
Canidae
Mammal
SAN DIEGO SUPERCOMPUTER CENTER
SAN DIEGO SUPERCOMPUTER CENTER
at theUNIVERSITY OF CALIFORNIA; SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
at theUNIVERSITY OF CALIFORNIA; SAN DIEGO
Document Frequency
•
# Documents the term occurs in
Assumptions:
•
Terms that occur in fewer documents are more specified
to a document and more descriptive of the content: rarity
matters
•
Terms that occur in most documents are common words,
not as descriptive
• Often true
•
Sometimes just reflect textual variants (synonyms),
regional differences, personal style
SAN DIEGO SUPERCOMPUTER CENTER
at theUNIVERSITY OF CALIFORNIA; SAN DIEGO
Term Frequency
•
Two-fold heuristics based on frequency
• TF (Term frequency)
• More frequent
within
a document
more relevant to
semantics
• e.g., “query” vs. “commercial”
• IDF (Inverse document frequency)
• Less frequent
among
documents
more discriminative
• e.g. “algebra” vs. “science”
SAN DIEGO SUPERCOMPUTER CENTER
at theUNIVERSITY OF CALIFORNIA; SAN DIEGO
Feature Extraction And Reduction
•
TF: Counts of keywords in field
•
Inverse Document Frequency - IDF
IDF= log( 1 + NumDocs / NumDocs with Term )
Interested in: TF*IDF
•
Multiple-word phrases: n-grams
SAN DIEGO SUPERCOMPUTER CENTER
at theUNIVERSITY OF CALIFORNIA; SAN DIEGO
Data Mining -Volinsky - 2011 - Columbia University
43
Document Distance
•
Pairwise distances between documents
•
Image plots of
cosine
distance, Euclidean, and
SAN DIEGO SUPERCOMPUTER CENTER
at theUNIVERSITY OF CALIFORNIA; SAN DIEGO
© 2002, AvaQuest Inc.
Extra-semantic Information
•
Extracting hidden meaning or sentiment
based on use of language.
• Examples:
• “Customer is unhappy with their service!”
• Sentiment = discontent
•
Sentiment is:
• Emotions: fear, love, hate, sorrow
• Feelings: warmth, excitement
• Mood, disposition, temperament, …
•
Or even (someday)…
SAN DIEGO SUPERCOMPUTER CENTER
at theUNIVERSITY OF CALIFORNIA; SAN DIEGO
•
Given: a collection of labeled records (
training set
)
• Each record contains a set of features
(
attributes
),
and the
true class
(
label
)
•
Find: a
model
for the class as a function of the values of the
features
•
Goal: previously unseen records should be assigned a class
as accurately as possible
• A
test set
is used to determine the accuracy of the model.
Usually, the given data set is divided into training and test
sets, with training set used to build the model and test set
used to validate it
SAN DIEGO SUPERCOMPUTER CENTER
at theUNIVERSITY OF CALIFORNIA; SAN DIEGO
•
Supervised learning (
classification
)
• Supervision: The training data (observations,
measurements, etc.) are accompanied by
labels
indicating
the class of the observations
• New data is classified based on the training set
•
Unsupervised learning (
clustering
)
• The class labels of training data is unknown
• Given a set of measurements, observations, etc. with the
aim of establishing the existence of classes or clusters in
the data
SAN DIEGO SUPERCOMPUTER CENTER
at theUNIVERSITY OF CALIFORNIA; SAN DIEGO
•
Given: a set of documents and a
similarity measure
among documents
•
Find: clusters such that:
• Documents in one cluster are more similar to one another
• Documents in separate clusters are less similar to one
another
•
Goal:
• Finding a
correct
set of documents
SAN DIEGO SUPERCOMPUTER CENTER
at theUNIVERSITY OF CALIFORNIA; SAN DIEGO
Classification: An Example
Ex#
Country Marital
Status
Income
Hooligan
1
England Single
125K
Yes
2
England Married
Yes
3
England Single
70K
Yes
4
Italy
Married
40K
No
5
USA
Divorced 95K
No
6
England Married
60K
Yes
7
England
20K
Yes
8
Italy
Single
85K
Yes
9
France
Married
75K
No
10 Denmark Single
50K
No
10Training
Set
Model
Learn
Classifie
r
Country Marital
Status
Income
Hooligan
England Single
75K
?
Turkey
Married
50K
?
England Married
150K
?
Divorced 90K
?
Single
40K
?
Itlay
Married
80K
?
10Test
Set
SAN DIEGO SUPERCOMPUTER CENTER
at theUNIVERSITY OF CALIFORNIA; SAN DIEGO
Text Classification: An Example
Ex#
Hooligan
1
An English football fan
…
Yes
2
During a game in Italy
…
Yes
3
England has been
beating France …
Yes
4
Italian football fans were
cheering …
No
5
An average USA
salesman earns 75K
No
6
The game in London
was horrific
Yes
7
Manchester city is likely
to win the championship
Yes
8
Rome is taking the lead
in the football league
Yes
10
Training
Set
Model
Learn
Classifier
Test
Set
Hooligan
A Danish football fan
?
Turkey is playing vs. France.
The Turkish fans …
?
SAN DIEGO SUPERCOMPUTER CENTER
at theUNIVERSITY OF CALIFORNIA; SAN DIEGO
Ex#
Country Marital
Status
Income
Hooligan
1
England Single
125K
Yes
2
England Married
100K
Yes
3
England Single
70K
Yes
4
Italy
Married
40K
No
5
USA
Divorced 95K
No
6
England Married
60K
Yes
7
England Divorced 20K
Yes
8
Italy
Single
85K
Yes
9
France
Married
75K
No
10 Denmark Single
50K
No
10
Decision Tree: An Example
Yes
English
Yes
No
MarSt
NO
Married
Single, Divorced
Splitting Attributes
Income
YES
NO
> 80K
< 80K
The splitting attribute at a node is
determined based on a specific
SAN DIEGO SUPERCOMPUTER CENTER
at theUNIVERSITY OF CALIFORNIA; SAN DIEGO
Ex#
Hooligan
1
An English football fan
…
Yes
2
During a game in Italy
…
Yes
3
England has been
beating France …
Yes
4
Italian football fans were
cheering …
No
5
An average USA
salesman earns 75K
No
6
The game in London
was horrific
Yes
7
Manchester city is likely
to win the championship
Yes
8
Rome is taking the lead
in the football league
Yes
10
Decision Tree: A Text Example
Yes
English
Yes
No
MarSt
NO
Married
Single, Divorced
Splitting Attributes
Income
YES
NO
> 80K
< 80K
The splitting attribute at a node is
determined based on a specific
SAN DIEGO SUPERCOMPUTER CENTER
at theUNIVERSITY OF CALIFORNIA; SAN DIEGO
© 2002, AvaQuest Inc.
•
The Needs:
• Analysis of call records as input into
decision-making process of Bank’s
management
• Quick answers to important questions
• Which offices receive the most angry calls?
• What products have the fewest satisfied customers?
• (“Angry” and “Satisfied” are recognizable sentiments)
• User friendly interface and visualization tools
Decision Support using Bank Call Center
SAN DIEGO SUPERCOMPUTER CENTER
at theUNIVERSITY OF CALIFORNIA; SAN DIEGO
© 2002, AvaQuest Inc.
Decision Support using Bank Call
Center Data
•
The Information Source:
• Call center records
• Example:
AC2G31, 01, 0101, PCC, 021, 0053352,
NEW YORK, NY
, H-SUPRVR8,
STMT
,
“
mr stark has been with the company for
about 20 yrs. He
hates
his
stmt
format and
wishes that we would show a daily balance
to help him know when he falls below the
SAN DIEGO SUPERCOMPUTER CENTER
at theUNIVERSITY OF CALIFORNIA; SAN DIEGO
© 2002, AvaQuest Inc.
Call Volume by Sentiment
0
200
400
600
800
1000
Negative Calls Related to Bank
Statements
Cleveland
New York
Boston
SAN DIEGO SUPERCOMPUTER CENTER
at theUNIVERSITY OF CALIFORNIA; SAN DIEGO
Movie Review Task
•
Build a model from the move review DB to
classify positive from negative reviews
SAN DIEGO SUPERCOMPUTER CENTER
at theUNIVERSITY OF CALIFORNIA; SAN DIEGO
Movie Review Data
• 1000 positive movie review and 1000 negative
review texts from
•
SAN DIEGO SUPERCOMPUTER CENTER
at theUNIVERSITY OF CALIFORNIA; SAN DIEGO
Sentiment Polarity Dataset Version 2.0
1000 positive movie review and 1000 negative review texts from:
Thumbs up? Sentiment Classification using Machine Learning
Techniques.
Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan.
Proceedings of EMNLP, pp. 79--86, 2002.
“
Our data
source
was the
Internet Movie Database
(IMDb) archive of the
rec.arts.movies.reviews newsgroup. We selected only reviews where the
author
rating
was
expressed
either with stars or some
numerical value
(other conventions
varied too widely to allow for automatic processing). Ratings were automatically
extracted and converted into one of three categories: positive, negative, or neutral.
For the work described in this paper, we concentrated
only
on discriminating
between
positive
and
negative
sentiment.”
SAN DIEGO SUPERCOMPUTER CENTER
at theUNIVERSITY OF CALIFORNIA; SAN DIEGO
Weka
•
TextMining Data Set
•
Use CLI
SAN DIEGO SUPERCOMPUTER CENTER
SAN DIEGO SUPERCOMPUTER CENTER
SAN DIEGO SUPERCOMPUTER CENTER
SAN DIEGO SUPERCOMPUTER CENTER