• No results found

Text Mining with R. Rob Zinkov. October 19th, Rob Zinkov () Text Mining with R October 19th, / 38

N/A
N/A
Protected

Academic year: 2021

Share "Text Mining with R. Rob Zinkov. October 19th, Rob Zinkov () Text Mining with R October 19th, / 38"

Copied!
38
0
0

Loading.... (view fulltext now)

Full text

(1)

Text Mining with R

Rob Zinkov

October 19th, 2010

(2)

Outline

1 Introduction 2 Readability 3 Summarization 4 Topic Modeling 5 Sentiment Analysis 6 Entity Extraction 7 Demo
(3)

What is Text Mining?

Text mining is any process or program that: Raw human written text

Structured information

(4)

Introduction

R is very good for this

(5)

Themes

• R is a great glue language

• CRAN already has a lots of packages that work well together

(6)

Introduction

Caveats

• I use outside libraries more than necessary

• Many of these algorithms could be written completely in R

• There are nicer ways to integrate these libraries

• Text mining is a vast field that can’t be covered in 40 minutes

(7)

Readability

(8)

Readability

• Readability gives us an idea of the difficulty of the document

• It also gives a rough measure of the quality

(9)

Flesch-Kincaid readability test

Readability can be roughly measured with 206.876−1.015w s −84.6y w where w = total words y = total syllables s = total sentences

(10)

Readability

Score Notes

90.0-100.0 easily understandable by an average 11-year-old student

60.0-70.0 easily understandable by 13- to 15-year-old students

0.0-30.0 best understood by university graduates

(11)

Flesch-Kincaid Reading Age

(0.39∗ASL) + (11.8∗ASW)−15.59 where ASL = average sentence length where ASW = average syllables per word

(12)

Readability

(13)

system(paste("java -jar CmdFlesh.jar","test_review.txt"))

(14)

Readability

Demo!

(15)

Notes

This algorithm isn’t hard to implement.

Quick trick. Count vowel clusters in words to estimate syllables

(16)

Summarization

Summarization is about a succient distinct distillation of the relevant content in a document.

(17)
(18)

Summarization

Most approaches involve selecting out the most relevant sentences The simplest technique is to just look for sentences with popular terms

(19)

I use libots, as an example but there is much better work

(20)

Summarization

Demo!

(21)

Topic Modeling

Topic Modeling is a way to group and categorize documents Usually unsupervised approach

(22)

Topic Modeling

Topic Modeling - continued

CRAN includes a package for topicmodeling This package using LDA and CTM

(23)

LDA - Latent Dirchilet Allocation

θ

k|j

D

[

α

]

φ

w|k

D

[

β

]

z

ij

θ

k|j

x

ij

φ

w|zij
(24)

Topic Modeling

njkw = #{i :xij =w,zij =k}

(25)

CTM - Coorelated Topic Models

(26)

Topic Modeling

CTM - Coorelated Topic Models

θ

k|j

log

(

N

(

µ,

Σ))

φ

w|k

D

[

β

]

z

ij

θ

k|j

x

ij

φ

w|zij
(27)

Demo!

(28)

Sentiment Analysis

(29)

Sentiment analysis is about gauging mood based on the text.

(30)

Sentiment Analysis

Opinion corpus available at:

Wiebe’s corpora

http://www.cs.pitt.edu/mpqa/

Sentiwordnet:

http://sentiwordnet.isti.cnr.it/

(31)

For more sophistication

• Best solved using a Conditional Random Field

• This area is still new

• No R libraries

• Entity Extraction needed for more fine-grained sentiment

(32)

Sentiment Analysis

Demo!

(33)

Named Entity Recognition

The purpose of NER is to extract out and label phrases in a sentence

Bill Clinton arrived at the United NationsBuilding in Manhattan.

(34)

Entity Extraction

Challenges

• People and locations may be referred to in ambiguous ways.

• Entity may never have been seen before

• Entity may be referred to with pronouns

• Wikipedia and Capitalization heuristics aren’t good enough

(35)

I use the Illinois Named Entity Extractor

http://cogcomp.cs.illinois.edu/page/software view/4

(36)

Entity Extraction

Demo!

(37)

Conclusions

There are lots of interesting things you can do with text mining R is very good at integrating all of them.

(38)

Demo

Questions?

References

Related documents

FECA reform to include these practices, along with collection of COP in cases involving third-party liabilities, changes to the assessment of administrative fees, and the use

6.04.175 Dog control by responsible person. A.Dogs shall at all times be kept under the immediate control and direction of a competent, responsible person who is capable of

Based on wide variety of activities that it is consisted of, the rural tourism unites more than 19 possible kinds of tourism: tourism on a farm; tourism on

A com- parison of the mineralogical content determined in the ana- lysed attic dust samples (quartz, muscovite, Mg–Fe clinochlore, calcite and plagioclase; serpentine

We show that the arrival of news that costs will fall in the future causes financial contracts to be modified which leads to an immediate expansion in liquidity in the financial

Speakers: Sui Chung, First Vice President, AILA South Florida, AILA National ICE Liaison Committee Jordan Dollar, Adjunct Assistant Professor, FIU College of Law. *Alex

Giardia intestinalis and other zoonotic parasites: prevalence in adult dogs from the southern part of Mexico City. Frequency of intestinal parasites in domiciled

(See T ables 2 – 4 for de fi nitions of abnormal returns, absolute abnormal returns, abnormal trading volume, and forecast revisions.) V ariable de fi nitions for explanatory