• No results found

Text Analytics (Text Mining)

N/A
N/A
Protected

Academic year: 2021

Share "Text Analytics (Text Mining)"

Copied!
66
0
0

Loading.... (view fulltext now)

Full text

(1)

Text Analytics (Text Mining)

LSI (uses SVD), Visualization CSE 6242 / CX 4242


Apr 3, 2014

Duen Horng (Polo) Chau
 Georgia Tech

Some lectures are partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos, Le Song

(2)

Singular Value Decomposition (SVD):

Motivation

Problem #1: "

" Text - LSI uses SVD find “concepts”"

"

Problem #2: "

" Compression / dimensionality reduction

(3)

SVD - Motivation

Problem #1: text - LSI: find “concepts”

(4)

SVD - Motivation

Customer-product, for recommendation system:

bread lettu

ce

beef

vegetarians

meat eaters

tomatos chicken

(5)

SVD - Motivation

• problem #2: compress / reduce dimensionality

(6)

Problem - Specification

~10^6 rows; ~10^3 columns; no updates;"

Random access to any cell(s); small error: OK

(7)

SVD - Motivation

(8)

SVD - Motivation

(9)

SVD - Definition

(reminder: matrix multiplication)

x

3 x 2 2 x 1

=

(10)

SVD - Definition

(reminder: matrix multiplication)

x

3 x 2 2 x 1

=

3 x 1

(11)

SVD - Definition

(reminder: matrix multiplication)

x

3 x 2 2 x 1

=

3 x 1

(12)

SVD - Definition

(reminder: matrix multiplication)

x

3 x 2 2 x 1

=

3 x 1

(13)

SVD - Definition

(reminder: matrix multiplication)

x =

(14)

SVD - Definition

A

[n x m]

= U

[n x r]

Λ

[ r x r]

(V

[m x r]

)

T"

"

A: n x m matrix 


e.g., n documents, m terms"

U: n x r matrix 


e.g., n documents, r concepts"

Λ: r x r diagonal matrix 


r : rank of the matrix; strength of each ‘concept’"

V: m x r matrix "

e.g., m terms, r concepts

(15)

SVD - Definition

A

[n x m]

= U

[n x r]

Λ

[ r x r]

(V

[m x r]

)

T

= x x

n

m r

r n r

m r

n documents
 m terms

n documents
 r concepts"

diagonal entries: 


concept strengths

m terms
 r concepts

(16)

SVD - Properties

THEOREM [Press+92]: 


always possible to decompose matrix A into 
 A = U Λ VT"

U, Λ, V: unique, most of the time"

U, V: column orthonormal "

i.e., columns are unit vectors, orthogonal to each other"

UT U = I"

VT V = I

Λ: diagonal matrix with non-negative diagonal entires, sorted in decreasing order

(I: identity matrix)

(17)

SVD - Example

A = U Λ VT - example:

datainfretrieval.

brain lung

= CS

MD

x x

(18)

SVD - Example

A = U Λ VT - example:

datainfretrieval.

brain lung

= CS

MD

x x

CS-concept

MD-concept

(19)

SVD - Example

A = U Λ VT - example:

datainfretrieval.

brain lung

= CS

MD

x x

CS-concept

MD-concept

doc-to-concept similarity matrix

(20)

SVD - Example

A = U Λ VT - example:

datainfretrieval.

brain lung

= CS

MD

x x

‘strength’ of CS-concept

(21)

SVD - Example

A = U Λ VT - example:

datainfretrieval.

brain lung

= CS

MD

x x

term-to-concept similarity matrix CS-concept

(22)

SVD - Example

A = U Λ VT - example:

datainfretrieval.

brain lung

= CS

MD

x x

term-to-concept similarity matrix CS-concept

(23)

SVD - Interpretation #1

‘documents’, ‘terms’ and ‘concepts’:!

U: document-to-concept similarity matrix!

V: term-to-concept sim. matrix!

• Λ: its diagonal elements: ‘strength’ of each concept

(24)

SVD – Interpretation #1

‘documents’, ‘terms’ and ‘concepts’:!

Q: if A is the document-to-term matrix, what is AT A?!

A:!

Q: A AT ?!

A:

(25)

SVD – Interpretation #1

‘documents’, ‘terms’ and ‘concepts’:!

Q: if A is the document-to-term matrix, what is AT A?!

A: term-to-term ([m x m]) similarity matrix!

Q: A AT ?!

A: document-to-document ([n x n]) similarity matrix

(26)

V are the eigenvectors of the covariance matrix ATA

!

!

U are the eigenvectors of the Gram (inner-product) matrix AAT

SVD properties

Thus, SVD is closely related to PCA, and can be numerically more stable. 


For more info, see:

http://math.stackexchange.com/questions/3869/what-is-the-intuitive-relationship-between-svd-and-pca Ian T. Jolliffe, Principal Component Analysis (2nd ed), Springer, 2002.

Gilbert Strang, Linear Algebra and Its Applications (4th ed), Brooks Cole, 2005.

(27)

SVD - Interpretation #2

best axis to project on !

(‘best’ = min sum of squares of projection errors)

min RMS error v1

First Singular

Vector

(28)

SVD - Interpretation #2

A = U Λ VT - example:

= x x

variance (‘spread’) on the v1 axis

v1

(29)

SVD - Interpretation #2

A = U Λ VT - example:!

U Λ gives the coordinates of the points in the projection axis

= x x

(30)

SVD - Interpretation #2

• More details!

• Q: how exactly is dim. reduction done?

= x x

(31)

SVD - Interpretation #2

• More details!

• Q: how exactly is dim. reduction done?!

• A: set the smallest singular values to zero:

= x x

(32)

SVD - Interpretation #2

~ x x

(33)

SVD - Interpretation #2

~ x x

(34)

SVD - Interpretation #2

~ x x

(35)

SVD - Interpretation #2

~

(36)

SVD - Interpretation #3

• finds non-zero ‘blobs’ in a data matrix

= x x

(37)

SVD - Interpretation #3

• finds non-zero ‘blobs’ in a data matrix

= x x

(38)

SVD - Interpretation #3

• finds non-zero ‘blobs’ in a data matrix =!

• ‘communities’ (bi-partite cores, here)

Row 1

Row 4

Col 1 Col 3 Col 4 Row 5

Row 7

(39)

SVD algorithm

• Numerical Recipes in C (free)

(40)

SVD - Interpretation #3

• Drill: find the SVD, ‘by inspection’!!

• Q: rank = ??

= ?? x ?? x

??

(41)

SVD - Interpretation #3

• A: rank = 2 (2 linearly independent rows/

cols)

= ?? x x

??

??

??

(42)

SVD - Interpretation #3

• A: rank = 2 (2 linearly independent rows/

cols)

= x x

orthogonal??

(43)

SVD - Interpretation #3

• column vectors: are orthogonal - but not unit vectors:

= x x

1/sqrt(3) 0 1/sqrt(3) 0 1/sqrt(3) 0

0 1/sqrt(2) 0 1/sqrt(2)

(44)

SVD - Interpretation #3

• and the singular values are:

= x x

1/sqrt(3) 0 1/sqrt(3) 0 1/sqrt(3) 0

0 1/sqrt(2) 0 1/sqrt(2)

(45)

SVD - Interpretation #3

• Q: How to check we are correct?

= x x

1/sqrt(3) 0 1/sqrt(3) 0 1/sqrt(3) 0

0 1/sqrt(2) 0 1/sqrt(2)

(46)

SVD - Interpretation #3

• A: SVD properties:!

matrix product should give back matrix A

matrix U should be column-orthonormal, i.e., columns should be unit vectors, orthogonal to each other!

ditto for matrix V

matrix Λ should be diagonal, with non-negative values

(47)

SVD - Complexity

O(n*m*m) or O(n*n*m) (whichever is less)!

!

Faster version, if just want singular values!

or if we want first k singular vectors!

or if the matrix is sparse [Berry]!

!

No need to write your own!


Available in most linear algebra packages (LINPACK, matlab, Splus/R,

mathematica ...)

(48)

References

Berry, Michael: http://www.cs.utk.edu/~lsi/!

Fukunaga, K. (1990). Introduction to Statistical Pattern Recognition, Academic Press.!

Press, W. H., S. A. Teukolsky, et al. (1992).

Numerical Recipes in C, Cambridge University Press.

(49)

Case study - LSI

Q1: How to do queries with LSI?!

Q2: multi-lingual IR (english query, on spanish text?)

(50)

Case study - LSI

Q1: How to do queries with LSI?!

Problem: Eg., find documents with ‘data’

datainfretrieval. brainlung

CS = MD

x x

(51)

Case study - LSI

Q1: How to do queries with LSI?!

A: map query vectors into ‘concept space’ – how?

datainfretrieval. brainlung

CS = MD

x x

(52)

Case study - LSI

Q1: How to do queries with LSI?!

A: map query vectors into ‘concept space’ – how?

datainfretrieval. brainlung q=

term1 term2

v1

q v2

(53)

Case study - LSI

Q1: How to do queries with LSI?!

A: map query vectors into ‘concept space’ – how?

datainfretrieval. brainlung q=

term1 term2

v1

q v2

A: inner product (cosine similarity)

with each ‘concept’ vector vi

(54)

Case study - LSI

Q1: How to do queries with LSI?!

A: map query vectors into ‘concept space’ – how?

datainfretrieval. brainlung q=

term1 term2

v1

q v2

A: inner product (cosine similarity)

with each ‘concept’ vector vi

q o v1

(55)

Case study - LSI

compactly, we have:!

q V= qconcept!

Eg:

datainfretrieval. brainlung q=

term-to-concept similarities

=

CS-concept

(56)

Case study - LSI

Drill: how would the document (‘information’,

‘retrieval’) be handled by LSI?

(57)

Case study - LSI

Drill: how would the document (‘information’,

‘retrieval’) be handled by LSI? A: SAME:!

dconcept = d V

Eg: datainfretrieval. brainlung

d=

term-to-concept similarities

=

CS-concept

(58)

Case study - LSI

Observation: document (‘information’,

‘retrieval’) will be retrieved by query (‘data’), although it does not contain ‘data’!!

datainfretrieval. brainlung d=

CS-concept

q=

(59)

Case study - LSI

Q1: How to do queries with LSI?!

Q2: multi-lingual IR (english query, on spanish text?)

(60)

Case study - LSI

• Problem:!

given many documents, translated to both languages (eg., English and Spanish)!

answer queries across languages

(61)

Case study - LSI

• Solution: ~ LSI

datainfretrieval. brainlung CS

MD

datos

informacion

(62)

Switch Gear to Text Visualization

What comes up to your mind?!

What visualization have you seen before?

62

(63)

Word/Tag Cloud (still popular?)

http://www.wordle.net!

63

(64)

Word Counts (words as bubbles)

http://www.infocaptor.com/bubble-my-page

64

(65)

Word Tree

http://www.jasondavies.com/wordtree/

65

(66)

Phrase Net

http://www-958.ibm.com/software/data/cognos/manyeyes/page/Phrase_Net.html 66

Visualize pairs of words that satisfy a particular pattern, e.g., X and Y

References

Related documents

With specific reference to the Republican party, the urban antipoverty triumphs represented a short-term boost in status for Senate Republicans who had led the way in calling for

Second team honorees included junior catcher Mike Meeuwsen of Grand Rapids, Mich., and sophomore second baseman Matt Klein of DeWitt, Mich.. Ruby and Labbe were also named

The enterprise server software prerequisites for database, application server, and operating systems are defined by IBM Sterling Commerce Order

Although approval was given by the resolution of the 2nd Ordinary General Meeting of Shareholders held on June 20, 2006 for the maximum amount of compensation payable to the

The buyer has the highest positive gradient with the extended payment period, larger than the bank, since the benefit for the buyer is dependent on the risk

required plugins please run the Check system / Check browser from Lesesne Gateway login page (check under login area) or your CitLearn course area.. Refer to the Lesesne

Among several scholars who claim the strengths of using blogs as a teaching and learning tool in education, Campbell (2003) states that there are three types of

compositional differences between age groups in asthma severity and other unobserved factors. However, the results of our cross-sectional analyses were consistent with the results