Keeping Pace with Big Data

(1)

Keeping Pace with Big Data

-‐ A Data Mining Perspec>ve

Huan Liu

Data Mining and Machine Learning Lab Arizona State University, Tempe, AZ

hEp://www.public.asu.edu/~huanliu

(2)

•  Big Data is a good problem to have

•  Data mining is one way of approaching it

•  Together, we can harness it for beLer sci & eng

(3)

Keeping Pace with Big Data

•  Big data is not a new problem, but a persistent one

–  Why now?

•  We’re overwhelmed, start apprecia6ng data value, and data is

generated ubiquitously (we’re part of the problem)

–  We have been dealing with it since we had data

•  Feature selec6on, as an example, to baLle data explosion (mainly

for aLribute-‐value data)

•  Big data will only become bigger

–  Ubiquitous and fast growing linked data in the age of social media

•  Example con6nued, Feature selec6on for linked data

•  Big data is a good problem to have

(4)

(5)

Begin with AEribute-‐Value Data

•  It is the most familiar form of data we encounter

–  Tables in Excel, Databases, …

–  Data is conveniently collected everywhere

•  Some typical challenges

–  Data overload (increasing in both width and length) –  Data is collected for various reasons

–  Data accumulates at an unprecedented speed

–  Data itself does not oﬀer any insight, but has poten6al

•  To make sense of massive amounts of data is to

focus: using only relevant data

–  Data preprocessing is an important part of machine learning and data mining

(6)

Massive Data and High Dimensionality

•  Dimensionality of data has increased exponen6ally

log 1980s 1990s 2000s 1 10 100 1,000 10,000 100,000 1,000,000 10,000,000 # Fe at ur es

(7)

•  _{Knowledge
Discovery
and
Data
Mining}

•  Data mining

–  Applying analy6cal methods and tools to discover ac6onable

paLerns, construct sta6s6cal or predic6ve models, and iden6fy rela6onships among massive data

(8)

Why Feature Selec>on?

•  Most machine learning and data mining

techniques may not be eﬀec6ve for high-‐ dimensional data

–  Curse of Dimensionality

–  Query accuracy and eﬃciency degrade rapidly as

the dimensionality increases.

•  The intrinsic dimensionality may be small.

–  For example, the number of genes responsible

(9)

Classiﬁca>on

•  A process of predic6ng the classes of unseen instances based on paLerns learned from available instances •  _{Supervised
learning
with
labeled
data}

Classiﬁca>on Algorithm

Classiﬁca>on Rules

If Hair = blonde

and

Loca>on = no, then

sunburned

Test Data New Data

(10)

Clustering

•  A process of grouping objects (or instances) into clusters

so that objects are similar to one another within a cluster but dissimilar to objects in other clusters

•  Unsupervised learning with unlabeled data •  Clustering tasks

(11)

Applica>ons of Feature Selec>on

•  Customer rela6onship management

•  Text mining and visual analy6cs

•  Image retrieval

•  Microarray data analysis and protein classiﬁca6on

•  Face recogni6on and handwriLen digit

recogni6on

•  Intrusion detec6on

(12)

Online Document Classiﬁca>on

Internet

ACM Portal IEEE Xplore PubMed

Digital Libraries The image cannot be displayed. Your computer

may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.

The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.

Web Pages

Emails

n  Task: To classify unlabeled

documents into categories

n  Challenge: thousands of terms

n  Solu>on: to apply dimensionality

reduc6on D1 D2 Sports T1 T2 ….…… TN 12 0 ….…… 6 DM C Travel Jobs … … … Terms Documents 3 10 ….…… 28 _{0 11}_{….…… 16} …

(13)

Gene Expression Microarray Analysis

•  Task: To classify novel samples into

known disease types (disease diagnosis)

•  Challenge: hundreds of thousands of

genes, but a few samples

•  Solu>on: Feature Selec6on

Image Courtesy of Aﬀymetrix

Expression Microarray

(14)

Other Types of High-‐Dimensional Data

(15)

Evalua>on Measures for Ranking and Selec>ng Features

• The goodness of a feature/feature subset

is dependent on measures

• Various measures

– Informa6on measures – Distance measures – Dependence measures – Consistency measures – Accuracy measures

(16)

•  _{Entropy
of
variable}X

•  _{Entropy
of}_X_{aher
observing}_Y

•  _{Informa6on
Gain}

(17)

How to Validate Selec>on Results

• Direct evalua6on (if we know a priori …)

– Ohen suitable for ar6ﬁcial data sets

– Based on prior knowledge about data

• Indirect evalua6on (if we don’t know …)

– Ohen suitable for real-‐world data sets

– Based on a) number of features selected,

b) performance on selected features (e.g., predic6ve accuracy, goodness of resul6ng clusters), and c) speed

(18)

Methods for Result Evalua>on

•  Learning curves

– For results in the form

of a ranked list of features

•  Before-‐and-‐aher comparison

– For results in the form of a minimum subset

•  Comparison using diﬀerent classiﬁers

– To avoid learning bias of a par6cular classiﬁer

•  Repea6ng experimental results

– For non-‐determinis6c results

Number of Features Accuracy For one ranked list

(19)

•  Six Chapters

1.  Data of High

Dimensionality and Challenges

2.  Univariate Formula6on of Spectral Feature

Selec6on (SFS) 3.  Mul6variate

Formula6ons

4.  Connec6ons to Exis6ng Algorithms

5.  Large-‐Scale SFS 6.  Mul6-‐Source SFS

Algorithms with sohware are available at

dmml.asu.edu/sfs

(20)

From ALribute-‐Value Data to

Linked Data

-‐ We are living in an increasingly connected world

(21)

Tradi>onal Media and Data

Broadcast Media One-‐to-‐Many

(22)

Linked Data in the Age of Social Media

Social

Media

Social Networking Blogs Wikis Forums Content Sharing

(23)

Social Media: Many-‐to-‐Many

• Everyone can be a media outlet or producer

• Disappearing communica6on barrier

• Dis6nct characteris6cs

–  _{User
generated
content:
Massive,
dynamic,
extensive,} instant, and noisy

–  Rich user interac6ons: Linked data

–  _{Collabora6ve
environment:
Wisdom
of
the
crowd} –  _{Many
small
groups:
The
long
tail
phenomenon;
and} –  ALen6on is hard to get

(24)

•  We ohen learn that:

– Noise should be removed before data mining; and

– “99% TwiLer data is useless.”

•  “Had eggs, sunny-‐side-‐up, this morning”

•  Can we remove noise as we usually do in DM?

•  What is leh aher noise removal?

– TwiLer data can be rendered useless aher

conven6onal noise removal

•  As we are certain there is noise in data and there

(25)

Linked Data and AEribute-‐Value Data

•  They exist for diﬀerent purposes

–  Rela6ons, Connec6ons, or Links –  Proper6es, Content, etc.

•  Classic machine learning and data mining methods

assume “independent, iden6cally distributed” or i.i.d. property for aLribute-‐value data

•  Addi6onal challenges with the conﬂuence of

aLribute-‐value and linked data

–  User-‐generated –  Large

–  Noisy, short, incomplete –  Unstructured, or free form

(26)

Feature Selec>on for Social Media Data

•  Massive and high-‐dimensional social media

data poses unique challenges to data mining tasks

– Scalability

– Curse of dimensionality

•  Social media data is inherently linked

– A key diﬀerence between social media data and

(27)

Keeping Pace with Big Data

Arizona State University _{NSF
Workshop
on
Big
Data
Analy6cs,
Beijing} ₂₇

Feature Selec>on of Social Media Data

•  Feature selec6on has been widely used to

prepare large-‐scale, high-‐dimensional data for eﬀec6ve data mining

•  Tradi6onal feature selec6on algorithms deal

with only “ﬂat" data (a2ribute-‐value data).

– Independent and Iden6cally Distributed (i.i.d.)

•  We need to take advantage of linked data for

feature selec6on

(28)

Representa>on for Social Media Data

User-‐post rela6ons

1 1 1 1 1 1 1 𝑢↓1  𝑢↓2  𝑢↓3  𝑢↓4  𝑢↓1  𝑢↓2  𝑢↓3  𝑢↓4  𝑝↓1  𝑝↓2  𝑝↓5  𝑝↓6  𝑝↓4  𝑝↓7  𝑝↓8  𝑓↓𝑚  …. …. …. _….𝑐↓𝑘 

(29)

Representa>on for Social Media Data

1 1 1 1 1 1 1 𝑢↓1  𝑢↓2  𝑢↓3  𝑢↓4  𝑢↓1  𝑢↓2  𝑢↓3  𝑢↓4  𝑝↓1  𝑝↓2  𝑝↓5  𝑝↓6  𝑝↓4  𝑝↓7  𝑝↓8  𝑓↓𝑚  …. …. …. _….𝑐↓𝑘 

User-‐user rela6ons

(30)

Representa>on for Social Media Data

1 1 1 1 1 1 1 𝑢↓1  𝑢↓2  𝑢↓3  𝑢↓4  𝑢↓1  𝑢↓2  𝑢↓3  𝑢↓4  𝑝↓1  𝑝↓2  𝑝↓5  𝑝↓6  𝑝↓4  𝑝↓7  𝑝↓8  𝑓↓𝑚  …. …. …. _….𝑐↓𝑘  Social Context

(31)

Problem

Statement

•  _{Given
labeled
data
X
and
its
label
indicator} matrix Y, the dataset F, its social context

including user-‐user following rela6onships S

and user-‐post rela6onships P,

•  _{Select
k
most
relevant
features
from
m}

features on dataset F with its social context S

and P

(32)

How to Use Link Informa>on

•  The new ques6on is how to proceed with

addi6onal informa6on for feature selec6on •  Two basic technical problems

– Rela6on extrac6on: What are dis6nc6ve rela6ons

that can be extracted from linked data

– Mathema6cal representa6on: How to use these

rela6ons in feature selec6on formula6on

•  Do we have theories to guide us in this eﬀort?

(33)

𝑢↓1  𝑢↓2  𝑢↓3  𝑢↓4  𝑝↓1  𝑝↓2  p₃ 𝑝↓5  𝑝↓6  𝑝↓4  𝑝↓7  𝑝↓8  1. CoPost 2. CoFollowing 3. CoFollowed 4. Following

Rela>on

Extrac>on

(34)

Rela>ons, Social Theories, Hypotheses

•  Social correla6on theories suggest that the

four rela6ons may aﬀect the rela6onships between posts

•  Social correla6on theories

– Homophily: People with similar interests are more

likely to be linked

– Inﬂuence: People who are linked are more likely

to have similar interests

•  Thus, four rela6ons lead to four hypotheses

(35)

Modeling CoFollowing Rela>on

•  Two co-‐following users have similar topics of interests

| | | | ) ( ^ k F f i T k F f i k F f W F f T u T i k i k

∑

∈ ∈ = = ）（

Users' topic interests

∑ ∑

∈ − + + − u u u N j i F T u j i u T u T , 2 2 ^ ^ 1 , 2 2 W || X W Y || || W || || ( ) ( ) || min _α _β

(36)

(37)

(38)

Summary

•  LinkedFS is evaluated under varied

circumstances to understand how it works. – Link informa6on can help feature selec;on

for social media data.

•  Unlabeled data is more ohen in social media,

unsupervised learning is more sensible, but also more challenging.

Jiliang Tang and Huan Liu. `` Unsupervised Feature Selec6on for Linked Social Media Data'', the Eighteenth ACM SIGKDD Interna6onal Conference on Knowledge Discovery and Data Mining , 2012.

(39)

Looking Ahead

•  New, rich data sources like social media

present challenges and opportuni6es

– Feature selec6on is shown here for illustra6on

•  Challenges abound

– Data collec6on (sampling bias, is data enough?)

– Data prepara6on (what is noise?)

– PaLern discovery (content, context, networks)

– Evalua6on (when without ground truth)

•  Big data allows more opportuni6es for

researchers of diﬀerent disciplines to conduct collabora6ve research

(40)

Thank You …

•  For this opportunity to share our research

•  Acknowledgments

– Grants from NSF, ONR, and ARO, among others

– DMML members and project leaders

– Collaborators

(41)

•  Big Data is a good problem to have

•  Data mining is one way of approaching it

•  Together, we can harness it for beLer sci & eng

Keeping Pace with Big Data

Keeping Pace with Big Data

-­‐ A Data Mining Perspec>ve

Huan Liu

Keeping Pace with Big Data

Begin with AEribute-­‐Value Data

Why Feature Selec>on?

Classiﬁca>on

Clustering

Applica>ons of Feature Selec>on

Online Document Classiﬁca>on

Gene Expression Microarray Analysis

Other Types of High-­‐Dimensional Data

•

The goodness of a feature/feature subset

is dependent on measures

•

Various measures

How to Validate Selec>on Results

•

Direct evalua6on (if we know a priori …)

•

Indirect evalua6on (if we don’t know …)

Methods for Result Evalua>on

From ALribute-­‐Value Data to

Linked Data

Tradi>onal Media and Data

Linked Data in the Age of Social Media

Social

Media

Social Media: Many-­‐to-­‐Many

•

Everyone can be a media outlet or producer

•

Disappearing communica6on barrier

•

Dis6nct characteris6cs

Feature Selec>on for Social Media Data

Feature Selec>on of Social Media Data

Representa>on for Social Media Data

User-­‐post rela6ons

Representa>on for Social Media Data

User-­‐user rela6ons

Representa>on for Social Media Data

Problem

Statement

How to Use Link Informa>on

Rela>on

Extrac>on

Rela>ons, Social Theories, Hypotheses

Modeling CoFollowing Rela>on

∑

∑

∑ ∑

Summary

-‐ A Data Mining Perspec>ve

Begin with AEribute-‐Value Data

Other Types of High-‐Dimensional Data

From ALribute-‐Value Data to

Social Media: Many-‐to-‐Many

User-‐post rela6ons

User-‐user rela6ons