Keeping Pace with Big Data
-‐ A Data Mining Perspec>ve
Huan Liu
Data Mining and Machine Learning Lab Arizona State University, Tempe, AZ
hEp://www.public.asu.edu/~huanliu
• Big Data is a good problem to have
• Data mining is one way of approaching it
• Together, we can harness it for beLer sci & eng
Keeping Pace with Big Data
• Big data is not a new problem, but a persistent one
– Why now?
• We’re overwhelmed, start apprecia6ng data value, and data is
generated ubiquitously (we’re part of the problem)
– We have been dealing with it since we had data
• Feature selec6on, as an example, to baLle data explosion (mainly
for aLribute-‐value data)
• Big data will only become bigger
– Ubiquitous and fast growing linked data in the age of social media
• Example con6nued, Feature selec6on for linked data
• Big data is a good problem to have
Begin with AEribute-‐Value Data
• It is the most familiar form of data we encounter
– Tables in Excel, Databases, …
– Data is conveniently collected everywhere
• Some typical challenges
– Data overload (increasing in both width and length) – Data is collected for various reasons
– Data accumulates at an unprecedented speed
– Data itself does not offer any insight, but has poten6al
• To make sense of massive amounts of data is to
focus: using only relevant data
– Data preprocessing is an important part of machine learning and data mining
Massive Data and High Dimensionality
• Dimensionality of data has increased exponen6ally
log 1980s 1990s 2000s 1 10 100 1,000 10,000 100,000 1,000,000 10,000,000 # Fe at ur es
• Knowledge Discovery and Data Mining
• Data mining
– Applying analy6cal methods and tools to discover ac6onable
paLerns, construct sta6s6cal or predic6ve models, and iden6fy rela6onships among massive data
Why Feature Selec>on?
• Most machine learning and data mining
techniques may not be effec6ve for high-‐ dimensional data
– Curse of Dimensionality
– Query accuracy and efficiency degrade rapidly as
the dimensionality increases.
• The intrinsic dimensionality may be small.
– For example, the number of genes responsible
Classifica>on
• A process of predic6ng the classes of unseen instances based on paLerns learned from available instances • Supervised learning with labeled data
Classifica>on Algorithm
Classifica>on Rules
If Hair = blonde
and
Loca>on = no, then
sunburned
Test Data New Data
Clustering
• A process of grouping objects (or instances) into clusters
so that objects are similar to one another within a cluster but dissimilar to objects in other clusters
• Unsupervised learning with unlabeled data • Clustering tasks
Applica>ons of Feature Selec>on
• Customer rela6onship management
• Text mining and visual analy6cs
• Image retrieval
• Microarray data analysis and protein classifica6on
• Face recogni6on and handwriLen digit
recogni6on
• Intrusion detec6on
Online Document Classifica>on
Internet
ACM Portal IEEE Xplore PubMed
Digital Libraries The image cannot be displayed. Your computer
may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.
The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.
The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.
Web Pages
Emails
n Task: To classify unlabeled
documents into categories
n Challenge: thousands of terms
n Solu>on: to apply dimensionality
reduc6on D1 D2 Sports T1 T2 ….…… TN 12 0 ….…… 6 DM C Travel Jobs … … … Terms Documents 3 10 ….…… 28 0 11 ….…… 16 …
Gene Expression Microarray Analysis
• Task: To classify novel samples into
known disease types (disease diagnosis)
• Challenge: hundreds of thousands of
genes, but a few samples
• Solu>on: Feature Selec6on
Image Courtesy of Affymetrix
Expression Microarray
Other Types of High-‐Dimensional Data
Evalua>on Measures for Ranking and Selec>ng Features
•
The goodness of a feature/feature subset
is dependent on measures
•
Various measures
– Informa6on measures – Distance measures – Dependence measures – Consistency measures – Accuracy measures• Entropy of variable X
• Entropy of X aher observing Y
• Informa6on Gain
How to Validate Selec>on Results
•
Direct evalua6on (if we know a priori …)
– Ohen suitable for ar6ficial data sets
– Based on prior knowledge about data
•
Indirect evalua6on (if we don’t know …)
– Ohen suitable for real-‐world data sets
– Based on a) number of features selected,
b) performance on selected features (e.g., predic6ve accuracy, goodness of resul6ng clusters), and c) speed
Methods for Result Evalua>on
• Learning curves
– For results in the form
of a ranked list of features
• Before-‐and-‐aher comparison
– For results in the form of a minimum subset
• Comparison using different classifiers
– To avoid learning bias of a par6cular classifier
• Repea6ng experimental results
– For non-‐determinis6c results
Number of Features Accuracy For one ranked list
• Six Chapters
1. Data of High
Dimensionality and Challenges
2. Univariate Formula6on of Spectral Feature
Selec6on (SFS) 3. Mul6variate
Formula6ons
4. Connec6ons to Exis6ng Algorithms
5. Large-‐Scale SFS 6. Mul6-‐Source SFS
Algorithms with sohware are available at
dmml.asu.edu/sfs
From ALribute-‐Value Data to
Linked Data
-‐ We are living in an increasingly connected world
Tradi>onal Media and Data
Broadcast Media One-‐to-‐Many
Linked Data in the Age of Social Media
Social
Media
Social Networking Blogs Wikis Forums Content SharingSocial Media: Many-‐to-‐Many
•
Everyone can be a media outlet or producer
•
Disappearing communica6on barrier
•
Dis6nct characteris6cs
– User generated content: Massive, dynamic, extensive, instant, and noisy
– Rich user interac6ons: Linked data
– Collabora6ve environment: Wisdom of the crowd – Many small groups: The long tail phenomenon; and – ALen6on is hard to get
• We ohen learn that:
– Noise should be removed before data mining; and
– “99% TwiLer data is useless.”
• “Had eggs, sunny-‐side-‐up, this morning”
• Can we remove noise as we usually do in DM?
• What is leh aher noise removal?
– TwiLer data can be rendered useless aher
conven6onal noise removal
• As we are certain there is noise in data and there
Linked Data and AEribute-‐Value Data
• They exist for different purposes
– Rela6ons, Connec6ons, or Links – Proper6es, Content, etc.
• Classic machine learning and data mining methods
assume “independent, iden6cally distributed” or i.i.d. property for aLribute-‐value data
• Addi6onal challenges with the confluence of
aLribute-‐value and linked data
– User-‐generated – Large
– Noisy, short, incomplete – Unstructured, or free form
Feature Selec>on for Social Media Data
• Massive and high-‐dimensional social media
data poses unique challenges to data mining tasks
– Scalability
– Curse of dimensionality
• Social media data is inherently linked
– A key difference between social media data and
Keeping Pace with Big Data
Arizona State University NSF Workshop on Big Data Analy6cs, Beijing 27
Feature Selec>on of Social Media Data
• Feature selec6on has been widely used to
prepare large-‐scale, high-‐dimensional data for effec6ve data mining
• Tradi6onal feature selec6on algorithms deal
with only “flat" data (a2ribute-‐value data).
– Independent and Iden6cally Distributed (i.i.d.)
• We need to take advantage of linked data for
feature selec6on
Representa>on for Social Media Data
User-‐post rela6ons
1 1 1 1 1 1 1 𝑢↓1 𝑢↓2 𝑢↓3 𝑢↓4 𝑢↓1 𝑢↓2 𝑢↓3 𝑢↓4 𝑝↓1 𝑝↓2 𝑝↓5 𝑝↓6 𝑝↓4 𝑝↓7 𝑝↓8 𝑓↓𝑚 …. …. …. …. 𝑐↓𝑘Representa>on for Social Media Data
1 1 1 1 1 1 1 𝑢↓1 𝑢↓2 𝑢↓3 𝑢↓4 𝑢↓1 𝑢↓2 𝑢↓3 𝑢↓4 𝑝↓1 𝑝↓2 𝑝↓5 𝑝↓6 𝑝↓4 𝑝↓7 𝑝↓8 𝑓↓𝑚 …. …. …. …. 𝑐↓𝑘User-‐user rela6ons
Representa>on for Social Media Data
1 1 1 1 1 1 1 𝑢↓1 𝑢↓2 𝑢↓3 𝑢↓4 𝑢↓1 𝑢↓2 𝑢↓3 𝑢↓4 𝑝↓1 𝑝↓2 𝑝↓5 𝑝↓6 𝑝↓4 𝑝↓7 𝑝↓8 𝑓↓𝑚 …. …. …. …. 𝑐↓𝑘 Social ContextProblem
Statement
• Given labeled data X and its label indicator matrix Y, the dataset F, its social context
including user-‐user following rela6onships S
and user-‐post rela6onships P,
• Select k most relevant features from m
features on dataset F with its social context S
and P
How to Use Link Informa>on
• The new ques6on is how to proceed with
addi6onal informa6on for feature selec6on • Two basic technical problems
– Rela6on extrac6on: What are dis6nc6ve rela6ons
that can be extracted from linked data
– Mathema6cal representa6on: How to use these
rela6ons in feature selec6on formula6on
• Do we have theories to guide us in this effort?
𝑢↓1 𝑢↓2 𝑢↓3 𝑢↓4 𝑝↓1 𝑝↓2 p3 𝑝↓5 𝑝↓6 𝑝↓4 𝑝↓7 𝑝↓8 1. CoPost 2. CoFollowing 3. CoFollowed 4. Following
Rela>on
Extrac>on
Rela>ons, Social Theories, Hypotheses
• Social correla6on theories suggest that the
four rela6ons may affect the rela6onships between posts
• Social correla6on theories
– Homophily: People with similar interests are more
likely to be linked
– Influence: People who are linked are more likely
to have similar interests
• Thus, four rela6ons lead to four hypotheses
Modeling CoFollowing Rela>on
• Two co-‐following users have similar topics of interests
| | | | ) ( ^ k F f i T k F f i k F f W F f T u T i k i k
∑
∑
∈ ∈ = = ) (Users' topic interests
∑ ∑
∈ − + + − u u u N j i F T u j i u T u T , 2 2 ^ ^ 1 , 2 2 W || X W Y || || W || || ( ) ( ) || min α βSummary
• LinkedFS is evaluated under varied
circumstances to understand how it works. – Link informa6on can help feature selec;on
for social media data.
• Unlabeled data is more ohen in social media,
unsupervised learning is more sensible, but also more challenging.
Jiliang Tang and Huan Liu. `` Unsupervised Feature Selec6on for Linked Social Media Data'', the Eighteenth ACM SIGKDD Interna6onal Conference on Knowledge Discovery and Data Mining , 2012.
Looking Ahead
• New, rich data sources like social media
present challenges and opportuni6es
– Feature selec6on is shown here for illustra6on
• Challenges abound
– Data collec6on (sampling bias, is data enough?)
– Data prepara6on (what is noise?)
– PaLern discovery (content, context, networks)
– Evalua6on (when without ground truth)
• Big data allows more opportuni6es for
researchers of different disciplines to conduct collabora6ve research
Thank You …
• For this opportunity to share our research
• Acknowledgments
– Grants from NSF, ONR, and ARO, among others
– DMML members and project leaders
– Collaborators
• Big Data is a good problem to have
• Data mining is one way of approaching it
• Together, we can harness it for beLer sci & eng