• No results found

Keeping Pace with Big Data

N/A
N/A
Protected

Academic year: 2021

Share "Keeping Pace with Big Data"

Copied!
41
0
0

Loading.... (view fulltext now)

Full text

(1)

Keeping  Pace  with  Big  Data  

-­‐  A  Data  Mining  Perspec>ve

 

 

 

Huan  Liu

   

 Data  Mining  and  Machine  Learning  Lab   Arizona  State  University,  Tempe,  AZ  

hEp://www.public.asu.edu/~huanliu  

 

(2)

•  Big  Data  is  a  good  problem  to  have  

•  Data  mining  is  one  way  of  approaching  it  

•  Together,  we  can  harness  it  for  beLer  sci  &  eng  

(3)

Keeping  Pace  with  Big  Data  

•  Big  data  is  not  a  new  problem,  but  a  persistent  one  

–  Why  now?    

•  We’re  overwhelmed,  start  apprecia6ng  data  value,  and    data  is  

generated  ubiquitously  (we’re  part  of  the  problem)  

–  We  have  been  dealing  with  it  since  we  had  data  

•  Feature  selec6on,  as  an  example,  to  baLle  data  explosion  (mainly  

for  aLribute-­‐value  data)  

•  Big  data  will  only  become  bigger  

–  Ubiquitous  and  fast  growing  linked  data  in  the  age  of  social   media  

•  Example  con6nued,  Feature  selec6on  for  linked  data  

•  Big  data  is  a  good  problem  to  have  

(4)
(5)

Begin  with  AEribute-­‐Value  Data  

•  It  is  the  most  familiar  form  of  data  we  encounter  

–  Tables  in  Excel,  Databases,  …  

–  Data  is  conveniently  collected  everywhere    

•  Some    typical  challenges  

–  Data  overload  (increasing  in  both  width  and  length)   –  Data  is  collected  for  various  reasons  

–  Data  accumulates  at  an  unprecedented  speed  

–  Data  itself  does  not  offer  any  insight,  but  has  poten6al  

•  To  make  sense  of  massive  amounts  of  data  is  to    

focus:  using  only  relevant  data  

–  Data  preprocessing  is  an  important  part  of    machine   learning  and  data  mining  

(6)

Massive  Data  and  High  Dimensionality  

•  Dimensionality  of  data  has  increased  exponen6ally  

log   1980s   1990s   2000s   1   10   100   1,000   10,000   100,000   1,000,000   10,000,000   #   Fe at ur es  

(7)

•  Knowledge  Discovery  and  Data  Mining  

•  Data  mining  

–  Applying  analy6cal  methods  and  tools  to  discover  ac6onable  

paLerns,  construct  sta6s6cal  or  predic6ve  models,  and   iden6fy  rela6onships  among  massive  data  

(8)

Why  Feature  Selec>on?  

•  Most  machine  learning  and  data  mining  

techniques  may  not  be  effec6ve  for  high-­‐ dimensional  data    

–  Curse  of  Dimensionality  

–  Query  accuracy  and  efficiency  degrade  rapidly  as  

the  dimensionality  increases.  

•  The  intrinsic  dimensionality  may  be  small.    

–  For  example,  the  number  of  genes  responsible  

(9)

Classifica>on  

•  A  process  of  predic6ng  the  classes  of  unseen  instances   based  on  paLerns  learned  from  available  instances     •  Supervised  learning  with  labeled  data  

Classifica>on     Algorithm  

Classifica>on     Rules  

If  Hair  =  blonde  

and  

Loca>on  =  no,     then  

sunburned  

Test  Data   New  Data  

(10)

Clustering  

•  A  process  of  grouping  objects  (or  instances)  into  clusters  

so  that  objects  are  similar  to  one  another  within  a  cluster   but  dissimilar  to  objects  in  other  clusters  

•  Unsupervised  learning  with  unlabeled  data   •  Clustering  tasks  

(11)

Applica>ons  of  Feature  Selec>on  

•  Customer  rela6onship  management  

•  Text  mining  and  visual  analy6cs  

•  Image  retrieval  

•  Microarray  data  analysis  and  protein  classifica6on  

•  Face  recogni6on  and  handwriLen  digit  

recogni6on  

•  Intrusion  detec6on  

(12)

Online  Document  Classifica>on  

Internet  

ACM  Portal   IEEE  Xplore   PubMed  

Digital  Libraries   The image cannot be displayed. Your computer

may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.

The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.

The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.

Web  Pages  

Emails  

n  Task:  To  classify  unlabeled  

documents  into  categories  

n  Challenge:    thousands  of  terms  

n  Solu>on:  to  apply  dimensionality  

reduc6on D1 D2 Sports T1 T2 ….…… TN 12 0 ….…… 6 DM C Travel Jobs Terms   Documents   3 10 ….…… 28 0 11 ….…… 16

(13)

Gene  Expression  Microarray  Analysis  

•  Task:  To  classify  novel  samples  into  

known  disease  types  (disease   diagnosis)  

•  Challenge:  hundreds  of  thousands  of  

genes,  but  a  few  samples  

•  Solu>on:  Feature  Selec6on  

Image  Courtesy  of  Affymetrix  

Expression  Microarray  

(14)

Other  Types  of  High-­‐Dimensional  Data  

(15)

Evalua>on  Measures  for  Ranking  and  Selec>ng  Features  

The  goodness  of  a  feature/feature  subset  

is  dependent  on  measures  

Various  measures  

– Informa6on  measures     – Distance  measures   – Dependence  measures     – Consistency  measures     – Accuracy  measures    

(16)

•  Entropy  of  variable  X

•  Entropy  of  X aher  observing  Y

•  Informa6on  Gain  

(17)

How  to  Validate  Selec>on  Results  

Direct  evalua6on  (if  we  know  a  priori  …)  

– Ohen  suitable  for  ar6ficial  data  sets  

– Based  on  prior  knowledge  about  data  

Indirect  evalua6on  (if  we  don’t  know  …)  

– Ohen  suitable  for  real-­‐world  data  sets  

– Based  on  a)  number  of  features  selected,    

     b)  performance  on  selected  features  (e.g.,   predic6ve  accuracy,  goodness  of  resul6ng   clusters),  and  c)  speed  

 

(18)

Methods  for  Result  Evalua>on  

•  Learning  curves  

– For  results  in  the  form    

of  a  ranked  list  of  features  

•  Before-­‐and-­‐aher  comparison  

– For  results  in  the  form  of  a  minimum  subset  

•  Comparison  using  different  classifiers  

– To  avoid  learning  bias  of  a  par6cular  classifier  

•  Repea6ng  experimental  results  

– For  non-­‐determinis6c  results  

Number  of  Features   Accuracy   For  one  ranked  list  

(19)

•  Six  Chapters  

1.  Data  of  High  

Dimensionality  and   Challenges  

2.  Univariate  Formula6on   of  Spectral  Feature  

Selec6on  (SFS)   3.  Mul6variate  

Formula6ons  

4.  Connec6ons  to  Exis6ng   Algorithms  

5.  Large-­‐Scale  SFS   6.  Mul6-­‐Source  SFS  

Algorithms  with  sohware   are  available  at  

dmml.asu.edu/sfs  

(20)

From  ALribute-­‐Value  Data  to  

Linked  Data  

-­‐  We  are  living  in  an  increasingly   connected  world  

(21)

Tradi>onal  Media  and  Data  

Broadcast  Media   One-­‐to-­‐Many  

(22)

Linked  Data  in  the  Age  of  Social  Media

 

Social  

Media  

Social   Networking   Blogs   Wikis   Forums   Content   Sharing  

(23)

Social  Media:  Many-­‐to-­‐Many  

Everyone  can  be  a  media  outlet  or  producer  

Disappearing  communica6on  barrier  

Dis6nct  characteris6cs  

–  User  generated  content:  Massive,  dynamic,  extensive,   instant,  and  noisy  

–  Rich  user  interac6ons:  Linked  data  

–  Collabora6ve  environment:  Wisdom  of  the  crowd   –  Many  small  groups:  The  long  tail  phenomenon;  and     –  ALen6on  is  hard  to  get  

(24)

•  We  ohen  learn  that:    

– Noise  should  be  removed  before  data  mining;  and    

– “99%  TwiLer  data  is  useless.”  

•  “Had  eggs,  sunny-­‐side-­‐up,  this  morning”  

•  Can  we  remove  noise  as  we  usually  do  in  DM?  

•  What  is  leh  aher  noise  removal?  

– TwiLer  data  can  be  rendered  useless  aher  

conven6onal  noise  removal  

•  As  we  are  certain  there  is  noise  in  data  and  there  

(25)

Linked  Data  and  AEribute-­‐Value  Data  

•  They  exist  for  different  purposes  

–  Rela6ons,  Connec6ons,  or  Links   –  Proper6es,  Content,  etc.  

•  Classic  machine  learning  and  data  mining  methods  

assume  “independent,  iden6cally  distributed”  or   i.i.d.  property  for  aLribute-­‐value  data  

•  Addi6onal  challenges  with  the  confluence  of  

aLribute-­‐value  and  linked  data  

–  User-­‐generated   –  Large  

–  Noisy,  short,  incomplete   –  Unstructured,  or  free  form  

(26)

Feature  Selec>on  for  Social  Media  Data  

•  Massive  and  high-­‐dimensional  social  media  

data  poses  unique  challenges  to  data  mining   tasks  

– Scalability  

– Curse  of  dimensionality    

•  Social  media  data  is  inherently  linked  

– A  key  difference  between  social  media  data  and  

(27)

Keeping  Pace  with  Big  Data  

Arizona  State  University   NSF  Workshop  on  Big  Data  Analy6cs,  Beijing     27  

Feature  Selec>on  of  Social  Media  Data  

•  Feature  selec6on  has  been  widely  used  to  

prepare  large-­‐scale,  high-­‐dimensional  data  for   effec6ve  data  mining  

•  Tradi6onal  feature  selec6on  algorithms  deal  

with  only  “flat"  data  (a2ribute-­‐value  data).  

– Independent  and  Iden6cally  Distributed  (i.i.d.)  

•  We  need  to  take  advantage  of  linked  data  for  

feature  selec6on  

   

(28)

Representa>on  for  Social  Media  Data  

User-­‐post  rela6ons  

1 1 1 1 1 1 1 ​𝑢↓1  ​𝑢↓2  ​𝑢↓3  ​𝑢↓4  ​𝑢↓1  ​𝑢↓2  ​𝑢↓3  ​𝑢↓4  ​𝑝↓1  ​𝑝↓2  ​𝑝↓5  ​𝑝↓6  ​𝑝↓4  ​𝑝↓7  ​𝑝↓8  ​𝑓↓𝑚  …. …. …. …. ​𝑐↓𝑘 

(29)

Representa>on  for  Social  Media  Data  

1 1 1 1 1 1 1 ​𝑢↓1  ​𝑢↓2  ​𝑢↓3  ​𝑢↓4  ​𝑢↓1  ​𝑢↓2  ​𝑢↓3  ​𝑢↓4  ​𝑝↓1  ​𝑝↓2  ​𝑝↓5  ​𝑝↓6  ​𝑝↓4  ​𝑝↓7  ​𝑝↓8  ​𝑓↓𝑚  …. …. …. …. ​𝑐↓𝑘 

User-­‐user  rela6ons  

(30)

Representa>on  for  Social  Media  Data  

1 1 1 1 1 1 1 ​𝑢↓1  ​𝑢↓2  ​𝑢↓3  ​𝑢↓4  ​𝑢↓1  ​𝑢↓2  ​𝑢↓3  ​𝑢↓4  ​𝑝↓1  ​𝑝↓2  ​𝑝↓5  ​𝑝↓6  ​𝑝↓4  ​𝑝↓7  ​𝑝↓8  ​𝑓↓𝑚  …. …. …. …. ​𝑐↓𝑘  Social   Context  

(31)

Problem

 

Statement  

•  Given  labeled  data  X  and  its  label  indicator   matrix  Y,  the  dataset  F,  its  social  context  

including  user-­‐user  following  rela6onships  S  

and  user-­‐post  rela6onships  P,    

•  Select  k  most  relevant  features  from  m  

features  on  dataset  F  with  its  social  context  S  

and  P  

   

(32)

How  to  Use  Link  Informa>on  

•  The  new  ques6on  is  how  to  proceed  with  

addi6onal  informa6on  for  feature  selec6on   •  Two  basic  technical  problems  

– Rela6on  extrac6on:  What  are  dis6nc6ve  rela6ons  

that  can  be  extracted  from  linked  data  

– Mathema6cal  representa6on:  How  to  use  these  

rela6ons  in  feature  selec6on  formula6on  

•  Do  we  have  theories  to  guide  us  in  this  effort?  

(33)

​𝑢↓1  ​𝑢↓2  ​𝑢↓3  ​𝑢↓4  ​𝑝↓1  ​𝑝↓2  p3 ​𝑝↓5  ​𝑝↓6  ​𝑝↓4  ​𝑝↓7  ​𝑝↓8  1. CoPost   2. CoFollowing   3. CoFollowed   4. Following  

Rela>on

 

Extrac>on  

(34)

Rela>ons,  Social  Theories,  Hypotheses  

•  Social  correla6on  theories  suggest  that  the  

four  rela6ons  may  affect  the  rela6onships   between  posts    

•  Social  correla6on  theories  

– Homophily:  People  with  similar  interests  are  more  

likely  to  be  linked  

– Influence:  People  who  are  linked  are  more  likely  

to  have  similar  interests  

•  Thus,  four  rela6ons  lead  to  four  hypotheses  

(35)

Modeling  CoFollowing  Rela>on    

•  Two  co-­‐following  users  have  similar  topics  of  interests  

  | | | | ) ( ^ k F f i T k F f i k F f W F f T u T i k i k

∈ ∈ = = ) (

Users'  topic  interests      

∑ ∑

∈ − + + − u u u N j i F T u j i u T u T , 2 2 ^ ^ 1 , 2 2 W || X W Y || || W || || ( ) ( ) || min α β

(36)
(37)
(38)

Summary  

•  LinkedFS  is  evaluated  under  varied  

circumstances  to  understand  how  it  works.     – Link  informa6on  can  help  feature  selec;on  

for  social  media  data.  

•  Unlabeled  data  is  more  ohen  in  social  media,  

unsupervised  learning  is  more  sensible,  but   also  more  challenging.  

Jiliang  Tang  and  Huan  Liu.  ``  Unsupervised  Feature  Selec6on  for  Linked  Social  Media  Data'',  the  Eighteenth  ACM   SIGKDD  Interna6onal  Conference  on  Knowledge  Discovery  and  Data  Mining  ,  2012.  

(39)

Looking  Ahead  

•  New,  rich  data  sources  like  social  media  

present  challenges  and  opportuni6es  

– Feature  selec6on  is  shown  here  for  illustra6on    

•  Challenges  abound  

– Data  collec6on  (sampling  bias,  is  data  enough?)  

– Data  prepara6on  (what  is  noise?)  

– PaLern  discovery  (content,  context,  networks)  

– Evalua6on  (when  without  ground  truth)  

•  Big  data  allows  more  opportuni6es  for  

researchers  of  different  disciplines  to  conduct   collabora6ve  research    

(40)

Thank  You  …  

•  For  this  opportunity  to  share  our  research  

•  Acknowledgments  

– Grants  from  NSF,  ONR,  and  ARO,  among  others  

– DMML  members  and  project  leaders  

– Collaborators  

(41)

•  Big  Data  is  a  good  problem  to  have  

•  Data  mining  is  one  way  of  approaching  it  

•  Together,  we  can  harness  it  for  beLer  sci  &  eng  

References

Related documents

In addition, time series models, such as GARCH-type models, fitted to financial data can possibly overlook some memory properties of the magnitude of change in financial returns

As the Extended Enterprise is no single legal entity, and companies really want to realize a loose coupling between their systems (via XML Web services for

Donaldson Adoption Institute, a leading national organization focused on adoption policy and practice, published a comprehensive report on transracial adoption and the research

Polyphenol oxidase levels of Anton in grain from the 2005 University of Nebraska cultivar performance trials were not signifi cantly different from those of the low-PPO

In 2015, the IAIABC launched its Foundations of Workers’ Compensation Administration Program, aimed at providing comprehensive education in workers’ compensation

It is assumed that both values and beliefs (considered as the expression of cultural attitude of a society) can hinder or support entrepreneurial behaviour over time, affecting

If a charitable trust (in our example structure, the Social Services Trust) receives income derived from business activities, it is prohibited from making charitable distributions

1. Review the logical data model. Create a table for each entity. Create fields for each attribute.. 4. Create index for each primary & secondary key. Create index for