• No results found

Deep Learning for Big Data

N/A
N/A
Protected

Academic year: 2021

Share "Deep Learning for Big Data"

Copied!
28
0
0

Loading.... (view fulltext now)

Full text

(1)

Deep  Learning  for  Big  Data

 

Yoshua  Bengio    

Département  d’Informa0que  et  Recherche   Opéra0onnelle,  U.  Montréal  

 

30  May  2013,  Journée  de  la  recherche   École  Polytechnique,  Montréal  

   

(2)

Big Data & Data Science

•  Super-­‐hot  buzzword  

•  Data  deluge  

•  Two  sides  of  the  coin:  

1.  Allowing  computers  to  “understand”  the  data  (percep0on)   2.  Allowing  computers  to  “take  decisions”  (ac0on)  

•  My  research:  1.  

•  CERC  in  Data  Science  and  Real-­‐Time  Decision-­‐Making:  

•  The  necessity  to  combine  1  and  2.    

2  

(3)

3  

Big Data — a Growing Torrent

Business  execu0ves  are  faced  with  a  relentless  and  exponen0al  growth  of  data  that  can  be  collected  by  their  enterprises  

Figure:  The  Economist  

1  Exabyte  =  

1  Billion  Gigabytes  

40%  

projected  growth  in   global  data  generated   per  year  vs.  

5%  

growth  in  

global  IT  spending  

5  billion  

mobile  phones   in  use  in  2010  

30  billion  

pieces  of  content  shared   on  Facebook  every  month  

Data:  McKinsey  

40%  

projected  growth  in   global  data  generated   per  year  vs.  

5%  

growth  in  

global  IT  spending  

(4)

4  

Source:  McKinsey  

Big Data — Big Value

Making  sense  of  this    data    could  unleash  substan0al  value  across  an  array  of  industries.  

$300  billion  

poten0al  annual  value  to  US  health  care  

€250  billion  

poten0al  annual  value  to  Europe’s  public  sector  

$600  billion  

poten0al  annual  consumer  surplus  from   using  personal  loca0on  data  globally  

60%  

poten0al  increase  in  retailers’  

opera0ng  margins  possible  with   big  data  

(5)

5  

Big Data: in the minds of executives

There  are  many  reasons  to  believe  that  —  since  last  year  —  turning  data  into  a  compeFFve  advantage  is  becoming  a  top-­‐of-­‐mind   C-­‐level  issue.  

The  Economist  

Special  Report,  2010  

McKinsey  White  Paper,   2011  

O’Reilly  Strata  Conference,   Twice  yearly  event,  

started  2011  

The  world  of  Big  Data  is  on  fire”  

—  The  Economist,  Sept  2011  

#bigdata  on  Twider  

(6)

Data Science: automatically extracting knowledge from data

6  

From:  Yann  LeCun  

Lecture  1  on  Big  Data,  large   scale  machine  learning,  2013  

(7)

Decision Science + Machine Learning

7  

•  The  topic  of  a  successful              CERC  applica0on  

•  Why?  

•  Data  deluge  &  real-­‐0me  online  learning  

•  Learned  models  are  used  to  take  decisions  on  the  fly  

•  The  data  used  to  train  depends  on  the  decisions  taken  

•  Can’t  separate  the  learning  from  the  decisions  like  in   tradi0onal  OR  &  ML  setups  

•  Examples:  

•  Online  adver0sing  &  recommenda0on  systems  

•  Online  video  games  

•  Fraud  detec0on,  targeted  marking,  etc.  

(8)

Ultimate Goals for AI

•  AI  

•  Needs  knowledge  

•  Needs  learning  

•  Needs  generalizing  

where  probability  mass  concentrates  

•  Needs  to  fight  the  curse  of  dimensionality  

•  Needs  disentangling  the  underlying  explanatory  factors   (“making  sense  of  the  data”)  

8  

(9)

Easy Learning

learned function: prediction = f(x)

*

*

*

* *

*

*

*

*

*

*

* *

true unknown function

= example (x,y)

*

x y

(10)

Local Smoothness Prior: Locally Capture the Variations

y *

x

*

learnt = interpolated

f(x)

prediction

true function: unknown

*

*

test point x

= training example

*

x  ≈  x’      è      f(x)  ≈  f(x’)    

(11)

What We Are Fighting Against:

The Curse of Dimensionality

     To  generalize  locally,   need  representa0ve   examples  for  all  

relevant  varia0ons!  

 

(12)

12  

Manifold Learning

Prior:  examples  

concentrate

 near  lower  dimensional  manifold  

(13)

Putting Probability Mass where Structure is Plausible

•  Empirical  distribu0on:  mass  at   training  examples  

•  Smoothness:  spread  mass  around  

•  Insufficient  

•  Guess  ‘structure’  and  generalize   accordingly  

13  

(14)

Representation Learning

•  Good  input  features  essen0al  for  successful  ML  

(feature  engineering  =  90%  of  effort  in  industrial  ML)  

•  Handcrasing  features  vs  learning  them  

•  Representa0on  learning:  guesses            the  features  /  factors  /  causes  =              good  representa0on.  

14  

(15)

Deep Representation Lear ning

Deep  learning  algorithms  adempt  to  learn  mul0ple  levels  of   representa0on  of  increasing  complexity/abstrac0on

 

When  the  number  of  levels  can  be  data-­‐

selected,  this  is     Deep  Learning  

   

15  

x   h3   h2   h1  

…  

(16)

A Modern Deep Architecture

 

Op0onal  Output  layer  

Here  predic0ng  a  supervised  target    

Hidden  layers  

These  learn  more  abstract  

representa0ons  as  you  head  up    

Input  layer  

This  has  raw  sensory  inputs  (roughly)  

16  

(17)

Google Image Search:

Different object types represented in the same space

Google:  

S.  Bengio,  J.  

Weston  &  N.  

Usunier  

(IJCAI  2011,  

NIPS’2010,  

JMLR  2010,  

MLJ  2010)  

(18)

How do humans generalize from very few examples?

18  

•  Brains  may  be  born  with  ‘generic’  priors.  Which  ones?    

•  Humans  transfer  knowledge  from  previous  learning:  

•  Representa0ons  

•  Explanatory  factors  

•  Previous  learning  from:  unlabeled  data    

               +  labels  for  other  tasks  

 

(19)

Learning multiple levels of representation

Theore0cal  evidence  for  mul0ple  levels  of  representa0on    

ExponenFal  gain  for  some  families  of  funcFons  

Biologically  inspired  learning  

Brain  has  a  deep  architecture   Cortex  seems  to  have  a    

generic  learning  algorithm     Humans  first  learn  simpler     concepts  and  then  compose     them  to  more  complex  ones  

 

19  

(20)

Learning multiple levels of representation

Successive  model  layers  learn  deeper  intermediate  representa0ons    

Layer  1   Layer  2   Layer  3  

High-­‐level  

linguis0c  representa0ons  

(Lee,  Largman,  Pham  &  Ng,  NIPS  2009)   (Lee,  Grosse,  Ranganath  &  Ng,  ICML  2009)    

20  

Prior:  underlying  factors  &  concepts  compactly  expressed  w/  mulFple  levels  of  abstracFon    

Parts  combine   to  form  objects  

(21)

main

sub1 sub2 sub3

subsub1 subsub2 subsub3

subsubsub1 subsubsub2

subsubsub3

“Deep” computer program

(22)

main subroutine1 includes

subsub1 code and subsub2 code and subsubsub1 code

“Shallow” computer program

subroutine2 includes subsub2 code and subsub3 code and

subsubsub3 code and …

(23)

Montréal Toronto

Bengio

Hinton

Le Cun

Major Breakthrough in 2006

•  Ability  to  train  deep  architectures  by   using  layer-­‐wise  unsupervised  

learning,  whereas  previous  purely   supervised  adempts  had  failed  

•  Unsupervised  feature  learners:  

•  RBMs  

•  Auto-­‐encoder  variants  

•  Sparse  coding  variants  

New York

23  

Empirical successes since

then: 2 competitions, Google, Microsoft, IBM, Apple…

(24)

Deep Networks for Speech Recognition:

results from Google, IBM, Microsoft

task   Hours  of  

training  data   Deep  net+HMM   GMM+HMM  

same  data   GMM+HMM   more  data  

Switchboard   309   16.1   23.6   17.1  (2k  hours)  

English  

Broadcast  news   50   17.5   18.8  

Bing  voice  

search   24   30.4   36.2  

Google  voice  

input   5870   12.3   16.0  (lots  more)  

Youtube   1400   47.6   52.3  

24   (numbers  taken  from  Geoff  Hinton’s  June  22,  2012  Google  talk)    

(25)

Deep Sparse Rectifier Neural Networks

 (Glorot,Bordes  and  Bengio  AISTATS  2011),  following  up  on  (Nair  &  Hinton  2010)  

Leaky integrate-and-fire model

Rectifier

Neuroscience motivations Machine learning motivations

Sparse representations Sparse gradients

f(x)=max(0,x)  

Outstanding  results  by  Krizhevsky  et  al  2012   killing  the  state-­‐of-­‐the-­‐art  on  ImageNet  1000:  

   

1st  choice   Top-­‐5  

2nd  best   27%  err  

Previous  SOTA   45%  err   26%  err   Krizhevsky  et  al   37%  err   15%  err  

(26)

Learning Multiple Levels of Abstraction

•  The  big  payoff  of  deep  learning  is  to  allow  learning   higher  levels  of  abstrac0on  

•  Higher-­‐level  abstrac0ons  disentangle  the  factors  of  

varia0on,  which  allows  much  easier  generaliza0on  and   transfer  

•  More  abstract  representa0ons   à  Successful  transfer  (domains,            languages),  2  interna0onal            compe00ons  won  

26  

(27)

Challenges Ahead

•  Big  data  +  deep  learning  =  underfizng,  local  minima,  ill-­‐

condi0oning,  difficulty  of  using  2nd-­‐order  methods  in  stochas0c  /   online  sezng  

 

•  The  challenge  of  inference  with  non-­‐unimodal  non-­‐factorial   posteriors  (can  we  avoid  this  altogether?)  

•  Big  data  +  deep  learning  +  parallel  compu0ng  à  our  current   best  training  algorithms  are  highly  sequen0al…  big  efforts  @   Google  in  this  respect  (Dean  et  al  ICML  2012,  NIPS  2012)  

•  Much  remains  to  be  understood  mathema0cally,  (Alain  &  

Bengio  ICLR  2013)  one  of  few  scratching  the  0p  of  the  iceberg  

27  

(28)

Merci! Questions?

LISA team:

References

Related documents

This includes educational bits coming from the Physical Space (e.g., a measure from a device used in a laboratory setting or a feedback from an Industry 4.0 compliant

describe the “four V” (Volume, Variety, Velocity, and Veracity) challenges of biometric big data and the representative techniques addressing these challenges using different

problems associated with Big Data, including high dimensionality, streaming data anal- ysis, scalability of Deep Learning models, improved formulation of data abstractions,

International Journal of Scientific Research in Computer Science, Engineering and Information Technology CSEIT1831403 | Received 12 Feb 2018 | Accepted 25 Feb 2018 | January February

what is more, the planned theme prevents the revealing of the non-public mistreatment homomorphic secret writing that has been with success used for data processing

HOLMeS System main modules with interaction paradigm: On bottom-centre the HOLMeS Application core; On the left the HOLMeS Chat-Bot (bottom) and the patient (top) interacting

E534 2019 Big Data Applications and Analytics Discovery of Higgs Boson: Big Data Higgs Unit 10: Looking for Higgs Particles Part III: Random Variables, Physics and Normal

1) Generative layer-wise pre-training : This stage requires only unlabeled data which is often abundant and cheap to collect in mobile systems using crowdsourcing. Fig- ure 2 shows