Deep Learning for Big Data

(1)

Deep Learning for Big Data

Yoshua Bengio

Département d’Informa0que et Recherche Opéra0onnelle, U. Montréal

30 May 2013, Journée de la recherche École Polytechnique, Montréal

(2)

Big Data & Data Science

•  Super-‐hot buzzword

•  Data deluge

•  Two sides of the coin:

1.  Allowing computers to “understand” the data (percep0on) 2.  Allowing computers to “take decisions” (ac0on)

•  My research: 1.

•  CERC in Data Science and Real-‐Time Decision-‐Making:

•  The necessity to combine 1 and 2.

2

(3)

3

Big Data — a Growing Torrent

Business execu0ves are faced with a relentless and exponen0al growth of data that can be collected by their enterprises

Figure: The Economist

1 Exabyte =

1 Billion Gigabytes

40%

projected growth in global data generated per year vs.

5%

growth in

global IT spending

5 billion

mobile phones in use in 2010

30 billion

pieces of content shared on Facebook every month

Data: McKinsey

40%

projected growth in global data generated per year vs.

5%

growth in

global IT spending

(4)

4

Source: McKinsey

Big Data — Big Value

Making sense of this data could unleash substan0al value across an array of industries.

$300 billion

poten0al annual value to US health care

€250 billion

poten0al annual value to Europe’s public sector

$600 billion

poten0al annual consumer surplus from using personal loca0on data globally

60%

poten0al increase in retailers’

opera0ng margins possible with big data

(5)

5

Big Data: in the minds of executives

There are many reasons to believe that — since last year — turning data into a compeFFve advantage is becoming a top-‐of-‐mind C-‐level issue.

The Economist

Special Report, 2010

McKinsey White Paper, 2011

O’Reilly Strata Conference, Twice yearly event,

started 2011

“

The world of Big Data is on ﬁre”

— The Economist, Sept 2011

#bigdata on Twider

(6)

Data Science: automatically extracting knowledge from data

6

From: Yann LeCun

Lecture 1 on Big Data, large scale machine learning, 2013

(7)

Decision Science + Machine Learning

7

•  The topic of a successful CERC applica0on

•  Why?

•  Data deluge & real-‐0me online learning

•  Learned models are used to take decisions on the ﬂy

•  The data used to train depends on the decisions taken

•  Can’t separate the learning from the decisions like in tradi0onal OR & ML setups

•  Examples:

•  Online adver0sing & recommenda0on systems

•  Online video games

•  Fraud detec0on, targeted marking, etc.

(8)

Ultimate Goals for AI

•  AI

•  Needs knowledge

•  Needs learning

•  Needs generalizing

where probability mass concentrates

•  Needs to ﬁght the curse of dimensionality

•  Needs disentangling the underlying explanatory factors (“making sense of the data”)

8

(9)

Easy Learning

learned function: prediction = f(x)

*

* *

*

* *

true unknown function

= example (x,y)

*

x y

(10)

Local Smoothness Prior: Locally Capture the Variations

y *

x

*

learnt = interpolated

f(x)

prediction

true function: unknown

*

test point x

= training example

*

x ≈ x’ è f(x) ≈ f(x’)

(11)

What We Are Fighting Against:

The Curse of Dimensionality

To generalize locally, need representa0ve examples for all

relevant varia0ons!

(12)

12

Manifold Learning

Prior: examples

concentrate

near lower dimensional manifold

(13)

Putting Probability Mass where Structure is Plausible

•  Empirical distribu0on: mass at training examples

•  Smoothness: spread mass around

•  Insuﬃcient

•  Guess ‘structure’ and generalize accordingly

13

(14)

Representation Learning

•  Good input features essen0al for successful ML

(feature engineering = 90% of eﬀort in industrial ML)

•  Handcrasing features vs learning them

•  Representa0on learning: guesses the features / factors / causes = good representa0on.

14

(15)

Deep Representation Lear ning

Deep learning algorithms adempt to learn mul0ple levels of representa0on of increasing complexity/abstrac0on

When the number of levels can be data-‐

selected, this is Deep Learning

15

x h₃ h₂ h₁

…

(16)

A Modern Deep Architecture

Op0onal Output layer

Here predic0ng a supervised target

Hidden layers

These learn more abstract

representa0ons as you head up

Input layer

This has raw sensory inputs (roughly)

16

(17)

Google Image Search:

Different object types represented in the same space

Google:

S. Bengio, J.

Weston & N.

Usunier

(IJCAI 2011,

NIPS’2010,

JMLR 2010,

MLJ 2010)

(18)

How do humans generalize from very few examples?

18

•  Brains may be born with ‘generic’ priors. Which ones?

•  Humans transfer knowledge from previous learning:

•  Representa0ons

•  Explanatory factors

•  Previous learning from: unlabeled data

+ labels for other tasks

(19)

Learning multiple levels of representation

Theore0cal evidence for mul0ple levels of representa0on

ExponenFal gain for some families of funcFons

Biologically inspired learning

Brain has a deep architecture Cortex seems to have a

generic learning algorithm Humans ﬁrst learn simpler concepts and then compose them to more complex ones

19

(20)

Learning multiple levels of representation

Successive model layers learn deeper intermediate representa0ons

Layer 1 Layer 2 Layer 3

High-‐level

linguis0c representa0ons

(Lee, Largman, Pham & Ng, NIPS 2009) (Lee, Grosse, Ranganath & Ng, ICML 2009)

20

Prior: underlying factors & concepts compactly expressed w/ mulFple levels of abstracFon

Parts combine to form objects

(21)

main

sub1 sub2 sub3

subsub1 subsub2 subsub3

subsubsub1 subsubsub2

subsubsub3

“Deep” computer program

(22)

main subroutine1 includes

subsub1 code and subsub2 code and subsubsub1 code

“Shallow” computer program

subroutine2 includes subsub2 code and subsub3 code and

subsubsub3 code and …

(23)

Montréal Toronto

Bengio

Hinton

Le Cun

Major Breakthrough in 2006

•  Ability to train deep architectures by using layer-‐wise unsupervised

learning, whereas previous purely supervised adempts had failed

•  Unsupervised feature learners:

•  RBMs

•  Auto-‐encoder variants

•  Sparse coding variants

New York

23

Empirical successes since

then: 2 competitions, Google, Microsoft, IBM, Apple…

(24)

Deep Networks for Speech Recognition:

results from Google, IBM, Microsoft

task Hours of

training data Deep net+HMM GMM+HMM

same data GMM+HMM more data

Switchboard 309 16.1 23.6 17.1 (2k hours)

English

Broadcast news 50 17.5 18.8

Bing voice

search 24 30.4 36.2

Google voice

input 5870 12.3 16.0 (lots more)

Youtube 1400 47.6 52.3

24 (numbers taken from Geoﬀ Hinton’s June 22, 2012 Google talk)

(25)

Deep Sparse Rectifier Neural Networks

(Glorot,Bordes and Bengio AISTATS 2011), following up on (Nair & Hinton 2010)

Leaky integrate-and-fire model

Rectifier

Neuroscience motivations Machine learning motivations

Sparse representations Sparse gradients

f(x)=max(0,x)

Outstanding results by Krizhevsky et al 2012 killing the state-‐of-‐the-‐art on ImageNet 1000:

1^st choice Top-‐5

2^nd best 27% err

Previous SOTA 45% err 26% err Krizhevsky et al 37% err 15% err

(26)

Learning Multiple Levels of Abstraction

•  The big payoﬀ of deep learning is to allow learning higher levels of abstrac0on

•  Higher-‐level abstrac0ons disentangle the factors of

varia0on, which allows much easier generaliza0on and transfer

•  More abstract representa0ons à  Successful transfer (domains, languages), 2 interna0onal compe00ons won

26

(27)

Challenges Ahead

•  Big data + deep learning = underﬁzng, local minima, ill-‐

condi0oning, diﬃculty of using 2^nd-‐order methods in stochas0c / online sezng

•  The challenge of inference with non-‐unimodal non-‐factorial posteriors (can we avoid this altogether?)

•  Big data + deep learning + parallel compu0ng à our current best training algorithms are highly sequen0al… big eﬀorts @ Google in this respect (Dean et al ICML 2012, NIPS 2012)

•  Much remains to be understood mathema0cally, (Alain &

Bengio ICLR 2013) one of few scratching the 0p of the iceberg

27

(28)

Merci! Questions?

LISA team:

Deep Learning for Big Data