Deep Learning for Big Data
Yoshua Bengio
Département d’Informa0que et Recherche Opéra0onnelle, U. Montréal
30 May 2013, Journée de la recherche École Polytechnique, Montréal
Big Data & Data Science
• Super-‐hot buzzword
• Data deluge
• Two sides of the coin:
1. Allowing computers to “understand” the data (percep0on) 2. Allowing computers to “take decisions” (ac0on)
• My research: 1.
• CERC in Data Science and Real-‐Time Decision-‐Making:
• The necessity to combine 1 and 2.
2
3
Big Data — a Growing Torrent
Business execu0ves are faced with a relentless and exponen0al growth of data that can be collected by their enterprises
Figure: The Economist
1 Exabyte =
1 Billion Gigabytes
40%
projected growth in global data generated per year vs.5%
growth inglobal IT spending
5 billion
mobile phones in use in 201030 billion
pieces of content shared on Facebook every monthData: McKinsey
40%
projected growth in global data generated per year vs.5%
growth inglobal IT spending
4
Source: McKinsey
Big Data — Big Value
Making sense of this data could unleash substan0al value across an array of industries.
$300 billion
poten0al annual value to US health care
€250 billion
poten0al annual value to Europe’s public sector
$600 billion
poten0al annual consumer surplus from using personal loca0on data globally
60%
poten0al increase in retailers’
opera0ng margins possible with big data
5
Big Data: in the minds of executives
There are many reasons to believe that — since last year — turning data into a compeFFve advantage is becoming a top-‐of-‐mind C-‐level issue.
The Economist
Special Report, 2010
McKinsey White Paper, 2011
O’Reilly Strata Conference, Twice yearly event,
started 2011
“
The world of Big Data is on fire”— The Economist, Sept 2011
#bigdata on Twider
Data Science: automatically extracting knowledge from data
6
From: Yann LeCun
Lecture 1 on Big Data, large scale machine learning, 2013
Decision Science + Machine Learning
7
• The topic of a successful CERC applica0on
• Why?
• Data deluge & real-‐0me online learning
• Learned models are used to take decisions on the fly
• The data used to train depends on the decisions taken
• Can’t separate the learning from the decisions like in tradi0onal OR & ML setups
• Examples:
• Online adver0sing & recommenda0on systems
• Online video games
• Fraud detec0on, targeted marking, etc.
Ultimate Goals for AI
• AI
• Needs knowledge
• Needs learning
• Needs generalizing
where probability mass concentrates• Needs to fight the curse of dimensionality
• Needs disentangling the underlying explanatory factors (“making sense of the data”)
8
Easy Learning
learned function: prediction = f(x)
*
*
*
* *
*
*
*
*
*
*
* *
true unknown function
= example (x,y)
*
x y
Local Smoothness Prior: Locally Capture the Variations
y *
x
*
learnt = interpolated
f(x)
prediction
true function: unknown
*
*
test point x
= training example
*
x ≈ x’ è f(x) ≈ f(x’)
What We Are Fighting Against:
The Curse of Dimensionality
To generalize locally, need representa0ve examples for all
relevant varia0ons!
12
Manifold Learning
Prior: examples
concentrate
near lower dimensional manifoldPutting Probability Mass where Structure is Plausible
• Empirical distribu0on: mass at training examples
• Smoothness: spread mass around
• Insufficient
• Guess ‘structure’ and generalize accordingly
13
Representation Learning
• Good input features essen0al for successful ML
(feature engineering = 90% of effort in industrial ML)
• Handcrasing features vs learning them
• Representa0on learning: guesses the features / factors / causes = good representa0on.
14
Deep Representation Lear ning
Deep learning algorithms adempt to learn mul0ple levels of representa0on of increasing complexity/abstrac0on
When the number of levels can be data-‐
selected, this is Deep Learning
15
x h3 h2 h1
…
A Modern Deep Architecture
Op0onal Output layer
Here predic0ng a supervised target
Hidden layers
These learn more abstract
representa0ons as you head up
Input layer
This has raw sensory inputs (roughly)
16
Google Image Search:
Different object types represented in the same space
Google:
S. Bengio, J.
Weston & N.
Usunier
(IJCAI 2011,
NIPS’2010,
JMLR 2010,
MLJ 2010)
How do humans generalize from very few examples?
18
• Brains may be born with ‘generic’ priors. Which ones?
• Humans transfer knowledge from previous learning:
• Representa0ons
• Explanatory factors
• Previous learning from: unlabeled data
+ labels for other tasks
Learning multiple levels of representation
Theore0cal evidence for mul0ple levels of representa0on
ExponenFal gain for some families of funcFonsBiologically inspired learning
Brain has a deep architecture Cortex seems to have a
generic learning algorithm Humans first learn simpler concepts and then compose them to more complex ones
19
Learning multiple levels of representation
Successive model layers learn deeper intermediate representa0ons
Layer 1 Layer 2 Layer 3
High-‐level
linguis0c representa0ons
(Lee, Largman, Pham & Ng, NIPS 2009) (Lee, Grosse, Ranganath & Ng, ICML 2009)
20
Prior: underlying factors & concepts compactly expressed w/ mulFple levels of abstracFon
Parts combine to form objects
main
sub1 sub2 sub3
subsub1 subsub2 subsub3
subsubsub1 subsubsub2
subsubsub3
“Deep” computer program
main subroutine1 includes
subsub1 code and subsub2 code and subsubsub1 code
“Shallow” computer program
subroutine2 includes subsub2 code and subsub3 code and
subsubsub3 code and …
Montréal Toronto
Bengio
Hinton
Le Cun
Major Breakthrough in 2006
• Ability to train deep architectures by using layer-‐wise unsupervised
learning, whereas previous purely supervised adempts had failed
• Unsupervised feature learners:
• RBMs
• Auto-‐encoder variants
• Sparse coding variants
New York
23
Empirical successes since
then: 2 competitions, Google, Microsoft, IBM, Apple…
Deep Networks for Speech Recognition:
results from Google, IBM, Microsoft
task Hours of
training data Deep net+HMM GMM+HMM
same data GMM+HMM more data
Switchboard 309 16.1 23.6 17.1 (2k hours)
English
Broadcast news 50 17.5 18.8
Bing voice
search 24 30.4 36.2
Google voice
input 5870 12.3 16.0 (lots more)
Youtube 1400 47.6 52.3
24 (numbers taken from Geoff Hinton’s June 22, 2012 Google talk)
Deep Sparse Rectifier Neural Networks
(Glorot,Bordes and Bengio AISTATS 2011), following up on (Nair & Hinton 2010)
Leaky integrate-and-fire model
Rectifier
Neuroscience motivations Machine learning motivations
Sparse representations Sparse gradients
f(x)=max(0,x)
Outstanding results by Krizhevsky et al 2012 killing the state-‐of-‐the-‐art on ImageNet 1000:
1st choice Top-‐5
2nd best 27% err
Previous SOTA 45% err 26% err Krizhevsky et al 37% err 15% err
Learning Multiple Levels of Abstraction
• The big payoff of deep learning is to allow learning higher levels of abstrac0on
• Higher-‐level abstrac0ons disentangle the factors of
varia0on, which allows much easier generaliza0on and transfer
• More abstract representa0ons à Successful transfer (domains, languages), 2 interna0onal compe00ons won
26
Challenges Ahead
• Big data + deep learning = underfizng, local minima, ill-‐
condi0oning, difficulty of using 2nd-‐order methods in stochas0c / online sezng
• The challenge of inference with non-‐unimodal non-‐factorial posteriors (can we avoid this altogether?)
• Big data + deep learning + parallel compu0ng à our current best training algorithms are highly sequen0al… big efforts @ Google in this respect (Dean et al ICML 2012, NIPS 2012)
• Much remains to be understood mathema0cally, (Alain &
Bengio ICLR 2013) one of few scratching the 0p of the iceberg
27
Merci! Questions?
LISA team: