Introduction to Statistical Machine Learning
c
2016
Ong & Walder Data61 | CSIRO The Australian National
University
1of 1
Introduction to Statistical Machine Learning
Cheng Soon Ong & Christian Walder
Machine Learning Research Group Data61 | CSIRO
and
College of Engineering and Computer Science The Australian National University
Canberra February – June 2017
Introduction to Statistical Machine Learning
c
2016
Ong & Walder Data61 | CSIRO The Australian National
University
Part I
Introduction to Statistical Machine Learning
c
2016
Ong & Walder Data61 | CSIRO The Australian National
University
3of 1
Course information
Lecture Manning Clarke Center T6 Tuesday 4.00 pm - 5.30 pm Wednesday 4.00 pm - 5.30 pm
Tutorial CSIT ( building 108 ) No tutorial this week choose one
N113 - Wednesday 5.30 - 7.30 pm N111 - Thursday 11 am - 1 pm
Assignment Warm-up 2% and Two assignments, 19% each
Exam Written exam, 60%
Info Wattle and
Introduction to Statistical Machine Learning
c
2016
Ong & Walder Data61 | CSIRO The Australian National
University
Junk Mail Filtering
Introduction to Statistical Machine Learning
c
2016
Ong & Walder Data61 | CSIRO The Australian National
University
5of 1
gmail - Priority inbox
https://www.youtube.com/watch?v=5nt3gE9dGHQ
Introduction to Statistical Machine Learning
c
2016
Ong & Walder Data61 | CSIRO The Australian National
University
Handwritten Digit Recognition
Given handwritten ZIP codes on letters, money amounts on cheques etc.
Introduction to Statistical Machine Learning
c
2016
Ong & Walder Data61 | CSIRO The Australian National
University
7of 1
Galaxy classification
http://www.galaxyzoo.org/#/classify
Given images of the sky from SDSS and CTIO, and crowdsourced labels
Introduction to Statistical Machine Learning
c
2016
Ong & Walder Data61 | CSIRO The Australian National
University
Backgammon
Introduction to Statistical Machine Learning
c
2016
Ong & Walder Data61 | CSIRO The Australian National
University
9of 1
Go
Uses a combination of deep neural networks and tree search.
Introduction to Statistical Machine Learning
c
2016
Ong & Walder Data61 | CSIRO The Australian National
University
10of 1
Image Denoising
1 Learn astatisticsover patches from many images of
natural scenes
Learning High-Order MRF Priors of Color Images
2 4 6 8
0.01 0.015 0.02 0.025 0.03
5 1015 20 −0.01 0 0.01 0.02 0.03 0.04
20 40 60 −5
0 5 10 15x 10
−3
2 4 6 8
0.01 0.015 0.02 0.025 0.03
5 10 15 20 −0.01 0 0.01 0.02 0.03 0.04
20 40 60 −5
0 5 10 15x 10
−3
Figure 1.Above are the filters trained on 3x3 intensity, 3x3 color, and 5x5 color, respectively (the filter with highest
variance, which corresponds to a flat grey patch in all three cases, is excluded). The left column shows these filters sorted according to their eigenvalues (largest to smallest), while the right column shows them sorted by their importance after learning (α’s, smallest to largest). The three graphs on the left plot theα’s with respect to the rank of the patches in the left column. The three graphs on the right plot theα’s with respect to the rank of the patches in the right column. The fact that the plots are non-monotonic when sorted by eigenvalue demonstrates the necessity for learning theα’s by maximum-likelihood. The fact that the two rightmost graphs are approximately concave reveals that, although high-order
50 000patches (5×5×3) from the Berkeley Segmentation Database.
Introduction to Statistical Machine Learning
c
2016
Ong & Walder Data61 | CSIRO The Australian National
University
11of 1
Image Denoising
2 Use this knowledge to a denoise ayet unseenimage
Original image Noise added Denoised
Introduction to Statistical Machine Learning
c
2016
Ong & Walder Data61 | CSIRO The Australian National
University
Separating Audio Sources
Cocktail Party Problem (human brains may do it differently ;–)
Microphones
Audio Sources Audio Mixtures
10 20 30 40 50
�0.5 0.5
10 20 30 40 50
�0.5 0.5 1.0
10 20 30 40 50
�1.0
�0.5 0.5 1.0
10 20 30 40 50
�1.0
�0.5 0.5 1.0
(Jaakko Särelä et. al.,
Introduction to Statistical Machine Learning
c
2016
Ong & Walder Data61 | CSIRO The Australian National
University
13of 1
Predicting Solar Panel Output
Photovoltaics now very close to grid electricity in price Distributed system of generators
Energy market
Great Machine Learning Problem:Predict the solar energy
output (variability primarily due to clouds) for Australia
Pilot project in Canberra : Use cheap cameras to take
360◦sky photos in several location.
Learn to predict 3-D model of cloud movement.
Learn orientation and efficiency of solar panels for each house from time series of energy output.
Introduction to Statistical Machine Learning
c
2016
Ong & Walder Data61 | CSIRO The Australian National
University
Other applications of Machine Learning
autonomous robotics, bioinformatics,
detecting network intrusion, neuroscience,
medical diagnosis, stock market analysis, social network analysis,
traffic and infrastructure planning, ..
.
See more aten.wikipedia.org/wiki/Machine_
Introduction to Statistical Machine Learning
c
2016
Ong & Walder Data61 | CSIRO The Australian National
University
15of 1
What is common to these examples?
1 Given some data (e.g. hand written digits).
2 Possibly some extra information (e.g. which digit does this
number represent)
3 Goal: Built a machine which can learn from the given data
Introduction to Statistical Machine Learning
c
2016
Ong & Walder Data61 | CSIRO The Australian National
University
Flavour of this course
Formalise intuitions about problems
Use language of mathematics to express models Geometry, vectors, linear algebra for reasoning Probabilistic models to capture uncertainty Calculus to identify good parameters Design and analysis of algorithms Numerical algorithms in python
Introduction to Statistical Machine Learning
c
2016
Ong & Walder Data61 | CSIRO The Australian National
University
17of 1
What is Machine Learning?
Definition (First Try)
Machine learning is concerned with the design and development of algorithms that allow machines to learn.
machines? computers? HAL? to learn?
need to quantify "learning"
to improve their performance over time
Definition (Second Try)
Machine learning is concerned with the design and
Introduction to Statistical Machine Learning
c
2016
Ong & Walder Data61 | CSIRO The Australian National
University
What is Machine Learning?
Definition (First Try)
Machine learning is concerned with the design and development of algorithms that allow machines to learn.
machines? computers? HAL? to learn?
need to quantify "learning"
to improve their performance over time
Definition (Second Try)
Machine learning is concerned with the design and
Introduction to Statistical Machine Learning
c
2016
Ong & Walder Data61 | CSIRO The Australian National
University
19of 1
What is Machine Learning?
Definition (First Try)
Machine learning is concerned with the design and development of algorithms that allow machines to learn.
machines? computers? HAL? to learn?
need to quantify "learning"
to improve their performance over time
Definition (Second Try)
Machine learning is concerned with the design and
Introduction to Statistical Machine Learning
c
2016
Ong & Walder Data61 | CSIRO The Australian National
University
What is Machine Learning?
What is the source of the improved performance? New insights by the algorithm designer?
Definition (Final Version)
Machine learning is concerned with the design and
Introduction to Statistical Machine Learning
c
2016
Ong & Walder Data61 | CSIRO The Australian National
University
21of 1
What is Machine Learning?
What is the source of the improved performance? New insights by the algorithm designer?
Definition (Final Version)
Machine learning is concerned with the design and
Introduction to Statistical Machine Learning
c
2016
Ong & Walder Data61 | CSIRO The Australian National
University
What is Machine Learning?
Definition (Mitchell, 1998)
A computer program is said to learn from experienceEwith
respect to some class of tasksTand performance measureP,
if its performance at tasks inT, as measured byP, improves
Introduction to Statistical Machine Learning
c
2016
Ong & Walder Data61 | CSIRO The Australian National
University
23of 1
What is the challenge?
Given only some examples.
Need to derive a relation for many more (possibly infinite) unseen examples.
Occam’s Razor (also written as Ockham’s razor, William Ockham, circa 1285 – 1349): “Plurality must never be posited without necessity”
“The simplest explanation or strategy tends to be the best one.”
Introduction to Statistical Machine Learning
c
2016
Ong & Walder Data61 | CSIRO The Australian National
University
Statistical
Machine Learning - Some History
1960’s : symbolic AI; computers learn rules from data; analysis of the underlying statistics is seldom done. Perceptron (Rosenblatt, 1957), "Perceptrons" (Minsky and Papert, 1969)
1980’s : artificial neural networks
1990’s - 2000’s : statistical machine learning (kernel methods, decision trees, graphical models)
Why Statistical Machine Learning not earlier?
fastercomputers withlargermemory to represent statistical models have become available
numerical methods on the desktop computer (BLAS, LAPACK, Optimisation)
found new interesting classes of algorithms (e.g. on graphs) large amounts of data available which can be tapped into (flickr, social networks)
Introduction to Statistical Machine Learning
c
2016
Ong & Walder Data61 | CSIRO The Australian National
University
25of 1
Supervised learning
Machine learning is about prediction
Examples/features x1, . . . ,xn∼ X
Labels/annotations y1, . . . ,yn∼ Y
Predictor fw(x) :X → Y
Estimate best predictor = training = learning
Given data(x1,y1), . . . ,(xn,yn), find a predictorfw(·).
No mechanistic model of the phenomenon
There is relatively large amounts of data (examples,x
usuallyRd)
Introduction to Statistical Machine Learning
c
2016
Ong & Walder Data61 | CSIRO The Australian National
University
Strategy in this course
Estimate best predictor = training = learning
Given data(x1,y1), . . . ,(xn,yn), find a predictorfw(·).
1 Identify the type of inputxand outputydata 2 Propose a (linear) mathematical model forfw 3 Design an objective function or likelihood 4 Calculate the optimal parameter (w)
Introduction to Statistical Machine Learning
c
2016
Ong & Walder Data61 | CSIRO The Australian National
University
27of 1
Related Fields
Artificial Intelligence - AI Statistics
Game Theory
Neuroscience, Psychology Data Mining
Introduction to Statistical Machine Learning
c
2016
Ong & Walder Data61 | CSIRO The Australian National
University
Artificial Intelligence - AI
Artificial intelligence is the intelligence of machines and the branch of computer science which aims to create it.
The field was founded on theclaimthat human intelligence
can be so precisely described that it can be simulated by a machine.
Central areas: reasoning, knowledge, planning,learning,
communication, perception and the ability to move and manipulate objects (autonomous robotics).
Philosophical questions: Can a machine have a mind and consciousness? Are there limits to how intelligent
Introduction to Statistical Machine Learning
c
2016
Ong & Walder Data61 | CSIRO The Australian National
University
29of 1
Statistics
Descriptive Statistics: Summarize or describe a collection of data
Inferential Statistics: Draw inferences about a process taking randomness and uncertainty into account What can be inferred from data and some modelling assumptions?
Introduction to Statistical Machine Learning
c
2016
Ong & Walder Data61 | CSIRO The Australian National
University
Neuroscience, Psychology
How do animals learn?
Modelling of human learning (e.g. Bayesian models of human inductive learning)
increasing interaction between Statistical Machine Learning and Neuroscience, Psychology
e.g. NIPS - Neural Information Processing Systems Conference 2009:
“Discriminative Network Models of Schizophrenia”, “Functional network reorganization in motor cortex can be explained by reward-modulated Hebbian learning”,
“Canonical Time Warping for Alignment of Human Behavior”
Introduction to Statistical Machine Learning
c
2016
Ong & Walder Data61 | CSIRO The Australian National
University
31of 1
Data Mining
searching through large data sets (databases) goal: extracting hidden patterns from data examples: bioinformatics, genetics, medicine
Introduction to Statistical Machine Learning
c
2016
Ong & Walder Data61 | CSIRO The Australian National
University
Computer Science
"What can be (efficiently) automated?" Algorithms
Data Structures
Introduction to Statistical Machine Learning
c
2016
Ong & Walder Data61 | CSIRO The Australian National
University
33of 1
Adaptive Control Theory
consider systems with parameters changing (slowly) over time or being uncertain
Example: aircraft which changes its weight over time depending on fuel consumption (which in turn depends on the wind)
Introduction to Statistical Machine Learning
c
2016
Ong & Walder Data61 | CSIRO The Australian National
University
Some Basic Notation - Data
The set of all input data is denoted asX. For instance,
X ={x|xis an image containing a handwritten digit}.
One data point withDelements :
x= x1 . . . xD
= (x1, . . . ,xD)T.
Data matrix : A set ofNdata pointsxi, wherei=1. . .N,
X=
x1,1. . .x1,D
x2,1. . .x2,D . . .
xN,1. . .xN,D
= xT1
. . .
xTN
.
(Note : Each data pointxiis a column vector, but appears
as a row vector inX.)
IfD=1,Xis a vector ofNscalar data points. We write
Introduction to Statistical Machine Learning
c
2016
Ong & Walder Data61 | CSIRO The Australian National
University
35of 1
Some Basic Notation - Targets
A target can be from a finite discrete set (’labels’) or from
R.
(Note: Can extend this idea tom-dimensional labels and
Rm.)
Set of TargetsT, e.g.
T ={one,two,three,four,five,six,seven,eight,nine,zero}.
An ordered set ofNscalar labelst=
t1
. . .
tN
Introduction to Statistical Machine Learning
c
2016
Ong & Walder Data61 | CSIRO The Australian National
University
Supervised Learning
Given are pairs of dataxi∈ X and targetsti∈ T in the
form(xi,ti), wherei=1..N.
Learn a mapping between the dataXand the targett
Introduction to Statistical Machine Learning
c
2016
Ong & Walder Data61 | CSIRO The Australian National
University
37of 1
Supervised Learning
Given are pairs of dataxi∈ X and targetsti∈ T in the
form(xi,ti), wherei=1..N.
Learn a mapping between the dataXand the targett
Introduction to Statistical Machine Learning
c
2016
Ong & Walder Data61 | CSIRO The Australian National
University
Unsupervised Learning
Given only the dataxi∈ X.
Discover (=learn) some interesting structure inherent in
Introduction to Statistical Machine Learning
c
2016
Ong & Walder Data61 | CSIRO The Australian National
University
39of 1
Unsupervised Learning
Given only the dataxi∈ X.
Introduction to Statistical Machine Learning
c
2016
Ong & Walder Data61 | CSIRO The Australian National
University
Introduction to Statistical Machine Learning
c
2016
Ong & Walder Data61 | CSIRO The Australian National
University
41of 1
Reinforcement Learning
Example: Game playing. There is one reward at the end of the game (negative or positive).
Find suitable actions in a given environment with the goal of maximising some reward.
correct input/output pairs never presented Reward might only come after many actions.
Introduction to Statistical Machine Learning
c
2016
Ong & Walder Data61 | CSIRO The Australian National
University
Reinforcement Learning
observation1 Agent action1 choose action reward1 receive reward observation2 action2 reward2 Agent choose action receive reward Agent rewardi actioni observationi receive reward ... choose actionExploration versus Exploitation.
Well suited for problems with a long-term versus short-term reward trade-off.
Introduction to Statistical Machine Learning
c
2016
Ong & Walder Data61 | CSIRO The Australian National
University
43of 1
Other Machine Learning Types
Active Learning
The algorithm may choose which dataxi∈ Xto select next when building the model.
The order of the data isactivelychosen by the algorithm at run-time.
Transduction
The algorithms is allowed to use the test data (but of course not labels!) when building a model.
Estimation with missing variables.
Introduction to Statistical Machine Learning
c
2016
Ong & Walder Data61 | CSIRO The Australian National
University
Training Regimes
Batch Learning
All training dataX={x1, . . . ,xn}and targetst={t1, . . . ,tn}
are given.
Learn a mapping fromxitotiwhich can then be applied to
yet unseen dataX0={x01, . . . ,x
0
m}to findt0={t01, . . . ,t
0
m}.
Online Processing
Pairs of(xi,ti)become available one at a time.
At each step, learn and refine a mapping fromxitotiwhich
Introduction to Statistical Machine Learning
c
2016
Ong & Walder Data61 | CSIRO The Australian National
University
45of 1
Python
dynamically typed programming language (no declarations for variables)
supports object oriented, imperative, and functional programming style
many built-in data types (str, tuple, list, set, dict, . . . ) packages for scientific programming (numpy, scipy, pandas, matplotlib)
easily extensible to use code written in C and C++ (or FORTRAN for that matter)
Python runs on Windows, Linux/Unix, Mac OS X, Android, Raspberry Pi
Introduction to Statistical Machine Learning
c
2016
Ong & Walder Data61 | CSIRO The Australian National
University
Jupyter Notebooks
Literate programming environment
Markdown cells also understand LATEX
Introduction to Statistical Machine Learning
c
2016
Ong & Walder Data61 | CSIRO The Australian National
University
47of 1
Introduction to Statistical Machine Learning
c
2016
Ong & Walder Data61 | CSIRO The Australian National
University