Principles of Machine Learning

(1)

(2)

(3)

(4)

(5)

Learning a Real Valued Func3on

•  Consider a real-valued target func3on f •  Training examples 〈x_i, d_i〉 , where d_i is

noisy training value d_i = f(x_i) + e_i

•  e_i is random variable (noise) drawn independently for each x_iaccording to Gaussian distribu3on with zero mean •  ⇒ d_ihas mean f(x_i) and same variance

(6)

(7)

(8)

(9)

€

W = W_[ ₀...W_n_]T is the weight vector

X_p = X

[

_{0 p}....X_np

]

T is the pth training sample

y_p = W_i

i

∑ X_ip = W • X_p is the output of the neuron for input X_p

X_p = f X

( )

_p is the desired output for input X_p

e_p = d

(

_p − y_p

)

is the error of the neuron on input X_p

S = X

{

(

_p,d_p

)

}

is`a (multi) set of training examples

E_S_{( )}W = E_S₍W₀,W₁,...W_n₎ = 1

2 ep

2

p

∑ is the estimated

error of W on training set S Goal : Find W* = argmin

W

(10)

(11)

(12)

(13)

Choice of learning rate

max λ η 1 0 _< _< Batch Update 2

2

0

p

η

X

<

Per sample Update In theory, inﬁnitesimally small (why?)

λ_maxis the largest Eigen value of the Hessian of E_S (matrix of second

order par3al deriva3ves of E_S with respect to the weights) Eigen values of a matrix A are given by solu3ons of

€

(14)

(15)

(16)

Newton Method

•  U3lizing the second order deriva3ve •  Expand the objec3ve func3on to the second order around x₀ •  The minimum point is •  Newton method for op3miza3on •  Guaranteed to converge when the objec3ve func3on is convex 0 0 2 0 0 0 ( ) ( ) ( ) ( ) 2 '( ) _{x x} , ''( ) _{x x} b f x f x a x x x x a f x b f x = = ≈ + − + − = = 0 / x = x − a b

xnew _{← x}old ₋ f '(x) x=xold f ''(x)

(17)

(18)

(19)

(20)

(21)

(22)

(23)

(24)

(25)

(26)

(27)

(28)

(29)

(30)

(31)

(32)

(33)

•  Both DFP and BFGS methods have theore3cal proper3es that guarantee superlinear (fast) convergence rate and global convergence under certain condi3ons.

•  However, both methods could fail for general nonlinear problems. •  DFP is highly sensi3ve to inaccuracies in line searches.

•  Both methods can get stuck on a saddle-point. In Newton's method, a saddle-point can be detected during modiﬁca3ons of the (true) Hessian. Therefore, search around the ﬁnal point when using quasi-Newton methods.

•  Update of Hessian becomes "corrupted" by round-oﬀ and other inaccuracies.

•  All kind of "tricks" such as scaling and precondi3oning exist to boost the performance of the methods.

(34)

(35)

(36)

Free Sopware

•  hHp://www.ece.northwestern.edu/~nocedal/sopware.html

(37)

(38)

(39)

(40)

(41)

(42)

•  Let ϕ : ℜàℜ be a non-constant, bounded (hence non-linear),

monotone, con3nuous func3on. Let I_N be the N-dimensional unit hypercube in ℜN_.

(43)

(44)

(45)

(46)

(47)

(48)

(49)

(50)

(51)

(52)

(53)

(54)

(55)

(56)

(57)

(58)

(59)

(60)

(61)

(62)

Training mul3-layer networks – Some Useful Tricks

(63)

(64)

Some Useful Tricks

•  Use sigmoid func3on which sa3sﬁes ϕ(–z)=–ϕ(z) helps speed up convergence

( )

(65)

(66)

(67)

Some Useful Tricks Input and output encodings

•  Do not eliminate natural proximity in the input or output space

•  Do not normalize input paHerns to be of unit length if the length is likely to be relevant for dis3nguishing between classes

•  Do not introduce unwarranted proximity as an ar3fact •  Do not use log₂ M outputs to encode M classes, use M

outputs instead to avoid spurious proximity in the output space

(68)

(69)

(70)

(71)

(72)

(73)

(74)

(75)

Some Useful Tricks

(76)

(77)

(78)

(79)

(80)

(81)

φ-separability of paHerns

0

0 <

ϕ

• >

ϕ

• )

(

)

(

X

W

X

W

2 1 C C ∈ ∈ X X H i i i H 1 1 =

ϕ

>

ϕ

=<

ϕ

)}

(

{

)

(

),...,

(

)

(

x

Hidden layer representa3on A (binary) par33on, also called dichotomy, (C₁,C₂) of the

training set C is φ-separable if there is a vector w of dimension

(82)

(83)

(84)

(85)

Gaussian Radial Basis Func3on φ

center

φ :

σ is a measure of how spread the curve is:

(86)

(87)

(88)

In the feature (hidden) space:

When mapped into the feature space < z₁ , z₂ >, C1 and C2become linearly

(89)

(90)

(91)

(92)

(93)

(94)

(95)

(96)

(97)

(98)

RBF Learning Algorithm

•  Ini3alize the parameters -- centers of the hidden neurons are typically ini3alized to coincide with a subset of the training set •  Use gradient descent to adjust the parameters using the training

(99)

(100)

(101)

(102)

“motorcycle”

(103)

You see this:

(104)

(105)

(106)

(107)

(108)

SIFT Spin image

HoG RIFT

(109)

SIFT Spin image

HoG RIFT

Textons GLOH

(110)

[Roe et al., 1992]

Auditory cortex learns to see!

(111)

•  Also referred to as representa3on learning or unsupervised feature learning (with subtle dis3nc3ons)

•  Is there some way to extract meaningful features from data

even without knowing the task to be performed?

(112)

(113)

V1 is the first stage of visual processing in the brain. Neurons in V1 typically modeled as edge detectors:

Neuron #1 of visual cortex

(114)

Sparse coding (Olshausen & Field,1996)

Input: Images x

(1)

, x

(2)

, …, x

(m)

(each in R

n x n

₎

(115)

(116)

0.6 * + 0.8 * + 0.4 *

φ15 φ₂₈φ₃₇ 1.3 * + 0.9 * + 0.3 *

φ5 φ₁₈φ₂₉ • Method “invents” edge detection

•  Automatically learns to represent an image in terms of the edges that appear in it. Gives a more succinct, higher-level representation than the raw pixels.

(117)

(118)

(119)

(120)

(121)

(122)

(123)

(124)

(125)

(126)

• Autoencoders

• Deep belief networks (DBNs)

• Convolu3onal variants

(127)

Andrew Ng x₄ x₅ x₆ +1 Layer 1 Layer 2 1 x₂ x₃ x₄ x₅ x₆ x₁ x₂ x₃ +1 Layer 3 Network is trained to output the input (learn identify function).

(128)

(129)

Andrew Ng x₄ x₅ x₆ +1 Layer 1 Layer 2 1 x₂ x₃ +1 a₁ a₂ a₃

(130)

(131)

Andrew Ng x₄ x₅ x₆ +1 1 x₂ x₃ +1 a₁ a₂ a₃ +1 b₁ b₂ b₃

(132)

(133)

(134)

(135)

(136)

(137)

Andrew Ng x₄ x₅ x₆ +1 1 x₂ x₃ +1 a₁ a₂ a₃ +1 b₁ b₂ b₃ +1 c₁ c₂ c₃ New representation for input.

(138)

Andrew Ng

Deep Belief Net (DBN) is another algorithm

for learning a feature hierarchy.

Building block: 2-layer graphical model

(Restricted Boltzmann Machine).

(139)

Andrew Ng

Input [x_1,x₂, x₃, x₄]

Layer 2. [a_1,a₂, a₃]

(binary-valued)

MRF with joint distribution:

Use Gibbs sampling for inference.

Given observed inputs x, want maximum likelihood estimation:

x

₄

x

₁

x

₂

x

₃

a

₂

_a

₃

(140)

Andrew Ng

Layer 2. [a_1,a₂, a₃]

(binary-valued)

Gradient ascent on log P(x) :

[x_ia_j]_obsfrom fixing x to observed value, and sampling a from P(a|x). [x_ia_j]_prior from running Gibbs sampling to convergence.

Adding sparsity constraint on a_i’s usually improves results.

x

₄

x

₁

x

₂

x

₃

a

₂

_a

₃

(141)

Andrew Ng

Layer 2. [a_1,a₂, a₃]

Layer 3. [b_1,b₂, b₃]

Stack RBMs on top of each other to get DBN.

Train with approximate maximum likelihood (often with

(142)

Andrew Ng

Layer 2. [a_1,a₂, a₃]

Layer 3. [b_1,b₂, b₃]

(143)

Andrew Ng

(144)

Andrew Ng

(145)