• No results found

Principles of Machine Learning

N/A
N/A
Protected

Academic year: 2021

Share "Principles of Machine Learning"

Copied!
145
0
0

Loading.... (view fulltext now)

Full text

(1)
(2)
(3)
(4)
(5)

Learning a Real Valued Func3on

•  Consider a real-valued target func3on f •  Training examples 〈xi, di〉 , where di is

noisy training value di = f(xi) + ei

•  ei is random variable (noise) drawn independently for each xi according to Gaussian distribu3on with zero mean •  ⇒ di has mean f(xi) and same variance

(6)
(7)
(8)
(9)

W = W[ 0...Wn]T is the weight vector

Xp = X

[

0 p....Xnp

]

T is the pth training sample

yp = Wi

i

Xip = W • Xp is the output of the neuron for input Xp

Xp = f X

( )

p is the desired output for input Xp

ep = d

(

p − yp

)

is the error of the neuron on input Xp

S = X

{

(

p,dp

)

}

is`a (multi) set of training examples

ES( )W = ES(W0,W1,...Wn) = 1

2 ep

2

p

∑ is the estimated

error of W on training set S Goal : Find W* = argmin

W

(10)
(11)
(12)
(13)

Choice of learning rate

max λ η 1 0 < < Batch Update 2

2

0

p

η

X

<

<

Per sample Update In theory, infinitesimally small (why?)

λmax is the largest Eigen value of the Hessian of ES (matrix of second

order par3al deriva3ves of ES with respect to the weights) Eigen values of a matrix A are given by solu3ons of

(14)
(15)
(16)

Newton Method

•  U3lizing the second order deriva3ve •  Expand the objec3ve func3on to the second order around x0 •  The minimum point is •  Newton method for op3miza3on •  Guaranteed to converge when the objec3ve func3on is convex 0 0 2 0 0 0 ( ) ( ) ( ) ( ) 2 '( ) x x , ''( ) x x b f x f x a x x x x a f x b f x = = ≈ + − + − = = 0 / x = xa b

xnew ← xold f '(x) x=xold f ''(x)

(17)
(18)
(19)
(20)
(21)
(22)
(23)
(24)
(25)
(26)
(27)
(28)
(29)
(30)
(31)
(32)
(33)

•  Both DFP and BFGS methods have theore3cal proper3es that guarantee superlinear (fast) convergence rate and global convergence under certain condi3ons.

•  However, both methods could fail for general nonlinear problems. •  DFP is highly sensi3ve to inaccuracies in line searches.

•  Both methods can get stuck on a saddle-point. In Newton's method, a saddle-point can be detected during modifica3ons of the (true) Hessian. Therefore, search around the final point when using quasi-Newton methods.

•  Update of Hessian becomes "corrupted" by round-off and other inaccuracies.

•  All kind of "tricks" such as scaling and precondi3oning exist to boost the performance of the methods.

(34)
(35)
(36)

Free Sopware

•  hHp://www.ece.northwestern.edu/~nocedal/sopware.html

(37)
(38)
(39)
(40)
(41)
(42)

•  Let ϕ : ℜàℜ be a non-constant, bounded (hence non-linear),

monotone, con3nuous func3on. Let IN be the N-dimensional unit hypercube in ℜN.

(43)
(44)
(45)
(46)
(47)
(48)
(49)
(50)
(51)
(52)
(53)
(54)
(55)
(56)
(57)
(58)
(59)
(60)
(61)
(62)

Training mul3-layer networks – Some Useful Tricks

(63)
(64)

Some Useful Tricks

•  Use sigmoid func3on which sa3sfies ϕ(–z)=–ϕ(z) helps speed up convergence

( )

(65)
(66)
(67)

Some Useful Tricks Input and output encodings

•  Do not eliminate natural proximity in the input or output space

•  Do not normalize input paHerns to be of unit length if the length is likely to be relevant for dis3nguishing between classes

•  Do not introduce unwarranted proximity as an ar3fact •  Do not use log2 M outputs to encode M classes, use M

outputs instead to avoid spurious proximity in the output space

(68)
(69)
(70)
(71)
(72)
(73)
(74)
(75)

Some Useful Tricks

(76)
(77)
(78)
(79)
(80)
(81)

φ-separability of paHerns

0

0

<

ϕ

>

ϕ

)

(

)

(

X

W

X

W

2 1 C C ∈ ∈ X X H i i i H 1 1 =

ϕ

ϕ

>

ϕ

ϕ

=<

ϕ

)}

(

{

)

(

),...,

(

)

(

x

x

x

x

Hidden layer representa3on A (binary) par33on, also called dichotomy, (C1,C2) of the

training set C is φ-separable if there is a vector w of dimension

(82)
(83)
(84)
(85)

Gaussian Radial Basis Func3on φ

center

φ :

σ is a measure of how spread the curve is:

(86)
(87)
(88)

In the feature (hidden) space:

When mapped into the feature space < z1 , z2 >, C1 and C2 become linearly

(89)
(90)
(91)
(92)
(93)
(94)
(95)
(96)
(97)
(98)

RBF Learning Algorithm

•  Ini3alize the parameters -- centers of the hidden neurons are typically ini3alized to coincide with a subset of the training set •  Use gradient descent to adjust the parameters using the training

(99)
(100)
(101)
(102)

“motorcycle”

(103)

You see this:

(104)
(105)
(106)
(107)
(108)

SIFT Spin image

HoG RIFT

(109)

SIFT Spin image

HoG RIFT

Textons GLOH

(110)

[Roe et al., 1992]

Auditory cortex learns to see!

(111)

•  Also referred to as representa3on learning or unsupervised feature learning (with subtle dis3nc3ons)

•  Is there some way to extract meaningful features from data

even without knowing the task to be performed?

(112)
(113)

V1 is the first stage of visual processing in the brain. Neurons in V1 typically modeled as edge detectors:

Neuron #1 of visual cortex

(114)

Sparse coding (Olshausen & Field,1996)

Input: Images x

(1)

, x

(2)

, …, x

(m)

(each in R

n x n

)

(115)
(116)

0.6 * + 0.8 * + 0.4 *

φ15 φ28 φ37 1.3 * + 0.9 * + 0.3 *

φ5 φ18 φ29 • Method “invents” edge detection

•  Automatically learns to represent an image in terms of the edges that appear in it. Gives a more succinct, higher-level representation than the raw pixels.

(117)
(118)
(119)
(120)
(121)
(122)
(123)
(124)
(125)
(126)

• 

Autoencoders

• 

Deep belief networks (DBNs)

• 

Convolu3onal variants

(127)

Andrew Ng x4 x5 x6 +1 Layer 1 Layer 2 1 x2 x3 x4 x5 x6 x1 x2 x3 +1 Layer 3 Network is trained to output the input (learn identify function).

(128)
(129)

Andrew Ng x4 x5 x6 +1 Layer 1 Layer 2 1 x2 x3 +1 a1 a2 a3

(130)
(131)

Andrew Ng x4 x5 x6 +1 1 x2 x3 +1 a1 a2 a3 +1 b1 b2 b3

(132)

Andrew Ng x4 x5 x6 +1 1 x2 x3 +1 a1 a2 a3 +1 b1 b2 b3

(133)

Andrew Ng x4 x5 x6 +1 1 x2 x3 +1 a1 a2 a3 +1 b1 b2 b3

(134)

Andrew Ng x4 x5 x6 +1 1 x2 x3 +1 a1 a2 a3 +1 b1 b2 b3

(135)
(136)
(137)

Andrew Ng x4 x5 x6 +1 1 x2 x3 +1 a1 a2 a3 +1 b1 b2 b3 +1 c1 c2 c3 New representation for input.

(138)

Andrew Ng

Deep Belief Net (DBN) is another algorithm

for learning a feature hierarchy.

Building block: 2-layer graphical model

(Restricted Boltzmann Machine).

(139)

Andrew Ng

Input [x1, x2, x3, x4]

Layer 2. [a1, a2, a3]

(binary-valued)

MRF with joint distribution:

Use Gibbs sampling for inference.

Given observed inputs x, want maximum likelihood estimation:

x

4

x

1

x

2

x

3

a

2

a

3

(140)

Andrew Ng

Input [x1, x2, x3, x4]

Layer 2. [a1, a2, a3]

(binary-valued)

Gradient ascent on log P(x) :

[xiaj]obs from fixing x to observed value, and sampling a from P(a|x). [xiaj]prior from running Gibbs sampling to convergence.

Adding sparsity constraint on ai’s usually improves results.

x

4

x

1

x

2

x

3

a

2

a

3

(141)

Andrew Ng

Input [x1, x2, x3, x4]

Layer 2. [a1, a2, a3]

Layer 3. [b1, b2, b3]

Stack RBMs on top of each other to get DBN.

Train with approximate maximum likelihood (often with

(142)

Andrew Ng

Input [x1, x2, x3, x4]

Layer 2. [a1, a2, a3]

Layer 3. [b1, b2, b3]

(143)

Andrew Ng

(144)

Andrew Ng

(145)

References

Related documents

While every program is custom-designed specifically for your organization, Scott engages his audience on the following topics and provides the following.. takeaways for

Wherever loads exceed these allowable loads during detail design, vendor shall be contacted to confirm the design of equipment for applicable nozzle loads Pressure vessel nozzles

the German Democratic Republic translated the books of Russian writers who followed the Soviet canon, whereas the Federal Republic of Germany.. translated Russian

Frontal plane movement of the pelvis and thorax during dynamic activities in individuals with and without anterior cruciate ligament injury..

According to the Table 6, the associated competitiveness impacts of alternative climate policy scenarios for European ETS_NBTA industries reflect the effects on the

PDA Seizure’s features include the ability to acquire a forensic image of Palm and Pocket PC devices, to perform examiner-defined searches on data contained within acquired

[2004] conclude that relatively lower emissions would be rectified due to the uncertainty of climate change damages (i.e. uncertainties in the response of the physical climate system

Biological methods, especially the traditionally activated sludge process have been widely used for the treatment of textile wastewaters because it can remove dyes from