• No results found

Least Squares Optimization and Gradient Descent Algorithm

N/A
N/A
Protected

Academic year: 2021

Share "Least Squares Optimization and Gradient Descent Algorithm"

Copied!
52
0
0

Loading.... (view fulltext now)

Full text

(1)

Least Squares Optimization and

Gradient Descent Algorithm

(2)

Example

Single Variable Linear Regression


estimate


̂yi = θ0+ θ1xi 12 20 25 35 40 50 y 20 40 10 20 60 65 x

(3)

ESTIMATING PARAMETERS:


(4)

SCATTER PLOT

Plot all (Xi, Yi) pairs, and plot your learned model

4

0

20

40

60

0

20

40

60

X

Y

(5)

QUESTION

How would you draw a line through the points? How do you determine which line “fits the best” …?


?????????

5

0

20

40

60

0

20

40

60

X

Y

[WF]

(6)

QUESTION

How would you draw a line through the points?

How do you determine which line “fits the best” ?????????

6

0

20

40

60

0

20

40

60

X

Y

Slope changed Intercept unchanged

(7)

QUESTION

How would you draw a line through the points?

How do you determine which line “fits the best” ?????????

7

0

20

40

60

0

20

40

60

X

Y

Slope unchanged Intercept changed [WF]

(8)

QUESTION

How would you draw a line through the points?

How do you determine which line “fits the best” ?????????

8

0

20

40

60

0

20

40

60

X

Y

Slope changed Intercept changed

(9)

LEAST SQUARES

Best fit: difference between the true (observed) Y-values and the

estimated Y-values is minimized:

• Positive errors offset negative errors … • … square the error!

Least squares minimizes the sum of the squared errors

9

[WF] n

i=1

(y

i

− ̂y

i

)

2

=

n i=1

ϵ

i2

(10)

10

LEAST SQUARES, GRAPHICALLY

ε

2

Y

X

ε

1

ε

3

ε

4

^

^

^

^

LS Minimizes 

n i=1

ϵ

i2

= ϵ

12

+ ϵ

22

+ … + ϵ

n2

y

2

= θ

0

+ θ

1

x

2

+ ϵ

2

̂y

i

= θ

0

+ θ

1

x

i

(11)

Example

Single Variable Linear Regression


estimate


̂yi = θ0+ θ1xi 1600 1400 2100 … …. 2400 Area(sq. ft.) y 220 180 350 … …. 500 Price (in 1000$) x

(12)

Multivariate Regression

Multi Linear Regression


Price (in 1000$) 220 180 350 … …. 500 Area(sq. ft.) 1600 1400 2100 … …. 2400 # Bathrooms # Bedrooms 2.5 1.5 3.5 … … 4 3 3 4 … … 5 ̂y i = θ0+ θ1xi1 + θ2xi2 + … + θmxim

y

x

1

x

2

x

3

(13)

Multivariate Regression

Multi Linear Regression


Price (in 1000$) 220 180 350 … …. 500 Area(sq. ft.) 1600 1400 2100 … …. 2400 # Bathrooms # Bedrooms 2.5 1.5 3.5 … … 4 3 3 4 … … 5 1400 1.5 3 yi xi xi1 xi2 xi3 ̂y i = θ0+ θ1xi1 + θ2xi2 + … + θmxim

(14)

Multivariate Regression

Multi Linear Regression


Price (in 1000$) 220 180 350 … …. 500 Area(sq. ft.) 1600 1400 2100 … …. 2400 # Bathrooms # Bedrooms 2.5 1.5 3.5 … … 4 3 3 4 … … 5 yi 1400 1.5 3 1 yi = θ0xi0+ θ1xi1 + θ2xi2 + … + θmxim xi xi1 xi2 x xi0 1 1 1 … …. 1

y

x

0

x

1

x

2

x

3

(15)

Multivariate Regression Model

Model:


feature 1 = x0 …. (constant, 1) feature 2 = x1 …. (area, sq. ft.) feature 3 = x2 …. (# of bedrooms) feature 4 = x3 …. (# of bathrooms) …. …. feature m = xm ̂y i = θ0xi0 + θ1xi1+ θ2xi2 + … + θmxim ̂y i = mj=0 θijxij

(16)

One Observation Model

Matrix Notation


For observation i


̂y i = mj=0 θijxij yi = 0 θ . . . …. θ 2 θ m θ 1 yi = XTθ xi0 xi1 xi2xim

(17)

All Observation Model

Matrix Notation


For all observations


. . . = . . . . . . . . . .. .. .. . . . .. .. .. .. . . . .. x3m . . . xnm . θm θ0 θ1 θ2 ̂Y = Xθ x10 x20 x12 x30 xn0 x11 xim x21 x22 x2m x31 x32 xn1 xn2 y1 y0 y2 yn

(18)

LEAST SQUARES OPTIMIZATION

Rewrite inputs:

Rewrite optimization problem:

Each row is a feature vector paired with a label for a single input

n labeled inputs m features

X =

(x

(1)

)

T

(x

(2)

)

T

(x

(n)

)

T

∈ ℝ

n×m

, y =

y

(1)

y

(2)

y

(n)

∈ ℝ

n

*Recall ||z||22 = zTz = ∑zi2

(19)

LEAST SQUARES OPTIMIZATION

Rewrite inputs:

Rewrite optimization problem:

Each row is a feature vector paired with a label for a single input

n labeled inputs m features

X =

(x

(1)

)

T

(x

(2)

)

T

(x

(n)

)

T

∈ ℝ

n×m

, y =

y

(1)

y

(2)

y

(n)

∈ ℝ

n

*Recall ||z||22 = zTz = ∑ i z2 i

⟹ minimize 

n i=1

(y

i

− ̂y

i

)

2

=

n

i=1

ϵ

2 i

(20)

ERROR FUNCTION

20

θ

n

i=1

(y

i

− ̂y

i

)

2

=

n

i=1

ϵ

2 i

ϵ

(21)

GRADIENTS

Minimizing a multivariate function involves finding a point

where the gradient is zero:

Points where the gradient is zero are

local

minima

• If the function is convex, also a

global

minimum

Let’s solve the least squares problem!

We’ll use the multivariate generalizations of


some concepts from MATH141/142 …

• Chain rule: 


• Gradient of squared ℓ

2

norm:

(22)

LEAST SQUARES

Recall the least squares optimization problem:

What is the gradient of the optimization objective ????????

22

Chain rule:

(23)

LEAST SQUARES

Recall: points where the gradient

equals zero

are minima.

So where do we go from here?????????

23

Solve for model parameters θ

(24)

LINEAR REGRESSION AS

OPTIMIZATION PROBLEM

Let’s consider linear regression that minimizes the sum of

squared error, i.e., least squares …

1. Hypothesis function: ????????

Linear

hypothesis function

2. Loss function: ????????

Squared

error loss

4. Optimization problem: ????????

24

min

θ n

i=1

T

x

(i)

− y

(i)

)

2

(25)

GRADIENT DESCENT

We used the gradient as a condition for optimality

It also gives the local

direction of steepest increase

for a

function:

Intuitive idea: take small steps

against

the gradient.

25

Image from Zico Kolter If there is no increase, gradient is zero = local minimum!

(26)

GRADIENT DESCENT

Algorithm for any* hypothesis function , loss

function , step size :

Initialize the parameter vector:

Repeat until satisfied (e.g., exact or approximate

convergence):

• Compute gradient:

• Update parameters:

26

*must be reasonably well behaved

g ← n

i=1

(27)

GRADIENT DESCENT

27

∂f(θ) ∂θ > 0 ∂f(θ) ∂θ < 0 ⋯θ⋯ θ := θ − α ∂f(θ)∂θ

(28)

EXAMPLE

Function: f(x,y) = x

2

+ 2y

2

Gradient: ??????????

Let’s take a gradient step

from (-2, +1):

Step in the direction (0.04,

-0.01), scaled by step size

Repeat until no movement

28

(29)

GRADIENT DESCENT

(30)

GRADIENT DESCENT

30

θ0 = 0.1 θ1 = 0.1 ̂y = θ0 + θ1x

(31)

GRADIENT DESCENT

31

θ0 = 0.1 θ1 = 0.1 ̂y = θ0 + θ1x x y 0.2 0.44 0.31 0.45 0.26 0.123 0.75 0.39

(32)

GRADIENT DESCENT

θ0 = 0.1 θ1 = 0.1 ̂y = θ0 + θ1x x y ̂y = θ0 + θ1x 0.2 0.44 0.31 0.45 0.26 0.123 0.75 0.39 0.12 0.131 0.145 0.175

(33)

GRADIENT DESCENT

33

θ0 = 0.1 θ1 = 0.1 ̂y = θ0 + θ1x x y ̂y = θ0 + θ1x 12( ̂y − y)2 SSE 0.2 0.44 0.31 0.45 0.26 0.123 0.75 0.39 0.12 0.131 0.145 0.175 0.0512 0.000032 0.183 0.0231

(34)

GRADIENT DESCENT

34

θ0 = 0.1 θ1 = 0.1 ̂y = θ0 + θ1x x y ̂y = θ0 + θ1x 12( ̂y − y)2 SSE ∂(SSE) ∂θ0 ̂y − y 0.2 0.44 0.31 0.45 0.26 0.123 0.75 0.39 0.12 0.131 0.145 0.175 0.0512 0.000032 0.183 0.0231 -0.32 0.008 -0.605 -0.215

(35)

GRADIENT DESCENT

35

θ0 = 0.1 θ1 = 0.1 ̂y = θ0 + θ1x x y ̂y = θ0 + θ1x 12( ̂y − y)2 SSE ∂(SSE) ∂θ0 ̂y − y ∂(SSE) ∂θ1 ( ̂y − y)x 0.2 0.44 0.31 0.45 0.26 0.123 0.75 0.39 0.12 0.131 0.145 0.175 0.0512 0.000032 0.183 0.0231 -0.32 0.008 -0.605 -0.215 -0.064 0.00248 -0.27225 -0.16125

(36)

GRADIENT DESCENT

36

θ0 = 0.1 θ1 = 0.1 ̂y = θ0 + θ1x x y ̂y = θ0 + θ1x 12( ̂y − y)2 SSE ∂(SSE) ∂θ0 ̂y − y ∂(SSE) ∂θ1 ( ̂y − y)x 0.2 0.44 0.31 0.45 0.26 0.123 0.75 0.39 0.12 0.131 0.145 0.175 0.0512 0.000032 0.183 0.0231 -0.32 0.008 -0.605 -0.215 -0.064 0.00248 -0.27225 -0.16125 -1.132 -0.495

(37)

GRADIENT DESCENT

37

θ0 = 0.1 θ1 = 0.1 ̂y = θ0 + θ1x x y ̂y = θ0 + θ1x 12( ̂y − y)2 SSE ∂(SSE) ∂θ0 ̂y − y ∂(SSE) ∂θ1 ( ̂y − y)x 0.2 0.44 0.31 0.45 0.26 0.123 0.75 0.39 0.12 0.131 0.145 0.175 0.0512 0.000032 0.183 0.0231 -0.32 0.008 -0.605 -0.215 -0.064 0.00248 -0.27225 -0.16125 -1.132 -0.495 θ0 := θ0− αn i=1 ∂(SSE) ∂θ0 θ1 := θ1 − α ni=1 ∂(SSE) ∂θ1

(38)

GRADIENT DESCENT

38

θ0 = 0.1 θ1 = 0.1 ̂y = θ0 + θ1x x y ̂y = θ0 + θ1x 12( ̂y − y)2 SSE ∂(SSE) ∂θ0 ̂y − y ∂(SSE) ∂θ1 ( ̂y − y)x 0.2 0.44 0.31 0.45 0.26 0.123 0.75 0.39 0.12 0.131 0.145 0.175 0.0512 0.000032 0.183 0.0231 -0.32 0.008 -0.605 -0.215 -0.064 0.00248 -0.27225 -0.16125 -1.132 -0.495 θ0 := θ0− αn i=1 ∂(SSE) ∂θ0 θ1 := θ1 − α ni=1 ∂(SSE) ∂θ1 α = 0.01

(39)

GRADIENT DESCENT

39

θ0 = 0.1 θ1 = 0.1 ̂y = θ0 + θ1x x y ̂y = θ0 + θ1x 12( ̂y − y)2 SSE ∂(SSE) ∂θ0 ̂y − y ∂(SSE) ∂θ1 ( ̂y − y)x 0.2 0.44 0.31 0.45 0.26 0.123 0.75 0.39 0.12 0.131 0.145 0.175 0.0512 0.000032 0.183 0.0231 -0.32 0.008 -0.605 -0.215 -0.064 0.00248 -0.27225 -0.16125 -1.132 -0.495 θ0 := θ0− αn i=1 ∂(SSE) ∂θ0 θ1 := θ1 − α ni=1 ∂(SSE) ∂θ1 α = 0.01 θ0 = 0.1 − 0.01 × (−1.132)

(40)

GRADIENT DESCENT

40

θ0 = 0.1 θ1 = 0.1 ̂y = θ0 + θ1x x y ̂y = θ0 + θ1x 12( ̂y − y)2 SSE ∂(SSE) ∂θ0 ̂y − y ∂(SSE) ∂θ1 ( ̂y − y)x 0.2 0.44 0.31 0.45 0.26 0.123 0.75 0.39 0.12 0.131 0.145 0.175 0.0512 0.000032 0.183 0.0231 -0.32 0.008 -0.605 -0.215 -0.064 0.00248 -0.27225 -0.16125 -1.132 -0.495 θ0 := θ0− αn i=1 ∂(SSE) ∂θ0 θ1 := θ1 − α ni=1 ∂(SSE) ∂θ1 α = 0.01 θ0 = 0.1 − 0.01 × (−1.132) θ = 0.11132

(41)

GRADIENT DESCENT

41

θ0 = 0.1 θ1 = 0.1 ̂y = θ0 + θ1x x y ̂y = θ0 + θ1x 12( ̂y − y)2 SSE ∂(SSE) ∂θ0 ̂y − y ∂(SSE) ∂θ1 ( ̂y − y)x 0.2 0.44 0.31 0.45 0.26 0.123 0.75 0.39 0.12 0.131 0.145 0.175 0.0512 0.000032 0.183 0.0231 -0.32 0.008 -0.605 -0.215 -0.064 0.00248 -0.27225 -0.16125 -1.132 -0.495 θ0 := θ0− αn i=1 ∂(SSE) ∂θ0 θ1 := θ1 − α ni=1 ∂(SSE) ∂θ1 α = 0.01 θ0 = 0.1 − 0.01 × (−1.132) θ0 = 0.11132 θθ11= 0.10495= 0.1 − 0.01 × (−0.495)

(42)

GRADIENT DESCENT

Algorithm for any* hypothesis function , loss

function , step size :

Initialize the parameter vector:

Repeat until satisfied (e.g., exact or approximate

convergence):

• Compute gradient:

• Update parameters:

42

*must be reasonably well behaved

θ0 := θ0− αm i=1 ∂(SSE) ∂θ0 θ1 := θ1 − α mi=1 ∂(SSE) ∂θ1

(43)

GRADIENT DESCENT -

MULTIVARIATE

43

θ0 := θ0 − αn i=1 ∂(SSE) ∂θ0 θ1 := θ1 − αn i=1 ∂(SSE) ∂θ1 θ2 := θ2 − αn i=1 ∂(SSE) ∂θ2

θm := θn− αn i=1 ∂(SSE) ∂θm

(44)

GRADIENT DESCENT -

MULTIVARIATE

44

θ0 := θ0− α 1n n i=1 (h(θ(xi) − yi)x0i

Repeat{ θj := θj − α 1n n i=1 (h(θ(xi) − yi)xji }

(update θj for all j = 1…m simultaneously

θ = 0 1n ni=1 (h(θ(xi) − yi)xji = ∂∂θ j f(θ) θ1 := θ1 − α 1n n i=1 (h(θ(xi) − yi)x1i θ2 := θ2− α 1n n i=1 (h(θ(xi) − yi)x2i θm := θm − α 1 n ni=1 (h(θ(xi) − yi)xmi obtain all partial derivatives w.r.t θ's first

(45)

PLOTTING LOSS OVER TIME

(46)

STOCHASTIC GRADIENT DESCENT

- MULTIVARIATE

46

Repeat{ θj := θj − α(h(θ(xi) − yi)xji }

(update θj for all j = 1…n

θ = 0

i = random index between 1 and m

STOCHASTIC GRADIENT DESCENT

Repeat{

θj := θj − α 1n n

i=1

(h(θ(xi) − yi)xji

}

(update θj for all j = 1…m simultaneously

θ = 0 1n ni=1 (h(θ(xi) − yi)xji = ∂∂θ j f(θ) obtain all partial derivatives w.r.t θ's first

(47)

GRADIENT DESCENT

47

θ0 = 0.1 θ1 = 0.1 ̂y = θ0 + θ1x x y ̂y = θ0 + θ1x 12( ̂y − y)2 SSE ∂(SSE) ∂θ0 ̂y − y ∂(SSE) ∂θ1 ( ̂y − y)x 0.2 0.44 0.31 0.45 0.26 0.123 0.75 0.39 0.12 0.131 0.145 0.175 0.0512 0.000032 0.183 0.0231 -0.32 0.008 -0.605 -0.215 -0.064 0.00248 -0.27225 -0.16125 -1.132 -0.495

(48)

GRADIENT DESCENT

48

θ0 = 0.1 θ1 = 0.1 ̂y = θ0 + θ1x x y ̂y = θ0 + θ1x 1 2( ̂y − y)2 SSE ∂(SSE) ∂θ0 ̂y − y ∂(SSE) ∂θ1 ( ̂y − y)x 0.31 0.123 0.131 0.000032 0.008 0.00248

...…..

(49)

STOCHASTIC

GRADIENT DESCENT

(50)

STOCHASTIC GRADIENT DESCENT

- MINI BATCH

50

Repeat {

}

(update θj for all j = 1…n

θ = 0

= random index between 1 and m

i

1

, …, i

l θj := θj − α 1 l li=1 (h(θ(xi) − yi)xji

(51)
(52)

GRADIENT DESCENT IN

PURE(-ISH) PYTHON

Implicitly using squared loss and linear hypothesis function

above; drop in your favorite gradient for kicks!

52

# Training data (X, y), T time steps, alpha step

def grad_descent(X, y, T, alpha):

m, n = X.shape # m = #examples, n = #features

theta = np.zeros(n) # initialize parameters

f = np.zeros(T) # track loss over time

for i in range(T):

# loss for current parameter vector theta

f[i] = 0.5*np.linalg.norm(X.dot(theta) – y)**2

# compute steepest ascent at f(theta)

g = np.transpose(X).dot(X.dot(theta) – y)

# step down the gradient

theta = theta – alpha*g return theta, f

References

Related documents

Thus levels completing forces player to learn new course chapters and solve level assignments, so the game scenario relates to the learning course (Figure 1).. For example,

Minnesota Artist Exhibition Program, Artist Panel, Minneapolis Institute of Arts, 2006 - 2008 WARM, Board of Directors, Women’s Art Registry of Minnesota, 2006 -

In each wing root, displacement of the trim screwjack drives the all speed aileron servo control input linkage through the artificial feel unit, whose spring rod remains at

The independent variables (reaction time and the weight ratio of catalyst/oil) were optimized to obtain the optimum biodiesel (fatty acid methyl ester) yield.. The results of

The effects of cooperative learning methods on students’ academic achievements in social psychology lessons.. Cooperative learning: teori, riset, dan

Various domestic economic factors including money supply, industrial activities, inflation, Islamic interbank rate and international issues (e.g., real effective exchange rate

Problems occur when we receive a load calculation based on Connected Load–the power used if every device, circuit, piece of equipment and system were operating at the same time

 Applicants to the Equivalency Process who were candidates in the former ODQ Equivalency Process, former ACFD Eligibility Examination candidates and NDEB candidates must pay the