Least Squares Optimization and Gradient Descent Algorithm

(1)

Least Squares Optimization and

Gradient Descent Algorithm

(2)

Example

■

Single Variable Linear Regression 

 

estimate 

̂y_i = θ₀+ θ₁x_i 12 20 25 35 40 50 y 20 40 10 20 60 65 x

(3)

ESTIMATING PARAMETERS: 

(4)

SCATTER PLOT

Plot all (X_i, Y_i) pairs, and plot your learned model

4

0

20

40

60

0

20

40

60 X

Y

(5)

QUESTION

How would you draw a line through the points? How do you determine which line “fits the best” …? 

?????????

5

0

20

40

60

0

20

40

60 X

Y

[WF]

(6)

QUESTION

How would you draw a line through the points?

How do you determine which line “fits the best” ?????????

6

0

20

40

60

0

20

40

60 X

Y

Slope changed Intercept unchanged

(7)

QUESTION

7

0

20

40

60

0

20

40

60 X

Y

Slope unchanged Intercept changed [WF]

(8)

QUESTION

8

0

20

40

60

0

20

40

60 X

Y

Slope changed Intercept changed

(9)

LEAST SQUARES

Best fit: difference between the true (observed) Y-values and the

estimated Y-values is minimized:

• Positive errors offset negative errors … • … square the error!

Least squares minimizes the sum of the squared errors

9

[WF] n

∑

i=1

(y

_i

− ̂y

_i

)

2

=

_∑

n i=1

ϵ

_i2

(10)

10 LEAST SQUARES, GRAPHICALLY

ε

₂

Y

X

ε

₁

_ε

₃

ε

₄

^

LS Minimizes

_∑

n i=1

ϵ

_i2

= ϵ

₁2

+ ϵ

₂2

+ … + ϵ

_n2

y

₂

= θ

₀

+ θ

₁

x

₂

+ ϵ

₂

̂y

i

= θ

0 + θ

1 x

i

(11)

Example

■

Single Variable Linear Regression 

 

estimate 

̂y_i = θ₀+ θ₁x_i 1600 1400 2100 … …. 2400 Area(sq. ft.) y 220 180 350 … …. 500 Price (in 1000$) x

(12)

Multivariate Regression

■

Multi Linear Regression 

 

Price (in 1000$) 220 180 350 … …. 500 Area(sq. ft.) 1600 1400 2100 … …. 2400 # Bathrooms # Bedrooms 2.5 1.5 3.5 … … 4 3 3 4 … … 5 ̂y i = θ0+ θ1xi1 + θ2xi2 + … + θmxim

y

x

₁

x

₂

_x

₃

(13)

Multivariate Regression

■

Multi Linear Regression 

 

Price (in 1000$) 220 180 350 … …. 500 Area(sq. ft.) 1600 1400 2100 … …. 2400 # Bathrooms # Bedrooms 2.5 1.5 3.5 … … 4 3 3 4 … … 5 1400 1.5 3 yi x_i x_i1 x_i2 x_i3 ̂y i = θ0+ θ1xi1 + θ2xi2 + … + θmxim

(14)

Multivariate Regression

■

Multi Linear Regression 

 

Price (in 1000$) 220 180 350 … …. 500 Area(sq. ft.) 1600 1400 2100 … …. 2400 # Bathrooms # Bedrooms 2.5 1.5 3.5 … … 4 3 3 4 … … 5 yi 1400 1.5 3 1 y_i = θ₀x_i0+ θ₁x_i1 + θ₂x_i2 + … + θ_mx_im x_i x_i1 x_i2 x x_i0 1 1 1 … …. 1

y

_x

₀

_x

₁

_x

₂

_x

₃

(15)

Multivariate Regression Model

■

Model: 

feature 1 = x0 …. (constant, 1) feature 2 = x1 …. (area, sq. ft.) feature 3 = x2 …. (# of bedrooms) feature 4 = x3 …. (# of bathrooms) …. …. feature m = xm ̂y i = θ0xi0 + θ1xi1+ θ2xi2 + … + θmxim ̂y i = m ∑ j=0 θ_ijx_ij

(16)

One Observation Model

■

Matrix Notation 

For observation i 

̂y i = m ∑ j=0 θ_ijx_ij yi = 0 θ . . . …. θ 2 θ m θ 1 y_i = XT_θ x_i0 x_i1 x_i2 … x_im

(17)

All Observation Model

■

Matrix Notation 

For all observations 

. . . = . . . . . . . . . .. .. .. . . . .. .. .. .. . . . .. x_3m . . . x_nm . θ_m θ₀ θ₁ θ₂ ̂Y = Xθ x₁₀ x₂₀ x₁₂ x₃₀ x_n0 x₁₁ _x_im x₂₁ x₂₂ x2m x₃₁ x₃₂ x_n1 x_n2 y₁ y₀ y₂ y_n

(18)

LEAST SQUARES OPTIMIZATION

Rewrite inputs:

Rewrite optimization problem:

Each row is a feature vector paired with a label for a single input

n labeled inputs m features

X =

(x

(1)

)

T

(x

(2)

)

T

⋯

(x

(n)

)

T

∈ ℝ

n×m

, y =

y

(1)

y

(2)

⋯

y

(n)

∈ ℝ

n

*Recall ||z||2₂ = zTz = ∑z_i2

(19)

LEAST SQUARES OPTIMIZATION

Rewrite inputs:

Rewrite optimization problem:

Each row is a feature vector paired with a label for a single input

n labeled inputs m features

X =

(x

(1)

)

T

(x

(2)

)

T

⋯

(x

(n)

)

T

∈ ℝ

n×m

, y =

y

(1)

y

(2)

⋯

y

(n)

∈ ℝ

n

*Recall ||z||2₂ = zTz = ∑ i z2 i

⟹ minimize

_∑

n i=1

(y

_i

− ̂y

_i

)

2

₌

n

∑

i=1

ϵ

2 i

(20)

ERROR FUNCTION

20 θ

n

∑

i=1

(y

_i

− ̂y

_i

)

2

₌

n

∑

i=1

ϵ

2 i

ϵ

(21)

GRADIENTS

Minimizing a multivariate function involves finding a point

where the gradient is zero:

Points where the gradient is zero are

local

minima

• If the function is convex, also a

global

minimum

Let’s solve the least squares problem!

We’ll use the multivariate generalizations of 

some concepts from MATH141/142 …

• Chain rule:  

• Gradient of squared ℓ

2

norm:

(22)

LEAST SQUARES

Recall the least squares optimization problem:

What is the gradient of the optimization objective ????????

22

Chain rule:

(23)

LEAST SQUARES

Recall: points where the gradient

equals zero

are minima.

So where do we go from here?????????

23

Solve for model parameters θ

(24)

LINEAR REGRESSION AS

OPTIMIZATION PROBLEM

Let’s consider linear regression that minimizes the sum of

squared error, i.e., least squares …

1. Hypothesis function: ????????

• Linear

hypothesis function

2. Loss function: ????????

• Squared

error loss

4. Optimization problem: ????????

24 min

θ n

∑

i=1

(θ

T

_x

(i)

_{− y}

(i)

₎

2

(25)

GRADIENT DESCENT

We used the gradient as a condition for optimality

It also gives the local

direction of steepest increase

for a

function:

Intuitive idea: take small steps

against

the gradient.

25

Image from Zico Kolter If there is no increase, gradient is zero = local minimum!

(26)

GRADIENT DESCENT

**Algorithm for any* hypothesis function , loss**

function , step size :

Initialize the parameter vector:

• Repeat until satisfied (e.g., exact or approximate

convergence):

• Compute gradient:

• Update parameters:

26

*must be reasonably well behaved

g ← _∑n

i=1

(27)

GRADIENT DESCENT

27

∂f(θ) ∂θ > 0 ∂f(θ) ∂θ < 0 ⋯θ⋯ θ := θ − α ∂f(θ)_∂θ

(28)

EXAMPLE

Function: f(x,y) = x

2

+ 2y

2

Gradient: ??????????

Let’s take a gradient step

from (-2, +1):

Step in the direction (0.04,

-0.01), scaled by step size

Repeat until no movement

28

(29)

GRADIENT DESCENT

(30)

GRADIENT DESCENT

30

θ₀ = 0.1 θ₁ = 0.1 ̂y = θ₀ + θ₁x

(31)

GRADIENT DESCENT

31

θ₀ = 0.1 θ₁ = 0.1 ̂y = θ₀ + θ₁x x y 0.2 0.44 0.31 0.45 0.26 0.123 0.75 0.39

(32)

GRADIENT DESCENT

θ₀ = 0.1 θ₁ = 0.1 ̂y = θ₀ + θ₁x x y ̂y = θ₀ + θ₁x 0.2 0.44 0.31 0.45 0.26 0.123 0.75 0.39 0.12 0.131 0.145 0.175

(33)

GRADIENT DESCENT

33

θ₀ = 0.1 θ₁ = 0.1 ̂y = θ₀ + θ₁x x y ̂y = θ₀ + θ₁x 1₂( ̂y − y)2 SSE 0.2 0.44 0.31 0.45 0.26 0.123 0.75 0.39 0.12 0.131 0.145 0.175 0.0512 0.000032 0.183 0.0231

(34)

GRADIENT DESCENT

34

θ₀ = 0.1 θ₁ = 0.1 ̂y = θ₀ + θ₁x x y ̂y = θ₀ + θ₁x 1₂( ̂y − y)2 SSE ∂(SSE) ∂θ₀ ̂y − y 0.2 0.44 0.31 0.45 0.26 0.123 0.75 0.39 0.12 0.131 0.145 0.175 0.0512 0.000032 0.183 0.0231 -0.32 0.008 -0.605 -0.215

(35)

GRADIENT DESCENT

35

θ₀ = 0.1 θ₁ = 0.1 ̂y = θ₀ + θ₁x x y ̂y = θ₀ + θ₁x 1₂( ̂y − y)2 SSE ∂(SSE) ∂θ₀ ̂y − y ∂(SSE) ∂θ₁ ( ̂y − y)x 0.2 0.44 0.31 0.45 0.26 0.123 0.75 0.39 0.12 0.131 0.145 0.175 0.0512 0.000032 0.183 0.0231 -0.32 0.008 -0.605 -0.215 -0.064 0.00248 -0.27225 -0.16125

(36)

GRADIENT DESCENT

36

θ₀ = 0.1 θ₁ = 0.1 ̂y = θ₀ + θ₁x x y ̂y = θ₀ + θ₁x 1₂( ̂y − y)2 SSE ∂(SSE) ∂θ₀ ̂y − y ∂(SSE) ∂θ₁ ( ̂y − y)x 0.2 0.44 0.31 0.45 0.26 0.123 0.75 0.39 0.12 0.131 0.145 0.175 0.0512 0.000032 0.183 0.0231 -0.32 0.008 -0.605 -0.215 -0.064 0.00248 -0.27225 -0.16125 -1.132 _-0.495

(37)

GRADIENT DESCENT

37

θ₀ = 0.1 θ₁ = 0.1 ̂y = θ₀ + θ₁x x y ̂y = θ₀ + θ₁x 1₂( ̂y − y)2 SSE ∂(SSE) ∂θ₀ ̂y − y ∂(SSE) ∂θ₁ ( ̂y − y)x 0.2 0.44 0.31 0.45 0.26 0.123 0.75 0.39 0.12 0.131 0.145 0.175 0.0512 0.000032 0.183 0.0231 -0.32 0.008 -0.605 -0.215 -0.064 0.00248 -0.27225 -0.16125 -1.132 _-0.495 θ₀ := θ₀− α_∑n i=1 ∂(SSE) ∂θ₀ θ1 := θ1 − α n ∑ i=1 ∂(SSE) ∂θ₁

(38)

GRADIENT DESCENT

38

θ₀ = 0.1 θ₁ = 0.1 ̂y = θ₀ + θ₁x x y ̂y = θ₀ + θ₁x 1₂( ̂y − y)2 SSE ∂(SSE) ∂θ₀ ̂y − y ∂(SSE) ∂θ₁ ( ̂y − y)x 0.2 0.44 0.31 0.45 0.26 0.123 0.75 0.39 0.12 0.131 0.145 0.175 0.0512 0.000032 0.183 0.0231 -0.32 0.008 -0.605 -0.215 -0.064 0.00248 -0.27225 -0.16125 -1.132 _-0.495 θ₀ := θ₀− α_∑n i=1 ∂(SSE) ∂θ₀ θ1 := θ1 − α n ∑ i=1 ∂(SSE) ∂θ₁ α = 0.01

(39)

GRADIENT DESCENT

39

θ₀ = 0.1 θ₁ = 0.1 ̂y = θ₀ + θ₁x x y ̂y = θ₀ + θ₁x 1₂( ̂y − y)2 SSE ∂(SSE) ∂θ₀ ̂y − y ∂(SSE) ∂θ₁ ( ̂y − y)x 0.2 0.44 0.31 0.45 0.26 0.123 0.75 0.39 0.12 0.131 0.145 0.175 0.0512 0.000032 0.183 0.0231 -0.32 0.008 -0.605 -0.215 -0.064 0.00248 -0.27225 -0.16125 -1.132 _-0.495 θ₀ := θ₀− α_∑n i=1 ∂(SSE) ∂θ₀ θ1 := θ1 − α n ∑ i=1 ∂(SSE) ∂θ₁ α = 0.01 θ₀ = 0.1 − 0.01 × (−1.132)

(40)

GRADIENT DESCENT

40

θ₀ = 0.1 θ₁ = 0.1 ̂y = θ₀ + θ₁x x y ̂y = θ₀ + θ₁x 1₂( ̂y − y)2 SSE ∂(SSE) ∂θ₀ ̂y − y ∂(SSE) ∂θ₁ ( ̂y − y)x 0.2 0.44 0.31 0.45 0.26 0.123 0.75 0.39 0.12 0.131 0.145 0.175 0.0512 0.000032 0.183 0.0231 -0.32 0.008 -0.605 -0.215 -0.064 0.00248 -0.27225 -0.16125 -1.132 _-0.495 θ₀ := θ₀− α_∑n i=1 ∂(SSE) ∂θ₀ θ1 := θ1 − α n ∑ i=1 ∂(SSE) ∂θ₁ α = 0.01 θ₀ = 0.1 − 0.01 × (−1.132) θ = 0.11132

(41)

GRADIENT DESCENT

41

θ₀ = 0.1 θ₁ = 0.1 ̂y = θ₀ + θ₁x x y ̂y = θ₀ + θ₁x 1₂( ̂y − y)2 SSE ∂(SSE) ∂θ₀ ̂y − y ∂(SSE) ∂θ₁ ( ̂y − y)x 0.2 0.44 0.31 0.45 0.26 0.123 0.75 0.39 0.12 0.131 0.145 0.175 0.0512 0.000032 0.183 0.0231 -0.32 0.008 -0.605 -0.215 -0.064 0.00248 -0.27225 -0.16125 -1.132 _-0.495 θ₀ := θ₀− α_∑n i=1 ∂(SSE) ∂θ₀ θ1 := θ1 − α n ∑ i=1 ∂(SSE) ∂θ₁ α = 0.01 θ₀ = 0.1 − 0.01 × (−1.132) θ₀ = 0.11132 _θθ₁1_{= 0.10495}= 0.1 − 0.01 × (−0.495)

(42)

GRADIENT DESCENT

**Algorithm for any* hypothesis function , loss**

function , step size :

Initialize the parameter vector:

• Repeat until satisfied (e.g., exact or approximate

convergence):

• Compute gradient:

• Update parameters:

42

*must be reasonably well behaved

θ₀ := θ₀− α_∑m i=1 ∂(SSE) ∂θ₀ θ1 := θ1 − α m ∑ i=1 ∂(SSE) ∂θ₁

(43)

GRADIENT DESCENT -

MULTIVARIATE

43

θ₀ := θ₀ − α_∑n i=1 ∂(SSE) ∂θ₀ θ₁ := θ₁ − α_∑n i=1 ∂(SSE) ∂θ₁ θ₂ := θ₂ − α_∑n i=1 ∂(SSE) ∂θ₂

…

θ_m := θ_n− α_∑n i=1 ∂(SSE) ∂θ_m

(44)

GRADIENT DESCENT -

MULTIVARIATE

44

θ₀ := θ₀− α 1_n _∑n i=1 (h(θ(x_i) − y_i)x_0i

…

Repeat{ θ_j := θ_j − α 1_n _∑n i=1 (h(θ(x_i) − y_i)x_ji }

(update θ_j _{for all} j = 1…m simultaneously

θ = 0 1n n ∑ i=1 (h(θ(x_i) − y_i)x_ji = ∂_∂θ j f(θ) θ₁ := θ₁ − α 1_n _∑n i=1 (h(θ(x_i) − y_i)x_1i θ₂ := θ₂− α 1_n _∑n i=1 (h(θ(x_i) − y_i)x_2i θ_m := θ_m − α 1 n n ∑ i=1 (h(θ(x_i) − y_i)x_mi obtain all partial derivatives w.r.t θ's first

(45)

PLOTTING LOSS OVER TIME

(46)

STOCHASTIC GRADIENT DESCENT

- MULTIVARIATE

46

Repeat{ θ_j := θ_j − α(h(θ(x_i) − y_i)x_ji }

(update θ_j for all j = 1…n

θ = 0

i = random index between 1 and m

STOCHASTIC GRADIENT DESCENT

Repeat{

θ_j := θ_j − α 1_n _∑n

i=1

(h(θ(x_i) − y_i)x_ji

}

(update θ_j _{for all} j = 1…m simultaneously

θ = 0 1n n ∑ i=1 (h(θ(x_i) − y_i)x_ji = ∂_∂θ j f(θ) obtain all partial derivatives w.r.t θ's first

(47)

GRADIENT DESCENT

47

θ₀ = 0.1 θ₁ = 0.1 ̂y = θ₀ + θ₁x x y ̂y = θ₀ + θ₁x 1₂( ̂y − y)2 SSE ∂(SSE) ∂θ₀ ̂y − y ∂(SSE) ∂θ₁ ( ̂y − y)x 0.2 0.44 0.31 0.45 0.26 0.123 0.75 0.39 0.12 0.131 0.145 0.175 0.0512 0.000032 0.183 0.0231 -0.32 0.008 -0.605 -0.215 -0.064 0.00248 -0.27225 -0.16125 -1.132 _-0.495

(48)

GRADIENT DESCENT

48

θ₀ = 0.1 θ₁ = 0.1 ̂y = θ₀ + θ₁x x y ̂y = θ₀ + θ₁x 1 2( ̂y − y)2 SSE ∂(SSE) ∂θ₀ ̂y − y ∂(SSE) ∂θ₁ ( ̂y − y)x 0.31 0.123 0.131 0.000032 _0.008 _0.00248

...…..

(49)

STOCHASTIC

GRADIENT DESCENT

(50)

STOCHASTIC GRADIENT DESCENT

- MINI BATCH

50

Repeat {

}

(update θ_j for all j = 1…n

θ = 0

= random index between 1 and m

i

₁

, …, i

_l θ_j := θ_j − α 1 l l ∑ i=1 (h(θ(x_i) − y_i)x_ji

(51)

(52)

GRADIENT DESCENT IN

PURE(-ISH) PYTHON

Implicitly using squared loss and linear hypothesis function

above; drop in your favorite gradient for kicks!

52

# Training data (X, y), T time steps, alpha step

def grad_descent(X, y, T, alpha):

m, n = X.shape # m = #examples, n = #features

theta = np.zeros(n) # initialize parameters

f = np.zeros(T) # track loss over time

for i in range(T):

# loss for current parameter vector theta

f[i] = 0.5*np.linalg.norm(X.dot(theta) – y)**2

# compute steepest ascent at f(theta)

g = np.transpose(X).dot(X.dot(theta) – y)

# step down the gradient

theta = theta – alpha*g return theta, f