A Distributed Method for Fitting Laplacian Regularized Stratiﬁed Models

(1)

A Distributed Method for Fitting Laplacian Regularized Stratified Models

Jonathan Tuck Shane Barratt Stephen Boyd

International Conference on Statistical Optimization and Learning

Beijing Jiatong University, June 19 2019

(2)

Outline

Stratified models

Data models

Regularization graphs Solution method Examples Conclusions

Stratified models 2

(3)

Data records

I Data records have form (z, x , y ) ∈ Z × X × Y

I z ∈ Z = {1, . . . , K } are categorical features (we’ll stratify over)

I x ∈ X are the other features

I y ∈ Y is the label/dependent variable

Stratified models 3

(4)

Stratified model

I Fit a model to data (z, x , y )

I Stratified model: fit different model for each value of z

I θ k is model parameter for z = k

I θ = (θ 1 , . . . , θ K ) ∈ Θ ⊆ R ^Kn parameterizes the stratified model

I Old idea [Kernan 99]

I Example: stratified regression model ˆ y = x ^T θ _z

Stratified models 4

(5)

Example

0.0 0.2 0.4 0.6 0.8 1.0

10 5 0 5

10 z=1

z=-1

ˆ

y = θ 1 + θ 2 x + θ 3 z, z ∈ {−1, 1}

Stratified models 5

(6)

Example

0.0 0.2 0.4 0.6 0.8 1.0

10 5 0 5 10

z=1 z=-1

ˆ y =

( θ −1,1 + θ −1,2 x z = −1 θ _1,1 + θ _1,2 x z = 1

Stratified models 6

(7)

Stratified model

I Stratified model is simple function of x (e.g., linear), arbitrary function of z

I Examples: separate models for – z ∈ {Male, Female}

– z ∈ {Monday, . . . , Sunday}

I If K is large, might not have enough training data to fit θ _k

I Extreme case: no training data for some values of z

I We’ll add regularization so nearby θ _k ’s are close

Stratified models 7

(8)

Stratified model with Laplacian regularization

I Choose θ = (θ 1 , . . . , θ K ) to minimize

N

X

i =1

l (θ _z

_i

, x _i , y _i ) +

K

X

k=1

r (θ _k ) +

K

X

u,v =1

W _uv kθ _u − θ _v k ²

I l is loss function, r is (local) regularization

I Last term is Laplacian regularization

I W _uv ≥ 0 are edge weights of graph with node set Z

I Convex problem when l , r convex in θ

Stratified models 8

(9)

Stratified model with Laplacian regularization

I Graph encodes prior that nearby values of z should have similar models

I Can be used to capture periodicities, other structure

I Examples:

– θ _male and θ _female should be close – θ _jan should be close to θ _feb and θ _dec

I Model for each value of z ‘borrows strength’ from its neighbors

I Works even when there’s no data for some values of z

I As W uv → 0, get traditional (unregularized) stratified model

I As W uv → ∞, get common model (θ ₁ = · · · = θ K )

Stratified models 9

(10)

Related work

I Network lasso [Hallac 15] )

I Pliable lasso [Tibshirani 17]

I Varying-coefficient models [Hastie 93, Fan 08]

I Multi-task learning [Caruana 97]

Stratified models 10

(11)

Outline

Stratified models Data models

Regularization graphs Solution method Examples Conclusions

Data models 11

(12)

Point estimate: Predict y

I Regression: X = R ⁿ , Y = R

– l (θ, x , y ) = p(x ^T θ − y ), p is penalty function – y = x ˆ ^T θ z

I Classification: X = R ⁿ , Y = {−1, 1}

– l (θ, x , y ) = p(yx ^T θ) – y = sign(x ˆ ^T θ z )

I M-class classification: X ∈ R ⁿ , Y = {1, . . . , M}

– l (θ, x , y ) = p y (x ^T θ), θ ∈ R ^n×M , p y is penalty function for class y – y = argmax(x ˆ ^T θ z )

Data models 12

(13)

Conditional distribution estimate: Predict prob(y | x, z)

I Multinomial logistic regression: X = R ⁿ , Y = {1, . . . , M}

I Conditional distribution:

prob(y | x , z) = exp(x ^T θ _z ) _y P M

j =1 exp(x ^T θ z ) j

, y = 1, . . . , M

I Loss function (convex in θ):

l (θ, x , y ) = log





M

X

j =1

exp(x ^T θ) j



 − (x ^T θ) y

Data models 13

(14)

Distribution estimate: Predict p(y | z)

I Gaussian distribution: Y = R ^m

I Density:

p(y | z) = (2π) ^−m/2 det(Σ) ^−1/2 exp(−1/2(y − µ) ^T Σ ⁻¹ (y − µ))

I Use natural parameter θ = (S , ν) = (Σ ⁻¹ , Σ ⁻¹ µ) (so Σ = S ⁻¹ , µ = S ⁻¹ ν)

I Loss function (convex in θ):

l (θ, y ) = − log det S + y ^T Sy − 2y ^T ν + ν ^T S ⁻¹ ν

Data models 14

(15)

Outline

Stratified models Data models

Regularization graphs

Solution method Examples Conclusions

Regularization graphs 15

(16)

Path graph

0 1 2 3 4 5

I Models time, distance, . . .

I Yields time-varying, distance-varying models

Regularization graphs 16

(17)

Cycle graph

M W Tu

Th

F

Sa Su

I Yields diurnal, weekly, seasonally-varying models

Regularization graphs 17

(18)

Tree graph

CEO

CTO

SVP1

COO

SVP2

VP1 VP2

SVP3

VP3

I Yields hierarchical models

Regularization graphs 18

(19)

Grid graph

0,0 0,1 0,2 0,3

1,0 1,1 1,2 1,3

2,0 2,1 2,2 2,3

3,0 3,1 3,2 3,3

I Yields (2D) space-varying models

Regularization graphs 19

(20)

Products of graphs

M,1

F,1

M,2

F,2

· · ·

M,99

F,99

M,100

F,100

I Z = {Male, Female} × {1, 2, . . . , 99, 100} (sex × age)

I Yields sex, age-varying models

Regularization graphs 20

(21)

Outline

Stratified models Data models

Regularization graphs Solution method

Examples Conclusions

Solution method 21

(22)

Fitting problem

I To fit stratified model, minimize `(θ) + r (θ) + L(θ)

I `(θ) = P K

k=1 ` _k (θ _k ) is loss, ` _k (θ _k ) = P

i :z

i

=k l (θ _k , x _i , y _i )

I r (θ) = P K

k=1 r (θ k ) is (local) regularization

I L(θ) = P K

u,v =1 W _uv kθ _u − θ _v k ² is Laplacian regularization

I `, r are separable in θ _k

I L is quadratic, separable in components of θ _k

I We’ll use operator splitting method (ADMM)

Solution method 22

(23)

Reformulation

I Replicate variables:

minimize `(θ) + r (˜ θ) + L(ˆ θ) subject to θ = ˜ θ = ˆ θ

I Augmented Lagrangian

L(θ, ˜ θ, ˆ θ, u, ˜ u) = `(θ) + r (˜ θ) + L(ˆ θ) + 1

2t kθ − ˆ θ + uk ² ₂ + 1

2t k˜ θ − ˆ θ + ˜ uk ² ₂

I u, ˜ u dual variables for the two constraints, t > 0 is algorithm parameter

Solution method 23

(24)

ADMM

I Augmented Lagrangian

L(θ, ˜ θ, ˆ θ, u, ˜ u) = `(θ) + r (˜ θ) + L(ˆ θ) + 1

2t kθ − ˆ θ + uk ² ₂ + 1

2t k˜ θ − ˆ θ + ˜ uk ² ₂

I ADMM: for i = 1, 2, . . .

θ ^{i +1} , ˜ θ ^{i +1} := argmin

θ,˜ θ

L(θ, ˜ θ, ˆ θ ⁱ , u ⁱ , ˜ u ⁱ ) θ ˆ ^{i +1} := argmin

θ ˆ

L(θ ⁱ , ˜ θ ⁱ , ˆ θ, u ⁱ , ˜ u ⁱ ) u ^{i +1} := u ⁱ + θ ^{i +1} − ˆ θ ^{i +1}

˜

u ^{i +1} := u ˜ ⁱ + ˜ θ ^{i +1} − ˆ θ ^{i +1}

Solution method 24

(25)

ADMM

I First step can be expressed as

θ _k ^{i +1} = prox _t`

_k

(ˆ θ ⁱ _k − u _k ⁱ ), θ ˜ ^{i +1} _k = prox _tr (ˆ θ ⁱ _k − ˜ u _k ⁱ ), k = 1, . . . , K

I proximal operator of tg is

prox _tg (ν) = argmin

θ

tg (θ) + (1/2)kθ − νk ² ₂

I Can evaluate these 2K proximal operators in parallel

Solution method 25

(26)

Loss proximal operators

I Often has closed form expression or efficient implementation

I Square loss: ` _k (θ _k ) = kX _k θ _k − y _k k ² ₂ – prox _t`

k

(ν) = (I + tX _k ^T X _k ) ⁻¹ (ν − tX _k ^T y _k ) – Cache factorization or warm-start CG

I Logistic loss: ` k (θ k ) = P

i log(1 + exp(−y ki θ _k ^T x ki )) – Use L-BFGS to evaluate prox _t`

k

(ν) – Warm-start

I Many others . . .

Solution method 26

(27)

Regularizer proximal operators

I Often has closed form expression or efficient implementation Regularizer r (θ _k ) prox _tr (ν) Sum of squares (` ₂ ) (λ/2)kθ _k k ² ₂ ν/(tλ + 1)

` 1 norm λkθ _k k ₁ (ν − tλ) + − (−ν − tλ) ₊ Nonnegative I ₊ (θ _k ) (θ _k ) ₊

I λ > 0 local regularization parameter

Solution method 27

(28)

Laplacian proximal operator

I Separable across each component of θ _k

I To find each component (θ _k ) i , need to solve a Laplacian system

I Many efficient ways to solve, e.g., diagonally preconditioned CG

I These n systems can be solved in parallel

Solution method 28

(29)

Software implementation

I Available at www.github.com/cvxgrp/strat_models

I numpy, scipy for matrix operations

I networkx for handling graphs and graph operations

I torch for L-BFGS and GPU computation

I multiprocessing for parallelism

I model.fit(X,Y,Z,G) (writes θ _k on graph nodes)

Solution method 29

(30)

Outline

Stratified models Data models

Regularization graphs Solution method Examples Conclusions

Examples 30

(31)

House price prediction

I N ≈ 22000 (z, x , y ) from King County, WA

I Split into train/test ≈ 16200/5400

I z = (latitude bin, longitude bin)

I x ∈ R ¹⁰ = features of house, y = log of house price

I Graph is 50 × 50 grid with all edge weights same; K = 2500

I Stratified ridge regression model with two hyperparameters (one for local regularization, one for Laplacian regularization)

Examples 31

(32)

House price prediction: Results

I Compare stratifed, common, and random forest with 50 trees

Model Parameters RMS test error

Stratified 25000 0.181

Common 10 0.316

Random forest 985888 0.184

Examples 32

(33)

House price prediction: Parameters

bedrooms bathrooms sqft living sqft lot floors

waterfront condition grade yr built intercept

Examples 33

(34)

Chicago crime prediction

I N ≈ 7 000 000 (z, y ) pairs for 2017, 2018

I Train on 2017, test on 2018

I y = number of crimes

I z = (location bin, week of year, day of week, hour of day); K ≈ 3 500 000

I Graph is Cartesian product of grid, three cycles; four graph edge weights

I Stratified Poisson model with four hyperparameters (one for each graph edge weight)

Examples 34

(35)

Chicago crime prediction

I Compare three models, average negative log likelihood on test data

Model Train Test Separate 0.068 0.740 Stratified 0.221 0.234 Common 0.279 0.278

Examples 35

(36)

Chicago crime prediction Crime rate as a function of latitude/longitude

-87.85 -87.77 -87.69 -87.61 -87.52 42.02

41.93

41.83

41.74

41.64 0.02 0.04 0.06 0.08 0.10 0.12

Examples 36

(37)

Chicago crime prediction Crime rate as a function of week of year

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 0.036

0.038 0.040 0.042 0.044

Examples 37

(38)

Chicago crime prediction Crime rate as a function of hour of week

M Tu W Th F Sa Su

0.020 0.025 0.030 0.035 0.040 0.045 0.050 0.055

Examples 38

(39)

Outline

Stratified models Data models

Regularization graphs Solution method Examples Conclusions

Conclusions 39

(40)

Conclusions

I Stratified models combine

– simple dependence on some features (x ) – complex dependence on others (z)

I Often interpretable

I Laplacian regularization encodes prior on values of z, so models can borrow strength from their neighbors

I Effective method to build time-varying, space-varying, seasonally-varying models

I Efficient, distributed ADMM-based implementation for large-scale data

Conclusions 40

A Distributed Method for Fitting Laplacian Regularized Stratiﬁed Models

A Distributed Method for Fitting Laplacian Regularized Stratified Models

Jonathan Tuck Shane Barratt Stephen Boyd

International Conference on Statistical Optimization and Learning

Beijing Jiatong University, June 19 2019

Outline

Stratified models

Data models

Regularization graphs Solution method Examples Conclusions

Stratified models 2

Data records

I Data records have form (z, x , y ) ∈ Z × X × Y

I z ∈ Z = {1, . . . , K } are categorical features (we’ll stratify over)

I x ∈ X are the other features

I y ∈ Y is the label/dependent variable

Stratified models 3

Stratified model

I Fit a model to data (z, x , y )

I Stratified model: fit different model for each value of z

I θ k is model parameter for z = k

I θ = (θ 1 , . . . , θ K ) ∈ Θ ⊆ R Kn parameterizes the stratified model

I Old idea [Kernan 99]

I Example: stratified regression model ˆ y = x T θ z

Stratified models 4

Example

0.0 0.2 0.4 0.6 0.8 1.0

10 5 0 5

10 z=1

z=-1

ˆ

y = θ 1 + θ 2 x + θ 3 z, z ∈ {−1, 1}

Stratified models 5

Example

0.0 0.2 0.4 0.6 0.8 1.0

10 5 0 5 10

z=1 z=-1

ˆ y =

( θ −1,1 + θ −1,2 x z = −1 θ 1,1 + θ 1,2 x z = 1

Stratified models 6

Stratified model

I Stratified model is simple function of x (e.g., linear), arbitrary function of z

I Examples: separate models for – z ∈ {Male, Female}

– z ∈ {Monday, . . . , Sunday}

I If K is large, might not have enough training data to fit θ k

I Extreme case: no training data for some values of z

I We’ll add regularization so nearby θ k ’s are close

Stratified models 7

Stratified model with Laplacian regularization

I Choose θ = (θ 1 , . . . , θ K ) to minimize

N

X

i =1

l (θ z

, x i , y i ) +

K

X

k=1

r (θ k ) +

K

X

u,v =1

W uv kθ u − θ v k 2

I l is loss function, r is (local) regularization

I Last term is Laplacian regularization

I W uv ≥ 0 are edge weights of graph with node set Z

I Convex problem when l , r convex in θ

Stratified models 8

Stratified model with Laplacian regularization

I Graph encodes prior that nearby values of z should have similar models

I Can be used to capture periodicities, other structure

I Examples:

– θ male and θ female should be close – θ jan should be close to θ feb and θ dec

I Model for each value of z ‘borrows strength’ from its neighbors

I Works even when there’s no data for some values of z

I As W uv → 0, get traditional (unregularized) stratified model

I As W uv → ∞, get common model (θ 1 = · · · = θ K )

Stratified models 9

Related work

I Network lasso [Hallac 15] )

I Pliable lasso [Tibshirani 17]

I θ = (θ 1 , . . . , θ K ) ∈ Θ ⊆ R ^Kn parameterizes the stratified model

I Example: stratified regression model ˆ y = x ^T θ _z

( θ −1,1 + θ −1,2 x z = −1 θ _1,1 + θ _1,2 x z = 1

I If K is large, might not have enough training data to fit θ _k

I We’ll add regularization so nearby θ _k ’s are close

l (θ _z

, x _i , y _i ) +

r (θ _k ) +

W _uv kθ _u − θ _v k ²

I W _uv ≥ 0 are edge weights of graph with node set Z

– θ _male and θ _female should be close – θ _jan should be close to θ _feb and θ _dec

I As W uv → ∞, get common model (θ ₁ = · · · = θ K )

I Regression: X = R ⁿ , Y = R

– l (θ, x , y ) = p(x ^T θ − y ), p is penalty function – y = x ˆ ^T θ z

I Classification: X = R ⁿ , Y = {−1, 1}

– l (θ, x , y ) = p(yx ^T θ) – y = sign(x ˆ ^T θ z )

I M-class classification: X ∈ R ⁿ , Y = {1, . . . , M}

– l (θ, x , y ) = p y (x ^T θ), θ ∈ R ^n×M , p y is penalty function for class y – y = argmax(x ˆ ^T θ z )

I Multinomial logistic regression: X = R ⁿ , Y = {1, . . . , M}

prob(y | x , z) = exp(x ^T θ _z ) _y P M

j =1 exp(x ^T θ z ) j

exp(x ^T θ) j

 − (x ^T θ) y

I Gaussian distribution: Y = R ^m

p(y | z) = (2π) ^−m/2 det(Σ) ^−1/2 exp(−1/2(y − µ) ^T Σ ⁻¹ (y − µ))

I Use natural parameter θ = (S , ν) = (Σ ⁻¹ , Σ ⁻¹ µ) (so Σ = S ⁻¹ , µ = S ⁻¹ ν)

l (θ, y ) = − log det S + y ^T Sy − 2y ^T ν + ν ^T S ⁻¹ ν