Variable Selection for Global Fréchet Regression

(1)

Variable Selection for Global Fr´ echet Regression

Danielle C. Tucker ¹, Yichao Wu¹, Hans-Georg M¨uller²

1University of Illinois at Chicago

2UC Davis

September 2021

(2)

Research Focus: Complex Data Types

(3)

Motivating Example

(5)

Motivating Example

Motivation: Bike Rental Data

Let’s look at a data set collected by Capital Bikeshare in Washington D.C.

This data set spans the years 2011 and 2012 for a total of 731 days. For each day, there are 24 observations of bike rental counts and predictors which include

RBW: Indicator of really bad weather (snowy and/or rainy)

Holiday: Indicator of a public holiday celebrated in Washington D.C.

2012: Indicator of the year 2012

Humid: Continuous variable of standardized humidity Temp: Continuous variable of standardized temperature

Response Y : Each day’s 24 observed quantiles for distribution of rentals.

(6)

Bike Rental Data Snippet

Date Count RBW Holiday 2012 Humidity Temperature

2011-01-01 16 0 0 0 0.81 0.24

2011-01-01 40 0 0 0 0.80 0.22

2011-01-01 32 0 0 0 0.80 0.22

2011-01-01 13 0 0 0 0.75 0.24

2011-01-01 1 0 0 0 0.75 0.24

...

2011-01-01 35 1 0 0 0.88 0.42

2011-01-01 37 1 0 0 0.88 0.42

2011-01-01 36 0 0 0 0.87 0.40

2011-01-01 34 0 0 0 0.87 0.40

2011-01-01 28 0 0 0 0.94 0.40

(7)

Motivating Example

Example Bike Distributions

(8)

How to Extend Existing Methods?

Classical Linear Regression

Y = β₀+ X^Tβ + ε

with a p dimensional predictor vector X ∈ X ⊂ R^p and random errors ε with mean zero and finite variance. We collect a random sample

{(X_i, Yi) : i = 1, 2, ..., n}. We then have two main goals:

1 Estimate β₀ and β

2 Predict a future observation at any x in the domain of interest X Here, what kind of Y are we working with? Y ∈ R. Euclidean Space.

(9)

Motivating Example

From Euclidean Space to General Metric Space

Definition

A metric space is an ordered pair (Y, d ) where Y is a set and d is a metric on Y, i.e. a function

d : Y × Y → R such that for any y₁, y₂, y₃ ∈ Y, the following holds:

1 d (y₁, y₂) = 0 ⇐⇒ y₁ = y₂ Identity of Indiscernibles

2 d (y1, y2) = d (y2, y1) Symmetry

3 d (y1, y3) ≤ d (y1, y2) + d (y2, y3) Triangle Inequality

(10)

Examples

Probability distributions + Wasserstein metric

Symmetric positive definite M × M matrices + Cholesky Decomposition metric

Spheres in R³ + geodesic distance

(11)

Motivating Example

Probability Distribution Metric Space

Definition

Let Y₁ be the set of probability distributions. The 2-Wasserstein metric distance between two distributions with CDFs H(·) and G (·) is defined as

dW(H, G ) = s

Z 1 0

(H⁻¹(t) − G⁻¹(t))²dt.

We denote (Y1, dW) as the metric space of probability distributions equipped with the Wasserstein distance.

(12)

Symmetric, Positive Definite Matrix Metric Space

Definition

Let Y2 be the set of symmetric, positive definite (SPD) matrices. Let P1

and P2 be two SPD matrices. Then, under the Cholesky decomposition, we can write P₁= (P₁^1/2)^TP₁^1/2 and P₂= (P₂^1/2)^TP₂^1/2, where P₁^1/2 and P₂^1/2 are upper triangle matrices with positive diagonal components. Then we define the Cholesky decomposition distance between P₁ and P₂ as

dC(P1, P2) = r

trace

(P₁^1/2− P₂^1/2)^T(P₁^1/2− P₂^1/2)

. We denote (Y2, dC) as the metric space of SPD matrices equipped with the Cholesky decomposition distance.

(13)

Motivating Example

Fr´ echet Mean and Variance

Fr´echet extended the concepts of mean and variance to random objects in 1948.

Let Y be a random object in the metric space (Y, d ). Then the Fr´echet mean and variance of Y are defined as

Fr´echet mean and variance

y⊕= arg min

y ∈Y

E (d²(Y , y )) and V⊕= E (d²(Y , y⊕)) (1) respectively.

(14)

Global Fr´ echet Regression

(15)

Global Fr´echet Regression

Fr´ echet Regression Function

Petersen and Müller (2019) introduced a general concept of a Fréchet regression function of Y given X = x with x = (x1, ..., xp)^T as Fréchet Regression

m⊕(x ) = arg min

y ∈Y

M⊕(y , x ) (2)

where M⊕(·, x ) = E (d²(Y , ·)|X = x ).

(16)

Global Fr´ echet Regression:

An extension of classical Linear Regression

Classical Linear Regression

Y = β0+ X^Tβ + ε

with a p dimensional predictor vector X ∈ X ⊂ R^p and random errors ε with mean zero and finite variance. We collect a random sample

{(X_i, Y_i) : i = 1, 2, ..., n}. We then have two main goals:

1 Estimate β0 and β

2 Predict a future observation at any x in the domain of interest X .

(17)

Global Fr´echet Regression

Equivalent Interpretation of Linear Regression

For the second objective, with a little bit of work, we can show that this future prediction can be equivalently interpreted as the minimizer of

miny n

X

i =1

h

1 + (x − ¯X )^TΣˆ⁻¹(Xi − ¯X )i

(Yi − y )² (3)

We can replace the squared difference in red with a more general squared metric distance d²(Y_i, y ), and we get the global Fr´echet regression model!

(18)

Global Fr´ echet Regression

The global Fr´echet regression model is characterized by Global Fr´echet Regression

m⊕(x ) = arg min

y ∈Y

E (d²(Y , y )|X = x ) = g⊕(x ) for any x ∈ X , where (4) g⊕(x ) = arg min

y ∈Y

E_{(X ,Y )}nh

1 + (x − µ)^TΣ⁻¹(X − µ)i

d²(Y , y )o (5)

where g⊕(x ) is assumed to be well-defined.

(19)

FRiSO: Fr´echet Ridge Selection Operator

FRiSO: Fr´ echet Ridge Selection Operator

(20)

Variable Selection Objective

As in linear regression, it could be of great interest to determine which X_j in X = (X₁, ..., X_p)^T are important predictors, especially in this era of Big Data when p could be very large! Further, determining the important predictors could provide some capability for inference.

In linear regression, we have sparsity-encouraging penalty methods such as LASSO

Ridge SCAD

Problem: We have no coefficients in Fr´echet regression!

ˆ

g⊕(x ) = arg min

n

Xh

1 + (x − ¯X )^TΣˆ⁻¹(Xi − ¯X )i

d²(Yi, y )

(21)

Individually Penalized Ridge Regression

Yichao Wu (2020) developed a new variable selection method for linear regression based on ridge regression. This method is called Individually Penalized Ridge Regression. Based on a random sample

{(X_i, Y_i) : i = 1, 2, ..., n} from (3), one obtains the unknown regression coefficients β₀ and β by solving

Individually Penalized Ridge Regression Estimator

minβ0,β

1 2n

n

X

i =1

(Y_i − β₀− X_i^Tβ)²+1 2

p

X

i =1

ν_jβ_j² (6)

with ridge regulation parameters ν_j ≥ 0 for j = 1, ..., p. Let’s

reparameterize this with λj = _ν¹. Note: a less important predictor Xj

(22)

Individually Penalized Ridge Regression

Variable selection for (3) then proceeds by solving Individually Penalized Ridge Regression

minλ hY − H(λ)Y , Y − H(λ)Y i (7) subject to λj ≥ 0, j = 1, ..., p; (8)

p

X

j =1

λj ≤ τ (9)

where H(λ) =

1 n

1_n×11^T_n×1+ (X − 1n×1X¯^T)h ˆΣ + diag(λ⁻¹)i−1

(X − 1n×1X¯^T)^T

,

(23)

Individually Penalized Fr´ echet Regression

As before, we can show that if we extend individually penalized ridge regression to the more general Fr´echet regression framework, we get the estimator Individually Penalized Fr´echet Regression Estimator Rˆ⊕(x ; λ) as

Individually Penalized Fr´echet Regression Estimator

arg min

y ∈Y n

X

i =1

1 + (x − ¯X )^Th ˆΣ + diag(λ)i−1

(X_i − ¯X )

d²(Y_i, y ) (10)

for the prediction of the Fr´echet regression function m⊕(x ) at any location x in the domain of interest X .

(24)

FRiSO: Fr´ echet Ridge Selection Operator

1 Given the metric space (Y,d), ˆR⊕(x ; λ⁻¹) is arg min

y ∈Y n

X

i =1

1 + (x − ¯X )^Th ˆΣ + diag(λ⁻¹)i−1

(X_i − ¯X )

d²(Y_i, y )

2 Then we find ˆλ(τ) = (ˆλ₁(τ), ..., ˆλ_p(τ))^T for τ ≥ 0by solving minλ

1 n

n

X

i =1

d²(Y_i, ˆR⊕(x ; λ⁻¹)) (11) subject to λ_j ≥ 0, j = 1, ..., p; (12)

p

X

j =1

λ_j ≤τ (13)

(25)

FRiSO: Selection Consistency

With FRiSO, we indeed have selection consistency.

Definition

A set I ⊆ {1, 2, . . . , p} is called the important predictor set for global Fr´echet regression of random objects Y on multivariate random vectors X , if I is the smallest set satisfying Y ⊥⊥ X_Ic|X_I, i.e., Y is conditionally independent of XI^c given XI.

Theorem

Under a few reasonable conditions, when τ = τ_n → ∞ as n → ∞, the solution bλ(τ_n) of (11) satisfies ˆλ_j(τ_n)→ ∞ for j ∈ I and ˆ^p λ_j⁰(τ_n)→ 0 for^p j⁰ 6∈ I.

(26)

FRiSO: Steps 3 and 4 (Refitting)

3 Collect the set ˆI(τ ) = {j : ˆλ_j > 0} to indicate the selected predictors

4 Refit the global Fr´echet model with only the selected predictors to get the estimator ˆmrefit_⊕ (x ; τ ):

arg min

y ∈Y n

X

i =1

h 1 + (x

bI(τ )− ¯X

I(τ )b )^TΣb⁻¹

I(τ ),bb I(τ )(X_{i ,b}_{I(τ )}− ¯X

bI(τ ))i

d²(Y_i, y ).

Why is this refitting helpful? Theorem 1 states that ˆλ_j → ∞ for j ∈ I.^p This implies ˆνj = 1/ˆλj

→ 0. This would eliminate the penalty attached top

an important predictor and eliminate bias. However, we have a finite sample size n....

Refitting improves the model’s bias.

(27)

Theorem 1: Required Conditions

Condition [A]: For any A satisfying I ⊆ A ⊆ {1, 2, . . . , p}, we have n

E (Σ⁻¹_A,A(XA− µ_A)|XI)o

I = Σ⁻¹_I,I(XI− µ_I)

and n

E (Σ⁻¹_A,A(XA− µ_A)|XI)o

A\I = 0, where A \ I = {j : j ∈ A and j 6∈ I}.

(28)

Theorem 1: Required Conditions Cont.

Condition [B]:

E_Xd²(m⊕(X ), L^A_⊕(X )) ≡R

X d²(m⊕(x ), L^A_⊕(x ))F_X(d x ) > 0 for any set A satisfying I \ A 6= ∅.

Condition [C]: Assume that E_Xd²(L⊕(X ), R⊕(X ; ν)) > 0 for any nonnegative vector ν satisfying kνIk > 0.

Condition [D]: Assume that the gradient

∂

∂νIEXd²(L⊕(X ), R⊕(X ; ν)) = 0 and the Hessian

∂²

∂νI∂ν_I^TEXd²(L⊕(X ), R⊕(X ; ν)) is strictly positive definite at any ν satisfying ν = 0 for any j ∈ I.

(29)

Simulation Studies

(30)

Simulation Study Setup

Correlated scalar predictors X_j ∼ U (−1, 1), j = 1, 2, . . . , p, are generated in two steps:

1 Z = (Z₁, Z₂, . . . , Z_p)^T multivariate Gaussian with E (Z_j) = 0 and cov(Zj, Z_j⁰) = ρ^|j−j⁰^|

2 Xj = 2Φ(Zj) − 1 for j = 1, . . . , p, where Φ is the standard normal distribution function

We set p = 10 and ρ = 0.5.

(31)

Simulation Studies

A Probability Distribution Example: (Y

1

, d

W

)

The Fr´echet regression function is given by

m⊕(x ) = E (Y (·)|X = x ) = µ0+ β(x4+ x8) + (σ0+ γx1)Φ⁻¹(·).

Conditional on X , the random response Y is generated by adding noise as follows: Y = µ + σΦ⁻¹ with µ|X ∼ N(µ0+ β(X4+ X8), ν1) and

σ|X ∼ Gamma((σ0+ γX1)²/ν2, ν2/(σ0+ γX1)) being independently sampled. The additional parameters are set as µ₀ = 0, σ₀= 3, β = 3/4, γ = 1, ν1= 1, and ν2 = 1/2.

Training and validation sample sizes are both n = 200. Testing sample size

˜

n = 100n. The validation set is utilized for the selection of τ . The testing set is used to evaluate the performance of the estimated Fr´echet regression

(32)

Results: Selection Consistency

We obtain the optimal solution ˆλ(τ ) over a pre-specified grid for τ , {τ₁< τ2< ... < τ_K}. A solution path is considered to be consistent if the optimal solution ˆλ(τ_k) leads to the same sparsity pattern as the truth for some k ∈ {1, 2, ..., K }.

Table: Example 5.2.2 simulation results for variable selection with FRiSO for global Fr´echet regression when the random objects are probability distributions.

Selection Frequency Path

X₁ X₂ X₃ X₄ X₅ X₆ X₇ X₈ X₉ X₁₀ Consistency

100 17 21 100 26 22 20 100 20 12 96

100 0 0 100 4 1 0 100 0 0 With Refitting

(33)

Simulation Studies

Solution Path: One Random Repetition

24681012

λj

X1

X2

X3

X4

X5

X6

X7

X8

X9

X10

(34)

Results: Model Performance

Denote the test set as {( ˜X_i, ˜Y_i) : i = 1, ..., ˜n}. We collect:

1 D₁ = ¹_n_˜Pn˜

i =1d_w²(m⊕( ˜X_i), ˜Y_i) (noise)

2 D₂ = ¹_n_˜Pn˜

i =1d_w²(m⊕( ˜X_i),mb⊕( ˜X_i)) (deviation from true function)

3 D₃ = ¹_n_˜Pn˜

i =1d_w²(mb⊕( ˜X_i), ˜Y_i) (prediction error)

4 D₄ = ¹_n_˜Pn˜

i =1d_w²(m⊕( ˜X_i),mb^∗) (D₂ without predictors)

5 D₅ = ¹_n_˜Pn˜

i =1d_w²(mb^∗, ˜Y_i) (D₃ without predictors)

Table: Example 5.2.2 Simulation results (prediction) of Fr´echet regression for probability distributions with the Wasserstein metric.

Mean (standard error) over 100 repetitions

D₁

D2 D3 D4 D5

0.122 (0.003) 1.433 (0.003) 0.616 (0.001) 1.929 (0.001) 1.311

(35)

Simulation Studies

Notes

On the previous slide, mb^∗ = arg min

y

Pn i =1

R1

0[y⁻¹(t) − Y_i⁻¹(t)]²dt.

D₁, D₂ and D₄ cannot be observed from real data D5 is a measure of out of sample SSTO

D3 is a measure of out of sample SSE

(36)

An SPD Matrix Example: (Y

2

, d

C

)

Let I denote an M × M identity matrix and U = (U_{i ,j}) denote an M × M matrix where U_{i ,j} = I_{{i <j}}. We choose M = 5.

The Fr´echet regression function is given by

m⊕(x ) = E (Y |X = x ) = E (A)^TE (A), where E (A) =

{µ₀+ β(x1+ x3) + σ0+ γ(x5+ x7+ x9)} I + {σ0+ γ(x5+ x7+ x9)}U.

Noise Addition: Y = A^TA where A = (µ + σ)I + σU and with µ|X ∼ N(µ0+ β(X1+ X3), ν1) and

σ|X ∼ Gamma((σ₀+ γ(X₅+ X₇+ X₉))²/ν₂, ν₂/(σ₀+ γ(X₅+ X₇+ X₉))).

We set µ = 3, σ = 3, β = 2, γ = 3, ν = 1, and ν = 2.

(37)

Simulation Studies

Results: Selection Consistency

Table: Simulation results (variable selection) of Fr´echet regression for SPD matrix data

selection frequency Path

X₁ X₂ X₃ X₄ X₅ X₆ X₇ X₈ X₉ X₁₀ consistency

100 1 100 0 100 1 100 0 100 0 100

100 0 99 0 100 1 100 0 100 0 with refitting

(38)

Solution Path: One Random Repetition

0 2 4 6 8 10

λ^ j

Variable X1 X2 X3 X4 X5 X6 X7 X8 X9 X10

(39)

Simulation Studies

Results: Model Performance

Table:Simulation results on prediction for Fr´echet regression for SPD matrix data Mean (standard error) over 100 repetitions

D1

D2 D3 D4 D5

60.18 (4.34) 73.13 (4.37) 241.45 (4.33) 267.35 (4.33) 8.14 35.31 (4.27) 35.16 (4.34) with refitting

(40)

Bike Rental Distribution Data

(41)

Bike Rental Distribution Data

Recall the bike rental data example from the beginning. To show the variable selection accuracy of FRiSO, we create 6 additional noise variables:

X1 ∼ N(0, 1)

X2 ∼ Γ(α = 1, β = 2) X3 ∼ Γ(2, 2)

X₄ ∼ Γ(15, 1) X₅ ∼ Γ(15, 2)

X6 ∼ N(0, 1) +pΓ(35, 2)

All predictors are then standardized to have mean zero and variance one.

(42)

Variable Selection on Bike Rental Data

10 fold cross validation errors with and without refitting are shown as the solid and dashed black lines, respectively. The optimal τ with refitting is 8.

0 1 2 3 4 5

λ^

Variable 2012 BW Holiday Humid RBW Temp Wind Work X1 X2 X3 X4 X5 X6

(43)

Bike Rental Distribution Data

A Generalized R

²

To gain some inference on the performance of our final model, which includes 2012, Temp, Work, RBW, BW, and Wind, we can compute a generalized R².

Rˆ_⊕² = 1 − Pn

i =1d²(Yi,mb⊕(Xi)) Pn

i =1d²(Y_i,mb^∗) = 1 − SSE SSTO. The ˆR_⊕² = 0.708 for our final refitted model, after choosing τ = 8.

(44)

Taxi Network Data

(45)

Taxi Network Data

Taxi Network Construction

The New York City Taxi and Limousine Commission provides records on yellow taxi rides. To engineer 723 total 10 × 10 SPD matrices:

1 Filter the data on the month of January 2016.

2 Filter on observations occurring in Manhattan.

3 For each hour, we collect the number of pairwise connections between neighborhoods based on taxi pick-ups and drop-offs.

(46)

Taxi Ride Potential Predictors (Averaged per Hour)

Ave. Distance: Mean distance travelled, standardized Ave. Fare: Mean fare, standardized

Ave. Passengers: Mean number of passengers, standardized Ave. Tip: Mean tip, standardized

Cash: Sum of cash indicators for type of payment, standardized Credit: Sum of credit indicators for type of payment, standardized Dispute: Sum of dispute indicators for type of payment, standardized Free: Sum of free indicators for type of payment, standardized Late Hour: Indicator for the hour being between 11pm and 5am, standardized

Vendor: Sum of the vendor indicators, standardized (in the original data, the vendors for the recording devices installed in each taxi are

(47)

Taxi Network Data

Potential Weather Predictors (Averaged per Day)

From https://www.wunderground.com/history/daily/us/ny/new-york- city/KLGA/date, we further collect New York City weather history for January 2016.

Day’s Ave. Humid.: Daily mean humidity, standardized

Day’s Ave. Press.: Daily mean barometric pressure, standardized Day’s Ave. Temp.: Daily mean temperature, standardized Day’s Ave. Wind: Daily mean windspeed, standardized Day’s Total Precip.: Daily total precipitation, standardized This then leads to a total of fifteen potential predictors.

(48)

Variable Selection on Taxi Ride Data

Validation errors with and without refitting are shown as the solid and dashed black lines, respectively. The optimal τ with refitting is 4.5. The Rˆ_⊕² = 0.429.

0.1 0.2 0.3 0.4

λ^

Variable Ave. Distance Ave. Fare Ave. Passengers Ave. Tip Cash Credit Day's Ave. Humid.

Day's Ave. Press.

Day's Ave. Temp.

Day's Ave. Wind Day's Total Precip.

Dispute Free Late Hour

(49)

Concluding Remarks

(50)

Concluding Remarks

We have done the following:

1 Developed a novel variable selection method for global Fr´echet regression

2 Proven FRiSO satisfies selection consistency under certain conditions

3 Shown these conditions hold for useful examples

4 Demonstrated through simulation and real data examples that FRiSO performs well with finite n

(51)

Concluding Remarks

Future Work

Future work will include

Applying our method in collaborations with other departments/researchers

Developing LRT for all models and further inference Exploring the effect of metric choice on analysis

(52)

Thank you!

Thank you all for joining! Please reach out if you have

any questions or would like to discuss this project.

(53)

Concluding Remarks

Some Selected References

Fan, J. and J. Lv (2010). A selective overview of variable selection in high dimensional feature space. Statistica Sinica 20 (1), 101-148.

Fréchet, M. (1948). Les éléments aléatoires de nature quelconque dans un espace distancié. Annales de l’institut Henri Poincaré 10 (4), 215-310.

Petersen, A. and H.-G. M¨uller (2019). Fr´echet regression for random objects with euclidean predictors. The Annals of Statistics 47 (2), 691-719.

Wu, Y. (2020). Can’t ridge regression perform variable selection?

Technometrics, in press.

Variable Selection for Global Fréchet Regression

Variable Selection for Global Fr´ echet Regression

Research Focus: Complex Data Types

Table of Contents

Motivating Example

Motivation: Bike Rental Data

Bike Rental Data Snippet

Example Bike Distributions

How to Extend Existing Methods?

From Euclidean Space to General Metric Space

Examples

Probability Distribution Metric Space

Symmetric, Positive Definite Matrix Metric Space

Fr´ echet Mean and Variance

Global Fr´ echet Regression

Fr´ echet Regression Function

Global Fr´ echet Regression:

An extension of classical Linear Regression

Equivalent Interpretation of Linear Regression

Global Fr´ echet Regression

FRiSO: Fr´ echet Ridge Selection Operator

Variable Selection Objective

Individually Penalized Ridge Regression

Individually Penalized Ridge Regression

Individually Penalized Fr´ echet Regression

FRiSO: Fr´ echet Ridge Selection Operator

FRiSO: Selection Consistency

FRiSO: Steps 3 and 4 (Refitting)

Theorem 1: Required Conditions

Theorem 1: Required Conditions Cont.

Simulation Studies

Simulation Study Setup

A Probability Distribution Example: (Y

, d

)

Results: Selection Consistency

Solution Path: One Random Repetition

Results: Model Performance

Notes

An SPD Matrix Example: (Y

, d

)

Results: Selection Consistency

Solution Path: One Random Repetition

Results: Model Performance

Bike Rental Distribution Data

Variable Selection on Bike Rental Data

A Generalized R

Taxi Network Data

Taxi Network Construction

Taxi Ride Potential Predictors (Averaged per Hour)

Potential Weather Predictors (Averaged per Day)

Variable Selection on Taxi Ride Data

Concluding Remarks

Concluding Remarks

Future Work

Thank you!

Thank you all for joining! Please reach out if you have

any questions or would like to discuss this project.

Some Selected References