• No results found

Variable Selection for Global Fréchet Regression

N/A
N/A
Protected

Academic year: 2021

Share "Variable Selection for Global Fréchet Regression"

Copied!
53
0
0

Loading.... (view fulltext now)

Full text

(1)

Variable Selection for Global Fr´ echet Regression

Danielle C. Tucker 1, Yichao Wu1, Hans-Georg M¨uller2

1University of Illinois at Chicago

2UC Davis

September 2021

(2)

Research Focus: Complex Data Types

(3)

Table of Contents

1 Motivating Example

2 Global Fr´echet Regression

3 FRiSO: Fr´echet Ridge Selection Operator

4 Simulation Studies

5 Bike Rental Distribution Data

6 Taxi Network Data

7 Concluding Remarks

(4)

Motivating Example

(5)

Motivating Example

Motivation: Bike Rental Data

Let’s look at a data set collected by Capital Bikeshare in Washington D.C.

This data set spans the years 2011 and 2012 for a total of 731 days. For each day, there are 24 observations of bike rental counts and predictors which include

RBW: Indicator of really bad weather (snowy and/or rainy)

Holiday: Indicator of a public holiday celebrated in Washington D.C.

2012: Indicator of the year 2012

Humid: Continuous variable of standardized humidity Temp: Continuous variable of standardized temperature

Response Y : Each day’s 24 observed quantiles for distribution of rentals.

(6)

Bike Rental Data Snippet

Date Count RBW Holiday 2012 Humidity Temperature

2011-01-01 16 0 0 0 0.81 0.24

2011-01-01 40 0 0 0 0.80 0.22

2011-01-01 32 0 0 0 0.80 0.22

2011-01-01 13 0 0 0 0.75 0.24

2011-01-01 1 0 0 0 0.75 0.24

2011-01-01 1 0 0 0 0.75 0.24

...

2011-01-01 35 1 0 0 0.88 0.42

2011-01-01 37 1 0 0 0.88 0.42

2011-01-01 36 0 0 0 0.87 0.40

2011-01-01 34 0 0 0 0.87 0.40

2011-01-01 28 0 0 0 0.94 0.40

(7)

Motivating Example

Example Bike Distributions

(8)

How to Extend Existing Methods?

Classical Linear Regression

Y = β0+ XTβ + ε

with a p dimensional predictor vector X ∈ X ⊂ Rp and random errors ε with mean zero and finite variance. We collect a random sample

{(Xi, Yi) : i = 1, 2, ..., n}. We then have two main goals:

1 Estimate β0 and β

2 Predict a future observation at any x in the domain of interest X Here, what kind of Y are we working with? Y ∈ R. Euclidean Space.

(9)

Motivating Example

From Euclidean Space to General Metric Space

Definition

A metric space is an ordered pair (Y, d ) where Y is a set and d is a metric on Y, i.e. a function

d : Y × Y → R such that for any y1, y2, y3 ∈ Y, the following holds:

1 d (y1, y2) = 0 ⇐⇒ y1 = y2 Identity of Indiscernibles

2 d (y1, y2) = d (y2, y1) Symmetry

3 d (y1, y3) ≤ d (y1, y2) + d (y2, y3) Triangle Inequality

(10)

Examples

Probability distributions + Wasserstein metric

Symmetric positive definite M × M matrices + Cholesky Decomposition metric

Spheres in R3 + geodesic distance

(11)

Motivating Example

Probability Distribution Metric Space

Definition

Let Y1 be the set of probability distributions. The 2-Wasserstein metric distance between two distributions with CDFs H(·) and G (·) is defined as

dW(H, G ) = s

Z 1 0

(H−1(t) − G−1(t))2dt.

We denote (Y1, dW) as the metric space of probability distributions equipped with the Wasserstein distance.

(12)

Symmetric, Positive Definite Matrix Metric Space

Definition

Let Y2 be the set of symmetric, positive definite (SPD) matrices. Let P1

and P2 be two SPD matrices. Then, under the Cholesky decomposition, we can write P1= (P11/2)TP11/2 and P2= (P21/2)TP21/2, where P11/2 and P21/2 are upper triangle matrices with positive diagonal components. Then we define the Cholesky decomposition distance between P1 and P2 as

dC(P1, P2) = r

trace



(P11/2− P21/2)T(P11/2− P21/2)

 . We denote (Y2, dC) as the metric space of SPD matrices equipped with the Cholesky decomposition distance.

(13)

Motivating Example

Fr´ echet Mean and Variance

Fr´echet extended the concepts of mean and variance to random objects in 1948.

Let Y be a random object in the metric space (Y, d ). Then the Fr´echet mean and variance of Y are defined as

Fr´echet mean and variance

y= arg min

y ∈Y

E (d2(Y , y )) and V= E (d2(Y , y)) (1) respectively.

(14)

Global Fr´ echet Regression

(15)

Global Fr´echet Regression

Fr´ echet Regression Function

Petersen and M¨uller (2019) introduced a general concept of a Fr´echet regression function of Y given X = x with x = (x1, ..., xp)T as Fr´echet Regression

m(x ) = arg min

y ∈Y

M(y , x ) (2)

where M(·, x ) = E (d2(Y , ·)|X = x ).

(16)

Global Fr´ echet Regression:

An extension of classical Linear Regression

Classical Linear Regression

Y = β0+ XTβ + ε

with a p dimensional predictor vector X ∈ X ⊂ Rp and random errors ε with mean zero and finite variance. We collect a random sample

{(Xi, Yi) : i = 1, 2, ..., n}. We then have two main goals:

1 Estimate β0 and β

2 Predict a future observation at any x in the domain of interest X .

(17)

Global Fr´echet Regression

Equivalent Interpretation of Linear Regression

For the second objective, with a little bit of work, we can show that this future prediction can be equivalently interpreted as the minimizer of

miny n

X

i =1

h

1 + (x − ¯X )TΣˆ−1(Xi − ¯X )i

(Yi − y )2 (3)

We can replace the squared difference in red with a more general squared metric distance d2(Yi, y ), and we get the global Fr´echet regression model!

(18)

Global Fr´ echet Regression

The global Fr´echet regression model is characterized by Global Fr´echet Regression

m(x ) = arg min

y ∈Y

E (d2(Y , y )|X = x ) = g(x ) for any x ∈ X , where (4) g(x ) = arg min

y ∈Y

E(X ,Y )nh

1 + (x − µ)TΣ−1(X − µ)i

d2(Y , y )o (5)

where g(x ) is assumed to be well-defined.

(19)

FRiSO: Fr´echet Ridge Selection Operator

FRiSO: Fr´ echet Ridge Selection Operator

(20)

Variable Selection Objective

As in linear regression, it could be of great interest to determine which Xj in X = (X1, ..., Xp)T are important predictors, especially in this era of Big Data when p could be very large! Further, determining the important predictors could provide some capability for inference.

In linear regression, we have sparsity-encouraging penalty methods such as LASSO

Ridge SCAD

Problem: We have no coefficients in Fr´echet regression!

ˆ

g(x ) = arg min

n

Xh

1 + (x − ¯X )TΣˆ−1(Xi − ¯X )i

d2(Yi, y )

(21)

FRiSO: Fr´echet Ridge Selection Operator

Individually Penalized Ridge Regression

Yichao Wu (2020) developed a new variable selection method for linear regression based on ridge regression. This method is called Individually Penalized Ridge Regression. Based on a random sample

{(Xi, Yi) : i = 1, 2, ..., n} from (3), one obtains the unknown regression coefficients β0 and β by solving

Individually Penalized Ridge Regression Estimator

minβ0

1 2n

n

X

i =1

(Yi − β0− XiTβ)2+1 2

p

X

i =1

νjβj2 (6)

with ridge regulation parameters νj ≥ 0 for j = 1, ..., p. Let’s

reparameterize this with λj = ν1. Note: a less important predictor Xj

(22)

Individually Penalized Ridge Regression

Variable selection for (3) then proceeds by solving Individually Penalized Ridge Regression

minλ hY − H(λ)Y , Y − H(λ)Y i (7) subject to λj ≥ 0, j = 1, ..., p; (8)

p

X

j =1

λj ≤ τ (9)

where H(λ) =

1 n



1n×11Tn×1+ (X − 1n×1T)h ˆΣ + diag(λ−1)i−1

(X − 1n×1T)T

 ,

(23)

FRiSO: Fr´echet Ridge Selection Operator

Individually Penalized Fr´ echet Regression

As before, we can show that if we extend individually penalized ridge regression to the more general Fr´echet regression framework, we get the estimator Individually Penalized Fr´echet Regression Estimator Rˆ(x ; λ) as

Individually Penalized Fr´echet Regression Estimator

arg min

y ∈Y n

X

i =1



1 + (x − ¯X )Th ˆΣ + diag(λ)i−1

(Xi − ¯X )



d2(Yi, y ) (10)

for the prediction of the Fr´echet regression function m(x ) at any location x in the domain of interest X .

(24)

FRiSO: Fr´ echet Ridge Selection Operator

1 Given the metric space (Y,d), ˆR(x ; λ−1) is arg min

y ∈Y n

X

i =1



1 + (x − ¯X )Th ˆΣ + diag(λ−1)i−1

(Xi − ¯X )



d2(Yi, y )

2 Then we find ˆλ(τ) = (ˆλ1(τ), ..., ˆλp(τ))T for τ ≥ 0by solving minλ

1 n

n

X

i =1

d2(Yi, ˆR(x ; λ−1)) (11) subject to λj ≥ 0, j = 1, ..., p; (12)

p

X

j =1

λj ≤τ (13)

(25)

FRiSO: Fr´echet Ridge Selection Operator

FRiSO: Selection Consistency

With FRiSO, we indeed have selection consistency.

Definition

A set I ⊆ {1, 2, . . . , p} is called the important predictor set for global Fr´echet regression of random objects Y on multivariate random vectors X , if I is the smallest set satisfying Y ⊥⊥ XIc|XI, i.e., Y is conditionally independent of XIc given XI.

Theorem

Under a few reasonable conditions, when τ = τn → ∞ as n → ∞, the solution bλ(τn) of (11) satisfies ˆλjn)→ ∞ for j ∈ I and ˆp λj0n)→ 0 forp j0 6∈ I.

(26)

FRiSO: Steps 3 and 4 (Refitting)

3 Collect the set ˆI(τ ) = {j : ˆλj > 0} to indicate the selected predictors

4 Refit the global Fr´echet model with only the selected predictors to get the estimator ˆmrefit (x ; τ ):

arg min

y ∈Y n

X

i =1

h 1 + (x

bI(τ )− ¯X

I(τ )b )TΣb−1

I(τ ),bb I(τ )(Xi ,bI(τ )− ¯X

bI(τ ))i

d2(Yi, y ).

Why is this refitting helpful? Theorem 1 states that ˆλj → ∞ for j ∈ I.p This implies ˆνj = 1/ˆλj

→ 0. This would eliminate the penalty attached top

an important predictor and eliminate bias. However, we have a finite sample size n....

Refitting improves the model’s bias.

(27)

FRiSO: Fr´echet Ridge Selection Operator

Theorem 1: Required Conditions

Condition [A]: For any A satisfying I ⊆ A ⊆ {1, 2, . . . , p}, we have n

E (Σ−1A,A(XA− µA)|XI)o

I = Σ−1I,I(XI− µI)

and n

E (Σ−1A,A(XA− µA)|XI)o

A\I = 0, where A \ I = {j : j ∈ A and j 6∈ I}.

(28)

Theorem 1: Required Conditions Cont.

Condition [B]:

EXd2(m(X ), LA(X )) ≡R

X d2(m(x ), LA(x ))FX(d x ) > 0 for any set A satisfying I \ A 6= ∅.

Condition [C]: Assume that EXd2(L(X ), R(X ; ν)) > 0 for any nonnegative vector ν satisfying kνIk > 0.

Condition [D]: Assume that the gradient

∂νIEXd2(L(X ), R(X ; ν)) = 0 and the Hessian

2

∂νI∂νITEXd2(L(X ), R(X ; ν)) is strictly positive definite at any ν satisfying ν = 0 for any j ∈ I.

(29)

Simulation Studies

Simulation Studies

(30)

Simulation Study Setup

Correlated scalar predictors Xj ∼ U (−1, 1), j = 1, 2, . . . , p, are generated in two steps:

1 Z = (Z1, Z2, . . . , Zp)T multivariate Gaussian with E (Zj) = 0 and cov(Zj, Zj0) = ρ|j−j0|

2 Xj = 2Φ(Zj) − 1 for j = 1, . . . , p, where Φ is the standard normal distribution function

We set p = 10 and ρ = 0.5.

(31)

Simulation Studies

A Probability Distribution Example: (Y

1

, d

W

)

The Fr´echet regression function is given by

m(x ) = E (Y (·)|X = x ) = µ0+ β(x4+ x8) + (σ0+ γx1−1(·).

Conditional on X , the random response Y is generated by adding noise as follows: Y = µ + σΦ−1 with µ|X ∼ N(µ0+ β(X4+ X8), ν1) and

σ|X ∼ Gamma((σ0+ γX1)22, ν2/(σ0+ γX1)) being independently sampled. The additional parameters are set as µ0 = 0, σ0= 3, β = 3/4, γ = 1, ν1= 1, and ν2 = 1/2.

Training and validation sample sizes are both n = 200. Testing sample size

˜

n = 100n. The validation set is utilized for the selection of τ . The testing set is used to evaluate the performance of the estimated Fr´echet regression

(32)

Results: Selection Consistency

We obtain the optimal solution ˆλ(τ ) over a pre-specified grid for τ , {τ1< τ2< ... < τK}. A solution path is considered to be consistent if the optimal solution ˆλ(τk) leads to the same sparsity pattern as the truth for some k ∈ {1, 2, ..., K }.

Table: Example 5.2.2 simulation results for variable selection with FRiSO for global Fr´echet regression when the random objects are probability distributions.

Selection Frequency Path

X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 Consistency

100 17 21 100 26 22 20 100 20 12 96

100 0 0 100 4 1 0 100 0 0 With Refitting

(33)

Simulation Studies

Solution Path: One Random Repetition

24681012

λj

X1

X2

X3

X4

X5

X6

X7

X8

X9

X10

(34)

Results: Model Performance

Denote the test set as {( ˜Xi, ˜Yi) : i = 1, ..., ˜n}. We collect:

1 D1 = 1n˜Pn˜

i =1dw2(m( ˜Xi), ˜Yi) (noise)

2 D2 = 1n˜Pn˜

i =1dw2(m( ˜Xi),mb( ˜Xi)) (deviation from true function)

3 D3 = 1n˜Pn˜

i =1dw2(mb( ˜Xi), ˜Yi) (prediction error)

4 D4 = 1n˜Pn˜

i =1dw2(m( ˜Xi),mb) (D2 without predictors)

5 D5 = 1n˜Pn˜

i =1dw2(mb, ˜Yi) (D3 without predictors)

Table: Example 5.2.2 Simulation results (prediction) of Fr´echet regression for probability distributions with the Wasserstein metric.

Mean (standard error) over 100 repetitions

D1

D2 D3 D4 D5

0.122 (0.003) 1.433 (0.003) 0.616 (0.001) 1.929 (0.001) 1.311

(35)

Simulation Studies

Notes

On the previous slide, mb = arg min

y

Pn i =1

R1

0[y−1(t) − Yi−1(t)]2dt.

D1, D2 and D4 cannot be observed from real data D5 is a measure of out of sample SSTO

D3 is a measure of out of sample SSE

(36)

An SPD Matrix Example: (Y

2

, d

C

)

Let I denote an M × M identity matrix and U = (Ui ,j) denote an M × M matrix where Ui ,j = I{i <j}. We choose M = 5.

The Fr´echet regression function is given by

m(x ) = E (Y |X = x ) = E (A)TE (A), where E (A) =

0+ β(x1+ x3) + σ0+ γ(x5+ x7+ x9)} I + {σ0+ γ(x5+ x7+ x9)}U.

Noise Addition: Y = ATA where A = (µ + σ)I + σU and with µ|X ∼ N(µ0+ β(X1+ X3), ν1) and

σ|X ∼ Gamma((σ0+ γ(X5+ X7+ X9))22, ν2/(σ0+ γ(X5+ X7+ X9))).

We set µ = 3, σ = 3, β = 2, γ = 3, ν = 1, and ν = 2.

(37)

Simulation Studies

Results: Selection Consistency

Table: Simulation results (variable selection) of Fr´echet regression for SPD matrix data

selection frequency Path

X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 consistency

100 1 100 0 100 1 100 0 100 0 100

100 0 99 0 100 1 100 0 100 0 with refitting

(38)

Solution Path: One Random Repetition

0 2 4 6 8 10

λ^ j

Variable X1 X2 X3 X4 X5 X6 X7 X8 X9 X10

(39)

Simulation Studies

Results: Model Performance

Table:Simulation results on prediction for Fr´echet regression for SPD matrix data Mean (standard error) over 100 repetitions

D1

D2 D3 D4 D5

60.18 (4.34) 73.13 (4.37) 241.45 (4.33) 267.35 (4.33) 8.14 35.31 (4.27) 35.16 (4.34) with refitting

(40)

Bike Rental Distribution Data

(41)

Bike Rental Distribution Data

Recall the bike rental data example from the beginning. To show the variable selection accuracy of FRiSO, we create 6 additional noise variables:

X1 ∼ N(0, 1)

X2 ∼ Γ(α = 1, β = 2) X3 ∼ Γ(2, 2)

X4 ∼ Γ(15, 1) X5 ∼ Γ(15, 2)

X6 ∼ N(0, 1) +pΓ(35, 2)

All predictors are then standardized to have mean zero and variance one.

(42)

Variable Selection on Bike Rental Data

10 fold cross validation errors with and without refitting are shown as the solid and dashed black lines, respectively. The optimal τ with refitting is 8.

0 1 2 3 4 5

λ^

Variable 2012 BW Holiday Humid RBW Temp Wind Work X1 X2 X3 X4 X5 X6

(43)

Bike Rental Distribution Data

A Generalized R

2

To gain some inference on the performance of our final model, which includes 2012, Temp, Work, RBW, BW, and Wind, we can compute a generalized R2.

2 = 1 − Pn

i =1d2(Yi,mb(Xi)) Pn

i =1d2(Yi,mb) = 1 − SSE SSTO. The ˆR2 = 0.708 for our final refitted model, after choosing τ = 8.

(44)

Taxi Network Data

(45)

Taxi Network Data

Taxi Network Construction

The New York City Taxi and Limousine Commission provides records on yellow taxi rides. To engineer 723 total 10 × 10 SPD matrices:

1 Filter the data on the month of January 2016.

2 Filter on observations occurring in Manhattan.

3 For each hour, we collect the number of pairwise connections between neighborhoods based on taxi pick-ups and drop-offs.

(46)

Taxi Ride Potential Predictors (Averaged per Hour)

Ave. Distance: Mean distance travelled, standardized Ave. Fare: Mean fare, standardized

Ave. Passengers: Mean number of passengers, standardized Ave. Tip: Mean tip, standardized

Cash: Sum of cash indicators for type of payment, standardized Credit: Sum of credit indicators for type of payment, standardized Dispute: Sum of dispute indicators for type of payment, standardized Free: Sum of free indicators for type of payment, standardized Late Hour: Indicator for the hour being between 11pm and 5am, standardized

Vendor: Sum of the vendor indicators, standardized (in the original data, the vendors for the recording devices installed in each taxi are

(47)

Taxi Network Data

Potential Weather Predictors (Averaged per Day)

From https://www.wunderground.com/history/daily/us/ny/new-york- city/KLGA/date, we further collect New York City weather history for January 2016.

Day’s Ave. Humid.: Daily mean humidity, standardized

Day’s Ave. Press.: Daily mean barometric pressure, standardized Day’s Ave. Temp.: Daily mean temperature, standardized Day’s Ave. Wind: Daily mean windspeed, standardized Day’s Total Precip.: Daily total precipitation, standardized This then leads to a total of fifteen potential predictors.

(48)

Variable Selection on Taxi Ride Data

Validation errors with and without refitting are shown as the solid and dashed black lines, respectively. The optimal τ with refitting is 4.5. The Rˆ2 = 0.429.

0.1 0.2 0.3 0.4

λ^

Variable Ave. Distance Ave. Fare Ave. Passengers Ave. Tip Cash Credit Day's Ave. Humid.

Day's Ave. Press.

Day's Ave. Temp.

Day's Ave. Wind Day's Total Precip.

Dispute Free Late Hour

(49)

Concluding Remarks

Concluding Remarks

(50)

Concluding Remarks

We have done the following:

1 Developed a novel variable selection method for global Fr´echet regression

2 Proven FRiSO satisfies selection consistency under certain conditions

3 Shown these conditions hold for useful examples

4 Demonstrated through simulation and real data examples that FRiSO performs well with finite n

(51)

Concluding Remarks

Future Work

Future work will include

Applying our method in collaborations with other departments/researchers

Developing LRT for all models and further inference Exploring the effect of metric choice on analysis

(52)

Thank you!

Thank you all for joining! Please reach out if you have

any questions or would like to discuss this project.

(53)

Concluding Remarks

Some Selected References

Fan, J. and J. Lv (2010). A selective overview of variable selection in high dimensional feature space. Statistica Sinica 20 (1), 101-148.

Fr´echet, M. (1948). Les ´el´ements al´eatoires de nature quelconque dans un espace distanci´e. Annales de l’institut Henri Poincar´e 10 (4), 215-310.

Petersen, A. and H.-G. M¨uller (2019). Fr´echet regression for random objects with euclidean predictors. The Annals of Statistics 47 (2), 691-719.

Wu, Y. (2020). Can’t ridge regression perform variable selection?

Technometrics, in press.

References

Related documents

and how PI scores are used in states’ accountability systems. The central research questions are: 1) how many different elements or features are currently used for PI

In particular, we will look at how the account of dynamic belief revision offered in the previous section can be characterised in terms of change in belief state, with the

- Payment Credit Type (Payment method = Credits): Select a credit type with sufficient amount for this order (Normal / Early

Source: “M-Trends™ 2012: An Evolving Threat”, Mandiant, 27 Feb 2012 http://www.mandiant.com/resources/m-trends/..

By doing so, they began to legitimize energy transitions in the early years of the energy transition in Germany, laid the economic and technological foundation which allowed

Kecelakaan konstitusi dimulai pada saat perubahan UUD 1945, karena saat itu belum disepakati mengenai pemilihan prseiden secara langsung, akhirnya dalam perubahan Kedua UUD

Calculate the carrying value, interest expense and cash payment for note payable (periodic payment, lump-sum, periodic and lump-sum) transactions.. Statement of

We started by regressing the number of climate policies and measures for which member states reported quantified emissions reductions (projections for the year 2020) in 2009,