Variable Selection for Global Fr´ echet Regression
Danielle C. Tucker 1, Yichao Wu1, Hans-Georg M¨uller2
1University of Illinois at Chicago
2UC Davis
September 2021
Research Focus: Complex Data Types
Table of Contents
1 Motivating Example
2 Global Fr´echet Regression
3 FRiSO: Fr´echet Ridge Selection Operator
4 Simulation Studies
5 Bike Rental Distribution Data
6 Taxi Network Data
7 Concluding Remarks
Motivating Example
Motivating Example
Motivation: Bike Rental Data
Let’s look at a data set collected by Capital Bikeshare in Washington D.C.
This data set spans the years 2011 and 2012 for a total of 731 days. For each day, there are 24 observations of bike rental counts and predictors which include
RBW: Indicator of really bad weather (snowy and/or rainy)
Holiday: Indicator of a public holiday celebrated in Washington D.C.
2012: Indicator of the year 2012
Humid: Continuous variable of standardized humidity Temp: Continuous variable of standardized temperature
Response Y : Each day’s 24 observed quantiles for distribution of rentals.
Bike Rental Data Snippet
Date Count RBW Holiday 2012 Humidity Temperature
2011-01-01 16 0 0 0 0.81 0.24
2011-01-01 40 0 0 0 0.80 0.22
2011-01-01 32 0 0 0 0.80 0.22
2011-01-01 13 0 0 0 0.75 0.24
2011-01-01 1 0 0 0 0.75 0.24
2011-01-01 1 0 0 0 0.75 0.24
...
2011-01-01 35 1 0 0 0.88 0.42
2011-01-01 37 1 0 0 0.88 0.42
2011-01-01 36 0 0 0 0.87 0.40
2011-01-01 34 0 0 0 0.87 0.40
2011-01-01 28 0 0 0 0.94 0.40
Motivating Example
Example Bike Distributions
How to Extend Existing Methods?
Classical Linear Regression
Y = β0+ XTβ + ε
with a p dimensional predictor vector X ∈ X ⊂ Rp and random errors ε with mean zero and finite variance. We collect a random sample
{(Xi, Yi) : i = 1, 2, ..., n}. We then have two main goals:
1 Estimate β0 and β
2 Predict a future observation at any x in the domain of interest X Here, what kind of Y are we working with? Y ∈ R. Euclidean Space.
Motivating Example
From Euclidean Space to General Metric Space
Definition
A metric space is an ordered pair (Y, d ) where Y is a set and d is a metric on Y, i.e. a function
d : Y × Y → R such that for any y1, y2, y3 ∈ Y, the following holds:
1 d (y1, y2) = 0 ⇐⇒ y1 = y2 Identity of Indiscernibles
2 d (y1, y2) = d (y2, y1) Symmetry
3 d (y1, y3) ≤ d (y1, y2) + d (y2, y3) Triangle Inequality
Examples
Probability distributions + Wasserstein metric
Symmetric positive definite M × M matrices + Cholesky Decomposition metric
Spheres in R3 + geodesic distance
Motivating Example
Probability Distribution Metric Space
Definition
Let Y1 be the set of probability distributions. The 2-Wasserstein metric distance between two distributions with CDFs H(·) and G (·) is defined as
dW(H, G ) = s
Z 1 0
(H−1(t) − G−1(t))2dt.
We denote (Y1, dW) as the metric space of probability distributions equipped with the Wasserstein distance.
Symmetric, Positive Definite Matrix Metric Space
Definition
Let Y2 be the set of symmetric, positive definite (SPD) matrices. Let P1
and P2 be two SPD matrices. Then, under the Cholesky decomposition, we can write P1= (P11/2)TP11/2 and P2= (P21/2)TP21/2, where P11/2 and P21/2 are upper triangle matrices with positive diagonal components. Then we define the Cholesky decomposition distance between P1 and P2 as
dC(P1, P2) = r
trace
(P11/2− P21/2)T(P11/2− P21/2)
. We denote (Y2, dC) as the metric space of SPD matrices equipped with the Cholesky decomposition distance.
Motivating Example
Fr´ echet Mean and Variance
Fr´echet extended the concepts of mean and variance to random objects in 1948.
Let Y be a random object in the metric space (Y, d ). Then the Fr´echet mean and variance of Y are defined as
Fr´echet mean and variance
y⊕= arg min
y ∈Y
E (d2(Y , y )) and V⊕= E (d2(Y , y⊕)) (1) respectively.
Global Fr´ echet Regression
Global Fr´echet Regression
Fr´ echet Regression Function
Petersen and M¨uller (2019) introduced a general concept of a Fr´echet regression function of Y given X = x with x = (x1, ..., xp)T as Fr´echet Regression
m⊕(x ) = arg min
y ∈Y
M⊕(y , x ) (2)
where M⊕(·, x ) = E (d2(Y , ·)|X = x ).
Global Fr´ echet Regression:
An extension of classical Linear Regression
Classical Linear Regression
Y = β0+ XTβ + ε
with a p dimensional predictor vector X ∈ X ⊂ Rp and random errors ε with mean zero and finite variance. We collect a random sample
{(Xi, Yi) : i = 1, 2, ..., n}. We then have two main goals:
1 Estimate β0 and β
2 Predict a future observation at any x in the domain of interest X .
Global Fr´echet Regression
Equivalent Interpretation of Linear Regression
For the second objective, with a little bit of work, we can show that this future prediction can be equivalently interpreted as the minimizer of
miny n
X
i =1
h
1 + (x − ¯X )TΣˆ−1(Xi − ¯X )i
(Yi − y )2 (3)
We can replace the squared difference in red with a more general squared metric distance d2(Yi, y ), and we get the global Fr´echet regression model!
Global Fr´ echet Regression
The global Fr´echet regression model is characterized by Global Fr´echet Regression
m⊕(x ) = arg min
y ∈Y
E (d2(Y , y )|X = x ) = g⊕(x ) for any x ∈ X , where (4) g⊕(x ) = arg min
y ∈Y
E(X ,Y )nh
1 + (x − µ)TΣ−1(X − µ)i
d2(Y , y )o (5)
where g⊕(x ) is assumed to be well-defined.
FRiSO: Fr´echet Ridge Selection Operator
FRiSO: Fr´ echet Ridge Selection Operator
Variable Selection Objective
As in linear regression, it could be of great interest to determine which Xj in X = (X1, ..., Xp)T are important predictors, especially in this era of Big Data when p could be very large! Further, determining the important predictors could provide some capability for inference.
In linear regression, we have sparsity-encouraging penalty methods such as LASSO
Ridge SCAD
Problem: We have no coefficients in Fr´echet regression!
ˆ
g⊕(x ) = arg min
n
Xh
1 + (x − ¯X )TΣˆ−1(Xi − ¯X )i
d2(Yi, y )
FRiSO: Fr´echet Ridge Selection Operator
Individually Penalized Ridge Regression
Yichao Wu (2020) developed a new variable selection method for linear regression based on ridge regression. This method is called Individually Penalized Ridge Regression. Based on a random sample
{(Xi, Yi) : i = 1, 2, ..., n} from (3), one obtains the unknown regression coefficients β0 and β by solving
Individually Penalized Ridge Regression Estimator
minβ0,β
1 2n
n
X
i =1
(Yi − β0− XiTβ)2+1 2
p
X
i =1
νjβj2 (6)
with ridge regulation parameters νj ≥ 0 for j = 1, ..., p. Let’s
reparameterize this with λj = ν1. Note: a less important predictor Xj
Individually Penalized Ridge Regression
Variable selection for (3) then proceeds by solving Individually Penalized Ridge Regression
minλ hY − H(λ)Y , Y − H(λ)Y i (7) subject to λj ≥ 0, j = 1, ..., p; (8)
p
X
j =1
λj ≤ τ (9)
where H(λ) =
1 n
1n×11Tn×1+ (X − 1n×1X¯T)h ˆΣ + diag(λ−1)i−1
(X − 1n×1X¯T)T
,
FRiSO: Fr´echet Ridge Selection Operator
Individually Penalized Fr´ echet Regression
As before, we can show that if we extend individually penalized ridge regression to the more general Fr´echet regression framework, we get the estimator Individually Penalized Fr´echet Regression Estimator Rˆ⊕(x ; λ) as
Individually Penalized Fr´echet Regression Estimator
arg min
y ∈Y n
X
i =1
1 + (x − ¯X )Th ˆΣ + diag(λ)i−1
(Xi − ¯X )
d2(Yi, y ) (10)
for the prediction of the Fr´echet regression function m⊕(x ) at any location x in the domain of interest X .
FRiSO: Fr´ echet Ridge Selection Operator
1 Given the metric space (Y,d), ˆR⊕(x ; λ−1) is arg min
y ∈Y n
X
i =1
1 + (x − ¯X )Th ˆΣ + diag(λ−1)i−1
(Xi − ¯X )
d2(Yi, y )
2 Then we find ˆλ(τ) = (ˆλ1(τ), ..., ˆλp(τ))T for τ ≥ 0by solving minλ
1 n
n
X
i =1
d2(Yi, ˆR⊕(x ; λ−1)) (11) subject to λj ≥ 0, j = 1, ..., p; (12)
p
X
j =1
λj ≤τ (13)
FRiSO: Fr´echet Ridge Selection Operator
FRiSO: Selection Consistency
With FRiSO, we indeed have selection consistency.
Definition
A set I ⊆ {1, 2, . . . , p} is called the important predictor set for global Fr´echet regression of random objects Y on multivariate random vectors X , if I is the smallest set satisfying Y ⊥⊥ XIc|XI, i.e., Y is conditionally independent of XIc given XI.
Theorem
Under a few reasonable conditions, when τ = τn → ∞ as n → ∞, the solution bλ(τn) of (11) satisfies ˆλj(τn)→ ∞ for j ∈ I and ˆp λj0(τn)→ 0 forp j0 6∈ I.
FRiSO: Steps 3 and 4 (Refitting)
3 Collect the set ˆI(τ ) = {j : ˆλj > 0} to indicate the selected predictors
4 Refit the global Fr´echet model with only the selected predictors to get the estimator ˆmrefit⊕ (x ; τ ):
arg min
y ∈Y n
X
i =1
h 1 + (x
bI(τ )− ¯X
I(τ )b )TΣb−1
I(τ ),bb I(τ )(Xi ,bI(τ )− ¯X
bI(τ ))i
d2(Yi, y ).
Why is this refitting helpful? Theorem 1 states that ˆλj → ∞ for j ∈ I.p This implies ˆνj = 1/ˆλj
→ 0. This would eliminate the penalty attached top
an important predictor and eliminate bias. However, we have a finite sample size n....
Refitting improves the model’s bias.
FRiSO: Fr´echet Ridge Selection Operator
Theorem 1: Required Conditions
Condition [A]: For any A satisfying I ⊆ A ⊆ {1, 2, . . . , p}, we have n
E (Σ−1A,A(XA− µA)|XI)o
I = Σ−1I,I(XI− µI)
and n
E (Σ−1A,A(XA− µA)|XI)o
A\I = 0, where A \ I = {j : j ∈ A and j 6∈ I}.
Theorem 1: Required Conditions Cont.
Condition [B]:
EXd2(m⊕(X ), LA⊕(X )) ≡R
X d2(m⊕(x ), LA⊕(x ))FX(d x ) > 0 for any set A satisfying I \ A 6= ∅.
Condition [C]: Assume that EXd2(L⊕(X ), R⊕(X ; ν)) > 0 for any nonnegative vector ν satisfying kνIk > 0.
Condition [D]: Assume that the gradient
∂
∂νIEXd2(L⊕(X ), R⊕(X ; ν)) = 0 and the Hessian
∂2
∂νI∂νITEXd2(L⊕(X ), R⊕(X ; ν)) is strictly positive definite at any ν satisfying ν = 0 for any j ∈ I.
Simulation Studies
Simulation Studies
Simulation Study Setup
Correlated scalar predictors Xj ∼ U (−1, 1), j = 1, 2, . . . , p, are generated in two steps:
1 Z = (Z1, Z2, . . . , Zp)T multivariate Gaussian with E (Zj) = 0 and cov(Zj, Zj0) = ρ|j−j0|
2 Xj = 2Φ(Zj) − 1 for j = 1, . . . , p, where Φ is the standard normal distribution function
We set p = 10 and ρ = 0.5.
Simulation Studies
A Probability Distribution Example: (Y
1, d
W)
The Fr´echet regression function is given by
m⊕(x ) = E (Y (·)|X = x ) = µ0+ β(x4+ x8) + (σ0+ γx1)Φ−1(·).
Conditional on X , the random response Y is generated by adding noise as follows: Y = µ + σΦ−1 with µ|X ∼ N(µ0+ β(X4+ X8), ν1) and
σ|X ∼ Gamma((σ0+ γX1)2/ν2, ν2/(σ0+ γX1)) being independently sampled. The additional parameters are set as µ0 = 0, σ0= 3, β = 3/4, γ = 1, ν1= 1, and ν2 = 1/2.
Training and validation sample sizes are both n = 200. Testing sample size
˜
n = 100n. The validation set is utilized for the selection of τ . The testing set is used to evaluate the performance of the estimated Fr´echet regression
Results: Selection Consistency
We obtain the optimal solution ˆλ(τ ) over a pre-specified grid for τ , {τ1< τ2< ... < τK}. A solution path is considered to be consistent if the optimal solution ˆλ(τk) leads to the same sparsity pattern as the truth for some k ∈ {1, 2, ..., K }.
Table: Example 5.2.2 simulation results for variable selection with FRiSO for global Fr´echet regression when the random objects are probability distributions.
Selection Frequency Path
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 Consistency
100 17 21 100 26 22 20 100 20 12 96
100 0 0 100 4 1 0 100 0 0 With Refitting
Simulation Studies
Solution Path: One Random Repetition
24681012
λj
X1
X2
X3
X4
X5
X6
X7
X8
X9
X10
Results: Model Performance
Denote the test set as {( ˜Xi, ˜Yi) : i = 1, ..., ˜n}. We collect:
1 D1 = 1n˜Pn˜
i =1dw2(m⊕( ˜Xi), ˜Yi) (noise)
2 D2 = 1n˜Pn˜
i =1dw2(m⊕( ˜Xi),mb⊕( ˜Xi)) (deviation from true function)
3 D3 = 1n˜Pn˜
i =1dw2(mb⊕( ˜Xi), ˜Yi) (prediction error)
4 D4 = 1n˜Pn˜
i =1dw2(m⊕( ˜Xi),mb∗) (D2 without predictors)
5 D5 = 1n˜Pn˜
i =1dw2(mb∗, ˜Yi) (D3 without predictors)
Table: Example 5.2.2 Simulation results (prediction) of Fr´echet regression for probability distributions with the Wasserstein metric.
Mean (standard error) over 100 repetitions
D1
D2 D3 D4 D5
0.122 (0.003) 1.433 (0.003) 0.616 (0.001) 1.929 (0.001) 1.311
Simulation Studies
Notes
On the previous slide, mb∗ = arg min
y
Pn i =1
R1
0[y−1(t) − Yi−1(t)]2dt.
D1, D2 and D4 cannot be observed from real data D5 is a measure of out of sample SSTO
D3 is a measure of out of sample SSE
An SPD Matrix Example: (Y
2, d
C)
Let I denote an M × M identity matrix and U = (Ui ,j) denote an M × M matrix where Ui ,j = I{i <j}. We choose M = 5.
The Fr´echet regression function is given by
m⊕(x ) = E (Y |X = x ) = E (A)TE (A), where E (A) =
{µ0+ β(x1+ x3) + σ0+ γ(x5+ x7+ x9)} I + {σ0+ γ(x5+ x7+ x9)}U.
Noise Addition: Y = ATA where A = (µ + σ)I + σU and with µ|X ∼ N(µ0+ β(X1+ X3), ν1) and
σ|X ∼ Gamma((σ0+ γ(X5+ X7+ X9))2/ν2, ν2/(σ0+ γ(X5+ X7+ X9))).
We set µ = 3, σ = 3, β = 2, γ = 3, ν = 1, and ν = 2.
Simulation Studies
Results: Selection Consistency
Table: Simulation results (variable selection) of Fr´echet regression for SPD matrix data
selection frequency Path
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 consistency
100 1 100 0 100 1 100 0 100 0 100
100 0 99 0 100 1 100 0 100 0 with refitting
Solution Path: One Random Repetition
0 2 4 6 8 10
λ^ j
Variable X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
Simulation Studies
Results: Model Performance
Table:Simulation results on prediction for Fr´echet regression for SPD matrix data Mean (standard error) over 100 repetitions
D1
D2 D3 D4 D5
60.18 (4.34) 73.13 (4.37) 241.45 (4.33) 267.35 (4.33) 8.14 35.31 (4.27) 35.16 (4.34) with refitting
Bike Rental Distribution Data
Bike Rental Distribution Data
Recall the bike rental data example from the beginning. To show the variable selection accuracy of FRiSO, we create 6 additional noise variables:
X1 ∼ N(0, 1)
X2 ∼ Γ(α = 1, β = 2) X3 ∼ Γ(2, 2)
X4 ∼ Γ(15, 1) X5 ∼ Γ(15, 2)
X6 ∼ N(0, 1) +pΓ(35, 2)
All predictors are then standardized to have mean zero and variance one.
Variable Selection on Bike Rental Data
10 fold cross validation errors with and without refitting are shown as the solid and dashed black lines, respectively. The optimal τ with refitting is 8.
0 1 2 3 4 5
λ^
Variable 2012 BW Holiday Humid RBW Temp Wind Work X1 X2 X3 X4 X5 X6
Bike Rental Distribution Data
A Generalized R
2To gain some inference on the performance of our final model, which includes 2012, Temp, Work, RBW, BW, and Wind, we can compute a generalized R2.
Rˆ⊕2 = 1 − Pn
i =1d2(Yi,mb⊕(Xi)) Pn
i =1d2(Yi,mb∗) = 1 − SSE SSTO. The ˆR⊕2 = 0.708 for our final refitted model, after choosing τ = 8.
Taxi Network Data
Taxi Network Data
Taxi Network Construction
The New York City Taxi and Limousine Commission provides records on yellow taxi rides. To engineer 723 total 10 × 10 SPD matrices:
1 Filter the data on the month of January 2016.
2 Filter on observations occurring in Manhattan.
3 For each hour, we collect the number of pairwise connections between neighborhoods based on taxi pick-ups and drop-offs.
Taxi Ride Potential Predictors (Averaged per Hour)
Ave. Distance: Mean distance travelled, standardized Ave. Fare: Mean fare, standardized
Ave. Passengers: Mean number of passengers, standardized Ave. Tip: Mean tip, standardized
Cash: Sum of cash indicators for type of payment, standardized Credit: Sum of credit indicators for type of payment, standardized Dispute: Sum of dispute indicators for type of payment, standardized Free: Sum of free indicators for type of payment, standardized Late Hour: Indicator for the hour being between 11pm and 5am, standardized
Vendor: Sum of the vendor indicators, standardized (in the original data, the vendors for the recording devices installed in each taxi are
Taxi Network Data
Potential Weather Predictors (Averaged per Day)
From https://www.wunderground.com/history/daily/us/ny/new-york- city/KLGA/date, we further collect New York City weather history for January 2016.
Day’s Ave. Humid.: Daily mean humidity, standardized
Day’s Ave. Press.: Daily mean barometric pressure, standardized Day’s Ave. Temp.: Daily mean temperature, standardized Day’s Ave. Wind: Daily mean windspeed, standardized Day’s Total Precip.: Daily total precipitation, standardized This then leads to a total of fifteen potential predictors.
Variable Selection on Taxi Ride Data
Validation errors with and without refitting are shown as the solid and dashed black lines, respectively. The optimal τ with refitting is 4.5. The Rˆ⊕2 = 0.429.
0.1 0.2 0.3 0.4
λ^
Variable Ave. Distance Ave. Fare Ave. Passengers Ave. Tip Cash Credit Day's Ave. Humid.
Day's Ave. Press.
Day's Ave. Temp.
Day's Ave. Wind Day's Total Precip.
Dispute Free Late Hour
Concluding Remarks
Concluding Remarks
Concluding Remarks
We have done the following:
1 Developed a novel variable selection method for global Fr´echet regression
2 Proven FRiSO satisfies selection consistency under certain conditions
3 Shown these conditions hold for useful examples
4 Demonstrated through simulation and real data examples that FRiSO performs well with finite n
Concluding Remarks
Future Work
Future work will include
Applying our method in collaborations with other departments/researchers
Developing LRT for all models and further inference Exploring the effect of metric choice on analysis
Thank you!
Thank you all for joining! Please reach out if you have
any questions or would like to discuss this project.
Concluding Remarks
Some Selected References
Fan, J. and J. Lv (2010). A selective overview of variable selection in high dimensional feature space. Statistica Sinica 20 (1), 101-148.
Fr´echet, M. (1948). Les ´el´ements al´eatoires de nature quelconque dans un espace distanci´e. Annales de l’institut Henri Poincar´e 10 (4), 215-310.
Petersen, A. and H.-G. M¨uller (2019). Fr´echet regression for random objects with euclidean predictors. The Annals of Statistics 47 (2), 691-719.
Wu, Y. (2020). Can’t ridge regression perform variable selection?
Technometrics, in press.