Missing Value Imputation for Mixed Data via Gaussian Copula
Yuxuan Zhao
ORIE 4741, December 1 2020
Table of Contents
1 Motivation
2 Gaussian copula model
3 Demo
4 More about missing value imputation
Motivation
Let’s first see a general social survey dataset
Figure 1:2538 participants and 17 questions, among which one is continuous (age, 72 levels) and 16 are ordinal (2-48) levels. 25.6% entries are missing in total. There is no complete row. Data available at
https://gssdataexplorer.norc.org/.
Motivation
Variable description
Figure 2:Description for the 17 variables used.
Motivation
Example variables
I General happiness: Taken all together, how would you say things are these days–would you say that you are very happy, pretty happy, or not too happy?
I Respondents income: In which of these groups did your earnings from (OCCUPATION IN OCC) for last year–[the previous year]–fall? That is, before taxes or other deductions. Just tell me the letter.
I How many people in contact in a typical weekday: On average, about how many people do you have contact with in a typical week day, including people you live with. Please tell me which of the following categories best matches your estimate.
I Weeks r. worked last year: In [the previous year] how many weeks did you work either full-time or part-time not counting work around the house–include paid vacations and sick leave?
Motivation
Recap: GLRM imputes mixed data better than PCA
Generalized low rank model: find low rank matrix X ∈ Rn×k and W ∈ Rk×p such that XW approximates Y ∈ Rn×p well:
minimize X
(i ,j )∈Ω
`j
Yij, xiTwj +
n
X
i =1
ri(xi) +
d
X
j =1
˜ rj(wj)
I `j can vary for different j .
I The regularizer for row ri and column ˜rj can vary.
Great flexibility usually means many choices to make...
Motivation
Recap: GLRM imputes mixed data better than PCA
Generalized low rank model: find low rank matrix X ∈ Rn×k and W ∈ Rk×p such that XW approximates Y ∈ Rn×p well:
minimize X
(i ,j )∈Ω
`j
Yij, xiTwj +
n
X
i =1
ri(xi) +
d
X
j =1
˜ rj(wj)
I `j can vary for different j .
I The regularizer for row ri and column ˜rj can vary.
Great flexibility usually means many choices to make...
Motivation
GLRM: practical consideration
Generalized low rank model: find low rank matrix X ∈ Rn×k and W ∈ Rk×p such that XW approximates Y ∈ Rn×p well:
minimize X
(i ,j )∈Ω
`j
Yij, xiTwj +
n
X
i =1
ri(xi) +
d
X
j =1
˜ rj(wj)
I What `j to choose?
I How to assign weights to `j when columns have different scales?
I What regularizer ri, ˜rj to use?
And there are tuning parameters...
Motivation
GLRM: practical consideration
Generalized low rank model: find low rank matrix X ∈ Rn×k and W ∈ Rk×p such that XW approximates Y ∈ Rn×p well:
minimize X
(i ,j )∈Ω
`j
Yij, xiTwj +
n
X
i =1
ri(xi) +
d
X
j =1
˜ rj(wj)
I What `j to choose?
I How to assign weights to `j when columns have different scales?
I What regularizer ri, ˜rj to use?
And there are tuning parameters...
Motivation
GLRM: practical consideration
Generalized low rank model: find low rank matrix X ∈ Rn×k and W ∈ Rk×p such that XW approximates Y ∈ Rn×p well:
minimize X
(i ,j )∈Ω
`j
Yij, xiTwj +
n
X
i =1
ri(xi) +
d
X
j =1
˜ rj(wj)
I How to choose the rank k?
I If setting ri and ˜rj as quadratic regularization with parameter λ, how to choose λ?
I Need to search over two-dimensional grid. Is the problem just about computation?
Motivation
GLRM: practical consideration
Generalized low rank model: find low rank matrix X ∈ Rn×k and W ∈ Rk×p such that XW approximates Y ∈ Rn×p well:
minimize X
(i ,j )∈Ω
`j
Yij, xiTwj +
n
X
i =1
ri(xi) +
d
X
j =1
˜ rj(wj)
I How to choose the rank k?
I If setting ri and ˜rj as quadratic regularization with parameter λ, how to choose λ?
I Need to search over two-dimensional grid.
Is the problem just about computation?
Motivation
GLRM: practical consideration
Generalized low rank model: find low rank matrix X ∈ Rn×k and W ∈ Rk×p such that XW approximates Y ∈ Rn×p well:
minimize X
(i ,j )∈Ω
`j
Yij, xiTwj +
n
X
i =1
ri(xi) +
d
X
j =1
˜ rj(wj)
I How to choose the rank k?
I If setting ri and ˜rj as quadratic regularization with parameter λ, how to choose λ?
I Need to search over two-dimensional grid.
Is the problem just about computation?
Motivation
GLRM: low rank assumption
Generalized low rank model: find low rank matrix X ∈ Rn×k and W ∈ Rk×p such that XW approximates Y ∈ Rn×p well:
minimize X
(i ,j )∈Ω
`j
Yij, xiTwj
+
n
X
i =1
ri(xi) +
d
X
j =1
˜ rj(wj)
I Only works well when Y can be approximated by low rank matrix.
I Big data (large n and large p) usually have low rank structure.
I Movie rating datasets: many movies and many users
I Long skinny data (large n and small p) usually does not have low rank structure.
I Social survey data: many participants, few questions.
Motivation
Get over the low rank assumption
I Large n allows learning more complex variable dependence than the low rank structure.
I Statistical dependence structure: model the joint distribution I Gaussian distribution for quantitative vector
1 All 1-dimensional marginals are Gaussian
2 The joint p-dimensional distribution is multivariate Gaussian
First, can we use 1-dimensional Gaussian to model ordinal/binary variable?
Motivation
Get over the low rank assumption
I Large n allows learning more complex variable dependence than the low rank structure.
I Statistical dependence structure: model the joint distribution I Gaussian distribution for quantitative vector
1 All 1-dimensional marginals are Gaussian
2 The joint p-dimensional distribution is multivariate Gaussian
First, can we use 1-dimensional Gaussian to model ordinal/binary variable?
Motivation
Get over the low rank assumption
I Large n allows learning more complex variable dependence than the low rank structure.
I Statistical dependence structure: model the joint distribution I Gaussian distribution for quantitative vector
1 All 1-dimensional marginals are Gaussian
2 The joint p-dimensional distribution is multivariate Gaussian
First, can we use 1-dimensional Gaussian to model ordinal/binary variable?
Motivation
Histograms for some GSS variables
1 2 3
General happiness
from left to right: Very happy to Not too happy Counts 04008001200
1 2 3 4 5 6 7 8 9 10 1112
Respondont's income
from left to right: Less than $1000 to more than $25000 Counts 0200400600800
1 2 3 4 5
How many people in contact in a typical weekday
from left to right: 0−4 persons to 50 or more Counts 050150250350
0 3 6 11 15 20 24 28 34 38 42 46 50 Weeks r. worked last year
from left to right: 0 to 52 Counts 02006001000
Motivation
Generate ordinal data by thresholding Gaussian variable
-∞ -1 0 1 ∞
1 2 3
normal z value
ordinalxvalue
I Select thresholds to ensure desired class proportion.
I A mapping between ordinal levels and intervals.
I f (z) = x for z ∈ [ax, ax +1) or f−1(x ) = [ax, ax +1).
Motivation
Estimated thresholds for some GSS variables
1 2 3
General happiness
from left to right: Very happy to Not too happy Counts 04008001200
−4 −2 0 2 4
0.00.10.20.30.4
x
y
1 2 3 4 5
How many people in contact in a typical weekday
from left to right: 0−4 persons to 50 or more Counts 050150250350
−4 −2 0 2 4
0.00.10.20.30.4
x
y
Figure 3:Red vertical lines indicate estimated thresholds.
Motivation
Poll
1 2 3 4
How satisfied are you with the work you do
from left to right: very satisfied to very dissatisfied Counts 0200600
−4 −2 0 2 4
0.00.10.20.30.4
Choice: A
x
y
−4 −2 0 2 4
0.00.10.20.30.4
Choice: B
x
y
−4 −2 0 2 4
0.00.10.20.30.4
Choice: C
x
y
Figure 4:In which plot the cutoff points correspond to the variable in histogram?
Gaussian copula model
Table of Contents
1 Motivation
2 Gaussian copula model
3 Demo
4 More about missing value imputation
Gaussian copula model
Gaussian copula model for mixed data
We say x = (x1, . . . , xp) follows the Gaussian copula model if I marginals: x = f(z) for f = (f1, . . . , fp) entrywise monotonic,
xj = fj(zj), j = 1, . . . , p I copula: z ∼ N (0, Σ) with correlation matrix Σ
Poll: what are the tuning parameters for Gaussian copula? A The marginals f
B The copula correlation Σ C There is no tuning parameter
Gaussian copula model
Gaussian copula model for mixed data
We say x = (x1, . . . , xp) follows the Gaussian copula model if I marginals: x = f(z) for f = (f1, . . . , fp) entrywise monotonic,
xj = fj(zj), j = 1, . . . , p I copula: z ∼ N (0, Σ) with correlation matrix Σ Poll: what are the tuning parameters for Gaussian copula?
A The marginals f
B The copula correlation Σ C There is no tuning parameter
Gaussian copula model
Gaussian copula model for mixed data
We say x = (x1, . . . , xp) follows the Gaussian copula model if I marginals: x = f(z) for f = (f1, . . . , fn) entrywise monotonic,
xj = fj(zj), j = 1, . . . , p I copula: z ∼ N (0, Σ) with correlation matrix Σ
I Estimate fj to match the observed empirical distribution I Estimate Σ through an EM algorithm
Gaussian copula model
Given parameter estimate, imputation is easy
?
?
?
?
Figure 5:Curves indicate density and dots mark the observation.
Gaussian copula model
Given parameter estimate, imputation is easy
x
x
+
Figure 6:Curves indicate density and crosses mark the prediction.
Gaussian copula model
Given parameter estimate, imputation is easy
x
x
x
x
Figure 7:Curves indicate density and crosses mark the prediction.
Gaussian copula model
Given parameters, imputation is easy
I observed entries xO of new row x ∈ Rp, O ⊂ {1, . . . , p}
I missing entries M = {1, . . . , p} \ O
I marginals f = (fO, fM) and copula correlation matrix Σ I the truncated region: zO ∈ fO−1(xO) :=Q
j ∈Ofj−1(xj)
impute missing entries using normality of zM: I latent missing zM are normal given zO:
zM|zO ∼ N (ΣM,OΣ−1O,OzO, ΣM,M− ΣM,OΣ−1O,OΣO,M)
Gaussian copula model
Given parameters, imputation is easy I observed entries xO of new row x ∈ Rp, O ⊂ {1, . . . , p}
I missing entries M = {1, . . . , p} \ O
I marginals f = (fO, fM) and copula correlation matrix Σ I the truncated region: zO ∈ fO−1(xO) :=Q
j ∈Ofj−1(xj) impute missing entries using normality of zM:
I latent missing zM are normal given zO:
zM|zO ∼ N (ΣM,OΣ−1O,OzO, ΣM,M− ΣM,OΣ−1O,OΣO,M) I predict with mean
ˆzM = E[zM|zO∈ fO−1(xO)] = ΣM,OΣ−1O,OE[zO|zO∈ fO−1(xO)]
I map back to observed space ˆxM= fM(ˆzM)
Demo
Table of Contents
1 Motivation
2 Gaussian copula model
3 Demo
4 More about missing value imputation
Demo
Gaussian copula yields insight
Figure 8:Large Correlations (| · | > 0.3) between a few interesting variables from GSS data.
More about missing value imputation
Table of Contents
1 Motivation
2 Gaussian copula model
3 Demo
4 More about missing value imputation
More about missing value imputation
Imputation itself can be the goal
Figure 9:Rating matrix in recommendation system. Left: binary rating. Right:
1 − 5 star rating. Imputation for recommending entries with high imputed value
More about missing value imputation
missForest: impute continuous and categorical mixed data
For each variable, train a random forest using all other features to predict/impute this variable.
I Advantages:
I Supports categorical data
I Returns estimate of imputation error I Good accuracy usually observed in practice I Disadvantages:
I Does not distinguish between categorical data and ordinal data I May be very slow to converge
More about missing value imputation
Multiple imputation
When imputation is the intermediate step to learn some parameter θ, e.g.
linear coefficients, on imputed complete dataset:
1 Generate m different imputed datasets X(1), . . . , X(m).
2 For each imputed dataset X(j ), learn the desired model parameter ˆθ(j ) for j = 1, . . . , m.
3 Combine all estiamtes into one: ˆθ =
Pm j =1θˆ(j )
m .
More about missing value imputation
Multiple imputation
When imputation is the intermediate step to learn some parameter θ, e.g.
linear coefficients, on imputed complete dataset:
1 Generate m different imputed datasets X(1), . . . , X(m).
2 For each imputed dataset X(j ), learn the desired model parameter ˆθ(j ) for j = 1, . . . , m.
3 Combine all estiamtes into one: ˆθ =
Pm j =1θˆ(j )
m .
More about missing value imputation
Multiple imputation
I Decrease the variance of estimated ˆθ, similar to the idea of Bootstrap.
I Still need a model to form every single imputation.
I Large bias can happen if the underlying model is not proper.
Choices:
I Gaussian copula: sbgcop in R.
I Chained equations: MICE in R and Python.
More about missing value imputation
Multiple imputation
I Decrease the variance of estimated ˆθ, similar to the idea of Bootstrap.
I Still need a model to form every single imputation.
I Large bias can happen if the underlying model is not proper.
Choices:
I Gaussian copula: sbgcop in R.
I Chained equations: MICE in R and Python.
More about missing value imputation
Which imputation method to choose?
1 Select a few reasonable imputation methods you want to choose from.
2 Divide your dataset into training set and test set.
3 For each imputation method and tuning parameter combination, impute the dataset and train the model you want to use.
4 Pick the combination which perform best on the test set.
More about missing value imputation
For more information
I Python package
https://github.com/udellgroup/online_mixed_gc_imp I R package https://github.com/udellgroup/mixedgcImp
I Paper https://dl.acm.org/doi/abs/10.1145/3394486.3403106