Missing Value Imputation for Mixed Data via Gaussian Copula

(1)

Missing Value Imputation for Mixed Data via Gaussian Copula

Yuxuan Zhao

ORIE 4741, December 1 2020

(2)

Table of Contents

1 Motivation

2 Gaussian copula model

3 Demo

4 More about missing value imputation

(3)

Motivation

Let’s first see a general social survey dataset

Figure 1:2538 participants and 17 questions, among which one is continuous (age, 72 levels) and 16 are ordinal (2-48) levels. 25.6% entries are missing in total. There is no complete row. Data available at

https://gssdataexplorer.norc.org/.

(4)

Motivation

Variable description

Figure 2:Description for the 17 variables used.

(5)

Motivation

Example variables

I General happiness: Taken all together, how would you say things are these days–would you say that you are very happy, pretty happy, or not too happy?

I Respondents income: In which of these groups did your earnings from (OCCUPATION IN OCC) for last year–[the previous year]–fall? That is, before taxes or other deductions. Just tell me the letter.

I How many people in contact in a typical weekday: On average, about how many people do you have contact with in a typical week day, including people you live with. Please tell me which of the following categories best matches your estimate.

I Weeks r. worked last year: In [the previous year] how many weeks did you work either full-time or part-time not counting work around the house–include paid vacations and sick leave?

(6)

Motivation

Recap: GLRM imputes mixed data better than PCA

Generalized low rank model: find low rank matrix X ∈ R^n×k and W ∈ R^k×p such that XW approximates Y ∈ R^n×p well:

minimize X

(i ,j )∈Ω

`_j

Y_ij, x_i^Tw_j +

n

X

i =1

r_i(x_i) +

d

X

j =1

˜ r_j(w_j)

I `_j can vary for different j .

I The regularizer for row ri and column ˜rj can vary.

Great flexibility usually means many choices to make...

(7)

Motivation

Recap: GLRM imputes mixed data better than PCA

minimize X

(i ,j )∈Ω

`_j

Y_ij, x_i^Tw_j +

n

X

i =1

r_i(x_i) +

d

X

j =1

˜ r_j(w_j)

I `_j can vary for different j .

I The regularizer for row ri and column ˜rj can vary.

Great flexibility usually means many choices to make...

(8)

Motivation

GLRM: practical consideration

minimize X

(i ,j )∈Ω

`_j

Y_ij, x_i^Tw_j +

n

X

i =1

r_i(x_i) +

d

X

j =1

˜ r_j(w_j)

I What `_j to choose?

I How to assign weights to `_j when columns have different scales?

I What regularizer ri, ˜rj to use?

And there are tuning parameters...

(9)

Motivation

minimize X

(i ,j )∈Ω

`_j

Y_ij, x_i^Tw_j +

n

X

i =1

r_i(x_i) +

d

X

j =1

˜ r_j(w_j)

I What `_j to choose?

I How to assign weights to `_j when columns have different scales?

I What regularizer ri, ˜rj to use?

And there are tuning parameters...

(10)

Motivation

minimize X

(i ,j )∈Ω

`_j

Y_ij, x_i^Tw_j +

n

X

i =1

r_i(x_i) +

d

X

j =1

˜ r_j(w_j)

I How to choose the rank k?

I If setting ri and ˜rj as quadratic regularization with parameter λ, how to choose λ?

I Need to search over two-dimensional grid. Is the problem just about computation?

(11)

Motivation

minimize X

(i ,j )∈Ω

`_j

Y_ij, x_i^Tw_j +

n

X

i =1

r_i(x_i) +

d

X

j =1

˜ r_j(w_j)

I Need to search over two-dimensional grid.

Is the problem just about computation?

(12)

Motivation

minimize X

(i ,j )∈Ω

`_j

Y_ij, x_i^Tw_j +

n

X

i =1

r_i(x_i) +

d

X

j =1

˜ r_j(w_j)

I Need to search over two-dimensional grid.

Is the problem just about computation?

(13)

Motivation

GLRM: low rank assumption

minimize X

(i ,j )∈Ω

`j

Yij, x_i^Twj

+

n

X

i =1

ri(xi) +

d

X

j =1

˜ rj(wj)

I Only works well when Y can be approximated by low rank matrix.

I Big data (large n and large p) usually have low rank structure.

I Movie rating datasets: many movies and many users

I Long skinny data (large n and small p) usually does not have low rank structure.

I Social survey data: many participants, few questions.

(14)

Motivation

Get over the low rank assumption

I Large n allows learning more complex variable dependence than the low rank structure.

I Statistical dependence structure: model the joint distribution I Gaussian distribution for quantitative vector

1 All 1-dimensional marginals are Gaussian

2 The joint p-dimensional distribution is multivariate Gaussian

First, can we use 1-dimensional Gaussian to model ordinal/binary variable?

(15)

Motivation

(16)

Motivation

(17)

Motivation

Histograms for some GSS variables

1 2 3

General happiness

from left to right: Very happy to Not too happy Counts 04008001200

1 2 3 4 5 6 7 8 9 10 1112

Respondont's income

from left to right: Less than $1000 to more than $25000 Counts 0200400600800

1 2 3 4 5

How many people in contact in a typical weekday

from left to right: 0−4 persons to 50 or more Counts 050150250350

0 3 6 11 15 20 24 28 34 38 42 46 50 Weeks r. worked last year

from left to right: 0 to 52 Counts 02006001000

(18)

Motivation

Generate ordinal data by thresholding Gaussian variable

-∞ -1 0 1 ∞

1 2 3

normal z value

ordinalxvalue

I Select thresholds to ensure desired class proportion.

I A mapping between ordinal levels and intervals.

I f (z) = x for z ∈ [ax, ax +1) or f⁻¹(x ) = [ax, ax +1).

(19)

Motivation

Estimated thresholds for some GSS variables

1 2 3

General happiness

from left to right: Very happy to Not too happy Counts 04008001200

−4 −2 0 2 4

0.00.10.20.30.4

x

y

1 2 3 4 5

How many people in contact in a typical weekday

from left to right: 0−4 persons to 50 or more Counts 050150250350

−4 −2 0 2 4

0.00.10.20.30.4

x

y

Figure 3:Red vertical lines indicate estimated thresholds.

(20)

Motivation

Poll

1 2 3 4

How satisfied are you with the work you do

from left to right: very satisfied to very dissatisfied Counts 0200600

−4 −2 0 2 4

0.00.10.20.30.4

Choice: A

x

y

−4 −2 0 2 4

0.00.10.20.30.4

Choice: B

x

y

−4 −2 0 2 4

0.00.10.20.30.4

Choice: C

x

y

Figure 4:In which plot the cutoff points correspond to the variable in histogram?

(21)

Gaussian copula model

Table of Contents

1 Motivation

3 Demo

(22)

Gaussian copula model for mixed data

We say x = (x1, . . . , xp) follows the Gaussian copula model if I marginals: x = f(z) for f = (f1, . . . , fp) entrywise monotonic,

x_j = f_j(z_j), j = 1, . . . , p I copula: z ∼ N (0, Σ) with correlation matrix Σ

Poll: what are the tuning parameters for Gaussian copula? A The marginals f

B The copula correlation Σ C There is no tuning parameter

(23)

We say x = (x1, . . . , xp) follows the Gaussian copula model if I marginals: x = f(z) for f = (f1, . . . , fp) entrywise monotonic,

x_j = f_j(z_j), j = 1, . . . , p I copula: z ∼ N (0, Σ) with correlation matrix Σ Poll: what are the tuning parameters for Gaussian copula?

A The marginals f

B The copula correlation Σ C There is no tuning parameter

(24)

We say x = (x₁, . . . , x_p) follows the Gaussian copula model if I marginals: x = f(z) for f = (f₁, . . . , f_n) entrywise monotonic,

xj = fj(zj), j = 1, . . . , p I copula: z ∼ N (0, Σ) with correlation matrix Σ

I Estimate fj to match the observed empirical distribution I Estimate Σ through an EM algorithm

(25)

Given parameter estimate, imputation is easy

?

Figure 5:Curves indicate density and dots mark the observation.

(26)

x

+

Figure 6:Curves indicate density and crosses mark the prediction.

(27)

x

Figure 7:Curves indicate density and crosses mark the prediction.

(28)

Given parameters, imputation is easy

I observed entries xO of new row x ∈ R^p, O ⊂ {1, . . . , p}

I missing entries M = {1, . . . , p} \ O

I marginals f = (fO, fM) and copula correlation matrix Σ I the truncated region: zO ∈ f_O⁻¹(xO) :=Q

j ∈Of_j⁻¹(x_j)

impute missing entries using normality of zM: I latent missing zM are normal given zO:

zM|z_O ∼ N (Σ_M,OΣ⁻¹_O,OzO, ΣM,M− Σ_M,OΣ⁻¹_O,OΣO,M)

(29)

Given parameters, imputation is easy I observed entries xO of new row x ∈ R^p, O ⊂ {1, . . . , p}

I missing entries M = {1, . . . , p} \ O

I marginals f = (fO, fM) and copula correlation matrix Σ I the truncated region: zO ∈ f_O⁻¹(xO) :=Q

j ∈Of_j⁻¹(x_j) impute missing entries using normality of zM:

I latent missing zM are normal given zO:

zM|z_O ∼ N (Σ_M,OΣ⁻¹_O,OzO, ΣM,M− Σ_M,OΣ⁻¹_O,OΣO,M) I predict with mean

ˆzM = E[zM|z_O∈ f_O⁻¹(xO)] = ΣM,OΣ⁻¹_O,OE[zO|z_O∈ f_O⁻¹(xO)]

I map back to observed space ˆxM= fM(ˆzM)

(30)

Demo

Table of Contents

1 Motivation

3 Demo

(31)

Demo

Gaussian copula yields insight

Figure 8:Large Correlations (| · | > 0.3) between a few interesting variables from GSS data.

(32)

More about missing value imputation

Table of Contents

1 Motivation

3 Demo

(33)

Imputation itself can be the goal

Figure 9:Rating matrix in recommendation system. Left: binary rating. Right:

1 − 5 star rating. Imputation for recommending entries with high imputed value

(34)

missForest: impute continuous and categorical mixed data

For each variable, train a random forest using all other features to predict/impute this variable.

I Advantages:

I Supports categorical data

I Returns estimate of imputation error I Good accuracy usually observed in practice I Disadvantages:

I Does not distinguish between categorical data and ordinal data I May be very slow to converge

(35)

Multiple imputation

When imputation is the intermediate step to learn some parameter θ, e.g.

linear coefficients, on imputed complete dataset:

1 Generate m different imputed datasets X⁽¹⁾, . . . , X^(m).

2 For each imputed dataset X^{(j )}, learn the desired model parameter ˆθ^{(j )} for j = 1, . . . , m.

3 Combine all estiamtes into one: ˆθ =

Pm j =1θˆ^{(j )}

m .

(36)

Multiple imputation

When imputation is the intermediate step to learn some parameter θ, e.g.

linear coefficients, on imputed complete dataset:

1 Generate m different imputed datasets X⁽¹⁾, . . . , X^(m).

2 For each imputed dataset X^{(j )}, learn the desired model parameter ˆθ^{(j )} for j = 1, . . . , m.

3 Combine all estiamtes into one: ˆθ =

Pm j =1θˆ^{(j )}

m .

(37)

Multiple imputation

I Decrease the variance of estimated ˆθ, similar to the idea of Bootstrap.

I Still need a model to form every single imputation.

I Large bias can happen if the underlying model is not proper.

Choices:

I Gaussian copula: sbgcop in R.

I Chained equations: MICE in R and Python.

(38)

Multiple imputation

I Decrease the variance of estimated ˆθ, similar to the idea of Bootstrap.

I Still need a model to form every single imputation.

I Large bias can happen if the underlying model is not proper.

Choices:

I Gaussian copula: sbgcop in R.

I Chained equations: MICE in R and Python.

(39)

Which imputation method to choose?

1 Select a few reasonable imputation methods you want to choose from.

2 Divide your dataset into training set and test set.

3 For each imputation method and tuning parameter combination, impute the dataset and train the model you want to use.

4 Pick the combination which perform best on the test set.

(40)

For more information

I Python package

https://github.com/udellgroup/online_mixed_gc_imp I R package https://github.com/udellgroup/mixedgcImp

I Paper https://dl.acm.org/doi/abs/10.1145/3394486.3403106