• No results found

Missing Value Imputation for Mixed Data via Gaussian Copula

N/A
N/A
Protected

Academic year: 2022

Share "Missing Value Imputation for Mixed Data via Gaussian Copula"

Copied!
40
0
0

Loading.... (view fulltext now)

Full text

(1)

Missing Value Imputation for Mixed Data via Gaussian Copula

Yuxuan Zhao

ORIE 4741, December 1 2020

(2)

Table of Contents

1 Motivation

2 Gaussian copula model

3 Demo

4 More about missing value imputation

(3)

Motivation

Let’s first see a general social survey dataset

Figure 1:2538 participants and 17 questions, among which one is continuous (age, 72 levels) and 16 are ordinal (2-48) levels. 25.6% entries are missing in total. There is no complete row. Data available at

https://gssdataexplorer.norc.org/.

(4)

Motivation

Variable description

Figure 2:Description for the 17 variables used.

(5)

Motivation

Example variables

I General happiness: Taken all together, how would you say things are these days–would you say that you are very happy, pretty happy, or not too happy?

I Respondents income: In which of these groups did your earnings from (OCCUPATION IN OCC) for last year–[the previous year]–fall? That is, before taxes or other deductions. Just tell me the letter.

I How many people in contact in a typical weekday: On average, about how many people do you have contact with in a typical week day, including people you live with. Please tell me which of the following categories best matches your estimate.

I Weeks r. worked last year: In [the previous year] how many weeks did you work either full-time or part-time not counting work around the house–include paid vacations and sick leave?

(6)

Motivation

Recap: GLRM imputes mixed data better than PCA

Generalized low rank model: find low rank matrix X ∈ Rn×k and W ∈ Rk×p such that XW approximates Y ∈ Rn×p well:

minimize X

(i ,j )∈Ω

`j

Yij, xiTwj +

n

X

i =1

ri(xi) +

d

X

j =1

˜ rj(wj)

I `j can vary for different j .

I The regularizer for row ri and column ˜rj can vary.

Great flexibility usually means many choices to make...

(7)

Motivation

Recap: GLRM imputes mixed data better than PCA

Generalized low rank model: find low rank matrix X ∈ Rn×k and W ∈ Rk×p such that XW approximates Y ∈ Rn×p well:

minimize X

(i ,j )∈Ω

`j

Yij, xiTwj +

n

X

i =1

ri(xi) +

d

X

j =1

˜ rj(wj)

I `j can vary for different j .

I The regularizer for row ri and column ˜rj can vary.

Great flexibility usually means many choices to make...

(8)

Motivation

GLRM: practical consideration

Generalized low rank model: find low rank matrix X ∈ Rn×k and W ∈ Rk×p such that XW approximates Y ∈ Rn×p well:

minimize X

(i ,j )∈Ω

`j

Yij, xiTwj +

n

X

i =1

ri(xi) +

d

X

j =1

˜ rj(wj)

I What `j to choose?

I How to assign weights to `j when columns have different scales?

I What regularizer ri, ˜rj to use?

And there are tuning parameters...

(9)

Motivation

GLRM: practical consideration

Generalized low rank model: find low rank matrix X ∈ Rn×k and W ∈ Rk×p such that XW approximates Y ∈ Rn×p well:

minimize X

(i ,j )∈Ω

`j

Yij, xiTwj +

n

X

i =1

ri(xi) +

d

X

j =1

˜ rj(wj)

I What `j to choose?

I How to assign weights to `j when columns have different scales?

I What regularizer ri, ˜rj to use?

And there are tuning parameters...

(10)

Motivation

GLRM: practical consideration

Generalized low rank model: find low rank matrix X ∈ Rn×k and W ∈ Rk×p such that XW approximates Y ∈ Rn×p well:

minimize X

(i ,j )∈Ω

`j

Yij, xiTwj +

n

X

i =1

ri(xi) +

d

X

j =1

˜ rj(wj)

I How to choose the rank k?

I If setting ri and ˜rj as quadratic regularization with parameter λ, how to choose λ?

I Need to search over two-dimensional grid. Is the problem just about computation?

(11)

Motivation

GLRM: practical consideration

Generalized low rank model: find low rank matrix X ∈ Rn×k and W ∈ Rk×p such that XW approximates Y ∈ Rn×p well:

minimize X

(i ,j )∈Ω

`j

Yij, xiTwj +

n

X

i =1

ri(xi) +

d

X

j =1

˜ rj(wj)

I How to choose the rank k?

I If setting ri and ˜rj as quadratic regularization with parameter λ, how to choose λ?

I Need to search over two-dimensional grid.

Is the problem just about computation?

(12)

Motivation

GLRM: practical consideration

Generalized low rank model: find low rank matrix X ∈ Rn×k and W ∈ Rk×p such that XW approximates Y ∈ Rn×p well:

minimize X

(i ,j )∈Ω

`j

Yij, xiTwj +

n

X

i =1

ri(xi) +

d

X

j =1

˜ rj(wj)

I How to choose the rank k?

I If setting ri and ˜rj as quadratic regularization with parameter λ, how to choose λ?

I Need to search over two-dimensional grid.

Is the problem just about computation?

(13)

Motivation

GLRM: low rank assumption

Generalized low rank model: find low rank matrix X ∈ Rn×k and W ∈ Rk×p such that XW approximates Y ∈ Rn×p well:

minimize X

(i ,j )∈Ω

`j



Yij, xiTwj

 +

n

X

i =1

ri(xi) +

d

X

j =1

˜ rj(wj)

I Only works well when Y can be approximated by low rank matrix.

I Big data (large n and large p) usually have low rank structure.

I Movie rating datasets: many movies and many users

I Long skinny data (large n and small p) usually does not have low rank structure.

I Social survey data: many participants, few questions.

(14)

Motivation

Get over the low rank assumption

I Large n allows learning more complex variable dependence than the low rank structure.

I Statistical dependence structure: model the joint distribution I Gaussian distribution for quantitative vector

1 All 1-dimensional marginals are Gaussian

2 The joint p-dimensional distribution is multivariate Gaussian

First, can we use 1-dimensional Gaussian to model ordinal/binary variable?

(15)

Motivation

Get over the low rank assumption

I Large n allows learning more complex variable dependence than the low rank structure.

I Statistical dependence structure: model the joint distribution I Gaussian distribution for quantitative vector

1 All 1-dimensional marginals are Gaussian

2 The joint p-dimensional distribution is multivariate Gaussian

First, can we use 1-dimensional Gaussian to model ordinal/binary variable?

(16)

Motivation

Get over the low rank assumption

I Large n allows learning more complex variable dependence than the low rank structure.

I Statistical dependence structure: model the joint distribution I Gaussian distribution for quantitative vector

1 All 1-dimensional marginals are Gaussian

2 The joint p-dimensional distribution is multivariate Gaussian

First, can we use 1-dimensional Gaussian to model ordinal/binary variable?

(17)

Motivation

Histograms for some GSS variables

1 2 3

General happiness

from left to right: Very happy to Not too happy Counts 04008001200

1 2 3 4 5 6 7 8 9 10 1112

Respondont's income

from left to right: Less than $1000 to more than $25000 Counts 0200400600800

1 2 3 4 5

How many people in contact in a typical weekday

from left to right: 0−4 persons to 50 or more Counts 050150250350

0 3 6 11 15 20 24 28 34 38 42 46 50 Weeks r. worked last year

from left to right: 0 to 52 Counts 02006001000

(18)

Motivation

Generate ordinal data by thresholding Gaussian variable

-∞ -1 0 1

1 2 3

normal z value

ordinalxvalue

I Select thresholds to ensure desired class proportion.

I A mapping between ordinal levels and intervals.

I f (z) = x for z ∈ [ax, ax +1) or f−1(x ) = [ax, ax +1).

(19)

Motivation

Estimated thresholds for some GSS variables

1 2 3

General happiness

from left to right: Very happy to Not too happy Counts 04008001200

−4 −2 0 2 4

0.00.10.20.30.4

x

y

1 2 3 4 5

How many people in contact in a typical weekday

from left to right: 0−4 persons to 50 or more Counts 050150250350

−4 −2 0 2 4

0.00.10.20.30.4

x

y

Figure 3:Red vertical lines indicate estimated thresholds.

(20)

Motivation

Poll

1 2 3 4

How satisfied are you with the work you do

from left to right: very satisfied to very dissatisfied Counts 0200600

−4 −2 0 2 4

0.00.10.20.30.4

Choice: A

x

y

−4 −2 0 2 4

0.00.10.20.30.4

Choice: B

x

y

−4 −2 0 2 4

0.00.10.20.30.4

Choice: C

x

y

Figure 4:In which plot the cutoff points correspond to the variable in histogram?

(21)

Gaussian copula model

Table of Contents

1 Motivation

2 Gaussian copula model

3 Demo

4 More about missing value imputation

(22)

Gaussian copula model

Gaussian copula model for mixed data

We say x = (x1, . . . , xp) follows the Gaussian copula model if I marginals: x = f(z) for f = (f1, . . . , fp) entrywise monotonic,

xj = fj(zj), j = 1, . . . , p I copula: z ∼ N (0, Σ) with correlation matrix Σ

Poll: what are the tuning parameters for Gaussian copula? A The marginals f

B The copula correlation Σ C There is no tuning parameter

(23)

Gaussian copula model

Gaussian copula model for mixed data

We say x = (x1, . . . , xp) follows the Gaussian copula model if I marginals: x = f(z) for f = (f1, . . . , fp) entrywise monotonic,

xj = fj(zj), j = 1, . . . , p I copula: z ∼ N (0, Σ) with correlation matrix Σ Poll: what are the tuning parameters for Gaussian copula?

A The marginals f

B The copula correlation Σ C There is no tuning parameter

(24)

Gaussian copula model

Gaussian copula model for mixed data

We say x = (x1, . . . , xp) follows the Gaussian copula model if I marginals: x = f(z) for f = (f1, . . . , fn) entrywise monotonic,

xj = fj(zj), j = 1, . . . , p I copula: z ∼ N (0, Σ) with correlation matrix Σ

I Estimate fj to match the observed empirical distribution I Estimate Σ through an EM algorithm

(25)

Gaussian copula model

Given parameter estimate, imputation is easy

?

?

?

?

Figure 5:Curves indicate density and dots mark the observation.

(26)

Gaussian copula model

Given parameter estimate, imputation is easy

x

x

+

Figure 6:Curves indicate density and crosses mark the prediction.

(27)

Gaussian copula model

Given parameter estimate, imputation is easy

x

x

x

x

Figure 7:Curves indicate density and crosses mark the prediction.

(28)

Gaussian copula model

Given parameters, imputation is easy

I observed entries xO of new row x ∈ Rp, O ⊂ {1, . . . , p}

I missing entries M = {1, . . . , p} \ O

I marginals f = (fO, fM) and copula correlation matrix Σ I the truncated region: zO ∈ fO−1(xO) :=Q

j ∈Ofj−1(xj)

impute missing entries using normality of zM: I latent missing zM are normal given zO:

zM|zO ∼ N (ΣM,OΣ−1O,OzO, ΣM,M− ΣM,OΣ−1O,OΣO,M)

(29)

Gaussian copula model

Given parameters, imputation is easy I observed entries xO of new row x ∈ Rp, O ⊂ {1, . . . , p}

I missing entries M = {1, . . . , p} \ O

I marginals f = (fO, fM) and copula correlation matrix Σ I the truncated region: zO ∈ fO−1(xO) :=Q

j ∈Ofj−1(xj) impute missing entries using normality of zM:

I latent missing zM are normal given zO:

zM|zO ∼ N (ΣM,OΣ−1O,OzO, ΣM,M− ΣM,OΣ−1O,OΣO,M) I predict with mean

ˆzM = E[zM|zO∈ fO−1(xO)] = ΣM,OΣ−1O,OE[zO|zO∈ fO−1(xO)]

I map back to observed space ˆxM= fM(ˆzM)

(30)

Demo

Table of Contents

1 Motivation

2 Gaussian copula model

3 Demo

4 More about missing value imputation

(31)

Demo

Gaussian copula yields insight

Figure 8:Large Correlations (| · | > 0.3) between a few interesting variables from GSS data.

(32)

More about missing value imputation

Table of Contents

1 Motivation

2 Gaussian copula model

3 Demo

4 More about missing value imputation

(33)

More about missing value imputation

Imputation itself can be the goal

Figure 9:Rating matrix in recommendation system. Left: binary rating. Right:

1 − 5 star rating. Imputation for recommending entries with high imputed value

(34)

More about missing value imputation

missForest: impute continuous and categorical mixed data

For each variable, train a random forest using all other features to predict/impute this variable.

I Advantages:

I Supports categorical data

I Returns estimate of imputation error I Good accuracy usually observed in practice I Disadvantages:

I Does not distinguish between categorical data and ordinal data I May be very slow to converge

(35)

More about missing value imputation

Multiple imputation

When imputation is the intermediate step to learn some parameter θ, e.g.

linear coefficients, on imputed complete dataset:

1 Generate m different imputed datasets X(1), . . . , X(m).

2 For each imputed dataset X(j ), learn the desired model parameter ˆθ(j ) for j = 1, . . . , m.

3 Combine all estiamtes into one: ˆθ =

Pm j =1θˆ(j )

m .

(36)

More about missing value imputation

Multiple imputation

When imputation is the intermediate step to learn some parameter θ, e.g.

linear coefficients, on imputed complete dataset:

1 Generate m different imputed datasets X(1), . . . , X(m).

2 For each imputed dataset X(j ), learn the desired model parameter ˆθ(j ) for j = 1, . . . , m.

3 Combine all estiamtes into one: ˆθ =

Pm j =1θˆ(j )

m .

(37)

More about missing value imputation

Multiple imputation

I Decrease the variance of estimated ˆθ, similar to the idea of Bootstrap.

I Still need a model to form every single imputation.

I Large bias can happen if the underlying model is not proper.

Choices:

I Gaussian copula: sbgcop in R.

I Chained equations: MICE in R and Python.

(38)

More about missing value imputation

Multiple imputation

I Decrease the variance of estimated ˆθ, similar to the idea of Bootstrap.

I Still need a model to form every single imputation.

I Large bias can happen if the underlying model is not proper.

Choices:

I Gaussian copula: sbgcop in R.

I Chained equations: MICE in R and Python.

(39)

More about missing value imputation

Which imputation method to choose?

1 Select a few reasonable imputation methods you want to choose from.

2 Divide your dataset into training set and test set.

3 For each imputation method and tuning parameter combination, impute the dataset and train the model you want to use.

4 Pick the combination which perform best on the test set.

(40)

More about missing value imputation

For more information

I Python package

https://github.com/udellgroup/online_mixed_gc_imp I R package https://github.com/udellgroup/mixedgcImp

I Paper https://dl.acm.org/doi/abs/10.1145/3394486.3403106

References

Related documents

Trays for each sort plan number 4,000 barcoded articles Large Deliver imprinted or metered articles at a lower price than Full Rate mail, with no minimum volume Imprint

When this is all in order and the 0.5mm wire hasn’t moved then fix a couple of chairs to the left of the toe, checking with a straight edge run through the crossing that the

A market may be removed from the approved list by the deputy administrator veterinary services, APHIS, USDA when it is determined by the executive secretary of the board or the

2 Carlton Street, Suite 1200, Toronto ON M5B 2M9 | Telephone: 1-888-327-7377 | Web site: www.eqao.com | © 2015 Queen’s Printer for Ontario.. Grade 9 Assessment of Mathematics

The result shows that employee’s individual attributes, top management support, compatibility, IT infrastructure and industry pressure are important factors that influencing

17th IPHS Conference, Delft 2016 | HISTORY - URBANISM - RESILIENCE | VOlume 04 Planning and Heritage | Politics, Planning, Heritage and urban Space | Civic Space

transatlantic cooperation. Positive deviant behavior and nutrition education. How do socio-economic status, perceived economic barriers and nutritional benefits affect quality