• No results found

DataWig: Missing Value Imputation for Tables

N/A
N/A
Protected

Academic year: 2020

Share "DataWig: Missing Value Imputation for Tables"

Copied!
6
0
0

Loading.... (view fulltext now)

Full text

(1)

DataWig: Missing Value Imputation for Tables

Felix Bießmann

[email protected]

Beuth University, Luxemburger Str. 10, 13353 Berlin (work done while at Amazon Research)

Tammo Rukat

[email protected]

Phillipp Schmidt

[email protected]

Amazon Research, Krausenstr. 38, 10117 Berlin, Germany

Prathik Naidu

[email protected]

Department of Computer Science, Stanford University

Stanford, CA 94305, USA (work done while at Amazon Research)

Sebastian Schelter

[email protected]

Center for Data Science, New York University, New York, USA (work done while at Amazon

Research)

Andrey Taptunov

[email protected]

Snowflake, Stresemannstraße 123, 10963 Berlin (work done while at Amazon Research)

Dustin Lange

[email protected]

Amazon Research, Krausenstr. 38, 10117 Berlin, Germany

David Salinas

[email protected]

Naver Labs, 6 Chemin de Maupertuis, 38240 Meylan, France (work done while at Amazon Research)

Editor:

Alexandre Gramfort

Abstract

With the growing importance of machine learning (ML) algorithms for practical applications,

reducing data quality problems in ML pipelines has become a major focus of research. In

many cases missing values can break data pipelines which makes completeness one of the most

impactful data quality challenges. Current missing value imputation methods are focusing

on numerical or categorical data and can be difficult to scale to datasets with millions of

rows. We release

DataWig

, a robust and scalable approach for missing value imputation

that can be applied to tables with heterogeneous data types, including unstructured text.

DataWig

combines deep learning feature extractors with automatic hyperparameter tuning.

This enables users without a machine learning background, such as data engineers, to

impute missing values with minimal effort in tables with more heterogeneous data types

than supported in existing libraries, while requiring less glue code for feature engineering

and offering more flexible modelling options. We demonstrate that

DataWig

compares

favourably to existing imputation packages. Source code, documentation, and unit tests for

this package are available at:

github.com/awslabs/datawig

Keywords:

missing value imputation, deep learning, heterogeneous data

1. Introduction

Machine learning (ML) algorithms have become a standard technology in production use cases.

One of the main reasons for suboptimal predictive performance of such systems is low data

c

(2)

Biessmann, Rukat, Schmidt, Naidu, Schelter, Taptunov, Lange

Data Type

Featurizers

Loss

Numerical

Normalization

Neural Network

Regression

Categorical

Embeddings

Softmax

Text

Bag-of-Words

LSTM

N/A

t a b l e = p a n d a s . r e a d _ c s v (’ p r o d u c t s . csv ’) m i s s i n g = t a b l e [ t a b l e [’ c o l o r ’]. i s n u l l ()]

# i n s t a n t i a t e m o d e l and t r a i n i m p u t e r

m o d e l = S i m p l e I m p u t e r (

i n p u t _ c o l u m n s =[’ d e s c r i p t i o n ’, ’ p r o d u c t _ t y p e ’, ’ s i z e ’] ,

o u t p u t _ c o l u m n s =[’ c o l o r ’]) . fit ( t a b l e )

# i m p u t e m i s s i n g v a l u e s

i m p u t e d = m o d e l . p r e d i c t ( m i s s i n g )

Figure 1:

Left

: Available featurizers and loss functions for different data types in

DataWig

.

Right:

Application example of

DataWig

API for the use case shown in Figure 2.

quality and one of the most frequent data quality problems are missing values. Imputation

of missing values can help to increase data quality by filling gaps in training data. However

automated and scalable imputations for tables with heterogeneous data types including

free form text fields remains challenging. Here we present

DataWig

, a software package

that aims at minimizing the effort required for missing value imputation in heterogeneous

data sources. Most research in the field of imputation focuses on imputing missing values

in

matrices

, that is imputation of numerical values from other numerical values (Mayer

et al., 2019). Popular approaches include

k-nearest neighbors

(KNN) (Batista and Monard,

2003),

multivariate imputation by chained equations

(MICE) (Little and Rubin, 2002),

matrix factorization

(Koren et al., 2009; Mazumder et al., 2010; Troyanskaya et al., 2001) or

deep learning methods (Gondara and Wang, 2017; Zhang et al., 2018; Mattei and Frellsen,

2019). While some recent work addresses imputation for more heterogeneous data types

(Stekhoven and Bühlmann, 2012; Yoon et al., 2018; Nazabal et al., 2018),

heterogeneous

in

those studies refers to binary, ordinal or categorical variables, which can be easily transformed

into numerical representations. In practice also these simple transformations require glue

code that can be difficult to adapt and maintain in a production setting. Writing such feature

extraction code is out of scope for many engineers and can incur considerable technical debt

on any data pipeline (Sculley et al., 2015; Schelter et al., 2018). We release

DataWig

to

complement existing imputation libraries by an imputation solution for tables that contain

not only numerical values or categorical values, but also more generic data types such as

unstructured text. Extending the functionality of previous packages,

DataWig

’s imputation

automatically selects from a number of feature extractors, including deep learning techniques,

and learns all parameters in an end-to-end fashion using the symbolic API of Apache

mxnet

to ensure efficient execution on both CPUs and GPUs.

2. Imputation Model

The imputation model in

DataWig

is inspired by established approaches (van Buuren,

2018) and follows the approach of MICE, also referred to as

fully conditional specification

:

for each to-be-imputed column (referred to as

output column

), the user can specify the

(3)

DataWig: Missing Value Imputation for Tables

scenario, our system’s simple API (presented in Section 4) is also bene�cial for other use cases: users without any machine learning experience or coding skills who want to run simple classi�cation experiments on non-numerical data can leverage the system as long as they can export their data from an Excel sheet and specify the target column.

In order to demonstrate the scalability and performance of our approach, we present experiments on samples of a large product catalog and on public datasets extracted from Wikipedia. In the experiments on the product catalog, we impute missing product attribute values for a variety of product types and attributes. The sizes of the data sets sampled from the product catalog are between several 1,000 rows and several million rows, which is between one to four orders of magnitude larger than data sets in previous work on missing value imputation on numerical data [10]. We evaluate the imputations on product data with very di�erent languages (English and Japanese), and �nd that our system is able to deal with di�erent languages without any language-speci�c preprocessing, such as tokenization. In the Wikipedia experiments, we impute missing infobox attributes from the article abstracts for a number of infobox properties.

A sketch of the proposed approach is shown in Figure 1. We will use the running example of imputing missing color attributes for products in a retail catalog. Our system operates on a table with non-numerical data, where the column to be imputed is the

colorattribute and the columns considered as input data for the

imputation are other product attribute columns such asproduct description. The proposed approach then trains a machine learn-ing model for each to be imputed column that learns to predict the observed values of the to be imputed column from the remaining columns (or a subset thereof). Each input column of the system is fed into a featurizer component that processes sequential data (such as unstructured text) or categorical data. In the case of thecolor

attribute, the to be imputed column is modeled as a categorical variable that is predicted from the concatenation of all featurizer outputs.

The reason we propose this imputation model is that in an ex-tensive literature review we found that the topic of imputation for non-numerical data beyond rule-based systems was not covered very well. There exists a lot of work on imputation [30], and on modeling non-numerical data, but to the best of our knowledge there is no work on end-to-end systems that learn how to extract features and impute missing values on non-numerical data at scale. This paper aims at �lling this gap by providing the following con-tributions:

• Scalable deep learning for imputation.We present an impu-tation approach that is based on state of the art deep learning models (Section 3).

• High precision imputations.In extensive experiments on pub-lic and private real-world datasets, we compare our imputation approach against standard imputation baselines and observe up to 100-fold improvements of imputation quality (Section 6).

• Language-agnostic text feature extraction.Our approach op-erates on the character level and can impute with high precision and recall independent of the language present in a data source (Section 5, Section 6).

Product

Type Description Size Color

Shoe Ideal for running … 12UK Black

SDCards Best SDCard ever … 8GB Blue

Dress This yellow dress … M ?

Feature Columns Label Column

Character Sequence

One Hot Encoder

Embedding n-gram hashingLSTM or

Feature Concatenation

Imputation

distinguish betweencategoricalandsequentialdata. For categorical data the numerical representation

114

xc {1,2, . . . , M

c}of thestringdata in columncwas simply the index of the value histogram

115

of sizeMc; for notational simplicity also these scalar variables will be denoted as vectorxcin the

116

following. For sequential data, the numerical representationxc {0,1,2, . . . , A

c}Scis a vector of

117

lengthSc, whereScdenotes the length of the sequence orstringin columncandAcdenotes the

118

size of the set of all characters observed in columnc. The data types were determined using heuristics.

119

In the data sets used in the experiments the data types of the columns were easy to separate into

120

free text fields (product description, bullet points, item name) and categorical variables

121

(e.g.color, brand, size, ...). If the data types are not known upfront, heuristics based on the

122

distribution of values in a column can be used [?].

123

Once the the non-numerical data is encoded into their respective numerical representation, a column

124

specific feature extraction mapping c(xk) RDc is computed, whereDkdenotes the dimensionality

125

for a latent variable associated with columnc. We considered three different types of featurizers

126

c(.). For categorical data we use a one-hot encoded embedding (as known from word embeddings).

127

For columnscwith sequentialstringdata we consider two different possibilities for c(xc): an

128

n-gram representation or a character-based embedding using a Long-Short-Term-Memory (LSTM)

129

recurrent neural network [?]TODO: CITATION. For the character n-gram representation, c(xc)is

130

a hashing function that maps each n-gram, wheren {1, . . . ,5}, in the character sequencexcto

131

aDcdimensional vector; hereDcdenotes here the number of hash buckets. Note that the hashing

132

featurizer is a stateless component that does not require any training, whereas the other two types of

133

feature maps contain parameters that are learned using backpropagation.

134

For the case of categorical embeddings, we use a standard linear embedding fed into one fully

135

connected layer. The hyperparameter for this featurizer was a single one and used to set both the

136

embedding dimensionality as well as the number of hidden units of the output layer. In the LSTM

137

case, we featurizexcby iterating an LSTM through the sequence of characters ofxc

ithat are each

138

represented as continuous vector via a character embedding. The sequence of charactersxcis then

139

mapped to a sequence of statesh(c,1), . . . ,h(c,Sc)and we take the last stateh(c,Sc), mapped through 140

a fully connected layer as the featurization ofxc. The hyperparameters of each LSTM featurizer are

141

then the number of layers, the number of hidden units of the LSTM cell and the dimension of the

142

characters embeddingciand the number of hidden units of the final fully connected output layer of

143

the LSTM featurizer.

144

Finally all feature vectors c(xc)are concatenated into one feature vectorx˜ RDwhereD= Dc

145

is the sum over all latent dimensionsDc. We will refer to the numerical representation of the values

146

in the to-be-imputed column asy {1,2, . . . , Dy}, as in standard supervised learning settings. The

147

symbols of the target columns use the same encoding as the aforementioned categorical variables.

148

After the featurization˜xof input or feature columns and the encodingyof the to-be-imputed 149

column we can cast the imputation problem as a supervised problem by learning to predict the label

150

distribution ofyfromx˜.

151

Imputation is then performed by modelingp(y|x˜, ), the probability over all observed values or

152

classes ofygiven an input or feature vectorx˜with some learned parameters . The probability

153

p(y|x˜, )is modeled, as

154

p(y|x˜, ) =softmax[Wx˜+b] (1)

where = (W, z, b)are parameters to learn withW RDy D,b RD

y andzis a vector containing

155

all parameters of all learned column featurizers c. Finally,softmax(q)denotes the elementwise

156

softmax function expq

jexpqjwhereqjis thejelement of a vectorq. 157

The parameters are learned by minimizing the cross-entropy loss between the predicted distribution

158

and the observed labelsy, e.g. by taking

159

= arg min

N

1 Dy

1

log(p(y|x˜, )) onehot(y) (2)

wherep(y|˜x, )) RDydenotes the output of the model andNis the number of rows for which 160

a value was observed in the target column corresponding to y. We useone-hot(y) {0,1}Dyto 161

4

of sizeMc; for notational simplicity also these scalar variables will be denoted as vectorxcin the

116

following. For sequential data, the numerical representationxc {0,1,2, . . . , A

c}Scis a vector of

117

lengthSc, whereScdenotes the length of the sequence orstringin columncandAcdenotes the

118

size of the set of all characters observed in columnc. The data types were determined using heuristics. 119

In the data sets used in the experiments the data types of the columns were easy to separate into

120

free text fields (product description, bullet points, item name) and categorical variables

121

(e.g.color, brand, size, ...). If the data types are not known upfront, heuristics based on the

122

distribution of values in a column can be used [?].

123

Once the the non-numerical data is encoded into their respective numerical representation, a column

124

specific feature extraction mapping c(xk) RDc is computed, whereDkdenotes the dimensionality

125

for a latent variable associated with columnc. We considered three different types of featurizers

126

c(.). For categorical data we use a one-hot encoded embedding (as known from word embeddings).

127

For columnscwith sequentialstringdata we consider two different possibilities for c(xc): an

128

n-gram representation or a character-based embedding using a Long-Short-Term-Memory (LSTM)

129

recurrent neural network [?]TODO: CITATION. For the character n-gram representation, c(xc)is

130

a hashing function that maps each n-gram, wheren {1, . . . ,5}, in the character sequencexcto

131

aDcdimensional vector; hereDcdenotes here the number of hash buckets. Note that the hashing

132

featurizer is a stateless component that does not require any training, whereas the other two types of

133

feature maps contain parameters that are learned using backpropagation.

134

For the case of categorical embeddings, we use a standard linear embedding fed into one fully

135

connected layer. The hyperparameter for this featurizer was a single one and used to set both the

136

embedding dimensionality as well as the number of hidden units of the output layer. In the LSTM

137

case, we featurizexcby iterating an LSTM through the sequence of characters ofxc

ithat are each

138

represented as continuous vector via a character embedding. The sequence of charactersxcis then

139

mapped to a sequence of statesh(c,1), . . . ,h(c,Sc)and we take the last stateh(c,Sc), mapped through 140

a fully connected layer as the featurization ofxc. The hyperparameters of each LSTM featurizer are

141

then the number of layers, the number of hidden units of the LSTM cell and the dimension of the

142

characters embeddingciand the number of hidden units of the final fully connected output layer of

143

the LSTM featurizer.

144

Finally all feature vectors c(xc)are concatenated into one feature vector˜x RDwhereD= Dc

145

is the sum over all latent dimensionsDc. We will refer to the numerical representation of the values

146

in the to-be-imputed column asy {1,2, . . . , Dy}, as in standard supervised learning settings. The

147

symbols of the target columns use the same encoding as the aforementioned categorical variables.

148

After the featurizationx˜ of input or feature columns and the encodingy of the to-be-imputed

149

column we can cast the imputation problem as a supervised problem by learning to predict the label

150

distribution ofyfromx˜.

151

Imputation is then performed by modelingp(y|˜x, ), the probability over all observed values or

152

classes ofygiven an input or feature vectorx˜with some learned parameters . The probability

153

p(y|˜x, )is modeled, as

154

p(y|x˜, ) =softmax[Wx˜+b] (1)

where = (W, z, b)are parameters to learn withW RDy D,b RD

y andzis a vector containing

155

all parameters of all learned column featurizers c. Finally,softmax(q)denotes the elementwise

156

softmax function expq

jexpqj whereqjis thejelement of a vectorq. 157

The parameters are learned by minimizing the cross-entropy loss between the predicted distribution

158

and the observed labelsy, e.g. by taking

159

= arg min

N

1 Dy

1

log(p(y|x˜, )) onehot(y) (2)

wherep(y|˜x, )) RDydenotes the output of the model andNis the number of rows for which 160

a value was observed in the target column corresponding to y. We useone-hot(y) {0,1}Dyto 161

4

The first step in the model is to transform for each row thestringdata of each columncinto a 112

numerical representationxc. We use different encoders for different non-numerical data types and

113

distinguish betweencategoricalandsequentialdata. For categorical data the numerical representation 114

xc {1,2, . . . , Mc}of thestringdata in columncwas simply the index of the value histogram

115

of sizeMc; for notational simplicity also these scalar variables will be denoted as vectorxcin the

116

following. For sequential data, the numerical representationxc {0,1,2, . . . , Ac}Scis a vector of 117

lengthSc, whereScdenotes the length of the sequence orstringin columncandAcdenotes the

118

size of the set of all characters observed in columnc. The data types were determined using heuristics. 119

In the data sets used in the experiments the data types of the columns were easy to separate into 120

free text fields (product description, bullet points, item name) and categorical variables 121

(e.g.color, brand, size, ...). If the data types are not known upfront, heuristics based on the 122

distribution of values in a column can be used [?]. 123

Once the the non-numerical data is encoded into their respective numerical representation, a column 124

specific feature extraction mapping c(xk) RDc is computed, whereDkdenotes the dimensionality

125

for a latent variable associated with columnc. We considered three different types of featurizers 126

c(.). For categorical data we use a one-hot encoded embedding (as known from word embeddings).

127

For columnscwith sequentialstringdata we consider two different possibilities for c(xc): an

128

n-gram representation or a character-based embedding using a Long-Short-Term-Memory (LSTM) 129

recurrent neural network [?]TODO: CITATION. For the character n-gram representation, c(xc)is

130

a hashing function that maps each n-gram, wheren {1, . . . ,5}, in the character sequencexcto

131

aDcdimensional vector; hereDcdenotes here the number of hash buckets. Note that the hashing

132

featurizer is a stateless component that does not require any training, whereas the other two types of 133

feature maps contain parameters that are learned using backpropagation. 134

For the case of categorical embeddings, we use a standard linear embedding fed into one fully 135

connected layer. The hyperparameter for this featurizer was a single one and used to set both the 136

embedding dimensionality as well as the number of hidden units of the output layer. In the LSTM 137

case, we featurizexcby iterating an LSTM through the sequence of characters ofxc

ithat are each

138

represented as continuous vector via a character embedding. The sequence of charactersxcis then

139

mapped to a sequence of statesh(c,1), . . . ,h(c,Sc)and we take the last stateh(c,Sc), mapped through 140

a fully connected layer as the featurization ofxc. The hyperparameters of each LSTM featurizer are

141

then the number of layers, the number of hidden units of the LSTM cell and the dimension of the 142

characters embeddingciand the number of hidden units of the final fully connected output layer of

143

the LSTM featurizer. 144

Finally all feature vectors c(xc)are concatenated into one feature vectorx˜ RDwhereD= Dc

145

is the sum over all latent dimensionsDc. We will refer to the numerical representation of the values

146

in the to-be-imputed column asy {1,2, . . . , Dy}, as in standard supervised learning settings. The 147

symbols of the target columns use the same encoding as the aforementioned categorical variables. 148

After the featurization˜xof input or feature columns and the encodingyof the to-be-imputed 149

column we can cast the imputation problem as a supervised problem by learning to predict the label 150

distribution ofyfromx˜. 151

Imputation is then performed by modelingp(y|x˜, ), the probability over all observed values or 152

classes ofygiven an input or feature vectorx˜with some learned parameters . The probability 153

p(y|x˜, )is modeled, as 154

p(y|˜x, ) =softmax[Wx˜+b] (1) where = (W, z, b)are parameters to learn withW RDy D,b RD

y andzis a vector containing

155

all parameters of all learned column featurizers c. Finally,softmax(q)denotes the elementwise

156

softmax function expq

jexpqjwhereqjis thejelement of a vectorq. 157

The parameters are learned by minimizing the cross-entropy loss between the predicted distribution 158

and the observed labelsy, e.g. by taking 159

= arg min

N

1

Dy

1

log(p(y|x˜, )) onehot(y) (2)

wherep(y|˜x, )) RDydenotes the output of the model andNis the number of rows for which 160

a value was observed in the target column corresponding to y. We useone-hot(y) {0,1}Dyto 161

4

of sizeMc; for notational simplicity also these scalar variables will be denoted as vectorxcin the

116

following. For sequential data, the numerical representationxc {0,1,2, . . . , A

c}Scis a vector of

117

lengthSc, whereScdenotes the length of the sequence orstringin columncandAcdenotes the

118

size of the set of all characters observed in columnc. The data types were determined using heuristics.

119

In the data sets used in the experiments the data types of the columns were easy to separate into

120

free text fields (product description, bullet points, item name) and categorical variables

121

(e.g.color, brand, size, ...). If the data types are not known upfront, heuristics based on the

122

distribution of values in a column can be used [?].

123

Once the the non-numerical data is encoded into their respective numerical representation, a column

124

specific feature extraction mapping c(xk) RDc is computed, whereDkdenotes the dimensionality

125

for a latent variable associated with columnc. We considered three different types of featurizers

126

c(.). For categorical data we use a one-hot encoded embedding (as known from word embeddings).

127

For columnscwith sequentialstringdata we consider two different possibilities for c(xc): an

128

n-gram representation or a character-based embedding using a Long-Short-Term-Memory (LSTM)

129

recurrent neural network [?]TODO: CITATION. For the character n-gram representation, c(xc)is

130

a hashing function that maps each n-gram, wheren {1, . . . ,5}, in the character sequencexcto

131

aDcdimensional vector; hereDcdenotes here the number of hash buckets. Note that the hashing

132

featurizer is a stateless component that does not require any training, whereas the other two types of

133

feature maps contain parameters that are learned using backpropagation.

134

For the case of categorical embeddings, we use a standard linear embedding fed into one fully

135

connected layer. The hyperparameter for this featurizer was a single one and used to set both the

136

embedding dimensionality as well as the number of hidden units of the output layer. In the LSTM

137

case, we featurizexcby iterating an LSTM through the sequence of characters ofxc

ithat are each

138

represented as continuous vector via a character embedding. The sequence of charactersxcis then

139

mapped to a sequence of statesh(c,1), . . . ,h(c,Sc)and we take the last stateh(c,Sc), mapped through 140

a fully connected layer as the featurization ofxc. The hyperparameters of each LSTM featurizer are

141

then the number of layers, the number of hidden units of the LSTM cell and the dimension of the

142

characters embeddingciand the number of hidden units of the final fully connected output layer of

143

the LSTM featurizer.

144

Finally all feature vectors c(xc)are concatenated into one feature vectorx˜ RDwhereD= Dc

145

is the sum over all latent dimensionsDc. We will refer to the numerical representation of the values

146

in the to-be-imputed column asy {1,2, . . . , Dy}, as in standard supervised learning settings. The

147

symbols of the target columns use the same encoding as the aforementioned categorical variables.

148

After the featurizationx˜of input or feature columns and the encodingyof the to-be-imputed

149

column we can cast the imputation problem as a supervised problem by learning to predict the label

150

distribution ofyfromx˜.

151

Imputation is then performed by modelingp(y|x˜, ), the probability over all observed values or

152

classes ofygiven an input or feature vectorx˜with some learned parameters . The probability

153

p(y|x˜, )is modeled, as

154

p(y|x˜, ) =softmax[W˜x+b] (1)

where = (W, z, b)are parameters to learn withW RDy D,b RD

y andzis a vector containing

155

all parameters of all learned column featurizers c. Finally,softmax(q)denotes the elementwise

156

softmax function expq

jexpqjwhereqjis thejelement of a vectorq. 157

The parameters are learned by minimizing the cross-entropy loss between the predicted distribution

158

and the observed labelsy, e.g. by taking 159

= arg min

N

1 Dy

1

log(p(y|˜x, )) onehot(y) (2)

wherep(y|x˜, )) RDydenotes the output of the model andNis the number of rows for which 160

a value was observed in the target column corresponding to y. We useone-hot(y) {0,1}Dyto 161 4 String Representation Numerical Representation Training Rows Featurizers … … Pandas, Dask or Spark MxNet Latent representation

Imputation of attribute color

C PU C PU /G PU To-Be-Imputed Row 1. 2. 3. 4. One Hot Encoder One Hot Encoder Embedding

Figure 1: Imputation example on non-numerical data with deep learning; symbols explained in Section 3.

•End-to-end optimization of imputation model.Our system learns numerical feature representations automatically and is readily applicable as a plugin in data pipelines that require com-pleteness for data sources (Section 3).

2 RELATED WORK

Missing data is a common problem in statistics and has become more important with the increasing availability of data and the popularity of data science. Methods for dealing with missing data can be divided into the following categories [21]:

(1) Remove cases with incomplete data (2) Add dedicated missing value symbol (3) Impute missing values

Approach (1) is also known as complete-case analysis and is the simplest approach to implement – yet it has the decisive disadvan-tage of excluding a large part of the data. Rows of a table are often not complete, especially when dealing with heterogeneous data sources. Discarding an entire row of a table if just one column has a missing value would often discard a substantial part of the data. Approach (2) is also simple to implement as it essentially only introduces a placeholder symbol for missing data. The resulting data is consumed by downstream ML models as if there were no missing values. This approach can be considered the de-facto standard in many machine learning pipelines and often achieves competitive results, as the missingness of data can also convey information. If the application is really only focused on the �nal output of a model and the goal is to make a pipeline survive missing data cases, then this approach is sensible.

Finally, approach (3) replaces missing values with substitute values (also known as imputation). In many application scenarios including the one considered in this work, users are interested in imputing these missing values. For instance, when browsing for a product, a customer might re�ne a search by speci�c queries (say

Figure 2: Imputation example on non-numerical data with deep learning.

Depending on its data type, each input column gets a dedicated

featurizer

denoted below as

φ

. Similarly, depending on the data type for the output column,

DataWig

uses a different

loss function. The types of featurizers and loss functions currently available in

DataWig

are

listed in Figure 1 (left). The code design enables users to extend these types easily to images

or sequences. More formally,

DataWig

imputes values

y

ˆ

o

=

f

x

I

)

in an output column

o

, where

f

refers to the imputation model learned on the observed values in column

o

and

x

˜

I

refers to the concatenation of the features extracted from all input columns

x

˜

I

=

[φ1(

x

1

), φ2(

x

2

), . . . , φ

CI

(

x

CI

)]

, see also Figure 2. Depending on the data type in the output

column,

f

is fitted using either a regression or a cross-entropy loss. The API allows imputation

of missing values in a table by simply passing in a

pandas

dataframe and specifying the input

and output columns, see Figure 1 (right). Alternatively, all missing values in a dataframe can

be imputed by calling

SimpleImputer.complete(df)

. Additionally

DataWig

has a number

of features that help to automate end-to-end imputation for practitioners: The data types

are detected using heuristics and the corresponding features are learned automatically during

the training of the imputation model. All hyperparameters and neural architectures are

optimized using random search (Bergstra and Bengio, 2012), which can be constrained to

a specified time limit. Probabilistic model outputs are automatically calibrated on the

validation set (Guo et al., 2017), and if requested explanations for the imputations can be

computed for string input columns to better understand the imputations. Moreover, the

model is equipped with functionality to compensate for label shift between the training and

unlabelled production data using in the approach proposed by Lipton et al. (2018).

3. Evaluation

In Figure 3 we compare

DataWig

on

numerical missing value imputation

against three methods

from the

fancyimpute

package (mean, KNN and matrix factorization) and two methods

from the IterativeImputer of

sklearn

with the estimators RandomForestRegressor and

(4)

Biessmann, Rukat, Schmidt, Naidu, Schelter, Taptunov, Lange

5

10

30

50

Percent Missing

0.0

0.5

1.0

Relative MSE

Missing completely at random

5

10

30

50

Percent Missing

0.0

0.5

1.0

Missing at random

5

10

30

50

Percent Missing

0.0

0.5

1.0

Missing not at random

mean

knn

mf

sklearn_rf

sklearn_linreg

datawig

Figure 3: Comparison of imputation performance across several synthetic and real world

data sets with varying amounts of missing data and missingness structure. Relative mean

squared errors were normalized to the highest error in a condition.

imputation here means that 10 consecutive imputation rounds were performed for replacing

the missing values in the input columns. All methods were evaluated on one synthetic linear

and one synthetic non-linear problem and five real data sets available in

sklearn

. Values

were discarded either completely at random, at random (conditioned on values in another

randomly chosen column being in a random interval) or not at random (conditioned on

values to be discarded). In Figure 3 the relative mean-squared error is shown, normalized to

the highest MSE in a given condition. For

DataWig

the

SimpleImputer.complete

function

with random search for hyperparameter tuning was used. For each baseline method, grid

search was performed for hyperparameter optimization on a validation set, test errors were

obtained on a separate test set, for details and unnormalized results see benchmarks github

repository. We observe that

DataWig

compares favourably with other implementations for

numeric imputation, even in the difficult missing-not-at-random condition. These experiments

allow for a comparison of

DataWig

with existing packages designed for numeric data. For

imputation with text data, standard numerical imputation methods cannot be used. When

comparing

DataWig

with mode imputation and string matching (Dallachiesa et al., 2013)

DataWig

achieves a median F1-score of 60% across three tasks, imputation of the Wikipedia

attributes

birth-place, genre

and

location

, with a simple n-gram model. Mode imputation

reached a median F1-score of 0.7% and string matching 7.5% (Biessmann et al., 2018).

4. Conclusion

We present

DataWig

, a software package that enables practitioners such as data engineers

to achieve state-of-the-art imputation results with minimal set up and maintenance. Our

package complements the open source ecosystem by offering deep learning modules combined

with neural architecture search and end-to-end optimization of the imputation pipeline,

also for data types like free text fields.

DataWig

compares favorably to existing imputation

approaches on numeric imputation problems, but also when imputing values in tables

containing unstructured text. The software, unit tests, and all experiments are available

under

github.com/awslabs/datawig

. While the present version of our software does not

(5)

References

Gustavo Batista and Maria Carolina Monard. An analysis of four missing data treatment

methods for supervised learning.

Applied Artificial Intelligence

, 17(5-6):519–533, 2003.

benchmarks github repository.

https://github.com/awslabs/datawig/blob/master/

experiments/benchmarks.py

.

James Bergstra and Yoshua Bengio. Random Search for Hyper-Parameter Optimization.

Journal of Machine Learning Research

, 13:281–305, 2012. URL

http://dl.acm.org/

citation.cfm?id=2188395

.

Felix Biessmann, David Salinas, Sebastian Schelter, Philipp Schmidt, and Dustin Lange. “deep”

learning for missing value imputation in tables with non-numerical data. In

International

Conference on Information and Knowledge Management (CIKM)

, 2018.

Ramiro D. Camino, Christian A. Hammerschmidt, and Radu State. Improving Missing Data

Imputation with Deep Generative Models. feb 2019. URL

http://arxiv.org/abs/1902.

10666

.

Michele Dallachiesa, Amr Ebaid, Ahmed Eldawy, Ahmed Elmagarmid, Ihab F Ilyas, Mourad

Ouzzani, and Nan Tang. Nadeef: a commodity data cleaning system. In

ACM SIGMOD

,

pages 541–552. ACM, 2013.

Lovedeep Gondara and Ke Wang. Multiple imputation using deep denoising autoencoders.

CoRR

, abs/1705.02737, 2017. URL

http://arxiv.org/abs/1705.02737

.

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On Calibration of Modern

Neural Networks. In

International Conference on Machine Learning (ICML)

, 2017.

Yehuda Koren, Robert M. Bell, and Chris Volinsky. Matrix factorization techniques for

recommender systems.

IEEE Computer

, 42(8):30–37, 2009.

Zachary C. Lipton, Yu-Xiang Wang, and Alex Smola. Detecting and Correcting for Label

Shift with Black Box Predictors.

International Conference on Machine Learning (ICML)

,

2018.

R. J. A. Little and D. B. Rubin.

Statistical analysis with missing data. 2nd ed.

Wiley-Interscience, Hoboken, NJ„ 2002.

Pierre-Alexandre Mattei and Jes Frellsen. MIWAE: Deep generative modelling and imputation

of incomplete data sets. In

International Conference on Machine Learning (ICML)

, 2019.

Imke Mayer, Julie Josse, Nicholas Tierney, and Nathalie Vialaneix. R-miss-tastic: a unified

platform for missing values methods and workflows. art. arXiv:1908.04822, 2019.

Rahul Mazumder, Trevor Hastie, and Robert Tibshirani. Spectral regularization algorithms

for learning large incomplete matrices.

Journal of Machine Learning Research

, 11:2287–

(6)

Biessmann, Rukat, Schmidt, Naidu, Schelter, Taptunov, Lange

A Nazabal, Pablo M Olmos, Zoubin Ghahramani, and Isabel Valera. Handling Incomplete

Heterogeneous Data using VAEs. 2018. URL

https://arxiv.org/pdf/1807.03653.pdf

.

Sebastian Schelter, Felix Biessmann, Tim Januschowski, David Salinas, Stephan Seufert,

and Gyuri Szarvas. On challenges in machine learning model management.

IEEE Data

Engineering Bulletin

, 41, 12 2018.

D Sculley, G Holt, D Golovin, E Davydov, T Phillips, D Ebner, V Chaudhary, M Young, and

D Dennison. Hidden Technical Debt in Machine Learning Systems.

Neural Information

Processing Systems (NeurIPS)

, 2015.

Daniel J Stekhoven and Peter Bühlmann. MissForest—non-parametric missing value

imputa-tion for mixed-type data.

Bioinformatics

, 28(1):112–118, 2012.

Olga G. Troyanskaya, Michael N. Cantor, Gavin Sherlock, Patrick O. Brown, Trevor Hastie,

Robert Tibshirani, David Botstein, and Russ B. Altman. Missing value estimation methods

for DNA microarrays.

Bioinformatics

, 17(6):520–525, 2001.

S. van Buuren.

Flexible Imputation of Missing Data. 2nd ed.

CRC/Chapman & Hall, 2018.

Jinsung Yoon, James Jordon, and Mihaela van der Schaar. GAIN: Missing Data Imputation

using Generative Adversarial Nets.

International Conference on Machine Learning (ICML)

,

2018. URL

http://arxiv.org/abs/1806.02920

.

Hongbao Zhang, Pengtao Xie, and Eric P. Xing. Missing value imputation based on deep

generative models.

CoRR

, abs/1808.01684, 2018. URL

http://arxiv.org/abs/1808.

References

Related documents

In bringing together some of London’s finest medical minds, BHB Medical are not only able to offer best-in-class knowledge and advice, but is also able to facilitate access to

Even though the (data) range between member states increased with the enlargement of the Union, delegation of risk regulation as a general mode of governance

Points Time 1 160 10233997 Alexander Александр PROKOPOVICH ПРОКОПОВИЧ RUS 107EZ03 CAPITOL КАПИТОЛЬ WESTF Вестф...

Specifically, when acetaminophen and/or ibuprofen do not lower the pain enough, then start with a lower dose of narcotic, and increase the dose if pain remains uncontrolled,

It is possible to assign special missing values to numeric variables to distinguish different types of missing data.. The letter in the special missing value may be uppercase

Through an artefactual field experiment with 200 Bolivian microfinance borrowers, we observe that subjects from real-world delinquent borrowing groups do not prefer risky

Y-Tween can form the wall units of a bathroom, separating dinning room and kitchen, in a child’s bedroom to provide the simultaneous functions of wall, shelves and cupboards....

Cross-Section Data - Variables: 136 - 141 (Political Efficacy Variables) Imputation of Missing Values on TOUCH and INTEREST... Listwise Deletion of Missing Values after Imputation