Uncertainty Analysis in Statistical Matching

(1)

Uncertainty Analysis in Statistical Matching

Pier Luigi Conti¹, Daniela Marella², and Mauro Scanu³

Among the goals of statistical matching, a very important one is the estimation of the joint distribution of variables not jointly observed in a sample survey but separately available from independent sample surveys. The absence of joint information on the variables of interest leads to uncertainty about the data generating model. The present article reviews the concept of uncertainty in statistical matching and how to measure it by providing a unified framework for the parametric and nonparametric setting. Furthermore, the reduction of uncertainty due to the introduction of logical constraints is investigated and a simulation experiment is performed.

Key words: Data fusion; synthetical matching.

1. Introduction

A key issue in decision processes is the availability of information. Rich and detailed information will help a decision maker in making good decisions. Information may be gathered through a survey but the cost of implementing a new survey and the availability of information in archives or in other surveys may lead researchers to combine available information from different data sources. Statistical matching (sometimes called data fusion) aims at combining information available in distinct sample surveys referred to the same target population. Formally, let Y and Z be two random variables (rv). Statistical matching is defined as the estimation of the ðY; ZÞ joint distribution (e.g., its cumulative distribution function – cdf – Fð y; zÞ) or of some of its parameters when:

. Y and Z are not jointly observed in a survey, but . Y is observed in a sample A, of size nA,

. Z is observed in a sample B, of size nB,

. A and B are independent, and the sets of observed units in the two samples do not overlap (it is not possible to use record linkage),

. A and B both observe a set of additional variables X.

A detailed list of statistical matching applications is to be found in D’Orazio et al. (2006b) and Ridder and Moffitt (2007).

Generally speaking, two approaches have been considered. At first, e.g., Okner (1972), techniques based on the conditional independence assumption between Y and Z given X (CIA assumption) were considered. Appropriateness of CIA is discussed in several papers.

qStatistics Sweden

1 University of Rome, La Sapienza, p. Aldo Moro 5, IT 00185 Rome, Italy. Email: [email protected]

2 University of Rome, Roma Tre, via Milazzo 11B, IT 00185 Rome, Italy. Email: [email protected]

3 ISTAT, via Cesare Balbo 16, 00184 Rome, Italy. Email: [email protected]

(2)

We cite, among others, Sims (1972) and Rodgers (1984). The second group of techniques uses external auxiliary information on the statistical relationships between Y and Z (e.g., an additional file C where ðX; Y; ZÞ are jointly observed is available, as in Singh et al. (1993)).

Most of these approaches aim at making an inference as to the distribution of Y and Z by reconstructing, through imputation procedures, a synthetic file containing X, Y, Z data.

These approaches are actually theoretically justified when the joint probability distribution of the variables of interest in the population coincides with the probability distribution of the same variables in the synthetic data file, or at least when these two distributions are “very close”. The discrepancy between the joint distribution of the variables of interest (a) in the population, and (b) in the synthetic data file, is usually referred to as matching noise (cf. Paass 1986). Attempts at evaluating the “closeness” of the empirical distribution of imputed data to the empirical distribution of “real” data have been performed in the literature, see D’Orazio et al. (2006b). In a nonparametric setting an important role is played by hot deck methods, as well as k-Nearest Neighbour (kNN, for short) methods. Their properties are studied in Marella et al. (2008) and in Conti et al.

(2009a), where both theoretical and simulation results are obtained.

As a matter of fact, the CIA is usually a misspecified assumption, while external auxiliary information is hardly ever available. The lack of joint information on the variables of interest is the cause of uncertainty about the model of ðX; Y; ZÞ, since the sample information provided by A and B is actually unable to discriminate among a set of plausible models for ðX; Y; ZÞ. In other terms, the adopted statistical model is not identifiable on the basis of sample data. Hence, a third group of techniques that does not directly aim at reconstructing a complete data set is introduced. This group of techniques addresses the so-called identification problem. The main consequence of the lack of identifiability is that some parameters of the model cannot be estimated on the basis of the available sample information. For instance, in a parametric setting, instead of point estimates, one can only reasonably construct sets of “possible estimates”, compatible with what can be actually estimated. These sets (usually intervals) formally provide a representation of uncertainty about the model parameters.

In this setting, the main task consists in constructing a coherent measure that can reasonably quantify the uncertainty about the (estimated) model. From an operational point of view, a measure of uncertainty essentially quantifies how “large” is the class of models estimable on the basis of the available sample information. The smaller the measure of uncertainty, the smaller the class of estimated models.

This article aims at reviewing uncertainty in statistical matching providing a unified framework for the parametric and nonparametric approach. More specifically, in Section 2 model uncertainty is defined and uncertainty measures are introduced. Furthermore, several examples are illustrated for both parametric and nonparametric settings. Section 3 shows the effect on model uncertainty due to the introduction of logical constraints (frequently used in imputation and editing). Finally, in Section 4 a simulation experiment is performed.

2. What is Uncertainty in Statistical Matching

Let ðX; Y; ZÞ be a trivariate random variable with joint (cumulative) distribution function (df) Mðx; y; zÞ. Denote further by Hð y; zjxÞ the joint df of ðY; ZÞ conditionally on X, Fð yjxÞ

(3)

and GðzjxÞ the marginal d.f.s of Y and Z conditionally on X, respectively, and QðxÞ the marginal df of X.

Let M ¼ {Mðx; y; zÞ}, H^x¼ {Hð y; zjxÞ}, F^x¼ {Fð yjxÞ}, G^x¼ {GðzjxÞ}, Q ¼ {QðxÞ}

be the sets of df’s in which M, H, F, G and Q lie. Knowledge of a specific distribution in M is equivalent to knowledge of the corresponding distributions in H^xand Q .

We further assume that the observation mechanism allows one to identify the classes F^x, G^x and Q , but not the class H^x, unless special assumptions are made. The most important is the conditional independence assumption, under which we have H^x¼ {Fð yjxÞGðzjxÞ; F [ F^x; G [ G^x}, i.e., H^x¼ F^x£ G^x.

We distinguish two main cases:

1. Parametric case: the classes H^xand Q are indexed by real or vector parameters. As a consequence we may write:

Q ¼ {QjðxÞ;j[ J}

F^x¼ {Ffð yjxÞ;f[ F^x} G^x¼ {GcðzjxÞ;c[ C^x} H^x¼ {H_gð y; zjxÞ;g[ G^x}

2. Nonparametric case: the above classes H^xand Q cannot be indexed by real or vector parameters.

Given the (known) marginal distributions Fð yjxÞ and GðzjxÞ, uncertainty is defined as the set of probability distributions of the random vector ðY; ZjXÞ compatible with Fð yjxÞ and GðzjxÞ. Formally

H²ð y; zjxÞ # Hð y; zjxÞ # H^þð y; zjxÞ ð1Þ

where

H^þð y; zjxÞ ¼ sup_H[H^x{Hð y; zjxÞ : Hð y; þ1jxÞ ¼ Fð yjxÞ; Hðþ1; zjxÞ ¼ GðzjxÞ};

H²ð y; zjxÞ ¼ inf_H[H^x{Hð y; zjxÞ : Hð y; þ1jxÞ ¼ Fð yjxÞ; Hðþ1; zjxÞ ¼ GðzjxÞ}:

In the parametric case, we may write H^þð y; zjxÞ ¼ H^þð y; zjx;c;fÞ

¼

g[G^x

sup{Hð y; zjx;gÞ; Hð y; þ1jx;gÞ ¼ Fð yjx;fÞ; Hðþ1; zjx;gÞ ¼ Gðzjx;cÞ}

H²ð y; zjxÞ ¼ H²ð y; zjx;c;fÞ

¼g[Ginf{Hð y; zjx;^x gÞ; Hð y; þ1jx;gÞ ¼ Fð yjx;fÞ; Hðþ1; zjx;gÞ ¼ Gðzjx;cÞ}

The df’s belonging to the class H^x are compatible with the available information, namely they may have generated the observed data.

The interval (1) summarizes the pointwise uncertainty about the statistical model for every triple ðx; y; zÞ. It is intuitive to take the length of such an interval as a pointwise

(4)

measure of uncertainty. The wider the interval of extremes ½H²ð; jxÞ; H^þð; jxÞ the more uncertain the statistical model generating the data w.r.t. ðx; y; zÞ. Of course, if the model is identifiable, then the interval reduces to a single point, with length zero, and there is no uncertainty at all.

In this case, it is possible to compute the following measures of uncertainty. The first one is based on computing for each point ðx; y; zÞ the distance between the extrema H²ð y; zjxÞ; H^þð y; zjxÞ.

Formally

D^yzjx¼ H^þðFð yjxÞ; GðzjxÞÞ 2 H²ðFð yjxÞ; GðzjxÞÞ ð2Þ In order to summarize the differences in (2) into an overall measure of uncertainty we may take the average length

D ¼ ð

R³

D^yzjxdTðx; y; zÞ

where Tðx; y; zÞ is a weight function onR³, i.e., a measure having total mass 1. A “natural”

choice consists in taking

dTðx; y; zÞ ¼ dFð yjxÞdGðzjxÞdQðxÞ

This distribution is “natural” because: i) it is the simplest choice given the available df’s Fð yjxÞ, GðzjxÞ, QðxÞ and makes the integral in D easily computable in many cases;

ii) among all the possible associations between Y and Z, we consider a neutral position, i.e., we do not give preference to any specific positive or negative association. Hence, a conditional measure of uncertainty is

D^x¼ ð

R²

D^yzjxdFð yjxÞdGðzjxÞ ð3Þ

As a matter of fact, integrating (3) with respect to X, we obtain again the overall measure of uncertainty D

D ¼ ð

RD^xdQðxÞ ð4Þ

Relationships (3), (4) show that the unconditional uncertainty measure (4) can be expressed as a weighted mean of conditional uncertainty measures (3). Then, the larger D^xthe more uncertain the data generating statistical model.

If interest is only in the joint distribution of the (not jointly observed) variables ðY; ZÞ, it is possible also to consider the unconditional Fre´chet class

ðEx½H²ðFð yjxÞ; GðzjxÞÞ; Ex½H^þðFð yjxÞ; GðzjxÞÞÞ ð5Þ A unique number can be obtained by using an appropriate weight function Tð y; zÞ.

The uncertainty measure depends on the marginal df’s F and G, that can be estimated by the available sample information. The asymptotic properties of the estimators of F and G (both in the parametric and nonparametric cases) determine the asymptotic properties of D. The limit distribution of D allows to construct confidence intervals and hypothesis tests for the uncertainty measure.

(5)

Example 1. Nonparametric Case – Assume that H^xcannot be indexed by real or vector parameters, so that H^x is the set of all bivariate df’s having marginal df’s Fð yjxÞ and GðzjxÞ. Denote further by Lðu; vÞ ¼ max ð0; u þ v 2 1Þ, and by Uðu; vÞ ¼ min ðu; vÞ. Then the Fre´chet inequalities

LðFð yjxÞ; GðzjxÞÞ # Hð y; zjxÞ # UðFð yjxÞ; GðzjxÞÞ ð6Þ hold true, LðFð yjxÞ; GðzjxÞÞ and UðFð yjxÞ; GðzjxÞÞ being bivariate df’s corresponding to maximal negative and maximal positive association between Y and Z (given X).

Furthermore

UðFðþ1jxÞ; GðzjxÞÞ ¼ LðFðþ1jxÞ; GðzjxÞÞ ¼ GðzjxÞ and

UðFð yjxÞ; Gðþ1jxÞÞ ¼ LðFð yjxÞ; Gðþ1jxÞÞ ¼ Fð yjxÞ

namely LðFð yjxÞ; GðzjxÞÞ and UðFð yjxÞ; GðzjxÞÞ both possess marginal df’s Fð yjxÞ and GðzjxÞ.

As a consequence of Fre´chet inequalities (6), we have H²ð y; zjxÞ ¼ LðFð yjxÞ; GðzjxÞÞ

and

H^þð y; zjxÞ ¼ UðFð yjxÞ; GðzjxÞÞ

With such a choice, the overall uncertainty measure (4) becomes

D ¼ ð

R

ð

R²½UðFð yjxÞ; GðzjxÞÞ 2 LðFð yjxÞ; GðzjxÞÞd½Fð yjxÞdGðzjxÞ

dQðxÞ

¼ ð

R

D^xdQðxÞ ¼ Ex½D^x

ð7Þ

where D^x¼

ð

R²½UðFð yjxÞ; GðzjxÞÞ 2 LðFð yjxÞ; GðzjxÞÞdFð yjxÞdGðzjxÞ ð8Þ is the uncertainty measure about the considered statistical model, conditionally on X ¼ x.

Further results and simplifications can be obtained when Fð yjxÞ and GðzjxÞ are continuous df’s, so that R_x¼ FðYjxÞ and Sx¼ GðZjxÞ both have uniform distribution in [0,1] for every x. From the Sklar theorem (Nelsen 1999), the joint df H may be uniquely represented as

Hð y; zjxÞ ¼ C_xðFð yjxÞ; GðzjxÞÞ ð9Þ

where Cxðu; vÞ is the copula, i.e., the joint d.f. of Rxand S_x. In this case, the conditional

(6)

uncertainty measure simplifies to D^x¼

ð

½0;1²

½Uðr; sÞ 2 Lðr; sÞd½r s

¼ ð1

0

ð1 0

{ min ðr; sÞ 2 max ð0; r þ s 2 1Þ}dr ds

ð10Þ

In this setting, it is straightforward to show that the uncertainty measure (10) is equal to 1/6 for any F and G, and for any x. The value D^x¼ 1=6 represents the maximum uncertainty achieved when no external auxiliary information beyond the knowledge of the marginals Fð yjxÞ and GðzjxÞ is available. As a consequence, also the unconditional uncertainty measure (7) takes the value 1/6.

Example 2. Contingency Tables – Assume that, given a discrete rv X with I categories, Y and Z are discrete rv’s with J and K categories, respectively, not necessarily ordered. In this case, the whole theory developed so far can still be applied by simply replacing the d.f.s with probability functions. With no loss of generality, the symbols i ¼ 1; : : : ; I, j ¼ 1; : : : ; J, and k ¼ 1; : : : ; K, denote the categories taken by X, Y and Z, respectively.

Letg_jkjibe the probability that ðY ¼ j; Z ¼ kjX ¼ i Þ, with marginalsf_jji representing the probability of the event ðY ¼ jjX ¼ i Þ andckjirepresenting the probability of the event ðZ ¼ kjX ¼ i Þ. Since

Lðfjji;ckjiÞ #gj;kji# Uðfjji;ckjiÞ ð11Þ

the conditional uncertainty measure turns out to be equal to D^x¼i¼X^J

j¼1

X^K

k¼1

{Uðf_jji;c_kjiÞ 2 Lðf_jji;c_kjiÞ}f_jjic_kji

and the overall uncertainty measure is D ¼X^I

i¼1

D^x¼iji

wherejirepresents the probability of the event ðX ¼ i Þ. Sharper results are obtained when the categories taken by ðX; Y; ZÞ are ordered. For the sake of simplicity, we use the customary order for natural numbers. In this case, the cumulative df’s are

H_j;kji¼X^j

y¼1

X^k

z¼1

gyzji; j ¼ 1; : : : ; J; k ¼ 1; : : : ; K; i ¼ 1; : : : ; I

F_jji¼X^j

y¼1

fyji; j ¼ 1; : : : ; J; i ¼ 1; : : : ; I

G_kji¼X^k

z¼1

c_zji; k ¼ 1; : : : ; K; i ¼ 1; : : : ; I

(7)

Using the same arguments as in Example 1, the inequalities

LðF_jji; G_kjiÞ # Hj;kji# UðF_jji; G_kjiÞ ð12Þ

hold. Note that inequalities (12) imply that

g²_jkji#gjkji#g^þ_jkji ð13Þ

where

g²_jkji¼ LðF_jji; G_kjiÞ 2 LðF_j21ji; G_kjiÞ 2 LðF_jji; G_k21jiÞ þ LðF_j21ji; G_k21jiÞ g^þ_jkji¼ UðFjji; G_kjiÞ 2 UðFj21ji; G_kjiÞ 2 UðFjji; G_k21jiÞ þ UðFj21ji; G_k21jiÞ Then, it is not difficult to realize that

g²_jkji$ Lðf_jji;c_kjiÞ g^þ_jkji# Uðfjji;ckjiÞ

so that inequalities (13) are sharper than (11). At any rate, the conditional uncertainty measure is

D^x¼i¼X^J

j¼1

X^K

k¼1

{UðF_jji; G_kjiÞ 2 LðF_jji; G_kjiÞ}f_jjic_kji ð14Þ

and the unconditional uncertainty measure is D ¼X^I

i¼1

D^x¼iji ð15Þ

In contrast with what was found in Example 1, D is not equal to 1/6, since the uncertainty measure depends on the marginal probabilities of YjX and ZjX.

Example 3. Multinormal Distribution – Let X, Y, Z, be jointly multinormally distributed, with mean vector and covariance matrix equal to

m¼ mx

my

mz

2 66 4

3 77 5; S¼

s²_x sxy sxz

sxy s²_y syz

sxz syz s²_z 2

66 64

3 77

75 ð16Þ

respectively. All bivariate marginals are still multinormal, with mean vectors and covariance matrices easily obtained from (16). Furthermore, conditionally on X, (Y, Z) do have joint bivariate normal distribution, with mean vector and covariance matrix given by

myzjx¼

myþby=xðx 2mxÞ mzþbz=xðx 2mxÞ 2

64

3 75;

S_yzjx¼

s²_y1 2r²_xy syszðryz2rxyrxzÞ syszðryz2rxyrxzÞ s²_z1 2r²_xz 2

66 4

3 77 5

ð17Þ

(8)

where byjx¼sxy

s²_x^;bzjx¼sxz

s²_x

are the regression coefficients of Y, Z w.r.t. X, respectively, and rxy ¼ sxy

sxsy

;rxz¼ sxz

sxsz

;ryz¼ syz

sysz

are the correlation coefficients between ðX; YÞ, ðX; ZÞ, ðY; ZÞ, respectively.

The only unidentified parameter is ryz, the correlation coefficient between Y and Z. From (17) it is immediate to see that, givenrxy andrxz,ryzranges in the interval

rxyrxz2

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 2r²_xy

1 2r²_xz

r

;rxyrxzþ

1 2r²_xz

r

ð18Þ

From the Slepian (1962) inequality, the joint d.f. of Y, Z given X is an increasing function of the correlation coefficient between Y and Z given X:

r_yzjx ¼ ryz2rxyrxz

1 2r²_xz

r

i.e., it turns out to be a monotone function of ryz. In other words, the conditional d.f.s Hð y; zjxÞ are totally ordered on the basis of ryz. The case ryz¼rxyrxz2 ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

ð1 2r²_xyÞð1 2r²_xzÞ q

corresponds to ryzjx ¼ 21, so that conditionally on X, Y is a linear decreasing function of Z (and vice versa). As a consequence, H²ð y; zjxÞ ¼ max ð0; Fð yjxÞ þ GðzjxÞ 2 1Þ, as in Example 1. Similarly, the case ryz¼rxyrxzþ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

ð1 2r²_xyÞð1 2r²_xzÞ q

corresponds to r_yzjx¼ 1, i.e., to H^þð y; zjxÞ ¼ min ðFð yjxÞ; GðzjxÞÞ, again as in Example 1. As a consequence, the computation of uncertainty measures parallels the nonparametric case.

The assumption of multinormal distribution of ðX; Y; ZÞ is largely debated in the statistical matching literature, since the seminal paper by Kadane (1978).

Useful discussions and advances are to be found in Moriarity and Scheuren (2001), Ra¨ssler (2002), Kiesl and Ra¨ssler (2008).

At first glance, results of Examples 1, 3 seem to be contradictory. In the multinormal case, uncertainty affects the correlation coefficient ryz. Uncertainty bounds for the correlation coefficients are determined by the condition that the correlation matrix of ðX; Y; ZÞ must be positive definite. In this case, the uncertainty space restricts continuously to a single value when rxy or rxz go continuously to 1 (or 2 1). The width of the Fre´chet class behaves differently. The lower and upper bounds of the Fre´chet class only depend on the marginal df’s of Y and Z (given X).

The copula transformation makes such marginals uniform in the interval (0, 1), and hence the whole Fre´chet class is always mapped on the same subset of the square (0, 1)². This is why the uncertainty measure remains constantly equal to 1/6.

A discontinuity happens if either Y or Z is fully determined by X. In this case, the square (0, 1)² collapses to the segment (0, 1), and the uncertainty measure becomes equal to zero.

(9)

Example 4. Skew-normal Distribution – Suppose that the joint distribution of ðY; ZÞ, conditionally on X, does possess a bivariate skew-normal distribution with parameters ðd; V;a;tÞ, (see Capitanio et al. 2003), where

V ¼

vyy vyz

vyz vzz

!

is a 2 £ 2 matrix. In particular,vyzis the association parameter between Y and Z, given X, and ranges in the interval ð21; 1Þ. Of course, this means that also the distribution of ðX; Y; ZÞ does possess (extended) skew-normal distribution as shown in Capitanio et al.

(2003). Asvyztends to þ 1, ðY; ZÞjX tends to be perfectly positively correlated, so that H^þð y; zjxÞ ¼ min ðFð yjxÞ; GðzjxÞÞ, again as in Example 1.

However, asvyztends to 2 1, Y and Z (given X) do not achieve the situation of perfect negative correlation. Hence, H²ð y; zjxÞ $ max ð0; Fð yjxÞ þ GðzjxÞÞ.

Example 5. Farlie-Gumbel-Morgenstern Distribution – Suppose that X is distributed as a uniform in ½0; 1. Conditionally on X, let Y and Z be marginally distributed as uniform distributions in ½0; x. Furthermore, the joint d.f. of ðY; ZÞ given X is:

Hð y; zjxÞ ¼y x z

x 1 þa 1 2y x

1 2z x

h i

ð19Þ where 21 #a# 1 (otherwise (19) is not a distribution function), 0 # y # x, 0 # z # x.

The maximal value H^þð y; zjxÞ is obtained whena¼ 1, and is equal to H^þð y; zjxÞ ¼y

x z

x 1 þ 1 2y x

1 2z x

h i

which is strictly smaller than min ðFð yjxÞ; GðzjxÞÞ. On the other hand, the minimal value H²ð y; zjxÞ is obtained whena¼ 21, and is equal to

H²ð y; zjxÞ ¼y x z

x 1 2 1 2y x

1 2z x

h i

which is strictly greater than max ð0; Fð yjxÞ þ GðzjxÞÞ.

3. Reducing Uncertainty

When auxiliary information in the form of logical constraints regarding the statistical model for ðY; ZÞ or ðY; ZjXÞ is available, some models for ðX; Y; ZÞ become illogical and must be excluded from the set of plausible distribution functions. As a consequence, the statistical model for the data becomes less uncertain. This kind of information is frequently used, for instance, in imputation and editing, see Luzi et al. (2007). The introduction of such constraints may complicate the estimation process, because there could be no probability distributions satisfying both the logical constraints and lying in the set of all estimated bivariate distributions.

There are essentially two types of constraints:

1. constraints on the values of the parameters;

2. constraints on the support of ðY; ZÞ or ðY; ZjXÞ.

(10)

The first kind of constraints is essentially used in the parametric case. They have been largely studied in D’Orazio et al. (2006a) when ðX; Y; ZÞ are categorical. The constraints are in terms of inequalities between the cell probabilities of the contingency table ðY; ZÞ or ðY; ZÞjX. In the normal case, these constraints can be used when information on the correlation coefficientryzor equivalentlyr_yzjx is available, see Ra¨ssler (2002).

The second kind of constraints has been mainly used in the parametric-multinomial case (D’Orazio et al. 2006a). These constraints have been defined in terms of structural zeros on the joint ðY; ZÞ distribution. In Vantaggi (2008) a different estimation method to deal with the case of constraints in form of structural zeros in a coherent probability framework is proposed.

The use of structural zeros for reducing uncertainty has been confined to the case of categorical rv’s. In fact, for continuous rv’s the introduction of structural zeros corresponds to the restriction of the ðY; ZÞ or ðY; ZjXÞ support to a subset of the Cartesian product of the Y and Z supports, i.e., ðY; ZÞ is fully concentrated on a set of points ð y; zÞ [R². The relationship between the original marginal parameters and the joint ðY; ZÞ distribution is heavily dependent on the nature of this constraint and does not allow easy generalizations. For instance, even if both ðYjXÞ and ðZjXÞ follow a normal distribution, the introduction of a structural zero will prevent ðY; ZjXÞ from being bivariate normal. As a matter of fact, structural zeros can be easily used in a nonparametric approach, and the estimation of uncertainty can be performed using Equations (3) and (4).

For the sake of simplicity, assume that Y and Z are continuous r.v.’s, that X is a discrete r.v., and that the support constraints have the following shape: a_x# Y=Z # b_x given X. An example of this kind of constraint happens in household surveys, when X is a household socio-economic character, Y the household consumption and Z the household income. Conditionally on X, the inequality ax# Y=Z # b_x holds. Using the notation a ^ b ¼ min ða; bÞ, it is not difficult to see that the pair of inequalities

L^x G z ^ y a_xjx

; Fð y ^ bxzjxÞ

# Hðz; yjxÞ # U^x G z ^ y a_xjx

; Fð y ^ bxzjxÞ

holds, where

U^x G z ^ y ax

jx

; Fð y ^ bxzjxÞ

¼ min G z ^ y ax

jx

; Fð y ^ bxzjxÞ

¼ min GðzjxÞ ^ G y a_xjx

; Fð yjxÞg ^ Fðb_xzjxÞ

¼ min GðzjxÞ; G y ax

jx

; Fð yjxÞ; FðbxzjxÞ

ð20Þ

L^x G z ^ y ax

jx

; Fð y ^ bxzjxÞ

¼ max 0; G z ^ y ax

jx

þ Fð y ^ bxzjxÞ 2 1

¼ max 0; GðzjxÞ ^ G y ax

jx

þ Fð yjxÞ ^ FðbxzjxÞ 2 1

ð21Þ

(11)

With obvious notation, according to (8) the conditional uncertainly measure is then D^x_c¼

ð

R²

U^x G z ^ y ax

jx

; Fð y ^ bxzjxÞ

2L^x G z ^ y ax

jx

; Fð y ^ bxzjxÞ

dFð yjxÞdGðzjxÞ

ð22Þ

where c denotes the constraint ax# Y=Z # bx. The unconditional uncertainty measure is easily obtained from (22) by averaging w.r.t. the distribution of X.

As remarked above, in this case both conditional and unconditional uncertainty measures are strictly smaller than 1/6. In other words, the constraint a_x# Y=Z # b_x reduces the uncertainty about the statistical model.

Clearly, the reduction in the model uncertainty depends on how informative the imposed constraints are. In some circumstances, the logical constraints can be so informative that the Fre´chet class reduces to a unique distribution. This happens when Z can be exactly predicted by X (or alternatively by ðX; YÞ) due to a deterministic relationship between X and Z (or ðX; Z; YÞ). The effects on model uncertainty due to the introduction of logical constraints in a nonparametric setting are investigated in Conti et al.

(2009b). Moreover, in Conti et al. (2009b) estimators of conditional and unconditional measures of uncertainty under logical constraints are proposed and their consistency and asymptotic normality are proved.

4. A Simulation Study

In this section we perform a simulation experiment in order to evaluate the effects on model uncertainty due to the introduction of logical constraints and in order to investigate the asymptotic behavior of the uncertainty measures proposed in Section 3. The simulation involves the following steps.

1. A sample A composed by n_Ai.i.d. records has been generated according to a bivariate normal distribution ðX; YÞ with mean vectormxy ¼ ð0; 0Þ and covariance matrix given by

Sxy ¼

1 0:5

0:5 1

!

2. A sample B composed by nBi.i.d. records has been generated according to a bivariate normal distribution ðX; ZÞ with mean vectormxz¼ ð0; 0Þ and covariance matrix given by

Sxz¼ 1 0:8 0:8 1

!

3. Since X is a continuous variable in order to compute the uncertainty measures we need to discretize it. The range of observed values x ¼ xA< xB has been divided into I ¼ 20 intervals according to the hth percentiles of data, for h ¼ 5 2 95ð5Þ. Formally, the discretized variable will assume the value x ¼ i, for i ¼ 1; : : : ; 20.

(12)

4. Conditionally on x ¼ i (for i ¼ 1; : : : ; 20) the pointwise measure of uncertainty in ðx; y; zÞ is estimated by

D^^yzjx¼ minð ^G_n_BðzjxÞ; ^F_n_Að yjxÞÞ 2 maxð ^G_n_BðzjxÞ þ ^F_n_Að yjxÞ 2 1; 0Þ ð23Þ where ^F_n_Að yjxÞ and ^G_n_BðzjxÞ are the empirical distribution functions of Fð yjxÞ and GðzjxÞ, respectively. Let nA;xand nB;xbe the number of units such that x ¼ i in samples A and B, and denote by y ¼ ð y1;x; : : : ; yn_A;xÞ and z ¼ ðz1;x; : : : ; zn_B;xÞ the corresponding sample values of Y and Z, respectively. Then, for x ¼ i we obtain nA;xnB;x pointwise uncertainty measures (23).

5. Conditionally on x ¼ i the uncertainty measure when no external auxiliary information is available is obtained by averaging the n_A;xn_B;x pointwise uncertainty measures (23).

Formally D^^x¼i¼ 1

n_A;xn_B_{;x y[y} X

z[z

XD^^yzjx ð24Þ

6. The overall unconditional uncertainty measure is a weighted mean of the conditional uncertainty measure (24)

D ¼^

x

XD^x¼i^pðxÞ ð25Þ

where

^pðxÞ ¼ nA;xþ nB;x

nAþ nB

ð26Þ

7. Suppose that auxiliary information in the form of logical constraints regarding the statistical model for ðX; Y; ZÞ is available. More specifically let us assume that there exist constants ax and bx such that ax# Y=Z # b_x. In the simulation study we set ax¼ 0:1=x and bx¼ 1=x. Then, conditionally on x ¼ i and under the constraint 0:1=x # Y=Z # 1=x, the pointwise uncertainty measure in ðx; y; zÞ is estimated by

D^^yzjx_c ¼ ^U^xð y; zÞ 2 ^L^xð y; zÞ

where the subscript c represents the constraint and

U^^xð y; zÞ ¼ min{ ^FnAð yjxÞ; ^F_n_AðbxzjxÞ; ^G_n_BðzjxÞ; ^G_n_Bð y=axjxÞ} ð27Þ

^L^xð y; zÞ ¼ max{0; minð ^Fn_Að yjxÞ; ^Fn_AðbxzjxÞÞ þ minð ^Gn_BðzjxÞ; ^Gn_Bð y=axjxÞÞ 2 1} ð28Þ

8. Conditionally on x ¼ i the estimator of conditional uncertainty measure under the constraint 0:1=x # Y=Z # 1=x is given by

D^^x¼i_c ¼ 1 nA;xnB;x y[y

X

z[z

XD^^yzjx_c ð29Þ

(13)

9. The overall unconditional uncertainty measure under the constraint 0:1=x # Y=Z # 1=x is obtained as a weighted mean of conditional uncertainty measures (29)

D^_c¼X

x

D^^x¼i_c ^pðxÞ ð30Þ

where ^pðxÞ is given by (26).

10. Steps 1 to 9 have been repeated 500 times and for different sample sizes nA¼ nB¼ n ¼ ð1; 000; 2; 000; 5; 000Þ.

Given n, for each sample s (for s ¼ 1; : : : ; 500) and for each category i (for i ¼ 1; : : : ; 20) denote by ^D^x¼i;s, ^D^s, ^D^x¼i;s_c , ^D^s_cthe uncertainty measures (24), (25), (29) and (30), respectively.

As for the conditional uncertainty measure estimates (24), their average over simulation runs is

D^x¼i¼ 1 500

X⁵⁰⁰

s¼1

D^^x¼i;s ð31Þ

while D ¼ 1

500 X⁵⁰⁰

s¼1

D^^s ð32Þ

represents the corresponding overall uncertainty. The standard deviation of ^D^x¼i;s, again over simulation runs, is equal to

SDð ^D^x¼iÞ ¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1

499 X⁵⁰⁰

s¼1

ð ^D^x¼i;s2 D^x¼iÞ² vu

uu

t ð33Þ

and the corresponding mean squared error is given by

MSEð ^D^x¼iÞ ¼ ½Sdð ^D^x¼iÞ²þ ð D^x¼i2 D^x¼iÞ² ð34Þ where D^x¼i¼ 1=6.

Analogously, if we refer to the conditional uncertainty measure estimates (29), we have that the average over simulation runs is given by

D^x¼i¼ 1 500

X⁵⁰⁰

s¼1

D^^x¼i;s_c ð35Þ

while

Dc¼X⁵⁰⁰

s¼1

D^s_c ð36Þ

(14)

is the corresponding overall uncertainty under the constraint 0:1=x # Y=Z # 1=x. The standard deviation over simulation runs is

SD ^D^x¼i_c

¼

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1

499 X⁵⁰⁰

s¼1

D^^x¼i;s_c 2 D^x¼i_c

2

vu uu

t ð37Þ

and finally the corresponding mean squared error is MSE ^D^x¼i_c

¼ Sd ^h D^x¼i_c i2

þ D^x¼i2 D^x¼i_c ²

ð38Þ where D^x¼i_c is the actual conditional uncertainty measure for the ith category under the constraint c. The values D^x¼i_c have been numerically computed. More specifically N ¼ 30; 000 i.i.d. records have been generated from ðX; YÞ and ðX; ZÞ according to the distributions specified in Steps 1 and 2. The actual overall uncertainty measure under the constraint is given by

D_c¼ 1 20

X²⁰

i¼1

D^x¼i_c ø 0:11225 ð39Þ

4.1. Simulation Results

In Table 1 the uncertainty measures D and Dc given by (32) and (36), respectively, are reported for different values of the sample size n. Note that D is always smaller than 1/6, and that it tends to 1/6 as the sample size n increases. Furthermore, the constraint 0:1=x # Y=Z # 1=x reduces the model uncertainty for ðX; Y; ZÞ from D ø 0:16 to D_cø 0:11.

In Figure 1 the expectation of the estimated uncertainty (31) is reported. As the category x ¼ i varies, the expected estimated uncertainty is essentially the same. Furthermore, as the sample size n increases from 1,000 to 5,000, for each category x ¼ i the expectation of the estimated mean uncertainty tends to 1/6 and the precision of our estimator increases as shown in Figure 2, where the standard deviation (33) is reported. As shown in Figure 3, as n increases, the mean squared error (34) decreases for each category x ¼ i. Such a property comes from the consistency of the empirical distribution functions ^Fn_Að yjxÞ and ^Gn_BðzjxÞ as estimators of Fð yjxÞ and GðzjxÞ, respectively.

The same analysis has been performed under the constraint 0:1=x # Y=Z # 1=x where ax¼ 0:1=x and bx¼ 1=x. Conditional and unconditional uncertainty measures

Table 1. Uncertainty measures as the sample size n varies

n D Dc

1,000 0.16653 0.10946 2,000 0.16663 0.11068 5,000 0.16666 0.11163

(15)

(numerically computed) are reported in Figure 4. The horizontal line is the unconditional uncertainty measure, while the dashed line depicts the conditional uncertainty measure as x ¼ i varies. Conditional uncertainty takes small values, about 3 £ 10²²– 7 £ 10²², for x ¼ 1 – 5(1), while it is about 0.15 for the categories x ¼ 13 2 18ð1Þ. The behavior of the corresponding estimates is shown in Figures 5, 6 and 7. In particular, in Figure 5 the expectations of the estimated uncertainty for sample sizes n ¼ 1; 000; 2; 000; 5; 000, i.e., D^x¼i_c given by (35), are reported. The absolute bias of estimators is approximately constant

3.0×10^–5

2.0×10^–5

1.0×10^–5

0.0

5 10 15 20

x

Uncertainty Sd

n = 5,000 n = 2,000 n = 1,000

Fig. 2. Standard deviation of estimated uncertainty measures in the unconstrained case n = 5,000

n = 2,000 n = 1,000

5 10 15 20

x

Uncertainty mean

0.16675

0.16670

0.16665

0.16660

0.16655

0.16650

Fig. 1. Expected values of estimated uncertainty measures in the unconstrained case

(16)

and small, if compared to the corresponding true uncertainty measures. In Figures 6, 7 the standard deviation (37) and the mean squared error (38) of our uncertainty estimators are reported. The highest efficiency of estimation is obtained for the classes x ¼ 1 2 5ð1Þ.

The smallest efficiency is obtained for the classes x ¼ 12 2 16ð1Þ. At any rate, we stress that the mean squared errors of conditional uncertainty estimators are all considerably small, ranging from 10²³to 1:3 £ 10²².

In Figure 8 the Kernel density estimate of the overall uncertainty measure under the constraint 0:1=x # Y=Z # 1=x is shown. Such an estimate has been computed using for

2.5×10^–8

2.0×10^–8

1.5×10^–8

1.0×10^–8

0.0

5 10 15 20

x

Uncertainty Mse

n = 5,000 n = 2,000 n = 1,000

Fig. 3. Mean squared error of estimated uncertainty measures in the unconstrained case

0.15

0.10

0.05

5 10 15 20

x

Fig. 4. Uncertainty measureDcandDⁱ_cfor each x ¼ i under the constraint 0:1=x # Y=Z # 1=x

(17)

each sample size n the 500 value D_cgiven by (30). Note that as the sample size n increases the uncertainty measure distribution tends to a normal distribution as proved in Conti et al.

(2009b). The bandwidth selection rule is given by Sheather and Jones (1991). Similar considerations hold for the estimated conditional uncertainty measures.

5. Concluding Remarks

The approach described in the above sections seems weird in statistical inference: the result of an estimation process is not a single value, or a single distribution, but a set of

0.15

0.10

0.05

5 10 15 20

x

Uncertainty mean

n = 5,000 n = 2,000 n = 1,000

Fig. 5. Expectation of estimated uncertainty measures under the constraint 0:1=x # Y=Z # 1=x

0.030

0.025

0.020

0.015

0.010

0.005

0.000

5 10 15 20

x

Uncertainty Sd

n = 5,000 n = 2,000 n = 1,000

Fig. 6. Standard deviation of estimated uncertainty measures under the constraint 0:1=x # Y=Z # 1=x

(18)

values or a set of distributions. Such sets are justified by the lack of joint information on the rv’s of interest Y and Z, and can be mitigated (although not avoided) by strong statistical relationships of X with Y and/or Z, as well as by constraints justified by logical rules on the ðY; ZÞ joint distribution. If a set is not acceptable, it is then necessary to include additional assumptions, such as the CIA. However, the use of additional assumptions, sometimes neither theoretically justified nor testable on the available data, is not a benefit, as stated in the so-called “law of decreasing credibility” (cf. Manski 2003): the credibility of inference decreases with the strength of the assumption maintained.

0.030

0.025

0.020

0.015

0.010

0.005

0.000

5 10 15 20

x

Uncertainty Mse

n = 5,000 n = 2,000 n = 1,000

Fig. 7. Mean squared error of estimated uncertainty measures under the constraint 0:1=x # Y=Z # 1=x

200

150

100

50

0

0.104 0.106 0.108 0.110 0.112 0.114 0.116 0.118

Density

n = 5,000 n = 2,000 n = 1,000

Fig. 8. Density estimate of overall uncertainty measure under the constraint 0:1=x # Y=Z # 1=x

(19)

The present article essentially acknowledges the importance of the law of decreasing credibility. The measure of uncertainty defined in Section 2 aims at measuring the

“credibility of inferences” that can be made on the basis of available data and prior information. Similar approaches have also been used in the areas of statistical disclosure control (cf. Fienberg and Slavkovic 2005), ecological inference (cf. King 1997) and analysis of partially observed data sets when neither MAR nor MCAR assumptions are made (cf. Imbens and Manski 2004).

6. References

Capitanio, A., Azzalini, A., and Stanghellini, E. (2003). Graphical Models for Skew- Normal Variates. Scandinavian Journal of Statistics, 30, 129 – 144.

Conti, P.L., Marella, D., and Scanu, M. (2009a). Evaluation of Matching Noise for Imputation Techniques based on Nonparametric Local Linear Regression Estimators.

Computational Statistics & Data Analysis, 53, 354 – 365.

Conti, P.L., Marella, D., and Scanu, M. (2009b). How Far from Identifiability?

A Nonparametric Approach to Uncertainty in Statistical Matching under Logical Constraints. Technical Report n.22, DSPSA, Universita` di Roma “La Sapienza”.

D’Orazio, M., Di Zio, M., and Scanu, M. (2006a). Statistical Matching for Categorical Data: Displaying Uncertainty and Using Logical Constraints. Journal of Official Statistics, 22, 137 – 157.

D’Orazio, M., Di Zio, M., and Scanu, M. (2006b). Statistical Matching: Theory and Practice. Chichester: Wiley.

Fienberg, S.E. and Slavkovic, A.B. (2005). Preserving the Confidentiality of Categorical Data Bases When Releasing Information for Association Rules. Data Mining and Knowledge Discovery, 11, 155 – 180.

Imbens, G. and Manski, C. (2004). Confidence Intervals for Partially Identified Parameters. Econometrica, 72, 1845 – 1857.

Kadane, J.B. (1978). Some Statistical Problems in Merging Data Files. Compendium of Tax Research, Department of Treasury, U.S. Government Printing Office, Washington D.C., 159 – 179 (Reprinted in 2001, Journal of Official Statistics, 17, 423 – 433).

Kiesl, H. and Ra¨ssler, S. (2008). The Validity of Data Fusion. Insights on Data Integration Methodologies, ESSnet-ISAD workshop, Vienna, 29 – 30 May 2008, 59 – 67.

King, G. (1997). A Solution to the Ecological Inference Problem: Reconstructing Individual Behavior from Aggregate Data. Princeton: Princeton University Press.

Luzi, O., Di Zio, M., Guarnera, U., Manzari, A., De Waal, T., Pannekoek, J., Hoogland, J., Tempelman, C., Hulliger, B., and Kilchmann, D. (2007). Recommended Practices for Editing and Imputation in Cross-Sectional Business Surveys, ISTAT, CBS, SFSO, Eurostat. Available at http://edimbus.istat.it.

Manski, C.F. (2003). Partial Identification of Probability Distributions. New York:

Springer Verlag.

Marella, D., Scanu, M., and Conti, P.L. (2008). On the Matching Noise of Some Nonparametric Imputation Procedures. Statistics and Probability Letters, 78, 1593 – 1600.

(20)

Moriarity, C. and Scheuren, F. (2001). Statistical Matching: A Paradigm for Assessing the Uncertainty in the Procedure. Journal of Official Statistics, 17, 407 – 422.

Nelsen, R.B. (1999). An Introduction to Copulas. New York: Springer Verlag.

Okner, B.A. (1972). Constructing a New Microdata Base from Existing Microdata Sets:

The 1966 Merge File. Annals of Economic and Social Measurement, 1, 325 – 362.

Paass, G. (1986). Statistical Match: Evaluation of Existing Procedures and Improvements by Using Additional Information. Microanalytic Simulation Models to Support Social and Financial Policy, G.H. Orcutt and H. Quinke (eds). Amsterdam: Elsevier.

Ra¨ssler, S. (2002). Statistical Matching: A Frequentist Theory, Practical Applications and Alternative Bayesian Approaches. Lecture Notes in Statistics. New York: Springer Verlag.

Ridder, G. and Moffitt, R. (2007). The Econometrics of Data Combination. Handbook of Econometrics, J.J. Heckmann and E.E. Leamer (eds). vol. 6A. Amsterdam: Elsevier.

Rodgers, W.L. (1984). An Evaluation of Statistical Matching. Journal of Business and Economic Statistics, 2, 91 – 102.

Sheather, S.J. and Jones, M.C. (1991). A Reliable Data-Based Bandwidth Selection Method for Kernel Density Estimation. Journal of the Royal Statistical Society, Series B, 53, 683 – 690.

Sims, C.A. (1972). Comments and Rejoinder (On Okner (1972)). Annals of Economic and Social Measurement, 1, 343 – 345, 355 – 357.

Singh, A.C., Mantel, H., Kinack, M., and Rowe, G. (1993). Statistical Matching: Use of Auxiliary Information as an Alternative to the Conditional Independence Assumption.

Survey Methodology, 19, 59 – 79.

Slepian, D. (1962). The One-Sided Barrier Problem for Gaussian Noise. Bell System Technical Journal, 41, 463 – 501.

Vantaggi, B. (2008). Statistical Matching of Multiple Sources: A Look Through Coherence. International Journal of Approximate Reasoning, 49, 701 – 711.

Received November 2009 Revised November 2011