Uncertainty Analysis in Statistical Matching
Pier Luigi Conti1, Daniela Marella2, and Mauro Scanu3
Among the goals of statistical matching, a very important one is the estimation of the joint distribution of variables not jointly observed in a sample survey but separately available from independent sample surveys. The absence of joint information on the variables of interest leads to uncertainty about the data generating model. The present article reviews the concept of uncertainty in statistical matching and how to measure it by providing a unified framework for the parametric and nonparametric setting. Furthermore, the reduction of uncertainty due to the introduction of logical constraints is investigated and a simulation experiment is performed.
Key words: Data fusion; synthetical matching.
1. Introduction
A key issue in decision processes is the availability of information. Rich and detailed information will help a decision maker in making good decisions. Information may be gathered through a survey but the cost of implementing a new survey and the availability of information in archives or in other surveys may lead researchers to combine available information from different data sources. Statistical matching (sometimes called data fusion) aims at combining information available in distinct sample surveys referred to the same target population. Formally, let Y and Z be two random variables (rv). Statistical matching is defined as the estimation of the ðY; ZÞ joint distribution (e.g., its cumulative distribution function – cdf – Fð y; zÞ) or of some of its parameters when:
. Y and Z are not jointly observed in a survey, but . Y is observed in a sample A, of size nA,
. Z is observed in a sample B, of size nB,
. A and B are independent, and the sets of observed units in the two samples do not overlap (it is not possible to use record linkage),
. A and B both observe a set of additional variables X.
A detailed list of statistical matching applications is to be found in D’Orazio et al. (2006b) and Ridder and Moffitt (2007).
Generally speaking, two approaches have been considered. At first, e.g., Okner (1972), techniques based on the conditional independence assumption between Y and Z given X (CIA assumption) were considered. Appropriateness of CIA is discussed in several papers.
qStatistics Sweden
1 University of Rome, La Sapienza, p. Aldo Moro 5, IT 00185 Rome, Italy. Email: [email protected]
2 University of Rome, Roma Tre, via Milazzo 11B, IT 00185 Rome, Italy. Email: [email protected]
3 ISTAT, via Cesare Balbo 16, 00184 Rome, Italy. Email: [email protected]
We cite, among others, Sims (1972) and Rodgers (1984). The second group of techniques uses external auxiliary information on the statistical relationships between Y and Z (e.g., an additional file C where ðX; Y; ZÞ are jointly observed is available, as in Singh et al. (1993)).
Most of these approaches aim at making an inference as to the distribution of Y and Z by reconstructing, through imputation procedures, a synthetic file containing X, Y, Z data.
These approaches are actually theoretically justified when the joint probability distribution of the variables of interest in the population coincides with the probability distribution of the same variables in the synthetic data file, or at least when these two distributions are “very close”. The discrepancy between the joint distribution of the variables of interest (a) in the population, and (b) in the synthetic data file, is usually referred to as matching noise (cf. Paass 1986). Attempts at evaluating the “closeness” of the empirical distribution of imputed data to the empirical distribution of “real” data have been performed in the literature, see D’Orazio et al. (2006b). In a nonparametric setting an important role is played by hot deck methods, as well as k-Nearest Neighbour (kNN, for short) methods. Their properties are studied in Marella et al. (2008) and in Conti et al.
(2009a), where both theoretical and simulation results are obtained.
As a matter of fact, the CIA is usually a misspecified assumption, while external auxiliary information is hardly ever available. The lack of joint information on the variables of interest is the cause of uncertainty about the model of ðX; Y; ZÞ, since the sample information provided by A and B is actually unable to discriminate among a set of plausible models for ðX; Y; ZÞ. In other terms, the adopted statistical model is not identifiable on the basis of sample data. Hence, a third group of techniques that does not directly aim at reconstructing a complete data set is introduced. This group of techniques addresses the so-called identification problem. The main consequence of the lack of identifiability is that some parameters of the model cannot be estimated on the basis of the available sample information. For instance, in a parametric setting, instead of point estimates, one can only reasonably construct sets of “possible estimates”, compatible with what can be actually estimated. These sets (usually intervals) formally provide a representation of uncertainty about the model parameters.
In this setting, the main task consists in constructing a coherent measure that can reasonably quantify the uncertainty about the (estimated) model. From an operational point of view, a measure of uncertainty essentially quantifies how “large” is the class of models estimable on the basis of the available sample information. The smaller the measure of uncertainty, the smaller the class of estimated models.
This article aims at reviewing uncertainty in statistical matching providing a unified framework for the parametric and nonparametric approach. More specifically, in Section 2 model uncertainty is defined and uncertainty measures are introduced. Furthermore, several examples are illustrated for both parametric and nonparametric settings. Section 3 shows the effect on model uncertainty due to the introduction of logical constraints (frequently used in imputation and editing). Finally, in Section 4 a simulation experiment is performed.
2. What is Uncertainty in Statistical Matching
Let ðX; Y; ZÞ be a trivariate random variable with joint (cumulative) distribution function (df) Mðx; y; zÞ. Denote further by Hð y; zjxÞ the joint df of ðY; ZÞ conditionally on X, Fð yjxÞ
and GðzjxÞ the marginal d.f.s of Y and Z conditionally on X, respectively, and QðxÞ the marginal df of X.
Let M ¼ {Mðx; y; zÞ}, Hx¼ {Hð y; zjxÞ}, Fx¼ {Fð yjxÞ}, Gx¼ {GðzjxÞ}, Q ¼ {QðxÞ}
be the sets of df’s in which M, H, F, G and Q lie. Knowledge of a specific distribution in M is equivalent to knowledge of the corresponding distributions in Hxand Q .
We further assume that the observation mechanism allows one to identify the classes Fx, Gx and Q , but not the class Hx, unless special assumptions are made. The most important is the conditional independence assumption, under which we have Hx¼ {Fð yjxÞGðzjxÞ; F [ Fx; G [ Gx}, i.e., Hx¼ Fx£ Gx.
We distinguish two main cases:
1. Parametric case: the classes Hxand Q are indexed by real or vector parameters. As a consequence we may write:
Q ¼ {QjðxÞ;j[ J}
Fx¼ {Ffð yjxÞ;f[ Fx} Gx¼ {GcðzjxÞ;c[ Cx} Hx¼ {Hgð y; zjxÞ;g[ Gx}
2. Nonparametric case: the above classes Hxand Q cannot be indexed by real or vector parameters.
Given the (known) marginal distributions Fð yjxÞ and GðzjxÞ, uncertainty is defined as the set of probability distributions of the random vector ðY; ZjXÞ compatible with Fð yjxÞ and GðzjxÞ. Formally
H2ð y; zjxÞ # Hð y; zjxÞ # Hþð y; zjxÞ ð1Þ
where
Hþð y; zjxÞ ¼ supH[Hx{Hð y; zjxÞ : Hð y; þ1jxÞ ¼ Fð yjxÞ; Hðþ1; zjxÞ ¼ GðzjxÞ};
H2ð y; zjxÞ ¼ infH[Hx{Hð y; zjxÞ : Hð y; þ1jxÞ ¼ Fð yjxÞ; Hðþ1; zjxÞ ¼ GðzjxÞ}:
In the parametric case, we may write Hþð y; zjxÞ ¼ Hþð y; zjx;c;fÞ
¼
g[Gx
sup{Hð y; zjx;gÞ; Hð y; þ1jx;gÞ ¼ Fð yjx;fÞ; Hðþ1; zjx;gÞ ¼ Gðzjx;cÞ}
H2ð y; zjxÞ ¼ H2ð y; zjx;c;fÞ
¼g[Ginf{Hð y; zjx;x gÞ; Hð y; þ1jx;gÞ ¼ Fð yjx;fÞ; Hðþ1; zjx;gÞ ¼ Gðzjx;cÞ}
The df’s belonging to the class Hx are compatible with the available information, namely they may have generated the observed data.
The interval (1) summarizes the pointwise uncertainty about the statistical model for every triple ðx; y; zÞ. It is intuitive to take the length of such an interval as a pointwise
measure of uncertainty. The wider the interval of extremes ½H2ð; jxÞ; Hþð; jxÞ the more uncertain the statistical model generating the data w.r.t. ðx; y; zÞ. Of course, if the model is identifiable, then the interval reduces to a single point, with length zero, and there is no uncertainty at all.
In this case, it is possible to compute the following measures of uncertainty. The first one is based on computing for each point ðx; y; zÞ the distance between the extrema H2ð y; zjxÞ; Hþð y; zjxÞ.
Formally
Dyzjx¼ HþðFð yjxÞ; GðzjxÞÞ 2 H2ðFð yjxÞ; GðzjxÞÞ ð2Þ In order to summarize the differences in (2) into an overall measure of uncertainty we may take the average length
D ¼ ð
R3
DyzjxdTðx; y; zÞ
where Tðx; y; zÞ is a weight function onR3, i.e., a measure having total mass 1. A “natural”
choice consists in taking
dTðx; y; zÞ ¼ dFð yjxÞdGðzjxÞdQðxÞ
This distribution is “natural” because: i) it is the simplest choice given the available df’s Fð yjxÞ, GðzjxÞ, QðxÞ and makes the integral in D easily computable in many cases;
ii) among all the possible associations between Y and Z, we consider a neutral position, i.e., we do not give preference to any specific positive or negative association. Hence, a conditional measure of uncertainty is
Dx¼ ð
R2
DyzjxdFð yjxÞdGðzjxÞ ð3Þ
As a matter of fact, integrating (3) with respect to X, we obtain again the overall measure of uncertainty D
D ¼ ð
RDxdQðxÞ ð4Þ
Relationships (3), (4) show that the unconditional uncertainty measure (4) can be expressed as a weighted mean of conditional uncertainty measures (3). Then, the larger Dxthe more uncertain the data generating statistical model.
If interest is only in the joint distribution of the (not jointly observed) variables ðY; ZÞ, it is possible also to consider the unconditional Fre´chet class
ðEx½H2ðFð yjxÞ; GðzjxÞÞ; Ex½HþðFð yjxÞ; GðzjxÞÞÞ ð5Þ A unique number can be obtained by using an appropriate weight function Tð y; zÞ.
The uncertainty measure depends on the marginal df’s F and G, that can be estimated by the available sample information. The asymptotic properties of the estimators of F and G (both in the parametric and nonparametric cases) determine the asymptotic properties of D. The limit distribution of D allows to construct confidence intervals and hypothesis tests for the uncertainty measure.
Example 1. Nonparametric Case – Assume that Hxcannot be indexed by real or vector parameters, so that Hx is the set of all bivariate df’s having marginal df’s Fð yjxÞ and GðzjxÞ. Denote further by Lðu; vÞ ¼ max ð0; u þ v 2 1Þ, and by Uðu; vÞ ¼ min ðu; vÞ. Then the Fre´chet inequalities
LðFð yjxÞ; GðzjxÞÞ # Hð y; zjxÞ # UðFð yjxÞ; GðzjxÞÞ ð6Þ hold true, LðFð yjxÞ; GðzjxÞÞ and UðFð yjxÞ; GðzjxÞÞ being bivariate df’s corresponding to maximal negative and maximal positive association between Y and Z (given X).
Furthermore
UðFðþ1jxÞ; GðzjxÞÞ ¼ LðFðþ1jxÞ; GðzjxÞÞ ¼ GðzjxÞ and
UðFð yjxÞ; Gðþ1jxÞÞ ¼ LðFð yjxÞ; Gðþ1jxÞÞ ¼ Fð yjxÞ
namely LðFð yjxÞ; GðzjxÞÞ and UðFð yjxÞ; GðzjxÞÞ both possess marginal df’s Fð yjxÞ and GðzjxÞ.
As a consequence of Fre´chet inequalities (6), we have H2ð y; zjxÞ ¼ LðFð yjxÞ; GðzjxÞÞ
and
Hþð y; zjxÞ ¼ UðFð yjxÞ; GðzjxÞÞ
With such a choice, the overall uncertainty measure (4) becomes
D ¼ ð
R
ð
R2½UðFð yjxÞ; GðzjxÞÞ 2 LðFð yjxÞ; GðzjxÞÞd½Fð yjxÞdGðzjxÞ
dQðxÞ
¼ ð
R
DxdQðxÞ ¼ Ex½Dx
ð7Þ
where Dx¼
ð
R2½UðFð yjxÞ; GðzjxÞÞ 2 LðFð yjxÞ; GðzjxÞÞdFð yjxÞdGðzjxÞ ð8Þ is the uncertainty measure about the considered statistical model, conditionally on X ¼ x.
Further results and simplifications can be obtained when Fð yjxÞ and GðzjxÞ are continuous df’s, so that Rx¼ FðYjxÞ and Sx¼ GðZjxÞ both have uniform distribution in [0,1] for every x. From the Sklar theorem (Nelsen 1999), the joint df H may be uniquely represented as
Hð y; zjxÞ ¼ CxðFð yjxÞ; GðzjxÞÞ ð9Þ
where Cxðu; vÞ is the copula, i.e., the joint d.f. of Rxand Sx. In this case, the conditional
uncertainty measure simplifies to Dx¼
ð
½0;12
½Uðr; sÞ 2 Lðr; sÞd½r s
¼ ð1
0
ð1 0
{ min ðr; sÞ 2 max ð0; r þ s 2 1Þ}dr ds
ð10Þ
In this setting, it is straightforward to show that the uncertainty measure (10) is equal to 1/6 for any F and G, and for any x. The value Dx¼ 1=6 represents the maximum uncertainty achieved when no external auxiliary information beyond the knowledge of the marginals Fð yjxÞ and GðzjxÞ is available. As a consequence, also the unconditional uncertainty measure (7) takes the value 1/6.
Example 2. Contingency Tables – Assume that, given a discrete rv X with I categories, Y and Z are discrete rv’s with J and K categories, respectively, not necessarily ordered. In this case, the whole theory developed so far can still be applied by simply replacing the d.f.s with probability functions. With no loss of generality, the symbols i ¼ 1; : : : ; I, j ¼ 1; : : : ; J, and k ¼ 1; : : : ; K, denote the categories taken by X, Y and Z, respectively.
Letgjkjibe the probability that ðY ¼ j; Z ¼ kjX ¼ i Þ, with marginalsfjji representing the probability of the event ðY ¼ jjX ¼ i Þ andckjirepresenting the probability of the event ðZ ¼ kjX ¼ i Þ. Since
Lðfjji;ckjiÞ #gj;kji# Uðfjji;ckjiÞ ð11Þ
the conditional uncertainty measure turns out to be equal to Dx¼i¼XJ
j¼1
XK
k¼1
{Uðfjji;ckjiÞ 2 Lðfjji;ckjiÞ}fjjickji
and the overall uncertainty measure is D ¼XI
i¼1
Dx¼iji
wherejirepresents the probability of the event ðX ¼ i Þ. Sharper results are obtained when the categories taken by ðX; Y; ZÞ are ordered. For the sake of simplicity, we use the customary order for natural numbers. In this case, the cumulative df’s are
Hj;kji¼Xj
y¼1
Xk
z¼1
gyzji; j ¼ 1; : : : ; J; k ¼ 1; : : : ; K; i ¼ 1; : : : ; I
Fjji¼Xj
y¼1
fyji; j ¼ 1; : : : ; J; i ¼ 1; : : : ; I
Gkji¼Xk
z¼1
czji; k ¼ 1; : : : ; K; i ¼ 1; : : : ; I
Using the same arguments as in Example 1, the inequalities
LðFjji; GkjiÞ # Hj;kji# UðFjji; GkjiÞ ð12Þ
hold. Note that inequalities (12) imply that
g2jkji#gjkji#gþjkji ð13Þ
where
g2jkji¼ LðFjji; GkjiÞ 2 LðFj21ji; GkjiÞ 2 LðFjji; Gk21jiÞ þ LðFj21ji; Gk21jiÞ gþjkji¼ UðFjji; GkjiÞ 2 UðFj21ji; GkjiÞ 2 UðFjji; Gk21jiÞ þ UðFj21ji; Gk21jiÞ Then, it is not difficult to realize that
g2jkji$ Lðfjji;ckjiÞ gþjkji# Uðfjji;ckjiÞ
so that inequalities (13) are sharper than (11). At any rate, the conditional uncertainty measure is
Dx¼i¼XJ
j¼1
XK
k¼1
{UðFjji; GkjiÞ 2 LðFjji; GkjiÞ}fjjickji ð14Þ
and the unconditional uncertainty measure is D ¼XI
i¼1
Dx¼iji ð15Þ
In contrast with what was found in Example 1, D is not equal to 1/6, since the uncertainty measure depends on the marginal probabilities of YjX and ZjX.
Example 3. Multinormal Distribution – Let X, Y, Z, be jointly multinormally distributed, with mean vector and covariance matrix equal to
m¼ mx
my
mz
2 66 4
3 77 5; S¼
s2x sxy sxz
sxy s2y syz
sxz syz s2z 2
66 64
3 77
75 ð16Þ
respectively. All bivariate marginals are still multinormal, with mean vectors and covariance matrices easily obtained from (16). Furthermore, conditionally on X, (Y, Z) do have joint bivariate normal distribution, with mean vector and covariance matrix given by
myzjx¼
myþby=xðx 2mxÞ mzþbz=xðx 2mxÞ 2
64
3 75;
Syzjx¼
s2y1 2r2xy syszðryz2rxyrxzÞ syszðryz2rxyrxzÞ s2z1 2r2xz 2
66 4
3 77 5
ð17Þ
where byjx¼sxy
s2x;bzjx¼sxz
s2x
are the regression coefficients of Y, Z w.r.t. X, respectively, and rxy ¼ sxy
sxsy
;rxz¼ sxz
sxsz
;ryz¼ syz
sysz
are the correlation coefficients between ðX; YÞ, ðX; ZÞ, ðY; ZÞ, respectively.
The only unidentified parameter is ryz, the correlation coefficient between Y and Z. From (17) it is immediate to see that, givenrxy andrxz,ryzranges in the interval
rxyrxz2
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 2r2xy
1 2r2xz
r
;rxyrxzþ
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 2r2xy
1 2r2xz
r
ð18Þ
From the Slepian (1962) inequality, the joint d.f. of Y, Z given X is an increasing function of the correlation coefficient between Y and Z given X:
ryzjx ¼ ryz2rxyrxz
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 2r2xy
1 2r2xz
r
i.e., it turns out to be a monotone function of ryz. In other words, the conditional d.f.s Hð y; zjxÞ are totally ordered on the basis of ryz. The case ryz¼rxyrxz2 ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ð1 2r2xyÞð1 2r2xzÞ q
corresponds to ryzjx ¼ 21, so that conditionally on X, Y is a linear decreasing function of Z (and vice versa). As a consequence, H2ð y; zjxÞ ¼ max ð0; Fð yjxÞ þ GðzjxÞ 2 1Þ, as in Example 1. Similarly, the case ryz¼rxyrxzþ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ð1 2r2xyÞð1 2r2xzÞ q
corresponds to ryzjx¼ 1, i.e., to Hþð y; zjxÞ ¼ min ðFð yjxÞ; GðzjxÞÞ, again as in Example 1. As a consequence, the computation of uncertainty measures parallels the nonparametric case.
The assumption of multinormal distribution of ðX; Y; ZÞ is largely debated in the statistical matching literature, since the seminal paper by Kadane (1978).
Useful discussions and advances are to be found in Moriarity and Scheuren (2001), Ra¨ssler (2002), Kiesl and Ra¨ssler (2008).
At first glance, results of Examples 1, 3 seem to be contradictory. In the multinormal case, uncertainty affects the correlation coefficient ryz. Uncertainty bounds for the correlation coefficients are determined by the condition that the correlation matrix of ðX; Y; ZÞ must be positive definite. In this case, the uncertainty space restricts continuously to a single value when rxy or rxz go continuously to 1 (or 2 1). The width of the Fre´chet class behaves differently. The lower and upper bounds of the Fre´chet class only depend on the marginal df’s of Y and Z (given X).
The copula transformation makes such marginals uniform in the interval (0, 1), and hence the whole Fre´chet class is always mapped on the same subset of the square (0, 1)2. This is why the uncertainty measure remains constantly equal to 1/6.
A discontinuity happens if either Y or Z is fully determined by X. In this case, the square (0, 1)2 collapses to the segment (0, 1), and the uncertainty measure becomes equal to zero.
Example 4. Skew-normal Distribution – Suppose that the joint distribution of ðY; ZÞ, conditionally on X, does possess a bivariate skew-normal distribution with parameters ðd; V;a;tÞ, (see Capitanio et al. 2003), where
V ¼
vyy vyz
vyz vzz
!
is a 2 £ 2 matrix. In particular,vyzis the association parameter between Y and Z, given X, and ranges in the interval ð21; 1Þ. Of course, this means that also the distribution of ðX; Y; ZÞ does possess (extended) skew-normal distribution as shown in Capitanio et al.
(2003). Asvyztends to þ 1, ðY; ZÞjX tends to be perfectly positively correlated, so that Hþð y; zjxÞ ¼ min ðFð yjxÞ; GðzjxÞÞ, again as in Example 1.
However, asvyztends to 2 1, Y and Z (given X) do not achieve the situation of perfect negative correlation. Hence, H2ð y; zjxÞ $ max ð0; Fð yjxÞ þ GðzjxÞÞ.
Example 5. Farlie-Gumbel-Morgenstern Distribution – Suppose that X is distributed as a uniform in ½0; 1. Conditionally on X, let Y and Z be marginally distributed as uniform distributions in ½0; x. Furthermore, the joint d.f. of ðY; ZÞ given X is:
Hð y; zjxÞ ¼y x z
x 1 þa 1 2y x
1 2z x
h i
ð19Þ where 21 #a# 1 (otherwise (19) is not a distribution function), 0 # y # x, 0 # z # x.
The maximal value Hþð y; zjxÞ is obtained whena¼ 1, and is equal to Hþð y; zjxÞ ¼y
x z
x 1 þ 1 2y x
1 2z x
h i
which is strictly smaller than min ðFð yjxÞ; GðzjxÞÞ. On the other hand, the minimal value H2ð y; zjxÞ is obtained whena¼ 21, and is equal to
H2ð y; zjxÞ ¼y x z
x 1 2 1 2y x
1 2z x
h i
which is strictly greater than max ð0; Fð yjxÞ þ GðzjxÞÞ.
3. Reducing Uncertainty
When auxiliary information in the form of logical constraints regarding the statistical model for ðY; ZÞ or ðY; ZjXÞ is available, some models for ðX; Y; ZÞ become illogical and must be excluded from the set of plausible distribution functions. As a consequence, the statistical model for the data becomes less uncertain. This kind of information is frequently used, for instance, in imputation and editing, see Luzi et al. (2007). The introduction of such constraints may complicate the estimation process, because there could be no probability distributions satisfying both the logical constraints and lying in the set of all estimated bivariate distributions.
There are essentially two types of constraints:
1. constraints on the values of the parameters;
2. constraints on the support of ðY; ZÞ or ðY; ZjXÞ.
The first kind of constraints is essentially used in the parametric case. They have been largely studied in D’Orazio et al. (2006a) when ðX; Y; ZÞ are categorical. The constraints are in terms of inequalities between the cell probabilities of the contingency table ðY; ZÞ or ðY; ZÞjX. In the normal case, these constraints can be used when information on the correlation coefficientryzor equivalentlyryzjx is available, see Ra¨ssler (2002).
The second kind of constraints has been mainly used in the parametric-multinomial case (D’Orazio et al. 2006a). These constraints have been defined in terms of structural zeros on the joint ðY; ZÞ distribution. In Vantaggi (2008) a different estimation method to deal with the case of constraints in form of structural zeros in a coherent probability framework is proposed.
The use of structural zeros for reducing uncertainty has been confined to the case of categorical rv’s. In fact, for continuous rv’s the introduction of structural zeros corresponds to the restriction of the ðY; ZÞ or ðY; ZjXÞ support to a subset of the Cartesian product of the Y and Z supports, i.e., ðY; ZÞ is fully concentrated on a set of points ð y; zÞ [R2. The relationship between the original marginal parameters and the joint ðY; ZÞ distribution is heavily dependent on the nature of this constraint and does not allow easy generalizations. For instance, even if both ðYjXÞ and ðZjXÞ follow a normal distribution, the introduction of a structural zero will prevent ðY; ZjXÞ from being bivariate normal. As a matter of fact, structural zeros can be easily used in a nonparametric approach, and the estimation of uncertainty can be performed using Equations (3) and (4).
For the sake of simplicity, assume that Y and Z are continuous r.v.’s, that X is a discrete r.v., and that the support constraints have the following shape: ax# Y=Z # bx given X. An example of this kind of constraint happens in household surveys, when X is a household socio-economic character, Y the household consumption and Z the household income. Conditionally on X, the inequality ax# Y=Z # bx holds. Using the notation a ^ b ¼ min ða; bÞ, it is not difficult to see that the pair of inequalities
Lx G z ^ y axjx
; Fð y ^ bxzjxÞ
# Hðz; yjxÞ # Ux G z ^ y axjx
; Fð y ^ bxzjxÞ
holds, where
Ux G z ^ y ax
jx
; Fð y ^ bxzjxÞ
¼ min G z ^ y ax
jx
; Fð y ^ bxzjxÞ
¼ min GðzjxÞ ^ G y axjx
; Fð yjxÞg ^ FðbxzjxÞ
¼ min GðzjxÞ; G y ax
jx
; Fð yjxÞ; FðbxzjxÞ
ð20Þ
Lx G z ^ y ax
jx
; Fð y ^ bxzjxÞ
¼ max 0; G z ^ y ax
jx
þ Fð y ^ bxzjxÞ 2 1
¼ max 0; GðzjxÞ ^ G y ax
jx
þ Fð yjxÞ ^ FðbxzjxÞ 2 1
ð21Þ
With obvious notation, according to (8) the conditional uncertainly measure is then Dxc¼
ð
R2
Ux G z ^ y ax
jx
; Fð y ^ bxzjxÞ
2Lx G z ^ y ax
jx
; Fð y ^ bxzjxÞ
dFð yjxÞdGðzjxÞ
ð22Þ
where c denotes the constraint ax# Y=Z # bx. The unconditional uncertainty measure is easily obtained from (22) by averaging w.r.t. the distribution of X.
As remarked above, in this case both conditional and unconditional uncertainty measures are strictly smaller than 1/6. In other words, the constraint ax# Y=Z # bx reduces the uncertainty about the statistical model.
Clearly, the reduction in the model uncertainty depends on how informative the imposed constraints are. In some circumstances, the logical constraints can be so informative that the Fre´chet class reduces to a unique distribution. This happens when Z can be exactly predicted by X (or alternatively by ðX; YÞ) due to a deterministic relationship between X and Z (or ðX; Z; YÞ). The effects on model uncertainty due to the introduction of logical constraints in a nonparametric setting are investigated in Conti et al.
(2009b). Moreover, in Conti et al. (2009b) estimators of conditional and unconditional measures of uncertainty under logical constraints are proposed and their consistency and asymptotic normality are proved.
4. A Simulation Study
In this section we perform a simulation experiment in order to evaluate the effects on model uncertainty due to the introduction of logical constraints and in order to investigate the asymptotic behavior of the uncertainty measures proposed in Section 3. The simulation involves the following steps.
1. A sample A composed by nAi.i.d. records has been generated according to a bivariate normal distribution ðX; YÞ with mean vectormxy ¼ ð0; 0Þ and covariance matrix given by
Sxy ¼
1 0:5
0:5 1
!
2. A sample B composed by nBi.i.d. records has been generated according to a bivariate normal distribution ðX; ZÞ with mean vectormxz¼ ð0; 0Þ and covariance matrix given by
Sxz¼ 1 0:8 0:8 1
!
3. Since X is a continuous variable in order to compute the uncertainty measures we need to discretize it. The range of observed values x ¼ xA< xB has been divided into I ¼ 20 intervals according to the hth percentiles of data, for h ¼ 5 2 95ð5Þ. Formally, the discretized variable will assume the value x ¼ i, for i ¼ 1; : : : ; 20.
4. Conditionally on x ¼ i (for i ¼ 1; : : : ; 20) the pointwise measure of uncertainty in ðx; y; zÞ is estimated by
D^yzjx¼ minð ^GnBðzjxÞ; ^FnAð yjxÞÞ 2 maxð ^GnBðzjxÞ þ ^FnAð yjxÞ 2 1; 0Þ ð23Þ where ^FnAð yjxÞ and ^GnBðzjxÞ are the empirical distribution functions of Fð yjxÞ and GðzjxÞ, respectively. Let nA;xand nB;xbe the number of units such that x ¼ i in samples A and B, and denote by y ¼ ð y1;x; : : : ; ynA;xÞ and z ¼ ðz1;x; : : : ; znB;xÞ the corresponding sample values of Y and Z, respectively. Then, for x ¼ i we obtain nA;xnB;x pointwise uncertainty measures (23).
5. Conditionally on x ¼ i the uncertainty measure when no external auxiliary information is available is obtained by averaging the nA;xnB;x pointwise uncertainty measures (23).
Formally D^x¼i¼ 1
nA;xnB;x y[y X
z[z
XD^yzjx ð24Þ
6. The overall unconditional uncertainty measure is a weighted mean of the conditional uncertainty measure (24)
D ¼^
x
XDx¼i^pðxÞ ð25Þ
where
^pðxÞ ¼ nA;xþ nB;x
nAþ nB
ð26Þ
7. Suppose that auxiliary information in the form of logical constraints regarding the statistical model for ðX; Y; ZÞ is available. More specifically let us assume that there exist constants ax and bx such that ax# Y=Z # bx. In the simulation study we set ax¼ 0:1=x and bx¼ 1=x. Then, conditionally on x ¼ i and under the constraint 0:1=x # Y=Z # 1=x, the pointwise uncertainty measure in ðx; y; zÞ is estimated by
D^yzjxc ¼ ^Uxð y; zÞ 2 ^Lxð y; zÞ
where the subscript c represents the constraint and
U^xð y; zÞ ¼ min{ ^FnAð yjxÞ; ^FnAðbxzjxÞ; ^GnBðzjxÞ; ^GnBð y=axjxÞ} ð27Þ
^Lxð y; zÞ ¼ max{0; minð ^FnAð yjxÞ; ^FnAðbxzjxÞÞ þ minð ^GnBðzjxÞ; ^GnBð y=axjxÞÞ 2 1} ð28Þ
8. Conditionally on x ¼ i the estimator of conditional uncertainty measure under the constraint 0:1=x # Y=Z # 1=x is given by
D^x¼ic ¼ 1 nA;xnB;x y[y
X
z[z
XD^yzjxc ð29Þ
9. The overall unconditional uncertainty measure under the constraint 0:1=x # Y=Z # 1=x is obtained as a weighted mean of conditional uncertainty measures (29)
D^c¼X
x
D^x¼ic ^pðxÞ ð30Þ
where ^pðxÞ is given by (26).
10. Steps 1 to 9 have been repeated 500 times and for different sample sizes nA¼ nB¼ n ¼ ð1; 000; 2; 000; 5; 000Þ.
Given n, for each sample s (for s ¼ 1; : : : ; 500) and for each category i (for i ¼ 1; : : : ; 20) denote by ^Dx¼i;s, ^Ds, ^Dx¼i;sc , ^Dscthe uncertainty measures (24), (25), (29) and (30), respectively.
As for the conditional uncertainty measure estimates (24), their average over simulation runs is
Dx¼i¼ 1 500
X500
s¼1
D^x¼i;s ð31Þ
while D ¼ 1
500 X500
s¼1
D^s ð32Þ
represents the corresponding overall uncertainty. The standard deviation of ^Dx¼i;s, again over simulation runs, is equal to
SDð ^Dx¼iÞ ¼
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1
499 X500
s¼1
ð ^Dx¼i;s2 Dx¼iÞ2 vu
uu
t ð33Þ
and the corresponding mean squared error is given by
MSEð ^Dx¼iÞ ¼ ½Sdð ^Dx¼iÞ2þ ð Dx¼i2 Dx¼iÞ2 ð34Þ where Dx¼i¼ 1=6.
Analogously, if we refer to the conditional uncertainty measure estimates (29), we have that the average over simulation runs is given by
Dx¼i¼ 1 500
X500
s¼1
D^x¼i;sc ð35Þ
while
Dc¼X500
s¼1
Dsc ð36Þ
is the corresponding overall uncertainty under the constraint 0:1=x # Y=Z # 1=x. The standard deviation over simulation runs is
SD ^Dx¼ic
¼
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1
499 X500
s¼1
D^x¼i;sc 2 Dx¼ic
2
vu uu
t ð37Þ
and finally the corresponding mean squared error is MSE ^Dx¼ic
¼ Sd ^h Dx¼ic i2
þ Dx¼i2 Dx¼ic 2
ð38Þ where Dx¼ic is the actual conditional uncertainty measure for the ith category under the constraint c. The values Dx¼ic have been numerically computed. More specifically N ¼ 30; 000 i.i.d. records have been generated from ðX; YÞ and ðX; ZÞ according to the distributions specified in Steps 1 and 2. The actual overall uncertainty measure under the constraint is given by
Dc¼ 1 20
X20
i¼1
Dx¼ic ø 0:11225 ð39Þ
4.1. Simulation Results
In Table 1 the uncertainty measures D and Dc given by (32) and (36), respectively, are reported for different values of the sample size n. Note that D is always smaller than 1/6, and that it tends to 1/6 as the sample size n increases. Furthermore, the constraint 0:1=x # Y=Z # 1=x reduces the model uncertainty for ðX; Y; ZÞ from D ø 0:16 to Dcø 0:11.
In Figure 1 the expectation of the estimated uncertainty (31) is reported. As the category x ¼ i varies, the expected estimated uncertainty is essentially the same. Furthermore, as the sample size n increases from 1,000 to 5,000, for each category x ¼ i the expectation of the estimated mean uncertainty tends to 1/6 and the precision of our estimator increases as shown in Figure 2, where the standard deviation (33) is reported. As shown in Figure 3, as n increases, the mean squared error (34) decreases for each category x ¼ i. Such a property comes from the consistency of the empirical distribution functions ^FnAð yjxÞ and ^GnBðzjxÞ as estimators of Fð yjxÞ and GðzjxÞ, respectively.
The same analysis has been performed under the constraint 0:1=x # Y=Z # 1=x where ax¼ 0:1=x and bx¼ 1=x. Conditional and unconditional uncertainty measures
Table 1. Uncertainty measures as the sample size n varies
n D Dc
1,000 0.16653 0.10946 2,000 0.16663 0.11068 5,000 0.16666 0.11163
(numerically computed) are reported in Figure 4. The horizontal line is the unconditional uncertainty measure, while the dashed line depicts the conditional uncertainty measure as x ¼ i varies. Conditional uncertainty takes small values, about 3 £ 1022– 7 £ 1022, for x ¼ 1 – 5(1), while it is about 0.15 for the categories x ¼ 13 2 18ð1Þ. The behavior of the corresponding estimates is shown in Figures 5, 6 and 7. In particular, in Figure 5 the expectations of the estimated uncertainty for sample sizes n ¼ 1; 000; 2; 000; 5; 000, i.e., Dx¼ic given by (35), are reported. The absolute bias of estimators is approximately constant
3.0×10–5
2.0×10–5
1.0×10–5
0.0
5 10 15 20
x
Uncertainty Sd
n = 5,000 n = 2,000 n = 1,000
Fig. 2. Standard deviation of estimated uncertainty measures in the unconstrained case n = 5,000
n = 2,000 n = 1,000
5 10 15 20
x
Uncertainty mean
0.16675
0.16670
0.16665
0.16660
0.16655
0.16650
Fig. 1. Expected values of estimated uncertainty measures in the unconstrained case
and small, if compared to the corresponding true uncertainty measures. In Figures 6, 7 the standard deviation (37) and the mean squared error (38) of our uncertainty estimators are reported. The highest efficiency of estimation is obtained for the classes x ¼ 1 2 5ð1Þ.
The smallest efficiency is obtained for the classes x ¼ 12 2 16ð1Þ. At any rate, we stress that the mean squared errors of conditional uncertainty estimators are all considerably small, ranging from 1023to 1:3 £ 1022.
In Figure 8 the Kernel density estimate of the overall uncertainty measure under the constraint 0:1=x # Y=Z # 1=x is shown. Such an estimate has been computed using for
2.5×10–8
2.0×10–8
1.5×10–8
1.0×10–8
0.0
5 10 15 20
x
Uncertainty Mse
n = 5,000 n = 2,000 n = 1,000
Fig. 3. Mean squared error of estimated uncertainty measures in the unconstrained case
0.15
0.10
0.05
5 10 15 20
x
Fig. 4. Uncertainty measureDcandDicfor each x ¼ i under the constraint 0:1=x # Y=Z # 1=x
each sample size n the 500 value Dcgiven by (30). Note that as the sample size n increases the uncertainty measure distribution tends to a normal distribution as proved in Conti et al.
(2009b). The bandwidth selection rule is given by Sheather and Jones (1991). Similar considerations hold for the estimated conditional uncertainty measures.
5. Concluding Remarks
The approach described in the above sections seems weird in statistical inference: the result of an estimation process is not a single value, or a single distribution, but a set of
0.15
0.10
0.05
5 10 15 20
x
Uncertainty mean
n = 5,000 n = 2,000 n = 1,000
Fig. 5. Expectation of estimated uncertainty measures under the constraint 0:1=x # Y=Z # 1=x
0.030
0.025
0.020
0.015
0.010
0.005
0.000
5 10 15 20
x
Uncertainty Sd
n = 5,000 n = 2,000 n = 1,000
Fig. 6. Standard deviation of estimated uncertainty measures under the constraint 0:1=x # Y=Z # 1=x
values or a set of distributions. Such sets are justified by the lack of joint information on the rv’s of interest Y and Z, and can be mitigated (although not avoided) by strong statistical relationships of X with Y and/or Z, as well as by constraints justified by logical rules on the ðY; ZÞ joint distribution. If a set is not acceptable, it is then necessary to include additional assumptions, such as the CIA. However, the use of additional assumptions, sometimes neither theoretically justified nor testable on the available data, is not a benefit, as stated in the so-called “law of decreasing credibility” (cf. Manski 2003): the credibility of inference decreases with the strength of the assumption maintained.
0.030
0.025
0.020
0.015
0.010
0.005
0.000
5 10 15 20
x
Uncertainty Mse
n = 5,000 n = 2,000 n = 1,000
Fig. 7. Mean squared error of estimated uncertainty measures under the constraint 0:1=x # Y=Z # 1=x
200
150
100
50
0
0.104 0.106 0.108 0.110 0.112 0.114 0.116 0.118
Density
n = 5,000 n = 2,000 n = 1,000
Fig. 8. Density estimate of overall uncertainty measure under the constraint 0:1=x # Y=Z # 1=x
The present article essentially acknowledges the importance of the law of decreasing credibility. The measure of uncertainty defined in Section 2 aims at measuring the
“credibility of inferences” that can be made on the basis of available data and prior information. Similar approaches have also been used in the areas of statistical disclosure control (cf. Fienberg and Slavkovic 2005), ecological inference (cf. King 1997) and analysis of partially observed data sets when neither MAR nor MCAR assumptions are made (cf. Imbens and Manski 2004).
6. References
Capitanio, A., Azzalini, A., and Stanghellini, E. (2003). Graphical Models for Skew- Normal Variates. Scandinavian Journal of Statistics, 30, 129 – 144.
Conti, P.L., Marella, D., and Scanu, M. (2009a). Evaluation of Matching Noise for Imputation Techniques based on Nonparametric Local Linear Regression Estimators.
Computational Statistics & Data Analysis, 53, 354 – 365.
Conti, P.L., Marella, D., and Scanu, M. (2009b). How Far from Identifiability?
A Nonparametric Approach to Uncertainty in Statistical Matching under Logical Constraints. Technical Report n.22, DSPSA, Universita` di Roma “La Sapienza”.
D’Orazio, M., Di Zio, M., and Scanu, M. (2006a). Statistical Matching for Categorical Data: Displaying Uncertainty and Using Logical Constraints. Journal of Official Statistics, 22, 137 – 157.
D’Orazio, M., Di Zio, M., and Scanu, M. (2006b). Statistical Matching: Theory and Practice. Chichester: Wiley.
Fienberg, S.E. and Slavkovic, A.B. (2005). Preserving the Confidentiality of Categorical Data Bases When Releasing Information for Association Rules. Data Mining and Knowledge Discovery, 11, 155 – 180.
Imbens, G. and Manski, C. (2004). Confidence Intervals for Partially Identified Parameters. Econometrica, 72, 1845 – 1857.
Kadane, J.B. (1978). Some Statistical Problems in Merging Data Files. Compendium of Tax Research, Department of Treasury, U.S. Government Printing Office, Washington D.C., 159 – 179 (Reprinted in 2001, Journal of Official Statistics, 17, 423 – 433).
Kiesl, H. and Ra¨ssler, S. (2008). The Validity of Data Fusion. Insights on Data Integration Methodologies, ESSnet-ISAD workshop, Vienna, 29 – 30 May 2008, 59 – 67.
King, G. (1997). A Solution to the Ecological Inference Problem: Reconstructing Individual Behavior from Aggregate Data. Princeton: Princeton University Press.
Luzi, O., Di Zio, M., Guarnera, U., Manzari, A., De Waal, T., Pannekoek, J., Hoogland, J., Tempelman, C., Hulliger, B., and Kilchmann, D. (2007). Recommended Practices for Editing and Imputation in Cross-Sectional Business Surveys, ISTAT, CBS, SFSO, Eurostat. Available at http://edimbus.istat.it.
Manski, C.F. (2003). Partial Identification of Probability Distributions. New York:
Springer Verlag.
Marella, D., Scanu, M., and Conti, P.L. (2008). On the Matching Noise of Some Nonparametric Imputation Procedures. Statistics and Probability Letters, 78, 1593 – 1600.
Moriarity, C. and Scheuren, F. (2001). Statistical Matching: A Paradigm for Assessing the Uncertainty in the Procedure. Journal of Official Statistics, 17, 407 – 422.
Nelsen, R.B. (1999). An Introduction to Copulas. New York: Springer Verlag.
Okner, B.A. (1972). Constructing a New Microdata Base from Existing Microdata Sets:
The 1966 Merge File. Annals of Economic and Social Measurement, 1, 325 – 362.
Paass, G. (1986). Statistical Match: Evaluation of Existing Procedures and Improvements by Using Additional Information. Microanalytic Simulation Models to Support Social and Financial Policy, G.H. Orcutt and H. Quinke (eds). Amsterdam: Elsevier.
Ra¨ssler, S. (2002). Statistical Matching: A Frequentist Theory, Practical Applications and Alternative Bayesian Approaches. Lecture Notes in Statistics. New York: Springer Verlag.
Ridder, G. and Moffitt, R. (2007). The Econometrics of Data Combination. Handbook of Econometrics, J.J. Heckmann and E.E. Leamer (eds). vol. 6A. Amsterdam: Elsevier.
Rodgers, W.L. (1984). An Evaluation of Statistical Matching. Journal of Business and Economic Statistics, 2, 91 – 102.
Sheather, S.J. and Jones, M.C. (1991). A Reliable Data-Based Bandwidth Selection Method for Kernel Density Estimation. Journal of the Royal Statistical Society, Series B, 53, 683 – 690.
Sims, C.A. (1972). Comments and Rejoinder (On Okner (1972)). Annals of Economic and Social Measurement, 1, 343 – 345, 355 – 357.
Singh, A.C., Mantel, H., Kinack, M., and Rowe, G. (1993). Statistical Matching: Use of Auxiliary Information as an Alternative to the Conditional Independence Assumption.
Survey Methodology, 19, 59 – 79.
Slepian, D. (1962). The One-Sided Barrier Problem for Gaussian Noise. Bell System Technical Journal, 41, 463 – 501.
Vantaggi, B. (2008). Statistical Matching of Multiple Sources: A Look Through Coherence. International Journal of Approximate Reasoning, 49, 701 – 711.
Received November 2009 Revised November 2011