1.2 Imputation models
1.2.3 Copula models
1.2.3.1 Basics in copulas
Copula modelling, which is another joint modelling approach, has the potential to inherit the merits from both JM and FCS, as it provides flexibilities in the marginal distributions, while ensuring a proper joint distribution at the same time. In the thesis, we consider copula-based models to impute missing values. The word ‘copula’ means ‘a link, tie, bond’. In mathematics and statistics, gener- ally speaking, it means joining together one-dimensional cumulative distribution functions (CDF) F1, ..., Fp of variables y1, ..., yp to form a joint CDF, F. Each of
the variables is modeled by the marginal distribution Fl(yl) =ul, l = {1, ..., p},
which is uniformly distributed, and the dependence amongu= (u1, ..., up) is cap-
tured by the copula function C. More formally, a copula function C : [0,1]p →
[0,1], is defined as the joint CDF of the uniformly distributed random variables
u1, ..., up, such that C(u1, ..., up) = p(U1 ≤ u1, ..., Up ≤ up). An equivalent but
1.2. IMPUTATION MODELS 17 Copula models are commonly used for constructing multivariate CDFs, as implied by the Sklar’s theorem (Sklar, 1959). Sklar’s theorem shows that there always exists a copula function C, such that F(y1, ..., yp) =C(F1(y1), ..., Fp(yp)), and C
is unique if the random variables yl, l={1, ..., p}, are continuous.
One merit of using a copula model is its invariance property, such that if
T1, ..., Tp are strictly increasing functions, then C is also the copula of the trans-
formed variablesT1(y1), ..., Tp(yp). Also, while Pearson’s correlation measures the
linear relationship between two variables, it is not suitable for quantifying non- linear relationships. Some rank-based association parameters, such as Kendall’s
tau τ and Spearman’s rho ρ (Embrechts et al., 2002) which describe the con-
cordance between two variables, and tail dependence which describes the co- movement between extreme values (to be discussed in Chapter 3, Section 3.4.1), can be computed as functions of the association parameters in copula models.
For example, consider two continuous random variables y1 and y2, and their
corresponding univariate CDFs F1 and F2 respectively. Define the uniformly
distributed random variables as u = F1−1(y1) and v = F2−1(y2) respectively.
Kendall’s tau τ, Spearman’s rho ρ, lower tail dependence λlw and upper tail
dependence λup can be calculated from a copula function as follows
τ = 4 Z 1 0 Z 1 0 C(u, v)dC(u, v)−1, ρ= 12 Z 1 0 Z 1 0 C(u, v)dudv−3, λlw = limt→0+ C(t, t) t , λup= 2−limt→1− 1−C(t, t) 1−t . (1.4)
The simplest copula is the independent copula, such that C(u, v) = uv if and only ify1 and y2 are independent. A slightly more complicated class of copulas is the Archimedian copulas including Clayton, Frank and Gumbel copulas. As an example, the Clayton copula is given by Cα(u, v) = max [u−α+v−α−1]−1/α,0
, where α ∈[−1,∞]\{0}. The parameter α is regarded as the association param- eter in the Clayton copula, and its relationship with Kendall’s tau is τ = αα+2.
Clayton copula is asymmetric, in the sense that its upper tail dependence param- eter is λup = 0, and lower tail dependence parameter is λlw = 2−1/α. Another
class of copulas is the elliptical copulas including Gaussian and t copulas, which shall be discussed in more detail in Section 2.2 and Section 3.2.2. The relation- ships between the association parameter α in elliptical copulas with Spearman’s rho and Kendall’s tau are ρ= π6arcsin(α2) and τ = 2πarcsin(α) respectively. Note that both the Gaussian and t copula are symmetric copulas, but the Gaussian copula is asymptotically independent in its tails, but the t copula has a certain amount of tail dependence controlled by the degrees of freedom parameter. In the thesis we only consider elliptical copulas, not only because they allow for differ- ent pair-wise associations between variables when generalizing to more than two variables, but also because of their mathematical convenience. Moreover, we also consider a mixture of copulas in Chapter 4, as a convex combination of copulas (Nelsen, 2007, Section 3.2.4).
We refer to the book by Nelsen (2007) for a more comprehensive review of copulas and the paper by Trivedi et al. (2007) for a nice summary.
1.2.3.2 Applications of copula models
There exists in the literature many applications using copulas. To list a few, Bouy´e et al. (2000) summarized the financial applications including credit scoring, asset returns modelling and risk measurements; Hu (2006) studied the dependence patterns across financial markets; and more recently Liu et al. (2017) proposed a time-varying copula model to study the dependent structure between security and commodity markets. Extreme value copulas have been considered in actuarial science, where Cebrian et al. (2003) applied their models to a medical claim database and Dupuis and Jones (2006) illustrated their approaches to four risk- related data sets. While copula models have their major applications in finance, actuarial studies and economics, they have been actively studied in other fields as well. For example, Nguyen-Huy et al. (2017) explored copula models to forecast
1.2. IMPUTATION MODELS 19 spring seasonal precipitation at Australia’s agro-ecological zones; Yin and Yuan (2009) used a copula regression in Bayesian adaptive design for finding optimal dosage in oncology; Valle et al. (2018) extended the work of Wu et al. (2015) by using an infinite mixture of copula models to study the effect of socioeconomic factors on the relationship between twins’ cognitive abilities.
1.2.3.3 Copula-based imputation
Copula models have proven to be very powerful for modeling variables of differ- ent types and shapes, when there is an underlying dependence among them. It adopts a ‘bottom-up’ strategy where the starting point is the marginal distribu- tions Fl, which are then glued together by the copula function C. In some other
‘top down’ joint modelling approaches, the marginal distributions are fully de- termined by their parental joint distribution, however constructing a joint model whose marginal distributions are suitable for each variable can be a daunting task, especially when there are a large number of variables of mixed type and they are skewed or multi-modal. By starting from the marginal distributions we can accommodate the features of each variable and ensure the imputed data take proper values in the correct range, which is usually a merit of FCS. In addition, copula models guarantee the existence of a compatible joint distribution which is not guaranteed by the FCS approach. The multinomial probit models for or- dered or unordered categorical data can be treated as a special case of a copula model, because the underlying latent variables corresponding to each category are assumed to follow a multivariate Gaussian distribution (Albert and Chib, 1993). Using the copula model as an imputation engine is relatively new but has drawn some attention in the literature. K¨a¨arik (2006) and K¨a¨arik and K¨a¨arik (2009) were among the first authors to consider imputation using a Gaussian copula model where the missing data pattern was monotone. In their papers, they imputed missing data due to dropouts in longitudinal data sets, where a monotone missing data pattern is often observed. A compound symmetry and autoregressive
dependencies for the correlation matrix were considered. Lascio (2015) found that copula based imputation from the Archimedian family compared favourably with nearest neighbour donor imputation and regression imputation by the EM algorithm. Hollenbach et al. (2014) compared the performance of imputation by the copula model using the extended rank likelihood approach (Hoff, 2007) (to be discussed in Section 2.2) with JM (as implemented in ‘Amelia’ package in R) and FCS (as implemented in ‘mice’ package in R) and concluded that the copula imputation approach outperformed the other two approaches in terms of a slightly smaller bias, higher coverage rate and narrower confidence interval estimates. The improvement was more pronounced when the missing data were not normally distributed. They implemented the imputation using the R package ‘sbgcop’ (Hoff, 2018), which has the option to impute missing data under the MAR assumption. Shen and Weissfeld (2006) considered the NMAR scenario by building a joint model for the outcome variables and the associated missing data indicators, and they claimed that it could be used to eliminate the potential bias caused by the non-ignorable missingness. Generalized estimating equations (GEE) were used to estimate the marginal distributions of the missing indicators given the observed data, and then the parameters in the Gaussian copula model and the marginal distributions of the outcome variables were estimated.