Dirichlet process mixture - Recent developments of copula based models to handle missing data o

2.6 Discussion

3.2.1 Dirichlet process mixture

We start by introducing some fundamental concepts of a Dirichlet Process (DP) that are relevant to this article; for a more comprehensive review refer to Chapter 1 and 2 in M¨uller et al. (2015).

3.2.1.1 Construction of a Dirichlet process

A nonparametric model is characterized by an infinite number of parameters. Suppose we have a collection of data y1, ..., yN, which follow a distribution G in

an infinite dimensional space. To proceed with a Bayesian nonparametric model,

we need a prior on G known as a Bayesian nonparametric prior. Among a range

of Bayesian nonparametric priors, we will focus on the DP prior because of its mathematical convenience. A DP(M G0) is characterized by two quantities, the

total mass parameter M and the centering measurement G0. For each partition

{B1, ..., BK}on a setB, DP(M G0) assigns probabilityG(Bk) to every subsetBk,

such that G(B1), ..., G(BK)∼Dir(M G0(B1), ..., M G0(BK)), where Dir() denotes

the Dirichlet distribution. From the definition we can see the analogy between a DP and a Dirichlet distribution: a DP is an infinite dimensional extension to a Dirichlet distribution. In addition, the mathematical convenience of a DP comes from its conjugacy. Let the datay1, ..., yN|G

i.i.d.

∼ G, and the priorG∼DP(M G0), then the posterior also follows a DP: G|y1, ..., yN ∼ DP(M G0 +PN_i₌₁δyi) = DP (M +N)(_MM₊_NG0 + _M₊1_N

i=1δyi)

,where the centering measurement is a weighted average of the prior distribution and the point masses of the observed data denoted by the Dirac function δ(·), and the total mass parameter increases by N.

A DP can be constructed in several equivalent ways, e.g., Chinese restaurant process, Polya urn process, and the stick breaking construction (M¨uller et al., 2015). In the Chinese restaurant process, it specifies a distribution over partitions ofN points in a sequential manner. As a plain explanation, consider a restaurant with an infinite number of tables. The first customer sits on the first table, and the (n+ 1)th _{customer has the option to sit on an empty table with probability}

M+n, or sit on an existing table with probability nk

M+n, wherenk is the total num-

ber of people sitting on tablek before the (n+1)th_{customer comes in. In a similar}

vein, the Polya urn process specifies the distribution of (y1, ..., yn) as a product

of sequence of increasing conditionals p(y1, ..., yn) = p(y1)

3.2. MODEL SPECIFICATION 55 where p(yi|y1, ..., yi−1) = _M₊1_i₋₁Pi −1 h=1δyh(yi) + M M+i−1G0(yi), i = {2,3, ...n} and

y1 ∼G0. The allocation of the (n+ 1)th data point either equals to the existing datayh, h={1, ..., n}, or to a new data point drawn from the centering measure-

mentG0. A more constructive way to represent a DP is through the stick breaking

construction, where G can be represented as an infinite sum of point masses

with different weights wh and locations θh, such that G(·) =

P∞

h=1whδθh(·), θh ∼ G0, vh ∼ Beta(1, M) and wh = vhQ_l<h(1− vl). It can be thought as

breaking a stick of unit length at positionv1 and assigningw1 equal to the length of the stick that we broke, and the parameter associated with that segment is

θ1 which is drawn from the centering measurement G0. Then we successively

break the rest of the stick by fraction vh, h={2, ...}, compute wh and draw the

cluster-specific parameter θh from the centering measurement G0. In this way,

a DP induces clusters, as data with the same parameter θh form a cluster. In

this chapter we will use the stick breaking construction because it will facilitate computation.

In a DP, the centering measurement G0 is the expectation of the distribution

G before data are collected, therefore it represents the prior belief on G. The

total mass parameter M controls how concentrated the induced distribution is

aroundG0, as a biggerM generates more clusters and a distribution closer toG0. To illustrate this, in Figure 3.1, we generated three DPs by the stick breaking construction, with G0 =N(0,1) and M = 1,10,100 respectively. When M = 1, only a few clusters were generated with larger weights and the induced distribution was far from the standard normal distribution. On the other hand, when

M = 100, a greater number of clusters were generated with smaller weights and

the induced distribution was very close to the standard normal distribution.

3.2.1.2 Dirichlet process mixture

From the stick breaking construction we can see that a DP only generates discrete distributions, and it is not directly generalizable to continuous distributions. To

2 4 6 8 10 0.0 0.2 0.4 0.6 0.8 Weights, M=1 Index weights DP(M=1, G0=N(0,1)) y Frequency −2 −1 0 1 2 0 5 10 15 20 25 30 0.00 0.05 0.10 0.15 0.20 Weights, M=10 Index weights DP(M=5, G0=N(0,1)) y Frequency −2 −1 0 1 0 100 200 300 400 500 0.000 0.010 0.020 0.030 Weights, M=100 Index weights DP(M=100, G0=N(0,1)) y Frequency −2 −1 0 1 2 3

Figure 3.1: Dirichlet Processes generated by stick breaking construction, with

G0 =N(0,1), M = 1,10,100 respectively.

overcome this limitation, DPM models can be used by putting a DP on the distribution of the parameters in a parametric kernel

fG(y) =

fθ(y)dG(θ), G∼DP(M G0). (3.1)

Equivalently, a DPM can be represented hierarchically, where each data point yi

is associated with its own parameter θi and the distribution of those parameters

follow a DP

yi|θi ∼fθi, θi|G∼G, G∼DP(M G0). (3.2)

In this representation, clusters are induced by the grouping of the parameters. A third equivalent representation is constructed through the stick breaking representation of a DP which induces an infinite mixture model

f(y|w, θ) = ∞

h=1

whf(y|θh). (3.3)

It shares many of the same interpretations and properties as a finite mixture model, but is more flexible in terms of allowing the model complexity to be adapted to the data.

3.2. MODEL SPECIFICATION 57

3.2.1.3 Selecting the number of components in a finite mixture model

In a finite mixture model f(yi|w, θ) = P_hH₌₁whf(yi|θh), the number of clusters

H is fixed although unknown, and a model selection procedure can be applied

to choose a ‘best’ H. This is known as the order selection problem (McLachlan

and Peel, 2004, Chapter 6). Some commonly used methods for the order selection problem involve calculating an information criterion (AIC, BIC), performing a likelihood ratio test, and calculating the Bayes factor between two compet- ing models (Carlin and Chib, 1995), etc. These methods require a number of potential models to be fitted and one chosen amongst them. These methods can be quite computationally intensive and may fail because of nonidentifiability issues and the use of improper priors. Specifically, the nonidentifiability issue arises in the mixture distribution when the model can be represented equivalently

well by different H, and this may lead to an unbounded likelihood function in

a frequentist model (Basford and McLachlan, 1985) or a divergent posterior if an improper prior is specified in a Bayesian model. In addition, the methods also fail to account for the variability in H if it is treated as fixed. Some fully Bayesian procedures consider model selection and parameter estimation in a sin-

gle MCMC run by treating H as a random variable. For example, the reversible

jump algorithm (Richardson and Green, 1997) allows the chain to jump between different H values corresponding to different models until convergence; in a birth and death process (Stephens, 2000) the parameters in the model are viewed as a point process and new components are allowed to be ‘born’ and existing components to ‘die’; and by prior parallel tempering (Van Havre et al., 2015), where far more clusters than supported by the data are initially included and the redundant clusters are gradually removed or merged with other clusters in the MCMC algorithm by controlling hyper-parameters. We have chosen to use the prior parallel tempering algorithm in our proposed model because it neither requires the de- sign of a delicate balance criterion in the trans-dimensional algorithms, nor extra programming effort in the parallel chains by just changing the hyperparameters

that resemble temperatures. Furthermore, the prior parallel tempering algorithm simultaneously achieves order selection and assists mixing.

In document Recent developments of copula based models to handle missing data of mixed type in multivariate analysis (Page 69-74)