Comparison to Markov random fields - Some remarks on adaptive smoothing

4.2 Some remarks on adaptive smoothing

4.2.2 Comparison to Markov random fields

In sharp contrast to (4.9) is the conditional approach for Markov random fields. Recall, that for MRFs two regions are called neighbors if they contribute to the full conditional of each other. Hence, the conditional correlation is always non-zero whenever two regionsiand jare neighbored. This conditional correlation is determined by the precision matrixQ. To perform spatial smoothing the precision matrix is chosen fixed, usually with non-zero entries for pairs of geographically adjacent regions. Thus, the definition of neighborhood for the MRF is in agreement with the definition of geographical neighborhood as given in Section 2.1.1. We will focus on this definition, although other choices are possible.

4.2. Some remarks on adaptive smoothing 77

Now consider a GMRF for the parameters λ, i.e. a pairwise difference prior with scale parameterκ p(λ_|κ) _∝ exp ( −κ 2

∑

_i_∼_j(λi−λj) 2 ) (4.13) = exp ( κ 2_i

∑

_<_jkij(λi−λj) 2 ) . (4.14)

The precision matrix is given byQ=κK, whereK = (kij)is apenalty matrixwith off-diagonal

elements

kij=

(

−1 ifi∼ j

0 otherwise , fori6= j, and the number of neighbors on the diagonal

ki =kii =−

∑

j6=i

kij, i=1, . . . ,n.

Note that the off-diagonal elements in the penalty matrix are the negative entries of the adja- cency matrix of the underlying graph, i.e. kij = −aijfori 6= j(cf. Section 2.1.2). This penalty

matrix controls the conditional correlation structure and therefore Clayton (1996) also calls it the “inverse variance-covariance structure”. The parameters in two adjacent regions are condi- tionally correlated

Cor(λ_i,λ_j_|.) = _p1

kikj

, fori∼ j.

Thus, the conditional correlation of two parameters solely depends on the number of neighbors of the two regions and is fixed. The local smoothing behavior of the GMRF is predefined by the specification of the precision matrix. Note that the local amount of smoothing is determined by the marginal variance of the regions, not the conditional. For pairwise difference priors, the marginal variances are not defined, but can be derived under linear constraints, see Section 4.3.4.

What varies is the global amount of smoothing according to the unknown scale parameterκ. For fixed precision κ, prior (4.13) penalizes differences in the parameters λ and supports a smooth parameter surface. Hence, the prior opposes the likelihood and allows for smoothing, where the global amount of smoothing depends on the scale parameter. But, smoothing is not adaptive to the observed data since the penalty matrix depends only on the underlying graph. Thus, there is no structural learning in MRFs.

Using other definitions of neighborhood will not change this, e.g. the use of second-order neighborhoods (see Figure 2.2) will lead to smoother results but not to adaptive smoothing. Other non-Gaussian approaches, e.g. based on absolute differences

p(λ_|κ)_∝exp ( −κ 2_i

∑

_∼_j|λi−λj| ) ,

78 4. Further Topics in Clustering Partition Models

are more robust versions and allow for stronger edges in the parameter surface. Still, the smoothing behavior depends only on the underlying graph.

For GMRFs, adaptive smoothing requires inference on the structure of the precision matrix, i.e. inference on the elements of the penalty matrix K. One approach is to interpret the kij

in (4.14) as (negative) weights on the differences between the parameters. Fahrmeir, G ¨ossl & Hennerfeind (2003) propose a model where the non-zero entries in the penalty matrix are stochastic and estimated within the algorithm. This is an appealing extension but holds the unpleasant feature that the normalizing constant of the pairwise difference prior is difficult to derive. Furthermore, smoothing is now variable and adaptive to the data but the structure of the penalty matrix is still fixed because only predefined non-zero elements ofKare subject to statistical inference.

A further step would be to assume a variable neighborhood structure. For example, we may implement a move to switch off-diagonal elements ofKfrom 0 to−1 and reverse (and simul- taneously update the diagonal elements). This idea would indeed refer to structural learning based on the data. Still, some care has to be taken to assure symmetry ofK. In addition, the extreme case withk_ij=0 for all pairs(i,j)has to be avoided. In this case, parameters are inde- pendent and the pairwise differences between parameters get irrelevant. Thus, the prior (4.14) does not oppose the likelihood and no spatial smoothing is performed.

4.2.3 Summary

The CPM is one possibility to perform adaptive smoothing with respect to the observed data for arbitrary graphs. Adaptiveness is achieved by inference on the correlation matrix of the parameters. The prior model, as proposed in Section 2.2.2, assumes that parameters λ are constant within each cluster. At first sight, this is a rather strong assumption but crucial for any spatially adaptive estimation.

In general, this assumption is not necessary. There are applications, where other formu- lations might be useful. Indeed, there exist related approaches in which the assumption of constant parameters is loosened. Holmes et al. (1999) propose a Bayesian partition model for applications in continuous space. In one dimension, this can be seen as regression modeling with partitions for which in every subset the unknown function is linear instead of constant. This is the continuous analogue to the piecewise linear modelλ_i =α_j+β_jti, ti ∈ Cj, for time

series data, already tackled in connection with Example 2.4. Although the parametersλare not constant within each cluster anymore, the parameters defining the linear pieces still are, i.e. the interceptsα_k = (α₁, . . . ,α_k)and the slopesβ_k = (β₁, . . . ,β_k). Thus, the more flexible model is achieved by increasing the dimension of the parameter space, i.e.θ_j = (α_j,β_j)for clusterCj.

More generally, any deterministic functions fjbetween the unit identifier (e.g. time pointi)

and the parameter are conceivable, i.e. λ_i = fj(ti)fori ∈ Cj. For reasons of identifiability the

dimension of the parametersθ_j, j=1, . . . ,k, should be well below the number of observations in each cluster. However, this idea only works for certain graphs. To define such functions, the

In document Raßer, Günter (2003): Clustering Partition Models for Discrete Structures with Applications in Geographical Epidemiology. Dissertation, LMU München: Fakultät für Mathematik, Informatik und Statistik (Page 88-91)