• No results found

4.2 Some remarks on adaptive smoothing

4.2.2 Comparison to Markov random fields

In sharp contrast to (4.9) is the conditional approach for Markov random fields. Recall, that for MRFs two regions are called neighbors if they contribute to the full conditional of each other. Hence, the conditional correlation is always non-zero whenever two regionsiand jare neighbored. This conditional correlation is determined by the precision matrixQ. To perform spatial smoothing the precision matrix is chosen fixed, usually with non-zero entries for pairs of geographically adjacent regions. Thus, the definition of neighborhood for the MRF is in agreement with the definition of geographical neighborhood as given in Section 2.1.1. We will focus on this definition, although other choices are possible.

4.2. Some remarks on adaptive smoothing 77

Now consider a GMRF for the parameters λ, i.e. a pairwise difference prior with scale parameterκ p(λ|κ) exp ( −κ 2

ij(λi−λj) 2 ) (4.13) = exp ( κ 2i

<jkij(λi−λj) 2 ) . (4.14)

The precision matrix is given byQK, whereK = (kij)is apenalty matrixwith off-diagonal

elements

kij=

(

−1 ifi∼ j

0 otherwise , fori6= j, and the number of neighbors on the diagonal

ki =kii =−

j6=i

kij, i=1, . . . ,n.

Note that the off-diagonal elements in the penalty matrix are the negative entries of the adja- cency matrix of the underlying graph, i.e. kij = −aijfori 6= j(cf. Section 2.1.2). This penalty

matrix controls the conditional correlation structure and therefore Clayton (1996) also calls it the “inverse variance-covariance structure”. The parameters in two adjacent regions are condi- tionally correlated

Cor(λij|.) = p1

kikj

, fori∼ j.

Thus, the conditional correlation of two parameters solely depends on the number of neigh- bors of the two regions and is fixed. The local smoothing behavior of the GMRF is predefined by the specification of the precision matrix. Note that the local amount of smoothing is de- termined by the marginal variance of the regions, not the conditional. For pairwise difference priors, the marginal variances are not defined, but can be derived under linear constraints, see Section 4.3.4.

What varies is the global amount of smoothing according to the unknown scale parameterκ. For fixed precision κ, prior (4.13) penalizes differences in the parameters λ and supports a smooth parameter surface. Hence, the prior opposes the likelihood and allows for smoothing, where the global amount of smoothing depends on the scale parameter. But, smoothing is not adaptive to the observed data since the penalty matrix depends only on the underlying graph. Thus, there is no structural learning in MRFs.

Using other definitions of neighborhood will not change this, e.g. the use of second-order neighborhoods (see Figure 2.2) will lead to smoother results but not to adaptive smoothing. Other non-Gaussian approaches, e.g. based on absolute differences

p(λ|κ)exp ( −κ 2i

j|λi−λj| ) ,

78 4. Further Topics in Clustering Partition Models

are more robust versions and allow for stronger edges in the parameter surface. Still, the smoothing behavior depends only on the underlying graph.

For GMRFs, adaptive smoothing requires inference on the structure of the precision matrix, i.e. inference on the elements of the penalty matrix K. One approach is to interpret the kij

in (4.14) as (negative) weights on the differences between the parameters. Fahrmeir, G ¨ossl & Hennerfeind (2003) propose a model where the non-zero entries in the penalty matrix are stochastic and estimated within the algorithm. This is an appealing extension but holds the unpleasant feature that the normalizing constant of the pairwise difference prior is difficult to derive. Furthermore, smoothing is now variable and adaptive to the data but the structure of the penalty matrix is still fixed because only predefined non-zero elements ofKare subject to statistical inference.

A further step would be to assume a variable neighborhood structure. For example, we may implement a move to switch off-diagonal elements ofKfrom 0 to−1 and reverse (and simul- taneously update the diagonal elements). This idea would indeed refer to structural learning based on the data. Still, some care has to be taken to assure symmetry ofK. In addition, the extreme case withkij=0 for all pairs(i,j)has to be avoided. In this case, parameters are inde- pendent and the pairwise differences between parameters get irrelevant. Thus, the prior (4.14) does not oppose the likelihood and no spatial smoothing is performed.

4.2.3 Summary

The CPM is one possibility to perform adaptive smoothing with respect to the observed data for arbitrary graphs. Adaptiveness is achieved by inference on the correlation matrix of the parameters. The prior model, as proposed in Section 2.2.2, assumes that parameters λ are constant within each cluster. At first sight, this is a rather strong assumption but crucial for any spatially adaptive estimation.

In general, this assumption is not necessary. There are applications, where other formu- lations might be useful. Indeed, there exist related approaches in which the assumption of constant parameters is loosened. Holmes et al. (1999) propose a Bayesian partition model for applications in continuous space. In one dimension, this can be seen as regression modeling with partitions for which in every subset the unknown function is linear instead of constant. This is the continuous analogue to the piecewise linear modelλijjti, ti ∈ Cj, for time

series data, already tackled in connection with Example 2.4. Although the parametersλare not constant within each cluster anymore, the parameters defining the linear pieces still are, i.e. the interceptsαk = (α1, . . . ,αk)and the slopesβk = (β1, . . . ,βk). Thus, the more flexible model is achieved by increasing the dimension of the parameter space, i.e.θj = (αjj)for clusterCj.

More generally, any deterministic functions fjbetween the unit identifier (e.g. time pointi)

and the parameter are conceivable, i.e. λi = fj(ti)fori ∈ Cj. For reasons of identifiability the

dimension of the parametersθj, j=1, . . . ,k, should be well below the number of observations in each cluster. However, this idea only works for certain graphs. To define such functions, the