• No results found

3.7 Summary and concluding remarks

4.1.2 Example walkthrough and code snippets

The functionality of sparr, as with many software packages, is best illustrated by exam- ple. In this section we re-create the example in Section 2.6.2 by estimating and displaying an adaptive relative risk function for the PBC data, along with asymptotically estimated tolerance contours.

Data

Initially, it is sensible to simply view the raw data. The following code loads the data set into the current workspace, re-formats the data structures, and yields Figure 4.1.

R> library(sparr) R> data(PBC) R> pbc.cases <- split(PBC)[[1]] R> pbc.controls <- split(PBC)[[2]] R> plot(pbc.controls,pch=3,main="") R> box(bty="L");axis(1);axis(2);title(ylab="Northings",xlab="Eastings") R> points(pbc.cases,col=2)

Inspection reveals, as we would expect, similar case-control dispersions based on the natural inhomogeneity in the population. Comparing the distribution of cases versus the distribution of controls, and attempting to identify any apparent differences, is our key objective.

Bandwidths

Firstly, we must sensibly calculate our isotropic bandwidth(s) for the required kernel smooth- ing. Recall the decision in Section 2.3.1 to select a common global bandwidth and separate pilot bandwidths for the case and control density estimates when implementing variable soothing. Furthermore, recall the poor performance of the single-density leave-one-out least- squares cross-validation bandwidth selector when compared to Terrell’s (1990) oversmooth- ing principle in the numerical experiments in Section 2.5. With this in mind, observe the following code:

36000 38000 40000 42000 44000 50000 55000 60000 65000 Eastings Nor things

R> OS.f <- OS(pbc.cases) R> OS.f [1] 317.4248 R> OS.g <- OS(pbc.controls) R> OS.g [1] 330.2433

R> OS.pool <- OS(PBC, nstar=sqrt(pbc.cases$n*pbc.controls$n)) R> OS.pool

[1] 349.8445

These commands calculate the single-density oversmoothing bandwidths for the case, control and pooled data via the sparr functionOS, which uses the mean of the two axis- specific interquartile ranges scaled by 1/1.34 to represent anad hocrobust scalar measure of standard deviation as required for the computation of the oversmoothing bandwidths. The argument nstar for OS.poolcorresponds to our discussion in Section 2.3.1, where we set the ‘effective’ sample size to the geometric mean of the separate case and control sample sizes to prevent too much influence by the larger of the two.

Density estimation

Estimation of the densities may now take place. We execute

R> pbc.pooled.density <- bivariate.density(PBC,pilotH=0.5*OS.pool, R> + globalH=OS.pool,adaptive=T,res=128,WIN=PBC$window,comment=F) R> pbc.case.density <- bivariate.density(pbc.cases,pilotH=0.5*OS.f, R> + globalH=OS.pool,adaptive=T,res=128,WIN=PBC$window, R> + gamma=pbc.pooled.density$gamma,comment=F) R> pbc.control.density <- bivariate.density(pbc.controls,pilotH=0.5*OS.g, R> + globalH=OS.pool,adaptive=T,res=128,WIN=PBC$window, R> + gamma=pbc.pooled.density$gamma,comment=F)

which provides us with the required objects. The objectpbc.pooled.densityis simply the density estimate of the pooled data set, i.e. ignoring the case/control marking of the observa- tions; this is needed for computation of the subsequentp-value surfaces. Global bandwidths are ordered to be equal to theOSvalues computed from the pooled data. Note that the pilot bandwidths are set to half of theOS-computed values; the case and control estimates utilise the separately computed oversmoothing bandwidths for the pilots in alignment with ear- lier comments (see Section 2.6.2). We also employ a common geometric mean scaling term computed from the pooled data,γω, for the case/control estimates as the argumentgamma (the reader is referred back to Section 2.2.2). Finally note that the grid resolution of our kernel estimates is set using the argument res=128, this indicates we wish to evaluate the densities on a 128×128 grid, automatically discarding any cells lying outside the irregular

polygon which defines our study region, specified byWIN. Function progress is commented upon during execution, we have disabled this through use of the argumentcomment=F. The option comment=T is particularly useful when we are dealing with large data sets and/or very fine grid resolutions.

The objects returned bybivariate.densityare given their own class,bivden. We can inspect basic information concerning our kernel estimates using class-specific versions of the genericRprintandsummarycommands. For example,printing abivdenobject yields

R> pbc.case.density

Bivariate kernel density/intensity estimate

Adaptive isotropic smoothing with (pilot) h = 158.7124 global h = 349.8445 unit(s)

No. of observations: 761

Risk surface and tolerance contour computation

The relative risk function may now be estimated. This is achieved through use of the function risk which by default computes the log-transformed density ratio as advocated by e.g. Kelsall and Diggle (1995a). We run this command using our pre-computed case and control density estimate objectspbc.case.densityand pbc.control.density, again suppressing output withcomment=F. By default,riskalso produces a pixel image plot of the resulting risk surface; this has been disabled for the moment (the various plotting options will be explored once we have computed the asymptotic tolerance contours).

R> pbc.risk <- risk(f=pbc.case.density,g=pbc.control.density, R> + comment=F,plotit=F)

R> summary(pbc.risk)

Log-Relative risk function.

Estimated log-risk range -4.485437 to 1.207464.

9095 grid cells out of 16384 fall inside study region.

--Numerator (case) density--

Bivariate kernel density/intensity estimate

Adaptive isotropic smoothing with (pilot) h = 158.7124 global h = 349.8445 unit(s)

No. of observations: 761

Evaluated over 128 by 128 rectangular grid.

Defined study region is a polygon with 115 vertices.

9095 grid cells out of 16384 fall inside study region.

--Denominator (control) density--

Bivariate kernel density/intensity estimate

Adaptive isotropic smoothing with (pilot) h = 165.1216 global h = 349.8445 unit(s)

No. of observations: 3020

Evaluated over 128 by 128 rectangular grid.

Defined study region is a polygon with 115 vertices.

Estimated density range 3.544127e-10 to 3.027201e-07. 9095 grid cells out of 16384 fall inside study region.

Therisk function produces an object of classrrsfor which print andsummary com- mands are also available. Use of summary.rrs, which subsequently calls summary.bivden on both objects supplied asfand g, is shown in the above output. A numerical summary of pbc.riskshows a maximum density-ratio value of exp(1.207) = 3.343. Can we consider risk magnitudes at this level ‘unusually extreme’ for our current data set?

Natural sampling variability will clearly produce benign fluctuations in the resulting spatial risk surface. It is of interest to highlight those sub-regions, if indeed there exist any, that are sufficiently extreme such that we would label them ‘statistically significant’ in the typical frequentist interpretation of the phrase. In spatial disease risk mapping, the search for significant positive fluctuations is the more common goal; embellishment of any ‘hotspots’ in the risk of disease contraction is clearly of interest to epidemiologists and public health officials.

Upon definition of appropriate hypotheses, we discussed in Section 2.4 the derivation of a p-value surface corresponding to a kernel-smoothed relative risk function. Denoting our study region with R, suppose we denote our (log) risk function of interest as ρ(x),

x ∈ R. As also covered in these earlier discussions, hotspot detection would correspond to the hypotheses H0 : ρ(x) = 0, HA : ρ(x) > 0; diminished risk would imply testing for HA : ρ(x) < 0; a double-sided test dictates HA : ρ(x) = 0. Once we have specified

hypotheses and estimated the correspondingp-value surface, any significant sub-regions are highlighted on a plot of the estimated risk function via tolerance contours at pre-defined significance levels.

Historically,p-value surfaces were computed via Monte-Carlo methods, namely random permutations of case/control marks and repeated evaluation of the kernel-smoothed risk function against which the actual estimated risk function values for each spatial coordi- nate would be compared. In Chapter 2, we proposed a theoretical alternative, based on the asymptotic normality of kernel density estimates (e.g. Parzen, 1962) coupled with asymptotic expressions for the variances of the fixed and adaptive kernel-estimated density-ratios. Nu- merically stable and computationally cheaper than MC methods, the asymptotic approach to tolerance contour calculation offers competitive performance, particularly for large data sets and when employing adaptive smoothing.

Thep-value surfaces are computed insparr with the functiontolerance. To illustrate the difference in computational expense between the two approaches, ‘MC’ and ‘ASY’, we evaluate both, using 100 iterations for the former.

R> pbc.tol.mc <- tolerance(pbc.risk,pbc.pooled.density,method="MC",ITER=100)

[1] "Tue Sep 20 22:01:47 2011"

Monte-Carlo iteration no.

10 20 30 40 50 60 70 80 90 [1] "Wed Sep 21 01:54:21 2011" R> pbc.tol.asy <- tolerance(pbc.risk,pbc.pooled.density,method="ASY") [1] "Wed Sep 21 09:39:12 2011" --Adaptive-bandwidth asymptotics-- calculating integrals K2... --f-- --g-- calculating integrals L2... --f-- --g-- [1] "Wed Sep 21 09:40:16 2011"

Supply of the pre-calculated rrs object, as well as the relevant pooled kernel density estimate, is required. Function progress is printed to the console on-the-fly; as for earlier functions this can be disabled by settingcomment=F. Note that for only 100 iterations, the MC p-value surface required almost four hours to complete within the current workspace for our adpative PBC example. By comparison, the asymptotic method took a little over a minute. The comments referring tointegrals K2and L2correspond to estimation of the square-bracketed terms in equation (2.9) for the case and control densitiesfandg respec- tively. Returned is a named list including the component P, a matrix of equal dimensions to the original rectangular evaluation grid, containing thep-values of interest overR.

Visualisation

The final step is to produce plots of our estimated risk function and any statistically signif- icant sub-regions therein through the use of tolerance contours. To this end,sparrprovides several flexible forms of risk surface visualisation. Arguably, one of the most intuitive is a pixel image or ‘heatplot’. The following code produces Figure 4.2.

R> par(mfrow=c(1,2))

Eastings Nor things 36000 40000 44000 50000 55000 60000 65000 − 4 − 3 − 2 − 10 1 Eastings Nor things 36000 40000 44000 50000 55000 60000 65000 − 4 − 3 − 2 − 10 1

Figure 4.2: Heatplots of the adaptive log-relative risk surfaces for the PBC data gener- ated usingsparr, displaying asymptotic and Monte-Carlo tolerance contours (left and right respectively), superimposed at significance levels of 0.01 (solid line) and 0.05 (dashed).

R> + tolerance.matrix=pbc.tol.asy$P, R> + tol.opt=list(levels=c(0.01,0.05),lty=c(1,2)), R> + main="",xlab="Eastings",ylab="Northings") R> plot(pbc.risk,display="heat",col=heat.colors(10)[10:1], R> + tolerance.matrix=pbc.tol.mc$P, R> + tol.opt=list(levels=c(0.01,0.05),lty=c(1,2)), R> + main="",xlab="Eastings",ylab="Northings")

Inspection of the plots shows some distinctions between the theoretically and empirically computed contours. Note that theplot.rrsfunction allows the user to directly include the calculated tolerance contours through the use of the argument tolerance.matrix. The appearance of the contours can be minimally controlled via a named list argumenttol.opt. Should the user wish more control over their inclusion, they may settolerance.matrixto NULL, in which casetol.optis ignored, and subsequently add to a pre-plotted risk surface the desired contours using the standardRcommandcontour(packagegraphics) with argument add=T.

In this example, the asymptotic method has appeared to amplify the significance of in- dividual peaks somewhat, flagging a sub-region on the eastern border as well as a smaller northern area at the 1% level. The Monte-Carlo contours agree with the apparent signifi- cance of the eastern border sub-region, albeit only at the 5% level, but do not indicate much

interest in any other areas (although we have only made use of 100 iterations for this method to spare excessive computation time). In any case, it should be stressed that superimpo- sition upon a risk surface image of tolerance contours of this sort are intended to provide a general idea of possible sub-regions of interest as an exploratory tool only, as opposed to definitive proof of anomalous behaviour. The agreement of the significant eastern-border area between the two approaches would lead us in this case to conclude that this is an area that most warrants further investigation by the relevant health officials.

In addition to the standard heatplot, the user can view the risk function as a contour or perspective plot, setting display="contour" or display="persp". One particularly unique feature tosparris the fact that we may also produce an interactive three-dimensional perspective plot based on the powerful features of the rgl package. Allowing full rotation and zooming via left and right mouse button holds respectively, it produces images from which it can be easier to assess the relative magnitudes of peaks and troughs in a particular surface compared to e.g. from a 2D heat plot.

The following code opens an interactive graphics device, plots the coloured 3D surface, and adds the asymptotically derived tolerance contours as in the left-hand panel of Figure 4.2. Figure 4.3 shows two screenshots of the interactive plot.

R> asp <- diff(PBC$window$yrange)/diff(PBC$window$xrange)

R> plot(pbc.risk,display="3d",aspect=c(1,asp,1),col=heat.colors(10)[10:1], R> + tolerance.matrix=pbc.tol.asy$P,

R> + tol.opt=list(levels=c(0.01,0.05),raise=0.025,col="blue"))

approximating 3D boundary...

approximating 3D tolerance contours...

To preserve the correct x:y axis aspect ratio, we must specify our desired relative scale via the argumentaspect=c(x,y,z). This is achieved by firstly finding the correct ratio which we store as the valueasp. Also noteworthy: thetol.optoptionraise is only relevant to display="3d"plots when!is.null(tolerance.matrix). This is a scalar constant which artificially translates the superimposed 3D contoursraise units along the z-axis, used to prevent segments of the included contours falling ‘below’ the plotted surface (an artefact of the approximations made during the 3D plotting).

Much the same plotting choices exist for standalone density estimates (i.e. objects of class bivden as opposed to rrs per se), and greater flexibility than can be fully covered here is of course available. In fact, in addition to the arguments for plot.bivden and plot.rrs we have encountered and discussed thus far, there is an additional ... field reserved for arbitrary arguments depending on whichdisplaytype we have elected. In the accompanying package documentation, the user is directed to theRhelp files for functions in external packages for proper use of... in fine-tuning density and risk function plots. For the fourdisplays"heat","contour","persp"and"3d", consultation should be made to?plot.im,?persp,?contourand?persp3drespectively.

Figure 4.3: Two snapshots of the interactive 3D adaptive bandwidth PBC risk function, with superimposed asymptotic tolerance contours.