Data Description - Diagnostic test - Modelling of survival and incidence for colorectal cancer

Appendix I Diagnostic test

5.2.2 Data Description

The data used for this chapter comprise the colorectal cancer cases diagnosed between 2008 to 2013 in the North-West of Peninsular Malaysia. After excluding seven cases without address information, there remain 1248 colorectal cancer cases from these four states (556, 379, 247 and 69 cases in Kedah, Penang, Perak and Perlis respectively). These 1248 cases represent 39.6% of all colorectal cancer cases in Peninsular Malaysia.

For this chapter, we used each patient’s geographical location for analysis. We applied the same procedure for acquiring the coordinate location of the cases as we did in Chapter4, that is, we searched for the coordinates of the point locations given by the addresses available in the data. We used Google Maps, 1MalaysiaMap

Modelling Incidence 141

(MaCGDI, 2012), and other resources from Google to search for the coordinates of the addresses. For those individuals where no address coordinate was available through the above databases, we generated an approximate address. We started with small street (of length less than 1km) to housing area to village to ‘mukim’ (sub-division of district) as also explained in the previous chapter (Table .

We used the 2010 data from Worldpop as the source of our population data for the analysis (Worldpop,2017). These are raster data of 100m resolution. We projected the Worldpop data onto our region and aggregated it to a regular square grid of 4km sides. For the analysis, the number of people in each grid cell represent the population at risk.

Table 5.1: Table of percentage(%) addresses assigns in the North West of

Peninsular Malaysia

States Accurate address(%) Small street Village(%) Small mukim(%) Total %

Penang 84.9 6.9 8.2 0.0 100

Kedah 65.0 10.0 19.0 6.0 100

Perlis 37.8 4.3 57.9 0.0 100

Perak 61.6 12.1 24.7 1.6 100

We chose three explanatory variables to be evaluated as potential predictors for colorectal cancer incidence. They were:

• level of health service provision; determined using the number of hospitals per unit area as a proxy.

• a measure of socioeconomic status

• the proportion of Chinese people in the population

We have explained how we derived the value of hospital intensity and socioeconomic index in Chapter 4. We chose the proportion of Chinese people as a predictor because this race had the highest age standardised incidence rate of colorectal cancer in Malaysia in a previous study (Hassan et al., 2016). This agrees with the analysis of colorectal incidence by race described in results section 5.3.1.

5.2.3 Statistical Analysis

5.2.3.1 Modelling Incidence of Colorectal Cancer

We modelled the number of colorectal cancer cases for the four states in North- West Peninsular Malaysia over the six year period 2008 to 2013.

As mentioned earlier, we used a point process model, fitted using the R software package lgcp (Taylor et al., 2015). It is a generalised linear mixed effects model in which the number of cases takes a Poisson distribution. The observed number of cases is explained by the chosen fixed effects (socioeconomic status, hospital intensity, proportion of Chinese people) and spatially correlated random effects. The model, also known as the spatial log-Gaussian Cox process is as follows:

X(s)∼Poisson{R(s)}

R(s) =CAexp{αlogP(s) +Z(s)β+Y(s)}

logR(s) = log(CA) +αlogP(s) +Z(s)β+Y(s)

X(s) denotes the observed number of cases (counts) in the grid cell containing spatial locations. CAis the cell area . We chose to use the covariate transformation logP(s) in the model to give a more flexible relationship between population and the number of cases (rather than just assuming proportionality as is common for Poisson models). If it transpires that α ≈ 1, then this, in essence similar to including P(s) as an offset in the Poisson model. Z(s) is a vector of area level covariates (fixed effect) with associated effect β. Y(s) is a spatial random effect. The interpretation of Y(s) is that having accounted for the fixed effects, Y(s) represents variation in risk not accounted for by the effects. We assume that

Y(s) is similar at locations S1 and S2 that are close to each other, but nearly

Modelling Incidence 143

describe the decay in correlation for two points. S1 and S2 that are distance d

apart, we assume corr(Y(S1), Y(S2)) = exp−_φd for some parameters φ > 0. We

further assume the marginal variance of Y is σ2 _>_0.

We normalized the area level covariates chosen for the model as we want to see which factors had the biggest affect on colorectal cancer cases in our study.

5.2.3.2 Prediction

In order to predict the expected number of cases outside the region used to create the model, or, to be exact, for the whole of Peninsular Malaysia, we need estimates for all terms in the model,CA,λ(s),Z(s),β,α, φ,σ2 andY(s). Note that outside North-West Peninsular Malaysia, our predictions are extrapolations and should be treated with some caution.

We arrange the prediction grid so that it is an extension of the grid previously used for analysis. Therefore, the grid we used in North-West Peninsular Malaysia is a subset of the grid covering all of Peninsular Malaysia. The main complication with producing extrapolated predictions of the expected number of cases concerns the process Y. Since Y is correlated spatially, producing these predictions would require repeated inversion of a very large matrix which is beyond the scope of desktop computers.

Our model assumes that_E(Y(s)) = −₂σ2 for anys, as this is howY is parameterized in the package lgcpused to fit these data (Taylor et al.,2015). Therefore, rather than simulate a correlatedY on the area outside North-West Peninsular Malaysia, we instead simulated uncorrelatedY, so thatY has expectation −₂σ2 and variance

We also sought to obtain an estimate of incidence of cases of colorectal cancer across the whole of Peninsular Malaysia. To do this, is in principle, straightfor- ward from our model; it just involves aggregating the predicted number of cases in each cell over Peninsular Malaysia and dividing by the population at risk. How- ever, this is complicated by the fact that the National Patient Cancer Registry - Colorectal Cancer database is known to be incomplete beyond North-West Penin- sular Malaysia. Each sample i of our MCMC chain, (α(i)_{, β}(i)_{, σ}(i)_{, φ}(i)_{) yields a}

different prediction of the total number of cases,Ti. Since we observedT∗ = 3155 cases in our database, we know that the T must be > T∗. Therefore in our prediction for the total number of cases, we give inference for max[Ti, T∗], including a confidence interval.

We then mapped the predicted incidence for colorectal cancer for the whole of Peninsular Malaysia. We were aware of the uncertainty of the prediction presented in the map, hence we created another plot, a plot of the posterior probability that the incidence in each area exceeds the national average. All this is shown in results section 5.3.4. We also predict the mean case ascertainment rate across the whole of Peninsular Malaysia with a 95% credible interval, calculated as the number of observed cases divided by the number of expected (predicted) cases.

Modelling Incidence 145

5.3 Results

In document Modelling of survival and incidence for colorectal cancer in Malaysia (Page 154-159)