Elicitation - Expert knowledge - Bayesian methods for modelling non-random missing data mechani

7.2 Expert knowledge

7.2.1 Elicitation

Although it is clear from the literature that it is preferable to elicit the views of several experts, as this exercise is for illustrative purposes we interview a single expert. Our recruited expert has general knowledge about missing data in longitudinal studies and specific knowledge about missing MCS family income. We now describe the interview, which follows the recommendations in the ELICITOR user manual. Our expert is not shown any of our modelling results before the completion of the elicitation exercise.

Selection of variables to explain income missingness

To begin with, we ask our expert whether the five variables we have been using in our response model of missingness, level, change, sc, eth and ctry, are suitable for explaining missingness of income variables in sweep 2. He confirms that they are and does not suggest additional explanatory variables.

Eliciting optimum values and design points

Next we ask our expert which category or level of each variable he thinks would maximise the proba- bility of an individual responding to the income question in sweep 2. This is called the optimum value, and his belief for each variable is shown in Table 7.6.

Table 7.6: MSC income example: explanatory variables for income missingness in sweep 2

name description design points optimum value

level level of hourly pay (sweep 1) £4, £10, £25 and £50 £10

change change in hourly pay (sweep 2-1) -£5, £0 and £5 £0

sc social class NS-SEC 1, NS-SEC 2 and NS-SEC 4/5b _{NS-SEC 2}

eth ethnicity white and non-white white

ctrya _country _{EWS and NI} _EWS

a _{EWS=England/Wales/Scotland; NI=Northern Ireland}

b _{NS-SEC 1=managerial and professional occupations; NS-SEC 2=intermediate occupations; NS-SEC 4=lower super-}

visory and technical occupations; NS-SEC 5=semi-routine and routine occupations

Then we ask which other values of the variables should be used for eliciting the response probability. These and the optimum value are known as the design points of the variable, and are also given in Table 7.6. Our expert’s choices necessitate three changes to our response model of missingness.

1 Our expert believes that the response probability would be the same for individuals with social class classification NS-SEC4 and NS-SEC5. So we combine categories 3 and 4, and use a three category rather than four category sc.

2 Our expert does not expect any difference in the probability of response between individuals residing in England, Scotland and Wales. Hence we combine these three countries to form a two category rather than four category ctry.

3 Our expert’s opinion is that an individual is most likely to respond if their hourly pay is £10 (a little higher than the median sweep 1 hourly pay of £7) and their hourly pay rate does not change between sweeps. This means that the linear functional form which we have been using for variables

level and change is no longer suitable. Following discussion with our expert, we decide to switch

to a piecewise linear form.

Eliciting the response probability at the overall optimum value

The overall optimum value occurs when all the covariates are set to the optimum value of their design points. This equates to the equation intercept and is the highest probability of observing income. First we elicit the median value for the intercept by asking our expert how many individuals out of a sample of 100 he would expect to respond, if they all had optimum values for the five covariates. Our expert expects that 95% would respond. We plot this value using the ELICITOR software, and then ask our expert to confirm that he feels that the true value was equally likely to be above or below 95%, which he does.

We then consider the confidence that our expert has in his estimate, eliciting lower and upper quartiles by the bisection method. For the lower quartile, we ask our expert to assume that the true value is actually below his estimate, and then to choose a value such that the true value is equally likely to be higher or lower. We elicit the upper quartile similarly. We graph these values using ELICITOR (see left graph in Figure 7.1) and as a check, we ask our expert to confirm that he feels that the true value is equally likely to be inside or outside this interval.

Eliciting the response probability at the other design points

The five explanatory variables are now considered in turn. Each variable is assumed independent, so covariances do not need to be assessed, and response probabilities for the design points are elicited assuming all the other variables are at optimum level. The probability for the optimum value is the same as the intercept graph, and the probabilities (median and 50% interval) for the remaining design points are elicited as before, again using the ELICITOR software. These elicited values are shown in the “original elicited %” columns in Table 7.7. The right graph in Figure 7.1 shows a screen shot from ELICITOR of the elicitation for variable change.

Table 7.7: MSC income example: elicited values of response percentage for sweep 2 income explanatory design original elicited % adjusted elicited %

variable point lqb _median _uqb _lqb _median _uqb

intercept 90 95 98

hourly pay (sweep 1) £4 80 85 90

£10 optimum value

£25 80 85 90

£50 65 75 85 75 80

hourly pay changea _-£5 ₈₀ ₉₀ ₉₅

£0 optimum value

£5 70 80 95 80 90

social class NS-SEC 1 85 90 95

NS-SEC 2 optimum value

NS-SEC 4/5 85 90 95

ethnicity white optimum value

non-white 70 80 90 80 85

country England/Wales/Scotland optimum value

Northern Ireland 80 85 90 85 90 95

a _{change in hourly pay from sweep 1 to sweep 2} b _{lq=lower quartile; uq=upper quartile}

Providing feedback on the elicitation

Providing feedback is an important part of the elicitation process, as it allows the expert to reconsider his assessments. ELICITOR enables feedback during the elicitation, including the display of alternative intervals. For example, if a 50% interval is elicited, then we can provide our expert with more information on the implications of his chosen values by asking ELICITOR to display a 90% interval. Any variable can be revisited at any stage in the elicitation. We use these features throughout the interview. Our expert feels that it would be useful to see the implied median for the worst case, i.e. when all the variables are set to their minimum design points. He expects a value of about 60%. This cannot be generated without running WinBUGS to forward sample from our expert’s prior, so we agree to provide this feedback after the meeting.

During discussion of the elicitation process, our expert comments that he had found questions about the change variable the most difficult part. This is the variable which allows informative missingness in our model, and about which there is greatest uncertainty on account of the missingness.

Converting the elicited values into WinBUGS code

ELICITOR automatically converts the probabilities of response at different design points into Win- BUGS code with informative priors, and Equation 7.3 shows the resulting equations for this elicitation. Note that in this equation the parameters of the Normal distributions are written as the mean and variance, whereas WinBUGS uses the mean and precision.

logit(pi) = θ0+ P iecewise(leveli) + P iecewise(changei)

+ (θctry× ctryi1) + (θeth× ethi1) +

3 X

k=1

(θ_sc[k]× sc_[k]i1)

leveli = hpayi1

changei = hpayi2− hpayi1

P iecewise(leveli) =         

θ_level[1]× (leveli− 10) : leveli < 10

θ_level[2]× (leveli− 10) : 10 ≤ leveli < 25

θ_level[3]× (leveli− 25) + θlevel[2]× 15 : leveli ≥ 25

P iecewise(changei) =    δ1× changei: changei< 0 δ2× changei: changei≥ 0 θ₀ ∼ N (3.0, 1.64)

θ_level[1]∼ N (0.21, 0.01); θ_level[2]∼ N (−0.085, 0.002); θ_level[3]∼ N (−0.026, 0.001)

δ₁ ∼ N (0.15, 0.06); δ₂∼ N (−0.32, 0.10) θ_ctry ∼ N (−1.3, 0.36)

θ_eth∼ N (−1.6, 1.04)

θ_sc[1]∼ N (−0.8, 0.83); θ_sc[2]= 0; θ_sc[3]∼ N (−0.8, 0.83)

(7.3)

where the second index on the variables indicates sweep, and sc_[k] is a binary indicator for sc category

k (1 if sci is category k, 0 otherwise).

The mean of the prior for θ0, is the logit of the elicited median for the intercept (as a proportion), i.e. prior mean of θ0 = logit(0.95) = log

µ 0.95 1 − 0.95

¶ = 2.9.

The small discrepancies between the results of this and subsequent calculations and the values shown in Equation 7.3, are due to rounding errors and the inexact entering of the elicited values into ELICITOR through its “click and point” mechanism. ELICITOR makes use of an approximation from Walpole and Myers (1993), p.211, for calculating the quantile, q(f ), of a Normal(µ,σ2_{) distribution, as given}

qµ,σ(f ) ≈ µ + σ

4.91{f0.14− (1 − f )0.14}¢. (7.4) Using Equation 7.4, the standard deviations can be approximated as

σ ≈ q(f ) − µ

4.91{f0.14_{− (1 − f )}0.14_}. (7.5)

Hence to calculate an approximate standard deviation, we need the elicited probability from one quantile in addition to the median. As two additional quantiles are available, ELICITOR uses the average of the approximated standard deviation from both quantiles. So the variance, σ2_{, of the prior} for θ0, is estimated as follows:

σlq= _4.91{0.25logit(0.90) − logit(0.95)_0.14_{− (1 − 0.25)}_0.14_} σuq= _4.91{0.75logit(0.98) − logit(0.95)_0.14_{− (1 − 0.75)}_0.14_} σ2= 1 2 ¡ σ_lq2 + σ_uq2 ¢= 1.61

where σlq and σuq are the approximations based on the lower and upper quartiles respectively.

For the parameters associated with the binary and categorical variables, the prior mean is calculated by subtracting the logit of the elicited median for the intercept from the logit of the elicited mean of its associated explanatory variable. For example,

prior mean of θ_eth= logit(0.80) − logit(0.95) = −1.6.

The remaining prior variances are calculated in a similar way to the prior variance for θ0, for example the prior variance of θ_eth is

1 2 Ãµ logit(0.70) − logit(0.80) 4.91{0.250.14_{− (1 − 0.25)}0.14_} ¶₂ + µ logit(0.90) − logit(0.80) 4.91{0.750.14_{− (1 − 0.75)}0.14_} ¶₂! = 1.05.

The only difference in converting the elicitation for the continuous variables, is that the distance between the design points must be taken into account in calculating the prior mean to place it on a unit scale. As an example, we consider the calculation of the prior means for the two parameters in the piecewise linear equation for change,

prior mean of δ1 = logit(0.90) − logit(0.95)_{−5 − 0} = 0.15 prior mean of δ₂ = logit(0.80) − logit(0.95)

5 − 0 = −0.31.

The prior variance calculations are the same as for the categorical variables.

The WinBUGS code generated by ELICITOR is run for a number of design points, including those elicited during the interview and the minimum as requested by our expert. Based on a sample of

20,000 iterations, the median for the probability of response in the worst case turns out to be 1%, very different from the 60% expected by our expert. The 95% interval, based on the 2.5 and 97.5 percentiles of the prior distribution is (0.00,0.75). The density of the probability of response in this worst case is shown as the dotted red line in the left plot of Figure 7.2. The densities for the optimum case (solid black line) and two intermediate cases are also shown. The median response probabilities for the design points tie up with the elicited values, confirming that the model is correctly implemented. In order to see the effect of eliciting narrower intervals, we also forward sample from WinBUGS using code that is generated assuming the expert’s uncertainty was elicited via a 95% interval (i.e. the lower and upper quartiles become the 2.5 and 97.5 percentiles respectively). The impact can be seen in the right plot of Figure 7.2. Note that the medians are unchanged.

Figure 7.2: MCS income example: prior densities generated by forward sampling using WinBUGS, based on the original elicitation for selected design points

0.0 0.2 0.4 0.6 0.8 1.0 0 5 10 15 20 25

50% interval

a probability of response Density optimum intermediate 1b intermediate 2c worst case 0.0 0.2 0.4 0.6 0.8 1.0 0 5 10 15 20 25

95% interval

a probability of response Density optimum intermediate 1b intermediate 2c worst case

a _{In the interview a 50% interval was elicited. To enable a comparison with narrower intervals, we also forward sample} from WinBUGS using code that is generated assuming the expert’s uncertainty was elicited via a 95% interval (i.e. the lower and upper quartiles become the 2.5 and 97.5 percentiles respectively).

b_{Intermediate design point 1: level = £4, all other variables at optimum}

c _{Intermediate design point 2: level = £4, change = £5, all other variables at optimum}

Revisiting the elicited values

In the light of this, our expert adjusts some of his medians and intervals as shown under the “adjusted elicited %” heading in Table 7.7. We generate new WinBUGS code using ELICITOR and rerun our model. The worst case response probability median is still only 9%. This worst case is very extreme, as it assumes that the individual is paid at an hourly rate of £50 in sweep 1, has an increase in pay of £5 an hour between sweeps, is in social class 1, is non-white and lives in Northern Ireland. To put this into context, out of 505 individuals, only one individual is paid £50 or more in sweep 1, and only one has an hourly rate between £25 and £50. However, it does also reveal a difficulty with this approach. The rate of response rapidly decreases as probabilities are multiplied and having good intuition about

probabilities that are combined is difficult.

Part of the problem is that the probabilities are not really independent as assumed in the elicitation. Hindsight suggests that it would have been better to focus on eliciting information on fewer variables and allow for correlation between these variables. This would require modifying ELICITOR or developing an alternative method.

In document Bayesian methods for modelling non-random missing data mechanisms in longitudinal studies (Page 148-154)