MCS income data - Bayesian full probability modelling

2.4 Bayesian full probability modelling

3.1.2 MCS income data

Income data from surveys is widely analysed, and income non-response is of considerable concern. We take income data from the MCS to examine the impact of using different Bayesian models on the conclusions to some typical research questions. Analysis of data from the MCS needs to take account of its design, as it is clustered geographically, and disproportionately stratified (Plewis, 2007a). The

population is stratified by UK country (England, Wales, Scotland and Northern Ireland), with England further stratified into three strata (ethnic minority, disadvantaged and advantaged) and the other three countries into two strata (disadvantaged and advantaged). For each stratum individuals are clustered by electoral ward.

Using data from sweeps 1 and 2, we predict income for the subset of main respondents who are single in sweep 1, in paid work and not self-employed. We will consider a number of motivating questions which include:

• How much extra does an individual earn if they have a degree? • Does change in partnership status affect income?

• What effect does ethnicity have on an individual’s rate of pay?

Our dataset is formed from individuals who are single (3,194 individuals), in paid work (741) and not self-employed in sweep 1 (removes a further 27). We also exclude those who are known to be self- employed or not working in sweep 2, and four records with extreme pay values which look suspicious, leaving 559 individuals. Further details of the data selection process can be found in Table A.2. By definition we are looking at a set of individuals who are, with two exceptions (see Appendix A), the mothers of very young children, so it is hardly surprising that many are working part-time. To simplify our models, we choose hourly net pay as our response variable. Hourly net pay, hpay, is calculated by dividing annual pay by number of hours worked in a year, and further details of this calculation can be found in Appendix A. The distribution of the observed hpay can be seen from Figure 3.2.

Figure 3.2: MCS hourly net pay by sweep

0 10 20 30 40 50 60 0 20 40 60 80 sweep 1

hourly net pay

frequency 0 10 20 30 40 50 60 0 20 40 60 80 sweep 2

hourly net pay

frequency

Drawing on the existing literature, we select potential covariates with our motivating questions and the structure of the survey in mind. Our dataset also includes variables which may help to explain the missingness (Hawkes and Plewis, 2008). These variables are detailed briefly in Table 3.3, with more extensive descriptions and source given in Table A.3.

Table 3.3: Description of MSC income dataset variables

name description details

age age at interview continuousa _{- median = 26, range = (15, 48)}

hsize household size continuousa _{- median = 3, range = (2, 11)}

kids number of children continuousa _{- median = 1, range = (1, 8)}

edu educational level 6 levelsb

sc social class 4 levels (NS-SEC 5 classes with 3 omitted)c

eth ethnic group 2 levels (1 = white; 2 = non-white) singd single/partner 2 levels (1 = single; 2 = partner) reg region of country 2 levels (1 = London; 2 = other)

ctry country 4 levelse

stratum country by ward typef _{9 levels}

wardg _{group or single electoral ward}

a _{all the continuous covariates are centred and standardised; the median and ranges are for sweep 1 on}

the original scale.

b_{the level of National Vocational Qualification (NVQ) equivalence of the individual’s highest academic}

or vocational educational qualification

c _{social class 3 is small employers and own account workers, and these individuals are excluded by}

definition

d_{always 1 for sweep 1 by dataset definition}

e _{1 = England; 2 = Wales; 3 = Scotland; 4 = Northern Ireland}

f _{three strata for England (advantaged, disadvantaged and ethnic minority); two strata for Wales, Scot-}

land and Northern Ireland (advantaged and disadvantaged)

g _{the sample is clustered by ward}

Our educational level variable, edu, is the level of National Vocational Qualification (NVQ) equivalence of the main respondent’s highest academic or vocational educational qualification, and details of these levels can be found in Table A.4. We regard individuals with only other or overseas qualifications as missing. The main respondent’s social class uses the National Statistics Socio-Economic Classification (NS-SEC) grouped into 5 categories, but since we have excluded the self-employed from our dataset, there are no individuals in category 3 and sc has 4 levels (see Table A.5). Note that sing is always 1 in sweep 1 from the definition of our dataset, but is used to indicate whether the individual has acquired a partner by sweep 2.

Ctry, stratum and ward are fully observed by survey design, and the pattern of missingness for the

remaining variables in the dataset is shown for sweeps 1 and 2 in Tables 3.4 and 3.5 respectively. In sweep 1, 8% of individuals have missing hpay, a very small number have missing edu, sc or eth, and the remaining variables are completely observed. In sweep 2 missingness is substantially higher, with 32% of individuals having no sweep 2 data due to wave missingness (Pattern 6 in Table 3.5), and a small amount of item missingness, predominantly for hpay. The pattern of missingness across both sweeps is shown in Table 3.6. We restrict our analysis of this dataset to modelling the missingness in sweep 2.

Some sweep 2 data was collected from individuals who were originally non-contacts or refusals in sweep 2, after they were re-issued by the field work agency. In our dataset, seven individuals have a

Table 3.4: Missingness Pattern for sweep 1 MCS income data

number number

of age1 hsize1 kids1 reg1 eth1 sc1 edu1 hpay1 of missing

recordsb _variablesc Pattern 1 505 1 1 1 1 1 1 1 1 0 Pattern 2 43 1 1 1 1 1 1 1 0 1 Pattern 3 3 1 1 1 1 1 1 0 1 1 Pattern 4 2 1 1 1 1 1 0 1 1 1 Pattern 5 1 1 1 1 1 0 1 1 1 1 Pattern 6 4 1 1 1 1 1 1 0 0 2 Pattern 7 1 1 1 1 1 1 0 0 1 2 MissObsa ₀ ₀ ₀ ₀ ₁ ₃ ₈ ₄₇

a _{total missing observations for the variable}

b_{number of records for the missingness pattern}

c _{number of missing variables in the missingness pattern}

Table 3.5: Missingness Pattern for sweep 2 MCS income data

number number

of age₂ hsize₂ kids₂ eth₂ reg₂ sing₂ edu₂ sc₂ hpay₂ of missing

recordsb _variablesc Pattern 1 343 1 1 1 1 1 1 1 1 1 0 Pattern 2 22 1 1 1 1 1 1 1 1 0 1 Pattern 3 2 1 1 1 1 1 1 0 1 1 1 Pattern 4 9 1 1 1 1 1 1 1 0 1 1 Pattern 5 2 1 1 1 1 1 1 1 0 0 2 Pattern 6 181 0 0 0 0 0 0 0 0 0 9 MissObsa 181 181 181 181 181 181 183 192 205

a _{total missing observations for the variable}

b _{number of records for the missingness pattern}

c _{number of missing variables in the missingness pattern}

complete set of sweep 2 variables as a result of these re-issues. We will set these data to missing for the purpose of fitting our models, so they can be used for model checking.

In document Bayesian methods for modelling non-random missing data mechanisms in longitudinal studies (Page 48-51)