Birth data: extraction, sample selection, and data cleaning

4 Methods

4.4 Birth data: extraction, sample selection, and data cleaning

4.4.1 Preliminary data extract and missing data

On the 30^th September 2008 birth data were extracted from eClipse for all infants enrolled in BiB (n = 4707). Technical problems with eClipse meant that the variables ‘ethnicity’ and ‘postcode’ were not included in the extract.

Instead, these variables were extracted from System One and merged into the dataset. This preliminary data extract was used to determine the frequency of infants who did not have datum for each variable (i.e. missing data) (see Table 4.1). A large percentage of infants had missing data for the

variables ‘baby hospital number’, ‘diabetes’, ‘obstetric history of GDM’,

‘abdominal circumference’, and ‘head circumference’. The BiB project has two ongoing processes, commonly referred to as ‘back-logging’, that are being used to reduce the frequency of missing data.

• Mother’s hospital notes, for BiB mothers, are being pulled from storage at St Luke’s hospital and taken to BRI. A pro-forma has been produced detailing what data from the mother notes should be in eClipse. Where a recording for a variable is in the notes, but not in eClipse, it is entered into the system.

• Baby notes for infants enrolled in BiB, who have no anthropometry in eClipse, are being pulled from storage at St Luke’s and taken to BRI.

Anthropometric data that are in the baby notes but for some reason have not been entered into eClipse are entered into the system.

At the time of the preliminary data extract, both of these back-logging processes had not been completed for all infants enrolled in BiB. Three weeks were spent working with the BiB team to complete these processes for infants enrolled in BiB up until the 30^th September 2008. This ensured that the final data extract would include all data that had been collected.

4.4.2 Final data extract and missing data

The final birth data were subsequently extracted from eClipse for the same group of 4707 infants, enrolled in BiB between 12th March 2007 and the 30th September 2008, for which the preliminary data extract was performed.

Again, these data were merged with ethnicity and postcode from System One, and the percentages of infants with missing data were calculated (see Table 4.1).

Table 4.1. The frequency of infants (n=4707) with missing data

Variable Percentage of infants with missing or incomplete data

Preliminary data extract Final data extract

Baby hospital number 66.75 66.75

Baby NHS number 0.02 0.02

Maternal indicators of diabetes 97.09 97.09

Obstetric history of GDM 97.39 97.39

Baby date of birth 0.00 0.00

Sex 0.00 0.00

Gestational age 0.02 0.02

Gravidity 0.87 0.87

Registerable parity 8.12 7.88⁺

Non-registerable parity 18.95 11.92⁺

Birthweight 0.02 0.02

Abdominal circumference 61.97 16.95⁺

Head circumference 61.06 14.09⁺

Ethnicity 4.42 1.44⁺

Postcode 4.33 0.01⁺

+percentages that decreased from preliminary data extract

The first back-logging process employed by BiB reduced the percentage of infants with missing data for registerable and non-registerable parity. Whilst, the second process reduced the percentage of infants with missing data for head circumference and abdominal circumference from 61.97% to 16.95%

and from 61.06% to 14.09%, respectively. Technical work on System One also reduced the number of infants with missing data for ethnicity and postcode. Despite this, there were still 3142 infants who had a missing (n=3086) or incomplete (n=56) baby hospital number. The baby notes stored at St Luke’s are filed by hospital number. It was particularly important that missing and incomplete baby hospital numbers were found so that baby notes could be located for data cleaning. The baby NHS numbers for infants with a missing/ incomplete hospital number were sent to Database Support, Information Services at BRI. Baby NHS numbers were matched with the iPM patient administration system to reveal baby hospital numbers. Baby hospitals numbers were matched for 2304 infants, which meant that only 838 infants had missing or incomplete baby hospital numbers. The issue of missing baby hospital numbers was passed on to a data quality team to investigate why these data were not in eClipse.

A large percentage of infants had no recording for the variables, ‘diabetes’

(97.09%) and ‘obstetric history of GDM’ (97.39%). These variables were on the pro-forma used in the first back-logging process, and all information about diabetes and obstetric history of GDM in mothers’ notes had therefore been entered into eClipse. It is possible that a result for these variables was only entered into eClipse if a mother had diabetes or history of GDM. A senior midwife confirmed that this was what probably happened. Therefore, infants with no recording for the variable ‘diabetes’ were recoded as not being diabetic, and infants with no recording for the variable ‘obstetric history of GDM’ were recoded as having no history. Assumptions were used to reduce the amount of missing data for the parity (registerable and non-registerable) variables. Results for parity and gravidity were recorded at antenatal appointments, at a time when BiB women would have been pregnant. Both parities added together should have, therefore, equalled one less than gravidity, unless the woman had previously had a multiple birth.

Infants with no result for both registerable and non-registerable parity were scored para zero for each variable, and infants with a result for one parity variable but no result for the other were scored para zero for the missing parity, as long as both parities added together equalled one less than gravidity. At some point after parity was recorded, women gave birth to their babies, which meant that registerable parity had to be increased by a score of one (i.e. para zero became para one).

4.4.3 Sample selection

Out of the 4707 infants with data, a sample was selected who had no missing data, apart from baby hospital number (see Figure 4.3). There was a relatively large number of infants with no baby hospital number (n=838) and removing these infants would have drastically reduced the total sample size. If an infant’s hospital number was later needed to find baby notes at St Luke’s hospital, and baby hospital number for that infant was missing, he/she was then removed from the sample (see Figure 4.4). Infants born to mothers with diabetes or GDM demonstrate different patterns of foetal and postnatal growth (Mello et al. 1997; Plagemann, Harder & Dudenhausen 2008) compared to infants born to mothers with normal glucose/insulin

metabolism. Therefore, infants whose mothers had diabetes (n=137) or obstetric history of GDM (n=123) were removed from the sample. Offspring of multiple births also demonstrate different patterns of infant growth compared to singletons (van Dommelen et al. 2008). Therefore, twins (n=110) and triplets (n=6) were deleted from our sample. One infant was withdrawn from the study, and all his/her data were deleted. All infants whose ethnicity was not ‘White British’ or ‘Asian or Asian British: Pakistani’

were also removed from the sample (n=751).

Figure 4.3. Sample selection

4707

4706 4674

3886 3730 3690

3662 3427

3277 3173

3172 2655

2631

Missing baby NHS number (n=1) Missing mothers hospital number (n=32)

Missing abdominal circumference (n=788) Missing head circumference (n=156)

Missing ethnicity (n=40) Missing gravidity (n=28) Missing registerable parity (n=235)

Missing non-registerable parity (n=150) Indicators of diabetes or history of GDM (n=140)

Withdrawn from BiB (n=1) Not White British or Pakistani (n=517)

Twins (n=24) All infants (n)

Selected sample (n)

4.4.4 Phase 2-Postnatal Growth data cleaning checks

4.4.4.1 Identifying erroneous data by detecting outliers

Potentially erroneous data were identified by flagging outliers. For each anthropometric measurement, any case that was outside ± two SDs from the mean was identified. This process identified 103, 62, and 102 cases that were potentially erroneous for the weight, abdominal circumference, and head circumference variables, respectively. Some infants had cases flagged for more than one measurement, and overall this process identified 202 infants with potentially erroneous data. Where possible, baby hospital numbers were used to locate baby notes and each potentially erroneous case was checked against the raw data. Twenty-four infants did not have a baby hospital number and we could not, therefore, locate their baby notes.

Baby notes were not found for 19 infants, and for one infant we found baby notes but the raw data were missing. These infants were removed from the sample (see Figure 4.4). In total 158 sets of baby notes, with raw data, were found. When our data were checked against the raw data in the baby notes there was one discrepancy for weight and nine discrepancies for both abdominal circumference and head circumference. Where discrepancies were found, the recording in our dataset was changed to the recording found in the baby notes (see Appendix IV).

Weight was measured to the last completed 10g, and each recording should have therefore ended in a zero (i.e. 2720g not 2722g). A frequency distribution flagged eight cases where weight did not end in zero. Where possible, baby hospital numbers were used to locate baby notes and each case was checked against the raw data. Baby notes were found for seven infants and no discrepancies between our data and the raw data were found. However, to ensure consistency between infants, weight was rounded to the nearest 10g (see Appendix IV). Baby notes for one infant could not be found, and this infant was removed from our sample (see Figure 4.4). Abdominal and head circumference were measured with the precision of one decimal place. A frequency distribution did not flag any

cases where abdominal or head circumference was not recorded to one decimal place.

All other variables were categorical. Frequency distributions were used to flag cases that fell outside the expected range. For example, sex was coded as one for male and two for female. The frequency distribution demonstrated that no infants had values other than one or two for the sex variable. Values were within the expected range for all categorical variables.

4.4.4.2 Identifying erroneous data within the normal distribution of the data

Flagging extreme outliers does not identify all erroneous data. It is likely that some erroneous data are contained within the normal distribution of the data. For this reason it was necessary to check 5% of the data against the raw data. However, apart from anthropometry, it was not known which variables could be found as raw data in the baby notes and which variables were inputted directly into eClipse (i.e. electronic data). We located baby notes for a random one percent of the sample (n=26) to determine what data were inputted straight into eClispe. Sex, gestational age, birthweight, abdominal circumference, head circumference, and postcode were found in 100% of baby notes. Babies’ date of birth was only found in 61.5% of the notes, and gravidity, registerable parity, non registerable parity, and ethnicity were not found in any of the notes.

For variables which were found in 100% of baby notes (sex, gestational age, birthweight, abdominal circumference, head circumference, and postcode) we randomly selected 5% (n=129) of cases within each variable to check against raw data in the baby notes. For gestational age, abdominal circumference, head circumference, and postcode there were errors in 1.55% (n=2), 2.33% (n=3), 1.55% (n=2), and 13.18% (n=17) of the selected cases, respectively. For sex and birthweight there were no errors in the data. Due to the nature of the data, we enforced a 3% error rate within each variable, meaning that if there were errors in more than 3% of the selected cases, for any one variable, then the process would be repeated for another

5% of cases for all variables. All of our error rates were below this level apart from that for postcode. Error rates within the postcode variable may be so high because of families changing address. When a family changes address this information is changed in eClipse and System One. Our data therefore reflected the most recent postcode to which an infant was registered. For this reason we did not to perform another 5% check of cases for this variable. For the same reason, we did not change the recording in our dataset for the recording in the baby notes for the 17 cases where errors were found. For the other variables, where errors were found, the recording in our dataset was changed to the recording found in the baby notes (see Appendix IV).

4.4.5 Final sample

The data cleaning process flagged cases that were checked against the raw data in baby notes. Twenty-four infants did not have a baby hospital number, which meant that their baby notes could not be located. Baby notes were not found for another 20 infants, even though we had baby hospital numbers for these infants. Finally, for one infant we found baby notes but the raw data needed for checking were missing. All of these infants were deleted from our sample. At the end of the birth data cleaning process our sample size had reduced from 2631 to 2586 (see Figure 4.4). These 2586 infants formed the core analysis group.

Figure 4.4. Final sample

2631 2607 2587

No baby hospital number (n = 24) Could not find baby notes (n = 20) Baby notes found but missing raw data (n = 1) Selected sample (n)

2586 Final selected

sample (n)

4.5 Postnatal anthropometry: Extraction, sample selection, and

In document The growth of Bradford infants (Page 81-89)