Introduction to thesis
Chapter 3: Data sources and preparation
3.7 Data preparation
3.7.2 Data merge
Annual records of SHS and TAF were provided to an external third party data vendor to be de-identified by replacing the students’ NRIC numbers with a unique record identifier string number. As listed in Table 29, annual SHS data were available from 1990 to 2011 from HPB and TAF data available from 1997 to 2011 from MOE, totalling 16,353,262 student records.
Subsequently, 37 Microsoft Excel files (each between 200,000 to 500,000 rows of records, each row representing one student with one BMI measured at a certain age) were sent to me and the following steps were taken to clean and merge all the data into one single dataset for further statistical analysis:
1. Estimate student’s age
a. The student’s age had to be estimated, as this was not provided, from the raw data sets. This was derived for each student in SHS and TAF, based on the educational levels code, which were recorded during the school health screening (Table 30).
The ages for male and female were different at pre-matriculation into university as boys are required to undergo two and a half-years of mandatory national
service in the military at age of 18 and for most boys, the age at which they would enter university would be 21 as compared to girls at 19.
2. Determine race group
a. Coding for race group had to be standardised and recoded between SHS and TAF datasets as different labels were used (Table 31).
Table 30 Estimated age at each academic level in Singapore for females and males
Code Description Age (Female) Age (Male)
N1 Nursery 1 3 3
* - For the purposes of this thesis, students who had a record of “Others” for their academic levels were coded as 999 and will be excluded from statistical analysis
Table 31 Categorical variable labels assigned to students belonging to different race groups Code Race Groups Categorical variable labels
CN Chinese 0
MY Malay 1
EU Eurasian 2
ID Indonesian 3
IP Indian/Pakistani/Sri Lankan 4
XX Others 5
3. Calculate BMI
a. Body Mass Index (BMI) was calculated based on the formula of weight (in kilograms) divided by the square height (in meters).
4. Remove outliers of BMI
a. Records with BMI values greater than 99 and less than 10 were removed 5. Remove duplicates of BMI
a. In order to address multiple measurements of BMI of a student who had been screened by both SHS and TAF in the same year, plus the need to establish a long form of the data structure for latent growth and mixture analysis in Mplus later on, a Power Pivot table was created using unique record IDs and averaged BMI data for each year and data source.
b. 37 separate Excel files containing the pivot table output of IDs and BMI measurements for age 6 to 25 were created.
c. Given the 1 million rows limitation in Microsoft Excel, steps 1 to 5 had to be done within each Excel file separately.
6. Combine all of SHS and TAF excel-based data into a single Microsoft Access database a. In order to reduce the computing power and memory required to manipulate the 16 million rows of data, two separate procedures were used to compile the BMI only data and other student variables (age, race group, postal code, school code and gender) into two different tables within a single Microsoft Access database.
b. A BMIpivot table was first created in Microsoft Access by importing and appending BMI only data (wide form) from the 37 Microsoft Excel files. At this
point, each row of data represents one student ID with one BMI measurement at one age
c. A variables table was then created for data on race and gender d. MOH NHS2010 data was imported into another separate table e. Combine and merge BMI measurements per unique student record f. This procedure aims to combine, merge (average duplicates), all BMI
measurements of each student into a single row record g. An SQL query command was executed:
i. SELECT BMIpivot.id AS Expr1, Avg(BMIpivot.bmi4) AS bmi4, Avg(BMIpivot.bmi5) AS bmi5, Avg(BMIpivot.bmi6) AS bmi6, Avg(BMIpivot.bmi7) AS bmi7, Avg(BMIpivot.bmi8) AS bmi8, Avg(BMIpivot.bmi9) AS bmi9, Avg(BMIpivot.bmi10) AS bmi10, Avg(BMIpivot.bmi11) AS bmi11, Avg(BMIpivot.bmi12) AS bmi12, Avg(BMIpivot.bmi13) AS bmi13, Avg(BMIpivot.bmi14) AS bmi14, Avg(BMIpivot.bmi15) AS bmi15, Avg(BMIpivot.bmi16) AS bmi16, Avg(BMIpivot.bmi17) AS bmi17, Avg(BMIpivot.bmi18) AS bmi18, Avg(BMIpivot.bmi19) AS bmi19, Avg(BMIpivot.bmi20) AS bmi20, Avg(BMIpivot.bmi21) AS bmi21, Avg(BMIpivot.bmi22) AS bmi22, Avg(BMIpivot.bmi23) AS bmi23, Avg(BMIpivot.bmi24) AS bmi24, Avg(BMIpivot.bmi25) AS bmi25 INTO BMImerge
ii. FROM BMIpivot
iii. GROUP BY BMIpivot.id;
iv. Output: BMImerge table
7. Define new variable “count” for number of BMI measurements across ages 6 to 25 per student
a. This procedure aims to calculate number of BMI measurements available for each student in the database
b. An SQL query command was executed:
i. SELECT BMImerge.id, IIf(bmi4 Is Null,0,1)+IIf(bmi5 Is
Null,0,1)+IIf(bmi6 Is Null,0,1)+IIf(bmi7 Is Null,0,1)+IIf(bmi8 Is Null,0,1)+IIf(bmi9 Is Null,0,1)+IIf(bmi10 Is Null,0,1)+IIf(bmi11 Is Null,0,1)+IIf(bmi12 Is Null,0,1)+IIf(bmi13 Is Null,0,1)+IIf(bmi14 Is Null,0,1)+IIf(bmi15 Is Null,0,1)+IIf(bmi16 Is Null,0,1)+IIf(bmi17 Is Null,0,1)+IIf(bmi18 Is Null,0,1)+IIf(bmi19 Is Null,0,1)+IIf(bmi20 Is Null,0,1)+IIf(bmi21 Is Null,0,1)+IIf(bmi22 Is Null,0,1)+IIf(bmi23 Is Null,0,1)+IIf(bmi24 Is Null,0,1)+IIf(bmi25 Is Null,0,1) AS Result INTO [Count]
ii. FROM BMImerge;
8. Create Master Dataset A
a. This procedure aims to select minimum number of BMI measurements available for each student in the database, merge variables on unique ID and create different new output tables. Example below shows a new output table with all students with at least 1 BMI measurement
b. An SQL query command was executed:
i. SELECT Count.Result, BMImerge.id, BMImerge.bmi4, BMImerge.bmi5, BMImerge.bmi6, BMImerge.bmi7, BMImerge.bmi8, BMImerge.bmi9,
BMImerge.bmi10, BMImerge.bmi11, BMImerge.bmi12, BMImerge.bmi13, BMImerge.bmi14, BMImerge.bmi15, BMImerge.bmi16, BMImerge.bmi17, BMImerge.bmi18, BMImerge.bmi19, BMImerge.bmi20, BMImerge.bmi21, BMImerge.bmi22, BMImerge.bmi23, BMImerge.bmi24,
BMImerge.bmi25, RaceFemale.race, RaceFemale.female INTO 7bmi ii. FROM (BMImerge INNER JOIN [Count] ON BMImerge.id = Count.id)
INNER JOIN RaceFemale ON (Count.id = RaceFemale.id) AND (BMImerge.id = RaceFemale.id)
iii. WHERE (((Count.Result)>1));
iv. Output: AllBMI table 9. Link MOH unique ID to BMI data
a. This procedure aims to join BMI values to MOH unique ID (from NHS2010) b. An SQL query command was executed:
i. SELECT MOHID.id, BMImerge.bmi4, BMImerge.bmi5,
BMImerge.bmi6, BMImerge.bmi7, BMImerge.bmi8, BMImerge.bmi9, BMImerge.bmi10, BMImerge.bmi11, BMImerge.bmi12,
BMImerge.bmi13, BMImerge.bmi14, BMImerge.bmi15, BMImerge.bmi16, BMImerge.bmi17, BMImerge.bmi18, BMImerge.bmi19, BMImerge.bmi20, BMImerge.bmi21, BMImerge.bmi22, BMImerge.bmi23, BMImerge.bmi24, BMImerge.bmi25 INTO MOHmerge
ii. FROM BMImerge INNER JOIN MOHID ON BMImerge.id = MOHID.id;
iii. Output: MOHmerge table 10. Create Master Dataset B
a. This procedure aims to create a dataset with MOH ID, BMI values and MOH data from NHS2010
b. An SQL query command was executed:
i. SELECT MOHdatamerge.id, MOHdatamerge.age, MOHdatamerge.race, MOHdatamerge.female, MOHmerge.bmi4, MOHmerge.bmi5,
MOHmerge.bmi6, MOHmerge.bmi7, MOHmerge.bmi8, MOHmerge.bmi9, MOHmerge.bmi10, MOHmerge.bmi11, MOHmerge.bmi12, MOHmerge.bmi13, MOHmerge.bmi14, MOHmerge.bmi15, MOHmerge.bmi16, MOHmerge.bmi17, MOHmerge.bmi18, MOHmerge.bmi19, MOHmerge.bmi20, MOHmerge.bmi21, MOHmerge.bmi22, MOHmerge.bmi23,
MOHmerge.bmi24, MOHmerge.bmi25, MOHdatamerge.Cho_indcr, MOHdatamerge.DailySmk_indcr, MOHdatamerge.DM_indcr, MOHdatamerge.GHQ12_group, MOHdatamerge.Hyptn_indcr INTO MOHdataBMImerge
ii. FROM MOHdatamerge INNER JOIN MOHmerge ON MOHdatamerge.id
= MOHmerge.id;
iii. Output: MOHdataBMImerge table
3.7.3 Timeline
The study protocol was developed on the 24th February 2012 and finalized on the 4th of October 2012 after a series of consultative meetings and revisions by the Health Promotion
Board, the Ministry of Education and the Ministry of Health. Data request and preparation for Master Dataset A took approximately five months to complete and data request to MOH and preparation for Master Dataset B took approximately two months to complete. Final steps in data cleaning on both Master Dataset A and B were completed within a month. Figure 2 shows the key milestones during study protocol development, data request, merge and final preparations.
Figure 2 Timeline for development of study protocol, data request and preparation for the thesis (Feb 2012 to Aug 2013)
3.8 Conclusion
The final outcome of data preparation was a dataset that represented the largest pooling of anthropometric data from about 2.7 million individuals born from 1973 to 2003 collected through routine school-based health screening in Singapore from 1990 to 2011 by HPB and MOE. Details on the basic summary statistics of the final datasets will be described in the next Chapter.