• No results found

Generating the student panel and the data cleaning process

Chapter 3 Data, cohort samples and performance measures

3.3 Generating the student panel and the data cleaning process

cleaning process

We process the data in a way that most closely matches our research interests. Initially, we construct the Student Panel Dataset (SPD) with the basic student’s register, and we add student school marks, teacher identifiers, and the correspond- ing National examination results.

Given the available administrative data set we assemble the main structure of the SPD based on the Enrolment DB. Although the availability of this data base is from 2004, we use the Performance DB for the year 2003, which slightly varies from the Enrolment DB.

Both data bases (Enrolment DB and Performance DB) are very similar, but they are submitted in different periods of the year and are used by the Mindeduc for different purposes. Although they do not have a perfect matching using the student ID (Mrun), these data bases can be considered as substitutes. Thus, we start the student panel from 2003 taken from Performance DB and we continue appending yearly the Enrolment DB up to 2012.

However, before combining these data bases, which in theory have the same formats, we need to work with them and standardise all the aggregated information

for every year. Once the yearly data are merged in a single data set, we are able to start the data cleaning process of our SPD. In Figure 3.1 we show a flowchart representing the dataset generation process from the original sources to the SPD and posterior cross-section cohorts.

Figure 3.1: Processing Data Sets - Flowchart

Enrolment DB

Student Registers(i)

2003-12 Year Student ID (Mrun) School ID (RBD) Grade Le:er Region Code Type of School Dep

Performance DB

Cross Sec7on Cohorts(ii)

2003 – 2012 Year Student ID (Mrun) School ID (RBD) Grade Le:er Region Code Type of School Dep GPA

Lang Marks Maths Marks

Teachers ID (Lang & Maths) Simce Scores (Lang & Maths)

Individual Scores DB School Marks DB Teachers DB School Directories DB Administra7ve Data Set The Na7onal Examina7on (Simce) Data Set

Cleaning Process

Student Panel Dataset (SPD) 2003-12

Parental QuesIonnaire DB

*To construct the Student’s Register we use the Enrolment DB from 2004 to 2013, and the Performance DB for 2003 as the Enrolment DB was not available. ** We use cross secIon cohorts for the VAM and we also create Mini Panels for parIcular years and grades (i.e. In Chapter 3) . Other data bases (or variables) can be merged on request.

Notes: (i)To construct the Students Register we use the Enrolment DB from 2004 to 2013, and the Performance DB for 2003 as the Enrolment DB was not available. (ii)We use cross section cohorts for the VAM and we also create Mini Panels for particular years and grades (i.e. In Chapter 4). Other data bases (or variables) can be merged on request.

Basically, from two main sources: (i) Administrative dataset and (ii) Simce data set, we put together yearly information to apply the data cleaning process that allows us to construct the final SPD for the period 2003-2012. However, most of our analyses are based on cross-section cohorts, which are selected from this SPD.

3.3.1

Cleaning the Aggregated Student Register (2003 -

2012)

As we identified some cases where students had more than one observation per year, meaning in general more than one school per year, we had to clean the data designing a selection criteria that assign only one school per student-year. This process enables us to construct our student panel data (SPD).

The total number of initial observations was 27,561,043 where a group of 715,334 observations belonged to students who had at least one year of duplicated registers. The group with assignment problems, or more than one school per student-year was classified as “Bad IDs” group and it was taken aside to apply the cleaning process.3 Even if it just accounted for 3% of the total data set it was

necessary to define coherent school selection criteria corresponding to possible reasons of pupil’s duplicated observations per year (e.g. a school did not update a student register after moving to another school).

In order to generate rational selection criteria, we ruled out all the in- dividuals who had in one year at least three or more registered schools. This group represented less than 0.01% of the initial number of observations. Then, we continued working only with those individuals who had one duplicated school registered in at least one year. The total sum of observations for this group is 583,097. However, if we only focused specifically on years with duplicated cases the observations are reduced to 98,212, which are the final cases where we apply the school assignment criteria.

The logic of the school allocation criteria basically consists in assigning a school ID based on backwards induction. If a student had more than one school in a particular yeart, we look forward to the following yeart+1 to check whether the student is enrolled (without duplicated registers) to one of the schools registered int. In case there is a match, we leave the school which only appears up to yeart, as we can identify the school movement from t tot+1. Unfortunately, we do not know the point in time at which the pupil changed school during yeart.4

When the backwards rationality cannot be applied because there is not a clear sequential school transition from one year to the next, we had to apply a random assignment process.5 However the number of cases randomly assigned was 25,103 and it represents only 0.01% of the student panel. Finally, we ended up with a panel of 27,428,806 observations, distributed across ten years and twelve grades.

3We name “Bad IDs” to those students who have at least one duplicated observation per

year, in any year over the period observed. Then we take apart all their history, with and without duplicated cases to reconstruct a proper panel with only one observation per student-school-year. Details of the school selection criteria are available in Appendix 3.1.

4The backward induction selection criteria is only applied to duplicated with 2 schools per

year. When a pupil is registered to 3 or more schools, we dropped the register from the dataset.

5The random assignment process consists in choosing one of the two possible schools, setting

3.3.2

Merging with complementary data sets

Due to capacity restrictions, we are going to work with our constructed Student Panel Dataset (SPD) as the main panel, from which we obtain the cross section cohorts and mini panels used for the analyses in later Chapters. However, ad- ditional information will be added depending on the interests and purposes of potential investigations. The idea is to define representative cohorts to work with and merge them with all necessary information.

The complementary data sets can be linked by student ID (Mrun), school ID (RBD), and when we refer to the teacher data bases we also use grades (CodGrade) and classroom identification (Letter). Keeping the student panel structure we just add variables per student-year for those selected cohorts used for analyses and estimations.

Before merging the SPD with the rest of the data base, it is necessary to clean-up and process the data for every period. In the following section, we show the composition of each complementary dataset with a brief description depending on the level at which we have processed the data bases up to now.