Chapter 3 Data Collection and Preprocessing
3.1 Data Description
The data used for the analysis falls into the following categories:
I. Large-scale California-wide high school data on Adequate Yearly Progress (AYP) performance metrics for three years: 2013, 2014, and 2015 (attributes tabulated in Appendix 1)
II. California-wide high school data on pupil demographics, second language learners, free meal program participation, and teacher qualifications (attributes tabulated in Appendix 2)
III. California-wide district-level finance and demographic information, including attributes on various sources of revenues (attributes tabulated in Appendix 3)
For a future research step, individual student yearly report data, including grades on various subjects, text-based trimester comments from teachers, and recommendations for
progress, was simulated based on the mean and standard deviation performance measures of each school selected as a sample set from the clusters.
The datasets are described in more detail as follows.
I. High school performance data at school level from the state of California: A total of 1,636 high schools were used for the macro-level analysis for identifying school clusters. This includes only traditional high schools. Schools for differently abled students and vocational high schools have been excluded. The data consists of multiple variables indicating school-level performance metrics for the years 2013, 2014, and 2015 from the Adequate Yearly Progress Report. This data has been obtained from the California Department of Education. Table 5 lists some of the multiple attributes used for the analysis. The data was originally provided as an Excel file, which was converted to a comma-separated values (CSV) file for analysis in the R language. This raw data had to be further cleaned before it could be analyzed, as it had some problems to overcome. The problems and the cleaning methods are discussed in detail in Section 3.2.
Table 5: Sample of attributes used for clustering from all California high schools
Field Name Type Width Description
cds Character 14 County/district/school code
rtype Character 1 Record type: D=district, S=school, X=state
type Character 1 Type: 1=unified, 2=elementary district, 3=9–12 high district, 4=7–12 high district, E=elementary school, M=middle school, H=high school
sname Character 50 School name dname Character 50 District name cname Character 50 County name
Crit1 Character 2 Number of AYP criteria met, based only on participation rate and additional indicators
Crit2 Character 2 Number of AYP criteria possible
m_enr Character 7 Schoolwide or Local Education Agency (LEA)-wide math enrollment
m_tst Character 7 Schoolwide or LEA-wide math, number of students tested m_prate Character 5 Schoolwide or LEA-wide math participation rate
m_val Character 7 Schoolwide or LEA-wide math valid scores
m_prof Character 7 Schoolwide math number of students scoring proficient or above
m_pprof Character 5 Schoolwide math percent of students scoring proficient or above
mp_aa
Adequate Yearly Progress (AYP) data from California school system. Sample of important attributes are listed here. Please refer to Appendix 1 for a full list of attributes.
II. Second dataset and data spread in multiple tables: The second step of analysis using regression methods was performed to observe whether some other variables correlate with the clustering, such as number of students in the free lunch program, number of migrants, and teacher qualifications. The information for these attributes had to be fetched from other multiple tables provided by the California Department of Education. Since certain attributes were not available for all 1,636 schools, for some steps, only a
sample set of schools from each cluster derived in the initial analysis was taken. Please refer to Appendix 2 for a full attribute list.
III. District-Level Finance Data: There are altogether 87 high school and 330 unified school districts in the state of California. For the research implemented here, a total of 337 districts are taken into account, including high schools. The dataset consists of 51 attributes used in the analysis. Appendix 3 details the list of attributes.
Next steps: simulated data with merged numeric and text data: The three datasets just described are macro datasets that do not contain individual, micro-level data. Micro-level data for individual student performance is not legally available from the school system. To establish a technical flow for testing algorithms in future research, micro-level data at the individual level was simulated based on the mean and standard deviation available for the schools taken as a random sample from each of the clusters created from the analysis in step 1. The attributes selected are a reflection of the California system’s high school grade reports and annual student reports, containing data on student performance on the courses studied, teacher comments for each trimester, and recommendations for improving performance or college plans.
One of the main goals of the research is also to demonstrate a combined analysis of text data and numeric data. The trimester comments and recommendations of the teachers are text data. After text mining was done, the results were used in the form of numeric variables for merging with the rest of the student report variables for analyzing micro-level data. The