Total 62 45 107 Overall, about half of all papers presented evidence which could be broadly described as
4. The Validity of Value Added Measures
5.3 Key Data Sources 1 Performance Tables Data
In England, details of school performance and characteristics for all state-funded schools are published annually on the ‘performance tables’ on the Department for Education (DfE) website (DfE, 2015). Approximately 7% of pupils nationally attend private schools and so data are generally unavailable for these pupils. Where data are available for privately educated pupils, measures are not always comparable due to differences in qualifications which are taken. School performance data go back to 1994 and full school-level datasets are freely and readily available from 2005 onwards (DfE, 2015). This research uses school-level
data from 2011-2014, where 2011 was the first year in which the current VA measure was used and 2014 was most recent year for which performance data were available during analysis. It is possible to match schools across years of data using unique school reference numbers recorded with each record in the data. Even when schools change school name or type, it is generally possible to match schools using local authority codes and establishment codes.
The data provided in the performance tables is used in two of the studies in this thesis. First, the study of reliability of value-added scores across years and, second, the study into observable biases and error in the official value-added scores. Further details of the exact measures used will be given later in this chapter and the relevant results sections.
5.3.2 National Pupil Database Data
As well as the school-level data collected in the performance tables (above), the DfE collects pupil-level data which combines data on pupil and school characteristics with examination results from key stage examination years. It is possible to apply for pupil-level data through application to the DfE who encourage use of the data for appropriate research or school improvement purposes. The National Pupil Database (NPD) is a very large dataset and has performance data going back to 1996; these performance data have been matched with the School Census data (formerly the Pupil-level Annual School Census, PLASC) since 2002 (DfE, 2015). A large number of variables are collected relating to achievement and to pupil and school characteristics. These fields change over the years with new policies and improvements in the data, but the general trend has been towards data of a greater quantity and quality. More information can be found on the NPD Wiki, which is maintained by NPD users in the research community and is updated regularly as the NPD changes (Allen, 2015a). The main study (Study 1) in which the pupil-level NPD data are used concerns bias and error within the official value-added data used in policy and practice. Study 1 looks at some of the difficulties in constructing the measure and so uses the more fine-grained pupil- level data as well as the school-level data described in the last section. NPD data are also used in the study presented here regarding the consistency of value-added estimates across cohorts within schools in a given year (Study 4). As will be described in the next section, this research also obtained a dataset containing teacher-assessed performance data and numerous contextual variables. This dataset was collected as part of previous research
funded by the DfE and the data were matched with NPD data. The dataset described in the next section, therefore, is a combination of data collected in the DfE study and NPD data.
5.3.3 ‘Making Good Progress’ Data
The NPD (see above) contains performance data for pupils at National Curriculum Key Stage years (where Key Stages 1 to 5 correspond to ages 7, 11, 14, 16 and 18). Several studies in this thesis, however, required performance data for year groups who were between these years. Study 2 compares value-added estimates of school effectiveness with those produced using a regression discontinuity design, requiring performance data for consecutive year groups. Study 3 examines the stability of performance for a given cohort followed over time. Study 4 examines the consistency of performance between all year groups in a key stage at a single point in time. All of these require performance data for year groups outside of key stage years.
During initial searches for existing data to meet this need, a DfE study known as ‘Making Good Progress’ (MGP) was identified. This was obtained through an application process which allowed access to the MGP dataset for use in this thesis. The MGP study looked at how pupils progressed during Key Stages 2 and 3 (DfE, 2011), collecting teacher- assessed performance data for all cohorts within this age range for three study years. It is a very large dataset which is well-suited to the intended analyses. Summary details are given presently, before more study-specific details are given in later sections. Further details of the MGP dataset, including more details on the local educational authorities included, variables collected, the validity of the teacher-assessed data and methodology of the data collection, can be found in the DfE statistical report based upon it (DfE, 2011).
The MGP dataset is large with data for 148,135 pupils spanning 342 schools, 10 local authorities, 6 consecutive school year groups (UK years 3-9) across 3 years. There were 100,000 pupils in 2007/2008 with pupils being fairly evenly spread across years 3 to 9 (age 8 to 14). This overall number dropped to just over 70,000 by the third year, again spread fairly evenly across the age range. The MGP report compares the achieved sample across a range of pupil background variables with national data for these year groups, finding it to be ‘broadly representative’ of pupils in years 3 to 9 nationally (DfE, 2011, p.6).
The analyses of the MGP data make use of the teacher-assessed data based on National Curriculum (NC) levels framework and guidance. NC levels are designed to be a
single scale tracking attainment from age 5 to age 14. It is questionable, however, whether the NC levels can be considered an interval scale (where the difference between level 3 and 4 can be assumed to be the same size as between 4 and 5, for example) and whether levelling is consistent across teachers across the full age range. There is also evidence to suggest that teacher-assessed levels can be unreliable in some circumstances, although it may be possible to improve this by way of moderation procedures and well-designed assessment criteria (Harlen, 2005). The evidence base on both reliability of teacher assessments and the effectiveness of moderation as a way of improving it is considerably lacking at present, however (Johnson, 2013). The MGP report discusses the quality of teacher assessments and includes an annex which compares the teacher assessed levels to those obtained in the key stage 2 and 3 examinations (DfE, 2011). This gives some indication of how consistent teacher-assessed and examination assessed grades were in this instance. Agreement between the teacher-assessed levels varied from 56% to 77% in KS2 writing, 36% to 95% for KS2 reading and 64% to 89% in KS2 mathematics. Some of the discrepancies will stem from differences in timing between the two measures, with teachers’ scores being lower than the examined results due to being recorded some time earlier (DfE, 2011).
Because of these differences, analyses of the MGP data use only the mathematics performance data which tended to have higher levels of agreement with the examination- assessed data (see DfE, 2011, pp.41, for a chart showing the correspondence between teacher-assessed and examination assessed KS2 mathematics). The report comments on the moderation activities which took place in schools during the study, noting that during the pilot study concerns were expressed about the initial quality of teacher assessment but that the quality improved as the ‘processes bedded’(DfE, 2011, p.7). The correspondence between the teacher-assessed levels and the examination levels tended to increase over the time period from 2008-2010. Although another factor is that the sample in the third year was reduced which is likely to reduce robustness, especially at secondary level (where the school numbers were lower). One other factor relating to the consistency of teacher- and examination-assessed attainment is that agreement was generally on the higher end of the above range for pupils of average or above average ability and lower for lower attaining pupils.
These problems of validity are inherent in educational and psychometric measurement in general and so concerns about the quality of the data used, while certainly
noteworthy, are not held to be especially problematic for this study in particular. The quality of the MGP dataset is comparatively high, with the main difficulty being the fact that performance is teacher-assessed. Of course, it is not the case that examination-based measures of academic performance are entirely valid, especially when based on a single examination in a high-stakes context (Stobart, 2008). Moreover, teacher-assessments are used as part of the predominantly examination-based key stage results in the NPD (above) and are widely used in practice in schools (see Chapter 3 for details and Chapter 7 for further discussion of how this study limitation influences the implications of the results for policy and practice, respectively).
5.3.4 Approaches and Actions Common to all Studies
The majority of the information for each study is contained within the sections below and in the corresponding results sections. This section briefly outlines some steps taken with the data that are common to all analyses and so saves repetition. Several explanatory points are made to ensure key distinctions made within the various analyses are clear.
All analyses contained within the results chapter were conducted using either Stata (v13) or SPSS (v22), the final analysis has been completed almost exclusively using Stata. Syntax has been stored and can be produced on request. Next, all the analyses concern state- funded mainstream schools only. Special schools and independent schools and pupils can be identified in all of the available datasets and so are removed from all analysis. The reason for omitting the former is that special schools take pupils with highly individual and specialist needs. This makes use of value-added measures (which seek to produce like-for- like comparisons) highly questionable. Special schools, at best, could be considered a more challenging application of the methodology, whereas the intention is to consider its use in more favourable circumstances. Similarly, using private schools brings in issues surrounding the measure of performance as many private schools take alternative qualifications which are not counted within the official data and many pupils lack the prior attainment measures used in the calculations. For present purposes excluding private and special schools from the analysis allows the core issues to be addressed.
Another point relevant to several of the analyses is that there has been a great deal of reform of English schools; part of this is to change the legal and financial arrangements of schools so that the school is funded directly from central government (rather than local
government) and the designation of such schools as ‘academy’ schools. One result of this is that many of the data sources used record the pre-academy and post-academy schools using different school reference IDs. Despite the nominal change to the school, it was thought appropriate to match schools operating on the same site using local authority and establishment codes rather than the school IDs as the latter remain unchanged through academisation.
In terms of presentation of the results, the approach has not been to put model output in the main text unless it is the object of discussion. Output from results which is not discussed directly but underpins key results is included in the appendices.