Studies may restrict comparisons of grade distributions from two different examinations to the same group of students (Nuttall et a l, 1974; NEAB, 1993; UCLES, 1993). The assumption is that each student has a fixed level of ability and because of this should achieve the same grade on each of the two examinations if the examinations have similar severity of grading. It is assumed that
affective factors such as motivation and confidence, and the effects of syllabuses and examinations upon students’ examination performances are identical and therefore controlled for, simply because the same students are involved. Forrest and Shoesmith (1985) in their review of inter-board examination comparability studies conducted by examination boards during the period 1978-1985 state that there is no reason why this should be so.
I f we consider the entire group o f candidates [students] taking both Physics and Chemistry, say, in a particular board, how does the distribution o f grades in Physics for that group compare with the corresponding distribution in Chemistry? It is sometimes argued that, if everything else is in order, the two distributions should be the same since the population o f candidates is the same; but there is a counter-argument that too little is known (for
example, about the degree o f motivation and intensity o f study in different subjects) to justify such an assertion.
(Forrest and Shoesmith, 1985, p. 11) Nevertheless, the control of student variables by investigating the same group of students is used annually by examining groups in a method known as subject pair analysis2 for their own internal monitoring of the validity of their measures.
I approached the GCSE examining groups in early 1994 requesting information about the methods used for investigating examination comparability. The six groups that responded confirmed they used subject pair analysis and revealed a general reluctance to place the outcomes of such analysis in the public domain. One group, the Northern Examinations and Assessment Group (NEAB), expressed their misgivings about the validity of the outcomes. This group provided an example of subject pair analysis from their 1993 GCSE science examinations (Table 3.1). The students’ GCSE grades have first been converted to integers (A=8, B=7 down to U=l). The mean grade gained in each of the two subjects by all of the students having taken that particular pair of examinations is then calculated and the difference obtained. Overall, using this method Table 3.1 is said to show that Chemistry syllabuses A and B are more severely graded than those for either Biology or Physics A and B.
Table 3.1 NEAB 1993 GCSE Subject Pair Analyses
Mean Grade (A) Mean Grade (B) Difference (A - B)*
Biology 6.5 Chemistry A 6.3 0.2 Biology 6.5 Chemistry B 6.1 0.4 Biology 6.7 Physics A 6.7 0.0 Biology 6.8 Physics B 6.4 0.4 Chemistry A 6.6 Physics A 6.7 -0.1 Chemistry A 6.4 Physics B 6.5 -0.1 Chemistry B 6.5 Physics A 6.9 -0.4
*Value is positive when examination A is less severely graded than examination B *Value is negative when examination A is more severely graded than examination B
NEAB stressed that the subject pair method outcomes do not represent a definitive statement on the relative severity /leniency of particular syllabuses but merely serve as one of a number of
indicators. Factors identified by NEAB that could mediate the validity of subject pair analysis outcomes included:
(i) the possible differences in the teaching of the subjects paired together; the lengths of the courses; the amount of time devoted to the subjects; the disparity in school facilities (school variables);
(ii) the possible differences in the interests and motivations of the students in the paired subjects (student variables)',
(iii) students who took only one subject are omitted from the analysis. The proportions of students taking only one subject varied from one subject to another. Consequently the method only partially represents the grading by subject, which affects the validity of the data in Table 3.1.
Concern (i), involves school variables that affect the students’ learning opportunities and not in my view, the comparability of the examinations. Concern (ii) reiterates my concern with student variables expressed earlier. Concern (iii) illustrates the complexity of the limitations of adopting the ‘same student’ treatment of student variables. However, it is argued that the assumptions upon which the ‘same student’ methodology is based become more tenable as the number of students increases. When a large number of students are involved there is more chance of a similar spread of examination entry policies, and some affective factors such as motivation and other variables upon which the grade distributions in the compared subjects depend. The difficulty lies in trying to establish the population size needed to justify the assumptions. Furthermore, sub
group effects may skew overall examination performances. For example, science examinations emphasizing electrical content set in contexts not reflective of girls’ out-of-school experiences could alter some girls’ confidence and actual examination performance (Johnson and Murphy,
1986). Thus, simply due to the girls’ sub-group effect, the same group of students could produce different grade distributions in two different science examinations. By taking larger numbers of students the effects from socio-cultural factors such as gender on examination performance would become even more evident.
It is interesting to note that in my 1994 communication with GCSE examination groups, the only other source of subject pair analyses outcomes was from UCLES and these were only reported in different sex sub-groups (Table 3.2). In contrast to the NEAB analyses, UCLES converted their GCSE grades to integers on a scale where A =l, B=2 down to U=8. From Table 3.2 one might say that for both boys and girls the severity of grading increases in the order of Biology, Physics and Chemistry.
Table 3.2 UCLES 1993 GCSE Subject Pair Analyses BOYS
Mean Grade (A) Mean Grade (B) Difference (A-B)
Chemistry 2.647 Biology 2.385 0.262
Chemistry 2.685 Physics 2.371 0.314
Chemistry 2.933 Biology 2.435 0.498
Chemistry 2.485 Physics 2.381 0.104
*Value is positive when examination A is more severely graded than examination B *Value is negative when examination A is less severely graded than examination B
When I communicated with the GCSE examining groups again in 1995/6 they confirmed that subject pair analyses were still used annually in examination comparability studies for internal group use. NEAB alone supplied some of their analyses outcomes, this time for 1995 examinations and with reference to sex groups. The various pairings of the subjects Biology, Chemistry and Physics revealed that there were no apparent differences in severity of grading between these subjects. However, the performances of the two sexes showed that girls were significantly (P < 0.05) more severely graded on Chemistry than Biology but there was no difference in grading for the boys on these two subjects. When Biology and Physics were paired, girls were significantly (P < 0.05) more severely graded on Physics and boys were significantly (P < 0.05) more severely
graded on Biology. Similarly, when Chemistry and Physics were paired, girls were significantly (P < 0.05) more severely graded on Physics and boys were significantly (P < 0.05) more severely graded on Chemistry. Thus the similarity in subject mean grades mask underlying differences in the mean grades of sex sub-groups.
These findings further exemplify the limitations of the ‘same student’ methodology in terms o f sub-group effects. In Johnson and Murphy’s (1986) APU research, which unlike subject pair analysis was not based on psychometric theory, sub-group effects were anticipated and indeed, were identified. The effects were considered to be due to the result o f interaction of the sub-group, for example boys and girls, though not necessarily all boys and all girls, with particular aspects of the examination. The assumption then that differences in performance outcomes for boys and girls across subjects reflect differences in grading is challenged as they might reflect differences in boys’ and girls’ views of what is significant in the item or indeed differences in their opportunities to learn - in short, reflecting the social gender mediation of teaching and learning. A combination of demands in assessment artefacts can differentially affect students’ performances and arguably to an extent that reduces the validity of comparing grade outcomes from different examinations taken by the same group of students.
The ‘same student’ treatment may claim to control several variables that affect examination performance, but this claim does not hold in reality. At best, and on condition that the compared examinations are taken, ideally by the entire age cohort, but more realistically within the context of GCSE by very large numbers of students, the ‘same student’ treatment and subject pair analysis is only an indicator of examination comparability.