• No results found

This copy has been supplied for the purpose of research or private study on the understanding that it is copyright material and that no quotation

3. Model Fit in a Frequentist Framework

3.5 Assessing the agreement between observations and model predictions

3.5.3 Additional considerations for polytomous items

When considering the fit of polytomous items, poor fit may be explained by disordered categories or disordered category thresholds. A higher category should imply more of the latent variable; if it does not then the category will exhibit large misfit. Apart from differing in order, categories may also differ in their probability of being observed. This results in disordered thresholds. This does not necessarily degrade measurement, but implies that a category discriminates across a very narrow range of the latent variable (Linacre, 2004b). If there are very few observations then the estimation of the category parameters can only be approximate. As the Rasch method of test equating depends on the estimation of the category parameters poor estimation of these parameters can pose technical difficulties.

3.6 Method

3.6.1 Design

Rasch and OPLM models were fitted to a selection of GCSE tests. Then, some summary statistics were calculated in order to verify that the items appeared to contribute to a coherent measurement instrument. Model fit was then investigated using the following steps:

(i) Routine analysis

Firstly, a routine examination of classical indices such as facility values (p- values), item total correlations and the distribution of scores at both test and item level.

3. Model Fit in a Frequentist Framework

62 (ii) Unidimensionality

Secondly, the dimensionality of the tests was examined using:

(a) Drasgow and Lissak’s (1983) linear factor analysis approach as

implemented by Rizopoulos (2006). All items were dichotomised and a maximum sample of 1,000 candidates was used to minimise processing time.

(b) Principal Components Analysis of residuals (PCAR) as implemented in

Winsteps

(iii) Test level measures of fit

Then test level measures were obtained using:

(a) R0 and R1M tests (Verhelst & Glas, 1995) as implemented in OPLM

(b) Graphical comparisons between the observed score distribution and the

predicted score distribution as suggested by Swaminathan et al. (2007) and implemented in R (R Development Core Team, 2010). The person parameters under the Rasch model were estimated using the MML procedure from eRm (Mair & Hatzinger, 2007). A maximum sample of 1,000 candidates was used to minimise processing time.

(iv) Item level measures of fit

Finally misfit was examined at the item level using:

(a) Standardised Infit and Outfit statistics as implemented in Winsteps

(b) M-statistics as implemented in OPLM

63 3.6.2 Components

Thirteen tests were selected so that a variety of item-types, response lengths, subject areas and difficulties were selected. They were also chosen with test equating in mind, so they have common items between levels and a coursework element common to both tiers that can be used for cross-validation purposes. Tests with longer response items such as essays or tests with optional items were excluded as these introduce assumptions about marking and choice that do not hold.

3.6.2.1 Science (Biology, Chemistry, Physics)

The Science tests have two primary objectives. The first is to assess candidates’ knowledge and understanding of science and how science works. The second is to assess the application of their skills, knowledge and understanding of science and how science works. At foundation tier the candidates answer 5 matching items (four pieces of information matched to four stimuli) and 16 multiple choice items (with four response categories and only one correct answer). The test is divided into 9 sections, each preceded by a stimulus. The stimulus may be in the form of a graph, a table, a paragraph, or some combination of all three. At higher tier candidates answer 2 matching items and 28 multiple choice items.

3.6.2.2 Mathematics

Mathematics assesses: use and application of mathematics; number and algebra; shape, space and measures; handling data. The foundation tier has 63 and 56 items in Papers 1 and 2 respectively, both with a total mark of 100. The higher tier has 47 items and 50 items in Papers 1 and 2 respectively, both with a total mark of 100. For

3. Model Fit in a Frequentist Framework

64

all papers the items vary from single mark items through to four mark items. Very few of the items are multiple-choice.

3.6.2.3 Geography

For Geography, candidates are expected to: show knowledge of places, environments and themes at a range of scales from local to global; show understanding of some specified content; apply their knowledge and understanding in a variety of physical and human contexts; select and use a variety of skills and techniques appropriate to geographical studies and enquiry. Paper 1 comprises a series of short answer items and two structured items on the United Kingdom. The paper also includes one or more items based on a UK Ordnance Survey map. Both tiers have a maximum mark of 75, with 33 items on the foundation tier and 28 items on the higher tier. The maximum mark for an item is 6 for both tiers. Paper 2 comprises four sections. Section A comprises a series of short answer items taken from: The European Union; The Wider World; Global Issues. The remaining sections each comprise a structured item on one of those same three areas. Both tiers have a maximum mark of 120; the foundation tier has 47 items while the higher tier has 31 items. The maximum mark for an item is 6 on the foundation and 9 on the higher. No items are multiple-choice.

3.6.2.4 Mathematics Functional Skills

Mathematics Functional Skills aims to assess how well candidates demonstrate their mathematical skills in a range of contexts for a range of purposes. The items therefore embed the mathematics within authentic contexts. Paper 1 is comprised of 30 short response dichotomous items, some of which are multiple-choice.

65

3.7 Results