5
subject of Chapter 7). This is the measurement com-ponent. For a closed-response item such as a multiple choice, this may simply be that each response is scored as 0 for incorrect and 1 for correct; for performance tasks, it would be necessary to provide more complex rating scales or other devices to guide the judgements of human assessors.
Test assembly specification: This document provides the instructions for how the entire test is constructed. We might know that there are four item types, but we still need to know how many of each item type we need in the test. If the items are coded by context, topic, degree of pragmatic knowledge required to respond (as in our ‘rapport item’ at the end of the last chapter), or whatever other features we think characterise the criterion domain, we may need to specify how many items are required for each category.
We also know from Chapter 2 that test reliability in norm-referenced tests is directly related to test length.
We may therefore need to specify the target reliability and the minimum number of items needed to meet the target. The test assembly specification therefore plays a critical role in showing that the number and range of items in any form of the test adequately represent the key features of the criterion situation in the real world.
Presentation specification: Sometimes completely overlooked by many language test designers, the presentation specification tells the test production team precisely how the items and any support material is to be presented to the test takers.
If it is a paper and pencil test, we need to specify the margin size, the font type and size, spacing, and where page numbers will appear. In computer-based tests the presentation specification is even more important, because we know that variation can cause fluctuations in scores that are construct irrelevant. Interface design and specification limit what it is possible to do, like the use of colour, scrolling, or the amount of text that can be presented on a single screen (Fulcher, 2003b).
Delivery specification: The delivery specification sets out the details of the test administration, test security, and timing. These specifications may include spacing between desks or computers, the number of invigilators/proctors required per number of test takers, what may or may not be used during the test (dictionaries, for example),
What are test specifications? 129
how long each sub-test should take, and what the over-all time over-allocated to the test is.
Together, these specifications allow us to build and deliver a ‘test form’ for a particular administration. A ‘test’ is really an abstract term. In a very real sense, a ‘test’ is really the collection of specifications. Any realisation of the specifications is a test form. We can use the specifications to create three, four or a hundred test forms. And we don’t use the term ‘test version’, for a very good reason. A test form means that it is generated from a test specification; one reason for having test specifications is to try to ensure that each form looks roughly the same because it is made up of the same item types, with the same number of items, representing the same set of constructs in each section. It is also designed to try to make sure that each form is of the same difficulty. Recall the discus-sion of fairness and equality in Chapter 2. One role for test specifications is to ensure that if we need two forms of a test on the same day for security reasons – say, one group of learners is taking the test in the morning and another in the afternoon – it should not matter to a test taker whether they are assigned to the morning or afternoon group.
Similarly, if someone takes the test this year or next year, assuming that their ability on the construct has not changed, they should get a similar score (within the standard error of measurement), even if the forms are different. A critical feature of test forms, therefore, is that they are parallel; there is no change between them. However, when we talk about a version of a test, we imply that it has changed. Over time test design-ers learn new things about their tests. Some items are not as good as we thought they were at measuring the construct they were intended to measure. Perhaps some items are sensitive to variables like gender, or first language background, and so they have to be removed. Perhaps we see signs that the test-taking population is changing and so some items have to be made more difficult. Changes to the test require the test specifications to be changed so that we have a new version of the test. The new version, in its turn, generates new forms – but all the new forms are parallel and there is no change between them. This relationship between forms and versions is illustrated in Figure 5.1. Here we can see that a test was developed at a point in time and a number of forms were created. These are identical twins, as it were. The test has subsequently been changed and improved on two separate occasions. This is part of the natural evolution of a test specification. Each time, the previous forms have been discontinued and new forms produced. The test remains the same. It is still a test of the same constructs. The test specifications evolve into new versions of the test. The forms are realisations of a par-ticular test version.
You will not be surprised to learn that test specifications are part of the technology that we discussed in Chapter 2. However, they are also a critical component of criterion-referenced testing.
Although the term ‘specifications’ is not used, the earliest discussion appears in Ruch (1924: 95–99). Indeed, the advice he gives to teachers covers pretty much the content of the five kinds of specifications listed above. While no early specifications appear to survive, we can see evidence of them in many early publications on testing. For example,
Yoakum and Yerkes (1920) contains multiple forms of ten army tests that are remark-ably parallel, both in content and statistical performance. It seems highly unlikely that these forms could have been generated without a set of specifications, although sadly they skip over precisely how the test designers moved from initial ideas to multiple forms (1920: 2–3). We can also see evidence of specifications in some task descriptions, such as these from Burt (1922: 24–25):
Understanding Simple Commands
Procedure. ‘Show me’ [‘put your finger on,’ ‘point to’] … (i) … ‘your nose’ …
(ii) … ‘your eyes’ … (iii) … ‘your mouth’ …
Each request (repeated several times, if necessary) should be given and answered separately.
Evaluation. All three injunctions should be correctly performed: but abundant rep-etition and free encouragement may first be used. (Opening the mouth, winking the eyes, etc., may be accepted.)
[Terman adds (iv) ‘hair’; this requires three out of four to be correct; allows using a doll, and the question: ‘Is this its (or your) nose? … Then where is its (or your) nose?]
In more open tasks that resemble modern-day speaking tests much more closely, we also have the following (slightly adapted) example (Burt, 1923: 26–27):
Describing Pictures
Three pictures chosen as containing people, and suggesting a story, and having a cer-tain standardised difficulty.
Procedure. ‘Look at this picture and tell me about it.’
‘What is this?’ If the child says ‘a picture’, ‘Tell me what you see there.’ It seems better to avoid leading phrases like ‘What can you see in it?’ which suggests enumeration, and ‘What are they doing?’ which suggests interpretation. Repeat instructions once for each picture, if there is no answer. Words of praise or encouragement may be
Time 1 Time 2
Time 3 Version 3 Form 1, Form 2, Form 3, Formn
Version 2
Original Version
Form 1, Form 2, Form 3, Formn
Form 1, Form 2, Form 3, Formn Fig. 5.1. Forms and versions
What are test specifications? 131
added: ‘Isn’t it a pretty picture? … Do you like it?’ Or even, ‘That’s right’ if the child is on the point of saying something, but is withheld by shyness.
Evaluation of Replies. Record the type of response given to the first picture. If doubt-ful, use the second and third, and record the type of response most frequently given.
Types of Response.
A. Enumeration. Replies giving a mere list of persons, objects or details.
B. Description. Phrases indicating actions or characteristics.
C. Interpretation. Replies going beyond what is actually visible in the picture, and mentioning the situation or emotion it suggests. For interpretation, the average order of ease appears to be: (i) man and woman, (ii) convict, (iii) man and boy.
[Note on pictures: There can, I think, be little doubt that pictures better printed, larger, coloured … representing actions in progress … allowing children … would be much more appropriate than … engravings. Many investigators use pictures of their own. But the above alone have been standardised.]
The pictures associated with this are reproduced here from Burt (1923: 27, 49, 51).
In these two examples, we can identify a number of features from test specifications that we still use today. Firstly, it is stated what the item is intended to test. In the second example, we are informed what each picture should contain, which also implies what they should not contain. Each one should suggest a story. The three pictures should be
What are test specifications? 133
of slightly different difficulty, and be presented in order from the easiest to the most difficult. We note the reason for the selection of pictures: it is related to the expected response and scoring – or the evidence specification. The highest score is to be given to test takers who are able to infer context and meaning beyond what is in the picture.
Next in each example is the procedure. These are the instructions to the interlocutor/
examiner about precisely how the test should be conducted in a standardised manner.
There is also guidance on just how far the examiner may vary the questions to be asked, and the degree of repetition and encouragement that can be given. Today, the levels of encouragement that are allowed in this picture description test would probably not be allowed in high-stakes testing, as it would be disclosing to the test taker what the out-come of the test is likely to be.
Next comes information on what kind of responses are expected, and the range of responses that can count as evidence of understanding. This is part of what we would now call an evidence model, and in the picture description task we also have a scoring model that tells the examiner what kind of evidence counts towards a grade at a particu-lar level, with ‘interpretation’ being the highest.
Remarkably, in the picture description task we also have information on the presen-tation specification. Burt recommends that pictures should be large, in colour, contain children (the target population of this test) and be action scenes. It is clear from the note that users of this item specification put their own pictures in place of the ones provided.
This is an element of freedom (variation) in the specification, although it is noted that there is no evidence to suggest that the other pictures are in fact parallel with the ones provided in the sample; the only evidence for how this item type works comes from the particular pictures presented here. Like all specifications, this one suggests a piece of research that is needed to support the test development process: trying out the item with a range of possible pictures to see what content and physical properties are likely to be permissible without significantly changing the difficulty of the items produced from the specification. This kind of research is very much in evidence today. For example, when the TOEFL test was computerised, it became possible to place visual images on the screen during the listening test. Ginther (2001, 2002) studied the relationship between the type of visual and the type of listening stimulus and discovered that when pictures carried information that supported the listening text, test scores improved. We may speculate that the pictures act as an independent means of ‘knowledge activation’, that calls up in the mind of the test taker a schema that is relevant to the easier processing of the text before they hear it for the first time (Rost, 2002: 62–64). As Buck (2001: 20) argues, ‘schemata guide the interpretation of text, setting up expectations for people, places or events’. It should not be surprising that the extent to which a visual activates this knowledge accurately prior to listening would increase comprehension to some degree.
We can see that the concept of test specifications is not new. Specifications were origi-nally conceived as design documents so that forms of a test would look as similar as possible, and work in the same way. The test is seen as a measuring device. As we saw in Chapter 2, reliable measuring instruments that produce the same results whenever and
wherever they are used are essential to scientific progress. The test specification is part of the technology required to craft precision instruments that give the same measurement results. As the specifications evolve, the instruments themselves come into being. These are tested out, and sometimes we find that features of the instrument produce variabil-ity that we did not expect. The sources of variabilvariabil-ity are researched. If these prove to be part of what we wish to measure – the construct – the test specifications are changed to allow their continued presence in future versions. If they prove to be construct irrel-evant they are a source of ‘error’, and the instrument needs to be redesigned to eliminate it. The test specifications are therefore changed to stop the further production of items with these features.
I have framed the last paragraph in terms of the classical view of test specifications.
Test specifications are still used in the same way today. However, there are other ways of looking at test specifications. Not surprisingly, these come from the criterion-referenced testing movement.