In our discussion of criterion-referenced testing in Chapter 3, we saw that one of the key principles of the paradigm was the link between the assessment and the domain in the real world to which inferences were to be made. The movement that Glaser (1963) began was really concerned with linking assessment to teaching so that outcomes could be described and measured. Hambleton (1994: 23) puts this most accurately:
One of the most important contributions of criterion-referenced measurement to testing practice was the central focus it placed on describing the intended outcomes of instruction – that is, the objectives. Requiring teachers and/or test developers to describe clearly the knowledge and skills to be tested provides the framework needed to write valid test items, to evaluate item–objective congruence, and to enhance the quality of test score interpretations.
The phrase ‘item–objective congruence’ refers to the relationship between the item or task and the learning objective that it is designed to test. As the objective is defined in the item specification, this is now more frequently referred to as ‘item–spec congruence’
or ‘item/task fit-to-spec’ (Davidson and Lynch, 2002: 44–45). The point being made is that specifications make us think, as teachers, very carefully about what it is we think the object of a learning activity is. When we ask our students to do this task, what know-ledge, skill or ability do we think it is helping them to acquire, and why? The specification forces the language test designer to be explicit about the reason for the use of the item and what it is the item is intended to test. If we consider Figure 4.1 once more, we can see that after the test is assembled into its final form comes ‘inferences’ – the inferences we wish to make from the outcome of the test to what the learner knows or can do.
A good specification is the explicit statement of the rationale for an inference to be
Specifications for testing and teaching 135
made from successful performance on the item or task to the construct, and from the construct to the criterion. This is what Hambleton means in the rather terse ending: ‘to enhance the quality of test score interpretations’.
Popham and Husek (1969: 3) were the first to see that this link, as expressed in the test specification, would be a major consideration in assessing the validity of a criterion-referenced test: ‘The meaning of the score … flows directly from the connection between the items and the criterion.’ Later, Popham (1994: 16) claimed that ‘the increased clarity attributed to criterion-referenced tests was derived from the test-item specifications that were generated in order to guide item writers’. We have already seen that test and item specifications were used well before the advent of the criterion-referenced testing move-ment. However, this is not to say that Popham was not largely responsible for making the value of specifications so evident to both teachers and testers. Popham’s (1978) clas-sic test specification format is reproduced in Figure 5.2, as it appears in Davidson and Lynch (2002: 14).
Specification Number: Provide a short index number
Title of Specification: A short title should be given that generally characterizes each spec.
The title is a good way to outline skills across several specifications.
Related Specification (s), if any: List the numbers and/or titles of specs related to this one, if any. For example, in a reading test separate detailed specifications would be given for the passage and for each item.
General Description (GD): A brief general statement of the behaviour to be tested. The GD is very similar to the core of a learning objective. The purpose of testing this skill may also be stated in the GD. The wording of this does not need to follow strict instruc-tional objective guidelines.
(1)
Prompt Attributes (PA): A complete and detailed description of what the student will encounter.
(2)
Response Attributes (RA): A complete and detailed description of the way in which the student will provide the answer; that is, a complete and detailed description of what the student will do in response to the prompt and what will constitute a failure or success.There are two basic types of RAs:
(3)
Selected Response (note that the choices must be randomly rearranged later in test development): Clear and detailed descriptions of each choice in a multiple-choice format.
a.
Constructed Response: A clear and detailed description of the type of response the student will perform, including the criteria for evaluating or rating the response.
b.
Sample Item (SI): An illustrative item or task that reflects this specification, that is, the sort of item or task this specification should generate.
(4)
Specification Supplement (SS): A detailed explanation of any additional information needed to construct items for a given spec. In grammar tests, for example, it is often necessary to specify the precise grammar forms tested. In a vocabulary specification, a list of testable words might be given. A reading specification might list in its supple- ment the textbooks from which reading test passages may be drawn.
(5)
Fig. 5.2. Popham’s (1978) five-component test specification format
The use of a specification template like Popham’s is beneficial for teachers and test designers alike, in achieving clarity of purpose in testing. Even though this was devel-oped in the 1970s, it is amazing just how well the sections of the picture description from Burt (1922, 1923) would fit into the template. The general description contains the construct or target behaviour that the task or item is intended to test. If used in classroom assessment, or to describe tasks that the language teacher is developing for learning purposes, the general description can be used to link the task type into the syl-labus. This is particularly useful if a team of teachers is going to be generating a range of similar tasks to articulate a spiral syllabus.
The prompt attribute defines what instructions the test taker will be given, and what kind of input is required to generate the required response. It is important that all instructions can be understood by the test takers in the way intended by the test design-ers. Any source of potential misunderstanding needs to be ironed out well before any items that come from the specification are used operationally. The prompt attribute may also contain information relating to the source and difficulty of input materials, such as reading or listening texts. The text types, ranges and genres may be specified in order to link them directly to a criterion context. In performance tests we may specify who the interlocutors may be, and how they are to conduct the test. Next comes the response attribute, which describes precisely what the test taker is expected to do in their response to the prompt. This may be as simple as selecting the ‘correct’ answer from a selection of four options, or specifying the expected nature of an extended piece of writing, or production of extended speech.
All specifications contain sample items that illustrate what is intended by the speci-fication. Sometimes ‘anti-items’ are also contained in the specifications to show what is not intended. In Chapter 4, Section 4, we presented a number of items from Davidson and Fulcher (2007). Here is the specification for these multiple-choice items designed to test understanding of service encounters. You will note that all the features from the Popham specification are included under the heading ‘Guiding Language’.
Specifications for testing and teaching 137
The sample anti-items are as follows:
[Unacceptable sample 1 – multiple turns]
[The examinee hears:]
Voice 1: Can I buy some apples?
Voice 2: Yes, happy to help.
Version 0.25 of the CEFR A1 Service Encounter Spec
Note: in the sample items, an asterisk (‘*’) indicates the intended correct choice, or
‘key’.
Guiding language:
At the lowest level of the CEFR, simple transactions are mastered. These
transactions share linguistic features, which are assessed by tasks generated by this spec. Transactions typically tested at this level include:
‘Can ask people for things and give people things’
‘Can handle numbers, quantities, cost, and time’
Tasks should focus on basic language constructions common to these transactions.
Because this is a lower level on the CEFR, we envision (a) an objectively keyed test, and (b) one in which the response is a selection (on a paper or computer screen).
The oral stimuli are presented in recorded formats, on a tape recorder or by digital playback. The examinee is instructed to pick the best response from among the four alternatives shown in each test item.
Each task should have a single target focus that reects simple question construction about matters of quantity, time, cost, and so forth. Syntactic complexity of the prompts is permitted, provided that such complexity does not draw focus away from the target forms on which the multiple-choice task depends. The idea here is to focus the test taker on to the meaning-laden target components of the transaction. It is assumed that the particular format of the question is not as relevant as listening for the key details of time, quantity, etc.
Both acceptable and unacceptable tasks are illustrated in this spec. Transactions of multiple turns are not acceptable. Also not acceptable are turns that have many utterances or complex embedded syntax that prevents listening for the target constructions.
Distracters are permitted that test rapport. Consider the alternative version of Sample Task One. Note the change to (d) in which a somewhat more rude response is presented – while technically accurate in terms of focused listening, the more-rude choice (d) violates an expectation of politeness for the encounter, and it is therefore considered to be a wrong response.
(From Davidson and Fulcher, 2007: 239–240)
Voice 1: These over here look good.
Voice 2: Yes, those are nice. They are two for 75p.
[The examinee sees:]
What comes next?
a) How much are they?
b) How much are two?
c) Thank you. I’ll buy two. * d) Thank you. How much?
[Unacceptable sample 2 – complex syntax]
[The examinee hears:]
Voice 1: I am not satisfied with the calculations you’ve produced for us. It seems to me that the total invoiced price should not exceed the average invoice in our audit from last year. What did we figure wrong?
Voice 2: I don’t know. The numbers in this spreadsheet ring false to me, as well.
[The examinee sees:]
What comes next?
a) The figures seem satisfactory to me.
b) Everything seems OK, so far as my number-crunching takes me.
c) Perhaps we ought to crunch the numbers again. *
d) Can we put the numbers into a spreadsheet and figure out what’s wrong?
We can see that anti-items give clear indications to item writers what they should avoid producing. When specifications go into operation, test developers can monitor the kinds of items that item writers produce. When items that they had not envisaged in the specifications are created, the specifications can be updated to exclude these items, and the samples included in the anti-item list.
Finally, a specification supplement may be added. This includes any additional infor-mation that would help the task/item writer to create parallel items. In a speaking test this may include an interlocutor frame. Milanovic et al. (1996: 17) describe a frame in this way:
The interlocutor is provided with a frame of topics and questions to be dealt with – the interlocutor frame. S/he is expected to follow this frame closely for all candidates, although, clearly the nature of the interaction is influenced by several factors such as background, personality and competence of the candidate. The range of topics covered in this phase include:
greetings and introductions;
A sample detailed specification for a reading test 139
giving information about self related to current job or status;
work and travel;
a built-in topic switch future career prospects interests;
closing exchanges.
This spells out in much greater detail precisely how the speaking test is to be conducted, amplifying what may occur in the prompt attribute. In a writing test, the supplement may give additional information about the nature of the intended audience, the func-tions that might be covered, such as complaining or inviting. Again, these details may be related directly to the criterion domain of interest.
To conclude this section, we return to the most important argument. Only by using specifications can we generate large numbers of items or tasks that are parallel for use in multiple test forms. The specifications are also the focal point for elaborating an argument that shows how the test items are directly related to test constructs. However, in creating tasks for classroom assessment and for classroom activities, specification creation also serves an important role. The specification can be a focal point for teacher collaboration in defining what it is that is being taught and learned. Teachers can use the specifications to create multiple tasks in teams that can be used in delivering a spiral curriculum that offers multiple opportunities for learning.