prototyping and piloting - Practical Language Testing

6

aside for cohesive groups to really take items and specifications apart in critical discus-sions. The purpose is to ensure that only robust items emerge from the process, for which there is wide agreement that the item type will not only work, but that it will elicit a response that provides valuable information on the construct of interest. Initial evaluation is undertaken by the test developers themselves, usually with help from other applied linguists or teachers with a range of experience in teaching and assessing.

In order to illustrate this I am going to use an example from a real test specifica-tion workshop, conducted for Oxford University Press, similar to the one described in Fulcher and Davidson (2007: 316–317). The context is the development of a computer-delivered placement test. The project brought together more than twenty experienced teachers and item writers. The workshop was divided into a number of stages, as follows:

Stage 1 Groups of teachers are formed and engage in an ice-breaking activity.

Stage 2 Review test constructs and create task/item specifications with sample items.

Stage 3 Groups swap sample items but not specifications. Each group attempts to reverse engineer the sample item from the other group.

Stage 4 Groups are given the original task/item specification and asked to critique the sample item in preparation for giving feedback to the group that designed the item.

Stage 5 Plenary session in which each group receives feedback on specifications and items, then responds to the critique.

Two concepts are critical to this process. The first is reverse engineering, and the second is item–spec congruence (or item–spec fit). We have already encountered reverse engin-eering in previous chapters; as a group evaluation technique it is a very powerful tool.

Although there are different types of reverse engineering (see Fulcher and Davidson, 2007: 57), the most common is critical reverse engineering, in which we take a sample item and analyse it to ask what it is testing, whether it is a useful item, and consider what problems we might face if we use the item. The outcome may be to revise an item, or to abandon it completely. Item–spec congruence is particularly relevant to Stage 4. Here, a group sees the original item specification and checks to see whether the item could reasonably have been generated from the specification, and whether they have been able to reverse engineer the general description. The group has to consider whether an item and its specification are both congruent and useful.

We will begin by considering Stage 2 briefly. The participants had been asked to consult a range of sources on listening and reading constructs. These sources included books and articles, models like those we have discussed in Chapter 5, including the CEFR and the Canadian Language Benchmarks. In Stage 2 the groups focused on which constructs would be most relevant for a placement test to be used in a language school that is following a particular syllabus with an associated set of materials. Many con-structs were selected and agreed upon; one of these was ‘ability to identify facts in short, clear, simple messages and announcements’. One of the groups was given the task of designing a listening item type to test this construct.

Evaluating items, tasks and specifications 161

Here is the item that was produced by the design group. Remember that this item is to be presented on a computer, so answers require the manipulation of the mouse and keyboard.

Tapescript (from the teaching materials) Woman: Yes?

Man: I’d like some information about the rock concert tonight.

Woman: Certainly? How can I help?

Man: Where is it on?

Woman: At the Regent Theatre in Bank Street.

Man: What time does it start?

Woman: At seven-thirty.

Man: And how much are the tickets?

Woman: Well, the ten-euro tickets are all sold – the only ones we have left are fif-teen euros.

Man: That’s fine – I’ll have those.

Woman: How many would you like?

Man: Four, please.

Woman: We have four in the front row or in the middle of the theatre.

Man: I’ll take the ones in the middle, please. The front row will be too close to the stage.

We would make this an answerphone message or recorded announcement covering the message.

Click on the word or number which is not correct on each line. Key in the correct information in 1 to 6 below. You will hear the recording twice. You can key in your answer at any time. Once you have heard the recording, you will have 60 seconds to fill in your answer.

City Ticket Agency

0 Event Jazz concert

1 Place Regent Cinema

2 Address Bank Road

3 Time 8.30

Tickets bought

4 Price 10 euros

5 Number 5

6 Seat(s) row front

0 rock

1 2

3 4 5 6

You have already been told what this item is supposed to test. Remember that the evalu-ation group only had access to the item and nothing else during Stage 3 of the workshop.

Before we move on to discuss Stage 3, you may wish to spend some time writing your own critique of this sample item. You can then compare your own views with those of the group.

Rather than simply listing a set of questions or criticisms that came out of Stage 3, I am going to present the transcript of the discussion with annotations. The reason for this, following Davidson and Lynch (2002), is that item development and review must be seen as a collaborative group activity. Individuals do not always see problems with items or tests. The problems and solutions emerge in discussion and debate, and good specifications evolve in the process. The following transcript is not exact. I have not recorded all overlapping speech, or attempted to transcribe hesitations, false starts, and so on. At points the discussion drifted from topic, and I have removed those sections that were not directly relevant. Nevertheless, the transcript does accurately reflect what was said in the workshop. As you read through the transcript, consider what ideas are being generated, where agreement starts to form, and where disagreements remain. At various places there are observations within text boxes to bring out salient points.

For ease of reading, our four participants in the discussion are Angela, Bill, Carol and Dave, although these are not their real names. We join them in Stage 3, in which they have been asked to reverse engineer the specification for the task.

Angela: We would make this an answerphone recorded message covering the conver-sation the information

Bill: what’s it say?

Carol: Oh they’ve taken that dialogue, that’s interesting, isn’t it?

Angela: we would make this an answerphone recorded message covering the information

Bill: okay, yeah

Angela: click on the word or number which is not correct

Bill: just one that’s

not correct

Angela: on each line

Bill: oh okay

Dave: hm hm and the lines are

Evaluating items, tasks and specifications 163

Angela: Key in the correct information in one to six below

Dave: so we need an input box as well

Carol: alright then we need to see this really

Angela: click on the word or number which is

not correct

Bill: okay so I think what you have to do is well that’s quite compli-cated isn’t it? I mean technically. But I imagine what you’re doing is in each line you have to kind of like select a word that’s not right so jazz is wrong it’s rock and you have to select ‘regent’ and change it to ‘odeon’ or something.

That’s what it is isn’t it?

Carol: key in the correct information in one to six so alright

Bill: so the next one is

theatre not cinema

Carol: so you click on that and then put in what it should be here.

You will hear the recording twice. You can key in your answer at any time.

Once you have heard the recording you will have sixty seconds to fill in your answers. It’s very complicated.

Bill: It’s very complicated it’s very complicated to achieve as well

Carol: why

doesn’t it it’s more than yeah what it’s doing is actually very simple isn’t it? It’s just correct the answers

Bill: yeah

Carol: what’s the point of clicking on it and then typing the thing in why can’t you just

Angela: well exactly it seems to be a very

long-winded way of just selecting the answers

Carol: you’ve actually got to hear it without

any prompt haven’t you? So is it notes? Can I just have a look is it is it erm so presumably you get a few seconds to read it through the city ticket agency jazz concert I’d like some information about the rock concert tonight you select jazz

Bill: technically it’s quite difficult isn’t it because you’ve got to have selectable text and you’ve got to have

Carol: and something that you can key

Bill: key in boxes it’s technically quite hard

Carol: hm it seems uneconomical

Bill: procedurally very

tough yeah

Carol: for what you’re getting out of it

Angela: so why don’t you just give them

two words and they click on the right one?

Carol: hmmm

Bill: yes exactly yes

Dave: so you’ve got rock jazz and you just click on one of these

Angela: just click on it yes

Looking at a dialogue of an item review is fascinating from many points of view. In the preceding section the lexical cohesion between turns and the ways in which members contribute to building consensus is particularly interesting. You may wish to mark the text to show how this happens; for example, by highlighting each use of the words ‘complicated’, ‘complex’, ‘difficult’, ‘hard’ and ‘simple’. In the opening discussion the group has not attempted to identify the intended construct. Rather, they are clearly having difficulty understanding just what it is the test taker has to do in response to the item. In focusing on the response attribute they also become involved in the delivery specification. It seems to them that just producing this item in a computer environment is likely to be difficult. But the difficulty is not just in the technicalities, it is also in the ‘procedure’ that the test takers are being asked to follow, as Bill makes clear.

We rejoin the discussion, where it takes a very interesting turn.

Bill: because you wouldn’t have to write it out you might

Angela: because you would have a difficulty here you’ve got the spelling

Dave: so do you

think then they’ve made this a spelling test? If they’re not giving you the answers written then it’s also a spelling test

Angela: yes but it’s also a hearing test

Dave: yes but it’s also but they are giving you the answers I think they are giving you the answers down the bottom

Carol: are they

Dave: one two three

four five six will have alternatives won’t they

Bill: no they’re writing boxes they’re

empty fields

Angela: you have sixty seconds to write in

Carol: where do so so you have to

remember what they were I mean you can write this down or

Angela: it’s very confusing

Carol: because you

may know what’s wrong but then you’ve got to remember what’s right and then you put the answers in and you can put them in any time

Angela: shall we try it? I’ll read it and

you try doing it

Carol: yeah

Evaluating items, tasks and specifications 165

Angela: first of all you have to identify I suppose you mark on there

Carol: I suppose you identify it on the first

listening and answer on the second I think

At this point Angela attempts to read out the prompt and the dialogue while the others attempt to answer the item. However, we note that in this short period of discussion Angela and Carol have raised two serious problems with the item as their understanding of it develops. The first is that by getting the students to type the correct response into a box the item may be testing spelling. This is a computer-based test, and the computer will score the answers. But the item is clearly a listening item – the group appears to agree on this even though it hasn’t been explicitly stated. The second issue is more subtle. Carol has seen that typing in the correct answer can only occur after listening to the text all the way through. This means that the test takers must remember the correct answer for each incorrect word. The implication of this observation is that the item is likely to be sensitive to short-term memory capacity, and the test is not meant to be a memory test. The group has identified two potential threats to score meaning from construct irrelevant variance.

We rejoin the conversation after the group has had the opportunity to do a ‘try-out’ of the item.

Carol: It’s all very complicated

Angela: I think it’s fine but I think it just needs simplifying

Bill: It’s a

variation on a blank fill really isn’t it

Angela: yes

Bill: where you’ve got to identify which

blanks which blanks to fill in

Dave: so general description this would be identifying specific information or is there a special phrase or

Angela: listening for specific

information but it’s correcting wrong information

Bill: yes correcting year or just correcting information Carol: they’re not even similar sounding words are they they’re just totally

differ-ent things it’s not like

Dave: but are we supposed to use this tapescript here?

Bill: yes but

Carol: I’d like some

information about the rock concert tonight

Angela: rock

Carol: jazz

Angela: so then you have to what do you do then?

Dave: well they say they’re going to make it a monologue aren’t they it’s going to be an answerphone message okay so they take that information about the concert and say ‘hi this is Bob. I wonder if you want to come to the jazz concert tonight it’s at the Leicester Square cinema’

Carol: Hang on a minute it’s an

answerphone it’s a recorded answerphone message isn’t it

Angela: trouble is an illustrative item or task reflects the specification it’s difficult to do a listening without the script isn’t it

Bill: well we can imagine it’s not hard to

imagine from that ‘yes hello this is the regent cinema tonight’s concert is a rock concert which starts at eight fifteen and the tickets are eleven euros fifty the ten euro tickets are all sold’

Angela: So it’s the ability to identify the wrong information and replace it with the correct information

In this section it is interesting to see that members of the group have made very different assumptions about what the answerphone message is. Dave assumes that it is an invitation left on the answerphone of the test taker, while Bill’s explanation is that it is a recorded message from the cinema. This appears to be ambiguous in the sample item because the designers have presented a dialogue and not rendered it in the genre required. Despite this serious problem, the group appear to agree about what the item is designed to test. Bill’s interpretation and the conclusion summarised by Angela at the end of this section show that they have managed to discern the intentions of the item designers.

The discussion drifts back to whether this item can really be called a ‘gap fill’ or a

‘cloze’, and they decide that it doesn’t really fit into either category. We return to the discussion as the group begins to look at the prompt attribute.

Angela: Where does that leave us then? Prompt attributes. So you need a recorded answering machine message of around you want a word limit

Dave: fifty words

would be more than enough

Angela: in order to extract information thirty seconds is one hundred and ten words in listening

Dave: yes yes you’re right

Angela: so one hundred and ten to one

hundred and twenty or something it’s difficult to get the exact number of words but you need a kind of parameter and that would be a short snippety listening

Bill: and then the

students correct the notes

Evaluating items, tasks and specifications 167

Angela: no I think you need to say students read the text and then listen

Dave: no they read the notes

Angela: well notes yeah the information in the

task they need a ten second period they read that and they

Carol: listen

Angela: to the recorded

mes-sage presumably as many times as they like

Carol: twice

Angela: does it say twice?

Carol: it says twice

Angela: I mean is the first listening to do the task and the second listening to check what they’ve done or

Carol: well surely the first one is to to identify the mistake and the second one to write the answer

Angela: but don’t you want to do that together

Carol: they

can do it any time can’t they they can put it in any time

Angela: you will hear the recording

twice and key in the answer any time once you have heard the recording you will have sixty seconds

Carol: that’s very difficult because presumably you have time to identify and then write it in so you’re going to have to remember what the answers are

Dave: six

lines six questions there’s one question for each line so that makes it a bit easier so it’s not just a completely random set of notes so you know that in each line there’s one error that you have to correct

Carol: these would be better off next to

Angela: yeah they

would I think it’s a bit of a layout problem

Carol: and you’ve only got two words

to choose from haven’t you?

Dave: yeah you have

Carol: which word is wrong what should

it have been and that’s it

Bill: once you get to the second part the first part ceases to be of any relevance

Dave: does clicking on the incorrect word have any purpose?

Carol: that’s what I’m

won-dering was it just for them to be able to remember what it was

Bill: it would make

more sense to provide a task with the word underlined so they could listen for what the correct word is

It was at this point that the group was given a short break. They were then given the specification for the sample item. The specification was written using a Popham-style template, which we reproduce here.

Title of Specification Listening Correction Task General Description

In document Practical Language Testing (Page 176-189)