PILOT TESTING AND TEST CONSTRUCTION Introduction

It is common practice to pilot sets of test items amongst members of the target population. This piloting forms a further empirical basis for accepting or rejecting items and provides general administrative information. In particular, ambiguous items need to be identified, as do negative discriminators and to indicate the types, if any, of items which are most easy and most difficult. This latter is not for the purposes of exclusion or retention, but to give the researcher some idea of how hard or how easy any particular test he devises will be. This will be

important in recommending its use. Administratively, it is necessary to determine the number of items which may be answered in the time period and ”to discover weaknesses or needed improvements in the mechanics of test taking, in the directions to examiner and examinee, in the provisions for the responses, in the sample or fore-exercises, in the typographical format, and so forth" (Conrad, 1951, p. 251).

Such piloting should, of course, take place amongst the target population of the final test and consideration must be given to the selection of such a sample. Criteria must be advanced for the

rejection or amendation of an item and means provided for repiloting amended items. Following on from this, it is possible to construct a functional reading test for evaluation.

Target Population

SOFRP aims to develop tests of occupational functional reading ability, for use with Fifth-Form pupils in their last year in

Sheffield secondary schools. The exact target population are those Sheffield Fifth-Form pupils intending to leave at the age of 16+ years,

should be valid for other cities or areas with similar industries. Unfortunately, it is difficult, if not impossible to sort the leavers

from the non-leavers in a Fifth-Form, as progression to Sixth-Form or Further Education is so often determined by examination results published in the summer following the leaving date. Also, from the fieldwork it was clear that those with several O-levels had no reading difficulties at work and would likely find a functional reading test very easy. It was felt, therefore, that more fruitful data might be obtained by piloting amongst the ‘effective* target population: those with middle to low ability (i.e. those predomin antly taking CSE examinations, or few examinations at all) rather than boring the higher ability pupils or subjecting remedial pupils to an ordeal. Further, it was reasoned that if the middle-low group answered correctly so would the higher group, if incorrectly then also the lowest group. Much information was then to be gained from the middle-low ability group.

For the purposes of the pilot, the target population was restricted to those predominatly taking CSE examinations or few

examinations in the Fifth-Form of state schools in Sheffield Metropol itan District.

Sampling the Population

Piloting of test items need not involve a large sample, as it needs only a small set of responses from a typical group to indicate patterns of response, point out ambiguities etc. If one has a large number of items, however, one must test several groups, rather than have all subjects take every item: fatigue will play its part and is to be avoided.

Sheffield has thirty-eight schools (all comprehensive) with Fifth-Form pupils and, due to the pursuit of enlightened policies, cachement areas have been organised to attempt a mix of pupils in each school wherever possible. Inner city schools are thus not the blighted denizens of the working class, nor the suburban schools the

privileged paradises of the middle-class. The system is not perfect, of course, but it does mean that any one of a number of schools can be selected as being fairly representative of the whole population. *Fairly* is a judgemental term but when such a judgement is left to one familiar with every school in the area, one can be reasonably sure of some accuracy. That person was a Senior Advisor in the Local Education Authority, who assisted in the arrangements for contacting each school used in the Project at this and later stages.

Just one school was used in the initial piloting, in an area of above average S.E.S. but with pupils from other areas attending. It was a school of good reputation but not the highest flyer. There were approximately

200

+ to a year group, covering the range of

ability and it can be said that this was a fair sample of Sheffield Fifth formers. Only one school was used in order to minimise the amount of travelling and administration involved in visiting several schools.

The school was asked to provide, in the first instance, groups of about 10 to 15 pupils, of middle to low ability.

Assembly of Test Booklets

All items were assembled into six test booklets of twenty or twenty-one items each. Axl booklets had some action or completion items as a first section, preceded by an example and a page of

introductory instruetions. This was followed by a second section of multiple-choice items, again preceded by examples and instructions. One booklet also contained an orally administered item.

For each item, the reading passage preceded the question part of the item. The booklets were bound with title sheets with room

for the testee*s name and s c h o o l ’s name. Administration

Three members of Polytechnic staff, including the author, undertook the pilot testing and common administration instructions

were agreed. These included a common introduction to the tester and the nature and reasons for the testing. The instructions then continued with the reading out of the instruction page in the text booklet and instructions for the example. Similar instructions were given for the second section. In addition, each tester had a timing schedule on which to record administration times, the time after which two-thirds had finished and when all had completed, for each section.

The testers each tested two groups on the same morning in the pilot school. Each group had about ten pupils, assigned to each group in no particular order, and each group took a unique set of items. No school staff were present at the testing.

Criteria for Item Scrutiny

In a sense, item rejection, revision or retention is under taken on a largely judgemental basis. Following the discussion of item analysis in Chapter

8

, above, the usual item statistics are only useful if they have high values (restricted variability will lower the values: therefore a high value can be trusted to be no lower, but a low value cannot be trusted not to be higher). An ‘inoperative* answer possibility may not need changing, for one is testing to see how well the testee can do, not to see how well one can lure him from the correct answer.

One or two things do have value, however. An incorrect answer possibility selected very frequently - particularly more often than the correct one - indicates an ambiguity, probably in the question stem or the answer possibility itself. A uniform spread of answers across the possibilities tends to indicate guesswork: the item may be either very difficult or ambiguous. A high discrimination index would be evidence for the former, a low index might indicate the latter; a negative value certainly would indicate the latter. A

high rate of ommission indicates that the testees are probably failing to understand the question part of the item. If the passage were misunderstood, one would expect some guessing or wrong answers, but

not omission. The observant tester will also notice items upon which a great deal of time is spent by pupils.

Scrutiny of Items and Repiloting

All test items were subject to scrutiny on the basis of the criteria discussed above. Most defects in items were minor, having already passed through content validation procedures designed to find ambiguities or other undue complexity. A number did, however, require revision of either the question stem or a specific answer possibility. A number of typographical errors were also discovered and corrected.

On a more general note, it was decided that, to reflect more accurately job-related reading tasks, the question should precede the reading passage. It was felt that a young person at work more often tends to go to a passage with specific purposes, rather .than

reading through it and then answering questions.

A number of items were deleted as inappropriate materials for the test, on the grounds that to remove the ambiguity or other defect would trivialise or destroy the item per se. Others were amended and these were repiloted in the same school. This repiloting also gave the opportunity to correct the details of administrative proced-

ure and to check on the timing of the test. Time on the practice items had been found too lengthy and also, the change to the question being placed first meant changes in procedure and explanation. Two groups of about thirty pupils each were given a new booklet each for this purpose, tested by the author and another tester in one morning. Results of Piloting

In all, 173 items were successfully constructed, validated and piloted, although some were different versions of the same item.

It was found that, within the double school period allotted for testing (70 minutes), a maximum of 32 items, of mixed types, in two sections, could be answered. This figure might be exceeded if all

items were of multiple-choice, or vice versa for action or completion items. It was decided that an absolute maximum of 35 items per

double school period was likely, given time for introduction and administrative procedures.

In general, the administrative procedures were acceptable, given that the action and completion items were novel types to the pupils, who took a little time to get used to them. Although ideally it was desired to allow all pupils to finish, the need for separate administration instructions for different sections indicated that a time limit would need to be imposed on the first section, to ensure that pupils were able to continue to the second section without indiv idual instruction. Pupils would have free license to return to the earlier section if they desired, however.

Construction of a Functional Reading Test

With a large number of test items successfully validated and piloted and with information available on the number of items which may be included in a test, the question of constructing a test of occupational functional reading ability arises. Such a test must aim to include a wide variety of items drawn from different jobs, of different types of content and joining different linguistic tasks. All of this must fit within the framework of the result of the SOFRP and the empirically determined criteria previously discussed.

Content of the Test

Of the six content-types derived from the Sticht classification, items were available for five ("Tables of Contents and Indexes" had no items constructed for it as the questions proved tended to be more complex than the reading passage, or be too ambiguous, or the reading passages available were inappropriate for item construction). Of the seven job types (including "Unemployment and Job Seeking" in category 7), there were no items available for "Professional” for

reasons previously discussed (Chapter 7). Of the four linguistic tasks, no items of the "Attitudinal" category survived piloting.

In order to have every combination of these categories, it would be necessary to have at least ninety items in the test

( 5 x 6 x 3 = 90). This was clearly impossible within the double

school period time limit. It was decided to construct two tests, therefore: Form A, which would be as complete a sample as possible of the different combinations of job, content and linguistic task; and Form B, which would provide extra items to complement Form A in decision-making. That is, Form A would be a test in its own right, but optionally, more information could be obtained by also administ ering Form B. The combination of these (Form A and Form B) would give a more complete picture of pupil performance if necessary. Both tests would be timed to last a double school period each and contain roughly 30 items.

Functional Reading Test, Form A

Form A consisted of 31 test items. No category, of content, job or linguistic task was omitted except ’'Unemployed” , and these categories formed subtests for more detailed analysis. The following figures (Figures 1 0

.1

to 10.3 show the numbers of items in each category:

Job Category Number of Items

Apprentices 14 Clerical 5 Dis tribution 4 Operative/others

2

Induction

6

Total 31

Figure 10.1: Items per Job Category, Form A

Content Category Number of Items

Standards & Specification 5

Identification & Physical

Description

10

Procedural Directions

6

Procedural Checkpoints 4

Functional Description

6

Total 31

Figure 10.2: Items per Content Category, Form A

Linguistic Task Number of Items

Re ferential

_\

Regulative 13

De fini tional

1

Total 31

Figure 10.3: Items per Linguistic Task, Form A

Form A is given as Appendix VII, below. It comprises of six action items, one completion item and twenty-four multiple choice items.

Job Category Form B Complete Long Form Apprentices 3 17 Clerical 4 9 Distribution 10 14 Operative/others 9 11 Induction 4 10 Total 30 61

Figure 10.4: Items per Job Category , Form B and Complete Long • Form

Content Category Form B Complete Long Form

Standards & Speci fication 12 17 Identification & Physical Description 14 24 Procedural Directions 3 9 Procedural Checkpoint 0 4 Functional Description 1 7 Total 30 61

Figure 10.5: Items per Content Category, Form B and Complete Long Form

Linguistic Task Form B Complete Long Form

Referential 14 31

Regulative 9 22

De fini tional 7 8

Total 30 61

Figure 10.6: Items per Linguistic Task, Form B and Complete Long Form

CHAPTER 11

In document The development of a criterion referenced test of occupational functional reading ability (Page 108-117)