Options for educational test developers: Millman and Greene (1989)

1 Introduction

2.8 Options for educational test developers: Millman and Greene (1989)

Millman and Greene’s chapter on test development in the third edition of

specification and development of tests of achievement and ability. The authors explicitly state that the chapter is aimed at professional test constructors, not classroom teachers (Millman and Greene 1989:335). Their goal is to discuss the range of different options available to test developers, rather than to give procedural guidelines for a standard test development process. It is this perspective of different purposes and options that motivates the inclusion of this text in the present overview. In procedural terms, the stages of test development that Millman and Greene cover are the same as those in the texts reviewed already.

2.8.1 View of test development

Millman and Greene’s (1989) discussion is organised according to logical steps in test development. They begin with test purposes, then they discuss the possible contents of test specifications, followed by concerns in item development, item evaluation and trialling, selection of items for potential inclusion in tests, and assembly of test forms. The authors emphasize that test planning is fundamentally iterative, so that the stages influence each other.

Millman and Greene (1989:335) point out that the “most important step in educational test development is to delineate the purpose of the test.” Their categorisation of purposes is different from many others, because it is not organised by the kinds of educational decisions that are to be made on the basis of the test, such as placement, diagnosis, and selection or initial evaluation, formative evaluation, and summative evaluation. Tests of achievement and ability are difficult to categorise according to such criteria, because while they should ostensibly belong to different categories, they share many functional purposes. Therefore, Millman and Greene (1989:336- 337) categorise tests by the type of inference that will be made on the basis of the results. They distinguish between three domains and three types of inference. The domains are curricular, cognitive, and future criterion setting, and the types of inference are description of individual examinees’ attainments, mastery decisions, and description of performance for a group or system. The curricular domain is further subdivided into domain inferences before instruction, during instruction, and after instruction. Each of the cells in the ensuing matrix identifies a set of test purposes with similar characteristics, such as diagnosis (description of individuals’ attainments during instruction), program evaluation (description of performance for a group or system after instruction), certification (mastery decisions about a cognitive domain), or selection (mastery decision in relation to a future criterion setting). The authors’ point is that the types and domains of

inferences have strong implications for features like test content and length, the kinds of items to be included, and criteria for evaluating items (p. 338). For instance, if inferences are drawn about individuals’ abilities, each test form has to be representative of the domain and comparable to other individuals’ test forms. If the inferences are closely related to an instructional program, the possible question types may be limited to those used in the instructional setting and the abilities and skills assessed to those of the curriculum objectives. If the inferences are related to a cognitive domain, individual instructional programs should not so strongly influence the definition of the skills assessed or the range of possible item types. Instead, such tests should be closely related to theoretical conceptualizations of mental abilities. If inferences do not concern individuals but instructional programs, comparability between different test forms answered by examinees is not an issue, but the study design as a whole should cover both content that the program focuses on and content in which it may be weaker. Millman and Greene’s categorisation of test purposes is most useful for educational tests of achievement and ability. In the context of the present thesis, its benefit is the focus on the inferences drawn from the scores. Yet even in this categorisation, the tests that I will examine in Part Two of the thesis belong to two categories. Language tests used as admission criteria for university studies must be based on a theoretical conceptualization of the necessary language ability rather than on any curriculum specifications, but because of the selection function, they also refer to a future criterion setting. I will return to the issue of purpose in Part Two of the thesis.

Test specifications, according to Millman and Greene, should define test content, item types and psychometric characteristics, scoring criteria and procedures, and number of items to be developed (1989:338). Their discussion of alternatives for test content is thorough. It starts from the definition of the sources of test content, such as curricula or theories of ability, and the authors suggest that the content specification can be clarified especially through a characterisation of high performance in the domain being tested, for example through stating what experts can do compared with novices, or a characterisation of differences in strategies or knowledge structures between experts. The content definition should also make it clear whether the test construct is uni- or multidimensional, in correspondence with the curriculum or other source which the test is intended to operationalize. Furthermore, the authors point out that the content specification is influenced by the type of intended score interpretation, whether it will be domain- or norm-referenced. Domain-referenced scores

require clear specification of each sub-component in the domain, whereas norm-referenced scores rather require a clear definition of the main component(s) of the construct assessed (Millman and Greene 1989:341- 342). If both kinds of inferences will be made, both content definition concerns will need to be addressed in the specifications, and balances be struck between broad and detailed content domain specifications, discrimination and content validity as criteria for item selection, and the appropriate rules for distribution of test content within each test form.

Millman and Greene (1989:343-345) also discuss the specification alternatives for scoring at some length. They contrast the relative ease of right-wrong scoring decisions with the potential for more detailed feedback if partial credit scoring is used. They recommend partial credit scoring especially for situations where feedback is desired on examinees’ strengths and weaknesses. In terms of performance assessment, they discuss componential (or analytic) scoring, which they consider the most appropriate for multidimensional content specifications, and holistic scoring, which is the most suitable for unidimensional content. Both kinds of scoring require clear specification of proficiency at different levels, and careful development and quality assurance of the assessment process through the training of judges and regular monitoring of their work. They then discuss weighting, which they consider in the light of validly reflecting the content definition in the test. Weighting of content coverage, Millman and Greene explain, can be done by developing different numbers of items for different content areas according to their importance, and by applying weights that regulate how much importance individual items or groups of items have for the final score. This may require careful analysis of item statistics within content-motivated subsets of items to check that all relevant subsets contribute appropriately to the total score. Provision for such procedures should be made in the test specifications.

In the area of item writing, Millman and Greene (1989:349-351) discuss the continuum from the freedom of creative artists operationalizing a theoretical construct to almost mechanical rule-governed item generation according to detailed specifications. They finish by defending fairly detailed and prescriptive instructions for item writers, because it is easier to specify the principles by which such items have been created, and thus analyse their content.

Millman and Greene (1989:354) divide item evaluation activities into two broad categories: those where the content and format of items is judged against a set of criteria, and those where examinee data from item tryouts is used to evaluate item performance. They advocate the use of both

methodologies and the combination of their information when items are selected for operational tests and when test forms are constructed. They list item-content criteria as item accuracy and communicability; suitability of the item as judged against the content specification in terms of difficulty, importance and perceived bias; conformity to specifications; relevance to real-world tasks; and in educational contexts, opportunity to learn (Millman and Greene 1989:354-362). The criteria based on item response data that they discuss are item difficulty, discrimination, indexes based on subset- motivated patterns of item responses, and distractor analysis. They do not discuss Item Response Theory methods because these are discussed elsewhere in Educational Measurement (Linn (ed.) 1989). When introducing the performance data based criteria, the authors discuss important considerations in item tryout design, namely how an appropriate sample of examinees is acquired, how the number of items to be trialled is determined, and how test developers can decide a strategy for item tryout that corresponds with the practical setting in which the test is being developed. The alternatives that they discuss are using experimental items as the operational test, embedding items within operational tests, and arranging a separate tryout. Each strategy has its advantages and drawbacks, which the test developers need to tackle when they know how trialling will be done in their case. In addition to main trials, Millman and Greene (1989:356) recommend a small-scale preliminary tryout before the main trials to weed out gross flaws in instructions and task wordings.

2.8.2 Principles and quality criteria

In the introduction to their chapter, Millman and Greene state (1989:335) that although they “appreciate the importance of such factors as the cost, the consequences of an incorrect decision or inference, and the political and organizational milieu in which test planning and development take place”, they will confine themselves to technical matters in test development. They do not list the principles that they promote in a straightforward list, but their discussion emphasizes good planning, coherence and quality assurance, especially through validity and reliability.

2.8.3 View of validation

Millman and Greene (1989) do not explicitly discuss validation in their chapter. However, throughout their text they treat validity as one of the criteria guiding test development. This is particularly apparent in their treatment of test specifications and through these, all concerns in test development which are related to the construct to be assessed. They state

that “the major function of [test specifications] is, quite simply, to enhance the ultimate validity of test-score inferences. Derived directly from the designated purpose of the test, the specification of test attributes provides a guide to subsequent item development, tryout, evaluation, selection, and assembly. This direct grounding of developmental activities in test purpose helps to insure the congruence between intended and actual test-score inferences and, thus, the validity of the latter.” (Millman and Greene 1989:338.) They particularly use validity as a criterion in discussing definitions of test content, which for them encompasses construct definition, in judging item quality against content specifications, and in analyzing the appropriate weighting of each content area for scoring the test. 2.8.4 Distinctive characteristics of the text

Millman and Greene’s chapter discusses the traditional phases of test development in the context of educational testing of achievement and ability. Its specific feature is the discussion of the alternatives that test developers have at each stage: sometimes a range that all test developers have to choose from, sometimes the practicalities of how different decision making purposes influence the activities undertaken at the same stage. None of the alternatives is perfect, but when a range of them are presented, it is easy for test developers to compare benefits and drawbacks. Another distinctive feature of Millman and Greene’s chapter is its emphasis on the content/construct definition as a core for the whole test development process.

2.9 Test development in Standards for educational and

In document UNIVERSITY OF JYVÄSKYLÄ Centre for Applied Language Studies (Page 51-56)