Test development in Standards for educational

1 Introduction

2.9 Test development in Standards for educational

The Standards for educational and psychological testing (American Educational Research Association (AERA), American Psychological Association (APA), and National Council on Measurement in Education (NCME) 1999:2) state that the purpose of the Standards is “to provide criteria for the evaluation of tests, testing practices, and the effects of test use.” The Standards is perhaps the best known and most referred-to set of criteria for evaluating educational and psychological tests. Although the document is American, it is well respected in other parts of the world as well. The current Standards is the sixth revised edition of guidelines for test construction and use from the three sponsoring organisations, the first having been produced separately by APA and AERA in the 1950s. The target audience of the Standards is all professional test developers. The

document codifies a set of practices which the educational and psychological measurement community views as a desirable standard. The format of the standards is prescriptive, but there are no formal enforcement mechanisms; professional honour should compel test developers to follow them.

The new Standards for educational and psychological testing

(AERA 1999) includes six chapters on test construction, evaluation, and documentation, one each for: validity; reliability; test development and revision; scales, norms, and score comparability; test administration, scoring, and reporting; and test documents. Each of the chapters first discusses general concerns related to its topic, and then presents and, where necessary, explains the standards related to it. The the introductory texts have been expanded from previous versions; their purpose is to educate future test developers and users and help all readers understand the standards related to each topic. The first two chapters concern standards for the key measurement criteria in the evaluation of tests, validity and reliability, and the last four contain standards for the stages and products of the test development process. Furthermore, Part Two of the Standards

includes four chapters on fairness issues. There is some overlap in the scope of the chapters, and the chapter on test development and revision states that “issues bearing on validity, reliability, and fairness are interwoven within the stages of test development” (AERA 1999:37).

2.9.1 View of test development

The Standards identifies four main steps in test development: “(a) delineation of the purpose(s) of the test and the scope of the construct or the extent of the domain to be measured; (b) development and evaluation of the test specifications; (c) development, field testing, evaluation, and selection of items and scoring guides and procedures; and (d) assembly and evaluation of the test for operational use” (AERA 1999:37). It also states that the development activities are not always sequential but that ”there is often a subtle interplay” between the stages, so that the writing of items and scoring rubrics clarifies the definition of the construct. Furthermore, the

Standards emphasizes the idea that the rationale for a test is strengthened when both logical/theoretical evidence in the form of the framework and empirical evidence from item development and test construction are available to support the interpretations of test scores (AERA 1999:41).

The aim of the first step of test development, according to the

Standards (AERA 1999:37) is to extend the original statement of purpose into a detailed framework for the test to be developed. The framework

”delineates the aspects (e.g., content, skills, processes, and diagnostic features) of the construct or domain to be measured”, and guides all subsequent test evaluation. The specifications, then, detail ”the format of items, tasks, or questions; the response format or conditions for responding; and the type of scoring procedures. The test specifications may also include such factors as time restrictions, characteristics of the intended population of test takers, and procedures for administration” (AERA 1999:38). Specifications are written to guide all subsequent test development activities, and they should be written for all kinds of assessments, including portfolios and other performance assessments.

The Standards points out that specifications must define the nature of the items to be written to some detail, including the number of response alternatives to be included in selected response items and explicit scoring criteria for constructed-response items (p. 38). The document identifies two main types of scoring for extended performances: analytic scoring where performances are given a number of scores for different features in the performance as well as an overall score, and holistic scoring where the same features might be observed, but only one overall score is given. The readers are told that analytic scoring suits diagnostic assessment and the description of the strengths and weaknesses of learners, while holistic scoring is appropriate for purposes where an overall score is needed and for skills which consist of complex and highly interrelated subskills. (AERA 1999:38- 39.)

The Standards states (AERA 1999:39) that when actual items and scoring rubrics begin to be written, a participatory approach may be used where practitioners or teachers are actively involved in the development work. The participants should be experts, however, in that they should be very familiar with the domain, able to apply the scoring rubrics, and know the characteristics of the target population of test takers. Experts may also be involved in item review procedures, which can be used in quality control in addition to pilot testing. Such review usually concerns content quality, clarity or lack of ambiguity, and possibly sensitivity to issues such as gender or cultural differences.

In the final step of initial test development, the items are assembled into test forms, or item pools are created for an adaptive test. Here, the responsibility of the test developer is to ensure that “the items selected for the test meet the requirements of the test specifications.” According to the

Standards (1999:39), item selection may be guided by criteria such as content quality and scope, appropriateness for intended population, difficulty, and discrimination. Similarly, the test developer must make sure

”that the scoring procedures are consistent with the purpose(s) of the test and facilitate meaningful score interpretation.”

For the purposes of score interpretation, the Standards makes the point (AERA et al. 1999:39-40) that the nature of the intended score interpretation influences the range of criteria used in the selection of items for the test. If the score interpretation is to be norm-referenced, item difficulty, discrimination, and inter-item correlations may be particularly important, because good discrimination among test takers at all points of the scale is important. If absolute, or criterion-referenced, score interpretations are intended, adequate representation of the relevant domain is very important ”even if many of the items are relatively easy or nondiscriminating” (AERA 1999:40). If cut scores are needed in score interpretation, discrimination is particularly important around the cut scores.

The actual standards for good practice in test development (see AERA 1999:43-48) require careful documentation of all test development procedures. Test developers should document the test framework, specifications, assessment criteria, intended uses of the test, and the procedures used to develop and review these. Any trialling and standard setting activities should also be documented in detail. Assessors should be trained and training and qualification procedures documented, administration instructions clearly presented and justified, and the public should be informed about the nature of the test and its intended uses in sufficient detail to ensure appropriate use of the test.

The Standards chapter on scales, norms, and score comparability (AERA 1999:49-60) presents rationales and standards which are directly related to score interpretation and score use. The text is relevant for the test development process especially in the sense that the planning and gathering of evidence to create reporting scales for the test, establish norms, and make decisions on mechanisms by which score comparability between forms will be ensured, has to start during test development. The strategies planned will be implemented throughout the operational phase of the test because new items and test forms will always require scaling, calibration, and equation. The chapter also explains what the assessment system must be like to ensure that scores from different forms can be compared and equated. This is not possible if different versions measure different constructs, there are distinct differences in reliability or in overall test difficulty between forms, the time limits or other administration conditions are different between the different forms, or the test forms are designed to different specifications. Furthermore, the chapter advises the readers that the establishment of cut scores, ie. points on the reporting scale which

distinguish between different categories of ability, is always partly a matter of judgement. The procedures used to establish the cut scores during test development, and the qualifications of the people who take part in the procedure should be carefully documented, so that the standard setting procedures can be reviewed and repeated if necessary.

For test administration and scoring, the Standards makes the point that the procedures for these activities given in the test documentation must be followed to ensure the usefulness and interpretability of the test scores (AERA 1999:61). If this is not done, the comparability of the scores and the fairness of the assessment system for individual test takers are endangered. This is also why it is important that the test documentation includes such instructions. The chapter on supporting documentation for tests (AERA 1999:67-70) lists the following features which a test’s documentation should specify: “the nature of the test; its intended use; the processes involved in the test’s development; technical information related to scoring, interpretation, and evidence of validity and reliability; scaling and norming if appropriate to the instrument; and guidelines for test administration and interpretation”. The documentation should be clear, complete, accurate, and current, and it should be “available to qualified individuals as appropriate.” Test users will need the documentation to evaluate the quality of the test and its appropriacy for their needs.

2.9.2 Principles and quality criteria

The Standards (1999) does not explicitly discuss principles and quality criteria for test development. The chapter on test development refers to validity, reliability, and fairness issues, and the standards for test development encourage careful documentation of all stages of test development, and the presentation of both theoretical rationales and empirical evidence to support cases for intended score interpretation and use. Validity is portrayed as the overarching concern in test development, focusing on score interpretations which are entailed by proposed uses of tests; reliability is related to consistency of measurement; and fairness in terms of test quality is envisioned to mean that the test should not contain deficiencies which cause the score interpretations to be different for identifiable groups of test takers, nor should the test documentation allow for the test to be administered or its scores used in such a way as to disadvantage identifiable groups of test takers.

2.9.3 View of validation

The Standards considers the validation process to be about ”accumulating evidence to provide a sound scientific basis for the proposed score interpretations” (AERA et al. 1999:9). Validation is focused on score interpretations, not on test instruments, and when scores are used for more than one purpose, each of the intended interpretations requires its own validity case.

Validation and documentation underlying test development are related in that validation should start from the test framework, which is one of the first documents that a test development body should draft. The test framework contains a definition of the construct, ie., “the knowledge, skills, abilities, processes, or characteristics to be assessed. The framework indicates how this representation of the construct is to be distinguished from other constructs and how it should relate to other variables. The conceptual framework is partially shaped by the ways in which test scores will be used.” (AERA1999:9.)

The aim of validation, according to the Standards (1999:9), is to provide ”a scientifically sound validity argument to support the intended interpretation of test scores and their relevance to the proposed use.” Since the test’s framework is the starting point for validation, and the framework is influenced by the purpose of the test, test purpose has implications for test development and evaluation. Validation continues into the operational use of tests, and all evidence accumulated when a test is being offered is potentially relevant for old and new validity cases.

The Standards points out that validation involves careful attention to possible distortions of score meaning (AERA 1999:10). Such distortions may happen because the construct is inadequately represented by the test, or perhaps because some of the test methods turn out to have a significant effect on scores. “That is, the process of validation may lead to revisions in the test, the conceptual framework of the test, or both. The revised test would then need validation.”

The evidence that test development activities offer for validation is based on the documents related to test development, especially the test framework and specifications, and the items contained in test forms. The primary method of providing such data, as described in the Standards

(1999:11-12) is expert judgement. Experts can be asked to analyse the relationship between a test’s content and the construct it is intended to measure, as defined in the test framework. If the test has been developed on the basis of a content domain specification, the items or score patterns can

be judged against this document to assess how well each test form represents the specific ation. Similarly, experts can be asked to judge the quality and representativeness of items against specifications. Furthermore, as the Standards points out, expert panels can identify potential unfairness in the review of test construct or content domain definitions.

2.9.4 Distinctive characteristics of the text

The Standards for educational and psychological testing (AERA et al. 1999) provides thorough documentation of professional standards for test development. The language is exhortatory, but effort has clearly been made to make the wordings clear and the standards comprehensible. The introduction to each chapter supports the comprehensibility and makes the

Standards educative reading. The target audience is professional test developers, and the expectation is that the measurement or assessment procedure to which the standards are applied is relatively formal and standardised, as can be gleaned from references to alternate forms, standardised administration procedures, and different kinds of supporting documentation.

In document UNIVERSITY OF JYVÄSKYLÄ Centre for Applied Language Studies (Page 56-62)