Procedures and Materials Standard 4.1 5
T he directions for test administration should be presented with sufficient clarity so that it is possible for others to replicate the administration conditions under which the data on reliability, validity, and (where appropriate) norms were obtained. Allowable variations in administration procedures should be clearly described. T he process for reviewing requests for additional testing variations should-also be documented.
Comment: Because all people administering tests, including those in schools, industry, and clinics, need to follow test administration procedures carefully, it is essential that test administrators receive detailed instructions on test administration guidelines and procedures. Testing accommodations may be needed to allow accurate measurement of intended constructs for specific groups of test takers, such as individuals with disabilities and individuals whose native language is not English.
(See chap. 3, "Fairness in Testing.")
Standard 4.1 6
The instructions presented to test takers should contain sufficient detail so that test takers can respond to a task in the manner that the test de
veloper intended. When appropriate, sample ma
terials, practice or sample questions, criteria for scoring, and a representative item identified with each item format or major area in the test's clas
sification or domain should be provided to the test takers prior to the administration of the test, or should be included in the testing material as part of the standard administration instruc
tions.
Comment: For example, in a personality inventory the intent may be that test takers give the first re
sponse that occurs to them. Such an expectation should be made clear in the inventory directions.
As another example, in directions for interest or occupational inventories, it may be important to specify whether test takers are to mark the activities they would prefer under ideal conditions or whether they are to consider both their opportunity and their ability realistically.
Instructions and any practice materials should be available in formats that can be accessed by all test takers. For example, if a braille version of the test is provided, the instructions and any practice materials should also be provided in a form that can be accessed by students who take the braille version.
The extent and nature of practice materials and directions depend oi;_i expected levels of knowl
edge among test takers. For example, in using a novel test format, it may be very important to provide the test taker with a practice opportunity as part of the test administration. In some testing situations, it may be important for the instructions to address such matters as time limits and the effects that guessing has on test scores. If expansion or elaboration of the test instructions is permitted, the conditions under which this may be done should be stated clearly in the form of general rules and by giving representative examples. If no expansion or elaboration is to be permitted, this should be stated explicitly. Test developers should include guidance for dealing with typical questions from test takers. Test administrators should be in
structed on how to deal with questions that may arise during the testing period.
Standard 4.1 7
If a test or part of a test is intended for research use only and is not distributed for operational use, statements to that effect should be displayed prominently on all relevant test administration and interpretation materials that are provided to the test user.
Comment: This standard refers to tests that are intended for research use only. It does not refer to standard test development functions that occur prior to the operational use of a test ( e.g., item and form tryouts). There may be legal requirements
to inform participants of how the test developer will use the data generated from the test, including the user's personally identifiable information, how that information will be protected, and with whom it might be shared.
Standard 4.1 8
Procedures for scoring and, if relevant, scoring criteria, should be presented by the test developer with sufficient detail and clarity to maximize the accuracy of scoring. Instructions for using rating scales or for deriving scores obtained by coding, scaling, or classifying constructed responses should be clear. This is especially critical for ex
tended-response items such as performance tasks, portfolios, and essays.
Comment: In scoring more complex responses, test developers must provide detailed rubrics and training in their use. Providing multiple examples of responses at each score level for use in training scorers and monitoring scoring consistency is also common practice, although these are typically added to scoring specifications during item de
velopment and tryouts. For monitoring scoring effectiveness, consistency criteria for qualifying scorers should be specified, as appropriate, along with procedures, such as double-scoring of some or all responses. As appropriate, test developers should specify selection criteria for scorers and procedures for training, qualifying, and monitoring scorers. If different groups of scorers are used with different administrations, procedures for checking the comparability of scores generated by the different groups should be specified and implemented.
Standard 4.1 9
When automated algorithms are to be used to score complex examinee responses, characteristics of responses at each score level should be docu
mented along with the theoretical and empirical bases for the use of the algorithms.
Comment: Automated scoring algorithms should be supported by an articulation of the theoretical
and methodological bases for their use that is suf
ficiently detailed to establish a rationale for linking the resulting test scores to the underlying construct of interest. In addition, the automated scoring al
gorithm should have empirical research support, such as agreement rates with human scorers, prior
to operational use, as well as evidence that the scoring algorithms do not introduce systematic bias against some subgroups.
Because automated · scoring algorithms are often considered proprietary, their developers are rarely willing to reveal scoring and weighting rules in public documentation. Also, in some cases, full disclosure of derails of the scoring algo
rithm might result in coaching strategies that would increase scores without any real change in the consrruct(s) being assessed. In such cases, de
velopers should describe the general characteristics of scoring algorithms. They may also have the al
gorithms reviewed by independent experts, under conditions of nondisclosure, and collect independent judgments of the extent to which the resulting scores will accurately implement intended scoring rubrics and be free from bias for intended exarninee subpopulations.
Standard 4.20
The process for selecting, trammg, qualifying, and monitoring scorers should be specified by the test developer. T he training materials, such as the scoring rubrics and examples of test takers' responses that illustrate the levels on the rubric score scale, and the procedures for training scorers should result in a ·degree of accuracy and agreement among scorers that allows the scores to be interpreted as originally intended by the test developer. Specifications should also describe processes for assessing scorer consistency and potential drift over time in raters' scoring.
Comment: To the extent possible, scoring processes and materials should anticipate issues that may arise during scoring. Training materials should address any common misconceptions about the rubrics used to describe score levels. When written text is being scored, it is common to include a set
of prescored responses for use in training and for judging scoring accuracy. The basis for determining scoring consistency (e.g., percentage of exact agree
ment, percentage within one score point, or some other index of agreement) should be indicated.
Information on scoring consistency is essential to estimating the precision of resulting scores.
Standard 4.21
When test users are responsible for scoring and scoring requires scorer judgment, the test user is responsible for providing adequate training and instruction to the scorers and for examining scorer agreement and accuracy. The test developer should document the expected level of scorer agreement and accuracy and should provide as much technical guidance as possible to aid test users in satisfying this standard.
Comment: A common practice of test developers is to provide training materials ( e.g., scoring rubrics, examples of test takers' responses at each score level) and procedures when scoring is done by test users and requires scorer judgment. Training provided to support local scoring should include standards for checking scorer accuracy during training and operational scoring. Training should also cover any special consideration for rest-taker groups that might interact differently with the task to be scored.
Standard 4.22
Test developers should specify the procedures used to interpret test scores and, when appropriate, the normative or standardization samples or the criterion used.
Comment: Test specifications may indicate that the intended scores should be interpreted as in
dicating an absolute level of the construct being measured or as indicating standing on r he con
struct relative to other examinees, or both. In absolute score interpretations, the score or average is assumed to reflect directly a level of competence or mastery in some defined criterion domain. In relative score interpretations the status of an
in-dividual ( or group) is determined by comparing the score (or mean score) with the performance of others in one or more defined populations.
Tests designed to facilitate one type of interpre
tation may function less effectively for the other type of interpretation. Given appropriate test design and adequate supporting data, however, scores arising from norm-referenced testing pro
grams may provide reasonable absolute score in
terpretations, and scores arising from criterion
referenced programs may provide reasonable rel
ative score interpretations.
Standard 4.23
When a test score is derived from the differential weighting of items or subscores, the test developer should document the rationale and process used to develop, review, and assign item weights.
When the item weights are obtained based on empirical data, the sample used for obtaining item weights should be representative of the population for which the test is intended and large enough to provide accurate estimates of optimal weights. When the item weights are ob
tained based on expert judgment, the qualifications of the judges should be documented.
Comment: Changes in the population of test takers, along with other changes, for example in instructions, training, or job requirements, may affect the original derived item weights, necessitating subsequent studies. In many cases, content areas are weighted by specifying a different number of items from different areas. The rationale for weighting the different content areas should also be documented and periodically reviewed.