• No results found

Despite all the care that is taken in assessment design to ensure that the devel- oped tasks measure the intended content and skills, it is still necessary to evalu- ate empirically that the inferences drawn from the assessment results are valid. Validity refers to the extent to which assessment tasks measure the skills that they are intended to measure (see, e.g., Kane, 2006, 2013; Messick, 1993; National Research Council, 2001, 2006). More formally, “Validity is an integrated evalua- tive judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferences and actions based on the test” (Messick, 1989, p. 13). Validation involves evaluation of the proposed inter- pretations and uses of the assessment results, using different kinds of evidence,

evidence that is rational and empirical and both qualitative and quantitative. For the examples discussed in this report, validation would include analysis of the processes and theory used to design and develop the assessment, evidence that the respondents were indeed thinking in the ways envisaged in that theory, the internal structure of the assessment, the relationships between results and other outcome measures, and whether the consequences of using the assessment results were as expected, and other studies designed to examine the extent to which the FIGURE 3-19 Learning progression for analyzing and interpreting data.

NOTES: Cha = chance, CoS = conceptions of statistics, DaD = data display, InI = informal inference, MoV = models of variability, MRC = meta-representational competence, ToM = theory of measurement. See text for discussion.

R02484 FIG3-19 NO COLOR.eps

bitmap

DaD 5 CoS 3 CoS 4 CoS 2 CoS 1 Cha 2 Cha 1 InI 2 InI1 InI 4 InI3 InI 6 InI 7 InI5 Cha 6 Cha 5 Cha 4 Cha 3 MoV 5 MoV 4 MoV 3 MoV 2 MoV 1 DaD 4 MRC 3 MRC 2 MRC 1 MRC 4 MRC 5 ToM 6 ToM 5 ToM 4 ToM 3 ToM 2 ToM 1 Bootstrapping DaD 3 DaD 2 DaD 1

intended interpretations of assessment results are fair, justifiable, and appropriate for a given purpose (see American Educational Research Association, American Psychological Association, and National Council on Measurement in Education, 1999).

Evidence of validity is typically collected once a preliminary set of tasks and corresponding scoring rubrics have been developed. Traditionally, validity con- cerns associated with achievement tests have focused on test content, that is, the degree to which the test samples the subject matter domain about which inferences are to be drawn. This sort of validity is confirmed through evaluation of the align- ment between the content of the assessment tasks and the subject-matter frame- work, in this case, the NGSS.

Measurement experts increasingly agree that traditional external forms of validation, which emphasize consistency with other measures, as well as the search for indirect indicators that can show this consistency statistically, should be supplemented with evidence of the cognitive and substantive aspects of validity (Linn et al., 1991; Messick, 1993). That is, the trustworthiness of the interpreta- tion of test scores should rest in part on empirical evidence that the assessment tasks actually reflect the intended cognitive processes. There are few alternative measures that assess the three-dimensional science learning described in the NGSS and hence could be used to evaluate consistency, so the empirical validity evidence will be especially important for the new assessments that states will be developing as part of their implementation of the NGSS.

Examining the processes that students use as they perform an assess- ment task is one way to evaluate whether the tasks are functioning as intended, another important component of validity. One method for doing this is called

protocol analysis (or cognitive labs), in which students are asked to think aloud

as they solve problems or to describe retrospectively how they solved the prob- lem (Ericsson and Simon, 1984). Another method is called analysis of reasons, in which students are asked to provide rationales for their responses to the tasks. A third method, analysis of errors, is a process of drawing inferences about students’ processes from incorrect procedures, concepts, or representations of the problems (National Research Council, 2001).

The empirical evidence used to investigate the extent to which the various components of an assessment actually perform together in the way they were designed to is referred to collectively as evidence based on the internal structure of the test (see American Educational Research Association, American Psychological Association, and National Council on Measurement in Education, 1999). For

example, in our example of measuring silkworm larvae growth, one form of evi- dence based on internal structure would be the match between the hypothesized levels of the construct maps and the empirical difficulty order shown in the mea- surement map in Figure 3-15 above.

One critical aspect of validity is fairness. An assessment is considered fair if test takers can demonstrate their proficiency in the targeted content and skills without other, irrelevant factors interfering with their performance. Many attri- butes of test items can contribute to what measurement experts refer to as con-

struct-irrelevant variance, which occurs when the test questions require skills that

are not the focus of the assessment. For instance, an assessment that is intended to measure a certain science practice may include a lengthy reading passage. Besides assessing skill in the particular practice, the question will also require a certain level of reading skill. Assessment respondents who do not have sufficient reading skills will not be able to accurately demonstrate their proficiency with the targeted science skills. Similarly, respondents who do not have a sufficient command of the language in which an assessment is presented will not be able to demonstrate their proficiency in the science skills that are the focus of the assessment. Attempting to increase fairness can be difficult, however, and can create additional prob- lems. For example, assessment tasks that minimize reliance on language by using online graphic representations may also introduce a new construct-irrelevant issue because students have varying familiarity with these kinds of representations or with the possible ways to interact with them offered by the technology.

Cultural, racial, and gender issues may also pose fairness questions. Test items should be designed so that they do not in some way disadvantage the respondent on the basis of those characteristics, social economic status, or other background characteristics. For example, if a passage uses an example more familiar or accessible to boys than girls (e.g., an example drawn from a sport in which boys are more likely to participate), it may give the boys an unfair advan- tage. Conversely, the opposite may occur if an example is drawn from cooking (with which girls are more likely to have experience). The same may happen if the material in the task is more familiar to students from a white, Anglo-Saxon back- ground than to students from minority racial and ethnic backgrounds or more familiar to students who live in urban areas than those in rural areas.

It is important to keep in mind that attributes of tasks that may seem unim- portant can cause differential performance, often in ways that are unexpected and not predicted by assessment designers. There are processes for bias and sensitivity reviews of assessment tasks that can help identify such problems before the assess-

ment is given (see, e.g., Basterra et al., 2011; Camilli, 2006; Schmeiser and Welch, 2006; Solano-Flores and Li, 2009). Indeed this process was begun by the NGSS. Their development work included a process to review and refine the performance expectations using this lens (see Appendix 4 of the NGSS). After an assessment has been given, analyses of differential item functioning can help identify problem- atic questions so that they can be excluded from scoring (see, e.g., see Camilli and Shepard, 1994; Holland and Wainer, 1993; Sudweeks and Tolman, 1993).

A particular concern for science assessment is the opportunity to learn—the extent to which students have had adequate instruction in the assessed material to be able to demonstrate proficiency on the targeted content and skills. Inferences based on assessment results cannot be valid if students have not had the oppor- tunity to learn the tested material, and the problem is exacerbated when access to adequate instruction is uneven among schools, districts, and states. This equity issue has particular urgency in the context of a new approach to science education that places many new kinds of expectations on students. The issue was highlighted in A Framework for K-12 Science Education: Practices, Crosscutting Concepts,

and Core Ideas (National Research Council, 2012a, p. 280), which noted:

. . . access to high quality education in science and engineering is not equitable across the country; it remains determined in large part by an individual’s socioeconomic class, racial or ethnic group, gender, language background, disability designation, or national origin.

The validity of science assessments designed to evaluate the content and skills depicted in the framework could be undermined simply because students do not have equal access to quality instruction. As noted by Pellegrino (2013), a major challenge in the validation of assessments designed to measure the NGSS performance expectations is the need for such work to be done in instructional settings where students have had adequate opportunity to learn the integrated knowledge envisioned by the framework and the NGSS. We consider this issue in more detail in Chapter 7 in the context of suggestions regarding implementation of next generation science assessments.