Measurement and Implementation Issues - Developing Assessments for the NGSStandards

In Chapter 3, we note that the selection and development of assessment tasks should be guided by the constructs to be assessed and the best ways of eliciting evidence about a student’s proficiency relative to that construct. The NGSS performance expectations emphasize the importance of providing students the opportunity to demonstrate their proficiencies in both science content and practices. Ideally, evidence of those proficiencies would be based on observations of students actually engaging in scientific and engineering practices relative to disciplinary core ideas. In the measurement field, these types of assessment tasks are typically performance based and include questions that require students to construct or supply an answer, produce a product, or perform an activity. Most of the tasks we discuss in Chapters 2, 3, and 4 are examples of performance tasks.

Performance tasks can be and have been designed to work well in a classroom setting to help guide instructional decisions making. For several reasons, they have been less frequently used in the context of monitoring assessments administered on a large scale.

First, monitoring assessments are typically designed to cover a much broader domain than tests used in classroom settings. When the goal is to assess an entire year or more of student learning, it is difficult to obtain a broad enough sampling of an individual student’s achievement using performance tasks. But with fewer tasks, there is less opportunity to fully represent the domain of interest.

Second, the reliability, or generalizability, of the resulting scores can be prob- lematic. Generalizability refers to the extent to which a student’s test scores reflect a stable or consistent construct rather than error and supports a valid inference about students’ proficiency with respect to the domain being tested. Obtaining reliable individual scores requires that students each take multiple performance tasks, but administering enough tasks to obtain the desired reliability often creates feasibility problems in terms of the cost and time for testing. Careful task and test design (described below) can help address this issue.

Third, some of the monitoring purposes shown in Table 5-1 (in the second row) require comparisons across time. When the goal is to examine performance across time, the assessment conditions and tasks need to be comparable across the two testing occasions. If the goal is to compare the performance of this year’s students with that of last year’s students, the two groups of students should be required to respond to the same set of tasks or a different but equivalent set of tasks (equivalent in terms of difficulty and content coverage). This requirement presents a challenge for assessments using performance tasks since such tasks generally cannot be reused because they are based on situations that are often highly memorable.4_{And, once they are given, they are usually treated as publicly avail-} able.5_{Another option for comparison across time is to give a second group of stu-} dents a different set of tasks and use statistical equating methods to adjust for dif- ferences in the difficulty of the tasks so that the scores can be placed on the same scale.6_{However, most equating designs rely on the reuse of some tasks or items.} To date, the problem of equating assessments that rely solely on performance tasks has not yet been solved. Some assessment programs that include both performance tasks and other sorts of items use the items that are not performance based to equate different test forms, but this approach is not ideal—the two types of tasks may actually measure somewhat different constructs, so there is a need for studies that explore when such equating would likely yield accurate results.

Fourth, scoring performance tasks is a challenge. As we discuss in Chapter 3, performance tasks are typically scored using a rubric that lays out criteria for assigning scores. The rubric describes the features of students’ responses required for each score and usually includes examples of student work at each scoring level. Most performance tasks are currently scored by humans who are trained to apply the criteria. Although computer-based scoring algorithms are increasingly in use, they are not generally used for content-based tasks (see, e.g., Bennett and Bejar, 1998; Braun et al., 2006; Nehm and Härtig, 2011; Williamson et al., 2006, 2012). When humans do the scoring, their variability in applying the criteria

4_{That is, test takers may talk about them after the test is completed, and share them with}

each other and their teachers. This exposes the questions and allows other students to practice for them or similar tasks, potentially in ways that affect the ability of the task to measure the intended construct.

5_{For similar reasons, it can be difficult to field test these kinds of items.}

6_{For a full discussion of equating methods, which is beyond the scope of this report, see Kolen}

introduces judgment uncertainty. Using multiple scorers for each response reduces this uncertainty, but it adds to the time and cost required for scoring.

This particular form of uncertainty does not affect multiple-choice items, but they are subject to uncertainty because of guessing, something that is much less likely to affect performance tasks. To deal with these issues, a combination of response types could be used, including some that require demonstrations, some that require short constructed responses, and some that use a selected-response format. Selected-response formats, particularly multiple-choice questions, have often been criticized as only being useful for assessing low-level knowledge and skills. But this criticism refers primarily to isolated multiple-choice questions that are poorly related to an overall assessment design. (Examples include questions that are not related to a well-developed construct map in the construct-modeling approach or not based on the claims and inferences in an evidence-centered design approach; see Chapter 3). With a small set of contextually linked items that are closely related to an assessment design, the difference between well-produced selected-response items and open-ended items may not be substantial. Using a combination of response types can help to minimize concerns associated with using only performance tasks on assessments intended for monitoring purposes.

Examples

Despite the various measurement and implementation challenges discussed above, a number of assessment programs have made use of performance tasks and port- folios7_{of student work. Some were quite successful and are ongoing, and some} experienced difficulties that led to their discontinuation. In considering options for assessing the NGSS performance expectations for monitoring purposes, we began by reviewing assessment programs that have made use of performance tasks, as well as those that have used portfolios. At the state level, Kentucky, Vermont, and Maryland implemented such assessment programs in the late 1980s and early 1990s.

In 1990, Kentucky adopted an assessment for students in grades 4, 8, and 11 that included three types of questions: multiple-choice and short essay questions, performance tasks that required students to solve practical and applied

7_{A portfolio is a collection of work, often with personal commentary or self-analysis, that is}

assembled over time as a cumulative record of accomplishment (see Hamilton et al., 2009). A portfolio can be either standardized or nonstandardized: in a standardized portfolio, the materi- als are developed in response to specific guidelines; in a nonstandardized portfolio, the students and teachers are free to choose what to include.

problems, and portfolios in writing and mathematics in which students presented the best examples of their classroom work for a school year. Assessments were given in seven areas: reading, writing, social science, science, math, arts and humanities, and practical living/vocational studies. Scores were reported for individual students.

In 1988, Vermont implemented a statewide assessment in mathematics and writing for students in grades 4 and 8 that included two parts: a portfolio component and uniform subject-matter tests. For the portfolio, the tasks were not standardized: teachers and students were given unconstrained choice in selecting the products to be in them. The portfolios were complemented by subject-matter tests that were standardized and consisted of a variety of item types. Scores were reported for individual students.

The Maryland School Performance Assessment System (MSPAP) was implemented in 1991. It assessed reading, writing, language usage, mathematics, science, and social sciences in grades 3, 5, and 8. All of the tasks were performance based, including some that required short-answer responses and others that required complex, multistage responses to data, experiences, or text. Some of the activities integrated skills from several subject areas, some were hands-on tasks involving the use of equipment, and some were accompanied by preassessment activities that were not scored. The MSPAP used a matrix-sampling approach: that is, the items were sampled so that each student took only a portion of the exam in each subject. The sampling design allowed for the reporting of scores for schools but not for individual students.

These assessment programs were ambitious, innovative responses to calls for education reform. They made use of assessment approaches that were then cut- ting edge for the measurement field. They were discontinued for many reasons, including technical measurement problems, practical reasons (e.g., the costs of the assessments and the time they took to administer), as well as imposition of the accountability requirements of NCLB (see Chapter 1), which they could not read- ily satisfy.8

8_{A thorough analysis of the experiences in these states is beyond the scope of this report, but}

there have been several studies. For Kentucky, see Hambleton et al. (1995), Catterall et al. (1998). For Vermont, see Koretz et al. (1992a,b, 1993a,b, 1993c, 1994). For Maryland, see Hambleton et al. (2000), Ferrara (2009), and Yen and Ferrara (1997). Hamilton et al. (2009) provides an overview of all three of these programs. Hill and DePascale (2003) have pointed out that some critics of these programs failed to distinguish between the reliability of student-level scores and school-level scores. For purposes of school-level reporting, the technical quality of some of these assessments appears to have been better than generally assumed.

Other programs that use performance tasks are ongoing. At the state level, the science portion of the New England Common Assessment Program (NECAP) includes a performance component to assess inquiry skills, along with questions that rely on other formats. The state assessments in New York include laboratory tasks that students complete in the classroom and that are scored by teachers. NAEP routinely uses extended constructed-response questions, and in 2009 con- ducted a special science assessment that focused on hands-on tasks and computer simulations. The Program for International Student Assessment (PISA) includes constructed-response tasks that require analysis and applications of knowledge to novel problems or contexts. Portfolios are currently used as part of the advanced placement (AP) examination in studio art.

Beyond the K-12 level, the Collegiate Learning Assessment makes use of performance tasks and analytic writing tasks. For advanced teacher certification, the National Board for Professional Teaching Standards uses an assessment com- posed of two parts—a portfolio and a 1-day exam given at an assessment center.9 The portfolio requires teachers to accumulate work samples over the course of a school year according to a specific set of instructions. The assessment center exam consists of constructed-response questions that measure the teacher’s content and pedagogical knowledge. The portfolio and constructed responses are scored cen- trally by teachers who are specially trained.

The U.S. Medical Licensing Examination uses a performance-based assessment (called the Clinical Skills Assessment) as part of the series of exams required for medical licensure. The performance component is an assessment of clinical skills in which prospective physicians have to gather information from simulated patients, perform physical examinations, and communicate their findings to patients and colleagues.10_{Information from this assessment is considered along} with scores from a traditional paper-and-pencil test of clinical skills in making licensing decisions.

In document Developing Assessments for the NGSStandards (Page 157-161)