Validity Studies for Performance Assessment

Surprisingly, despite the call by many to make sure that the common exam for novice teachers be both “valid and reliable,” the only validation study published on the TPA is the Summary Report produced by SCALE in November 2013 (SCALE, 2013). While often declared “valid and

Produce (lessons and teaching materials), or Performance (teaching experiences) evidence” (Goodman, Arbona, & Dominquez de Rameriz, 2008, p. 29). Portfolios from other states were seen as significantly different enough from the TPA, or had no validity or reliability data available, and therefore did not contribute to this study.

reliable” by proponents,7_{the studies linked to such statements (when there are citations) are for the} California PACT assessment, California CalTPA (Riggs, Verdi, & Arlin, 2009), or the Connecticut BEST exam. While these assessments are similar to the TPA, especially the PACT, they are not the same. In addition, because not all California schools adopted the PACT, those studies conducted on its reliability and validity have been small in terms of supporting a national assessment for

accountability. Other than SCALE’s Summary Report (2013), as of January 2014, no other validation study could be cited for the TPA or its 2013 iteration, the edTPA. The lack of studies examining the TPA validity in relation to teaching outcomes, or reliability data for the instrument and scoring process, demonstrates a need for further study. There are numerous mentions of pilot data collected for validation purposes so it may be that other studies are currently underway. Both the

edTPA Summary Report and the studies available for those performance exams that are similar to,

but not identical to, the TPA are discussed below.

SCALE summary report.

SCALE published the only known TPA validation study in

November 2013.8 _{It is important to note that this study uses data from the edTPA, which differs from}

7_{E.g., “A recent development that will significantly strengthen accountability for teacher preparation} programs and reflect candidate’ readiness for the classroom is the creation of a nationally available, valid, and reliable teacher performance assessment (TPA). … A few additional performance assessment models exist, but they are not as widely used as TPA, nor do most of them have the reliability and validity that TPA and PACT do” (AACTE, 2011, p. 9).

8_{Though the report suggests that the studies were conducted over the course of multiple years, all} scores and data was derived from one administration of the edTPA. Note: the TPA and edTPA are two different versions of the assessment. In Fall 2013, SCALE renamed the assessment, the edTPA, when it published a revised iteration. It is not clear whether or how the data collected from the 2012 field tests of the TPA was used. For instance, was it compared to the 2013 edTPA?

the TPA. For instance it includes only three tasks (planning, instruction, and assessment).9 _This truncated report provides a historical summary of the development process, the assessment procedure, and validity and reliability data. By analyzing roughly 4,000 submissions (33%) from the spring 2013 term,10_{SCALE suggests that the edTPA has strong construct validity, inter-rater}

agreement,11_{and high correlations with job analysis studies (pp. 17-24). While acknowledging some} implementation issues, no threats to validity were shared. SCALE is expected to offer a full technical report in 2014.

PACT

. As noted, Connecticut and California were early adopters of a mandated performance assessment. There are several studies of the initial licensure exam, the PACT,

developed primarily as a direct evaluation of a candidate’s teaching for credentialing decisions. One of the additional purposes of PACT is to serve as a formative, professional learning experience for candidates. Lastly, PACT was designed to provide evidence for programs to understand their strengths and weaknesses to use for program improvement. In 2007, a technical report was published by the consortium summarizing the validity and reliability studies from the pilot

9_{“The Stanford Center for Assessment, Learning, and Equity (SCALE) is the lead developer of the} edTPA, and Stanford University is the sole owner of the edTPA” (p. preface). The researcher requested permission to examine the edTPA version of the assessment but did not receive a response. Therefore, a full comparison of these two instruments could not be conducted here. The task and rubric numbers in the summary report do not align to the TPA, making it difficult connect the data collected in this study and the summary report.

10_{According to SCALE, standard setting occurred in August 2013. Participants in this study were} selected from the group that preceded standard setting (SCALE, 2013, pp. 1-3).

11_{According to SCALE, 10% of all submissions are randomly selected to be double scored by Pearson} (p. 23), though it is not clear whether this is 10% of the 4,000 submissions selected for the study or 10% of the full participation group of 12,000 candidates. Therefore, it is not possible to confirm inter-rater agreement based on the data provided in this report.

(Pecheone & Chung-Wei, 1997). Based on that report, and the endorsement of the growing

consortium of universities, the PACT was approved for use by the California Commission on Teacher Credentialing.

Okhremtcouk et al. (2009) used a mixed method survey to analyze the effects of the PACT on student teaching, university coursework, instructional practice, classroom management, personal time, and candidate perceptions of the level of support required for success. Their study discusses three findings. First, the PACT was overly time consuming, cutting into the time available to the candidate to focus on instruction, development, and university coursework. Second, the PACT did contribute to candidates’ perception of their professional growth as a teacher. Finally, IHE and placement site support networks and mechanisms are essential to candidate success. The authors write, “one of the most useful findings of this study is the local factor: how school placements impact student teachers ability to complete PACT” (Okhremtcouk, et al., 2009, p. 59). The outcome of their investigation suggests that those IHE with a PDS model in place are better equipped to provide support for candidates completing the PACT.

Beyond Okhremtcouk, other studies examine the formative value of the PACT. Ruth Chung- Wei provides a context for the value of the PACT as a developmental learning tool for candidates (Chung, 2008; 2007; 2005; Darling-Hammond, Chung Wei, & Johnson, 2009; Pecheone & Chung, 2006). Darling-Hammond, Chung-Wei, and Johnson (2009) have used these studies to successfully argue that “current research suggests that there are many teacher characteristics and abilities which, in combination, predict teaching effectiveness” (p. 631).

As mentioned above, a preliminary study of one year of pilot data (2003-2004) on the PACT was conducted by Pecheone and Chung-Wei (2007). They report preliminary findings from a two year pilot with thirteen IHE programs.12_{Pecheone and Chung-Wei found discernible patterns in}

12_{Pecheone and Chung (Wei) published the initial pilot data in their 2006 article and then a full study} for California that was pivotal in the CCTC endorsement of the PACT (2007). Because the data shared in both of

student performance demonstrating high levels of achievement in instructional planning. For the portion of the PACT that was double scored during the pilot, the study found high levels of inter- rater reliability. The pilot data also confirmed other studies of the PACT as a significant actor in the formative development of teachers. Pecheone and Chung-Wei found that candidates in urban settings believed their settings to mandate teaching decisions in such a way as to be too limiting to succeed on the PACT. The data confirms that those in urban settings who reported these limitations were associated with lower test scores (p. 29). Pecheone and Chung-Wei’s study of the pilot data from thirteen programs (in one IHE) is often cited as evidence of the validity of the PACT. Their work is the most important validity evidence for the PACT in the literature, to date. However, some have questioned whether that study is enough to support national adoption of a similar instrument without its own record of reliability. Ann Berlak (2010) writes that “the key question is whether PACT scores accurately and objectively measure quality teaching. That PACT assessments are neither reliable nor valid is certain to become widely apparent in the next decade” (Berlak, 2010).

In fact, studies of the PACT as a summative assessment are less common. Stephen Newton (2010) conducted a predictive validity study of the of PACT for Stanford (the developer of PACT), presenting the relationship between beginning teacher’s scores on PACT and their later teaching effectiveness as measured by value-added achievement gains in their students’ English Language Arts test scores. This study examined student test scores from first and second year teachers who taught students in the upper elementary grade levels. The study found that the PACT was highly correlated (.58 to .66) with at least one of four value added measures (S. Newton, p. 12). Newton writes, “for each additional point a teacher scored on PACT, her students averaged a gain of one percentile point per year on the California Standards Tests as compared with similar students” (p. 12). The summary finding confirmed “the validity of the PACT as a measure of teacher quality, and as

the report and the article are from the same study and explicitly connected by the authors, I discuss these two sources as one study.

a useful tool for evaluation of candidates and as a way to provide feedback to teacher education institutions” (p. 13). It is notable, however, that this investigation was small in participant numbers (only 14 teachers and 259 students participated). Ducker, Castellano, Tellez and Wilson (2013) examined the internal structure of the Elementary Literature Teaching Event in the PACT finding high reliability coefficients and domain-based structures but poor evidence for the task-based structure. Another study compared university supervisor predictions and candidate scores and found that predictions did not match score performance (Sandholtz & Shea, 2011). Unlike the formative value of the PACT, an examination of the literature demonstrates there are mixed views of the PACT as an assessment that can provide data for licensure and accountability decisions.

Several articles focus on the programmatic value of the PACT. Darling-Hammond (2006) concluded that the PACT was an integral part of assessing program outcomes at Stanford and helped to identify areas for attention across institutions (p. 131). Stanford faculty, Ira Lit and Rachel Lotan (2013) agree in their examination of the PACT and the dilemmas that this high-stakes performance assessment created for individual candidates and for different programs within an institution. The central dilemmas studied are “managing the conflicting values of the formative nature of the work of educators and the summative imperatives of high-stakes assessment” and “reconciling the

contribution of high-stakes assessment to curricular coherence and alignment of practice, on the one hand, and the program’s perspective of offering a range of competing theories and fruitful practices, on the other” (Lit & Lotan, 2013, p. 55). They find that differences across programs create a “balancing act” of implementing PACT and working with candidates as they complete it. Like Peck, Gallucci, and Sloan (2010), Lit and Lotan found that as professionals and teacher trainers “respect for the professional judgment of a faculty member and the program’s commitment to support

intellectual diversity were pitted against the demands and priorities of high-stakes assessment” (p. 70) and that this can lead to homogenization. Lit and Lohan write, “because of its high-stakes nature, the Teaching Event becomes something of an albatross-an experience to worry and fret over, rather than an opportunity for thinking, reflecting, and improving on one’s practice” (p. 65). While their

article is overwhelmingly positive and supportive, these are dilemmas indeed. Reading between the lines, the take-way for validation investigations is that the procedural conditions by which the PACT (and its off-shoot, the TPA) is performed will necessarily be varied. This will impact the inferences that can be made from those PACT test scores

In document An Argument-Based Validation Study of the Teacher Performance Assessment in Washington State (Page 41-47)