Validity Evidence 16: Case Study - An Argument-Based Validation Study of the Teacher Performanc

Case study candidates were asked how well they understood the outcomes of their performance, what the scores might mean, and whether they felt there was guidance in place for how decisions would be derived from TPA scores. As mentioned above, one difficulty with data collection and analysis was that TPA scores were not reported to the university or to candidates prior to the end of the term. Therefore, candidates were asked to provide their views without knowing their scores or overall performance on the TPA.

Findings.

Like faculty, all case study candidates expressed frustration that they did not know when, if, and what type of feedback they would receive about their performance or what decisions

might be made from those scores. Findings also reflect candidate confusion about completing a summative evaluation during a formative experience. Jackie says, “In order to learn you have to do something and then [get] feedback, immediate feedback” but “I don’t get anything back until August [four months after submission]. I have no idea how I did” (Phase 4).

Uncertainty about scoring and scorer interpretation of the handbook led to feelings of anger and frustration and accusations that the handbooks were not ready for field testing. This is

illustrated in the excerpt below:

Jennifer: “I am feeling kind of lackadaisical about it because it is not high stakes for me. I think I would feel very different if it was. I would not be super excited, especially just, you know, [because] in the end it kind of just comes down to little things. The fact that there is so many mistakes in the pamphlets that we were given, that we had to dig to figure out what they really wanted, the fact that they kind of wrote in a convoluted way. Those are things that students get upset about when they have to take a test from a teacher. It is what we are taught not to do. You make the expectations clear, you teach the test, and you don’t throw any curve balls and that is exactly what they were doing with the TPA, you know. You proof read your own stuff. You don’t judge kids for not knowing what to do when your test is thrown together and it looks like crap and that is kind of how I felt when I was writing the TPA. There were portions in there that were science requirements in my history pamphlet.”

Researcher: “That is really embarrassing.”

Jennifer: “I mean it is not little things. They were like glaring, ‘oh, my gosh, you have the wrong requirements in here.’ Can you imagine if you did something like that on a test and then gave them to students? . . . Then say, [to your student] ‘you’re not moving on to the 11th_{grade if you don’t pass this test that is convoluted and thrown} together.’ No.” (Phase 4)

Many candidate questions were about the scoring procedure. Because faculty were not trained by Pearson, and the training materials were not available unless you were hired to score for Pearson, it was difficult for faculty and supervisors to advise candidates. Jill says, “Overall, I think the TPA is a pretty accurate representation or asking for accurate skills and knowledge from teachers [but] the scoring process I was a little confused about” (Phase 1). Candidates were asked if they believed that they met standard. Jennifer states that “it is hard for me to [know] until I see the scores, because I mean, obviously, I think I am doing what I am supposed to but I don’t know what a three really looks like, or a four, or a five” (phase 2). She goes on to clarify her question about scoring, “are they looking at my commentary, at my lesson plans, are they looking at it as a whole?” Jennifer’s worry is that she needed answers to every prompt and rubric requirement to appear in both the artifacts and commentary. “I am afraid that they might misunderstand something” (Phase 4).

Similarly, Jill reflects that, “I am not sure if any of it will meet standard. Just kind of across the board, [I am] really unsure of what it is that they are looking for. I have tried to be as articulate as possible in my commentary but there comes a point where I just can’t repeat myself anymore and I feel like they are either going to get it or they are not going to get it” (Phase 2). Lack of shared “gold-standard” samples made interpretation of the handbook more difficult. Jamie describes this difficulty, “The thing that got me through and really helped me in [earlier coursework] . . . was being able to look at somebody else’s [work] and see what it takes to meet standard” (Phase 4). She verbalizes that samples of performance are helpful, “not because I want to copy but I just want to see ‘okay they might be looking for this’ or ‘they might be looking for that’” (Phase 4).

Other concerns were expressed about how scores would be calculated for passing. For instance, would rubric scores be averaged? Would there be a cut score? Would candidates pass so long as they never scored a one on a rubric? What if they exceeded expectations on multiple tasks but failed one rubric (Jennifer, Phase 4)? Jennifer says, “At this point I don’t even know how I am being graded. I mean, do you have to get all three’s” (Jennifer, Phase 4)?

As stated earlier, Sterner decided to ask candidates to submit the TPA at the midterm point because it was not clear whether future candidates who did not pass would need to submit another TPA in the same term. The consequence of this decision was that candidate submissions revealed their teaching abilities in the middle of ST, rather than the end. Jill describes the problem:

I think that the way that it [TPA] is put into the ST experience it is more of a formative assessment because it is early on and we haven’t had much experience in the classroom. But they are using it as a summative assessment which is kind of one of the reasons that I feel it is unfair because it’s measuring me at the start of my teaching and not when I am

comfortable in the classroom, comfortable with the students and ready to videotape myself. (Phase 3)

Jamie says, “it would be great to . . . have somebody look through it . . . and give me feedback on the whole thing because then I could have maybe applied that to the rest of my student teaching” (Phase 4).

Support to validity.

Some of the concerns expressed by candidates were a result of the timing of TPA submission, a decision made by Sterner, not SCALE or Pearson.

Threat to validity.

The purpose of ST is to provide candidates a formative learning opportunity and to guide their practice as a teacher. Without feedback, or simply as a summative assessment, the TPA does not offer candidates opportunities for growth and is, therefore, a questionable pedagogical tool to introduce during ST.

Not enough guidance was in place for candidates to understand and interpret tests scores and their meanings. In fact, due to this, Sterner did not disclose raw scores to candidates. Instead the program provided candidates with areas of “strength” and “focus” based on mean scores on the tasks and categories. More guidance is needed for candidates to understand the scoring procedures (specifically, relationship between artifacts, commentaries and tasks in the scoring process), score outcomes, passing scores and requirements. In addition, samples of performance at different rubric levels are needed to facilitate the interpretation of the Handbook requirements.

Conclusion

This chapter discussed the validity assumptions of the WA TPA field test in spring 2012 using the ABV methodology articulated by Michael Kane and the Cambridge Reporting Framework

developed by Shaw, Crisp, and Johnson. Of particular importance was the extent to which test scores could not be generalized across different ST contexts. The overall findings suggest that the operationalized construct was stable but scores were not generalizable and guidance was not in place regarding score meaning and use prior to the field test. Low correlation between the TPA and university instruments provided divergent evidence for the use of the TPA, indicating that decisions solely based on TPA may not be reliable. The risk of making the wrong decision is high. Other inferences (Inference 2 and 4) suggest potential weaknesses but not failures of validity. The next chapter presents the validity narrative and discussion, implications and next steps for study and a brief discussion on the ABV model.

In document An Argument-Based Validation Study of the Teacher Performance Assessment in Washington State (Page 196-200)