Value-Added Measures of Educator Performance: Clearing Away the Smoke and Mirrors

(1)

Douglas N. Harris

Associate Professor of Educational Policy and Public Affairs University of Wisconsin at Madison

October 19, 2010

SERVE Southeast REL Webinar

Value-Added Measures of Educator Performance:

Clearing Away the Smoke and Mirrors

(Book forthcoming, Harvard Educ. Press, February, 2011)

(2)

Preview

 Discuss how we measure (or really fail to measure) teacher performance today

 Explain what value-added measures are and how they might improve performance measurement

 Discuss how well value-added measures capture teacher performance—the different types of errors

 Interpret research evidence about the errors

 Provide a sense of perspective, as well as some

specific recommendations, about how to use value- added measures

(3)

A Question for All Organizations

 How should we measure and reward performance?

– What if we only measure performance related to one organizational goal and omit other goals?

– What happens if we measure performance badly for any or all goals?

– How do we align the incentives of workers with those of the organizations using imperfect measures?

 Specific concerns in schools:

– Many goals to balance – Need for professionalism

– Desire to “keep politics out”

(4)

Rationale for Value-Added

(5)

The Traditional “Credentials Strategy” to Teacher Quality

 Until 1990s, the education system focused on rule compliance and resources—finance, class size, . . .

 Teacher credentials also fall within the resources, or “input” approach

– Undergraduate Education and Test Scores – Graduate Education and Experience

– Certification

 Unfortunately, the only one related to teacher effectiveness is experience

 Therefore important to consider outcomes and instructional practice as alternatives

(6)

Formal Teacher Evaluations

 Do these make up for weaknesses of credentials?

 Evaluations do not focus on the “technical core”

of teaching

- i.e., they ignore instructional practice

 90% of teachers receive the highest rating

 Principals often do not have the training or the time to be instructional leaders

– Partly because low stakes of evaluations give little reason to take evaluation seriously

 I almost never hear teachers or administrators say that the formal evaluation works well

(7)

Teacher Effectiveness Varies

 “New” research suggests that teacher effectiveness varies a great deal, even within individual schools

– Some even argue that we eliminate the achievement gap simply by reassigning the most effective teachers to

minority children

– Differences are exaggerated but the larger conclusion about variation is not really in dispute

– Also, consistent with the evidence on credentials—if credentials worked, we would see less variation

 Yet, we measure teacher effectiveness poorly and accountability focuses on whole schools

(8)

A Failure of Test-Based Accountability:

The Snapshot Problem

 Snapshot = Any measure of student outcomes at a single point in time

– Regardless of test reporting methods (% proficient, scale scores, etc.)

– Until now, all accountability has been on snapshots

 The Problem: Students enter the classroom at very different starting points, because of factors outside the control of the school

– The “starting gate inequality”

 Why is this a problem?

(9)

Cardinal Rule of Accountability

 Rule: Hold people accountable for what they can control

– Part 1: “Hold people accountable . . .”

 Meaning that accountability is important

– Part 2: “ . . . for what they can control.”

 Meaning that the details matter

 Accountability systems have failed to follow

Cardinal Rule because snapshot fails to account for what students bring to the classroom

(10)

Consequences

 Driving teachers out of low-snapshot schools

 Pushing low-snapshot students out the door

 Complacency in high-snapshot schools

 Value-added measures can help address the snapshot problem and reduce these

consequences

(11)

Questions About the Rationale for

Value-Added

(12)

What are Value-Added

Measures (VAM)?

(13)

Basic VAM

 If the problem is accounting for what students bring with them to the classroom, then measure what they bring

 Annual student testing allows researchers to

subtract prior scores from current ones—growth

 Growth can be calculated for different test score reporting methods

– Scale scores and NCEs best

 Ideal: Growth of individual students based on scale scores

– The paradigm shift

(14)

Illustration of Basic Approach:

2 Teachers w/ Same VA

End of Year Start of School Year

Achievement

Mr. Hacker: Low Snapshot

Ms. Erickson:

High Snapshot

Starting Gate Inequality

(15)

Illustration #2:

2 Teachers w/ Different Value-Added

End of Year Start of School Year

Achievement Ms. Smith: Low VA, but High Snapshot

Ms. Bloom: High VA, but Low Snapshot

(16)

Advanced VAM

 Limits of Basic VAM:

– Unequal school resources

 With each control variable included, VAMs account for the contribution of each factor to

student achievement on the average, in all schools

 Based on these measured contributions, VAMs assign “bonus points” to schools with few school resources (and more disadvantaged students if

demographics are included)

– If having 1 fewer student in class increases test scores by 2 points, then a school with 5 more students per

class than avg. school gets 10 bonus points

 Each control variable added helps to make the schools in each bucket more and more similar in terms of what they can control

(22)

Controversy of Student Demographics

 Accounting for student demographics can be interpreted as “lowering expectations” for

disadvantaged students

 In one sense, this is true: schools with fewer school resources and more disadvantaged

students can achieve the same ratings as other schools with lower actual achievement gains

 In another sense, this is false: value-added does not provide schools with any incentive to give greater effort to disadvantaged students

– We can apply “weights” that give as much or as little weight to disadvantaged students as we wish

(23)

Value-Added Measures are Relative

 VA allows us to make comparisons among schools and teachers (it’s relative), not draw absolute

conclusions about performance

 On the one hand, this means that some teachers and schools will have low value-added no matter what they do

 On the other hand, we would never want to say that when a teacher or school gets to a particular standard, that they are “good enough”

– Relative measures facilitate continuous improvement

(24)

Questions About How Value-Added

Measures are Created

(25)

Possible Errors in Value-

Added Measures

(26)

Two Basic Types of Errors

 Systematic Error: The error is more likely to occur with a particular school or teacher

– Snapshots are a case in point: they systematically disadvantage low-snapshot schools

 Random Error: Is equally likely to arise for everyone

– Example: A coin toss – Two sources:

 Measurement error (from the student test scores)

 Sampling error (more students, less sampling error)

– Random error is worse with growth measures

(27)

Illustrating Random Error in Growth Measures

4^th grade score: 1400

3^rd grade

score: 1100

Maximum Growth: +500

Minimum growth: +100

Time

Achievement

(28)

More on Errors

 Types of random errors

– Type I error = in this case, the probability of

concluding two teachers perform differently when they are in fact the same (“statistical significance”) – Type II error = the probability of concluding two

teachers perform the same when they different

 Random and systematic errors are both important for deciding how to use performance measures

(29)

Statistical Errors and Decision Errors

Random Errors

Type One Error:

Conclude two are different when the same

Decision Error One:

Example: Give an award to someone who really isn’t high-performing

Decision Error Two:

Example: Leave

someone on the job who is performing poorly

Type Two Error:

Conclude two are same when really the different

Systematic Errors

Policy

? ?

(30)

“We made too many wrong mistakes”

-- Yogi Berra

(31)

Research on Strengths and

Weaknesses of VAM

(32)

The Good News

 Research on VAM is in its infancy, but . . .

 Again, differences between the lowest and highest value-added teachers seem large

 VAM measures have been partly validated by a random assignment experiment (here in LA)

 VAM measures of teacher effectiveness are positively correlated with principals’

subjective assessments of teachers

(33)

The Bad News

 VA no better than the tests—garbage in, garbage out

– Much effort right now toward improving the quality of student assessments

 Are imprecise

– Hard to say that one teacher is clearly better than another based on VAM-A

– As a result, teacher measures are unstable

 Vary across tests (same subject)

 Sensitive to specific statistical assumptions

 May not totally address the tracking problem

(34)

The Limited Applicability of VAM

 One of the main limitations of VAM is that, in most states, it can only be applied easily in

grades 4-8, math and reading

 Excludes:

– Teachers in other subjects, coaches, specialists – Teachers in grades K-3 and 9-12

– New teachers

 On the other hand, it wouldn’t make sense for teacher evaluations to be the same across all grades and subjects

(35)

Questions About the Strengths and

Weaknesses of Value-Added Measures

(36)

Putting the Evidence in Perspective

 Researchers have strict standards about drawing conclusions based on statistics (about teacher

performance or anything else)

– See AERA/APA/NCME standards

 As decision-makers, you do not have this

luxury—cannot wait around for ideal solutions, or accept large numbers of ineffective teachers remaining in classrooms

 All measures have their advantages and

disadvantages and you have to compare them

(37)

The Double Standard

 Critics of VAM don’t apply the same standard to credentials that they do to value-added

– Example: Do credentials “converge with results from other ratings of quality, such as classroom observations, parent surveys, …”?

– Answer: No way.

 No performance measure could possibly meet the AERA/APA/NCME standards

(38)

“When I hear somebody sigh, 'Life is hard,' I am always tempted to ask, 'Compared to what?'”

-- Sydney J. Harris (journalist)

(39)

Understanding Value-Added:

(42)

Low- vs. High-Stakes



There aren’t any “no stakes” uses



Lowest stakes:

– School-level VA with school bonuses



Medium stakes:

– Report teacher VA to school principal – Performance pay



Highest stakes:

– Make VA measures publicly available – Tenure and dismissal

(43)

Recommendations:

Using Value-Added to Improve

Teaching and Learning

(44)

Recommendations for Using VAM

#1: Use value-added to measure school performance and hold schools accountable

#2: Experiment with and carefully evaluate policies that use value-added to measure the performance of individual teachers

#3: In creating performance measures, combine value- added with other measures more closely related to actual practice

#4: Experiment with and carefully evaluate policies that use value-added to measure the performance of teacher teams

(45)

Recommendations: Part II

#5a: Consider extending value-added to other grades, subjects, and student outcomes . . .

#5b: . . . But don’t let the tail wag the dog.

#6: Avoid the “air bag” problem. Don’t drive value- added measures “too fast.”

(46)

Recommendations on Creating and Reporting VA Measures: Part I

#1: Use student tests that reflect rich content and are standardized, scaled and criterion-referenced

#2: Create data systems that link student outcomes over time and to teachers and schools

#3: Include all students, including special education, English Language Learners, and students with some missing data

#4: Make adjustments to align the timing of the test with the timing of schooling activities

(47)

Recommendations on Creating and Reporting VA Measures: Part II

#5: Average value-added measures over ≥ 2 years

#6: Create value-added measures based on

comparisons among teachers and schools that facilitate cooperation and collaboration

#7: Create value-added measures that compare teachers with grades and subjects

#8: Account for factors that are outside the control of those being evaluated

#9: Adjust for sampling error

#10: Report confidence intervals

(48)

An Additional Recommendation

 Use value-added to evaluate school, district, and state programs and practices

 The evidence on teacher credentials (above) is a good example

 The value-added approach solves the same problem in program evaluation as it does in educator performance

– Avoid systematic errors

(49)

How Are Others Using VA?

 Most districts are following these recommendations

– Mixing value-added with classroom observations

– Dozens of districts are using value-added as a partial basis for merit pay (federal TIF prog)

– Revamping formal evaluation and tenure decisions – For many, lack of good data system is the first barrier

 Some problems

– Moving too fast (“air bag” problem)

– Lack of professional development about VA measures – Dueling evaluation systems

(50)

“All models are false but some models are useful.”

-- George E.P. Box

(51)

Conclusion: Moving Forward to Ensure Teacher Effectiveness in LAUSD

 We can do better than the credentialing and check list evaluation system

 In deciding how to use VAM, we should:

(1) Ask ourselves: Is this system going to give high ratings to the types of teachers and schools I would want my children to attend?

(2) Compare VAM to the alternatives

 We need a comprehensive system of teacher

effectiveness and performance measures in some form represent one important element

(52)

Papers and References

 Policy brief from PACE (forthcoming)

 Forthcoming book on value-added from Harvard Education Press (February)

 My web site:

http://www.education.wisc.edu/eps/faculty/harris.asp

 Web site focused on teacher quality research:

http://www.teacherqualityresearch.org

 Ed Week Commentary (June, 2008)

 National Conference on Value-Added