Douglas N. Harris
Associate Professor of Educational Policy and Public Affairs University of Wisconsin at Madison
October 19, 2010
SERVE Southeast REL Webinar
Value-Added Measures of Educator Performance:
Clearing Away the Smoke and Mirrors
(Book forthcoming, Harvard Educ. Press, February, 2011)
Preview
Discuss how we measure (or really fail to measure) teacher performance today
Explain what value-added measures are and how they might improve performance measurement
Discuss how well value-added measures capture teacher performance—the different types of errors
Interpret research evidence about the errors
Provide a sense of perspective, as well as some
specific recommendations, about how to use value- added measures
A Question for All Organizations
How should we measure and reward performance?
– What if we only measure performance related to one organizational goal and omit other goals?
– What happens if we measure performance badly for any or all goals?
– How do we align the incentives of workers with those of the organizations using imperfect measures?
Specific concerns in schools:
– Many goals to balance – Need for professionalism
– Desire to “keep politics out”
Rationale for Value-Added
The Traditional “Credentials Strategy” to Teacher Quality
Until 1990s, the education system focused on rule compliance and resources—finance, class size, . . .
Teacher credentials also fall within the resources, or “input” approach
– Undergraduate Education and Test Scores – Graduate Education and Experience
– Certification
Unfortunately, the only one related to teacher effectiveness is experience
Therefore important to consider outcomes and instructional practice as alternatives
Formal Teacher Evaluations
Do these make up for weaknesses of credentials?
Evaluations do not focus on the “technical core”
of teaching
- i.e., they ignore instructional practice
90% of teachers receive the highest rating
Principals often do not have the training or the time to be instructional leaders
– Partly because low stakes of evaluations give little reason to take evaluation seriously
I almost never hear teachers or administrators say that the formal evaluation works well
Teacher Effectiveness Varies
“New” research suggests that teacher effectiveness varies a great deal, even within individual schools
– Some even argue that we eliminate the achievement gap simply by reassigning the most effective teachers to
minority children
– Differences are exaggerated but the larger conclusion about variation is not really in dispute
– Also, consistent with the evidence on credentials—if credentials worked, we would see less variation
Yet, we measure teacher effectiveness poorly and accountability focuses on whole schools
A Failure of Test-Based Accountability:
The Snapshot Problem
Snapshot = Any measure of student outcomes at a single point in time
– Regardless of test reporting methods (% proficient, scale scores, etc.)
– Until now, all accountability has been on snapshots
The Problem: Students enter the classroom at very different starting points, because of factors outside the control of the school
– The “starting gate inequality”
Why is this a problem?
Cardinal Rule of Accountability
Rule: Hold people accountable for what they can control
– Part 1: “Hold people accountable . . .”
Meaning that accountability is important
– Part 2: “ . . . for what they can control.”
Meaning that the details matter
Accountability systems have failed to follow
Cardinal Rule because snapshot fails to account for what students bring to the classroom
Consequences
Driving teachers out of low-snapshot schools
Pushing low-snapshot students out the door
Complacency in high-snapshot schools
Value-added measures can help address the snapshot problem and reduce these
consequences
Questions About the Rationale for
Value-Added
What are Value-Added
Measures (VAM)?
Basic VAM
If the problem is accounting for what students bring with them to the classroom, then measure what they bring
Annual student testing allows researchers to
subtract prior scores from current ones—growth
Growth can be calculated for different test score reporting methods
– Scale scores and NCEs best
Ideal: Growth of individual students based on scale scores
– The paradigm shift
Illustration of Basic Approach:
2 Teachers w/ Same VA
End of Year Start of School Year
Achievement
Mr. Hacker: Low Snapshot
Ms. Erickson:
High Snapshot
Starting Gate Inequality
Illustration #2:
2 Teachers w/ Different Value-Added
End of Year Start of School Year
Achievement Ms. Smith: Low VA, but High Snapshot
Ms. Bloom: High VA, but Low Snapshot
Advanced VAM
Limits of Basic VAM:
– Unequal school resources
– Prior achievement may not be enough to account for student differences
Possible solution: Compare similar schools
– Put them into buckets
– Apples to apples comparisons (within buckets)
Teachers whose students make greater than predicted growth have “high value-added”
Illustration of Advanced VAM:
A Simple Comparison
Individual school growth
Time
Achievement
3 4 5 6 Grades
District growth, or similar schools
High value- added
Illustration of Advanced VAM:
Prediction Approach
Individual school growth
Time
Achievement
3 4 5 6 Grades Predicted growth
High value- added
Illustration of Advanced VAM:
Prediction Approach w/ Low-Value-Added
Individual school growth
Time
Achievement
3 4 5 6 Grades
Low value- added
High value- added
Illustration of Advanced VAM:
Prediction Approach with “Controls”
Individual school growth
Time
Achievement
3 4 5 6 Grades
Predicted growth, small class sizes
High value- added
Predicted growth, large class sizes
How Exactly Does It Work?
With each control variable included, VAMs account for the contribution of each factor to
student achievement on the average, in all schools
Based on these measured contributions, VAMs assign “bonus points” to schools with few school resources (and more disadvantaged students if
demographics are included)
– If having 1 fewer student in class increases test scores by 2 points, then a school with 5 more students per
class than avg. school gets 10 bonus points
Each control variable added helps to make the schools in each bucket more and more similar in terms of what they can control
Controversy of Student Demographics
Accounting for student demographics can be interpreted as “lowering expectations” for
disadvantaged students
In one sense, this is true: schools with fewer school resources and more disadvantaged
students can achieve the same ratings as other schools with lower actual achievement gains
In another sense, this is false: value-added does not provide schools with any incentive to give greater effort to disadvantaged students
– We can apply “weights” that give as much or as little weight to disadvantaged students as we wish
Value-Added Measures are Relative
VA allows us to make comparisons among schools and teachers (it’s relative), not draw absolute
conclusions about performance
On the one hand, this means that some teachers and schools will have low value-added no matter what they do
On the other hand, we would never want to say that when a teacher or school gets to a particular standard, that they are “good enough”
– Relative measures facilitate continuous improvement
Questions About How Value-Added
Measures are Created
Possible Errors in Value-
Added Measures
Two Basic Types of Errors
Systematic Error: The error is more likely to occur with a particular school or teacher
– Snapshots are a case in point: they systematically disadvantage low-snapshot schools
Random Error: Is equally likely to arise for everyone
– Example: A coin toss – Two sources:
Measurement error (from the student test scores)
Sampling error (more students, less sampling error)
– Random error is worse with growth measures
Illustrating Random Error in Growth Measures
4th grade score: 1400
3rd grade
score: 1100
Maximum Growth: +500
Minimum growth: +100
Time
Achievement
More on Errors
Types of random errors
– Type I error = in this case, the probability of
concluding two teachers perform differently when they are in fact the same (“statistical significance”) – Type II error = the probability of concluding two
teachers perform the same when they different
Random and systematic errors are both important for deciding how to use performance measures
Statistical Errors and Decision Errors
Random Errors
Type One Error:
Conclude two are different when the same
Decision Error One:
Example: Give an award to someone who really isn’t high-performing
Decision Error Two:
Example: Leave
someone on the job who is performing poorly
Type Two Error:
Conclude two are same when really the different
Systematic Errors
Policy
? ?
“We made too many wrong mistakes”
-- Yogi Berra
Research on Strengths and
Weaknesses of VAM
The Good News
Research on VAM is in its infancy, but . . .
Again, differences between the lowest and highest value-added teachers seem large
VAM measures have been partly validated by a random assignment experiment (here in LA)
VAM measures of teacher effectiveness are positively correlated with principals’
subjective assessments of teachers
The Bad News
VA no better than the tests—garbage in, garbage out
– Much effort right now toward improving the quality of student assessments
Are imprecise
– Hard to say that one teacher is clearly better than another based on VAM-A
– As a result, teacher measures are unstable
Vary across tests (same subject)
Sensitive to specific statistical assumptions
May not totally address the tracking problem
The Limited Applicability of VAM
One of the main limitations of VAM is that, in most states, it can only be applied easily in
grades 4-8, math and reading
Excludes:
– Teachers in other subjects, coaches, specialists – Teachers in grades K-3 and 9-12
– New teachers
On the other hand, it wouldn’t make sense for teacher evaluations to be the same across all grades and subjects
Questions About the Strengths and
Weaknesses of Value-Added Measures
Putting the Evidence in Perspective
Researchers have strict standards about drawing conclusions based on statistics (about teacher
performance or anything else)
– See AERA/APA/NCME standards
As decision-makers, you do not have this
luxury—cannot wait around for ideal solutions, or accept large numbers of ineffective teachers remaining in classrooms
All measures have their advantages and
disadvantages and you have to compare them
The Double Standard
Critics of VAM don’t apply the same standard to credentials that they do to value-added
– Example: Do credentials “converge with results from other ratings of quality, such as classroom observations, parent surveys, …”?
– Answer: No way.
No performance measure could possibly meet the AERA/APA/NCME standards
“When I hear somebody sigh, 'Life is hard,' I am always tempted to ask, 'Compared to what?'”
-- Sydney J. Harris (journalist)
Understanding Value-Added:
The 3 Key Distinctions
Teacher vs. School Value-Added
Teacher value-added is arguably more problematic than school value-added
– it is more subject to student tracking – fewer students per teacher
– teachers aren’t accustomed to substantive evaluation
Trade-off between “free-riding” and accuracy
A middle option: team value-added
– Elementary schools: grade levels teams
– Middle and high schools: subject matter teams
Formative vs. Summative
VAM is inherently summative—it does not provide much guidance on how to improve
No measure can do both well
Formative and summative measures are complementary
– Formative measures alone provide a path to improvement but perhaps not an incentive
The credentialing problem
– Summative measures provide an incentive but no path
Low- vs. High-Stakes
There aren’t any “no stakes” uses
Lowest stakes:
– School-level VA with school bonuses
Medium stakes:
– Report teacher VA to school principal – Performance pay
Highest stakes:
– Make VA measures publicly available – Tenure and dismissal
Recommendations:
Using Value-Added to Improve
Teaching and Learning
Recommendations for Using VAM
#1: Use value-added to measure school performance and hold schools accountable
#2: Experiment with and carefully evaluate policies that use value-added to measure the performance of individual teachers
#3: In creating performance measures, combine value- added with other measures more closely related to actual practice
#4: Experiment with and carefully evaluate policies that use value-added to measure the performance of teacher teams
Recommendations: Part II
#5a: Consider extending value-added to other grades, subjects, and student outcomes . . .
#5b: . . . But don’t let the tail wag the dog.
#6: Avoid the “air bag” problem. Don’t drive value- added measures “too fast.”
Recommendations on Creating and Reporting VA Measures: Part I
#1: Use student tests that reflect rich content and are standardized, scaled and criterion-referenced
#2: Create data systems that link student outcomes over time and to teachers and schools
#3: Include all students, including special education, English Language Learners, and students with some missing data
#4: Make adjustments to align the timing of the test with the timing of schooling activities
Recommendations on Creating and Reporting VA Measures: Part II
#5: Average value-added measures over ≥ 2 years
#6: Create value-added measures based on
comparisons among teachers and schools that facilitate cooperation and collaboration
#7: Create value-added measures that compare teachers with grades and subjects
#8: Account for factors that are outside the control of those being evaluated
#9: Adjust for sampling error
#10: Report confidence intervals
An Additional Recommendation
Use value-added to evaluate school, district, and state programs and practices
The evidence on teacher credentials (above) is a good example
The value-added approach solves the same problem in program evaluation as it does in educator performance
– Avoid systematic errors
How Are Others Using VA?
Most districts are following these recommendations
– Mixing value-added with classroom observations
– Dozens of districts are using value-added as a partial basis for merit pay (federal TIF prog)
– Revamping formal evaluation and tenure decisions – For many, lack of good data system is the first barrier
Some problems
– Moving too fast (“air bag” problem)
– Lack of professional development about VA measures – Dueling evaluation systems
“All models are false but some models are useful.”
-- George E.P. Box
Conclusion: Moving Forward to Ensure Teacher Effectiveness in LAUSD
We can do better than the credentialing and check list evaluation system
In deciding how to use VAM, we should:
(1) Ask ourselves: Is this system going to give high ratings to the types of teachers and schools I would want my children to attend?
(2) Compare VAM to the alternatives
We need a comprehensive system of teacher
effectiveness and performance measures in some form represent one important element
Papers and References
Policy brief from PACE (forthcoming)
Forthcoming book on value-added from Harvard Education Press (February)
My web site:
http://www.education.wisc.edu/eps/faculty/harris.asp
Web site focused on teacher quality research:
http://www.teacherqualityresearch.org
Ed Week Commentary (June, 2008)
National Conference on Value-Added