• No results found

Value-Added Measures of Educator Performance: Clearing Away the Smoke and Mirrors

N/A
N/A
Protected

Academic year: 2022

Share "Value-Added Measures of Educator Performance: Clearing Away the Smoke and Mirrors"

Copied!
52
0
0

Loading.... (view fulltext now)

Full text

(1)

Douglas N. Harris

Associate Professor of Educational Policy and Public Affairs University of Wisconsin at Madison

October 19, 2010

SERVE Southeast REL Webinar

Value-Added Measures of Educator Performance:

Clearing Away the Smoke and Mirrors

(Book forthcoming, Harvard Educ. Press, February, 2011)

(2)

Preview

Discuss how we measure (or really fail to measure) teacher performance today

Explain what value-added measures are and how they might improve performance measurement

Discuss how well value-added measures capture teacher performance—the different types of errors

Interpret research evidence about the errors

Provide a sense of perspective, as well as some

specific recommendations, about how to use value- added measures

(3)

A Question for All Organizations

How should we measure and reward performance?

– What if we only measure performance related to one organizational goal and omit other goals?

– What happens if we measure performance badly for any or all goals?

– How do we align the incentives of workers with those of the organizations using imperfect measures?

Specific concerns in schools:

– Many goals to balance – Need for professionalism

– Desire to “keep politics out”

(4)

Rationale for Value-Added

(5)

The Traditional “Credentials Strategy” to Teacher Quality

Until 1990s, the education system focused on rule compliance and resources—finance, class size, . . .

Teacher credentials also fall within the resources, or “input” approach

– Undergraduate Education and Test Scores – Graduate Education and Experience

– Certification

Unfortunately, the only one related to teacher effectiveness is experience

Therefore important to consider outcomes and instructional practice as alternatives

(6)

Formal Teacher Evaluations

Do these make up for weaknesses of credentials?

Evaluations do not focus on the “technical core”

of teaching

- i.e., they ignore instructional practice

90% of teachers receive the highest rating

Principals often do not have the training or the time to be instructional leaders

– Partly because low stakes of evaluations give little reason to take evaluation seriously

I almost never hear teachers or administrators say that the formal evaluation works well

(7)

Teacher Effectiveness Varies

“New” research suggests that teacher effectiveness varies a great deal, even within individual schools

– Some even argue that we eliminate the achievement gap simply by reassigning the most effective teachers to

minority children

– Differences are exaggerated but the larger conclusion about variation is not really in dispute

– Also, consistent with the evidence on credentials—if credentials worked, we would see less variation

Yet, we measure teacher effectiveness poorly and accountability focuses on whole schools

(8)

A Failure of Test-Based Accountability:

The Snapshot Problem

Snapshot = Any measure of student outcomes at a single point in time

– Regardless of test reporting methods (% proficient, scale scores, etc.)

– Until now, all accountability has been on snapshots

The Problem: Students enter the classroom at very different starting points, because of factors outside the control of the school

– The “starting gate inequality”

Why is this a problem?

(9)

Cardinal Rule of Accountability

Rule: Hold people accountable for what they can control

– Part 1: “Hold people accountable . . .”

Meaning that accountability is important

– Part 2: “ . . . for what they can control.”

Meaning that the details matter

Accountability systems have failed to follow

Cardinal Rule because snapshot fails to account for what students bring to the classroom

(10)

Consequences

Driving teachers out of low-snapshot schools

Pushing low-snapshot students out the door

Complacency in high-snapshot schools

Value-added measures can help address the snapshot problem and reduce these

consequences

(11)

Questions About the Rationale for

Value-Added

(12)

What are Value-Added

Measures (VAM)?

(13)

Basic VAM

If the problem is accounting for what students bring with them to the classroom, then measure what they bring

Annual student testing allows researchers to

subtract prior scores from current ones—growth

Growth can be calculated for different test score reporting methods

– Scale scores and NCEs best

Ideal: Growth of individual students based on scale scores

– The paradigm shift

(14)

Illustration of Basic Approach:

2 Teachers w/ Same VA

End of Year Start of School Year

Achievement

Mr. Hacker: Low Snapshot

Ms. Erickson:

High Snapshot

Starting Gate Inequality

(15)

Illustration #2:

2 Teachers w/ Different Value-Added

End of Year Start of School Year

Achievement Ms. Smith: Low VA, but High Snapshot

Ms. Bloom: High VA, but Low Snapshot

(16)

Advanced VAM

Limits of Basic VAM:

– Unequal school resources

– Prior achievement may not be enough to account for student differences

Possible solution: Compare similar schools

– Put them into buckets

– Apples to apples comparisons (within buckets)

Teachers whose students make greater than predicted growth have “high value-added”

(17)

Illustration of Advanced VAM:

A Simple Comparison

Individual school growth

Time

Achievement

3 4 5 6 Grades

District growth, or similar schools

High value- added

(18)

Illustration of Advanced VAM:

Prediction Approach

Individual school growth

Time

Achievement

3 4 5 6 Grades Predicted growth

High value- added

(19)

Illustration of Advanced VAM:

Prediction Approach w/ Low-Value-Added

Individual school growth

Time

Achievement

3 4 5 6 Grades

Low value- added

High value- added

(20)

Illustration of Advanced VAM:

Prediction Approach with “Controls”

Individual school growth

Time

Achievement

3 4 5 6 Grades

Predicted growth, small class sizes

High value- added

Predicted growth, large class sizes

(21)

How Exactly Does It Work?

With each control variable included, VAMs account for the contribution of each factor to

student achievement on the average, in all schools

Based on these measured contributions, VAMs assign “bonus points” to schools with few school resources (and more disadvantaged students if

demographics are included)

– If having 1 fewer student in class increases test scores by 2 points, then a school with 5 more students per

class than avg. school gets 10 bonus points

Each control variable added helps to make the schools in each bucket more and more similar in terms of what they can control

(22)

Controversy of Student Demographics

Accounting for student demographics can be interpreted as “lowering expectations” for

disadvantaged students

In one sense, this is true: schools with fewer school resources and more disadvantaged

students can achieve the same ratings as other schools with lower actual achievement gains

In another sense, this is false: value-added does not provide schools with any incentive to give greater effort to disadvantaged students

– We can apply “weights” that give as much or as little weight to disadvantaged students as we wish

(23)

Value-Added Measures are Relative

VA allows us to make comparisons among schools and teachers (it’s relative), not draw absolute

conclusions about performance

On the one hand, this means that some teachers and schools will have low value-added no matter what they do

On the other hand, we would never want to say that when a teacher or school gets to a particular standard, that they are “good enough”

– Relative measures facilitate continuous improvement

(24)

Questions About How Value-Added

Measures are Created

(25)

Possible Errors in Value-

Added Measures

(26)

Two Basic Types of Errors

Systematic Error: The error is more likely to occur with a particular school or teacher

– Snapshots are a case in point: they systematically disadvantage low-snapshot schools

Random Error: Is equally likely to arise for everyone

– Example: A coin toss – Two sources:

Measurement error (from the student test scores)

Sampling error (more students, less sampling error)

– Random error is worse with growth measures

(27)

Illustrating Random Error in Growth Measures

4th grade score: 1400

3rd grade

score: 1100

Maximum Growth: +500

Minimum growth: +100

Time

Achievement

(28)

More on Errors

Types of random errors

– Type I error = in this case, the probability of

concluding two teachers perform differently when they are in fact the same (“statistical significance”) – Type II error = the probability of concluding two

teachers perform the same when they different

Random and systematic errors are both important for deciding how to use performance measures

(29)

Statistical Errors and Decision Errors

Random Errors

Type One Error:

Conclude two are different when the same

Decision Error One:

Example: Give an award to someone who really isn’t high-performing

Decision Error Two:

Example: Leave

someone on the job who is performing poorly

Type Two Error:

Conclude two are same when really the different

Systematic Errors

Policy

? ?

(30)

“We made too many wrong mistakes”

-- Yogi Berra

(31)

Research on Strengths and

Weaknesses of VAM

(32)

The Good News

Research on VAM is in its infancy, but . . .

Again, differences between the lowest and highest value-added teachers seem large

VAM measures have been partly validated by a random assignment experiment (here in LA)

VAM measures of teacher effectiveness are positively correlated with principals’

subjective assessments of teachers

(33)

The Bad News

VA no better than the tests—garbage in, garbage out

– Much effort right now toward improving the quality of student assessments

Are imprecise

– Hard to say that one teacher is clearly better than another based on VAM-A

– As a result, teacher measures are unstable

Vary across tests (same subject)

Sensitive to specific statistical assumptions

May not totally address the tracking problem

(34)

The Limited Applicability of VAM

One of the main limitations of VAM is that, in most states, it can only be applied easily in

grades 4-8, math and reading

Excludes:

– Teachers in other subjects, coaches, specialists – Teachers in grades K-3 and 9-12

– New teachers

On the other hand, it wouldn’t make sense for teacher evaluations to be the same across all grades and subjects

(35)

Questions About the Strengths and

Weaknesses of Value-Added Measures

(36)

Putting the Evidence in Perspective

Researchers have strict standards about drawing conclusions based on statistics (about teacher

performance or anything else)

– See AERA/APA/NCME standards

As decision-makers, you do not have this

luxury—cannot wait around for ideal solutions, or accept large numbers of ineffective teachers remaining in classrooms

All measures have their advantages and

disadvantages and you have to compare them

(37)

The Double Standard

Critics of VAM don’t apply the same standard to credentials that they do to value-added

– Example: Do credentials “converge with results from other ratings of quality, such as classroom observations, parent surveys, …”?

– Answer: No way.

No performance measure could possibly meet the AERA/APA/NCME standards

(38)

“When I hear somebody sigh, 'Life is hard,' I am always tempted to ask, 'Compared to what?'”

-- Sydney J. Harris (journalist)

(39)

Understanding Value-Added:

The 3 Key Distinctions

(40)

Teacher vs. School Value-Added

Teacher value-added is arguably more problematic than school value-added

– it is more subject to student tracking – fewer students per teacher

– teachers aren’t accustomed to substantive evaluation

Trade-off between “free-riding” and accuracy

A middle option: team value-added

– Elementary schools: grade levels teams

– Middle and high schools: subject matter teams

(41)

Formative vs. Summative

VAM is inherently summative—it does not provide much guidance on how to improve

No measure can do both well

Formative and summative measures are complementary

– Formative measures alone provide a path to improvement but perhaps not an incentive

The credentialing problem

– Summative measures provide an incentive but no path

(42)

Low- vs. High-Stakes

There aren’t any “no stakes” uses

Lowest stakes:

– School-level VA with school bonuses

Medium stakes:

– Report teacher VA to school principal – Performance pay

Highest stakes:

– Make VA measures publicly available – Tenure and dismissal

(43)

Recommendations:

Using Value-Added to Improve

Teaching and Learning

(44)

Recommendations for Using VAM

#1: Use value-added to measure school performance and hold schools accountable

#2: Experiment with and carefully evaluate policies that use value-added to measure the performance of individual teachers

#3: In creating performance measures, combine value- added with other measures more closely related to actual practice

#4: Experiment with and carefully evaluate policies that use value-added to measure the performance of teacher teams

(45)

Recommendations: Part II

#5a: Consider extending value-added to other grades, subjects, and student outcomes . . .

#5b: . . . But don’t let the tail wag the dog.

#6: Avoid the “air bag” problem. Don’t drive value- added measures “too fast.”

(46)

Recommendations on Creating and Reporting VA Measures: Part I

#1: Use student tests that reflect rich content and are standardized, scaled and criterion-referenced

#2: Create data systems that link student outcomes over time and to teachers and schools

#3: Include all students, including special education, English Language Learners, and students with some missing data

#4: Make adjustments to align the timing of the test with the timing of schooling activities

(47)

Recommendations on Creating and Reporting VA Measures: Part II

#5: Average value-added measures over ≥ 2 years

#6: Create value-added measures based on

comparisons among teachers and schools that facilitate cooperation and collaboration

#7: Create value-added measures that compare teachers with grades and subjects

#8: Account for factors that are outside the control of those being evaluated

#9: Adjust for sampling error

#10: Report confidence intervals

(48)

An Additional Recommendation

Use value-added to evaluate school, district, and state programs and practices

The evidence on teacher credentials (above) is a good example

The value-added approach solves the same problem in program evaluation as it does in educator performance

– Avoid systematic errors

(49)

How Are Others Using VA?

Most districts are following these recommendations

– Mixing value-added with classroom observations

– Dozens of districts are using value-added as a partial basis for merit pay (federal TIF prog)

– Revamping formal evaluation and tenure decisions – For many, lack of good data system is the first barrier

Some problems

– Moving too fast (“air bag” problem)

– Lack of professional development about VA measures – Dueling evaluation systems

(50)

“All models are false but some models are useful.”

-- George E.P. Box

(51)

Conclusion: Moving Forward to Ensure Teacher Effectiveness in LAUSD

We can do better than the credentialing and check list evaluation system

In deciding how to use VAM, we should:

(1) Ask ourselves: Is this system going to give high ratings to the types of teachers and schools I would want my children to attend?

(2) Compare VAM to the alternatives

We need a comprehensive system of teacher

effectiveness and performance measures in some form represent one important element

(52)

Papers and References

Policy brief from PACE (forthcoming)

Forthcoming book on value-added from Harvard Education Press (February)

My web site:

http://www.education.wisc.edu/eps/faculty/harris.asp

Web site focused on teacher quality research:

http://www.teacherqualityresearch.org

Ed Week Commentary (June, 2008)

National Conference on Value-Added

References

Related documents

Chess Subsea production systems consulting operations are well organized in divisions such as Subsea Project Management, Subsea Production System Asset Integrity

This study explored the perceived quality of the self- service technology of these services and its effect on customer satisfaction.. The literature survey and in depth

This 2-day course is intended to provide essential information in the areas of code administration and history, legal aspects, customer service, basic plan review, inspection

In this study, we present a theoretical model that examines the effect of five procedural justice components (decision influence, manager knowledgeability, explanation,

The scope of equipment covered by these material handling guidelines includes forklifts, forklift attachments, pallet trucks, stackers, industrial trucks, hand trucks,

• the nature of the treatment and management provided by the Teaching Physician. Time spent in teaching cannot be counted towards critical care. See Medicare Claims Processing

The vari- able ordering chosen allows for contemporaneous e ff ects of all variables on the monetary policy instrument, while the fiscal policy indicator is assumed not to react

In (2001) Usha and Ravindran [17] examined numerically the development of flow and heat transfer characteristics of a heat conducting fluid film on a rotating disk