Student Test Score Based Measures of Teacher Effectiveness Won’t Improve NJ Schools

Posted on March 13, 2011



Op-Ed from: http://www.northjersey.com

The recent Teacher Effectiveness Task Force report recommended basing teacher evaluation significantly on student test scores. A few weeks earlier, Education Commissioner Cerf recommended that teacher tenure and dismissal, as well as compensation decisions be based largely on student assessment data.

Implicit in these recommendations is that the state and local districts would design a system for linking student assessment data to teachers for purposes of estimating teacher effectiveness. The goal of statistical “teacher effectiveness” measurement systems, including the most common approach called value-added modeling (VAM), is to estimate the extent to which a specific teacher contributes to the learning gains of a group (or groups) of students assigned to that teacher in a given year.

Unfortunately, while this all sounds good, it just doesn’t work, at least not well enough to even begin considering using it for making high stakes decisions about teacher tenure, dismissal or compensation. Here’s a short list (my full list is much longer) of reasons why:

  1. It is not possible to equate the difficulty of moving a group of children 5 points (or rank and percentile positions) at one end of a test scale to moving children 5 points at the other end. Yet that is precisely what the proposed evaluations endeavor to accomplish. In such a system, the only fair way to compare one teacher to another would be to ensure that each has a randomly assigned group of children whose initial achievement is spread similarly across the testing scale. Real schools and districts don’t work that way.  It is also not possible to compare a 5 point gain in reading to a 5 point gain in math. These limitations undermine the entire proposed system.
  2. Even with the best models and data, teacher ratings are highly inconsistent from year to year, and have very high rates of misclassification. According to one recent major study, there is a 35% chance of identifying an average teacher as poor, given one year of data, and 25% chance given three years. Getting a good rating is a statistical crap shoot.
  3. If we rate the same teacher with the same students, but with two different tests in the same subject, we get very different results. Cal. Berkeley Economist Jesse Rothstein, re-evaluating the findings of the much touted Gates Foundation Measuring Effective Teaching (MET) study noted that more than 40% of teachers who placed in the bottom quarter on one test (state test) were in the top half when using the other test (alternative). That is, teacher ratings based on the state assessment were only slightly better than a coin toss for identifying which teachers did well using the alternative assessment.
  4. No-matter how hard statisticians try, and no matter how good the data and statistical model, it is very difficult to separate a teacher’s effect on student learning gains from other classroom effects, like peer effect (race and poverty of peer group).  New Jersey schools are highly segregated, hampering our ability to make valid comparisons across teachers who work in vastly different settings. Statistical models attempt to adjust away these differences, but usually come up short.
  5. Kids learn over the summer too and higher income kids learn more than their lower income peers over the summer. As a result, annual testing data aren’t very useful for measuring teacher effectiveness. Annual (rather than fall-spring) testing data significantly disadvantage teachers serving children whose summer learning lags. Setting aside all of the un-resolvable problems above, this one can be fixed with fall-spring assessments. But it cannot be resolved in any fast-tracked plan involving current New Jersey assessments, which are annual. The task force report irresponsibly ignores this HUGE AND OBVIOUS concern, recommending fast-tracked use of current assessment data.
  6. As noted by the task force, only those teachers responsible for reading and math in grades 3 to 8 could readily be assigned ratings (less than 20% of teachers). Testing everything else is a foolish and expensive endeavor. This means school districts will need separate contracts for separate classes of teachers and will have limited ability to move teachers from one contract type to another (from second to fourth grade). Further, pundits have been arguing that a) we should be using effectiveness measures instead of experience to implement layoffs due to budget cuts, and b) we shouldn’t be laying off core, classroom teachers in grades 3 to 8. But those are the only teachers for whom “effectiveness” measures would be available?
  7. Basing teacher evaluations, tenure decisions and dismissal decisions on scores that may be influenced by which students a teacher serves provides a substantial disincentive for teachers to serve kids with the greatest needs, disruptive kids, or kids with disruptive family lives. Many of these factors are not, and can not be captured by variables in the best models. Some have argued that including value-added metrics in teacher evaluation reduces the ability of school administrators to arbitrarily dismiss a teacher. Rather, use of these metrics provides new opportunities to sabotage a teacher’s career through creative student assignment practices.

In short, we may be able to estimate a statistical model that suggests that teacher effects vary widely across the education system – that teachers matter. But we would be hard pressed to use that model to identify with any degree of certainty which individual teachers are good teachers and which are bad.

Contrary to education reform wisdom, adopting such problematic measures will not make the teaching profession a more desirable career option for America’s best and brightest college graduates. In fact, it will likely make things much worse. Establishing a system where achieving tenure or getting a raise becomes a roll of the dice and where a teacher’s career can be ended by a roll of the dice is no way to improve the teacher work force.

Contrary to education reform wisdom, using these metrics as a basis for dismissing teachers will NOT reduce the legal hassles associated with removal of tenured teachers.  As the first rounds of teachers are dismissed by random error of statistical models alone, by manipulation of student assignments, or when larger shares of minority teachers are dismissed largely as a function of the students they serve, there will likely be a new flood of lawsuits like none ever previously experienced. Employment lawyers, sharpen your pencils and round up your statistics experts.

Authors of the task force report might argue that they are putting only 45% of the weight of evaluations on these measures. The rest will include a mix of other objective and subjective measures. The reality of an evaluation that includes a single large, or even significant weight, placed on a single quantified factor is that that specific factor necessarily becomes the tipping point, or trigger mechanism. It may be 45% of the evaluation weight, but it becomes 100% of the decision, because it’s a fixed, clearly defined (though poorly estimated) metric.

Self-proclaimed “reformers” make the argument that the present system of teacher evaluation is so bad as to be non-existent. Reformers argue that the current system has 100% error rate (assuming current evaluations label all teachers as good, when all are actually bad)!

From the “reformer” viewpoint, something is always better than nothing.

Value added is something.

We must do something.

Therefore, we must do value-added.

Reformers also point to studies showing that teacher’s value-added scores are the best predictor (albeit a weak and error prone predictor) of teacher’s future value added scores – a self-fulfilling prophecy. These arguments are incredibly flimsy.

In response, I often explain that if we lived in a society that walked everywhere, and a new automotive invention came along, but had the tendency to burst into a ball of flames on every third start, I think I’d walk. Now is a time to walk! Some innovations just aren’t ready for broad public adoption – and some may never be. Some, like this one, may not be a very good idea to begin with. That said, improving teacher evaluation is not a simple either/or and now may be a good time to step back from this false dichotomy and discuss more productive alternatives.

About these ads