Much has been made of late regarding the erroneous classification of 44 teachers in Washington DC as ineffective, thus facing job consequences. This particular erroneous rating was based on an “error” in the calculation of the teachers’ total ratings, as acknowledged by the consulting firm applying the ratings. That is, in this case, the consultants simply did not carry out their calculations as intended. This is not to suggest by any stretch that the intended calculations are necessarily more accurate or precise than the unintended error. That is, there certainly may be far more – are likely far more than these 44 teachers whose ratings fall arbitrarily and capriciously in the zone whereby those teachers would face employment consequences.
So, how can we tell… how can we identify such teachers. Well, DC’s own evaluation study of IMPACT provides us one useful road map and even a list of individuals arbitrarily harmed by the evaluation model. As I’ve stated on many, many occasions, it is simply inappropriate to make bright line distinctions through fuzzy data. Teacher evaluation data are fuzzy. Yet teacher evaluation systems like IMPACT impose on those data many bright lines – cut points… to make important consequential decisions. Distinctions which are unwarranted. Distinctions which characterize as substantively different individuals who simply are not.
Nowhere is this more clearly acknowledged than in Tom Dee and Jim Wyckoff’s choice of regression discontinuity to evaluate the effect of being place in different performance categories. I discussed this method in a previous post. As explained in the NY Times Econ blog:
To study the program’s effect, the researchers compared teachers whose evaluation scores were very close to the threshold for being considered a high performer or a low performer. This general method is common in social science. It assumes that little actual difference exists between a teacher at, say, the 16th percentile and the 15.9th percentile, even if they fall on either side of the threshold. Holding all else equal, the researchers can then assume that differences between teachers on either side of the threshold stem from the threshold itself.
In other words, the central assumption of the method is that those who fall just above and those just below a given, blunt threshold (through noisy data) really are no different from one another. Yet, they face different consequences and behave correspondingly. I pointed out in my previous post that in many ways, this was a rather silly research design to prove that “IMPACT works.” Really what it shows is that if you arbitrarily label otherwise similar teachers as acceptable and others as bad, those labeled as bad are going to feel bad about it and be more likely to leave. That’s not much of a revelation.
But, there may be other uses for the Dee/Wyckoff RD study and its underlying data, with opportunity to call these researchers to the witness stand to explain the premise of regression discontinuity. You see, underlying their analyses is a sufficient sample teachers in DC who have been put in the bottom performance category and may have faced job consequences as a result. By admission of the research design itself, these teachers have been arbitrarily placed in that category by the placement of a cut-score. They are, by admission of the research design, statistically no different from their peers who were lucky enough to be placed above that line and avoid consequences.
This seems at least a worthwhile due process challenge to me. To be clear, such violations are unavoidable in these teacher evaluation systems that try so hard to replace human judgment with mathematical algorithms, applying certain cut points to uncertain information.
So, to my colleagues in DC, I might suggest that now is the time to request the underlying data on the teachers included in that regression discontinuity model, and identify which ones were arbitrarily classified as ineffective and face consequences as a result.
Point of clarification: Is this the best such case to be brought on this issue? Perhaps not. I think the challenges to SGPs being bluntly used for consequential decisions – not even designed for distilling teacher effect – are much more straightforward. But what is unique here is that we now have on record an acknowledgement that the cut-points distinguishing between some teachers facing job consequences and others not facing job consequences was considered by way of research evaluation design to be arbitrary and not meaningful statistically. From a legal strategy standpoint, that’s a huge admission. It’s an admission that cut-points that forcibly (by policy design) over-ride professional judgment, are in fact arbitrary and distinguish between the non-distinguishable. And I would argue that it would be pretty damning to the district for plaintiffs counsel to simply ask Dee or Wyckoff on the stand what a regression discontinuity design does… how it works… etc.
Baker, B.D., Green, P.C., Oluwole, J. (2013) The legal consequences of mandating high stakes
decisions based on low quality information: Teacher Evaluation in the Race‐to‐the‐Top Era.
Education Policy Analysis Archives
Green, P.C., Baker, B.D., Oluwole, J. (2012) Legal implications of dismissing teachers on the basis of value‐added measures based on student test scores. BYU Education and Law Journal 2012 (1)