Today brings us yet another opportunity to apply common sense interpretation to an otherwise seemingly complex research study – this time on the “effectiveness” of the DC Impact teacher evaluation system on improving teaching quality in the district. The study, by some of my favorite researchers (no sarcasm here, these are good, thoughtful individuals who do high quality work) is nicely described in the New York Times Economix Blog section:
To study the program’s effect, the researchers compared teachers whose evaluation scores were very close to the threshold for being considered a high performer or a low performer. This general method is common in social science. It assumes that little actual difference exists between a teacher at, say, the 16th percentile and the 15.9th percentile, even if they fall on either side of the threshold. Holding all else equal, the researchers can then assume that differences between teachers on either side of the threshold stem from the threshold itself.
The results suggest that the program had perhaps its largest effect on the rate at which low-performing teachers left the school system. About 20 percent of teachers just above the threshold for low performance left the school system at the end of a year; the probability that a teacher just below the threshold would quit was instead above 30 percent.
In addition, low-performing teachers who remained lifted their performance, according to the system’s criteria. To give a sense of scale, the researchers noted that the effect was about half as large as the substantial gains that teachers typically make in their first years of teaching combined.
So, for research and stats geeks this description speaks to a design referred to as regression discontinuity analysis. It sounds complicated but it’s really not. The idea is that whenever we stick cut-points through some distribution of ratings or scores – through messy/noisy data – some people fall just below those cut-points and others just above. But the cut-points are really arbitrary and those who fall just above the line really aren’t substantively, or even statistically significantly different from those who fall just below the line. It’s almost equivalent (assumed equivalent for research purposes) to taking a group of otherwise similar individuals and randomly assigning to some, one score (below the line) and others another score (above the line).
In one application of this approach, researchers from Harvard studied the effect of high stakes high school exit exams on student in Massachusetts. Students who barely passed the test were compared with those who barely failed the test on the first try. In reality, missing or making one or two additional questions does not validly indicate that one child knows their math better than the other. The kids were otherwise comparable, but some were labeled failures and others successes. Those labeled failures were more likely to drop out of high school and less likely to attend college.
The conclusion – that these arbitrary, non-meaningful distinctions adopted in policy are harmful.
This brings us to the present study on the DC Impact teacher evaluation system. Here, the researchers identified teachers who were really no different from one another statistically on their DC Impact ratings, but some were just a few fractions of a point low enough to be labeled as Ineffective and face threat of dismissal, and others just high enough to be out of the woods for now. That is, there really aren’t any substantive observed quality differences between these two groups. Note that the researchers studied this at the high end of the ratings distribution as well, but didn’t really find as much going on there.
Put simply, what this study says is that if we take a group of otherwise similar teachers, and randomly label some as “ok” and tell others they suck and their jobs are on the line, the latter group is more likely to seek employment elsewhere. No big revelation there and certainly no evidence that DC Impact “works.”
Rather, arbitrary, non-meaningful distinctions are still consequential. This is largely what was found in the Massachusetts high stakes testing studies.
Actually, one implication for supervisors is that if you want to get a teacher you don’t like to leave your school, find a way to give them a bad rating. But I think most supervisors and principals could already figure that one out.
Here’s an alternative experiment to try – take a group of otherwise similar teachers and randomly assign them to group 1 and group 2. We’ll treat Group 1 okay… just okay… no real pats on the back or accolades. Group 2 on the other hand will be berated and treated like crap by the principal on a daily basis and each day in passing, the principal will scowl at them and say… “your job’s on the line!.”
My thinking is that group 2 teachers will be more likely to seek employment elsewhere. Not hugely different from the DC Impact research framework and nor are the policy implications. Does this mean that teacher evaluation works – has appropriate labor market consequences. No… not at all. It means that arbitrary differential treatment matters.
Of course, this would be an unethical experiment unlikely to make through IRB approval. But heck, screwing with people’s lives via actual arbitrary and capricious employee rating schemes, adopted as policy is totally okay.
As for the second conclusion… that those who do stay appear to improve their game…it certainly makes sense that individuals would try not to continue being the whipping boy… even if they perceive their prior selection as whipping boy to be arbitrary and capricious. Notably, the bulk of the evaluations in this study were based on observed behaviors not test-based metrics, and with observations, teachers have more direct control over what their supervisors observe and can therefore respond accordingly. Whether these behavior changes have anything to do with better actual on-the-job performance – “good teaching” – is at least questionable.