Rolling Dice: If I roll a “6” you’re fired!

Okay… Picture this…I’m rolling dice… and each time I roll a “6” some loud-mouthed, tweet happy pundit who just loves value-added assessment for teachers gets fired. Sound fair? It might happen to  someone who sucks at their job…or might just be someone who is rather average. Doesn’t matter. They lost on the roll of the dice.  A 1 in 6 chance. Not that bad. A 5 in 6 chance of keeping their job. Can’t you live with that?

This report was just released the other day from the National Center for Education Statistics:

The report carries out a series of statistical tests to determine the identification “error” rates for “bad teachers” when using typical value added statistical methods. Here’s a synopsis of the findings from the report itself:

Type I and II error rates for comparing a teacher’s performance to the average are likely to be about 25 percent with three years of data and 35 percent with one year of data. Corresponding error rates for overall false positive and negative errors are 10 and 20 percent, respectively.


Type I error rate (α) is the probability that based on c years of data, the hypothesis test will find that a truly average teacher (such as Teacher 4) performed significantly worse than average. (p. 12)

So, that means that there is about a 25% chance, if using three years of data or 35% chance if using 1 year of data that a teacher who is “average” would be identified as “significantly worse than average” and potentially be fired. So, what I really need are some 4 sided dice. I gave the pundits odds that are too good! Admittedly, this is the likelihood of identifying an “average” teacher as well below average. The likelihood of identifying an above average teacher as below average would be lower. Here’s the relevant definition of a “false positive” error rate from the study”

the false positive error rate, ()FPRq, is the probability that a teacher (such as Teacher 5) whose true performance level is q SDs above average is falsely identified for special assistance. (p. 12)

From the first quote above, even this occurs 1 in 10 times (given three years of data and 2 in 10 given only one year). And here’s the definition of a “false negative error:”

false negative error rate is the probability that the hypothesis test will fail to identify teachers (such as Teachers 1 and 2 in Figure 2.1) whose true performance is at least T SDs below average.

…which also occurs 1 in 10 times (given three years of data and 2 in 10 given only one year).

These concerns are not new. In a previous post, I discuss various problems with using value added measures for identifying good and bad teachers, such as temporal instability:

The introduction of this new report notes:

Existing research has consistently found that teacher- and school-level averages of student test score gains can be unstable over time. Studies have found only moderate year-to-year correlations—ranging from 0.2 to 0.6—in the value-added estimates of individual teachers (McCaffrey et al. 2009; Goldhaber and Hansen 2008) or small to medium-sized school grade-level teams (Kane and Staiger 2002b). As a result, there are significant annual changes in teacher rankings based on value-added estimates.

In my first post on this topic (and subsequent ones), I point out that the National Academies have already cautioned that:

“A student’s scores may be affected by many factors other than a teacher — his or her motivation, for example, or the amount of parental support — and value-added techniques have not yet found a good way to account for these other elements.”

And again, this new report provides a laundry list of factors that affect value-added assessment beyond the scope of the analysis itself:

However, several other features of value-added estimators that have been analyzed in the literature also have important implications for the appropriate use of value-added modeling in performance measurement. These features include the extent of estimator bias (Kane and Staiger 2008; Rothstein 2010; Koedel and Betts 2009), the scaling of test scores used in the estimates (Ballou 2009; Briggs and Weeks 2009), the degree to which the estimates reflect students’ future benefits from their current teachers’ instruction (Jacob et al. 2008), the appropriate reference point from which to compare the magnitude of estimation errors (Rogosa 2005), the association between value-added estimates and other measures of teacher quality (Rockoff et al. 2008; Jacob and Lefgren 2008), and the presence of spillover effects between teachers (Jackson and Bruegmann 2009).

In my opinion, the most significant problem here is the non-random assignment problem. The noise problem is significant and important, but much less significant than the non-random assignment problem. It just happens to be the topic of the day.

But alas, we continue to move forward… full steam ahead.

As I see it there are two groups of characters pitching fast-track adoption of value-added teacher evaluation policies.

Statistically Inept Pundits (who really don’t care anyway): The statistically inept pundits are those we see on Twitter every day, applauding the mass firing of DC teachers, praising the Colorado teacher evaluation bill and thinking that RttT is just AWESOME, regardless of the mixed (at best) evidence behind the reforms promoted by RttT (like value-added teacher assessment). My take is that they have no idea what any of this means… have little capacity to understand it anyway… and probably don’t much care. To them, I’m just a curmudgeonly academic throwing a wet blanket on their teacher bashing party. After all, who… but a wet blanket could really be against making sure all kids have good teachers… making sure that we fire and/or lay off the bad teachers, not just the inexperienced ones. These teachers are dangerous after all. They are hurting kids. We must stop them! Can’t argue that.  Or can we? The problem is, we just don’t have ideal, or even reasonably good methods for distinguishing between those good and bad teachers. And school districts that are all-of-the-sudden facing huge budget deficits and laying off hundreds of teachers, don’t retroactively have in place an evaluation system with sufficient precision to weed out the bad – nor could they.  Implementing “quality-based layoffs” here and now is among the most problematic suggestions currently out there.  The value-added assessment systems yet-to-be-implemented aren’t even up to the task. I’m really confused why these pundits who have so little knowledge about this stuff are so convinced that it is just so AWESOME.

Reform Engineers: Reform engineers view this issue in purely statistical and probabilistic terms – setting legal, moral and ethical concerns aside. I can empathize with that somewhat, until I try to make it actually work in schools and until I let those moral, ethical and legal concerns creep into my head. Perhaps I’ve gone soft. I’d have been all for this no more than 5 years ago. The reform engineer assumes first that it is the test scores that we want to improve  as our central objective – and only the test scores. Test scores are the be-all and end-all measure.  The reform engineer is okay with the odds above because more than 50% of the time they will fire the right person. That may be good enough – statistically. And, as long as they have decent odds of replacing the low performing teacher with at least an average teacher – each time – then the system should move gradually in a positive direction.  All that matters is that we have the potential for a net positive quality effect on replacing the 3/4 of fired teachers who were correctly identified and at least breaking even on the 1/4 who were falsely fired. That’s a pretty loaded set of assumptions though. Are we really going to get the best applicants to a school district where they know they might be fired for no reason on a 25% chance (if using 3 years of data) or 35% chance (on one year?). Of course, I didn’t even factor into this the number of bad teachers identified as good.

I guess that one could try to dismiss those moral, ethical and legal concerns regarding wrongly dismissing teachers by arguing that if it’s better for the kids in the end, then wrongly firing 1 in 4 average teachers along the way is the price we have to pay. I suspect that’s what the pundits would argue – since it’s about fairness to the kids, not fairness to the teachers, right? Still, this seems like a heavy toll to pay, an unnecessary toll, and quite honestly, one that’s not even that likely to work even in the best of engineered circumstances.


Follow up notes: A few comments I have received have argued from a reform engineering perspective that if we a) use the maximum number of years of data possible, and b) focus on identifying the bottom 10% or fewer of teachers, based on the analysis in the NCES/Mathematica report, we might significantly reduce our error rate – down to say 10% of teachers being incorrectly fired. Further, it is more likely that those incorrectly identified as failing are closer to failing anyway. That is not, however, true in all cases. This raises the interesting ethical question of – what is the tolerable threshold for randomly firing the wrong  teacher? or keeping the wrong teacher?

Further, I’d like to emphasize again that there are many problems that seriously undermine the application of value-added assessment for teacher hiring/firing decisions. This issue probably ranks about 3rd among the major problem categories. And this issue has many dimensions. First there is the statistical and measurement issue of having statistical noise result in wrongful teacher dismissal. There are also the litigation consequences that follow. There are also the questions over how the use of such methods will influence individuals thinking about pursuing teaching as a career, if pay is not substantially increased to counterbalance these new job risks. It’s not just about tweaking the statistical model and cut-points to bring the false positives into a tolerable zone. This type of shortsightedness is all too common in the types of technocratic solutions I, myself, used to favor.

Here’s a quick synopsis of the two other  major issues undermining the usefulness of value-added assessment for teacher evaluation & dismissal (on the assumption that majority weight is placed on value-added assessment):

1) That students are not randomly assigned across teachers and that this non-random assignment may severely bias estimates of teacher quality. The fact that non-random assignment of students may bias estimates of teacher quality will also likely have adverse labor market effects, making it harder to get the teachers we need in the classrooms where we need them most – at least without a substantial increase to their salaries to offset the risk.

2) That only a fraction of teachers can even be evaluated this way in the best of possible cases (generally less than 20%), and even their “teacher effects” are tainted – or enhanced – by one another. As I discussed previously, this means establishing different contracts for those who will versus those who will not be evaluated by test scores, creating at least two classes of teachers in schools and likely leading to even greater tensions between them. Further, there will likely be labor market effects with certain types of teachers either jockeying for position as a VAM evaluated teacher, or avoiding those positions.

More can be found on my entire blog thread on this topic: