Many of us have had extensive ongoing conversation about the Big Study (CFR) that caught media attention last week. That conversation has included much thoughtful feedback from the authors of the study. That’s how it should be. A good, ongoing discussion delving into technical details and considering alternative policy implications. I received the following kind note from one of the study authors, John Friedman, in which he addresses three major points in my critique:
Thank you very much for your thorough and well-reasoned comment on our paper. You raise three major concerns with the study in your post which we’d like to address. First, you write that “just because teacher VA scores in a massive data set show variance does not mean that we can identify with any level of precision or accuracy which individual teachers … are “good” and which are “bad.” You are certainly correct that there is lots of noise in the measurement of quality for any individual teacher. But I don’t think it is right that we cannot identify individual teachers’ quality with any precision. In fact, our value-added estimates for individual teachers come with confidence intervals that exactly quantify the degree of uncertainty, as we discuss in Section 6.2 of the paper. For instance, if after 3 average-sized classrooms a teacher had VA of -0.2, which is 2 standard deviations below the mean, would have a confidence interval of approximately [-0.41,0.01]. This range implies that there is a 80% chance that the teacher is among the worst 15% in the system, and less than a 5% chance that the teacher is better than average. Importantly, we take account of this teacher-level uncertainty in our calculations in Figure 10. Even taking account of this uncertainly, replacing this teacher with an average one would generate $190K in NPV future earnings for the students per classroom. Thus, even taking into account imprecision, value-added still provides useful information about individual teachers. The imprecision does imply that we should use other measures (such as principal ratings or student feedback) in combination with VA (more on this below).
Your second concern is about the policy implications of the study, in particular the quotations given by my co-author and I for the NYT article, which give the impression that we view dismissing low-VA teachers as the best solution. These quotes were taken out of context and we’d like to clarify our actual position. As we emphasize in our executive summary and paper, the policy implications of the study are not completely clear. What we know is that great teachers have great value and that test-score based VA measures can be useful in identifying such teachers. In the long run, the best way to improve teaching will likely require making teaching a highly prestigious and well rewarded profession that attracts top talent. Our interpretation of the policy implications of the paper is better reflected in this article we wrote for the New York Times.
Finally, you suggest to your readers that the earnings gains from replacing a bottom-5% teacher with an average one are small — only $250 per year. This is an arithmetic error due to not adjusting for discounting. We discount all gains back to age 12 at a 5% interest rate in order to put everything in today’s dollars, which is standard practice in economics. Your calculation requires the undiscounted gain (i.e. summing the cumulative earnings impact), which is $50,000 per student for a 1 SD better teacher (84th pctile vs 50th pctile) in one grade. Discounted back to age 12 at a 5% interest rate, $50K is equivalent to about $9K. $50,000 over a lifetime – around $1,000 per year – is still only a moderate amount, but we think it would be implausible that a single teacher could do more than that on average. So the magnitudes strike us as reasonable yet important. It sounds like many readers make this discounting mistake, so it might be helpful to correct your calculation so that your readers have the facts right (the paper itself also provides these calculations in Appendix Table 14).
Thank you again for your thoughtful post, we hope look forward to reading your comments on our work and others in the future.
I do have comments in response to each of these points, as well as a few additional thoughts. And I certainly welcome any additional response from John or the other authors.
On precision & accuracy
The first point above addresses only the confidence interval around a teacher’s VA estimate for the teacher in the bottom 15%. Even then, if we were to use the VA estimate as a blunt instrument (acknowledging that the paper does not make such a recommendation – but does simulate it as an option) for deselection, this would result in a 20% chance of dismissing teachers who are not legitimately in the bottom 15% (5% who are actually above average), given three years of data. Yes, that’s far better than break even (after waiting three full years), and permits one to simulate a positive effect of replacing the bottom 15% (in purely hypothetical terms, holding lots of stuff constant). But acting on this information, accepting a 1/5 misfire rate to generate a small marginal benefit, might still have a chilling effect on future teacher supply (given the extent that the error is entirely out of their control).
But the confidence interval is only one piece of the puzzle. It is the collective pieces of that puzzle that have led me to believe that the VA estimates are of limited if any value as a human resource management tool, as similarly concluded by Jesse Rothstein in his review of the first round of Gates MET findings.
We also know that if we were to use a different test of the supposed same content, we are quite likely to get different effectiveness ratings for teachers (either following the Gates MET findings, or the Corcoran & Jennings findings). That is, the present analysis tells us only whether there exists a certain level of confidence in the teacher ratings on a single instrument, which may or may not be the best assessment of teaching quality for that content area. Further test-test differences in teacher ratings may be caused by any number of factors. I would expect that test scaling differences as much as subtle content and question format differences, along with differences in the stakes attached lead to the difference in ratings across the same teachers when different tests are used. Given that the tests changed at different points in the CFR study, and there are likely at least some teachers who maintained constant assignments across those changes, CFR could explore shifts in VA estimates across different tests for the same teachers. Next paper? (the current paper is already 5 or 6 rolled into one).
Also, as the CFR paper appropriately acknowledges, the VA estimates – and any resulting assumptions that they are valid – are contingent upon the fact that they were estimated to data retrospectively, using assessments for which there were no stakes attached – most importantly, where high stakes personnel decisions were not based on the tests.
And one final technical point, just because the model across all cases does not reveal any systematic patterns of bias does not mean that significant numbers of teacher cases within the mix would not have their ratings compromised by various types of biases (associated with either observables or unobservables). Yes, the bias, on average, is either a wash or drowned out by the noise. But there may be clusters of teachers serving clusters of students and/or in certain types of settings where the bias cuts one way or the other. This may be a huge issue if school officials are required to place heavy emphasis on these measures, and where some schools are affected by biased estimates (in any direction) and others not.
On the limited usefulness of VAM estimates
I do not deny, though I’m increasingly skeptical, that these models produce any useful information at the individual level. They do indeed, as CFR explain produce a prediction – with error – of the likelihood that a teacher produces higher or lower gains across students on a specific test or set of tests (for what that test is worth). That may be useful information. But that’s a very small piece of a much larger human resource puzzle. First of all, it’s very limited piece of information on a very small subset of teachers in schools.
While pundits often opine about the potential cost effectiveness of these statistical estimates for use in teacher evaluation versus more labor intensive observation protocols, we must consider in that cost effectiveness analysis that the VA estimates are capturing only effectiveness with respect to a) the specific tests in question (since other tests may yield very different results) and b) for a small share of our staff districtwide.
I do appreciate, and did recognize that the CFR paper doesn’t make a case for deselection with heavy emphasis on VA estimates. Rather, the paper ponders the policy implications in the typical way in which we academically speculate. That doesn’t always play well in the media – and certainly didn’t this time.
The problem – and a very big one – is that states (and districts) are actually mandating rigid use of these metrics including proposing that these metrics be used in layoff protocols (quality based RIF) – essentially deselection. Yes, most states are saying “use test-score based measures for 50%” and use other stuff for the other half. And political supporters are arguing – “no-one is saying to use test scores as the only measure.” The reality is that when you put a rigid metric (and policymakers will ignore those error bands) into an evaluation protocol and combine it with less rigid, less quantified other measures the rigid metric will invariably become the tipping factor. It may be 50% of the protocol, but will drive 100% of the decision.
Also, state policymakers and local decision makers for the most part do not know the difference between a well estimated VAM, with appropriate checks for bias, and a Student Growth Percentile score – as being pitched to many state policymakers as a viable alternative and now adopted in many states – with no covariates – no published statistical evaluation on the properties, biases, etc.
Further, I would argue that there are actually perverse incentives for state policymakers and local district officials to adopt bad and/or severely biased VAMs because those VAMS are likely to appear more stable (less noisy) over time (because they will, year after year, inappropriately disadvantage the same teachers).
State policymakers are more than willing to make that completely unjustified leap that the CFR results necessarily indicate that Student Growth Percentiles – just like a well estimated (though still insufficient) VAM – can and should be used as blunt deselection tools (or tools for denying and/or removing tenure).
In short, even the best VAMs provide us with little more than noisy estimates of teaching effectiveness, measured by a single set of assessments, for a small share of teachers.
Given the body of research, now expanded with the CFR study, while I acknowledge that these models can pick up seemingly interesting variance across teachers, I stand by my perspective that that information is of extremely limited use for characterizing individual teacher effectiveness.
On the $250 calculation (and my real point)
My main point regarding the break down to $250 from $266k was that the $266k was generated for WOW effect, from an otherwise non-startling number (be it $1,000 or $250). It’s the intentional exaggeration by extrapolation that concerns me, like stretching the Y axes in the NY Times story (theirs, not yours). True, I simplified and didn’t discount (via an arbitrary 5%) and instead did a simple back of the napkin that would then reconcile, for the readers, with the related graph – which shows about a $250 shift in earnings at age 28 (but stretches the Y axis to also exaggerate the effect). It is perhaps more reasonable to point out that this is about a $250 shift over $20,500, or slightly greater than 1.2%?
I agree that when we see shifts even this seemingly subtle, in large data sets and in this type of analysis, they may be meaningful shifts. And I recognize that researchers try to find alternative ways to illustrate the magnitude of those shifts. But, in the context of the NY Times story, this one came off as stretching the meaningfulness of the estimate – multiplying it just enough times (by the whole class then by lifetime) to make it seem much bigger, and therefore much more meaningful. That was easy blog fodder. But again, I put it down in that section of my critique focused on the presentation, not the substance.
If I was a district personnel director would I want these data? Would I use them? How?
This is one that I’ve thought about quite a bit.
Yes, probably. I would want to be able to generate a report of the VA estimates for teachers in the district. Ideally, I’d like to be able to generate a report based on alternative model specifications (option to leave in and take out potential biases) and on alternative assessments (or mixes of them). I’d like the sensitivity analysis option in order to evaluate the robustness of the ratings, and to see how changes to model specification affect certain teachers (to gain insights, for example, regarding things like peer effect vs. teacher effect).
If I felt, when pouring through the data, that they were telling me something about some of my teachers (good or bad), I might then use these data to suggest to principals how to distribute their observation efforts through the year. Which classes should they focus on? Which teachers? It would be a noisy pre-screening tool, and would not dictate any final decision. It might start the evaluation process, but would certainly not end it.
Further, even if I did decide that I have a systematically underperforming middle school math teacher (for example), I would only be likely to try to remove that teacher if I was pretty sure that I could replace him or her with someone better. It is utterly foolish from a human resource perspective to automatically assume that I will necessarily be able to replace this “bad” teacher with an “average” one. Fire now, and then wait to see what the applicant pool looks like and hope for the best?
Since the most vocal VAM advocates love to make the baseball analogies… pointing out the supposed connection between VAM teacher deselection arguments and Moneyball, consider that statistical advantage in Baseball is achieved by trading for players with better statistics – trading up (based on which statistics a team prefers/needs). You don’t just unload your bottom 5% or 15% players in on-base-percentage and hope that players with on-base-percentage equal to your team average will show up on your doorstep. (acknowledging that the baseball statistics analogies to using VAM for teacher evaluation to begin with are completely stupid)
Unfortunately, state policymakers are not viewing it this way – not seeking reasonable introduction of new information into a complex human resource evaluation process. Rather, they are rapidly adopting excessively rigid mandates regarding the use of VA estimates or Student Growth Percentiles as the major component of teacher evaluation, determination of teacher tenure and dismissal. And unfortunately, they are misreading and misrepresenting (in my view) the CFR study to drive home their case.