Rebutting (again) the Persistent Flow of Disinformation on VAM, SGP and Teacher Evaluation

Posted on June 8, 2013



This post is in response to testimony overheard from recent presentations to the New Jersey State Board of Education. For background and more thorough explanations of issues pertaining to the use of Value-added Models and Student Growth Percentiles please see the following two sources:

  • Baker, B.D., Oluwole, J., Green, P.C. III (2013) The legal consequences of mandating high stakes decisions based on low quality information: Teacher evaluation in the race-to-the-top era. Education Policy Analysis Archives, 21(5). This article is part of EPAA/AAPE’s Special Issue on Value-Added: What America’s Policymakers Need to Know and Understand, Guest Edited by Dr. Audrey Amrein-Beardsley and Assistant Editors Dr. Clarin Collins, Dr. Sarah Polasky, and Ed Sloat. Retrieved [date], from http://epaa.asu.edu/ojs/article/view/1298
  • Baker, B.D., Oluwole, J. (2013) Deconstructing Disinformation on Student Growth Percentiles & Teacher Evaluation in New Jersey. New Jersey Education Policy Forum 1(1) http://njedpolicy.files.wordpress.com/2013/05/sgp_disinformation_bakeroluwole1.pdf

Here, I address a handful of key points.

First, different choices of statistical model or method for estimating teacher “effect” on test score growth matter. Indeed, one might find that adding new variables, controlling for this, that and the other thing, doesn’t always shift the entire pattern significantly, but a substantial body of literature indicates that even subtle changes to included variables or modeling approach can significantly change individual teacher’s ratings and significantly reshuffle teachers across rating categories. Further, these changes may be most substantial for those teachers in the tails of the distribution – or those for whom the rating might be most consequential.

Second, I reiterate that Value-added models in their best, most thorough form, are not the same as student growth percentile estimates. Specifically, those who have made direct comparisons of VAMs versus SGPs for rating teachers have found that SGPs – by omission of additional variables – are less appropriate. That is, they don’t to a very good job of sorting out the teacher’s influence on test score growth!

Third, I point out that the argument that VAM as a teacher effect indicator is as good as batting average for hitters or earned run average for pitchers simply means that VAM is a pretty crappy indicator of teacher quality.

Fourth, I reiterate a point I’ve made on numerous occasions, that just because we see a murky pattern of relationship and significant variation across thousands of points in scatterplot doesn’t mean that we can make any reasonable judgment about the position of any one point in that mess. Using VAM or SGP to make high stakes personnel decisions about individual teachers violates this very simple rule. Sticking specific, certain, cut scores through these uncertain estimates in order to categorize teachers as effective or not violates this very simple understanding rule.

Two Examples of How Models & Variables Matter

States are moving full steam ahead on adopting variants of value added and growth percentile models for rating their teachers and one thing that’s becoming rather obvious is that these models and the data on which they rely vary widely. Some states and districts have chosen to adopt value added or growth percentile models that include only a single year of student prior scores to address differences in student backgrounds, and others are adopting more thorough value added models which also include additional student demographic characteristics, classroom characteristics including class size, and other classroom and school characteristics that might influence – outside the teacher’s control – the growth in student outcomes. Some researchers have argued that in the aggregate – across the patterns as a whole – this stuff doesn’t always seem to matter that much. But we also have a substantial body of evidence that when it comes to the individual rating teachers it does.

For example, a few years back, the Los Angeles times contracted Richard Buddin to estimate a relatively simple value-added model of teacher effect on test scores in Los Angeles. Buddin included prior scores and student demographic variables. However, in a critique of Buddin’s report, Briggs and Domingue ran the following re-analysis to determine the sensitivity of individual teacher ratings to model changes, including additional prior scores and additional demographic and classroom level variables:

The second stage of the sensitivity analysis was designed to illustrate the magnitude of this bias. To do this, we specified an alternate value-added model that, in addition to the variables Buddin used in his approach, controlled for (1) a longer history of a student’s test performance, (2) peer influence, and (3) school-level factors. We then compared the results—the inferences about teacher effectiveness—from this arguably stronger alternate model to those derived from the one specified by Buddin that was subsequently used by the L.A. Times to rate teachers. Since the Times model had five different levels of teacher effectiveness, we also placed teachers into these levels on the basis of effect estimates from the alternate model. If the Times model were perfectly accurate, there would be no difference in results between the two models. Our sensitivity analysis indicates that the effects estimated for LAUSD teachers can be quite sensitive to choices concerning the underlying statistical model. For reading outcomes, our findings included the following:

Only 46.4% of teachers would retain the same effectiveness rating under both models, 8.1% of those teachers identified as effective under our alternative model are identified as “more” or “most” effective in the L.A. Times specification, and 12.6% of those identified as “less” or “least” effective under the alternative model are identified as relatively effective by the L.A. Times model.

For math outcomes, our findings included the following:

Only 60.8% of teachers would retain the same effectiveness rating, 1.4% of those teachers identified as effective under the alternative model are identified as ineffective in the L.A. Times model, and 2.7% would go from a rating of ineffective under the alternative model to effective under the L.A. Times model.

The impact of using a different model is considerably stronger for reading outcomes, which indicates that elementary school age students in Los Angeles are more distinctively sorted into classrooms with regard to reading (as opposed to math) skills. But depending on how the measures are being used, even the lesser level of different outcomes for math could be of concern.

  • Briggs, D. & Domingue, B. (2011). Due diligence and the evaluation of teachers: A review of the value-added analysis underlying the effectiveness rankings of Los Angeles Unified School District Teachers by the Los Angeles Times. Boulder, CO: National Education Policy Center. Retrieved June 4, 2012 from http://nepc.colorado.edu/publication/due-diligence.

Similarly, Ballou and colleagues ran sensitivity tests of teacher ratings applying variants of VAM models:

As the availability of longitudinal data systems has grown, so has interest in developing tools that use these systems to improve student learning. Value-added models (VAM) are one such tool. VAMs provide estimates of gains in student achievement that can be ascribed to specific teachers or schools. Most researchers examining VAMs are confident that information derived from these models can be used to draw attention to teachers or schools that may be underperforming and could benefit from additional assistance. They also, however, caution educators about the use of such models as the only consideration for high-stakes outcomes such as compensation, tenure, or employment decisions. In this paper, we consider the impact of omitted variables on teachers’ value-added estimates, and whether commonly used single-equation or two-stage estimates are preferable when possibly important covariates are not available for inclusion in the value-added model. The findings indicate that these modeling choices can significantly influence outcomes for individual teachers, particularly those in the tails of the performance distribution who are most likely to be targeted by high-stakes policies.

In short, the conclusions here are that model specification and variables included matter. And they can matter a lot. It is reckless and irresponsible to assert otherwise and even more so to never bother to run comparable sensitivity analyses to those above prior to requiring the use of measures for high stakes decisions.

SGP & a comprehensive VAM are NOT THE SAME!

This point is really just an extension of the previous. Most SGP models, which are a subset of VAM, take the simplest form of accounting only for a single prior year of test score. Proponents of SPGs like to make a big deal about how the approach re-scales the data from its original artificial test scaling to a scale-free (and thus somehow problem free?) percentile rank measure. The argument is that we can’t really ever know, for example, whether it’s easier or harder to increase your SAT (or any test) score from 600 to 650, or from 700 to 750, even though they are both 50 pt increases. Test-score distances simply aren’t like running distances. You know what? Neither are ranks/percentiles that are based on those test score scales! Rescaling is merely recasting the same ol’ stuff, though it can at times be helpful for interpreting results.  If the original scores don’t show legitimate variation – for example, if they have a  strong ceiling or floor effect, or simply have a lot of meaningless (noise) variation – then so too will any rescaled form of them.

Setting aside the re-scaling smokescreen, two recent working papers compare SGP and VAM estimates for teacher and school evaluation and both raise concerns about the face validity and statistical properties of SGPs.  And here’s what they find.

Goldhaber and Walch (2012) conclude “For the purpose of starting conversations about student achievement, SGPs might be a useful tool, but one might wish to use a different methodology for rewarding teacher performance or making high-stakes teacher selection decisions” (p. 30).

  •  Goldhaber, D., & Walch, J. (2012). Does the model matter? Exploring the relationship between different student achievement-based teacher assessments. University of Washington at Bothell, Center for Education Data & Research. CEDR Working Paper 2012-6.

Ehlert and colleagues (2012) note “Although SGPs are currently employed for this purpose by several states, we argue that they (a) cannot be used for causal inference (nor were they designed to be used as such) and (b) are the least successful of the three models [Student Growth Percentiles, One-Step & Two-Step VAM] in leveling the playing field across schools”(p. 23).

  •  Ehlert, M., Koedel, C., &Parsons, E., & Podgursky, M. (2012). Selecting growth measures for school and teacher evaluations. National Center for Analysis of Longitudinal Data in Education Research (CALDER). Working Paper #80. http://ideas.repec.org/p/umc/wpaper/1210.html

If VAM is as reliable as Batting Averages or ERA, that simply makes it a BAD INDICATOR of FUTURE PERFORMANCE!

I’m increasingly mind-blown by those who return, time after time, to really bad baseball analogies to make their point that these value-added or SGP estimates are really good indicators of teacher effectiveness.  I’m not that much of a baseball statistics geek, though I’m becoming more and more intrigued as time passes.  The standard pro-VAM argument goes that VAM estimates for individual teachers have a correlation of about .35 from one year to the next. Casual readers of statistics often see this as “low” working from a relatively naïve perspective that a high correlation is about .8.  The idea is that a good indicator of teacher effect would have to be an indicator which reveals the true, persistent effectiveness of that teacher from year to year. Even better, a good indicator would be one that allows us to tell if that teacher is likely to be a good teacher in future years. A correlation of only about .35 doesn’t give us much confidence.

That said, let’s be clear that all we’re even talking about here is the likelihood that a teacher having students who showed test score gains in one year, is likely to have a new batch of students who show similar test score gains the following year (or at least in relative terms, the teacher who is above the average of teachers for their student test score gains remains similarly above the average of teachers for their students’ test score gains the following year). That is, the measure itself may be of very limited use, thus the extent to which it is consistent or not may not really be that important. But I digress.

In order to try to make a .35 correlation sound good, VAM proponents will often argue that the year over year correlation between baseball batting averages, or earned run averages is really only about the same. And since we all know that batting average and earned run average are really, really important baseball indicators of player quality, then VAM must be a really, really important indicator of teacher quality. Uh… not so much!

If there’s one thing Baseball statistics geeks really seem to agree on, it’s that Batting Averages and Earned Run Averages for pitchers are crappy predictors of future performance precisely because of their low year over year correlation.

This piece from beyondtheboxscore.com provides some explanation:

Not surprisingly, Batting Average comes in at about the same consistency for hitters as ERA for pitchers. One reason why BA is so inconsistent is that it is highly correlated to Batting Average on Balls in Play (BABIP)–.79–and BABIP only has a year-to-year correlation of .35.

Descriptive statistics like OBP and SLG fare much better, both coming in at .62 and .63 respectively. When many argue that OBP is a better statistic than BA it is for a number of reasons, but one is that it’s more reliable in terms of identifying a hitter’s true skill since it correlates more year-to-year.

And this piece provides additional explanation of descriptive versus predictive metrics.

An additional really important point here, however, is that these baseball indicators are relatively simple, mathematical calculations – like taking the number of hits (relatively easily measured term) divided by at bats (also easily measured). These aren’t noisy regression estimates based on the test bubble-filling behaviors of groups of 8 and 9 year old kids.  And most baseball metrics are arguably more clearly related to the job responsibilities of the player – though the fun stuff enters in when we start talking about modeling personnel decisions in terms of their influence on wins above replacement.

Just because you have a loose/weak pattern across thousands of points doesn’t add to the credibility of judging any one point!

One of the biggest fallacies in the application of VAM (or SGP) is that having a weak or modest relationship between year over year estimates for the same teachers, produced across thousands of teachers serving thousands of students, provides us with good enough (certainly better than anything else!) information to inform school or district level personnel policy.

Wrong! Knowing that there exists a modest pattern in a scatterplot of thousands of teachers from year one to year two, PROVIDES US WITH LITTLE USEFUL INFORMATION ABOUT ANY ONE POINT IN THAT SCATTERPLOT!

In other words, given the degrees of noise in these best case (least biased) estimates, there exists very limited real signal about the influence of any one teacher on his/her student’s test scores.  What we have here is limited real signal on a measure – measured test score gains from last year to this – which captures a very limited scope of outcomes. And, if we’re lucky, we can generate this noisy estimate of a measure of limited value on about 1/5 of our teachers.

Asserting that useful information can be garnered about the position of a single point in a massive scatterplot, based on such a loose pattern violates the most basic understandings of statistics. And this is exactly what using Value Added estimates to evaluate individual teachers, and put them into categories based on specific cut scores applied these noisy measures does!

The idea that we can apply strict cut scores  to noisy statistical regression model estimates to characterize an individual teacher as “highly effective” versus merely “very effective” is statistically ridiculous, and validated as such by the resulting statistics themselves.

Can useful information be garnered from the pattern as whole? Perhaps. Statistics aren’t entirely worthless, nor is this variation of statistical application. I’d be in trouble if this was all entirely pointless.  These models and their resulting estimates describe patterns – patterns of test score growth across lots and lots of kids across lots and lots of teachers – and groups and subgroups of kids and teachers. And these models may provide interesting insights into groups and subgroups if the original sample size is large enough. We might find that teachers applying one algebra teaching approach in several schools appear to be advancing students’ measured grasp of key concepts better than teachers in other schools (assuming equal students and settings) applying a different teaching method?

But we would be hard pressed to say with any certainty, which of these teachers are “good teachers” and which are “bad.”