On Misrepresenting (Gates) MET to Advance State Policy Agendas

In my previous  post I chastised state officials for their blatant mischaracterization of metrics to be employed in teacher evaluation. This raised (in twitter conversation) the issue of the frequent misrepresentation of findings from the Gates Foundation Measures of Effective Teaching Project (or MET). Policymakers frequently invoke the Gates MET findings as providing broad based support for however they might choose to use, whatever measures they might choose to use (such as growth percentiles).

Here is one example in a recent article from NJ Spotlight (John Mooney) regarding proposed teacher evaluation regulations in New Jersey:

New academic paper: One of the most outspoken critics has been Bruce Baker, a professor and researcher at Rutgers’ Graduate School of Education. He and two other researchers recently published a paper questioning the practice, titled “The Legal Consequences of Mandating High Stakes Decisions Based on Low Quality Information: Teacher Evaluation in the Race-to-the-Top Era.” It outlines the teacher evaluation systems being adopted nationwide and questions the use of SGP, specifically, saying the percentile measures is not designed to gauge teacher effectiveness and “thus have no place” in determining especially a teacher’s job fate.

The state’s response: The Christie administration cites its own research to back up its plans, the most favored being the recent Measures of Effective Teaching (MET) project funded by the Gates Foundation, which tracked 3,000 teachers over three years and found that student achievement measures in general are a critical component in determining a teacher’s effectiveness.

I asked colleague Morgan Polikoff of the University of Southern California for his comments. Note that Morgan and I aren’t entirely on the same page on the usefulness of even the best possible versions of teacher effect (on test score gain) measures… but we’re not that far apart either.  It’s my impression that Morgan believes that better estimated measures can be more valuable – more valuable than I perhaps think they can be in policy decision making. My perspective is presented here (and Morgan is free to provide his).  My skepticism in part arises from my perception that there is neither interest among or incentive for state policymakers to actually develop better measures (as evidenced in my previous post). And that I’m not sure some of the major issues can ever be resolved.

That aside, here are Morgan Polikoff’s comments regarding misrepresentation of the Gates MET findings – in particular, as applied to states adopting student growth percentile measures:

As a member of the Measures of Effective Teaching (MET) project research team, I was asked by Bruce to pen a response to the state’s use of MET to support its choice of student growth percentiles (SGPs) for teacher evaluations. Speaking on my behalf only (and not on behalf of the larger research team), I can say that the MET project says nothing at all about the use of SGPs. The growth measures used in the MET project were, in fact, based on value-added models (VAMs) (http://www.metproject.org/downloads/MET_Gathering_Feedback_Research_Paper.pdf). The MET project’s VAMs, unlike student growth percentiles, included an extensive list of student covariates, such as demographics, free/reduced-price lunch, English language learner, and special education status.

Extrapolating from these results and inferring that the same applies to SGPs is not an appropriate use of the available evidence. The MET results cannot speak to the differences between SGP and VAM measures, but there is both conceptual and empirical evidence that VAM measures that control for student background characteristics are more conceptually and empirically appropriate (link to your paper and to Cory Koedel’s AEFP paper). For instance, SGP models are likely to result in teachers teaching the most disadvantaged students being rated the poorest (cite Cory’s paper). This may result in all kinds of negative unintended consequences, such as teachers avoiding teaching these kinds of students.

In short, state policymakers should consider all of the available evidence on SGPs vs. VAMs, and they should not rely on MET to make arguments about measures that were not studied in that work.



Baker, B.D., Oluwole, J., Green, P.C. III (2013) The legal consequences of mandating high stakes decisions based on low quality information: Teacher evaluation in the race-to-the-top era. Education Policy Analysis Archives, 21(5). This article is part of EPAA/AAPE’s Special Issue On Value-Added: What America’s Policymakers Need to Know and Understand, Guest Edited by Dr. Audrey Amrein-Beardsley and Assistant Editors Dr. Clarin Collins, Dr. Sarah Polasky, and Ed Sloat. Retrieved [date], from http://epaa.asu.edu/ojs/article/view/1298

Ehlert, M., Koedel, C., Parsons, E., & Podgursky, M. (2012). Selecting Growth Measures for School and Teacher Evaluations. http://ideas.repec.org/p/umc/wpaper/1210.html

(Updated alternate version:





  1. Could you append or just write for me a dumbed down explanation of the difference between sap and VAM? I will use it to introduce this.


    1. The key difference is explained in the previous post – which I think needs more attention:

      With value added modeling, which does attempt to parse statistically the relationship between a student being assigned to teacher X and that students achievement growth, controlling for various characteristics of the student and the student’s peer group, there still exists a substantial possibility of random-error based mis-classification of the teacher or remaining bias in the teacher’s classification (something we didn’t catch in the model affected that teacher’s estimate). And there’s little way of knowing what’s what.
      With student growth percentiles, there is no attempt to parse statistically the relationship between a student being assigned a particular teacher and the teacher’s supposed responsibility for that student’s change among her peers in test score percentile rank.

      Also, I explain the difference in this video:

    2. quick summary is that that value added models attempt, I would argue unsuccessfully to parse the influence of the teacher on student test score growth, whereas growth percentile models make no effort to isolate teacher effect. It’s entirely about relative reshuffling of students, aggregated to the teacher level. As I say in the video:

      One approach tries (VAM) and the other one doesn’t (SGP). One doesn’t work (VAM) and the other is completely wrong for the purpose to begin with(SGP).

      1. Of course I would moderate the last sentence a skosh to say “One approach tries (VAM) and the other one doesn’t (SGP). One works a bit (VAM) and the other is completely wrong for the purpose to begin with(SGP).”

      2. totally reasonable modification. to me, it all comes down to the balance of a) true signal, b) false signal (bias) and c) remaining random noise. The more we can tease out true signal the better off we may be… but we can’t ever really know how much of each part is left. Even in year to year correlations we may be capturing a big chunk of persistent bias. Cool experiments with switchers, etc. helps understand what might be true signal… but while that’s great fun for research it doesn’t really help in practice (imagine how many teachers we’d actually be able to rate in a year). The huge obvious gap in blind/naive… whatever we wish to call it, use of SGPs for teacher evaluation is the presumption that all variation, including false signal(bias), is actually true signal. Even worse the complete disregard by public officials that this may even be a problem.

  2. I wish you would share this information with the NJ Board of Ed, which has bought Cerf’s circuitous explanations hook, line and sinker without understanding the implications upon real-life teacher evaluations.

    1. I agree – I’d like to see this presented to the NJ State Board of Ed prior to May 1, their deadline for approval….

Comments are closed.