Firing teachers based on bad (VAM) versus wrong (SGP) measures of effectiveness: Legal note

In the near future my article with Preston Green and Joseph Oluwole on legal concerns regarding the use of Value-added modeling for making high stakes decisions will come out in the BYU Education and Law Journal. In that article, we expand on various arguments I first laid out in this blog post about how use of these noisy and potentially biased metrics is likely to lead to a flood of litigation challenging teacher dismissals.

In short, as I have discussed on numerous occasions on this blog, value-added models attempt to estimate the effect of the individual teacher on growth in measured student outcomes. But, these models tend to produce very imprecise estimates with very large error ranges, jumping around a lot from year to year.  Further, individual teacher effectiveness estimates are highly susceptible to even subtle changes to model variables. And failure to address key omitted variables can lead to systemic model biases which may even lead to racially disparate teacher dismissals (see here & for follow up , here) .

Value added modeling as a basis for high stakes decision making is fraught with problems likely to be vetted in the courts.  These problems are most likely to come to light in the context of overly rigid state policy requirements requiring that teachers be rated poorly if they receive low scores on the quantitative component of evaluations, and where state policies dictate that teachers must be put on watch and/or de-tenured after two years of bad evaluations (see my post with NYC data on problems with this approach).

Significant effort has been applied toward determining the reliability, validity and usefulness of value-added modeling for inferring school, teacher, principal and teacher preparation institution effectiveness. Just see the program from this recent conference.

As implied above, it is most likely that when cases challenging dismissal based on VAM make it to court, deliberations will center on whether these models are sufficiently reliable or valid for making such judgments – whether teachers are able to understand the basis for which they have been dismissed and whether it is assumed that they have had any control over their fate.  Further, there exist questions about how the methods/models may have been manipulated in order to disadvantage certain teachers.

But what about those STUDENT GROWTH PERCENTILES being pitched for similar use in states like New Jersey?  While on the one hand the arguments might take a similar approach of questioning the reliability or validity of the method for determining teacher effectiveness (the supposed basis for dismissal), the arguments regarding SGPs might take a much simpler approach. In really simple terms SGPs aren’t even designed to identify the teacher’s effect on student growth. VAMs are designed to do this, but fail.

When VAMs are challenged in court, one must show that they have failed in their intended objective. But it’s much, much easier to explain in court that SGPs make no attempt whatsoever to estimate that portion of student growth that is under the control of, therefore attributable to, the teacher (see here for more explanation of this).  As such, it is, on its face, inappropriate to dismiss the teacher on the basis of a low classroom (or teacher) aggregate student growth metric like SGP. Note also that even if integrated into a “multiple measures” evaluation model, if the SGP data becomes the tipping point or significant basis for such decisions, the entire system becomes vulnerable to challenge.*

The authors (& vendor) of SGP, in very recent reply to my original critique of SGPs, noted:

Unfortunately Professor Baker conflates the data (i.e. the measure) with the use. A primary purpose in the development of the Colorado Growth Model (Student Growth Percentiles/SGPs) was to distinguish the measure from the use: To separate the description of student progress (the SGP) from the attribution of responsibility for that progress.

That is, the authors and purveyors clearly state that SGPs make no ATTRIBUTION OF RESPONSIBILITY for progress to either the teacher or the school. The measure itself – the SGP – is entirely separable from attribution to the teacher (or school) of responsibility for that measure!

As I explain in my response, here, this point is key. It’s all about “attribution” and “inference.” This is not splitting hairs. This is a/the central point! It is my experience from expert testimony that judges are more likely to be philosophers than statisticians (empirical question if someone knows?).  Thus quibbling over the meaning of these words is likely to go further than quibbling over the statistical precision and reliability of VAMs. And the quibbling here is relatively straightforward, and far more than mere quibbling I would argue.

A due process standard for teacher dismissal would at the very least require that the measure upon which dismissal was based, where the basis was teaching “ineffectiveness”, was a measure that was intended to INFER a teacher’s effect on student learning growth – a measure which would allow ATTRIBUTION OF [TEACHER] RESPONSIBILITY for that student growth or lack thereof.  This is a very straightforward, non-statistical point.**

Put very simply, on its face, SGP is entirely inappropriate as a basis for determining teacher “ineffectiveness” leading to teacher dismissal.*** By contrast, VAM is, on its face appropriate, but in application, fails to provide sufficient protections against wrongful dismissal.

There are important implications for pending state policies and current and future pilot programs regarding teacher evaluation in New Jersey and other SGP states like Colorado. First, regarding legislation, it would be entirely inappropriate and a recipe for disaster to mandate that soon-to-be available SGP data be used in any way tied to high stakes personnel decisions like de-tenuring or dismissal. That is, SGPs should neither be explicitly or implicitly suggested as a basis for determining teacher effectiveness. Second, local school administrators would be wise to consider carefully how they choose to use these measures, if they choose to use them at all.


*I have noted on numerous occasions on this blog that in teacher effectiveness rating systems that a) use arbitrary performance categories, slicing decisive arbitrary categories through noisy metrics and b) use a weighted structure of percentages putting all factors alongside one another (rather than sequential application), the quantified metric can easily drive the majority of decisions, even if weighted at a seemingly small share (20% or so). If the quantified metric is the component of the evaluation system that varies most, and if we assume that variation to be “real” (valid), the quantified metric is likely to be 100% of the tipping point in many evaluations, despite being only 20% of the weighting.

A critical flaw with many legislative frameworks for teacher evaluation and district adopted policies is that they place the quantitative metrics along side other measures including observations, in a weighted calculation of teacher effectiveness. It is this parallel treatment of the measures that permits the test driven component to override all other “measures” when it comes to the ultimate determination of teacher effectiveness and in some cases whether the teacher is tenured or dismissed. A simple logical resolution to this problem is to use the quantitative measures as a first step – a noisy pre-screening – in which administrators – perhaps central office human resources – might review the data to determine whether the data are indicating potential problem areas across schools & teachers – knowing full well that these might be false signals due to data error and bias. But, the data used in this way at this step might then guide district administration on where to allocate additional effort in classroom observations in a given year.  In this case, the quantified measures might ideally improve the efficiency of time allocation in a comprehensive evaluation model but would not serve as the tipping point for decision making.  I suspect however, that even used in this more reasonable way, administrators will realize over time that the initial signals tend not to be particularly useful.

**Indeed, one can also argue that a VAM regression merely describes the relationship between having X teacher, and achieving Y growth, controlling for A, B, C and so on (where A, B, C include various student characteristics, classroom level characteristics and school characteristics).  To the extent that one can effectively argue that a VAM model is merely descriptive and also does not provide a basis for valid inference, similar arguments can be made. BUT, in my view, this is still more subtle than the OUTRIGHT FAILURE OF SGP to even consider A, B & C – which are factors clearly outside of teachers’ control over student outcomes.

***A non-trivial point is that if you review the conference program from the AEFP conference I mentioned above, or existing literature on this point, you will find numerous articles and papers critiquing the use of VAM for determining teacher effectiveness. But, there are none critiquing SGP. Is this because it is well understood that SGPs are an iron-clad method overcoming the problems of VAM? Absolutely not. Academics will evaluate and critique anything which claims to have a specific purpose. Scholars have not critiqued the usefulness of SGPs for inferring teacher effectiveness, have not evaluated their reliability or validity for this purpose, BECAUSE SCHOLARS UNDERSTAND FULL WELL THAT THEY ARE NEITHER DESIGNED NOR INTENDED FOR THIS PURPOSE.


11 thoughts on “Firing teachers based on bad (VAM) versus wrong (SGP) measures of effectiveness: Legal note

  1. Again, pertinent and clear. Thank you.

    Perhaps this might be considered too conspiracist,–but with wide influence of legislative tampering by ALEC and others it’s hard any longer not to at least allow the thought that these scenarios are all “gamed” in advance–how much can one suppose that these measures are designed to create the grounds for lawsuits?

    1. I really don’t think so, at least in the cases where I have some insights and know people involved. In NJ, we’ve actually got good people involved in this process of developing/adopting these models and measures. But, I don’t believe we have the technical capacity to evaluate them sufficiently, or to critique what I perceive to be an increasingly deceptive double speak from the vendors/consultants selling this stuff. I’ve spoken with individuals in other states who are on the ground in the implementation/adoption of this stuff. There are good people involved elsewhere too. So I’m not buying a conspiracy angle – yet. Further, I think there are many involved who are seeking, as I have in some posts, the potentially responsible uses of these measures, if there are any. But, once the measure (or monster) is created, then there are legislators (and perhaps district officials) who want to adopt policies that effectively mandate irresponsible use of these measures (perhaps there’s conspiracy there, but I’m not sure yet).

  2. Not surprisingly, people in Colorado are using the SGP to train administrators to assign students to “effective” teachers using SGP data–

  3. And, I assume, that the same problems exist when attempting to use SGP to identify “effective” and “ineffective” principals (which will surely follow the attempt to use to identify “effective” teachers)?

  4. Dr. Baker,
    Well done and well reasoned. The trail lawyers lay in wait. Your exposé will serve the coming bonanza well. NO ONE wants this legislation to proceed more than the trial lawyers.

    On a different matter, Have you ever considered the impact on HIGH performing districts and “Gifted and Talented” students specifically? For example if a high achieving sixth grader takes Algebra but takes the NJASK which does not “emphasize” Algrbra, might not the SGP be even more insensitive? Will schools be discouraged from “outpacing” the NJASK or the coming Common Core tests?

    I realize the attention the urban schools need BUT, the needs and impact of the BRIGHTEST students need attention as well.
    Thank you

  5. Very well done.
    With respect to the (Mathematica) paper you link regarding the inability of VAM (school level scores) to measure the effectiveness of of principals, it is important to note that some states use school level scores as part of their teacher evaluations, particularly for those teaching nontested subjects. Pennsylvania’s Department of Education has pushed for legislation (H.B. 1980) that would use school level scores for as much as 15% of the evaluation of teachers of students in tested and nontested subjects. The latter group also includes nonteaching professionals. The central argument made in the Mathematica researchers’ paper is that principals exercise little control over school level scores. It seems obvious that individual teachers or nonteaching professionals exercise even less control.

  6. Dr. Baker-

    This seems to me this paragraph points to a key, if not the key concern about the use of VAM for evaluation purposes:

    As implied above, it is most likely that when cases challenging dismissal based on VAM make it to court, deliberations will center on whether these models are sufficiently reliable or valid for making such judgments – whether teachers are able to understand the basis for which they have been dismissed and whether it is assumed that they have had any control over their fate. Further, there exist questions about how the methods/models may have been manipulated in order to disadvantage certain teachers.

    Perhaps, I’m misunderstanding your last sentence, but I see no reason to look for “manipulation,” in the sense of suggesting a deliberate attempt to disadvantage certain teachers.
    As you have explained so well many times before, all statistical modeling requires choices (and subjective judgments.) These are manipulations that require no malice. However, we are still left with this question: How can value-added results be considered valid measures of teacher effectiveness if different, equally defensible methods/models produce different “winners” and “losers?”

  7. Louisiana has decided to use the VAM for all personnel decisions. We can lose tenure after one ineffective rating, lose the chance for raises, lose our certification, and lose our jobs. The VAM is supposed to be 50% of our evaluation, but if it is ineffective, our whole evaluation is ineffective. I will sue if they try to dismiss me over a flawed model. I am sure lawyers are just waiting.

Comments are closed.