In the near future my article with Preston Green and Joseph Oluwole on legal concerns regarding the use of Value-added modeling for making high stakes decisions will come out in the BYU Education and Law Journal. In that article, we expand on various arguments I first laid out in this blog post about how use of these noisy and potentially biased metrics is likely to lead to a flood of litigation challenging teacher dismissals.
In short, as I have discussed on numerous occasions on this blog, value-added models attempt to estimate the effect of the individual teacher on growth in measured student outcomes. But, these models tend to produce very imprecise estimates with very large error ranges, jumping around a lot from year to year. Further, individual teacher effectiveness estimates are highly susceptible to even subtle changes to model variables. And failure to address key omitted variables can lead to systemic model biases which may even lead to racially disparate teacher dismissals (see here & for follow up , here) .
Value added modeling as a basis for high stakes decision making is fraught with problems likely to be vetted in the courts. These problems are most likely to come to light in the context of overly rigid state policy requirements requiring that teachers be rated poorly if they receive low scores on the quantitative component of evaluations, and where state policies dictate that teachers must be put on watch and/or de-tenured after two years of bad evaluations (see my post with NYC data on problems with this approach).
Significant effort has been applied toward determining the reliability, validity and usefulness of value-added modeling for inferring school, teacher, principal and teacher preparation institution effectiveness. Just see the program from this recent conference.
As implied above, it is most likely that when cases challenging dismissal based on VAM make it to court, deliberations will center on whether these models are sufficiently reliable or valid for making such judgments – whether teachers are able to understand the basis for which they have been dismissed and whether it is assumed that they have had any control over their fate. Further, there exist questions about how the methods/models may have been manipulated in order to disadvantage certain teachers.
But what about those STUDENT GROWTH PERCENTILES being pitched for similar use in states like New Jersey? While on the one hand the arguments might take a similar approach of questioning the reliability or validity of the method for determining teacher effectiveness (the supposed basis for dismissal), the arguments regarding SGPs might take a much simpler approach. In really simple terms SGPs aren’t even designed to identify the teacher’s effect on student growth. VAMs are designed to do this, but fail.
When VAMs are challenged in court, one must show that they have failed in their intended objective. But it’s much, much easier to explain in court that SGPs make no attempt whatsoever to estimate that portion of student growth that is under the control of, therefore attributable to, the teacher (see here for more explanation of this). As such, it is, on its face, inappropriate to dismiss the teacher on the basis of a low classroom (or teacher) aggregate student growth metric like SGP. Note also that even if integrated into a “multiple measures” evaluation model, if the SGP data becomes the tipping point or significant basis for such decisions, the entire system becomes vulnerable to challenge.*
The authors (& vendor) of SGP, in very recent reply to my original critique of SGPs, noted:
Unfortunately Professor Baker conflates the data (i.e. the measure) with the use. A primary purpose in the development of the Colorado Growth Model (Student Growth Percentiles/SGPs) was to distinguish the measure from the use: To separate the description of student progress (the SGP) from the attribution of responsibility for that progress.
That is, the authors and purveyors clearly state that SGPs make no ATTRIBUTION OF RESPONSIBILITY for progress to either the teacher or the school. The measure itself – the SGP – is entirely separable from attribution to the teacher (or school) of responsibility for that measure!
As I explain in my response, here, this point is key. It’s all about “attribution” and “inference.” This is not splitting hairs. This is a/the central point! It is my experience from expert testimony that judges are more likely to be philosophers than statisticians (empirical question if someone knows?). Thus quibbling over the meaning of these words is likely to go further than quibbling over the statistical precision and reliability of VAMs. And the quibbling here is relatively straightforward, and far more than mere quibbling I would argue.
A due process standard for teacher dismissal would at the very least require that the measure upon which dismissal was based, where the basis was teaching “ineffectiveness”, was a measure that was intended to INFER a teacher’s effect on student learning growth – a measure which would allow ATTRIBUTION OF [TEACHER] RESPONSIBILITY for that student growth or lack thereof. This is a very straightforward, non-statistical point.**
Put very simply, on its face, SGP is entirely inappropriate as a basis for determining teacher “ineffectiveness” leading to teacher dismissal.*** By contrast, VAM is, on its face appropriate, but in application, fails to provide sufficient protections against wrongful dismissal.
There are important implications for pending state policies and current and future pilot programs regarding teacher evaluation in New Jersey and other SGP states like Colorado. First, regarding legislation, it would be entirely inappropriate and a recipe for disaster to mandate that soon-to-be available SGP data be used in any way tied to high stakes personnel decisions like de-tenuring or dismissal. That is, SGPs should neither be explicitly or implicitly suggested as a basis for determining teacher effectiveness. Second, local school administrators would be wise to consider carefully how they choose to use these measures, if they choose to use them at all.
*I have noted on numerous occasions on this blog that in teacher effectiveness rating systems that a) use arbitrary performance categories, slicing decisive arbitrary categories through noisy metrics and b) use a weighted structure of percentages putting all factors alongside one another (rather than sequential application), the quantified metric can easily drive the majority of decisions, even if weighted at a seemingly small share (20% or so). If the quantified metric is the component of the evaluation system that varies most, and if we assume that variation to be “real” (valid), the quantified metric is likely to be 100% of the tipping point in many evaluations, despite being only 20% of the weighting.
A critical flaw with many legislative frameworks for teacher evaluation and district adopted policies is that they place the quantitative metrics along side other measures including observations, in a weighted calculation of teacher effectiveness. It is this parallel treatment of the measures that permits the test driven component to override all other “measures” when it comes to the ultimate determination of teacher effectiveness and in some cases whether the teacher is tenured or dismissed. A simple logical resolution to this problem is to use the quantitative measures as a first step – a noisy pre-screening – in which administrators – perhaps central office human resources – might review the data to determine whether the data are indicating potential problem areas across schools & teachers – knowing full well that these might be false signals due to data error and bias. But, the data used in this way at this step might then guide district administration on where to allocate additional effort in classroom observations in a given year. In this case, the quantified measures might ideally improve the efficiency of time allocation in a comprehensive evaluation model but would not serve as the tipping point for decision making. I suspect however, that even used in this more reasonable way, administrators will realize over time that the initial signals tend not to be particularly useful.
**Indeed, one can also argue that a VAM regression merely describes the relationship between having X teacher, and achieving Y growth, controlling for A, B, C and so on (where A, B, C include various student characteristics, classroom level characteristics and school characteristics). To the extent that one can effectively argue that a VAM model is merely descriptive and also does not provide a basis for valid inference, similar arguments can be made. BUT, in my view, this is still more subtle than the OUTRIGHT FAILURE OF SGP to even consider A, B & C – which are factors clearly outside of teachers’ control over student outcomes.
***A non-trivial point is that if you review the conference program from the AEFP conference I mentioned above, or existing literature on this point, you will find numerous articles and papers critiquing the use of VAM for determining teacher effectiveness. But, there are none critiquing SGP. Is this because it is well understood that SGPs are an iron-clad method overcoming the problems of VAM? Absolutely not. Academics will evaluate and critique anything which claims to have a specific purpose. Scholars have not critiqued the usefulness of SGPs for inferring teacher effectiveness, have not evaluated their reliability or validity for this purpose, BECAUSE SCHOLARS UNDERSTAND FULL WELL THAT THEY ARE NEITHER DESIGNED NOR INTENDED FOR THIS PURPOSE.