More on the SGP debate: A reply

Posted on September 13, 2011

This new post from Ed News Colorado is in response to my critique of Student Growth Percentiles here:

I must say that I agree with almost everything in this response to my post, except for a few points. First, they argue:

Unfortunately Professor Baker conflates the data (i.e. the measure) with the use. A primary purpose in the development of the Colorado Growth Model (Student Growth Percentiles/SGPs) was to distinguish the measure from the use: To separate the description of student progress (the SGP) from the attribution of responsibility for that progress.

No, I do not conflate the data and measures with their proposed use. Policy makers are doing that and doing that based on ill advisement from other policymakers who don’t see the important point – the primary purpose – as Betebenner, Briggs and colleagues explain.  This is precisely why I use their work in my previous post – because it explains their intent and provides their caveats.

Policymakers, by contrast are pitching the direct use of SGPs in teacher evaluation. Whether they intended this or not, that’s what’s happening. Perhaps this is because they are not explaining as bluntly they do here, what the actual intent/design was.

Further, I should point out that while I have marginally more faith that a VAM could, in theory be used to parse out teacher effect than an SGP, which isn’t even intended to, I do not have any more faith than they do that a VAM actually can accomplish this objective. They interpret my post as follows:

Despite Professor Baker’s criticism of VAM/SGP models for teacher evaluation, he appears to hold out more hope than we do that statistical models can precisely parse the contribution of an individual teacher or school from the myriad of other factors that contribute to students’ achievement.

I’m not, as they would characterize, a VAM supporter over SGP, and any reader of this blog certainly realizes that. However, it is critically important that state policymakers be informed that SGP is not even intended to be used in this way. I’m very pleased they have chosen to make this the central point of their response!

And while SGP information might reasonably be used in another way, if used as a tool for ranking and sorting teacher or school effectiveness, SGP results would likely be more biased even than VAM results… and we may not even know or be able to figure out to what extent.

I agree entirely with their statement (but for the removal of “freakin”):

We would add that it is a similar “massive … leap” to assume a causal relationship between any VAM quantity and a causal effect for a teacher or school, not just SGPs. We concur with Rubin et al (2004) who assert that quantities derived from these models are descriptive, not causal, measures. However, just because measures are descriptive does NOT imply that the quantities cannot and should not be used as part of a larger investigation of root causes.

The authors of the response make one more point, that I find objectionable (because it’s a cop out!):

To be clear about our own opinions on the subject: The results of large-scale assessments should never be used as the sole determinant of education/educator quality.

What the authors accomplish with this point, is permitting policymakers to still assume (pointing to this quote as their basis) that they can actually use this kind of information, for example, for a fixed 90% share of high stakes decision making, regarding school or teacher performance, and  certainly that a fixed 40% or 50% weight would be reasonable. Just not 100%. Sure, they didn’t mean that. But it’s an easy stretch for a policymaker.

If the measures aren’t meant to isolate system, school or teacher effectiveness, or if they were meant to but simply can’t, they should NOT be used for any fixed, defined, inflexible share of any high stakes decision making.  In fact, even better, more useful measures shouldn’t be used so rigidly.

[Also, as I’ve pointed out in the past, when a rigid indicator is included as a large share (even 40% or more) in a system of otherwise subjective judgments, the rigid indicator might constitute 40% of the weight but drive 100% of the decision.]

So, to summarize, I’m glad we are, for the most part, on the same page. I’m frustrated that I’m the one who had to raise this issue in part because it was pretty clear to me from reading the existing work on SGP’s that many were conflating the measure with its use. I’m still concerned about the use, and especially concerned in the current policy context. I hope in the future that the designers and promoters of SGP will proclaim more loudly and clearly their own caveats – their own cautions – and their own guidelines for appropriate use.

Simply handing off the tool to the end user and then walking away in the face of misuse and abuse would be irresponsible.

Addendum: By the way, I do hope the authors will happily testify on behalf of the first teacher who is wrongfully dismissed or “de-tenured” on the basis of 3 bad SGPs in a row. That they will testify that SGPs were never intended to assume a causal relationship to teacher effectiveness, nor can they be reasonably interpreted as such.