More on the SGP debate: A reply

This new post from Ed News Colorado is in response to my critique of Student Growth Percentiles here:

I must say that I agree with almost everything in this response to my post, except for a few points. First, they argue:

Unfortunately Professor Baker conflates the data (i.e. the measure) with the use. A primary purpose in the development of the Colorado Growth Model (Student Growth Percentiles/SGPs) was to distinguish the measure from the use: To separate the description of student progress (the SGP) from the attribution of responsibility for that progress.

No, I do not conflate the data and measures with their proposed use. Policy makers are doing that and doing that based on ill advisement from other policymakers who don’t see the important point – the primary purpose – as Betebenner, Briggs and colleagues explain.  This is precisely why I use their work in my previous post – because it explains their intent and provides their caveats.

Policymakers, by contrast are pitching the direct use of SGPs in teacher evaluation. Whether they intended this or not, that’s what’s happening. Perhaps this is because they are not explaining as bluntly they do here, what the actual intent/design was.

Further, I should point out that while I have marginally more faith that a VAM could, in theory be used to parse out teacher effect than an SGP, which isn’t even intended to, I do not have any more faith than they do that a VAM actually can accomplish this objective. They interpret my post as follows:

Despite Professor Baker’s criticism of VAM/SGP models for teacher evaluation, he appears to hold out more hope than we do that statistical models can precisely parse the contribution of an individual teacher or school from the myriad of other factors that contribute to students’ achievement.

I’m not, as they would characterize, a VAM supporter over SGP, and any reader of this blog certainly realizes that. However, it is critically important that state policymakers be informed that SGP is not even intended to be used in this way. I’m very pleased they have chosen to make this the central point of their response!

And while SGP information might reasonably be used in another way, if used as a tool for ranking and sorting teacher or school effectiveness, SGP results would likely be more biased even than VAM results… and we may not even know or be able to figure out to what extent.

I agree entirely with their statement (but for the removal of “freakin”):

We would add that it is a similar “massive … leap” to assume a causal relationship between any VAM quantity and a causal effect for a teacher or school, not just SGPs. We concur with Rubin et al (2004) who assert that quantities derived from these models are descriptive, not causal, measures. However, just because measures are descriptive does NOT imply that the quantities cannot and should not be used as part of a larger investigation of root causes.

The authors of the response make one more point, that I find objectionable (because it’s a cop out!):

To be clear about our own opinions on the subject: The results of large-scale assessments should never be used as the sole determinant of education/educator quality.

What the authors accomplish with this point, is permitting policymakers to still assume (pointing to this quote as their basis) that they can actually use this kind of information, for example, for a fixed 90% share of high stakes decision making, regarding school or teacher performance, and  certainly that a fixed 40% or 50% weight would be reasonable. Just not 100%. Sure, they didn’t mean that. But it’s an easy stretch for a policymaker.

If the measures aren’t meant to isolate system, school or teacher effectiveness, or if they were meant to but simply can’t, they should NOT be used for any fixed, defined, inflexible share of any high stakes decision making.  In fact, even better, more useful measures shouldn’t be used so rigidly.

[Also, as I’ve pointed out in the past, when a rigid indicator is included as a large share (even 40% or more) in a system of otherwise subjective judgments, the rigid indicator might constitute 40% of the weight but drive 100% of the decision.]

So, to summarize, I’m glad we are, for the most part, on the same page. I’m frustrated that I’m the one who had to raise this issue in part because it was pretty clear to me from reading the existing work on SGP’s that many were conflating the measure with its use. I’m still concerned about the use, and especially concerned in the current policy context. I hope in the future that the designers and promoters of SGP will proclaim more loudly and clearly their own caveats – their own cautions – and their own guidelines for appropriate use.

Simply handing off the tool to the end user and then walking away in the face of misuse and abuse would be irresponsible.

Addendum: By the way, I do hope the authors will happily testify on behalf of the first teacher who is wrongfully dismissed or “de-tenured” on the basis of 3 bad SGPs in a row. That they will testify that SGPs were never intended to assume a causal relationship to teacher effectiveness, nor can they be reasonably interpreted as such.


12 thoughts on “More on the SGP debate: A reply

  1. Thanks for all the work you do to clarify the use and abuse of educational data.

    I have a question about what you view as an appropriate use of VAM as it currently exists. For example, this VAM report on my local school district seems to me to be useless or almost useless as a guide to making policy decisions : What use do you think could be made of it?

    1. I think you’ve missed my point that I actually don’t know that VAM can produce useful information, at least in terms of determining teacher effectiveness. But, that VAM is at least designed to do so, whereas SGPs really aren’t. So, on the one hand, you’ve got a tool that is designed to estimate a teacher effect, but doesn’t accomplish that goal. And on the other hand, you’ve got a tool that isn’t even supposed to be used that way, and also can’t accomplish that goal. BUT… you’ve got policymakers mandating that both measures be used in this way (and to some extent, researchers and consultants remaining silent on this misuse). None of it makes a whole lot of sense does it. Thanks for the link. I’ll explore that further!

  2. My district isn’t looking at VAM for teacher effectiveness, but they make noises (that never result in action) about using it for policy evaluation and budgeting. I probably should have mentioned those examples, but refrained because I wanted to ask the general “if not teacher evaluation, then what?” question.

    Meanwhile, money that could be used directly in the service of education is spent each year to generate these reports and annually the Board Members and administrators — who lack a training in statistics — spend an hour or so being confused by an overly simplified walk through (this year’s powerpoints are linked here:

    My position has been that while interesting from a policy wonk perspective, I never seen anything in these reports that could help improve education or district management. Two years ago they almost took it out of the budget, but the Superintendent chanted “data-driven” a few times and back it went.

    I’m a historian (of education) by training who took some advanced regression and other quantitative analysis course, but I’m far from confident in my knowledge in this area. With that in mind, I thought I’d ask an expert — you — if you thought this was useful or how you thought this might be useful.

    As an interested wonk, parent, and advocate, I appreciate any attention you can give the report or questions.

    Thanks again.

    1. I am skeptical that either VAMs or SGPs can provide insights to anyone who doesn’t not have an pretty good understanding of the nuances of these kinds of data/estimates & the underlying properties of the tests. I agree with you on the opportunity cost issue.

      We are spending a great deal of time and money on these things for questionable return. It was perhaps assumed by some economists that we would just ram these things through large scale (statewide and in large urban districts) and use them to prune the teacher workforce and that would make the system better (layers of faulty and offensive assumptions). We’d make some wrong decisions, hopefully statistically more “right’ than wrong, and that we’d develop a massive model and data set for large enough numbers of teachers that the cost per unit (cost per bad teacher correctly fired, counterbalanced by the cost per good teacher wrongly fired) would be relatively low. We’d bring it all to scale.

      Now, I find this whole version of the story to be too offensive to really dig into here and now. But, it also hasn’t necessarily played out this way… except for some large city systems like NYC and DC, and a few rigid mandate state systems. Instead, we are now attempting to be more “thoughtful” about this stuff and asking teachers to ponder their statistical ratings for insights into how they interact with children? How they teach? And we are asking administrators to ponder teachers’ statistical estimates for any meaning they might find.

      It’s like a freakin’ inkblot test… and one that at the school or medium size district level comes at a pretty high price for producing each inkblot.

      While I was on my run this evening, I thought back to my teaching days, at whether I thought it would have been useful to me to simply have some rating of my aggregate effectiveness – simply relative to other teachers. Nothing specific about the performance of my students on specific content/concepts. Just some abstract number… like the relative rarity that my students scored X at the end of my class given that they scored X-Y at the end of last years class? I asked myself… would that actually tell me anything about what I should do next year (no less, lack of information to make reasonable mid-year changes)? Should I go watch teachers who got better ratings? Could I? Would they protect their turf? Besides, knowing what I do now, I also know that most of the teachers who got a better rating likely got that rating either because of a) random error/noise in the data or b) some unmeasured attribute of the students they serve (bias).

      It does strike me that there are a lot more useful things we could/should be spending our time looking at in order to inform practice or evaluate teachers. And that the cumulative expenditure on these ink blots, including the cost of time spent musing over them, might be better applied elsewhere.

    2. Thomas, here is one use of school and grade-level VAM for which I’ve heard positive feedback: At a district principals meeting, principals divide into groups of similar schools based on geographic regions, attainment levels, or student populations. A principal from a high value-added school will present her results to the group and explain the strategies she believes are producing the results. (This is something decided in advance, so the principal has some time to go over her report and consider a presentation.) The other principals can then give their input as well.
      There are other possible uses, such as program evaluation. From the report you linked to, it appears if Madison is considering any district-wide curriculum changes, they should consider 4th and 5th grade math.

  3. I should add that the State of Wisconsin is moving toward VAM (this version most likely) for teacher and school effectiveness ratings. This is being done via a Gubernatorial “Accountability Design Team” that has not been very transparent.

  4. Few know which state just took away collective bargaining, tenure, and seniority all at once! Nothing to fret about now as recycling teachers for the new lower paid ones every two years.a fact of life here in Idaho. Is there another state that has so harmed it ‘s teachers and education?

  5. Thanks for the example. Doing this of course assumes that the VAM scores are meaningful, something I am very skeptical about, because of the underlying data (WKCE standardized tests), the regression coefficients and other parts of the statistical methods, and the fact that individual schools appear to be all over the place from year to year (this about year 5 or 6 of MMSD VAM reports, note no trend or historical data is included in the current version). I will note that a glance at one graph tells me that low poverty schools tend to be high VAM schools (despite all the manipulation that is supposed to deal with that and there are some exceptions, like the school that serves Grad Student housing…high poverty, but…..).

    One thing to add is that MMSD’s recent Task Force on Math decided that Middle School teacher content knowledge was what needed the most attention. Little or no local evidence, but there were some national panic reports on the topic at the time the Task Force was meeting. That was about $75,000 for the Task Force and we are easily above a $1,000,000 for continuing ed and staff development to address this “problem.” Maybe it is a problem, but the evidence wasn’t very convincing.

  6. So, why did they write a piece that you don’t disagree with? As best I can tell, they made no effort to link their arguments with the real world.
    They quote approvingly, “All models are wrong, but some are useful”.

    But the above piece gave no evidence that their models are useful. The closest they come is the statement,
    “The causal nature of the questions together with the observational nature of the data makes the use of large-scale assessment data difficult “detective work”. … We BELIEVE that the education system as a whole can benefit from such scrupulous detective work.” (emphais mine. The failure to explain that belief is theirs)

    Of course, “when all stakeholders hold a seat at the table and are collectively engaged in these efforts to develop and maintain an education system geared toward maximizing the academic progress of all students.” What does that have to do with using VAMs for evaluation in systems run by flesh and blood individuals? If all men were angels, the above would be relevant. After all, VAMs don’t inflict harm, humans and organizations using VAMs are the danger.

    1. I believe they were actually trying to manipulate the argument and distract the reader from the main issue- the one they acknowledge, that SGPs are not meant for determining teacher effectiveness. They wanted to cast it instead as SGPs can be useful measures… which I’m not sure of either, since the viewer of the measure has to be able to interpret the statistics in light of the context or contexts from which they are derived, since the measures make no attempt to integrate contextual factors. In short, they have something to sell. Many states are buying it on the basis that they believe they can use it as a substantial factor in rating teacher effectiveness. If the authors are too blunt about the limitations in this regard, states might stop buying it. They’d probably buy something else, that’s only marginally less bad (not good, but less bad). Some states might buy it on the basis that it’s at least a useful measure, whether for teacher effectiveness or not. State legislators might be willing to permit local administrators to make bad decisions with the data, even if state policymakers knew better. It’s really convenient to pass the buck on the question of appropriate use versus design and intent of the measure. If the appropriate uses are too limited, then there’s no reason to buy it at all.

  7. Michelle Rhee recently called VAM a “fairer, more transparent and consistent way to evaluate teachers.” I think it was the “transparent” that got to me the most.

    1. Indeed being classified as effective or not based on your fixed-effect (and given the standard error of the estimate) in a multi-year, complex regression equation is hardly transparent.

Comments are closed.