I’ve posted a few blogs recently on the topic of Student Growth Percentile Scores, or SGPs and how many state policymakers have moved to adopt these measures and integrate them into new evaluation systems for teachers. In my first post, I argued that SGPs are simply not designed to make inferences about teacher effectiveness.
The designers of SGP replied to my first post, suggesting that I was conflating the measures with their use by suggesting that these measures can’t and shouldn’t be used to infer teacher effectiveness. And in their response (more below), they explained in greater detail, what was essentially my main point – that SGPs are not designed or intended to infer teacher effectiveness from student achievement growth. They also argued that the policy makers they have advised on adopting SGPs understood that.
Well, let’s review what’s going on in New Jersey. In New Jersey, a handful of districts have signed on to the department of education’s Pilot teacher evaluation program, explained here: http://www.state.nj.us/education/EE4NJ/faq/
Specifically, here’s how NJDOE responds to the question over how standardized testing data, and SGPs based on those data would be used within the pilot evaluations:
Q: How much weight do standardized test scores get in the evaluations?
A: Standardized test scores are not available for every subject or grade. For those that exist (Math and English Language Arts teachers of grades 4-8), Student Growth Percentages (SGPs), which require pre- and post-assessments, will be used. The SGPs should account for 35%-45% of evaluations. The NJDOE will work with pilot districts to determine how student achievement will be measured in non-tested subjects and grades.
Now, here is a quote from Betebenner and colleagues’ response to my criticism of policymakers proposed uses of SGPs in teacher evaluation.
A primary purpose in the development of the Colorado Growth Model (Student Growth Percentiles/SGPs) was to distinguish the measure from the use: To separate the description of student progress (the SGP) from the attribution of responsibility for that progress.
But, you see, using these data to “evaluate teachers” necessarily infers “attribution of responsibility for that progress.” Attribution of responsibility to the teacher! If one cannot use these measures to attribute responsibility to the teacher, then how can one possibly use these measures to “evaluate” the teacher? One can’t. You can’t. No-one can. No-one should!
Perhaps in an effort to preserve proprietary interests, Betebenner and colleagues in their reply to my original criticism also note:
To be clear about our own opinions on the subject: The results of large-scale assessments should never be used as the sole determinant of education/educator quality.
No state or district that we work with intends them to be used in such a fashion. That, however, does not mean that these data cannot be part of a larger body of evidence collected to examine education/educator quality.
But this statement stands in direct conflict with the first above. If the tool is insufficient for – simply not even designed to – ATTRIBUTE RESPONSIBILITY FOR PROGRESS to either teachers or schools, then it simply can’t and SHOULDN’T BE USED THAT WAY! Be it for 10% or 90%.
The reality is that even though Betebenner and colleagues explain that they believe that the policymakers with whom they have consulted “get it” and would never consider misusing the measures in the ways I explained on my original post, that is precisely what is going on.
Also, I noted previously that this paragraph from their response is a complete cop out. I explained:
What the authors accomplish with this point, is permitting policymakers to still assume (pointing to this quote as their basis) that they can actually use this kind of information, for example, for a fixed 90% share of high stakes decision making, regarding school or teacher performance, and certainly that a fixed 40% or 50% weight would be reasonable. Just not 100%. Sure, they didn’t mean that. But it’s an easy stretch for a policymaker.
If the measures aren’t meant to isolate system, school or teacher effectiveness, or if they were meant to but simply can’t, they should NOT be used for any fixed, defined, inflexible share of any high stakes decision making. In fact, even better, more useful measures shouldn’t be used so rigidly.
[Also, as I’ve pointed out in the past, when a rigid indicator is included as a large share (even 40% or more) in a system of otherwise subjective judgments, the rigid indicator might constitute 40% of the weight but drive 100% of the decision.]
Look. It’s pretty simple. If you want to pilot an airplane effectively, the plane needs to have the right instruments – flight instruments. If you’re coming in for a landing in dense fog in mountainous terrain, you look down to where your flight instruments should be, http://www.b737.org.uk/images/fltinsts_panel_nonefis.jpg, and there sits an alto saxophone instead (albeit a fine, Selmer Mark VI w/serial # in the 180s), you’re screwed. You might have a few minutes left to blow through the changes to Foggy Day, but your chances of successfully piloting the plane to a safe landing are severely diminished.
Okay, this analogy is a bit of a stretch. But it is not a stretch to acknowledge that SGPs were simply not designed to attribute responsibility for student progress to teachers. Meanwhile, VAM models try, but are unable to effectively, accurately or precisely attribute student progress to teachers. So, we have a choice of piloting the plane with either a) the wrong instruments (SGP) or b) instruments that don’t work very well (have high error rates & comparable problems of inference). When faced with choices this bad, it may be wise to take another course entirely. Don’t pilot the damn plane! It would be a shame to crash it with such a beautiful saxophone on board!