I spent some time the other day, while out running, pondering the usefulness of student growth percentile estimates and value added estimates of teacher effectiveness for the average school or district level practitioner. How would they use them? What would they see in them? How might these performance snapshots inform practice?
Let’s just say I am skeptical that either VAMs (Value Added Models) or SGPs (Student Growth Percentiles) can provide useful insights to anyone who doesn’t have a pretty good understanding of the nuances of these kinds of data/estimates & the underlying properties of the tests. If I was a principal, would I rather have the information than not? Perhaps. But I’m someone who’s primary collecting hobby is, well, collecting data. That doesn’t mean it all has meaning, or more specifically, that it has sufficient meaning to influence my thinking or actions. Some does. Some doesn’t. Keeping some of the data that doesn’t have much meaning actually helps me to delineate. But I digress.
It seems like we are spending a great deal of time and money on these things for questionable return. We are investing substantial resources in simply maximizing the links in our data systems between individual student’s records and their classroom teachers of record, hopefully increasing our coverage to, oh, somewhere between 10% and 20% of teachers (those with intact, single teacher classrooms, serving children who already have a track record of prior tests – e.g. upper elementary classroom teachers).
At the outset of this whole “statistical rating of teachers” endeavor, it was perhaps assumed by some economists that we would just ram these things through as large scale evaluation tools (statewide and in large urban districts) and use them to prune the teacher workforce and that would make the system better. We’d shoot first… ask questions later (if at all). We’d make some wrong decisions, hopefully statistically more “right’ than wrong, and we’d develop a massive model and data set for large enough numbers of teachers that the cost per unit (cost per bad teacher correctly fired, counterbalanced by the cost per good teacher wrongly fired) would be relatively low. We’d bring it all to scale, and scale would mean efficiency.
Now, I find this whole version of the story to be too offensive to really dig into here and now. I’ve written previously about “smart selection” versus “dumb selection” regarding personnel decisions in schools. And this would be what I called “dumb selection.“
But, it also hasn’t necessarily played out this way… thankfully… except perhaps for some large city systems like Washington, DC, and a few more rigidly mandated state systems (though we’re mostly in wait-and-see mode there as well). Instead, we are now attempting to be more “thoughtful” about how we use this stuff and asking teachers to ponder their statistical ratings for insights into how they interact with children? How they teach? And we are asking administrators to ponder teachers’ statistical estimates for any meaning they might find.
In my current role, as a researcher of education policy, I love equations like this:
I like to see the long lists of coefficients (estimates of how some measure in the model relates to the dependent variable) spit out in my Stata logs and ponder what they might mean, with full consideration of what I’ve chosen to include or exclude in the model, and whether I’m comfortable that the measures on both sides of the equation are of sufficient quality to really tell me anything… or at least something.
The other evening, I thought back to my teaching days (considered a liability as an education policy researcher), at whether I thought it would have been useful to me to simply have some rating of my aggregate effectiveness – simply relative to other teachers. Nothing specific about the performance of my students on specific content/concepts. Just some abstract number… like the relative rarity that my students scored X at the end of my class given that they scored X-Y at the end of last years class? Or, some generalized “effectiveness” rating category based on whether my coefficient in the model surpassed a specific cut score to call me “exceptional” or merely “adequate?” Something like this.
Would that be useful to me? to the principal? if I was the principal?
Given that I typically taught 2 sections of 7th grade life science and 2 of 8th grade physical science (yeah… cushy private school job), with class sizes of about 18 students each, which rotated through different times of day, I might also find it fun to compare growth of my various classes. Did the disruptive distraction kid really cause my ratings in one life science section to crash (you know who you are!)? Was the same kid able to bring her 8th grade teacher down the next year (hopefully not me again!)?
I asked myself… would those ratings actually tell me anything about what I should do next year (accepting that the data would come on a yearly cycle)? Should I go watch teachers who got better ratings? Could I? Would they protect their turf? Would that even tell me a damn thing? Besides, knowing what I do now, I also know that large shares of the teachers who got a better rating likely got that rating either because of a) random error/noise in the data or b) some unmeasured attribute of the students they serve (bias). Of course, I didn’t know that then, so what would I think?
My gut instinct is that any of these aggregate indicators of a teacher’s relative effectiveness, generated from complex statistical models, with, or without corrections for other factors, are little more than ink blots to most teachers and administrators. And I”m not convinced they’ll ever be anything more than that. They possess many of the same attributes of randomness or fuzziness of an ink blot. And while the most staunch advocate might wish them to appear as an impressionist painting, I expect they are still most often seen as ink blots – not even a Jackson Pollock. More random than pattern. And even if/when there is a pattern, the average viewer may never pick it up.
I anxiously (though skeptically) await well crafted qualitative studies exploring stakeholders’ interpretations of these inkblots.
But these aren’t just any ink blots. They are rather expensive ink blots if and when we start trying to use them in more comprehensive and human resource intensive ways through local public schools and districts and if we weigh on them the burden that we MUST use them not merely to inform, but rather to DRIVE our decisions – and must find significant meaning in them to justify doing so. That is, if we really expect teachers and principals to log significant hours trying to derive meaning from them, after consultants, researchers, central office administrators and state department officials have labored over data system design, linking teachers to students, and deciding on the most aesthetically pleasing representation of teacher performance classifications for the individual reporting system. Using these tools as quick screening, blunt instruments is certainly a bad idea. But is this – staring at them for endless hours in search of meaning that may not be there – much better?
It strikes me that there are a lot more useful things we could/should/might be spending our time looking at in order to inform and improve educational practice or evaluate teachers. And that the cumulative expenditure on these ink blots, including the cost of time spent musing over them, might be better applied elsewhere.