Readers of my blog know I’m both a data geek and a skeptic of the usefulness of Value-added data specifically as a human resource management tool for schools and districts. There’s been much talk this week about the release of the New York City teacher ratings to the media, and subsequent publication of those data by various news outlets. Most of the talk about the ratings has focused on the error rates in the ratings, and reporters from each news outlet have spent a great deal of time hiding behind their supposed ultra-responsibleness of being sure to inform the public that these ratings are not absolute, that they have significant error ranges, etc. Matt Di Carlo over at Shanker Blog has already provided a very solid explanatory piece on the error ranges and how those ranges affect classification of teachers as either good or bad.
But, the imprecision – as represented by error ranges – of each teacher’s effectiveness estimate is but one small piece of this puzzle. And in my view, the various other issues involved go much further in undermining the usefulness of the value added measures which have been presented by the media as necessarily accurate albeit lacking in precision.
Remember, what we are talking about here are statistical estimates generated on tests of two different areas of student content knowledge – math and English language arts. What is being estimated is the extent of change in score (for each student, from one year to the next) on these particular forms of these particular tests of this particular content, and only for this particular subset of teachers who work in these particular schools.
We know from other research (from Corcoran and Jennings, and form the first Gates MET report) that value added estimates might be quite different for teachers of the same subject area if a different test of that subject is used.
We know that summer learning may affect student annual value added, yet in this case, NYC is estimating teacher effectiveness on student outcomes from year to year. That is, the difference in a students’ score on one day in the spring of 2009 to another in the spring of 2010, is being attributed to a teacher who has contact, for a few hours a day with that child from September to June (but not July and August).
The NYC value-added model does indeed include a number of factors which attempt to make fairer comparisons between teachers of similar grade levels, similar class sizes, etc. But we also know that those attempts work only so well.
Focusing on error rate alone presumes that we’ve got the model and the estimates right – that we are making valid assertions about the measures and their attribution to teaching effectiveness.
That is, that we really are estimating the teacher’s influence on a legitimate measure of student learning in the given content area.
Then error rates are thrown into the discussion (and onto the estimates) to provide the relevant statistical caveats about their precision.
That is, accepting that we are measuring the right thing and rightly attributing it to the teacher, there might be some noise – some error – in our estimates.
If the estimates lack validity, or are biased, the rate of noise, or error around the invalid or biased estimate is really a moot point.
In fact, as I’ve pointed out before on this blog, it is quite likely that value added estimates that retain bias by failing to fully control for outside influences are actually likely to be more stable over time (to the extent that the outside influences remain more stable over time). And that’s not a good thing.
So, to the news reporters out there, be careful about hiding behind the disclaimer that you’ve responsibly provided the error rates to the public. There’s a lot more to it than that.
Playing with the Data
So, now for a little playing with the data, which can be found here:
I personally wanted to check out a few things, starting with assessing the year to year stability of the ratings. So, let’s start with some year to year correlations achieved by merging the teacher data reports across years for teachers who stayed in the same school teaching the same subject area to the same grade level. Note that teacher IDs are removed from the data. But teachers can be matched within school, subject and grade level, by name over time (by concatenating the dbn [school code], teacher name, grade level and subject area [changing subject area and grade level naming to match between older and newer files]). First, here’s how the year to year correlations play out for teachers teaching the same grade, subject area and in the same school each year.
Sifting through the Noise
As with other value-added studies, the correlation across teachers in their ratings from one year to the next seem to range from about .10 to about .50. Note that between 2009-10 and 2008-09 Math value-added estimates were relatively highly correlated, compared to previous years (with little clear evidence as to why, but for possible changes to assessments, etc.). Year to year correlations for ELA are pretty darn low, especially prior to the most recent two years.
I’ve done a little color coding here for fun. Dots coded in orange are those that stayed in the “average” category from one year to the next. Dots in bright red are those that stayed “high” or “above average” from one year to the next and dots in pale blue were “low” or “below average” from one year to the next. But there are also significant numbers of dots that were above average or high in one year, and below average or low in the next. 9 to 15% (of those who were “good” or were “bad” in the previous year) move all the way from good to bad or bad to good. 20 to 35% who were “bad” stayed “bad” & 20 to 35% who were “good” stayed “good.” And this is between the two years that show the highest correlation for ELA.
Here’s what the math estimates look like:
There’s actually a visually identifiable positive relationship here. Again, this is the relationship between the two most recent years, which by comparison to previous years, showed a higher correlation.
For math, only about 7% of teachers jump all the way from being bad to good or good to bad (of those who were “good” or “bad” the previous year), and about 30 to 50% who were good remain good, or who were bad, remain bad.
But, that still means that even in the more consistently estimated models, half or more of teachers move into or out of the good or bad categories from year to year, between the two years that show the highest correlation in recent years.
And this finding still ignores whether other factors may be at play in keeping teachers in certain categories. For example, whether teachers stay labeled as ‘good’ because they continue to work with better students or in better environments.
Searching for Potential Sources of Bias
My next fun little exercise in playing with the VA data involved merging the data by school dbn to my data set on NYC school characteristics. I limited my sample for now to teachers in schools serving all grade levels 4 to8 and w/complete data in my NYC schools data, which include a combination of measures from the NCES Common Core and NY State School Report Cards. I did a whole lot of fishing around to determine whether there were any particular characteristics of schools that appeared associated either or both with individual teacher value added estimates or with the likelihood that a teacher ended up being rated “good” or “bad” by my aggregations used here. I will present my preliminary findings with respect to those likelihoods here.
Here are a few logistic regression models of the odds that a teacher was rated “good” or rated “bad” based on a) the multi-year value-added categorical rating for the teacher and b) based on school year 2009 characteristics of their school across grades 4 to 8.
After fishing through a plethora of measures on school characteristics (because I don’t have classroom characteristics for each teacher), I found that with relative consistency, using the Math ratings, teachers in schools with higher math proficiency rates tended to get better value added estimates for math and were more likely to be rated “good.” This result was consistent across multiple attempts, models, subsamples (Note that I’ve only got 1300 of the total math teachers rated here… but it’s still a pretty good and well distributed sample). Also, teachers in schools with larger average class size tended to have lower likelihood of being classified as “above average” or “high” performers. These findings make some sense, in that peer group effect may be influencing teacher ratings and class size may effects (perhaps as spillover?) may not be fully captured in the model. The attendance rate factor is somewhat more perplexing.
Again, these models were run with the multi-year value added classification.
Next, I checked to see if there were differences in the likelihood of getting back to back good or back to back bad ratings by school characteristics. Here are the models:
As it turns out, the likelihood of achieving back to back good or back to back bad ratings is also influenced by school characteristics. Here, as class size increases by 1 student, the likelihood that a teacher in that school gets back to back bad ratings goes up by nearly 8%. The likelihood of getting back to back good ratings declines by 6%. The likelihood of getting back to back good ratings increases by nearly 8% in a school with 1% higher math proficiency rate in grades 4 to 8.
These are admittedly preliminary checks on the data, but these findings in my view do warrant further investigation into school level correlates with the math value added estimates and classifications in particular. These findings are certainly suggestive of possible estimate bias.
Who Gets VAM-ED?
Finally, while there’s been much talk about these ratings being released for such a seemingly large number of teachers – 18,000 – it’s important to put those numbers in context in order to evaluate their relevance. First of all, it’s 18,000 ratings, not teachers. Several teachers are rated for both math and ELA, bringing the total number of individuals down significantly from 18,000. In still generous terms, the 18,000 or so are more like “positions” within schools, but even then, the elementary classroom teacher covers both areas even within the same assignment or position.
Based on the NY State Personnel Master File for 2009-10, there were about 150,000 (linkable to individual schools including those in the VA reports) certified staffing assignments in New York City in 2009-10 (where individual teachers cover more than one assignment). In that light, 18,000 is not that big a share.
But let’s look at it at the school level using two sample schools. For these comparisons I picked two schools which had among the largest numbers of VA math estimates (with many of the same teachers in those schools having VA ELA estimates). The actual listing of teacher assignments is provided for two schools below, along with the number of teachers for whom there were Math VA estimates. Again, these are schools with among the highest reported number (and share) of teachers who were assigned math effectiveness ratings.
In each case, we are Math VAM-ing around 30% of total teacher assignments [not teachers, but assignments] (with substantial overlap for ELA). Clearly, several of the teacher assignments in the mix for each school are completely un-VAM-able. States such as Tennessee have adopted the absurd strategy that these other staff should be evaluated on the basis of the scores for those who can be VAM-ed.
A couple of issues are important to consider here. First, these listings more than anything convey the complexity of what goes on in schools – the type of people who nee to come together and work together collectively on behalf of the interests of kids. VAM-ing some subset of those teachers and putting their faces in the NY Post is unhelpful in many regards. Certainly there exist significant incentives for teachers to migrate to un-vammed assignments to the extent possible. And please don’t tell me that the answer to this dilemma is to VAM the Orchestra conductor or Art teacher. That’s just freakin’ stupid!
As Preston Green, Joseph Oluwole and I discuss in our forthcoming article in the BYU Education and Law Journal, coupling the complexities of staffing real schools and evaluating the diverse array of professionals that exist in those schools with VAM-based rating schemes necessarily means adopting differentiated contractual agreements, leading to numerous possible perverse incentives and illogical management decisions (as we’ve already seen in Tennessee as well as in the structure of the DC IMPACT contract).