Readers of my blog know I’m both a data geek and a skeptic of the usefulness of Value-added data specifically as a human resource management tool for schools and districts. There’s been much talk this week about the release of the New York City teacher ratings to the media, and subsequent publication of those data by various news outlets. Most of the talk about the ratings has focused on the error rates in the ratings, and reporters from each news outlet have spent a great deal of time hiding behind their supposed ultra-responsibleness of being sure to inform the public that these ratings are not absolute, that they have significant error ranges, etc. Matt Di Carlo over at Shanker Blog has already provided a very solid explanatory piece on the error ranges and how those ranges affect classification of teachers as either good or bad.

But, the imprecision – as represented by error ranges – of each teacher’s effectiveness estimate is but one small piece of this puzzle. And in my view, the various other issues involved go much further in undermining the usefulness of the value added measures which have been presented by the media as necessarily accurate albeit lacking in precision.

Remember, what we are talking about here are statistical estimates generated on tests of two different areas of student content knowledge – math and English language arts. What is being estimated is the extent of change in score (for each student, from one year to the next) on these particular forms of these particular tests of this particular content, and only for this particular subset of teachers who work in these particular schools.

We know from other research (from Corcoran and Jennings, and form the first Gates MET report) that value added estimates might be quite different for teachers of the same subject area if a different test of that subject is used.

We know that summer learning may affect student annual value added, yet in this case, NYC is estimating teacher effectiveness on student outcomes from year to year. That is, the difference in a students’ score on one day in the spring of 2009 to another in the spring of 2010, is being attributed to a teacher who has contact, for a few hours a day with that child from September to June (but not July and August).

The NYC value-added model does indeed include a number of factors which attempt to make fairer comparisons between teachers of similar grade levels, similar class sizes, etc. But we also know that those attempts work only so well.

Focusing on error rate alone presumes that we’ve got the model and the estimates right – that we are making valid assertions about the measures and their attribution to teaching effectiveness.

That is, that we really are estimating the teacher’s influence on a legitimate measure of student learning in the given content area.

Then error rates are thrown into the discussion (and onto the estimates) to provide the relevant statistical caveats about their precision.

That is, accepting that we are measuring the right thing and rightly attributing it to the teacher, there might be some noise – some error – in our estimates.

If the estimates lack validity, or are biased, the rate of noise, or error around the invalid or biased estimate is really a moot point.

In fact, as I’ve pointed out before on this blog, it is quite likely that value added estimates that retain bias by failing to fully control for outside influences are actually likely to be more stable over time (to the extent that the outside influences remain more stable over time). And that’s not a good thing.

So, to the news reporters out there, be careful about hiding behind the disclaimer that you’ve responsibly provided the error rates to the public. There’s a lot more to it than that.

**Playing with the Data**

So, now for a little playing with the data, which can be found here:

I personally wanted to check out a few things, starting with assessing the year to year stability of the ratings. So, let’s start with some year to year correlations achieved by merging the teacher data reports across years for teachers who stayed in the same school teaching the same subject area to the same grade level. Note that teacher IDs are removed from the data. But teachers can be matched within school, subject and grade level, by name over time (by concatenating the dbn [school code], teacher name, grade level and subject area [changing subject area and grade level naming to match between older and newer files]). First, here’s how the year to year correlations play out for teachers teaching the same grade, subject area and in the same school each year.

**Sifting through the Noise**

As with other value-added studies, the correlation across teachers in their ratings from one year to the next seem to range from about .10 to about .50. Note that between 2009-10 and 2008-09 Math value-added estimates were relatively highly correlated, compared to previous years (with little clear evidence as to why, but for possible changes to assessments, etc.). Year to year correlations for ELA are pretty darn low, especially prior to the most recent two years.

Visually, here’s what the relationship between the most recent two years of ELA VAM ratings looks like:

I’ve done a little color coding here for fun. Dots coded in orange are those that stayed in the “average” category from one year to the next. Dots in bright red are those that stayed “high” or “above average” from one year to the next and dots in pale blue were “low” or “below average” from one year to the next. But there are also significant numbers of dots that were above average or high in one year, and below average or low in the next. 9 to 15% (of those who were “good” or were “bad” in the previous year) move all the way from good to bad or bad to good. 20 to 35% who were “bad” stayed “bad” & 20 to 35% who were “good” stayed “good.” And this is between the two years that show the highest correlation for ELA.

Here’s what the math estimates look like:

There’s actually a visually identifiable positive relationship here. Again, this is the relationship between the two most recent years, which by comparison to previous years, showed a higher correlation.

For math, only about 7% of teachers jump all the way from being bad to good or good to bad (of those who were “good” or “bad” the previous year), and about 30 to 50% who were good remain good, or who were bad, remain bad.

But, that still means that even in the more consistently estimated models, half or more of teachers move into or out of the good or bad categories from year to year, between the two years that show the highest correlation in recent years.

And this finding still ignores whether other factors may be at play in keeping teachers in certain categories. For example, whether teachers stay labeled as ‘good’ because they continue to work with better students or in better environments.

**Searching for Potential Sources of Bias**

My next fun little exercise in playing with the VA data involved merging the data by school dbn to my data set on NYC school characteristics. I limited my sample for now to teachers in schools serving all grade levels 4 to8 and w/complete data in my NYC schools data, which include a combination of measures from the NCES Common Core and NY State School Report Cards. I did a whole lot of fishing around to determine whether there were any particular characteristics of schools that appeared associated either or both with individual teacher value added estimates or with the likelihood that a teacher ended up being rated “good” or “bad” by my aggregations used here. I will present my preliminary findings with respect to those likelihoods here.

Here are a few logistic regression models of the odds that a teacher was rated “good” or rated “bad” based on a) the multi-year value-added categorical rating for the teacher and b) based on school year 2009 characteristics of their school across grades 4 to 8.

After fishing through a plethora of measures on school characteristics (because I don’t have classroom characteristics for each teacher), I found that with relative consistency, using the Math ratings, teachers in schools with higher math proficiency rates tended to get better value added estimates for math and were more likely to be rated “good.” This result was consistent across multiple attempts, models, subsamples (Note that I’ve only got 1300 of the total math teachers rated here… but it’s still a pretty good and well distributed sample). Also, teachers in schools with larger average class size tended to have lower likelihood of being classified as “above average” or “high” performers. These findings make some sense, in that peer group effect may be influencing teacher ratings and class size may effects (perhaps as spillover?) may not be fully captured in the model. The attendance rate factor is somewhat more perplexing.

Again, these models were run with the multi-year value added classification.

Next, I checked to see if there were differences in the likelihood of getting back to back good or back to back bad ratings by school characteristics. Here are the models:

As it turns out, the likelihood of achieving back to back good or back to back bad ratings is also influenced by school characteristics. Here, as class size increases by 1 student, the likelihood that a teacher in that school gets back to back bad ratings goes up by nearly 8%. The likelihood of getting back to back good ratings declines by 6%. The likelihood of getting back to back good ratings increases by nearly 8% in a school with 1% higher math proficiency rate in grades 4 to 8.

These are admittedly preliminary checks on the data, but these findings in my view do warrant further investigation into school level correlates with the math value added estimates and classifications in particular. These findings are certainly suggestive of possible estimate bias.

**Who Gets VAM-ED?**

Finally, while there’s been much talk about these ratings being released for such a seemingly large number of teachers – 18,000 – it’s important to put those numbers in context in order to evaluate their relevance. First of all, it’s 18,000 ratings, not teachers. Several teachers are rated for both math and ELA, bringing the total number of individuals down significantly from 18,000. In still generous terms, the 18,000 or so are more like “positions” within schools, but even then, the elementary classroom teacher covers both areas even within the same assignment or position.

Based on the NY State Personnel Master File for 2009-10, there were about 150,000 (linkable to individual schools including those in the VA reports) certified staffing assignments in New York City in 2009-10 (where individual teachers cover more than one assignment). In that light, 18,000 is not that big a share.

But let’s look at it at the school level using two sample schools. For these comparisons I picked two schools which had among the largest numbers of VA math estimates (with many of the same teachers in those schools having VA ELA estimates). The actual listing of teacher assignments is provided for two schools below, along with the number of teachers for whom there were Math VA estimates. Again, these are schools with among the highest reported number (and share) of teachers who were assigned math effectiveness ratings.

In each case, we are Math VAM-ing around 30% of total teacher assignments [not teachers, but assignments] (with substantial overlap for ELA). Clearly, several of the teacher assignments in the mix for each school are completely un-VAM-able. States such as Tennessee have adopted the absurd strategy that these other staff should be evaluated on the basis of the scores for those who can be VAM-ed.

A couple of issues are important to consider here. First, these listings more than anything convey the complexity of what goes on in schools – the type of people who nee to come together and work together collectively on behalf of the interests of kids. VAM-ing some subset of those teachers and putting their faces in the NY Post is unhelpful in many regards. Certainly there exist significant incentives for teachers to migrate to un-vammed assignments to the extent possible. And please don’t tell me that the answer to this dilemma is to VAM the Orchestra conductor or Art teacher. That’s just freakin’ stupid!

As Preston Green, Joseph Oluwole and I discuss in our forthcoming article in the BYU Education and Law Journal, coupling the complexities of staffing real schools and evaluating the diverse array of professionals that exist in those schools with VAM-based rating schemes necessarily means adopting differentiated contractual agreements, leading to numerous possible perverse incentives and illogical management decisions (as we’ve already seen in Tennessee as well as in the structure of the DC IMPACT contract).

Bruce,

I will peek at this NYC data. But let me be emphatic: if this teacher quality data does not represent columns of college GPA, Praxis Test, and SAT/ACT or GRE scores, then what is the use. Let’s stop kidding ourselves. Is there reliable teacher quality data that explains the variance of student performance in New York? I don’t think so. But I will read this excerpt when I get a chance — from my enormous and weighted commitments.

T Bynoe

Bruce-

First, thanks for this and everything else you’ve been doing to be a voice of sanity and deep insight on this and so many other educational issues.

This is what seems to be the takeaway quote from your piece:

But, the imprecision – as represented by error ranges – of each teacher’s effectiveness estimate is but one small piece of this puzzle. And in my view, the various other issues involved go much further in undermining the usefulness of the value added measures which have been presented by the media as necessarily accurate albeit lacking in precision.

Unfortunately, your point apparently is lost on reporters, policymakers, pundits, advocates, and even some researchers. It seems to me that what is needed is a complete deconstruction (autopsy) of one of the VAMs now in use for evaluation of teachers in some state. Unless and until a team of psychometric, econometric, and statistical experts reviews literally step-by-step every assumption and decision that goes into the calculation of value-added estimates, and provides some estimate of the impact they cumulatively have on attributions of effectiveness, I’m not sure your point will be understood.

Everything including the tests, their scaling, the statistical methods (appropriateness and plausibility of their assumptions), the quality of the covariate data, treatment of missing data, sensitivity and specification tests, and so on must be reviewed. If I have time I will try to at least outline a flow chart. (I hope someone more knowledgeable beats me to it.)

In my own work, I have noted that the forced range of teacher effectiveness (from low to high, or 15th to 85th percentile) covers roughly 4 single point multiple choice questions on one of our state Reading tests (out of 52 points). The difference between an ineffective “bad” teacher and an average one is roughly 2 questions. This range is at the low end of the tests for which I have data. But there are others that are close. So do we really know if the VAMs are measuring educationally significant differences? (I also offer this as a counterexample to those who have done so much to inflate their representations of “teacher effects.”)

There obviously is so much work to be done before we should consider using VAMs for high stakes evaluations, but unfortunately educational fads are hard to resist, especially when so much money propels them. Before anyone raises the straw man, I do believe that teacher evaluation instruments and their implementation must be changed. (Who is arguing against that?) In fact that is well under way.

Thanks again.

Harris

You should research Harrison 2 school district in Colorado. the have a Pay for Performance plan that is being used as a model for the pilot schools in NJ. The superintendent there is a Broad Academy guy doing consulting here in for the NJDOE. I think the company is FOCAL POINT. It may give you an idea of what the NJDOE plan will look like.

Also, I saw that Rutgers graduate School of Education was going to evaluate the new teachers evaluation system. I hope you are part of that effort.

What is more distressing about the Colorado systems, and what is being planned in NJ are that they are actually based on a performance measure that does not even attempt to estimate the teacher’s effect on student growth, but rather, uses a student growth metric with absolutely no covariates yet still assumes that the teacher is responsible for the growth. I have written about these Student Growth Percentiles in several previous posts. They are entirely inappropriate for this use. I have my issues with VAM, as I discuss in this post, but VAMs are intended for isolating teacher effect on student outcomes. In my view, they can’t accomplish that lofty goal, but at least they try. SGPs don’t even try. It’s absurd & irresponsible.

https://schoolfinance101.wordpress.com/2011/09/02/take-your-sgp-and-vamit-damn-it/

https://schoolfinance101.wordpress.com/2011/09/13/more-on-the-sgp-debate-a-reply/

Thank you,

I will check the refernces you have supplied!

One other point about the Harrison 2 model. There is a feature in the principal’s evaluation formula that penalizes variance between the 50% acheivement scores (median SGP) and the “obsevation” or process half of the formula. So, if the principal sees a teacher who is doing well by whatever rubrics they have, but then the achievement scores are INCONGRUENT, the principals are penalized. It works the opposite way as well, so if the process scores (principal observations) are low but the achievement scores are high, the principal take a hit. So in order to minimize variance (INCONGRUENCE) the principals will cluster to the middle, Which will make the 50% weighting a LIE.

Thanks again for the thorough analysis. I hope you are part of the RUTGERS review1

My head’s spinning a bit what with your impressive data analysis – but I did want to point out that in fact the teacher ratings are based not on student contact in the Sept-June time frame (as you mention in par. 5), but actually Sept-April, as most of the state testing happens in April.

There’s a good piece in today’s Times pulling this apart, and touching on the fact that fourth grade teachers are more likely to teach to the test than fifth grade, because the fourth grade tests will impact middle school admissions, while the fifth grade tests don’t affect the students much. http://www.nytimes.com/2012/03/05/nyregion/in-brooklyn-hard-working-teachers-sabotaged-when-student-test-scores-slip.html

Great point! I did forget to account for the actual testing dates/timeline, which does make the models even more questionable.

Further, there is certainly the possibility that student/family pressure to do well on 4th grade tests varies by student/family background such that 5th grade teachers will be confronted with biases in their students’ starting points, and so on. Thanks!