When VAMs Fail: Evaluating Ohio’s School Performance Measures


Any reader of my blog knows already that I’m a skeptic of the usefulness of Value-added models for guiding high stakes decisions regarding personnel in schools. As I’ve explained on previous occasions, while statistical models of large numbers of data points – like lots of teachers or lots of schools – might provide us with some useful information on the extent of variation in student outcomes across schools or teachers and might reveal for us some useful patterns – it’s generally not a useful exercise to try to say anything about any one single point within the data set. Yes, teacher “effectiveness” estimates tend to be based on the many student points across students taught by that teacher, but are still highly unstable. Unstable to the point, where even as a researcher hoping to find value in this information, I’ve become skeptical.

However, I had still been holding out more hope that school level aggregate information on student growth – value added estimates – might still be more useful mainly because it represents a higher level of aggregation. That is, each school is indeed a single point in a school level analysis, but that point represents an aggregation of student points and more student points than would be aggregated to any one teacher in a school. Generally, school level value-added measures BECAUSE of this aggregation are somewhat more reliable.

I’m in the process of compiling data as part of a project which includes data on Ohio public schools. Ohio makes available school level value added ratings as well as traditional school performance level ratings. For that, I am grateful to them. Ohio also makes school site financial data available. Thanks again Ohio!

At the outset of any project, I like to explore the properties of various measures provided by the state. For example, to what extent are current accountability measures a) related to the same measures in the previous years, and b) related to factors such as student population characteristics?

Matt Di    Carlo over at http://www.shankerblog.org (see: http://shankerblog.org/?p=3870) has already addressed many/most of these issues with regard to the Ohio data. But, I figured I’d just reiterate these points with a few additional figures, focusing especially on the school level value added ratings.

As Matt Di         Carlo has already explained, Ohio’s performance index which is based on percent passing data is highly sensitive to concentrations of low income students.

Ohio performance index and % free lunch:

Nothing out of the ordinary here (except perhaps the large number of 0 values, which I didn’t bother to exclude – and which really compromise my r-squared… will fix if I get a chance). On this type of measure, this is pretty much expected and common across state systems. This is precisely why many state accountability system measures systematically penalize higher poverty schools and districts. Because they depend on performance level comparisons and because performance levels are highly sensitive to student/family backgrounds.

As a result, these heavily poverty biased measures are also pretty stable over time. Here’s the year to year correlation of the performance Index.

I’ve pointed out previously that one good way to get more stable performance measures over time – for schools, districts or for teachers – is to leave the bias in there. That is, keeping the measure heavily biased by student population characteristics keeps the measure more stable over time – if the student populations across schools and districts remain stable. More reliable yes. More useful, absolutely NOT.

It’s  pretty  much the case that the performance index received by a school this year will be in line with the index received the previous year.

Therein lies part of the argument for moving toward gain or value-added ratings. Note however that an exclusive emphasis on value-added without consideration for performance level means that we can ignore persistent achievement gaps between groups and the overall level of performance of lower performing groups.  That’s at least a bit problematic from a policy perspective! But I’ll set that aside for now.

Let’s take a look at what we can resolve and can’t resolve in Ohio school ratings by moving toward their value-added model (technical documentation here: http://www.ode.state.oh.us/GD/Templates/Pages/ODE/ODEDetail.aspx?Page=3&TopicRelationID=117&Content=113068)

As I noted above, I’d love to believe that the school level value-added estimates would provide at least some useful information to either policymakers or school officials. But, I’m now pretty damn skeptical, and here’s more evidence regarding why. Here is the relationship between 2008-09 and 2009-10 school value added ratings using the overall “value added index.”

Note that any district in the lower right quadrant is a district that had positive growth in 2009 but negative in 2010. Any district that is in the upper left had negative growth in 2009 and positive in 2010. It’s pretty much a random scatter. There is little relationship at all between what a school received in 2009 and in 2010 (or in 2008 or earlier for that matter).

So, imagine you are a school principal and year after year your little dot in this scatter plot shows up in a completely different place – odds are quite in favor of that! What are you to do with this information? Imagine trying to attach state accountability to these measures? I’ve long expressed concern about attaching any immediate policy actions to this type of measure. But in this case, I’m even concerned as to whether I have any reasonable research use for these measures. They are pretty much noise.

Here’s a little fishing into the rather small predictable shares of variation in those measures:

As it turns out, the prior year index is a stronger (though still weak) predictor of the current year index. But, it’s also the case that districts that had higher overall performance levels in the prior year tended to have lower value added the following year, and districts with higher % free lunch and higher % special ed population also had lower value added (among those starting at the same performance index level). That is some of the predictable stuff here is bias… indicative of model-related (if not test related) ceiling effects as well as demographic bias. That’s really unhelpful, and likely overlooked by most playing around with these data.

I get a little further if I use the math gains (the reading gains are particularly noisy).

These are ever so slightly more predictable than the aggregate index. But not a whole lot. But, they too are also a predictable function of stuff they shouldn’t be:

Again, districts that started with higher performance index have lower gain, and districts with higher free lunch and special ed populations have lower gain… and yes… these biases cut in opposite directions. But that doesn’t provide any comfort that they are counterbalancing in any way that makes these data at all useful.

If anything, the properties of the Ohio value-added data are particularly disheartening.  There’s little if anything there to begin with and what appears to be there might be compromised by underlying biases.

Further, even if the estimates were both more reliable and potentially less biased, I’m not quite sure how local district administrators would derive meaning from them – meaning that would lead to actions that could be taken to improve – or turn around their school in future years.

At this point and given these data, the best way to achieve a statistical turn around is probably to simply do nothing and sit and wait until the next year of data. Odds are pretty good your little dot (school) on the chart will end up in a completely different location the next time around!

 

2 thoughts on “When VAMs Fail: Evaluating Ohio’s School Performance Measures

  1. You often critique single year VAM estimates because they are too volatile. What about using a multi-year measure for schools? The larger sample would reduce volatility.

    1. In earlier posts, I discuss the relatively high classification errors that persist even in 3-year average VAMs. Further, I discuss in other posts how many times, the multi-year VAMs are really only generating averages of the persistent model bias, which we are then wrongly interpreting as a true measure of persistent teacher effectiveness. In this case, given the very low year to year correlations of the OH school level measures, it’s pretty much a non-issue. It would be an average of noise over time. Yeah, you’d get an average. But since it’s an average of such noisy data, I’d be hard pressed to argue that average is meaningful in any way. Further, some of the predictable variation is actually bias. Taking averages of crappy numbers, like taking higher level aggregations can arguably make them more stable, but may not make them any more meaningful.

Comments are closed.