The Average of Noise is not Signal, It’s Junk! More on NJ SGPs


I explained in my previous post that New Jersey’s school aggregate growth percentile measures are as correlated with things they shouldn’t be (average performance level and low income concentrations) as they are with themselves over time.  That is, while they seem relatively stable – correlation around .60 – it would appear that much of that correlation simply reflects the average composition and prior scores of the students in the schools.

In other words, New Jersey’s SGPs are stably baised – or consistently wrong!

But even the consistency of these measures is giving some school officials reason to pause and ask just how useful these measures are for evaluating their students’ progress or their school as a whole.

There are, for example, a good number of schools that would appear to jump a significant number of percentile points from year 1 to year 2. Here is a scatterplot of the schools moving from Over the 60th percentile to under the 40th percentile and from under the 40th to over the 60th percentile.Slide1That’s right West Cape May elementary… you rock … this year at least. Last year, well, you were less than mediocre. You are the new turnaround experts. Good thing we didn’t use last year’s SGP to shut you down!  Either that, or these data have some real issues – in addition to the fact that much of the correlation that does exist is simply a reflection of persistent conditions of these schools.

So, how is it then, that even with such persistent bias caused by external factors, that we can see schools move so far in the distribution?

I don’t have the raw data to test this particular assumption (nor will I likely ever see it), but I suspect these shifts result from a little discussed, but massive persistent problem in all such SGP and VAM models.

I call it spinning variance where there’s little or none… and more specifically… creating a ruse of “meaningful variance” from a narrow band of noise.

What the heck do I mean by that?

Well, these SGP estimates which range from the 0 to 100 percentile start with classrooms full of kids taking 50 item tests.

The raw numbers correct on those tests are then stretched into scale scores with a mean of 200, using an S-shaped conversion. At the higher and lower ends of the distribution, one or two questions can shift scale scores 20+ or so points. Stretch 1!

While individual kids’ scores might spread out quite widely, differences in classroom averages or schoolwide averages vary much less.

Differences in “growth” (really not growth… but rather estimated differences in year over year test scores) vary even less… often trivially … and quite noisily. But, these relatively trivial differences must still be spread out into 0 to 100 percentile ranks! Stretch 2!

I suspect that the differences in actual additional items answered correctly by the median student in the 60th percentile school are trivial when compared with additional items answered correctly by the median student in the 40th percentile school.

But alas, we must rank, as reformy logic and punative statistical illiteracy dictates, and fractions of individual multiple choice test items will dictate that rank under these methods.

Much of this narrow band of variance is simply noise (after sorting out the bias)… and thus the rankings based on spreading out that noise are completely freakin’ meaningless. These problems are equally bad if not worse [due to smaller sample sizes] when used for rating teachers.

Now for a few more figures. Above I show how even when we retain the bias in the SGPs, we’ve got schools that jump 20 percentile points one direction or the other over a single year! Do not cry Belleville PS10. Alas it is not your fault.

Here’s where these same schools lie, in year 1, with respect to poverty.

Slide2And here’s where they lie in year 2 with respect to poverty.

Slide3Of course, they have switched positions, merely by the way I’ve defined them. West Cape May is now awesome… and still low poverty, while the previous year they were low poverty but stunk! We’ve got some big movers at the other end too. Newark Educators Charter also made the big leap to awesomeness, while U. Heights fall from grace.

Now, the reformy statistical illiterate response to this instability is to take the mean of year 1 and year 2 and call it stable… and more representative… and by doing so we can pull this group to the middle.

Let me put this really bluntly… the average of noise is not signal.

Slide4The position of these schools at the edges of the patterned scatter is the noise.

Now averaging the patterns does give us stronger signal – signal that the growth percentiles are painfully, offensively biased with respect to poverty (the “persistent effect” is one of “persistent poverty”). But averaging the outer bands of this distribution to pull them to the center does not by any stretch make the ratings for these schools more meaningful.

As a fun additional exercise, I’ve used schoolwide proficiency rates and low income concentrations to generate predicted values of year 1 and year 2 growth percentiles and then taken the differences from those predicted values (in standard deviations) and used them as “adjusted” growth percentiles. That is, how much higher or lower is a school’s growth percentile  than predicted given only these two external factors?

In this graph, I identify those schools that jumped from over 1 full standard deviation above to 1 full standard deviation below their expected level, and vice versa.  I’ve used schools that had both 4th and 7th grade proficiency data and free lunch data, reducing my sample, I’ve also used a pretty wide range for identifying performance changes. So, I actually have fewer “outlier” schools.

Slide5

The fun part here is that these aren’t even the same schools that were identified as the big jumpers before correcting for average performance level and % free lunch. No overlap at all.

So, just how useful are the growth percentile data for even making reasonable judgments about schools’ influence on their students outcomes?

Well, as noted in my prior post, the persistent bias is so overwhelming as to call into serious question whether they have any valid use at all.

And the noise surrounding that bias appears to effectively undermine any remaining usefulness one might try to pry from these measures.

One more time on the video explanation of this stuff!

Advertisements

4 Comments

  1. I consistently appreciate your analysis. Thank you for your work.

    Your explanation of the “stretching” effect of percentile ranks is particularly useful. Rankings cause tiny differences (or plain luck) to appear significant when they are not. I wonder if a graphically-minded reader could help render this idea in a visual way that would “stick” with more people?

    You consistently serve up examples of inappropriate uses of test score data, often with a salty side helping of disdain. I read your stuff pretty frequently, and I don’t actually think you are against tests so much as for caution in the interpretation of scores and restraint in their application to individuals. Am I right?

    Your strongest recurring policy message seems to be: “It’s poverty, stupid. Don’t get distracted.” Effects of poverty have such a powerful connection to scores that they tend to overwhelm anything that schools actually do. When variances from this pattern appear they need to be questioned for validity before they are scrutinized for meaning.

    It would be useful to hear more of your thoughts about distributional tails. In testimony for the Vergara trial, Raj Chetty argues that teachers that are consistently ineffective (measured using VAM) create long-term harm. Further, he argues that student test score data, where available, should play a strong role in teacher dismissal cases. He seems to express the opinion that teacher job protections (tenure) should be delayed in part because single-year scores are insufficient to inform indelible employment decisions. It would be interesting to hear your take on these questions. Our current system makes virtually no use at all of data in dismissals, and that just seems loopy. You say that your views have evolved a lot. At this point in your thinking, what seems to you to be the right role of data in making important decisions like these?

    1. You are right that I’m not opposed to reasonable use of testing data, or for that matter, even more reasonable use of modeled estimates of growth (though I’m increasingly pessimistic about the usefulness/clarity of the signal). I write about that issue at the end of this article: http://epaa.asu.edu/ojs/article/view/1298/1043

      ================
      This is not to suggest that any and all forms of student assessment data should be considered
      moot in thoughtful decision-making by school leaders and leadership teams. Rather, that incorrect,
      inappropriate use of this information is simply wrong – ethically and legally (a lower standard)
      wrong. We accept the proposition that tests of student knowledge and skills can provide useful
      insights both regarding what students know and potentially regarding what they have learned while
      attending a particular school or class. We are increasingly skeptical regarding the ability of valueadded
      statistical models to parse any specific teacher’s effect on those outcomes. Furthermore, the
      relative weight in management decision-making placed on any one measure depends on the quality
      of that measure and likely fluctuates over time and across settings. That is, in some cases, with some
      teachers and in some years, test data may provide leaders and/or peers with more useful insights. In
      other cases, it may be quite obvious to informed professionals that the signal provided by the data is
      simply wrong – not a valid representation of the teacher’s effectiveness.
      Arguably, a more reasonable and efficient use of these quantifiable metrics in human
      resource management might be to use them as a knowingly noisy pre-screening tool to identify
      where problems might exist across hundreds of classrooms in a large district. Value-added estimates
      might serve as a first step toward planning which classrooms to observe more frequently. Under
      such a model, when observations are completed, one might decide that the initial signal provided by
      the value-added estimate was simply wrong. One might also find that it produced useful insights
      regarding a teacher’s (or group of teachers’) effectiveness at helping students develop certain tested
      skills.
      School leaders or leadership teams should clearly have the authority to make the case that a
      teacher is ineffective and that the teacher even if tenured should be dismissed on that basis. It may
      also be the case that the evidence would actually include data on student outcomes – growth, etc.
      The key, in our view, is that the leaders making the decision – indicated by their presentation of the
      evidence – would show that they have reasonably used information to make an informed
      management decision. Their reasonable interpretation of relevant information would constitute due
      process, as would their attempts to guide the teacher’s improvement on measures over which the
      teacher actually had control.

  2. Another great piece Bruce. Its above their heads. It really is. The key challenge is how to educate the educrats (both the barbarians now in charge and the real educators who (in NYC at least) seem to be on the comeback (but who also lack the statistical literacy to follow most of this), on the limits of data, on the limits of complex change data, and on the limits of the quality of gatherable information (particularly when you are getting into complex measures of learning). But the tracking and overinterpreting noise, is an ancient problem in educational decision making – its just gotten severely worse with the accountability bandwagon and the obsession with high stakes measures. I remember meetings in DOE (in two different states) in which superintendents touted differences in student attendance of 1-2 percentage points as if this really meant anything. Hours spent on analyzing how school X had an attendance of 87% and school B had on of 85%; entire sessions spent trying to extract meaning from nothing – anything to guide “best practice”. The really damaging stuff of the recent decade is when you combine this kind of statistical illiteracy, with the overblown confidence of the wonder boys/girls, with the PR campaign stuff of the eva and geoffrey kind, with an almost total absence of educational sensibility (that only comes from being in the front ranks in the classrooms and in schools at least 4-5 years – it breeds a certain kind of humility, that they don’t teach in the ivy league and they certainly can’t teach in the 6 week TFA training sessions…

Comments are closed.