The Average of Noise is not Signal, It’s Junk! More on NJ SGPs

Posted on February 3, 2014



I explained in my previous post that New Jersey’s school aggregate growth percentile measures are as correlated with things they shouldn’t be (average performance level and low income concentrations) as they are with themselves over time.  That is, while they seem relatively stable – correlation around .60 – it would appear that much of that correlation simply reflects the average composition and prior scores of the students in the schools.

In other words, New Jersey’s SGPs are stably baised – or consistently wrong!

But even the consistency of these measures is giving some school officials reason to pause and ask just how useful these measures are for evaluating their students’ progress or their school as a whole.

There are, for example, a good number of schools that would appear to jump a significant number of percentile points from year 1 to year 2. Here is a scatterplot of the schools moving from Over the 60th percentile to under the 40th percentile and from under the 40th to over the 60th percentile.Slide1That’s right West Cape May elementary… you rock … this year at least. Last year, well, you were less than mediocre. You are the new turnaround experts. Good thing we didn’t use last year’s SGP to shut you down!  Either that, or these data have some real issues – in addition to the fact that much of the correlation that does exist is simply a reflection of persistent conditions of these schools.

So, how is it then, that even with such persistent bias caused by external factors, that we can see schools move so far in the distribution?

I don’t have the raw data to test this particular assumption (nor will I likely ever see it), but I suspect these shifts result from a little discussed, but massive persistent problem in all such SGP and VAM models.

I call it spinning variance where there’s little or none… and more specifically… creating a ruse of “meaningful variance” from a narrow band of noise.

What the heck do I mean by that?

Well, these SGP estimates which range from the 0 to 100 percentile start with classrooms full of kids taking 50 item tests.

The raw numbers correct on those tests are then stretched into scale scores with a mean of 200, using an S-shaped conversion. At the higher and lower ends of the distribution, one or two questions can shift scale scores 20+ or so points. Stretch 1!

While individual kids’ scores might spread out quite widely, differences in classroom averages or schoolwide averages vary much less.

Differences in “growth” (really not growth… but rather estimated differences in year over year test scores) vary even less… often trivially … and quite noisily. But, these relatively trivial differences must still be spread out into 0 to 100 percentile ranks! Stretch 2!

I suspect that the differences in actual additional items answered correctly by the median student in the 60th percentile school are trivial when compared with additional items answered correctly by the median student in the 40th percentile school.

But alas, we must rank, as reformy logic and punative statistical illiteracy dictates, and fractions of individual multiple choice test items will dictate that rank under these methods.

Much of this narrow band of variance is simply noise (after sorting out the bias)… and thus the rankings based on spreading out that noise are completely freakin’ meaningless. These problems are equally bad if not worse [due to smaller sample sizes] when used for rating teachers.

Now for a few more figures. Above I show how even when we retain the bias in the SGPs, we’ve got schools that jump 20 percentile points one direction or the other over a single year! Do not cry Belleville PS10. Alas it is not your fault.

Here’s where these same schools lie, in year 1, with respect to poverty.

Slide2And here’s where they lie in year 2 with respect to poverty.

Slide3Of course, they have switched positions, merely by the way I’ve defined them. West Cape May is now awesome… and still low poverty, while the previous year they were low poverty but stunk! We’ve got some big movers at the other end too. Newark Educators Charter also made the big leap to awesomeness, while U. Heights fall from grace.

Now, the reformy statistical illiterate response to this instability is to take the mean of year 1 and year 2 and call it stable… and more representative… and by doing so we can pull this group to the middle.

Let me put this really bluntly… the average of noise is not signal.

Slide4The position of these schools at the edges of the patterned scatter is the noise.

Now averaging the patterns does give us stronger signal – signal that the growth percentiles are painfully, offensively biased with respect to poverty (the “persistent effect” is one of “persistent poverty”). But averaging the outer bands of this distribution to pull them to the center does not by any stretch make the ratings for these schools more meaningful.

As a fun additional exercise, I’ve used schoolwide proficiency rates and low income concentrations to generate predicted values of year 1 and year 2 growth percentiles and then taken the differences from those predicted values (in standard deviations) and used them as “adjusted” growth percentiles. That is, how much higher or lower is a school’s growth percentile  than predicted given only these two external factors?

In this graph, I identify those schools that jumped from over 1 full standard deviation above to 1 full standard deviation below their expected level, and vice versa.  I’ve used schools that had both 4th and 7th grade proficiency data and free lunch data, reducing my sample, I’ve also used a pretty wide range for identifying performance changes. So, I actually have fewer “outlier” schools.

Slide5

The fun part here is that these aren’t even the same schools that were identified as the big jumpers before correcting for average performance level and % free lunch. No overlap at all.

So, just how useful are the growth percentile data for even making reasonable judgments about schools’ influence on their students outcomes?

Well, as noted in my prior post, the persistent bias is so overwhelming as to call into serious question whether they have any valid use at all.

And the noise surrounding that bias appears to effectively undermine any remaining usefulness one might try to pry from these measures.

One more time on the video explanation of this stuff!

Posted in: Uncategorized