Ed Next’s triple-normative leap! Does the “Global Report Card” tell us anything?


Imagine trying to determine international rankings for tennis players or soccer teams entirely by a) determining how they rank relative to the average team or player in their country, then b) having only the average team or player from each country play each other in a tournament, then c) estimating how the top teams would rank when compared with each other based only on how their country’s average teams did when they played each other and how much better we think the individual teams or players are when compared to the average team or player in their country? Probably not that precise or even accurate, ya’ think?

Jay Greene and Josh McGee have produced a nifty new report and search tool that allows the average American Joe and Jane to see how their child’s local public school districts would stack up if one were to magically transport their district to Singapore or Finland.

 http://globalreportcard.org/

Even better, this nifty tool can be used by local newspapers to spread outrage throughout suburban communities everywhere across this mediocre land of ours.

To accomplish this mystical transportation, Greene and McGee rely on wizardry not often employed in credible empirical analysis: The Triple Normative Leap. Technically, it’s two leaps, across three norms. That is, the researcher-acrobat jumps from one normalized measure based on one underlying test, to another, and then to yet another (okay, actually to 50 others!). This is impressive, since the double-normative leap is tricky enough and has often resulted in severe injury.

To their credit, the authors provide pretty clear explanations of the triple-normative leap
and how it is used to compare the performance of schools in Scarsdale, NY to kids in Finland without ever making those kids sit down and take an assessment that is comparable in any
regard.

For example, the average student in Scarsdale School District in Westchester County, New York scored nearly one standard deviation above the mean for New York on the state’s math exam. The average student in New York scored six hundredths of a standard deviation above the national average of the NAEP exam given in the same year, and the average student in the United States scored about as far in the negative direction (-.055) from the international average on PISA. Our final index score for Scarsdale in 2007 is equal to the sum of the district, state, and national estimates (1+.06+ -.055 = 1.055). Since the final index score is expired in standard deviation units, it can easily be converted to a percentile for easy interpretation. In our example, Scarsdale would rank at the seventy seventh percentile internationally in math.

Note: Addition and spelling errors in Jay Greene’s original web-based materials: http://globalreportcard.org/about.html

Now, Greene and McGee do recognize the potential limitations of making this leap across non-comparable assessments, with potentially non-comparable distributions. In their technical appendix, which few other than geeky stat guys like me will ever read, they explain:

In order to construct the Global Report Card we combine testing information at three separate levels of aggregation: state, national, and international. At each level we use the available testing information to estimate the distribution of student achievement. To allow for direct comparisons across state and national borders, and thus testing instruments, we map all testing data to the standard normal curve.

We must make two assumptions for our methodology to yield valid results. First, mapping to the standard normal requires us to make the assumption that the distribution of student achievement on each of the testing instruments is approximately normal at each level of aggregation (i.e. district, state, national). Second, to compare the distribution of student achievement across testing instruments we assume that standard deviation units are relatively similar across the 2 testing instruments and across time. In other words we assume that being a certain distance from mean student performance in Arkansas is similar to being the same distance from mean student performance in Massachusetts.

http://globalreportcard.org/docs/AboutTheIndex/Global-Report-Card-Technical-Appendix-8-30-11.pdf

So, they appropriately lay out the important assumptions that to actually rate individual districts in the U.S. against international standards, based on relative position to a) other districts in their state, b) their state to the entire U.S., and then c) the entire U.S. relative to other countries, one must have a reasonable expectation that the distributions at each level are a) normal and b) have similar ranges. The range piece is key here because the spread of scores at any level dictates how many points a district can gain or lose when making each leap.  Again, they appropriately lay out these potential concerns. And then, true-to-form, they ignore them entirely. They don’t even test whether these assumptions hold.

The way I see it, if you’re going to point out a limitation and completely ignore it, you should at least point it out in the body of the report, not the appendix.

Setting aside that little concern for now, here’s how it all works. Walking backwards through their analysis each US district starts with penalty points based on the U.S. mean on PISA compared to the international mean.  That is, every district in the US is given a penalty point (-.055) partly because of the legitimately low performance of large numbers of US students in states that have thrown their public education systems under the bus, including Arizona, Colorado… but more strikingly, Louisiana and the deep south.

Now, a high performing state might then be able to offset their national penalty by outperforming U.S. norms… but only to the extent that NAEP has a wide enough distribution to allow a high performer to gain enough points back to make up that ground. If NAEP has a narrower range than the PISA distribution, even if you rock on NAEP, you can’t gain back the ground lost. In theory, this might even make some sense, but it would depend on the truth of the report’s key assumptions, which (as noted) are never tested.

The next move in the triple-normative leap is the move to the wacky collection of state assessments and their widely varied scale score distributions. High performing districts in a state like California, where the mean NAEP score of California gives everyone another layer of penalty to start, and a big one at that, are screwed. California high performers get a NAEP based penalty on top of their US average penalty and have to make up that entire deficit with standard deviations on state assessments. They’ve got a lot of ground to make up in standard deviations from their own state mean on their state assessment (if it’s even possible).

Let’s take a look at some of the actual district level distributions of standardized mean scale scores on state assessments. Remember, Green and McGee’s triple normative leap only works well to the extent that state assessments are a) normally distributed, b) have similar range and c) are not particularly skewed in one direction or the other.

Note that these graphs are of the normalized distributions of scale scores.

Here’s California

Here’s Ohio

And Here’s Indiana

Oh well, so much for that little assumption. Perhaps most importantly, these distributions show that it depends quite a bit on what state your district is in whether your district has reasonable likelihood of making up 1, 2 or 3 points in the last normative leap.

Remember, every district loses over half a point from the start based on U.S. PISA performance. California districts actually appear to have greater opportunity to make up more ground on the last leap, because the spread of California normed scores on state assessments is wider. But, they’ll need it, since their state average performance on NAEP gets all districts in the state a large penalty.

Anyway, while it may be fun to play with Green and McGee’s nifty web-based search tool, it really doesn’t give us much a picture as to how individual local public school districts in the U.S. stack up against foreign nations. It’s just too much of a stretch to assume that a district’s normative position on quirky state assessments, with non-normal distributions, can actually be translated with any precision to represent that district’s position within the performance distribution of schools in Finland or Singapore.

So, while it may be fun to play with the tool and see how different local public school districts compare, more or less to one another as they relate to other countries, it is totally inappropriate to make bold claims that any of these findings speak to the supposed “mediocrity” of the best public schools in the U.S. Many may appear mediocre when transported internationally for no reason other than the penalty points assessed to them in the first two normative leaps (national and state mean), neither of which has much to do with their own performance.

And these concerns ignore the fact that we are dealing with substantively different assessment content. See: http://nepc.colorado.edu/thinktank/review-us-math

Addendum:

McGee was kind enough to open a discussion on the topic below, and clarified… which what I was assuming already… that:

“We assume that being a certain distance from mean student performance in Arkansas is relatively similar to being the same distance from mean student performance in Massachusetts.”

My response is that the spread or variance issue is critically important here, even, and especially when making this kind of assumption. It comes down to the reasons for the differences in spread (like the differences seen in the above histograms).

The variance in each state’s assessments across districts contains some variance that truly indicates differences in performance and some that indicates differences in tests. The problem is that we can’t tell which portion of the spread is “real” variation in performance across districts (driven largely by demographic differences) and which is a function of the different assessments – especially the different assessments across states. Some of the variance is clearly constrained by the underlying testing differences, and may also be upper or lower limit constrained.

Advertisements

5 Comments

  1. In 1999, a panel of the Board on Testing and Assessment of the Commission on Behavioral and Social Sciences and Education of the National Research Council concluded that it was not feasible to compare results from the currently administered commercial and state standardized tests in the same subject to one another through the development of a single equivalency or linking scale (Feuer et al., 1999). The psychometric reasons for this are many and complex. The point of that study was to determine the feasibility of linking results from different standardized tests as a substute for a national exam. That would seem to undermine the validity of the whole project. Has the field of psychometrics overcome the hurdles identified by this distinguished panel in 1999? It would be interesting to have a psychometrician weigh in on this. (I am not one.)

    1. Feuer, M.J., Holland, P.W., Green, B.F, Bertentahl, M.W. and Hemphill, F.C. (eds.) (1999). Uncommon measures: equivalence and linkage among educational tests. Washington, DC: National Academy Press.

  2. Hi, Bruce. Thanks for your post on the GRC. I think it is fantastic that you thought about our work in such a detailed way. However, you miss the mark in a couple of ways.

    First, your sports analogy in the first paragraph is a little off base. All international (and for that matter national – i.e. the BCS) sports rankings are engaged in an exercise that is essentially the same as ours in spirit. They must rank a group of players or teams who do not all play each other, and when they do play it is in different places, under different conditions, and often on very different surfaces. The international rankings in tennis, golf and soccer must make much more heroic assumptions simply because their data is not as good as the international testing data. We can debate the relative merits of the data and estimation technique, but clearly the exercise is valuable and not so very different than what we see in other areas like sports.

    Second, I would like to point out that the graphs you include actually provide evidence that our first assumption holds across state exams. All of the states appear to have student achievement that is approximately normal. It is also important to note how small the US handicap of -0.055 standard deviation units really is. If I look at the graphs you have presented, it is clear that a significant proportion, if not majority, of districts are more than +/- 0.055 standard deviation units away from the mean for their state.

    Lastly, it appears you misunderstand our assumption about standard deviation units. It is not necessary for the actual data to have a spread that exhibits a similar range. Instead, for accuracy across the achievement distribution, it is important that standard deviation units have similar meaning across states and exams. In other words, we assume that being a certain distance from mean student performance in Arkansas is relatively similar to being the same distance from mean student performance in Massachusetts. As you point out we do not provide specific evidence in the report that this assumption holds. For evidence on this point we can look at the NAEP Urban District Assessment. If this assumption were completely erroneous, we would expect the rankings of the 15 large urban districts in different states who participated in NAEP’s Trial Urban District Assessment to bear very little correlation to our rankings. However, the opposite is true. Our calculations match up well with the rankings provided by Urban NAEP.

    1. I agree that many sports rankings do actually engage in similar comparisons, for example trying to determine relative strength of conferences in College Football or Basketball to set rankings during the season with little inter-conference competition. However, therein lies the problem. People know that this doesn’t work so well, even though we do it. And, it particularly doesn’t work so well when we try to determine the extremes of the distribution. That’s why I believe this is a fair analogy. Because we do it a lot but recognize the fallibility.

      Despite your explanation above, the spread or variance issue is critically important here. But, it comes down to the reasons for the differences in spread. The variance in each state’s assessments across districts contains some variance that truly indicates differences in performance and some that indicates differences in tests. Even if we had a truly comparable assessment across two states, they might have varied degrees of spread as a function of varied degrees of district heterogeneity. The problem is that we can’t tell which portion of the spread is “real” variation in performance across districts and which is a function of the different assessments – especially the different assessments across states. Some of the variance is clearly constrained by the underlying testing differences, and may also be upper or lower limit constrained. That’s a real problem here

      Feeling like this was still a generally fun activity, I tried to be exceedingly fair in my choice of state test distributions, but even these have vary different degrees of variance which does matter when ranking districts at the extremes, which is what you and Jay seem to be encouraging people to do. Also, keep in mind that I have presented the normalized distributions for these states. These are normalizations of distributions that are really, really non-normal. But even the normal distributions are problematic. Again, we don’t know how much of that variance is “real” performance difference variance and how much is variance due to differences in the testing.

      You note this assumption above:

      “In other words, we assume that being a certain distance from mean student performance in Arkansas is relatively similar to being the same distance from mean student performance in Massachusetts.”

      This is the problematic assumption. It may be that the Mass tests show more variation (not sure which do) either because there really is more variation in Mass, or because Mass has a test which picks it up. Since we don’t know, we can’t be too confident in that assumption.

      This probably creates the biggest problems for those cases where you are trying to compare high end, or low end for that matter, U.S. districts to norms of other countries.

  3. Bruce,

    This is a great post and timely. Our hometown newspaper just published a guest editorial by a right-wing think tank that cited the Global Report Card as, surprise, further evidence of our mediocrity thus justifying more charters.

    Your response is incredibly detailed, yet technical. If you ever have the opportunity to explain your analysis in any less technical terms (I know, probably not an easy task given the topic) I’d love to see it. If I tried to use this to explain to local mom and dad, it would be a tough sell.

    Thanks for your efforts. I greatly enjoy your work.

    Brendan

Comments are closed.