Understand your data & use it wisely! Tips for avoiding stupid mistakes with publicly available NJ data

Posted on April 11, 2014



My next few blog posts will return to a common theme on this blog – appropriate use of publicly available data sources. I figure it’s time to put some positive, instructive stuff out there. Some guidance for more casual users (and more reckless ones) of public data sources and for those must making their way into the game. In this post, I provide a few tips on using publicly available New Jersey schools data. The guidance provided herein is largely in response to repeated errors I’ve seen over time in using and reporting New Jersey school data, where some of those errors are simple oversight and lack of deep understanding of the data, and others of those errors seem a bit more suspect. Most of these recommendations apply to using other states’ data as well. Notably, most of these are tips that a thoughtful data analyst would arrive at on his/her own, by engaging in the appropriate preliminary evaluations of the data. But sadly these days, it doesn’t seem to work that way.

So, here are a few NJ state data tips.

NJ ASK scale score data vary by grade level, so aggregating across grade levels produces biased comparisons if schools have different numbers of kids in different grade levels

NJ, like other states has adopted math and reading assessments from grades 3 to 8 and like other states has made numerous rather arbitrary decisions over time as to how to establish cut scores determining proficiency on the assessments, and methods for converting raw scores (numbers of items on a 50 point test) into scale scores (with proficiency cut-score of 200 and max score of 300). [1] The presumption behind this method is that “proficiency” has some common meaning across grade levels. That a child who is proficient in grade 3 math for example, if he or she learns what they are supposed to in 4th grade (and only what they are supposed to), they will again be proficient at the end of the year. But that doesn’t mean that the distributions of testing data actually support this assumption. Alternatively, the state could have scaled the scores year-over-year such that the average student remained the average student, a purely normative approach rather than the pseudo-standards-based (mostly normative) approach currently in use.

A few fun artifacts of the current approach are that a) proficiency rates vary from one grade to the next, giving a false impression that, for example, 5th graders simply aren’t doing as well as 4th graders in language arts, and that b) scale score averages vary similarly. Many a 5 or 6th grade teacher or grade level coordinator across the state has come under fire from district officials for their apparent NJASK underperformance compared to lower grades. But this underperformance is merely an artifact of arbitrary decisions in the design of the tests, difficulty of the items, conversion to scale scores and arbitrary assignment of cut points.

Here’s a picture of the average scale scores drawn from school level data weighted by relevant test takers for NASK math and NJASK language arts. Of course, the simplest implication here is that “kids get dumber at LAL as they progress through grades” and or their teachers simply suck more, and that “kids get smarter in math as they progress through grades.” Alternatively – as stated above, this is really just an artifact of those layers of arbitrary decisions.

Figure 1. Scale Scores by Grade Level Statewide

Slide1

Why then, do we care? How does this affect our common uses of the data? Well, on several occasions I’ve encountered presentations of schoolwide average scale scores as somehow representing school average test performance. The problem is that if you aggregate across grades, but have more kids in some grades than others, your average will be biased by the imbalance of kids. If you are seeking to play this bias to your advantage:

  1. If your school has more kids in grades 6 to 8 than in 3 to 5, you’d want to look at LAL scores. That’s because kids statewide simply score higher on LAL in grades 6 to 8. It would be completely unfair to compare schoolwide LAL scores for a school with mostly grades 6 to 8 students to schoolwide LAL scores for a school with mostly grades 3 to 5 students. Yet it is done far too often!
  2. Interestingly, the revers appears true for math.

So, consumers of reports of school performance data in New Jersey should certainly be suspicious any time someone chooses to make comparisons solely on the basis of schoolwide LAL scores, or math scores for that matter. While it makes for many more graphs and tables, grade level disaggregation is the only way to go with these data.

Let’s take a look.

Here are Newark Charter and District schools by schoolwide LAL, and by low income concentration. Here, we see that Robert Treat Academy, North Star Academy and TEAM academy fall above the line. That is, relative to their low income concentrations (setting aside very low rates of children with disabilities or ELL children, and 50+% attrition at North Star) have average schoolwide scale scores that appear to exceed expectations.

Figure 2. Newark Charter vs. District School Schoolwide LAL by Low Income Concentration

Slide3

 

But, as Figure 3 shows, it may not be a good idea (unless of course you are gaming the data) to use LAL school aggregate scale scores to represent the comparative performance of these schools with NPS schools. As Figure 3 shows, both North Star and TEAM academy – especially TEAM academy have larger shares of kids in the grades where average scores tend to be higher as a function of the tests and their rescaling.

Figure 3. Charter and District Grade Level DistributionsSlide2

http://www.nj.gov/education/data/enr/enr13/enr.zip

Figure’s 4a and 4b break out the comparisons by grade level and provide both the math and LAL assessments for a more complete and more accurate picture, though still ignoring many other variables that may influence these scores (attrition, special education, ELL and gender balance). These figures also identify schools slated for takeover by charters. Whereas TEAM academy appeared on schoolwide aggregate to “beat the odds” on LAL, TEAM falls roughly on the trendline for LAL 6, 7 and 8 and falls below it for LAL 5. That is, disaggregation paints a different picture of TEAM academy in particular – one of a school that by grade level more or less meets the average expectation.  Similarly for North Star, while their small groups of 3rd and 4th graders appear to substantially beat the odds, differences are much smaller for their 5 through 8th grade students when compared to only students in those same grades in other schools. Some similar patterns appear for math, except that TEAM in particular falls more consistently below the line.

Figure 4a. Charter and District Scale Scores vs. Low Income by Grade

Slide4

Figure 4b. Charter and District Scale Scores vs. Low Income by Grade

Slide5

Figure 4c. Charter and District Scale Scores vs. Low Income by Grade

Slide6

A few related notes are in order:

  • Math assessments in grades 5-7 have very strong ceiling effects which are particularly noticeable in more affluent districts and schools where significant shares of children score 300.
  • As a result of the scale score fluctuations, there are also by grade proficiency rate fluctuations.

Not all measures are created equal: Measures, thresholds and cutpoints matter!

I’ve pointed this one out on a number of occasions – that finding the measure that best captures the variations in needs across schools is really important when you are trying to tease out how those variations relate to test scores. I’ve also explained over and over again how measures of low income concentration commonly used in education policy conversations are crude and often fail to capture variation across settings. But, among the not-so-great options for characterizing differences in student needs across schools, there are better and worse methods and measures. Two simple and highly related rules of thumb apply when evaluating factors that may affect or be strongly associated with student outcomes:

  1. The measure that picks up more variation across settings is usually the better measure, assuming that variation is not noise (simply a greater amount of random error in reporting of the measure).
  2. Typically, the measure that picks up more “real” variation across settings will also be more strongly correlated with the measure of interest – in many cases variations in student outcome levels.

A classic case of how different thresholds or cutpoints affect the meaningful variation captured across school settings is the choice of shares of free lunch (below the 130% threshold for poverty) versus free or reduced priced lunch (below the much higher 185% threshold) when comparing schools in a relatively high poverty setting. In many relatively high poverty settings, the share of children in families below the 185% threshold exceeds 80 to 90% across all schools. Yes, there may appear to be variation across those schools, but that variation within such a narrow, truncated range may be particularly noisy, and thus not very helpful in determining the extent to which low income shares compromise student outcomes. It is really important to understand that two schools with 80% of children below the 185% income threshold for poverty can be hugely different in terms of actual income and poverty distribution.

Here, for example, is the distribution of schools by concentration of children in families below the 185% income threshold in Newark, NJ. The mean is around 90%!

Figure 5.

Slide8

Now here is the distribution of schools by concentration of children in families below the 130% threshold. The bell curve looks similar in shape, but now the mean is around 80% and the spread is much greater. But even this is really the test of proof of the meaningfulness of this variation.

Figure 6.

Slide9

 

But first a little aside. If, in figure 5, nearly all kids are below the free/reduced threshold and fewer below the free lunch threshold, we basically have a city where “if not free lunch, then reduced lunch.” Plotted, it looks like this:

Figure 7.

Slide10

The correlation here is -.65 across Newark schools. What this actually means, is that the percent reduced lunch children is, in fact a measure of the percent of lower need children in any school – because there are so few children who don’t qualify for either. Children qualifying for reduced price lunch in Newark are among the upper income children in Newark schools. If a school has fewer reduced lunch children, it typically means they have more free lunch children and vice versa.

As such comparing charter schools to district schools on the basis of % free or reduced is completely bogus, because charters serve very low shares of the lower income children but do serve the rest.

Second, it is statistically very problematic to put both of these measures – the inverse of one another because they account for nearly the entire population – in a single regression model!

Further validation of the importance of using the measures of actual lower income children is provided in the table below, which shows the correlations between outcome measures across schools and student population characteristics.

Figure 8. Correlations between low income concentrations and outcome measures

Slide11 With respect to every outcome measure, % free lunch is more strongly negatively associated with the outcome measure. Of course, one striking problem here is that the growth percentile scores, while displaying weaker relationships to low income than level scores, do show a modest relationship, indicating their persistent bias, even across schools within a relatively narrow range of poverty (Newark). But that’s a side story for now!

To add to the assertion that % reduced lunch in a district like Newark (where % reduced means, % not free lunch), is in fact a measure of relative advantage, take a look at the final column. % reduced lunch alone is strongly positively correlated with the outcome level measures. Statistically, this is a given since it is more or less the inverse of a measure that is strongly negatively correlated with the outcome measures.

Know your data context!

Finally, and this is somewhat of an extension of a previous point, it’s really important if you intend to engage in any kind of comparisons across school settings, to get to know your context. Get to know your data and how they vary across schools. For example, know that nearly all kids in Newark fall below the 185% income threshold and that this means that if a child is not below the 130% income threshold, then they are likely still below the 185% threshold. This creates a whole different meaning from the “usual” assumptions about children qualified for reduced price lunch, how their shares vary across schools, and what it likely means.

Many urban districts have population distributions by race that are similarly in inverse proportion to one another. That is, in a city like Newark, schools that are not predominantly black tend to be predominantly Hispanic. Similar patterns exist in Chicago and Philadelphia at much larger scale. Here is the scatterplot for Newark. In Newark, the relationship between % black and % Hispanic is almost perfectly inverse!

Figure 9. % Black versus % Hispanic for Newark Schools

Slide7As Mark Weber and I pointed out in our One Newark briefs, just as it would be illogical (as well as profoundly embarrassing) to try to consider both % free and % reduced lunch in a model comparing Newark schools, it is hugely problematic to try to address both % Hispanic and % black in any model comparing Newark schools. Quite simply, for the most part, if not one then the other.

Catching these problems is a matter of familiarity with context and familiarity with data. These are common issues. And I encourage budding grad students, think tankers and data analysts to pay closer attention to these issues.

How can we catch this stuff?

Know your context.

Run descriptive analyses first to get to know your data.

Make a whole bunch of scatterplots to get to know how variables relate to one another.

Don’t assume that the relationships and meanings of the measures in one context necessarily translate to another. The best example here is the meaning of % reduced lunch. It might just be a measure of relative advantage in a very high poverty urban setting.

And think… think… think twice… and think again about just what the measures mean… and perhaps more importantly, what they don’t and cannot!

 

Cheers!

 

 

 

[1] New Jersey Assessment of Skills and Knowledge 2012 TECHNICAL REPORT Grades 3-8. February 2013. NJ Department of Education. http://www.nj.gov/education/assessment/es/njask_tech_report12.pdf

Posted in: Uncategorized