The big news over the weekend involved the LA Times posting of value-added ratings of LA public school teachers.

Here’s how the Times spun their methodology:

Seeking to shed light on the problem, The Times obtained seven years of math and English test scores from the Los Angeles Unified School District and used the information to estimate the effectiveness of L.A. teachers — something the district could do but has not.

The Times used a statistical approach known as value-added analysis, which rates teachers based on their students’ progress on standardized tests from year to year. Each student’s performance is compared with his or her own in past years, which largely controls for outside influences often blamed for academic failure: poverty, prior learning and other factors.

This spin immediately concerned me, because it appears to assume that simply using a student’s prior score erases, or controls for, any and all differences among students by family backgrounds as well as classroom level differences – who attends school with whom.

Thankfully (thanks to the immediate investigative work of Sherman Dorn), the analysis was at least marginally better than that and conducted by a very technically proficient researcher at RAND named Richard Buddin. Here’s his technical report:

The problem is that even someone as good as Buddin can only work with the data he has. And there are at least 3 major shortcomings of the data that Buddin appeared to have available for his value added models. I’m setting aside here the potential quality of the achievement measures themselves. Calculating (estimating) a teacher’s effect on their students’ learning and specifically, identifying the differences across teachers where those students are not randomly assigned (with same class size, comparable peer group, same air quality, lighting, materials, supplies, etc.) requires that we do a pretty damn good job of accounting for the measurable differences across the children assigned to teachers. This is especially true if our plan is to post names on the wall (or web)!

Here’s my quick read, short list of shortcomings to Buddin’s data, that I would suspect, lead to significant problems in precisely determining differences in teacher quality across students:

- While Buddin’s analysis includes student characteristics that may (and in fact appear to) influence student gains, Buddin – likely due to data limitations – includes only a simple classification variable for whether a student is a Title I student or not, and a simple classification variable for whether a student is limited in their English proficiency. These measures are woefully insufficient for a model being used to label teachers on a website as good or bad. Buddin notes that 97% of children in the lowest performing schools are poor, and 55% in higher performing schools are poor. Identifying children simply as poor or not poor misses entirely the variation among the poor to very poor children in LA public schools – which is most of the variation in family background in LA public schools. That is, the estimated model does not control at all for one teacher teaching a class of children who barely qualify for Title I programs, versus a teacher with a classroom of children of destitute homeless families, or multigenerational poverty. I suspect Buddin, himself, would have liked to have had more detailed information. But, you can only use what you’ve got. When you do, however, you need to be very clear about the shortcomings. Again, most kids in LA public schools are poor and the gradients of poverty are substantial. Those gradients are neglected entirely. Further, the model includes no “classroom” related factors such as class size, student peer group composition (either by a Hoxby approach of average ability level of peer group, or considering racial composition of peer group as done by Hanushek and Rivkin. Then again, it’s nearly if not entirely impossible to fully correct for classroom level factors in these models.).
- It would appear that Buddin’s analysis uses annual testing data, not fall-spring assessments. This means that the year-to-year gains interpreted as “teacher effects” include summer learning and/or summer learning lag. That is, we are assigning blame, or praise to teachers based on what kids learned, or lost over the summer. If this is true of the models, this is deeply problematic. Okay, you say, but Buddin accounted for whether a student was a Title I student and summer opportunities are highly associated with Poverty Status. But, as I note above, this very crude indicator is far from sufficient to differentiate across most LA public school students.
- Finally, researchers like Jesse Rothstein, among others have suggested that having multiple years of prior scores on students can significantly reduce the influence of non-random assignment of students to teachers on the ratings of teachers. Rothstein speaks of using 3-years of lagged scores (http://gsppi.berkeley.edu/faculty/jrothstein/published/rothstein_vam2.pdf) so as to sufficiently characterize the learning trajectories of students entering any given teacher’s class. It does not appear that Buddin’s analysis includes multiple lagged scores.

**So then what are some possible effects of these problems, where might we notice them, and why might they be problematic?**

One important effect, which I’ve blogged about previously, is that the value-added teacher ratings could be substantially biased by the non-random sorting of students – or in more human terms – teachers of children having characteristics not addressed by the models could be unfairly penalized, or for that matter, unfairly benefited.

Buddin is kind enough in his technical paper to provide for us, various teacher characteristics and student characteristics that are associated with the teacher value-added effects – that is, what kinds of teachers are good, and which ones are more likely to suck? Buddin shows some of the usual suspects, like the fact that novice (first 3 years) teachers tended to have lower average value added scores. Now, this might be reasonable if we also knew that novice teachers weren’t necessarily clustered with the poorest of students in the district. But, we don’t know that.

Strangely, Buddin also shows us that the number of gifted children a teacher has affects their value-added estimate – The more gifted children you have, the better teacher you are??? That seems a bit problematic, and raises the question as to why “gifted” was not used as a control measure in the value-added ratings? Statistically, this could be problematic if giftedness was defined by the outcome measure – test scores (making it endogenous). Nonetheless, the finding that having more gifted children is associated with the teacher effectiveness rating raises at least some concern over that pesky little non-random assignment issue.

*Now here’s the fun, and most problematic part:*

Buddin finds that black teachers have lower value-added scores for both ELA and MATH. Further, these are some of the largest negative effects in the second level analysis – especially for MATH. The interpretation here (for parent readers of the LA Times web site) is that having a black teacher for math is worse than having a novice teacher. In fact, it’s the worst possible thing! Having a black teacher for ELA is comparable to having a novice teacher.

Buddin also finds that having more black students in your class is negatively associated with teacher’s value-added scores, but writes off the effect as small. Teachers of black students in LA are simply worse? There is NO discussion of the potentially significant overlap between black teachers, novice teachers and serving black students, concentrated in black schools (as addressed by Hanushek and Rivken in link above).

By contrast, Buddin finds that having an Asian teacher is much, much better for MATH. In fact, Asian teachers are as much better (than white teachers) for math as black teachers are worse! Parents – go find yourself an Asian math teacher in LA? Also, having more Asian students in your class is associated with higher teacher ratings for Math. That is, you’re a better math teacher if you’ve got more Asian students, and you’re a really good math teacher if you’re Asian and have more Asian students?????

Talk about some nifty statistical stereotyping.

It makes me wonder if there might also be some racial disparity in the “gifted” classification variable, with more Asian students and fewer black students district-wide being classified as “gifted.”

IS ANYONE SEEING THE PROBLEM HERE? Should we really be considering using this information to either guide parent selection of teachers or to decide which teachers get fired?

I discussed the link between non-random assignment and racially disparate effects previously here:

Indeed there may be some substantive differences in the average academic (undergraduate & high school) preparation in math of black and Asian teachers in LA. And these differences may translate into real differences in the effectiveness of math teaching. But sadly, we’re not having that conversation here. Rather, the LA times is putting out a database, built on insufficient underlying model parameters, that produces these potentially seriously biased results.

While some of these statistically significant effects might be “small” across the entire population of teachers in LA, the likelihood that these “biases” significantly affect specific individual teacher’s value-added ratings is much greater – and that’s what’s so offensive about the use of this information by the LA Times. The “best possible,” still questionable, models estimated are not being used to draw simple, aggregate conclusions about the degree of variance across schools and classrooms, but rather they are being used to label individual cases from a large data set as “good” or “bad.” That is entirely inappropriate!

**Note: On Kane and Staiger versus Rothstein and non-random assignment**

Finally, a comment on references to two different studies on the influence of non-random assignment. Those wishing to write off the problems of non-random assignment typically refer to Kane and Staiger’s analysis using a relatively small, randomized sample. Those wishing to raise concerns over non-random assignment typically refer to Jesse Rothstein’s work. Eric Hanushek, in an exceptional overview article on Value Added assessment summarizes these two articles, and his own work as follows:

An alternative approach of Kane and Staiger (2008) of using estimates from a random assignment of teachers to classrooms finds little bias in traditional estimation, although the possible uniqueness of the sample and the limitations of the specification test suggest care in interpretation of the results.

A compelling part of the analysis in Rothstein (2010) is the development of falsification tests, where future teachers are shown to have significant effects on current achievement. Although this could be driven in part by subsequent year classroom placement on based on current achievement, the analysis suggests the presence of additional unobserved differences..

In related work, Hanushek and Rivkin (2010) use alternative, albeit imperfect, methods for judging which schools systematically sort students in a large Texas district. In the “sorted” samples, where random classroom assignment is rejected, this falsification test performs like that in North Carolina, but this is not the case in the remaining “unsorted” sample where random assignment is not rejected.

http://edpro.stanford.edu/hanushek/admin/pages/files/uploads/HanushekRivkin%20AEA2010.CALDER.pdf

I can get credit for asking the reporters for the link to the Buddin PDF, but I don’t think I get credit for anything else here!

I’m just a little old elementary teacher and I read Buddin’s report. From the perspective of a non-statistician, the general cognitive production function does look a little like mumbo jumbo.

In reference to said function: “… let μi0 be the student’s endowed ability that does not vary over time.” Endowed ability? Based upon what exactly? How does Buddin know what that is?

It’s a stated assumption of the model, but not one that really causes the big problems when the model is applied – except to the extent that the model doesn’t really try hard enough to establish the track record of each child’s trajectory, as I discuss above. Yeah… it sounds pretty damn offensive, but it’s not the most problematic stuff here. You’ll find a lot of stuff like this in economic explanations of education. Some of it is as bad as it sounds but some is not. Economists feel an obligation to lay out certain “assumptions” and theories as basis for the empirical model. It’s just part of economic research writing – like assuming away everything that compromises the validity of the model itself.

The old joke goes that an economist, an engineer and a physicist are trapped in a well… they all look at each other trying to figure out what to do to get out. The engineer starts looking for possible materials to aid in their escape, and drafting precise mechanical drawings on the side of the well. The physicist starts by sketching out equations on the side of the well to estimate well depth, various angles, etc. The economist smugly looks at both and says STOP! “First, we assume the well does not exist…”

I’m setting aside here the potential quality of the achievement measures themselves.B, b, but. Why would you do that? If the dependent variable is flawed, the rest of the endeavor is an idle exercise fraught with dangerous consequences.

Sample test items can be viewed at

http://www.cde.ca.gov/ta/tg/sr/css05rtq.asp

The measure is a rubber ruler in that the test changes from grade to grade. It’s a test of “English Language Arts” rather than a reading test. Each item is a trick question designed to “foil” novice students. The Math test is more a test of reading and technical math constructs that has more to do with reading capability than with math capability. By the time the scores are scaled, normatlized, and then put through the stat wringer with all the “issues” you describe, it’s a cruel and inhumane joke. It would be nice if we could assume that “production functions” don’t exist. Actually, they don’t exist, but if one believes in them, they’re as good as real.

Lack of sufficiently detailed information to go down that road – and plenty to critique even if we assumed the outcome measures were good, and appropriately scaled – which most are not (for these purposes).