A few comments on the Gates/Kane value-added study

A few comments on the Gates/Kane Value-added study

(My apologies in advance for an excessively technical, research geeky post, but I felt it necessary in this case)

Take home points

1) As I read it, the new Gates/Kane value-added findings are NOT by any stretch of the imagination an endorsement of using value-added measures of teacher effectiveness for rating individual teachers as effective or not or for making high-stakes employment decisions. In this regard, the Gates/Kane findings are consistent with previous findings regarding stability, precision and accuracy of rating individual teachers.

2) Even in the best of cases, measures used in value-added added models remain insufficiently precise or accurate to account for the differences in children served by different teachers in different classrooms (see discussion of poverty measure in first section, point #2 below)

3) Too many of these studies, including this one, adopt the logic that value-added outcomes can be treated both as a measure of effectiveness to be investigated (independent variable) and as the true measure of effectiveness (the dependent measure). That is, this study like others evaluates the usefulness of both value added measures and other measures of teacher quality by their ability to predict future (or different group) value-added measures. Certainly, the deck is stacked in favor of value added measures under such a model. See value-added as a predictor of itself below.

4) Value-added measures can be useful for exploring variations in student achievement gains across classroom settings and teachers, but I would argue that they remain of very limited use for identifying more precisely or accurately, the quality of individual teachers.  Among other things, the most useful findings in the new Gates/Kane study apply to very few teachers in the system (see final point below).

Detailed discussion

Much has been made of the preliminary findings of the Gates Foundation study on teacher effectiveness. Jason Felch of the LA Times has characterized the study as an outright endorsement of the use of Value-added measures as the primary basis for determining teacher effectiveness. Mike Johnston, the Colorado State Senator behind that state’s new teacher tenure law, which requires that 50% of teacher evaluation be based on student growth (and tenure and removal of tenure based on the evaluation scheme), also seemed thrilled – via twitter – that the Gates study found that value-added scores in one year predict value-added scores in another – seemingly assuming this finding unproblematically endorses his policies (?) (via Twitter: SenJohnston Mike Johnston New Gates foundation report on effective teaching: value added on state test strongest predictor of future performance).


Rather, the new Gates study tells us that we can use value-added analysis to learn about variations in student learning (or at least in test score growth) across classrooms and schools and that we can assume that some of this variation is related to variations in teacher quality. But, there remains substantial uncertainty in the capacity to estimate whether any one teacher is a good teacher or a bad one.

Perhaps the most important and interesting aspects of the study are its current and proposed explorations of the relationship between value-added measures and other measures, including student perceptions, principal perceptions and external evaluator ratings.

Gates Report vs. LA Times Analysis

In short, data quality and modeling matter, but you can only do so much.

For starters, let’s compare some of the features of the Gates study value added models to the LAT models. These are some important differences to look for when you see value- added models being applied to study student performance differences across classrooms – especially where the goal is to assign outcome effects to teachers.

  1. The LAT Times model, like many others, uses annual achievement data (as far as I can tell) to determine teacher effectiveness, whereas the Gates study at least explores the seasonality of learning – or more specifically, how much achievement change occurs over the summer (which is certainly outside of teacher’s control AND differs across students by their socioeconomic status). One of the more interesting findings of the Gates study is that from 4th grade on: “The norm sample results imply that students improve their reading comprehension scores just as much (or more) between April and October as between October and April in the following grade. Scores may be rising as kids mature and get more practice outside of school.” This means that if there exist substantial differences in summer learning by students’ family income level and/or other factors, as has been found in other studies, then using annual data could significantly and inappropriately disadvantage teachers who are assigned students whose reading skills lagged over the summer. The existing blunt indicator of low income status is unlikely to be sufficiently precise to correct for summer learning differences.
  2. The LA Times model did include such blunt measures for poverty status and language proficiency, as well as disability status (single indicator), but later found shares of gifted children to be associated with differences in teacher ratings, along with student race. The Gates study includes similarly crude indicators of socioeconomic status, but does include in their value-added model whether individual children are classified as gifted. It also includes student race and the average characteristics of students in each classroom (peer group effect). This is much richer and more appropriate model, but still likely insufficient to fully account for the non-random distribution of students.  That is, the Gates study models at least attempt to correct for the influence of peers in the classroom in addition to individual characteristics of students, but even this may be insufficient. One particular concern of mine is the use of a single dichotomous measure of child poverty – whether the child qualifies for free or reduced price lunch – and the share of children in each class who do. The reality is that in many urban public schooling settings like those involved in the Gates study, several elementary/middle schools have over 80% children qualifying for free or reduced lunch, but this apparent similarity is no guarantee of similar poverty conditions among the children in one school or classroom compared to another. One classroom might be filled 80% with children whose family income is at or below 100% income threshold for poverty, whereas another classroom might be filled with 80% children whose income is 85% higher (at the threshold for “reduced” price lunch). This is a big difference that is not captured with this crude measure.
  3. The LAT analysis uses a single set of achievement measures. Other studies like the work of Sean Corcoran (see below) using data from Houston, TX have shown us the relatively weak relationship between value-added ratings of teachers produced by one test and value added ratings of teachers produced by another test. Thankfully, the Gates foundation analysis takes steps to explore this question further, but I would argue, overstates the relationship they found between tests or states that relationship in a way that might be misinterpreted by pundits seeking to advance the use of value-added for high stakes decisions (more later).

Learning about Variance vs. Rating Individual Teachers with Precision and Accuracy

If we are talking about using the value-added method to classify individual teachers as effective or ineffective and to use this information as the basis for dismissing teachers or for compensation, then we should be very concerned with the precision and accuracy of the measures as they apply to each individual teacher. In this context, one can characterize precision and accuracy as follows.

  • Precision – That there exists little error in our estimate that a teacher is responsible for producing good or bad student value-added on the test instrument used.  That is, we have little chance of classifying a good teacher as bad, an average teacher as bad, or vice versa.
  • Accuracy – That the test instrument and our use of it to measure teacher effectiveness is really measuring “true” effectiveness of the teacher – or truly how good that teacher is at doing all of the things we expect that teacher to do.

If, instead of classifying individual teachers as good or bad (and firing them, or shaming them in the newspaper or on milk cartons), we are actually interested in learning about variations in “effectiveness” across many teachers and many sections of students over many years, and whether student perceptions, supervisor evaluations, classroom conditions and teaching practices are associated with differences in effectiveness, we are less concerned about precise and accurate classification of individuals and more concerned about the relationships between measures, across many individuals (measured with error).  That is, do groups of teachers who do more of “X” seem to produce better value-added gains? Do groups of teachers prepared in this way seem to produce better outcomes? We are not concerned about whether a given teacher is accurately “scored.” Instead, we are concerned about general trends and averages.

The Gates study, like most previous studies, finds what I would call relatively weak correlations between the value-added score an individual teacher receives for one section of students in math or reading compared to another, and from one year to the next. The Gates research report noted:

“When the between-section or between-year correlation in teacher value-added is below .5, the implication is that more than half of the observed variation is due to transitory effects rather than stable differences between teachers. That is the case for all of the measures of value-added we calculated.”

Below is a table of those correlations – taken from their Table #5.

Unfortunately, summaries of the Gates study seem to obsess on how relatively high the correlation is from year to year for teachers rated by student performance on the state math test (.404) and largely ignore how much lower many of the other correlations are. Why is the correlation for the ELA test under .20 and what does that say about the high-stakes usefulness of the approach? Like other studies evaluating the stability of value-added ratings, the correlations seem to run between .20 and .40, with some falling below .20. That’s not a very high correlation – which then suggests not a very high degree of precision in figuring out which individual teacher is a good teacher versus which one is bad. BUT THAT’S NOT THE POINT EITHER!

Now, the Gates study rightly points out that lower correlations do not mean that the information is entirely unimportant. The study focuses on what it calls “persistent” effects or “stable” effects, arguing that if there’s a ton of variation across classrooms and teachers, being able to explain even a portion of that variation is important – A portion of a lot is still something. A small slice of a huge pie may still provide some sustenance. The report notes:

“Assuming that the distribution of teacher effects is “bell-shaped” (that is, a normal distribution), this means that if one could accurately identify the subset of teachers with value-added in the top quartile, they would raise achievement for the average student in their class by .18 standard deviations relative to those assigned to the median teacher. Similarly, the worst quarter of teachers would lower achievement by .18 standard deviations. So the difference in average student achievement between having a top or bottom quartile teacher would be .36 standard deviations.” (p.19)

The language here is really, really, important, because it speaks to a theoretical and/or hypothetical difference between high and low performing teachers drawn from a very large analysis of teacher effects (across many teachers, classrooms, and multiple years). THIS DOES NOT SPEAK TO THE POSSIBILITY THAT WE CAN PRECISELY AND ACCURATELY IDENTIFY WHETHER ANY SINGLE TEACHER FALLS IN THE TOP OR BOTTOM GROUP! It’s a finding that makes sense when understood correctly but one that is ripe for misuse and misunderstanding.

Yes, in probabilistic terms, this does suggest that if we implement mass layoffs in a system as large as NYC and base those layoffs on value-added measures, we have a pretty good chance of increasing value-added in later years – assuming our layoff policy does not change other conditions (class size, average quality of those in the system – replacement quality). But any improvements can be expected to be far, far, far less than the .18 figure used in the passage above. Even assuming no measurement error – that the district if laying off the “right” teachers (a silly assumption), the newly hired teachers can be expected to fall, at best, across the same normal curve. But I’ve discussed my taste for this approach to collateral damage in previous posts. In short, I believe it’s unnecessary and not that likely to play out as we might assume. (see discussion of reform engineers at bottom)

A Few more Technical Notes

Persistent or Stable Effects: The Gates report focuses on what it terms “persistent” effects of teachers on student value-added – assuming that these persistent effects represent the consistent, over time or across sections influence of a specific teacher on his/her students’ achievement gains. The report focuses on such “persistent” effects for a few reasons. First, the report uses this discussion to, I would argue, overplay the persistent influence teachers have on student outcomes – as in the quote above which is later used in the report to explain the share of the black-white achievement gap that could be closed by highly effective teachers. The assertion is that even if teacher effects explain small portion of variations in student achievement gains, if variations in those gains are huge, then explaining a portion is important. Nonetheless, the persistent effects remain a relatively small portion (as high as “modest” portion in some cases) – which dramatically reduces the precision with which we can identify the effectiveness of any one teacher (taking as given that the tests are the true measure of effectiveness – the validity concern).

AND, I would argue that it is a stretch to assume that the persistent effects within teachers are entirely a function of teacher effectiveness. The persistent effect of teachers may also include the persistent characteristics of students assigned to that teacher – that the teacher, year after year, and across sections is more likely to be assigned the more difficult students (or the more expert students). Persistent pattern yes. Persistent teacher effect? Perhaps partially (How much? Who knows?).

Like other studies, the identification of persistent effects from year to year, or across sections in the new Gates study merely reinforces that with more sections and/or more years of data (more students passing through) for any given teacher, we can gain a more stable value-added estimate and more precise indication of the value-added associated with the individual teacher. Again, the persistent effect may be a measure of the persistence of something other than the teacher’s actual effectiveness (teacher X always has the most disruptive kids, larger classes, noisiest/hottest/coldest – generally worst classroom).  The Gates study does not (BECAUSE IT WASN’T MEANT TO) assess how the error rate of identifying a teacher as “good” or “bad” changes with each additional year of data, but given that other findings are so consistent with other studies, I would suspect the error rate to be similar as well.

Differences Between Tests: The Gates study provides some useful comparisons of value-added ratings of teachers on one test, compared with ratings of the same teachers on another test – a) for kids in the same section in the same year, and b) for kids in different sections of classes with the same teacher.

Note that in a similar analysis, Corcoran, Jennings and Beveridge found:

“among those who ranked in the top category (5) on the TAKS reading test, more than 17 percent ranked among the lowest two categories on the Stanford test. Similarly, more than 15 percent of the lowest value-added teachers on the TAKS were in the highest two categories on the Stanford.”

Corcoran, Sean P., Jennifer L. Jennings, and Andrew A. Beveridge. 2010. “Teacher Effectiveness on High- and Low-Stakes Tests.” Paper presented at the Institute for Research on Poverty summer workshop, Madison, WI.

That is, analysis of teacher value-added ratings on two separate tests called into question the extent to which individual teachers might accurately be classified as effective when using a single testing instrument. That is, if we assume both tests to measure how effective a teacher is a teaching “math,” or a specific subject within “math,” then both tests should tell us the same thing about each teacher – which ones are truly effective math teachers and which ones are not. Corcoran’s findings raise serious questions about accuracy in this regard.

The Gates study argues that comparing teacher-value added across two math tests – where one is more conceptual – allows them to validate that doing well on one test, the state test – as long as the results are correlated with the other, more conceptual test, did not compromise conceptual learning. That seems reasonable enough, to the extent that the testing instruments are being appropriately described (and to the extent they are valid instruments).  In terms of value-added ratings, the Gates study, like the Corcoran study, finds only a modest relationship between ratings of teacher based on one test and ratings based on the other:

“the correlation between a teacher’s value-added on the state test and their value-added on the Balanced Assessment in Math was .377 in the same section and .161 between sections.”

But the Gates study also explores the relationships between “persistent” components across tests – which must be done across sections taking the test in the same year (until subsequent years become available). They find:

“we estimate the correlation between the persistent component of teacher impacts on the state test and on BAM is moderately large, .54.”

“The correlation in the stable teacher component of ELA value-added and the Stanford 9 OE was lower, .37.”

I’m uncomfortable with the phrasing here that says – “persistent component of teacher impacts” – in part because there exist a number of other persistent conditions or factors that may be embedded in the persistent effect, as I discuss above. Setting that aside, however, what the authors are exploring is whether the correlated component – the portions of student performance on any given test that are assumed to represent teacher effectiveness – is similar between tests.

In any case, however, these correlations like the others in the Gates analysis are telling us how highly associated – or not – the assumed persistent component is across tests across many teachers teaching many sections of the same class.  This allows the authors to assert that across all of these teachers and the various sections they teach, there is a “moderately” large relationship between student performance on the two different tests, supporting the authors’ argument that one test somewhat validates the other. But again, this analysis, like the others in the report, does not suggest by any stretch of the imagination that either one test or the other will allow us to precisely identify the good teacher versus the bad one. There is still a significant amount of reshuffling going on in teacher ratings from one test to the next, even with the same students in the same class sections in the same year. And, of course, good teaching is not synonymous with raising a student’s test scores.

This analysis does suggest that we might – by using several tests – get a more accurate picture of student performance and how it varies across teachers, and does at least suggest that across multiple tests – if the persistent component is correlated – just like across multiple years – we might get a more stable picture of which teachers are doing better/worse.  Precise enough for high stakes decisions (and besides, how much more testing can we/they handle?)? I’m still not confident that’s the case.

Value-added is the best predictor of itself

This seems to be one of the findings that gets the most media-play (and was the basis of Senator Johnston’s proud tweets). Of course value-added is a better predictor of future value-added (on the same test and with the same model) than other factors are of future value-added – even if value-added is only a weak predictor of future (or different section) value-added. Amazingly, however, many of the student survey responses on factors related to things like “Challenge” seem almost as related to value-added as value-added to itself. That is a surprising finding, and I’m not sure yet what to make of it. [note that the correlation between student ratings and VAM were for the same class & year, whereas VAM predicting VAM is a) across sections and b) across years).

Again, the main problem with this VAM predicts VAM argument is that it assumes value-added ratings in the subsequent year to be THE valid measure of the desired outcome. But that’s the part we just don’t yet know. Perhaps the student perceptions are actually a more valid representation of good teaching than the value-added measure? Perhaps we should flip the question around? It does seem reasonable enough to assume that we want to see students improve their knowledge and skills in measurable ways on high quality assessments. Whether our current batch of assessments, as we are currently using them and as they are being used in this analysis accomplishes that goal remains questionable.

What is perhaps most useful about the Gates study and future research questions is that it begins to explore with greater depth and breadth the other factors that are – and are not – associated with student achievement gains.

Findings apply to a relatively small share of teachers

I have noted in other blog posts on this topic that in the best of cases (or perhaps worst if we actually followed through with it), we might apply value added ratings to somewhat less than 20% of teachers – those directly responsible and solely responsible for teaching reading or math to insulated clusters of children in grades 3 to 8 – well… 4-8, actually … since many VA models use annual data and the testing starts with grade 3. Even for the elementary school teachers who could be rated, the content of the ratings would exclude a great deal of what they teach. Note that most of the interesting findings in the new Gates study are those which allow us to evaluate the correlations of teachers across different sections of the same course in addition to subsequent years. These comparisons can only be made at the middle school level (and/or upper elementary, if taught by section). Further, many of the language arts correlations were very low, limiting the more interesting discussions to math alone. That is, we need to keep in mind that in this particular study, most of the interesting findings apply to no more than 5% to 10% of teachers – those involved in teaching math in the upper elementary and middle grades – specifically those teaching multiple sections of the same math content each year.