I was immediately intrigued the other day when a friend passed along a link to the recent technical report on the New York State growth model, the results of which are expected/required to be integrated into district level teacher and principal evaluation systems under that state’s new teacher evaluation regulations. I did as I often do and went straight for the pictures – in this case- the scatterplots of the relationships between various “other” measures and the teacher and principal “effect” measures. There was plenty of interesting stuff there, some of which I’ll discuss below.

But then I went to the written language of the report – specifically the report’s (albeit in DRAFT form) conclusions. The conclusions were only two short paragraphs long, despite much to ponder being provided in the body of the report. The authors’ main conclusion was as follows:

The model selected to estimate growth scores for New York State provides a

fairandaccuratemethod for estimatingindividual teacher and principal effectivenessbased on specific regulatory requirements for a “growth model” in the 2011-2012 school year. p. 40

http://engageny.org/wp-content/uploads/2012/06/growth-model-11-12-air-technical-report.pdf

13-Nov-2012 20:54

**Updated Final Report: http://engageny.org/sites/default/files/resource/attachments/growth-model-11-12-air-technical-report_0.pdf**

Local copy of original DRAFT report: growth-model-11-12-air-technical-report

Local copy of FINAL report: growth-model-11-12-air-technical-report_FINAL

Unfortunately, the multitude of graphs that immediately preceded this conclusion undermine it entirely. but first, allow me to address the egregious conceptual problems with the framing of this conclusion.

**First Conceptually**

Let’s start with the low hanging fruit here. First and foremost, nowhere in the technical report, nowhere in their data analyses, do the authors actually measure “individual teacher and principal effectiveness.” And quite honestly, I don’t give a crap if the “specific regulatory requirements” refer to such measures in these terms. If that’s what the author is referring to in this language, that’s a pathetic copout. Indeed it may have been their charge to “measure individual teacher and principal effectiveness based on requirements stated in XYZ.” That’s how contracts for such work are often stated. But that does not obligate the author to conclude that this is actually what has been statistically accomplished. And I’m just getting started.

So, what is being measured and reported? At best, what we have are:

- An estimate of student relative test score change on one assessment each for ELA and Math (scaled to growth percentile) for students who happen to be clustered in certain classrooms.

THIS IS NOT TO BE CONFLATED WITH “TEACHER EFFECTIVENESS”

Rather, it is merely a classroom aggregate statistical association based on data points pertaining to two subjects being addressed by teachers in those classrooms, for a group of children who happen to spend a minority share of their day and year in those classrooms.

- An estimate of student relative test score change on one assessment each for ELA and Math (scaled to growth percentile) for students who happen to be clustered in certain schools.

THIS IS NOT TO BE CONFLATED WITH “PRINCIPAL EFFECTIVENESS”

Rather, it is merely a school aggregate statistical association based on data points pertaining to two subjects being addressed by teachers in classrooms that are housed in a given school under the leadership of perhaps one or more principals, vps, etc., for a group of children who happen to spend a minority share of their day and year in those classrooms.

**Now Statistically**

Following are a series of charts presented in the technical report, immediately preceding the above conclusion.

*Classroom Level Rating Bias*

*School Level Rating Bias*

And there are many more figures displaying more subtle biases, but biases that for clusters of teachers may be quite significant and consequential.

Based on the figures above, there certainly appears to be, both at the teacher, excuse me – classroom, and principal – I mean school level, substantial bias in the Mean Growth Percentile ratings with respect to initial performance levels on both math and reading. Teachers with students who had higher starting scores and principals in schools with higher starting scores tended to have higher Mean Growth Percentiles.

This might occur for several reasons. First, it might just be that the tests used to generate the MGPs are scaled such that it’s just easier to achieve growth in the upper ranges of scores. I came to a similar finding of bias in the NYC value added model, where schools having higher starting math scores showed higher value added. So perhaps something is going on here. It might also be that students clustered among higher performing peers tend to do better. And, it’s at least conceivable that students who previously had strong teachers and remain clustered together from year to year, continue to show strong growth. What is less likely is that many of the actual “better” teachers just so happen to be teaching the kids who had better scores to begin with.

That the systemic bias appears greater in the school level estimates than in the teacher level estimates is suggestive that the teacher level estimates may actually be even more bias than they appear. The aggregation of otherwise less biased estimates should not reveal more bias.

Further, as I’ve mentioned on several times on this blog previously, even if there weren’t such glaringly apparent overall patterns of bias their still might be underlying biased clusters. That is, groups of teachers serving certain types of students might have ratings that are substantially WRONG, either in relation to observed characteristics of the students they serve or their settings, or of unobserved characteristics.

**Closing Thoughts**

To be blunt – the measures are neither conceptually nor statistically accurate. They suffer significant bias, as shown and then completely ignored by the authors. And inaccurate measures can’t be fair. Characterizing them as such is irresponsible.

I’ve now written 2 articles and numerous blog posts in which I have raised concerns about the likely overly rigid use of these very types of metrics when making high stakes personnel decisions. I have pointed out that misuse of this information may raise significant legal concerns. That is, when district administrators do start making teacher or principal dismissal decisions based on these data, there will likely follow, some very interesting litigation over whether this information really is sufficient for upholding due process (depending largely on how it is applied in the process).

I have pointed out that the originators of the SGP approach have stated in numerous technical documents and academic papers that SGPs are intended to be a descriptive tool and are not for making causal assertions (they are not for “attribution of responsibility”) regarding teacher effects on student outcomes. Yet, the authors persist in encouraging states and local districts to do just that. I certainly expect to see them called to the witness stand the first time SGP information is misused to attribute student failure to a teacher.

But the case of the NY-AIR technical report is somewhat more disconcerting. Here, we have a technically proficient author working for a highly respected organization – American Institutes for Research – ignoring all of the statistical red flags (after waiving them), and seemingly oblivious to gaping conceptual holes (commonly understood limitations) between the actual statistical analyses presented and the concluding statements made (and language used throughout).

The conclusion are **WRONG** – **statistically and conceptually**. And the author needs to recognize that being so damn bluntly wrong may be consequential for the livelihoods of thousands of individual teachers and principals! Yes, it is indeed another leap for a local school administrator to use their state approved evaluation framework, coupled with these measures, to actually decide to adversely affect the livelihood and potential career of some wrongly classified teacher or principal – but the author of this report has given them the tool and provided his blessing. And that’s inexcusable.

And a video with song!

==================

Note: In the executive summary, the report acknowledges these biases:

Despite the model conditioning on prior year test scores, schools and teachers with students who had higher prior year test scores, on average, had higher MGPs. Teachers of classes with higher percentages of economically disadvantaged students had lower MGPs.

But then blows them off throughout the remainder of the report, and never mentions that this might be important.

Local copy of report: growth-model-11-12-air-technical-report

Bruce, Out of curiosity first, are you aware of any vam that use more than just two pts to measure year-to-year progress? Would more measures in one year improve the stability of the growth measure? Second, how do VAM growth indexes compare to just regular school grades? To Teacher evaluations such as those based on danielson instrument? Are these measures of teacher quality any more stable over time?

Yes… some VAMs in research applications attempt to use a long string of pupil prior scores. This does help in reducing non-random sorting bias. But, VAMs in practice tend not to do this because it further reduces the number of teachers to which ratings can be applied. If you shoot for a string of 3 prior scores, and have annual data, then you have to drop ratings for 3rd, 4th and 5th grade teachers, and start with 6th. There is an increasing body of work looking at the relationship between alternative measures and VAM estimates… but I can’t recall of the top of my head… papers that look specifically at stability of other measures over time. Most of this research takes a narrow perspective that VAM is the true effectiveness measure and that all other measures are validated only by their relationship to VAM. Thus VAM is best because even though it’s only marginally correlated with itself over time, its more correlated with itself than other measures are with it. That’s a huge conceptual flaw/logical flaw in much of this work.

“There is an increasing body of work looking at the relationship between alternative measures and VAM estimates… but I can’t recall of the top of my head… papers that look specifically at stability of other measures over time.”

Are you thinking about MET (2012) that analyzed administrator ratings and student survey ratings against VAM outcomes?

See EVAAS. http://www.sas.com/govedu/edu/k12/evaas/index.html

EVAAS, and its twin sister TVAAS, use a compilation of many years of data, across subject areas, to project student growth scores. There are no covariates as each historic score supposedly serves as a control for SES and other concomitant variables.

…Sorry – I should have also mentioned that EVAAS requires 3 years of data to tone down error bands and give the truest estimation.

Much of the bias is from peer effects–at the school and the teacher level. Both have been well established in the literature. They have been associated with tracking since the 1980s. (See Veldman, D. J., & Sanford, J. P. (1984). The influence of class ability level on student achievement and classroom behavior. American Educational Research Journal, 21 629-644). By the way, lower ability students are more influenced, than higher ability. Teachers with large shares of such students are particularly affected.

There was also an interesting study after Katrina which demonstrated how the entrance of a large share of low achieving students into certain Houston Schools adversely affected the test scores of the native Houston students. (a NBER working paper entitled KATRINA’S CHILDREN:EVIDENCE ON THE STRUCTURE OF PEER EFFECTS FROM HURRICANE EVACUEES) Remember also that there is no control for school type, class size, resource allocation. The huge bias in the principal #s will predict, I believe, a correlation with factors such as per pupil spending.

I would certainly expect that one potential cause of the relationship between higher starting scores and MGPs might be peer effect. But I do also suspect some underlying scaling issues – floor effect in the assessments. I do like Imberman’s paper on Katrina effects in Houston. He has many related papers… which shed light on such issues under extreme conditions.

Indeed, the AIR MGP model has its numerous shortcomings… but I would suspect that even if they did try to introduce the additional covariates, some of these biases would persist… as I show in the NYC VAM estimates which are based on a richer model.

Thanks for this analysis…what I find most interesting is the bell curvedness (sic) of the data. I am continually amazed at the powers’ denial of variations in human potential, never mind the role of poverty in academic achievement.

The “bell curvedness (sic) of the data” is perhaps not at all accidental. Most state tests of achievement dating back to the RCT’s have had a significant positive correlation (r=.60) to students’ IQ (the psychometric third rail!).

If in fact there remains this test bias then the tests are unfair by definition.

And even “worser” (sic) then is the probability that in effect we are rating teachers on the innate differences that the students bring to school and not what they have actually learned there.

And yes, you can still add to this the “role of poverty”, as well.

The only place that you can find the report is in Bruce’s local posting. They took the report off the NYSED website. Good job, Bruce 🙂

Thank you for your work and for acknowledging that none of the measures are statistically accurate. Not only is this irresponsible, it is a waste of tax payer dollars and ultimately injurious to all of the children in the State of New York.

Peer effect, yes, but also other Matthew effects. The reasons for Matthew effects are legion, but let me just give one small one that I’ve noticed this year in my own classroom. People learn vocabulary most effectively by reading. But if you are a weaker reader, it’s harder for you to understand a book with a lot of words in it that you don’t know, so you will tend to read books that don’t contain as many difficult words. Kids who are stronger readers are able to read books with more words they don’t know, and so they learn more words, so they are able to read more difficult books, etc. The rich get richer. There are dozens and dozens of these feedback loops operating…

The pattern of correlations in your charts would also be consistent with the possibility that better teachers really are assigned to classrooms with higher-performing children or that better principals move to schools with higher-performing children.

Ah, yes. The next academic fad and new upstart career for young lawyer types: education litigation. Or, as in my case, an alternative career after twenty years in the classroom. May even be the answer to our nation’s “unemployment problem”. Get those law school apps completed ASAP and beat the rush!

N. Wilson has demonstrated that the concepts and process of educational standards and standardized testing is rife with error which render the whole process invalid. Therefore drawing any conclusion from said process are, in his words “vain and illusory”. VAM and other evaluation schemes that use standardized test scores are by definition false, illogical, irrational, and UNETHICAL. See “Educational Standards and the Problem of Error” found at:

http://epaa.asu.edu/ojs/article/view/577/700

and “A Little Less than Valid: An Essay Review” found at:

http://www.edrev.info/essays/v10n5.pdf

The possible litigation issues will probably be overcome by instituting the rhetoric that “multiple sources” (both qualitative from administrators and quantitative from VAM), coupled with the ‘estimation’ argument, lead to better and more accurate teacher evaluation schemes than what we had before.

That will be the case, Daniel, when they are congruent, but what happens when they are not? When, for example, a principal rates the teacher highly effective based on other measures, or , when the opposite is true. This will certainly happen. What worries me the most are the cases when an excellent teacher receives a low score on the test score component and then principals either second guess their own judgement or, are pressured into giving a fine teacher a lower score. That is already happening in Tennessee. Mark my words, test scores will in the end trump all.

I agree Carol – when I read TN’s First to the Top report, I thought TN made it clear they want to know why there was some discrepancy between administrator ratings and VAM outcomes. Also, in HISD study (from Beardsely), it was shown that administrators were changing ratings to stay within the boundaries of VAM outcomes.

So many variables are involved, so little control – especially for teachers. I can’t think of another profession in which workers are evaluated on so many things completely out of their control.

My personal experience as a recently retired superintendent in a VERY high performing school was that the lack of “head room” on the testing scale meant that our teachers couldn’t move the mean scores at all… some of the local budget hawks used this as evidence that either our teachers were mediocre or that the high test scores were the result of something they called “the Palo Alto” effect… that is, if you have smart parents who teach at an ivy league college or work as doctors in the hospital OF COURSE the kids are going to do well on tests… I have yet to read ANY scholarly analysis of VAM that supports its use…