When Disinformation is Fueled by Misinformation! CHANCELLOR TISCH, YOU ARE WRONG!


Very recently, I posted a critique of the recent technical report on New York State median growth percentiles to be used in that state’s teacher and principal evaluation system.

Today, I read this piece in the NY Post – an editorial by NY State Board of Regents Chancellor Merryl Tisch, and well, MY HEAD ALMOST EXPLODED!

The point of the editorial is to encourage NY City’s teachers and DOE to agree to a teacher evaluation system based on supposedly objective measures – where “objective measures” seems largely to be code language for estimates of teacher effectiveness derived from student assessment data.

First, I have written several previous posts on the usefulness of NYC’s value-added model for determining teacher effectiveness.

  1. the NYC VAM model retains some persistent biases
  2. the NYC VAM model is highly unstable from year to year
  3. the NYC VAM results capture only a handful of teachers per school and their results tend to jump all over the place
  4. adopting the NCTQ irreplaceables logic, the NYC VAM data are so noisy that few if any teachers are persistently irreplaceable
  5. for various reasons, it is unlikely that these are just early glitches in the system that will get better with time

Setting aside this long list of concerns about the NYC VAM results, I now turn to the NYSED – state median growth percentile data (which actually seem inferior to the NYC VAM model/estimates). In her editorial, Chancellor Tisch proclaims:

The student-growth scores provided by the state for teacher evaluations are adjusted for factors such as students who are English Language Learners, students with disabilities and students living in poverty. When used right, growth data from student assessments provide an objective measurement of student achievement and, by extension, teacher performance.

Let me be blunt here. CHANCELLOR TISCH – YOU ARE WRONG! FLAT OUT WRONG! IRRESPONSIBLY & PERHAPS NEGLIGENTLY WRONG!

[now, one might quibble that Chancellor Tisch has merely stated that the measures are “adjusted for” certain factors and she has not claimed that those adjustments actually work to eliminate bias. Further, she has merely declared that the measures are “objective” and not that they are accurate or precise. Personally, I don’t find this deceptive language at all comforting!]

Indeed, the measures attempt – but fail to sufficiently adjust for key factors. They retain substantial biases as identified in the state’s own technical report. And they are subject to many of the same error concerns as the NYC VAM model.  Given the findings of the state’s own technical report, it is irresponsible to suggest that these measures can and should be immediately considered for making personnel and compensation decisions.

Finally, as I laid out in my previous blog post to suggest that “growth data from student assessments provide an objective measure of student achievement, and, by extension, teacher performance” IS A HUGE UNWARRANTED STRETCH!

While I might concur with the follow up statement from Chancellor Tisch that “We should never judge an educator solely by test scores, but we shouldn’t completely disregard student performance and growth either.” I would argue that school leaders/peer teachers/personnel managers should absolutely have the option to completely disregard data that have high potential to be sending false signals, either as a function of persistent bias or error. Requiring action based on biased and error prone data (rather than permitting those data to be reasonably mined to the extent they may, OR MAY NOT, be useful) is a toxic formula for public schooling quality.

The one thing I can’t quite figure out here is which is the misinformation and which is the disinformation. In any case, both are wrong!

The rest of what I have to say, I’ve already said. But, so readers don’t have to click the link below to access the previous post, I’ve pasted  the entire thing below. Enjoy!

COMPLETE PREVIOUS POST!

I was immediately intrigued the other day when a friend passed along a link to the recent technical report on the New York State growth model, the results of which are expected/required to be integrated into district level teacher and principal evaluation systems under that state’s new teacher evaluation regulations.  I did as I often do and went straight for the pictures – in this case- the scatterplots of the relationships between various “other” measures and the teacher and principal “effect” measures.  There was plenty of interesting stuff there, some of which I’ll discuss below.

But then I went to the written language of the report – specifically the report’s (albeit in DRAFT form)  conclusions. The conclusions were only two short paragraphs long, despite much to ponder being provided in the body of the report. The authors’ main conclusion was as follows:

The model selected to estimate growth scores for New York State provides a fair and accurate method for estimating individual teacher and principal effectiveness based on specific regulatory requirements for a “growth model” in the 2011-2012 school year. p. 40

http://engageny.org/wp-content/uploads/2012/06/growth-model-11-12-air-technical-report.pdf

13-Nov-2012 20:54

Updated Final Report: http://engageny.org/sites/default/files/resource/attachments/growth-model-11-12-air-technical-report_0.pdf

Local copy of original DRAFT report: growth-model-11-12-air-technical-report

Local copy of FINAL report: growth-model-11-12-air-technical-report_FINAL

Unfortunately, the multitude of graphs that immediately preceded this conclusion undermine it entirely. but first, allow me to address the egregious conceptual problems with the framing of this conclusion.

First Conceptually

Let’s start with the low hanging fruit here. First and foremost, nowhere in the technical report, nowhere in their data analyses, do the authors actually measure “individual teacher and principal effectiveness.” And quite honestly, I don’t give a crap if the “specific regulatory requirements” refer to such measures in these terms. If that’s what the author is referring to in this language, that’s a pathetic copout.  Indeed it may have been their charge to “measure individual teacher and principal effectiveness based on requirements stated in XYZ.” That’s how contracts for such work are often stated. But that does not obligate the author to conclude that this is actually what has been statistically accomplished. And I’m just getting started.

So, what is being measured and reported?  At best, what we have are:

  • An estimate of student relative test score change on one assessment each for ELA and Math (scaled to growth percentile) for students who happen to be clustered in certain classrooms.

THIS IS NOT TO BE CONFLATED WITH “TEACHER EFFECTIVENESS”

Rather, it is merely a classroom aggregate statistical association based on data points pertaining to two subjects being addressed by teachers in those classrooms, for a group of children who happen to spend a minority share of their day and year in those classrooms.

  • An estimate of student relative test score change on one assessment each for ELA and Math (scaled to growth percentile) for students who happen to be clustered in certain schools.

THIS IS NOT TO BE CONFLATED WITH “PRINCIPAL EFFECTIVENESS”

Rather, it is merely a school aggregate statistical association based on data points pertaining to two subjects being addressed by teachers in classrooms that are housed in a given school under the leadership of perhaps one or more principals, vps, etc., for a group of children who happen to spend a minority share of their day and year in those classrooms.

Now Statistically

Following are a series of charts presented in the technical report, immediately preceding the above conclusion.

Classroom Level Rating Bias

School Level Rating Bias

And there are many more figures displaying more subtle biases, but biases that for clusters of teachers may be quite significant and consequential.

Based on the figures above, there certainly appears to be, both at the teacher, excuse me – classroom, and principal – I mean school level, substantial bias in the Mean Growth Percentile ratings with respect to initial performance levels on both math and reading. Teachers with students who had higher starting scores and principals in schools with higher starting scores tended to have higher Mean Growth Percentiles.

This might occur for several reasons. First, it might just be that the tests used to generate the MGPs are scaled such that it’s just easier to achieve growth in the upper ranges of scores. I came to a similar finding of bias in the NYC value added model, where schools having higher starting math scores showed higher value added. So perhaps something is going on here. It might also be that students clustered among higher performing peers tend to do better. And, it’s at least conceivable that students who previously had strong teachers and remain clustered together from year to year, continue to show strong growth. What is less likely is that many of the actual “better” teachers just so happen to be teaching the kids who had better scores to begin with.

That the systemic bias appears greater in the school level estimates than in the teacher level estimates is suggestive that the teacher level estimates may actually be even more bias than they appear. The aggregation of otherwise less biased estimates should not reveal more bias.

Further, as I’ve mentioned on several times on this blog previously, even if there weren’t such glaringly apparent overall patterns of bias their still might be underlying biased clusters.  That is, groups of teachers serving certain types of students might have ratings that are substantially WRONG, either in relation to observed characteristics of the students they serve or their settings, or of unobserved characteristics.

Closing Thoughts

To be blunt – the measures are neither conceptually nor statistically accurate. They suffer significant bias, as shown and then completely ignored by the authors. And inaccurate measures can’t be fair. Characterizing them as such is irresponsible.

I’ve now written 2 articles and numerous blog posts in which I have raised concerns about the likely overly rigid use of these very types of metrics when making high stakes personnel decisions. I have pointed out that misuse of this information may raise significant legal concerns. That is, when district administrators do start making teacher or principal dismissal decisions based on these data, there will likely follow, some very interesting litigation over whether this information really is sufficient for upholding due process (depending largely on how it is applied in the process).

I have pointed out that the originators of the SGP approach have stated in numerous technical documents and academic papers that SGPs are intended to be a descriptive tool and are not for making causal assertions (they are not for “attribution of responsibility”) regarding teacher effects on student outcomes. Yet, the authors persist in encouraging states and local districts to do just that. I certainly expect to see them called to the witness stand the first time SGP information is misused to attribute student failure to a teacher.

But the case of the NY-AIR technical report is somewhat more disconcerting. Here, we have a technically proficient author working for a highly respected organization – American Institutes for Research – ignoring all of the statistical red flags (after waiving them), and seemingly oblivious to gaping conceptual holes (commonly understood limitations) between the actual statistical analyses presented and the concluding statements made (and language used throughout).

The conclusion are WRONGstatistically and conceptually.  And the author needs to recognize that being so damn bluntly wrong may be consequential for the livelihoods of thousands of individual teachers and principals! Yes, it is indeed another leap for a local school administrator to use their state approved evaluation framework, coupled with these measures, to actually decide to adversely affect the livelihood and potential career of some wrongly classified teacher or principal – but the author of this report has given them the tool and provided his blessing. And that’s inexcusable.

18 Comments

  1. Bruce,

    She obviously neglected to read the studies that indicate that even when multiple value-added models (that address multiple variables) are put to the test, they fail to substantiate meaning behind “teacher effectiveness.” Even worse, models that account for student charateristics will create discentives for teachers to work with students with the greatest needs.

    Lou

  2. I sense that this blog post will be a key document in the argument made by that litigating teacher or principal. This will only be settled in court. Sadly, many teachers and principals will be harmed in the process of getting there.

  3. Bruce-

    Excellent post(s)! I particularly appreciated your very clear discussion of the evidence of bias in the SGP measures, the inappropriateness of making causal attributions of effectiveness, and ultimately the validity of using these estimates as measures of teacher or principal effectiveness. However, I’d like to focus on Tisch’s claim regarding the “objectivity” of these measures, one that is frequently made by others.

    One great advantage value-added (VAM) or student growth percentile (SGP) proponents such as Tisch claim for multivariate statistical estimates of teacher effects is that they provide an objective measure of teacher performance. To this I would say that yes, the estimates produced are objective in the sense that they are not possibly influenced by any form of personal judgment about the individual being evaluated. That no doubt was the proponents’ intended meaning of the term “objective.” However, it is important for educators and policymakers to keep in mind that in the process of devising VAMs/SGPs, gathering data, and making innumerable methodological decisions prior to producing estimates, countless judgment calls are made by the developers.

    As your blog has amply documented, previous research has indicated that value-added attributions of teacher effectiveness are sensitive to large array of choices, for example the use of alternate tests of the same subject matter (Papay, 2011; Rothstein, 2011; Corcoran et al., 2011). Recently, one study found, among other things, that teacher effects on high stakes tests decay faster than on low stakes ones (Corcoran et al., 2011). The value-added estimates for teachers on high- and low-stakes exams correlate only at a “moderate” level (ibid.).

    Given the state of the science (and “art?”) of value-added measurement, these methodological decisions cannot be entirely objective. Every time states make a “policy decision” about what VAM/SGP vendor to use, the choice of tests, control variables to include in the model, reference group(s) for calculating growth, number of years of data to require, and cutpoints delineating “ineffectiveness,” etc., they make pragmatic or value judgments that stray even further from common notions of objectivity. These judgment calls all precede the calculation of the estimate, and they may be overlooked when one only considers the “objective” score that results. These judgments, particularly ones involving modeling choices, may systematically advantage or disadvantage certain groups of teachers.

    So I would ask Chancellor Tisch, what comprehensive and “objective” process did New York State use to determine that the SGP it chose was the best available means for measuring teacher effectiveness currently on the market? Your blog provides compelling reasons to doubt.

  4. It is easier for me to believe that this is more about diminishing the political power of unions, by demoralizing and frightening teachers. Anyone who has spent more than three or four years on a real public school, teaching real public school students knows that no formula can measure the job that needs to be done. The people being appointed to positions of power and respect in education today rarely meet this minimum criteria and have only well heeled and well connected allies, using PR to circle the wagons around the most significant sources of student struggles, continually distracting and handicapping those with the power to help with this educational idiocy.

  5. Here is yet another factor not accounted for in these “objective” evaluations: The students I teach all have emotional and/or behavioral disorders. During the last testing cycle, one of my students, who works at a level in the classroom far above the level that standardized tests tell us he should be able to work at, stood up halfway through the test and said, “F this test, it’s boring!” He then stormed out of the room and refused to finish. This student’s test scores depend on whether he took his medication, had an altercation on the bus or in the hallway, or even if he slept the night before. The only consistent thing about his test scores is that they consistently underestimate his abilities and classroom performance.

    His standardized test scores give us zero useable information as to his abilities or my effectiveness as a teacher. This same student writes wonderful essays, excels in science, and will often finish his work early and spend the rest of the class helping other students who have not yet mastered the day’s challenge.

    1. From this teacher’s perspective I couldn’t care less about “student achievement”. That phrase is the edudeformer’s term. It implies an end product. The teaching and learning process is just that–a process really without beginning or ending.

      1. The message promotes a (poorly designed) system based on knowingly deeply flawed information, requiring heavy reliance on that information and deceiving the public and end users of that information into believing that it is useful, reliable and valid. Therefore the message itself is wrong, deceptive and unethical.

        For more on the basic structural problems of that system and others like it, see: https://schoolfinance101.wordpress.com/2012/04/19/the-toxic-trifecta-bad-measurement-evolving-teacher-evaluation-policies/

        I have a forthcoming academic article that lays out greater detail on these issues.

      2. A definition of achievement is something accomplished successfully. It’s not a dirty word.
        I would say graduating from high school is an achievement. Aren’t we arguing on the same side? The state tests are being used in ways they were not meant to. That does not negate the fact the low graduation rates and poor achievement are highly correlated with SES status.

      3. Indeed measures of this type can/should be used thoughtfully. I’d rather have the measures than not. I’d even like to try modeling effects with the measures to figure out what’s going on in my system/school, were in such a position. Yes…the problem is that the Chancellor is advocating misuse of a version of the measures – being used for a purpose other than its original intent – that also happens to be proven biased. I didn’t see that you were responding to someone else.

  6. Let’s assume all standardized tests are made by people who are educated and trained in all the available statistics’ knowledge out there. The claims are these tests are biased and should not be used to rate teachers or principals. So how can we justify using tests made by a classroom teacher with no formal training in making objective assessments, to rate, in the form of a grade, students? Shall we eliminate all tests then? Just use good subjective judgements to determine whether a student has learned based on the teacher’s best judgement?

    1. I am not arguing here that the tests themselves are biased. Nor am I arguing against thoughtful, responsible use of testing data. What I am pointing out here is that the AIR MGP model being used to estimate the teacher effect on student test score growth produces biased estimates of teacher effect – as acknowledged in the technical report. That is, the statistical model of the data produced biased statistical estimates – estimates, where, for example, teachers who have classes of kids who scored higher to begin with are more likely to get a higher rating. Principals in schools with higher special ed populations are more likely to get lower ratings. As such, the model used to estimate teacher or principal effect does not accurately characterize teacher effect on the assessment data. It fails to control for that which it claims to control for. The figures in the technical report show that clearly. The author chooses to ignore them in his/her conclusions.

      And yes, forcing the use of these measures which are both highly unstable from year to year (when better estimated) and in this case, substantially biased (in simple terms – WRONG – wrongly attributing effect to teachers/principals that is likely attributable to a condition outside their control) – for making staffing decisions/compensation decisions, etc. is likely to do more harm than good. Among other things, it reduces the likelihood that we might ever get around to using better estimates more appropriately.

      Such estimates – even from better models – while potentially useful for providing preliminary screening information, should never be used in the type of evaluation structure adopted in NY State. See: https://schoolfinance101.wordpress.com/2012/04/19/the-toxic-trifecta-bad-measurement-evolving-teacher-evaluation-policies/ (see: http://shankerblog.org/?p=7242 re: the more appropriate screening use of such estimates)

      The bias in these model estimates has little or nothing to do with the type of testing bias to which you refer above.

  7. This is exactly why it is so difficult to find experienced teachers to work with low performing students. It is one thing to want to help those who are the neediest, but it is completely another when in doing so to have to worry about being removed from your job for low performance evaluations. As Ms. Jones pointed out, these types of students are often emotional and will refuse to comply with testing at any given time. Not to mention, many low performing students are from low socioeconomic homes with serious problems, yet we expect them to come to school on testing days, regardless of what is occurring in their homes at the present time, and focus on doing their best. The whole idea is insane. What teacher needs this kind of stress year after year? We have created a system where teachers take on low achieving classrooms early in their careers, and then as soon as possible move up and teach higher performing students. The whole system accomplishes the exact opposite of what our students truly need.

Comments are closed.