Yesterday, New Jersey’s Education Commissioner announced his plans for how teachers should be evaluated, what teachers should have to do to achieve tenure, and on what basis a teacher could be relieved of tenure. In short, Commissioner Cerf borrowed from the Colorado teacher tenure and evaluation plan which includes a few key elements (Colorado version outlined at end of post):
1. Evaluations based 50% on teacher effectiveness ratings generated with student assessment data – or value-added modeling (though not stated in those specific terms)
2. Teachers must receive 3 positive evaluations in a row in order to achieve tenure.
3. Teachers can lose tenure status or be placed at risk of losing tenure status if they receive 2 negative evaluations in a row.
This post is intended to illustrate just how ill-conceived – how poorly thought out – the above parameters are. This all seems logical on its face, to anyone who knows little or nothing about the fallibility of measuring teacher effectiveness or probability and statistics more generally. Of course we only want to tenure “good” teachers and we want a simple mechanism to get rid of bad ones. If it was only that easy to set up simple parameters of goodness and badness and put such a system into place. Well, it’s not.
Here’s an activity for teachers to try today. It may take more than a day to get it done.
MATERIALS: DICE (well, really just one Die)! That’s all you need!
STEP 1: Roll Die. Record result. Roll again. Record result. Keep rolling until you get the same number 3 times in a row. STOP. Write down the total number of rolls.
STEP 2: Roll Die. Record result. Roll again. Record result. Keep rolling until you get the same number 2 times in a row. STOP. Write down the total number of rolls.
Post your results in the comments section below.
Now, what the heck does this all mean? Well, as I’ve written on multiple occasions, the year to year instability of teacher ratings based on student assessment scores is huge. Alternatively stated, the relationship between a teacher’s rating in one year and the next is pretty weak. The likelihood of getting the same rating two straight years is pretty low, and three straight years is very low. The year to year correlation, whether we are talking about the recent Gates/Kane studies or previous work, is about .2 to .3. There’s about a 35% chance that an average teacher in any year is misidentified as poor, given one year of data, and 25% chance given two years of data. That’s very high error rate and very low year to year relationship. This is noise. Error. Teachers – this is not something over which you have control! Teachers have little control over whether they can get 3 good years in a row. AND IN THIS CASE, I’M TALKING ONLY ABOUT THE NOISE IN THE DATA, NOT THE BIAS RESULTING FROM WHICH STUDENTS YOU HAVE!
What does this mean for teachers being tenured and de-tenured under the above parameters? Given the random error, instability alone, it could take quite a long time, a damn long time for any teacher to actually string together 3 good years of value added ratings. And even if one does, we can’t be that confident that he/she is really a good teacher. The dice rolling activity above may actually provide a reasonable estimate of how long it would take a teacher to get tenure (depending on how high or low teacher ratings have to be to achieve or lose tenure). In that case, you’ve got a 1/6 chance with each roll that you get the same number you got on the previous. Of course, getting the same number as your first roll two more times is a much lower probability than getting that number only one more time. You can play it more conservatively by just seeing how long it takes to get 3 rolls in a row where you get a 4, 5 or 6 (above average rating), and then how long it takes to get only two in a row of a 1, 2, or 3.
What does that mean? That means that it could take a damn long time to string together the ratings to get tenure, and not very long to be on the chopping block for losing it. Try the activity. Report your results below.
Each roll above is one year of experience. How many rolls did it take you to get tenure? And how long to lose it?
Now, I’ve actually given you a break here, because I’ve assumed that when you got the first of three in a row, that the number you got was equivalent to a “good” teacher rating. It might have been a bad, or just average rating. So, when you got three in a row, those three in a row might get you fired instead of tenured. So, let’s assume a 5 or a 6 represent a good rating. Try the exercise again and see how long it takes to get three 5s or three 6s in a row. (or increase your odds of either success or failure by lumping together any 5 or 6 as successful and any 1 or 2 as unsuccessful, or counting any roll of 1-3 as unsuccessful and any roll of 4 -6 as successful)
Of course, this change has to work both ways too. See how long it takes to get two 1s or two 2s in a row, assuming those represent bad ratings.
Now, defenders of this approach will likely argue that they are putting only 50% of the weight of evaluations on these measures. The rest will include a mix of other objective and subjective measures. The reality of an evaluation that includes a single large, or even significant weight, placed on a single quantified factor is that that specific factor necessarily becomes the tipping point, or trigger mechanism. It may be 50% of the evaluation weight, but it becomes 100% of the decision, because it’s a fixed, clearly defined (though poorly estimated) metric.
In short, based on the instability of measures alone, the average time to tenure will be quite long, and highly unpredictable. And, those who actually get tenure may not be much more effective, or any more, than those who don’t. It’s a crap shoot. Literally!
Then, losing tenure will be pretty easy… also on a crap shoot… but your odds of losing are much greater than your odds were of winning.
And who’s going to be lining up for these jobs?
Summary of research on “intertemporal instability” and “error rates”
The assumption in value-added modeling for estimating teacher “effectiveness” is that if one uses data on enough students passing through a given teacher each year, one can generate a stable estimate of the contribution of that teacher to those children’s achievement gains. However, this assumption is problematic because of the concept of inter-temporal instability: that is, the same teacher is highly likely to get a very different value-added rating from one year to the next. Tim Sass notes that the year-to-year correlation for a teacher’s value-added rating is only about 0.2 or 0.3 – at best a very modest correlation. Sass also notes that:
About one quarter to one third of the teachers in the bottom and top quintiles stay in the same quintile from one year to the next while roughly 10 to 15 percent of teachers move all the way from the bottom quintile to the top and an equal proportion fall from the top quintile to the lowest quintile in the next year.
Further, most of the change or difference in the teacher’s value-added rating from one year to the next is unexplainable – not by differences in observed student characteristics, peer characteristics or school characteristics.
Similarly, preliminary analyses from the Measures of Effective Teaching Project, funded by the Bill and Melinda Gates Foundation found:
When the between-section or between-year correlation in teacher value-added is below .5, the implication is that more than half of the observed variation is due to transitory effects rather than stable differences between teachers. That is the case for all of the measures of value-added we calculated.
While some statistical corrections and multi-year analysis might help, it is hard to guarantee or even be reasonably sure that a teacher would not be dismissed simply as a function of unexplainable low performance for two or three years in a row.
Classification & Model Prediction Error
Another technical problem of VAM teacher evaluation systems is classification and/or model prediction error. Researchers at Mathematica Policy Research Institute in a study funded by the U.S. Department of Education carried out a series of statistical tests and reviews of existing studies to determine the identification “error” rates for ineffective teachers when using typical value-added modeling methods. The report found:
Type I and II error rates for comparing a teacher’s performance to the average are likely to be about 25 percent with three years of data and 35 percent with one year of data. Corresponding error rates for overall false positive and negative errors are 10 and 20 percent, respectively.
Type I error refers to the probability that based on a certain number of years of data, the model will find that a truly average teacher performed significantly worse than average. So, that means that there is about a 25% chance, if using three years of data or 35% chance if using one year of data that a teacher who is “average” would be identified as “significantly worse than average” and potentially be fired. Of particular concern is the likelihood that a “good teacher” is falsely identified as a “bad” teacher, in this case a “false positive” identification. According to the study, this occurs one in ten times (given three years of data) and two in ten (given only one year of data).
Same Teachers, Different Tests, Different Results
Determining whether a teacher is effective may vary depending on the assessment used for a specific subject area and not whether that teacher is a generally effective teacher in that subject area. For example, Houston uses two standardized test each year to measure student achievement: the state Texas Assessment of Knowledge and Skills (TAKS) and the nationally-normed Stanford Achievement Test. Corcoran and colleagues used Houston Independent School District (HISD) data from each test to calculate separate value-added measures for fourth and fifth grade teachers. The authors found that a teacher’s value-added can vary considerably depending on which test is used. Specifically:
among those who ranked in the top category (5) on the TAKS reading test, more than 17 percent ranked among the lowest two categories on the Stanford test. Similarly, more than 15 percent of the lowest value-added teachers on the TAKS were in the highest two categories on the Stanford.
Similar issues apply to tests on different scales – different possible ranges of scores, or different statistical modification or treatment of raw scores, for example, whether student test scores are first converted into standardized scores relative to an average score, or expressed on some other scale such as percentile rank (which is done is some cases but would generally be considered inappropriate). For instance, if a teacher is typically assigned higher performing students and the scaling of a test is such that it becomes very difficult for students with high starting scores to improve over time, that teacher will be at a disadvantage. But, another test of the same content or simply with different scaling of scores (so that smaller gains are adjusted to reflect the relative difficulty of achieving those gains) may produce an entirely different rating for that teacher.
Brief Description of Colorado Model
Colorado, Louisiana, and Tennessee have teacher evaluation systems proposed that will require 50% or more of the evaluations to be based on their students’ academic growth. This section summarizes the evaluation systems in these states as well as the procedural protections that are provided for teachers.
Colorado’s statute creates a state council for educator effectiveness that advises the state board of education. A major goal of these councils is to aid in the creation of teacher evaluation systems that “every teacher is evaluated using multiple fair, transparent, timely, rigorous, and valid methods.” Considerations of student academic growth must comprise at least 50% of each evaluation. Quality measures for teachers must include “measures of student longitudinal academic growth” such as “interim assessments results or evidence of student work, provided that all are rigorous and comparable across classrooms and aligned with state model content standards and performance standards.” These quality standards must take diverse factors into account including “special education, student mobility, and classrooms with a student population in which ninety-five percent meet the definition of high-risk student.”
Colorado’s statute also calls for school districts to develop appeals procedures. A teacher or principal who is deemed ineffective must receive written notice, documentation used for making this determination, and identification of deficiency. Further, the school district must ensure that a tenured teacher who disagrees with this designation has “an opportunity to appeal that rating, in accordance with a fair and transparent process, where applicable, through collective bargaining.” If no collective bargaining agreement is in place, then the teacher may request a review “by a mutually agreed-upon third party.” The school district or board for cooperative services must develop a remediation plan to correct these deficiencies, which will include professional development opportunities that are intended to help the teacher achieve an effective rating in her next evaluation. The teacher or principal must receive a reasonable amount of time to correct such deficiencies.
 Tim R. Sass, The Stability of Value-Added Measures of Teacher Quality and Implications for Teacher Compensation Policy, Urban Institute (2008), available at http://www.urban.org/UploadedPDF/1001266_stabilityofvalue.pdf. See also Daniel F. McCaffrey et al., The Intertemporal Variability of Teacher Effect Estimates, 4 Educ. Fin. & Pol’y, 572 (2009).
 Sass, supra note 27.
 Bill & Melinda Gates Foundation, supra note 26.
 Peter Z. Schochet & Hanley S. Chiang, Error Rates in Measuring Teacher and School Performance Based on Student Test Score Gains (NCEE 2010-4004). Washington, DC: National Center for Education Evaluation and Regional Assistance, Institute of Education Sciences, U.S. Department of Education (2010).
 Id. at 12.
 Sean P. Corcoran, Jennifer L. Jennings & Andrew A. Beveridge, Teacher Effectiveness on High- and Low-Stakes Tests, Paper presented at the Institute for Research on Poverty summer workshop, Madison, WI (2010).
 Co. Rev. Stat. § 22-9-105.5(2)(a) (2010).
 Id. § 22-9-105.5(3)(a).
 Id. The statute also calls for the creation of performance evaluation councils that advise school districts. Id. § 22-9-107(1). The performance evaluation councils also help school districts develop teacher evaluation systems that must be based on the same measures as that developed by the state council for educator effectiveness. Id. § 22-9-106(1)(e)(II). However, the performance evaluation councils lose their authority to set standards once the state board has promulgated rules and the initial phase of statewide implementation has been completed. Id. § 22-9-106(1)(e)(I).
 Id. § 22-9-106(3.5)(b)(II).