Yesterday, New Jersey’s Education Commissioner announced his plans for how teachers should be evaluated, what teachers should have to do to achieve tenure, and on what basis a teacher could be relieved of tenure. In short, Commissioner Cerf borrowed from the Colorado teacher tenure and evaluation plan which includes a few key elements (Colorado version outlined at end of post):

1. Evaluations based 50% on teacher effectiveness ratings generated with student assessment data – or value-added modeling (though not stated in those specific terms)

2. Teachers must receive 3 positive evaluations in a row in order to achieve tenure.

3. Teachers can lose tenure status or be placed at risk of losing tenure status if they receive 2 negative evaluations in a row.

This post is intended to illustrate just how ill-conceived – how poorly thought out – the above parameters are. This all seems logical on its face, to anyone who knows little or nothing about the fallibility of measuring teacher effectiveness or probability and statistics more generally. Of course we only want to tenure “good” teachers and we want a simple mechanism to get rid of bad ones. If it was only that easy to set up simple parameters of goodness and badness and put such a system into place. Well, it’s not.

Here’s an activity for teachers to try today. It may take more than a day to get it done.

**MATERIALS: DICE (well, really just one Die)! That’s all you need!**

**STEP 1: Roll Die. Record result. Roll again. Record result. Keep rolling until you get the same number 3 times in a row. STOP. Write down the total number of rolls.**

**STEP 2: Roll Die. Record result. Roll again. Record result. Keep rolling until you get the same number 2 times in a row. STOP. Write down the total number of rolls.**

**Post your results in the comments section below.**

Now, what the heck does this all mean? Well, as I’ve written on multiple occasions, the year to year instability of teacher ratings based on student assessment scores is huge. Alternatively stated, the relationship between a teacher’s rating in one year and the next is pretty weak. The likelihood of getting the same rating two straight years is pretty low, and three straight years is very low. The year to year correlation, whether we are talking about the recent Gates/Kane studies or previous work, is about .2 to .3. There’s about a 35% chance that an average teacher in any year is misidentified as poor, given one year of data, and 25% chance given two years of data. That’s very high error rate and very low year to year relationship. This is noise. Error. Teachers – this is not something over which you have control! Teachers have little control over whether they can get 3 good years in a row. AND IN THIS CASE, I’M TALKING ONLY ABOUT THE NOISE IN THE DATA, NOT THE BIAS RESULTING FROM WHICH STUDENTS YOU HAVE!

What does this mean for teachers being tenured and de-tenured under the above parameters? Given the random error, instability alone, it could take quite a long time, a damn long time for any teacher to actually string together 3 good years of value added ratings. And even if one does, we can’t be that confident that he/she is really a good teacher. The dice rolling activity above may actually provide a reasonable estimate of how long it would take a teacher to get tenure (depending on how high or low teacher ratings have to be to achieve or lose tenure). In that case, you’ve got a 1/6 chance with each roll that you get the same number you got on the previous. Of course, getting the same number as your first roll two more times is a much lower probability than getting that number only one more time. You can play it more conservatively by just seeing how long it takes to get 3 rolls in a row where you get a 4, 5 or 6 (above average rating), and then how long it takes to get only two in a row of a 1, 2, or 3.

What does that mean? That means that it could take a damn long time to string together the ratings to get tenure, and not very long to be on the chopping block for losing it. Try the activity. Report your results below.

Each roll above is one year of experience. How many rolls did it take you to get tenure? And how long to lose it?

Now, I’ve actually given you a break here, because I’ve assumed that when you got the first of three in a row, that the number you got was equivalent to a “good” teacher rating. It might have been a bad, or just average rating. So, when you got three in a row, those three in a row might get you fired instead of tenured. So, let’s assume a 5 or a 6 represent a good rating. Try the exercise again and see how long it takes to get three 5s or three 6s in a row. (or increase your odds of either success or failure by lumping together any 5 or 6 as successful and any 1 or 2 as unsuccessful, or counting any roll of 1-3 as unsuccessful and any roll of 4 -6 as successful)

Of course, this change has to work both ways too. See how long it takes to get two 1s or two 2s in a row, assuming those represent bad ratings.

Now, defenders of this approach will likely argue that they are putting only 50% of the weight of evaluations on these measures. The rest will include a mix of other objective and subjective measures. The reality of an evaluation that includes a single large, or even significant weight, placed on a single quantified factor is that that specific factor necessarily becomes the tipping point, or trigger mechanism. It may be 50% of the evaluation weight, but it becomes 100% of the decision, because it’s a fixed, clearly defined (though poorly estimated) metric.

In short, based on the instability of measures alone, the average time to tenure will be quite long, and highly unpredictable. And, those who actually get tenure may not be much more effective, or any more, than those who don’t. It’s a crap shoot. Literally!

Then, losing tenure will be pretty easy… also on a crap shoot… but your odds of losing are much greater than your odds were of winning.

And who’s going to be lining up for these jobs?

**Summary of research on “intertemporal instability” and “error rates”**

The assumption in value-added modeling for estimating teacher “effectiveness” is that if one uses data on enough students passing through a given teacher each year, one can generate a stable estimate of the contribution of that teacher to those children’s achievement gains.[1] However, this assumption is problematic because of the concept of inter-temporal instability:* *that is, the same teacher is highly likely to get a very different value-added rating from one year to the next. Tim Sass notes that the year-to-year correlation for a teacher’s value-added rating is only about 0.2 or 0.3 – at best a very modest correlation. Sass also notes that:

About one quarter to one third of the teachers in the bottom and top quintiles stay in the same quintile from one year to the next while roughly 10 to 15 percent of teachers move all the way from the bottom quintile to the top and an equal proportion fall from the top quintile to the lowest quintile in the next year.[2]

Further, most of the change or difference in the teacher’s value-added rating from one year to the next is unexplainable – not by differences in observed student characteristics, peer characteristics or school characteristics.[3]

Similarly, preliminary analyses from the *Measures of Effective Teaching Project, *funded by the Bill and Melinda Gates Foundation found:

When the between-section or between-year correlation in teacher value-added is below .5, the implication is that more than half of the observed variation is due to transitory effects rather than stable differences between teachers. That is the case for all of the measures of value-added we calculated.[4]

While some statistical corrections and multi-year analysis might help, it is hard to guarantee or even be reasonably sure that a teacher would not be dismissed simply as a function of unexplainable low performance for two or three years in a row.

*Classification & Model Prediction Error*

Another technical problem of VAM teacher evaluation systems is classification and/or model prediction error. Researchers at Mathematica Policy Research Institute in a study funded by the U.S. Department of Education carried out a series of statistical tests and reviews of existing studies to determine the identification “error” rates for ineffective teachers when using typical value-added modeling methods.[5] The report found:

Type I and II error rates for comparing a teacher’s performance to the average are likely to be about 25 percent with three years of data and 35 percent with one year of data. Corresponding error rates for overall false positive and negative errors are 10 and 20 percent, respectively.[6]

Type I error refers to the probability that based on a certain number of years of data, the model will find that a truly average teacher performed significantly worse than average.[7] So, that means that there is about a 25% chance, if using three years of data or 35% chance if using one year of data that a teacher who is “average” would be identified as “significantly worse than average” and potentially be fired. Of particular concern is the likelihood that a “good teacher” is falsely identified as a “bad” teacher, in this case a “false positive” identification. According to the study, this occurs one in ten times (given three years of data) and two in ten (given only one year of data).

*Same Teachers*, *Different Tests, Different Results*

Determining whether a teacher is effective may vary depending on the assessment used for a specific subject area and not whether that teacher is a generally effective teacher in that subject area. For example, Houston uses two standardized test each year to measure student achievement: the state Texas Assessment of Knowledge and Skills (TAKS) and the nationally-normed Stanford Achievement Test.[8] Corcoran and colleagues used Houston Independent School District (HISD) data from each test to calculate separate value-added measures for fourth and fifth grade teachers.[9] The authors found that a teacher’s value-added can vary considerably depending on which test is used.[10] Specifically:

among those who ranked in the top category (5) on the TAKS reading test, more than 17 percent ranked among the lowest two categories on the Stanford test. Similarly, more than 15 percent of the lowest value-added teachers on the TAKS were in the highest two categories on the Stanford.[11]

Similar issues apply to tests on different scales – different possible ranges of scores, or different statistical modification or treatment of raw scores, for example, whether student test scores are first converted into standardized scores relative to an average score, or expressed on some other scale such as percentile rank (which is done is some cases but would generally be considered inappropriate). For instance, if a teacher is typically assigned higher performing students and the scaling of a test is such that it becomes very difficult for students with high starting scores to improve over time, that teacher will be at a disadvantage. But, another test of the same content or simply with different scaling of scores (so that smaller gains are adjusted to reflect the relative difficulty of achieving those gains) may produce an entirely different rating for that teacher.

**Brief Description of Colorado Model**

Colorado, Louisiana, and Tennessee have teacher evaluation systems proposed that will require 50% or more of the evaluations to be based on their students’ academic growth. This section summarizes the evaluation systems in these states as well as the procedural protections that are provided for teachers.

Colorado’s statute creates a state council for educator effectiveness that advises the state board of education.[12] A major goal of these councils is to aid in the creation of teacher evaluation systems that “every teacher is evaluated using multiple fair, transparent, timely, rigorous, and valid methods.”[13] Considerations of student academic growth must comprise at least 50% of each evaluation.[14] Quality measures for teachers must include “measures of student longitudinal academic growth” such as “interim assessments results or evidence of student work, provided that all are rigorous and comparable across classrooms and aligned with state model content standards and performance standards.”[15] These quality standards must take diverse factors into account including “special education, student mobility, and classrooms with a student population in which ninety-five percent meet the definition of high-risk student.”[16]

Colorado’s statute also calls for school districts to develop appeals procedures. A teacher or principal who is deemed ineffective must receive written notice, documentation used for making this determination, and identification of deficiency.[17] Further, the school district must ensure that a tenured teacher who disagrees with this designation has “an opportunity to appeal that rating, in accordance with a fair and transparent process, where applicable, through collective bargaining.”[18] If no collective bargaining agreement is in place, then the teacher may request a review “by a mutually agreed-upon third party.”[19] The school district or board for cooperative services must develop a remediation plan to correct these deficiencies, which will include professional development opportunities that are intended to help the teacher achieve an effective rating in her next evaluation.[20] The teacher or principal must receive a reasonable amount of time to correct such deficiencies.[21]

[1] Tim R. Sass, *The Stability of Value-Added Measures of Teacher Quality and Implications for Teacher Compensation Policy*, Urban Institute (2008), *available at* http://www.urban.org/UploadedPDF/1001266_stabilityofvalue.pdf. *See also *Daniel F. McCaffrey *et al*., *The Intertemporal Variability of Teacher Effect Estimates*, 4 Educ. Fin. & Pol’y, 572 (2009).

[2] Sass, *supra* note 27.

[3] *Id*.

[4] Bill & Melinda Gates Foundation, *supra* note 26.

[5] Peter Z. Schochet & Hanley S. Chiang, *Error Rates in Measuring Teacher and School Performance Based on Student Test Score Gains *(NCEE 2010-4004). Washington, DC: National Center for Education Evaluation and Regional Assistance, Institute of Education Sciences, U.S. Department of Education (2010).

[6] *Id.*

[7] *Id.* at 12.

[8] Sean P. Corcoran, Jennifer L. Jennings & Andrew A. Beveridge, *Teacher Effectiveness **on High- and Low-Stakes Tests*, Paper presented at the Institute for Research on Poverty summer workshop, Madison, WI (2010).

[9] *Id.*

[10] *Id.*

[11] *Id.*

[12] Co. Rev. Stat. § 22-9-105.5(2)(a) (2010).

[13] *Id*. § 22-9-105.5(3)(a).

[14] *Id*.

[15] *Id*.

[16] *Id*. The statute also calls for the creation of performance evaluation councils that advise school districts. *Id*. § 22-9-107(1). The performance evaluation councils also help school districts develop teacher evaluation systems that must be based on the same measures as that developed by the state council for educator effectiveness. *Id*. § 22-9-106(1)(e)(II). However, the performance evaluation councils lose their authority to set standards once the state board has promulgated rules and the initial phase of statewide implementation has been completed. *Id*. § 22-9-106(1)(e)(I).

[17] *Id*. § 22-9-106(3.5)(b)(II).

[18] *Id*.

[19] *Id*.

[20] *Id*.

[21] *Id*.

You say

Maybe what we really want is for no teachers ever to get tenure. It’s not clear that tenure for elementary and high school teacher has really produced anything more than job security.

Tenure as a job benefit – or job security – or predictable career earnings – arguably allows for keeping wages lower, while retaining quality of entrants to the profession. It’s just one of many job benefits that along with wage, make up the compensation package. In the current context, policy makers are proposing to flatten or reduce wages and substantially reduce other benefits, and in this case, significantly increase job risk and earnings stability – or expectation of career earnings potential. Those are big changes that collectively are more likely to diminish the quality of the applicant pool than to improve it, and diminish the quality of the current practicing teacher pool. This is especially true where the performance metrics proposed are so noisy and so outside the control of teachers. There are two levels to this. First, the general issue of noise leading to likelihood of dismissal based on random error alone. Second, the model bias that increases the likelihood of dismissal (regardless of actual teacher quality) in high poverty, high minority settings. So, these policies collectively are likely to reduce the average applicant quality and to severely reduce interest in teaching in disadvantaged settings. The situation would be marginally different if these proposals were coupled with substantial increases in average pay – rather than a zero sum merit game around a stagnant mean. It would also be different if it was possible to develop more useful, more accurate and precise metrics. That does not appear to be the case. There exists a sizeable body of research regarding the influence of average compensation and teacher quality and/or quality of applicants to the teaching profession. And there exists research on the flip side, of the negative effects of stagnant funding (tax and expenditure limits, etc.) on the long run quality of the teacher workforce. Here are a few…

Richard J. Murnane and Randall Olsen (1989) The effects of salaries and opportunity costs on length of state in teaching. Evidence from Michigan. Review of Economics and Statistics 71 (2) 347-352

David N. Figlio (1997) Teacher Salaries and Teacher Quality. Economics Letters 55 267-271.

David N. Figlio (2002) Can Public Schools Buy Better-Qualified Teachers?” Industrial and Labor Relations Review 55, 686-699.

Ronald Ferguson (1991) Paying for Public Education: New Evidence on How and Why Money Matters. Harvard Journal on Legislation. 28 (2) 465-498.

Figlio, D.N., Reuben, K. (2001) Tax limits and the qualifications of new teachers Journal of Public Economics 80 (1) 49-71

Teachers are who we look up to, & we send our children daily. Off to their teachers they go! Education should not be where we cut corners and you definitely cannot turn them into the corporate world for ratings. I myself, work for a company that rates us on ratings 1-5 & 1, being the highest… based on opinions of your superiors and overall job performance. Rating a teacher is not about getting a rating 1-5 & shooting not to fall short on a two. It’s more about the statement below:

” This is especially true where the performance metrics proposed are so noisy and so outside the control of teachers. There are two levels to this. First, the general issue of noise leading to likelihood of dismissal based on random error alone. Second, the model bias that increases the likelihood of dismissal (regardless of actual teacher quality) in high poverty, high minority settings. So, these policies collectively are likely to reduce the average applicant quality and to severely reduce interest in teaching in disadvantaged settings. ”

And this we cannot do to our Teachers & Friends!!!

I read this post at the beginning of seventh period lunch, diligently rolled my die for twenty minutes straight without getting 3 of the same number in a row…got two in a row three times, though… unfortunately, I have to get back to grading papers and do not have enough time to figure out how many years it will take me to make tenure in the Vegas system…

I promise I’ll roll the dice later…

Am I missing something, or are standardized tests NOT included in the Colorado statute? “Comparable across classrooms” is awfully vague to me.

I had thought the error rates INCLUDED non-random assignment of students – thanks for clarifying.

If I recall correctly the colo. model relies heavily on the Colorado Growth model, which is a crude version of value-added, but there are still some issues to be resolved there. It’s a 50% share. I managed to miss that piece in my cut and paste. Fixed now.

There is potentially some overlap between error rate, noise and non-random assignment problems… but they are conceptually separate. Also, it is quite possible that non-random assignment actually stabilizes some teachers’ scores – by getting systematically similar kids assigned to them year after year – either to their disadvantage or advantage. Stable and more predictable… but not a measure of the actual teacher’s effectiveness. This is why the LA Times approach to “fixing” non-random assignment is bogus. Multiple years (multiple classes for a given teacher) can stabilize a teacher score, but does not solve the non-random assignment problem unless each teacher gets to teach a totally different (random) mix of students over time. One can reduce the non-random assignment problem by having multiple lagged scores on each student before they have a given teacher, but that’s different. Multiple lagged scores establishes each child’s learning trajectory making it more reasonable to determine how the teacher affects the learning trajectory – across many students with multiple lagged scores.

The CO law doesn’t specific a give test — or even (technically) a test at all, although that’s clearly what’s contemplated and what will be used in most cases. During a discussion at the Denver PBS hq, I asked the bill’s author about, e.g., kindergarten teachers, and he responded by pointing to tests that are already administered, such as the DIBELS. Even though his thought here may not make its way into the final implementation, it’s a scary thought that kindergarten teachers (and 1st grade) might be evaluated based on a test of that nature (“barking at words” rather than showing meaning) that has certainly not been validated for such a use.

By the way, I wouldn’t call the Colorado Growth Model “crude.” The idea is similar to the growth charts used for children — the percentile growth ‘ranking’ is based on the criterion of past experience. Notably, however, the Model’s developer, Damian Betebenner, has cautioned against using the Model in this sort of high-stakes way. But I doubt his cautions would be any less for value-added approaches based on regression models.

Kevin,

Thanks for the clarifications. I guess my point about the Colo. Growth model is that from the perspective of a model for estimating a teacher effect it would be crude. That it measures individual student growth, but it is not, as I understand it designed to be used to estimate effects of teacher, classroom or school factors on that growth. In that sense, it would be my perspective that a thorough value added model is more developed, but still obviously problematic in many ways, as identified both in Jesse’s review of the Gates findings and in Derek’s recent re-analysis of the LAT data. Ultimately, for these purposes, one must take the growth measures – however constructed – and assign a causal effect of a teacher. That to me is the big leap, and one that I don’t see the Colo. Growth model resolving (at all?).

Bruce

It’s been a while since I’ve done stats work, so please forgive me if I am wrong, but the dice roll seems to overstate the randomness of value added measures. The probability of rolling the same number three times in a row would be about 2.8% ( 6*((1/6)^3)). The probability of an average teacher not being classified as “significantly worse than average” would be 65% in one year (based on Mathematica’s analysis using one year of data, as I assume would be necessary under the department’s proposal) or 27.4% over in three consecutive years (0.65^3). Certainly not the types of odds that will have people begging for teaching jobs, but still much better than 3%.

As an alternative to the dice roll, I did a quick simulation in Excel. I generated 10,000 random numbers with a mean of 0 and standard deviation of 1 in column A. In column B, I created a dummy that equals 1 if the value in the adjacent cell was greater than -0.363, 0 otherwise. This reflects the 65% chance of an average teacher not being classified as awful. I think the results should make everyone want to take a step back from making high stakes decisions with VAM. I will refer to the results as “good” (value of 1) or awful (value of 0).

Year 1: Good, Year2: Awful…at this point, I cannot get tenure for another three years if I am understanding the state’s plan correctly. Years 3 through 6…good, I’ve been tenured. Year 7, awful, I am now on the bubble. Years 8 and 9, good. Phew!

Then, something seriously improbable happens, years 10 through 18 all came up awful! In reality, I do not see any way this teacher would have made it to year 18, even though this is truly a competent (although, maybe not stellar) teacher! I recognize that this will not happen often, but if it happened once in my trial of one, it will happen multiple times each year if applied to 100,000+ teachers.

On a slightly different note, I would love to be a fly on the wall at meetings among senior management at Mathematica. They, on the one hand, do the value added measures for DC and one of their researchers was the lead author on a report supporting its use. On the other hand, other analysts produce a report that raises serious questions. Not sure how one company can juggle this.

It depends on where you set the cut points for how high a high rating has to be at least 3 times in a row, or how low a low rating has to be 2 times in a row. If you set the cut points higher and lower, around a narrower band, you reduce the likelihood of falling in that narrow band in subsequent years. Widen the range and you increase it. So the classification scheme applied to the ratings distribution matters greatly. So yes, if we set wide ranges around performance categories like those described in the Mathematica classification error analyses, considering the misclassification of the average teacher as ineffective, then the dice example does overstate the randomness as you note above. It just made for an easier example for the blog posting, and I assumed at least someone would play more carefully with odds. Thanks for the simulation example. I’ll ponder alternative versions.

I did a variation on the dice method. I let a score of 1 or 2 = awful and scores of 3-6 = good. The scores follow:

5

2

4

6

2

1

1

3

2

4

1

6

3

4 — Tenure!

3

4

2

3

4

1

1 — Lost Tenure!

The problem is that you’d have likely been dismissed before you got to tenure, after the 2 followed by 2 1s. Thanks for playin’!

I played the game twice, mostly because the first time, I didn’t get three in a row at all, even using the 5-6 great!,1-2 bad method.

Teacher 1 teaches for 40 years, because after all, she doesn’t deserve a decent pension plan anymore than job security or a decent wage. 4,2,1,possibly fired-next job: 5!, 2,1, oops, possibly fired again, 5!, 3, 5!,5!, 3darn it, 4,3,5!,5!, 3 darn it, 3,4,5!,4,2, 6!, 6!, 2, 1, possible firing, 2 bad teacher, 2 you’re outa here!New job (which restarts pension, remember?) 3,3,4, 6!, 1,4, 6!,5! 1, 1, you’re fired… and she didn’t make it to 40 years, just 39, because hey, old and incompetent, plus depressed.

She was rated unsatisfactory 11 times, and wonderful 12 times. She never made tenure, and was ranked unsatisfactory two or more years in a row 4 times.

Teacher 2, again for 40 years.

5!, 3, 4, 2, 6!,2,6!,5!,3,5!,1,1, possible firing 12 years on the job…6!, 4, 4, 5!, 6!,6!,6! in year 20, tenure, do a happy dance?, 3, 6!,4, 5!, 4, 2, 2 bye -bye tenure, possibly bye-bye job, only kept it 7 years. 3, 1, 1, if not fired before, probably on the line now, 4, 3, 3, 2, 4, 6!,6!, 5! WooHoo! tenure again in year 39, 5!

She was rated unsatisfactory 9 times, and wonderful 16 times. She made tenure twice, in year 20 and year 39, and was ranked unsatisfactory two or more years in a row only three times.

Not very encouraging is it?

ASSUME YOU’RE A GOOD TEACHER – OR AT LEAST MORE LIKELY TO RECEIVE A GOOD RATING. TRY MEGAN’S VARIATION –

1-2 =Poor performance rating

3 – 6 = Good performance rating

Tipping the balance somewhat in your favor? How much does it help?

Well, in my two examples,

teacher 1 makes tenure in year 5, loses it in year 7. She then has good ratings, and tenure, for the next 17 years, with only one poor rating, when she loses it all with 4 successive years of poor ratings. then it is 4 years in a row of good ratings, a blip, three more years of good ratings, and then three years before she is due to retire, she loses tenure again. ONLY two of her poor ratings were not consecutive. She actually had only 12 bad ratings out of 39, they just happened to be paired up. She had a total of 25 tenured years. The way I originally laid it out, she had no tenured years, so definitely an improvement.

teacher 2 now makes tenure in year 3, keeps it until year 12, then immediately regains it in year 15, and doesn’t have another poor performance review until she loses her tenure by having two years in a row at years 26, and 27. She bounces back, regaining her tenure in year 33, and finishes her 40 years with sterling reviews. Since she only had 9 unsatisfactory reviews out of the 40, using Megan’s scale really helped her. She had a total of 38 tenured years. The way I originally laid it out, she had 8 tenured years, definitely an improvement.

Of course, neither of these teachers likely every gets to accumulate that total number of years tenured, because they likely get fired after losing tenure and getting the bad ratings. Perhaps they find a new job. But perhaps they either don’t or really don’t want to put up with having their career depend on a roll of the dice any more.

Personally, I wish we could do away with the term tenure completely. Can’t we come up with another term that will inform non-educators that we are really just talking about no dismissal without cause? I think that the negative connotations of the term are crippling in this fight. As long as people perceive “tenure” as “job for life”, we have a problem.

Oops, misadded here. I forgot to not count the consecutive years when you are earning tenure.

So teacher 1 there actually had 23 years with tenure, not necessarily at the same school. I started counting tenure on the year after she achieved it, and stopped at the second bad review year.

And teacher 2 had 27 years with tenure, not necessarily at the same school.

This is of course, assuming a lot of things as you noted: that they actually aren’t fired, or at the least are quickly rehired.