Teacher Selection: Smart Selection vs. Dumb Selection


I had a twitter argument the other day about a blog posting that compared the current debate around “de-selection” of bad teachers to eugenics. It is perhaps a bit harsh to compare Hanushek  (cited author of papers on de-selecting bad teachers) to Hitler, if that was indeed the intent. However, I did not take that as the intent of the posting by Cedar Riener.  Offensive or not, I felt that the blog posting made 3 key points about errors of reasoning that apply to both eugenecists and to those promoting empirical de-selection of fixed shares of the teacher workforce.  Here’s a quick summary of those three points:

  • The first error is a deterministic view of a complex and uncertain process.
  • The second common error becomes apparent once the need arises to concretely measure quality.
  • The third error is a belief that important traits are fixed rather than changeable.

These are critically important, and help us to delineate between smart selection and, well, dumb selection.  These three errors of reasoning are the basis for dumb selection.  A selection that is, as the author explains, destined to fail.  But, I do not see this particular condemnation of dumb selection to be a condemnation of selection more generally. By contrast, the reformy pundit with whom I was arguing continued to claim that Riener’s blog was condemning any and all forms of selection as doomed to fail, a seemingly absurd proposition (and not how I read it at all).

Clearly, selection can and should play a positive role in the formation of the teacher workforce or in the formation of that team of school personnel that can make a school great.

Smart Selection: In nearly every human endeavor, in every and any workforce or labor market activity exists some form of selection. Selection of individuals into specific careers, jobs and roles and de-selection of individuals out of careers, jobs and roles. Selection in and of itself is clearly not a bad thing. In fact, the best of organizations necessarily select the best available individuals over time to work within those organizations. And, individuals attempt to select the best organizations, careers, jobs and roles to suit their interests, motivation and needs. That is, self-selection. Teacher selection or any education system employee selection is no different. And good teacher selection is obviously important for having good schools. Like any selection process on the labor market, teacher selection involves a two-sided match. On the one hand, there are the school leaders and existing employees (to the extent they play a role in recruitment and selection) who may play a role in determining among a pool of applicants which ones are the best fit for their school and the specific job in question. On the other hand, there are the signals sent out by the school (some within and some outside the control of existing staff and leaders) which influence the composition of the applicant pool and for that matter, whether an individual who is selected decides to stay. These include signals about compensation, job characteristics and work environment. Managing this complex system well is key to having a great school. Sending the right signals. Creating the right environment. Making the right choices among applicants. Knowing when a choice was wrong. And handling difficult decisions with integrity.

There has also been much discussion of late about a recent publication by Brian Jacob of the University of Michigan, who found that when given the opportunity to play a strong role in selecting which probationary teachers should continue in their schools, principals generally selected teachers who later proved to generate good statistical outcomes (test scores). Note that this approach to declaring successful decision making suffers the circular logic I’ve frequently bemoaned on this blog. But, at the very least, Jacob’s findings suggest that decisions made by individuals – human beings considering multiple factors – are not counterproductive when measured against our current batch of narrow and noisy metrics. Specifically, Jacob found:

Principals are more likely to dismiss teachers who are frequently absent and who have previously received poor evaluations. They dismiss elementary school teachers who are less effective in raising student achievement. Principals are also less likely to dismiss teachers who attended competitive undergraduate colleges. It is interesting to note that dismissed teachers who were subsequently hired by a different school are much more likely than other first-year teachers in their new school to be dismissed again.

That to me seems like good selection. And it seems that principals are doing it reasonably well when given the chance. And this is why I also support using principals as the key leverage point in the process (with the caveat that principal quality itself is very unequally distributed, and must be improved).

Dumb “Selection:” Dumb selection on the other hand – the kind of selection that is destined to fail if applied en masse in public schooling or any endeavor suffers the three major flaws of reasoning addressed by Cedar Riener in his blog post.  Now, you say to yourself, but who is really promoting dumb selection and what more specifically are the elements of dumb selection when it comes to the teacher workforce? Here are the elements:

  1. Heavy (especially a defined fixed, large share) weight in making teacher evaluation, compensation or dismissal decisions placed on Value-Added metrics, which can be corrupted, may suffer severe statistical bias, and are highly noisy and error prone.
  2. Explicit, prior specification of the exact share of teachers who should be de-selected in any given year, or year after year over time OR prior specification of exact scores or ratings (categories) derived from those scores requiring action to be taken – including de-selection.

Sadly, several states have already adopted into policy the first of these dumb selection concepts – the mandate of a fixed weight to be place on problematic measures.  See this post by Matt Di Carlo at ShankerBlog for more on this topic.

Thus far, I do not know of states or districts that have, for example, required that 5% of the bottom scoring teachers in any given year be de-selected. But, states and districts have established categorical rating systems for teachers from high to low rating groups, based arbitrary cut points applied to these noisy measures, and have required that dismissal, intervention and compensation decisions be based on where teachers fall in the fixed, arbitrary classification scheme in a given year, or sequence of three years.

To some extent, the notion of de-selecting fixed shares of the teacher workforce based on noisy metrics comes from economists simulations based on convenience of measures than on active policy conversations. But in the past year, the lines between these simulations and reality have become blurred as policy conversations have indeed drifted toward actually using fixed values based on noisy achievement measures in place of seniority as a blunt tool to deselect teachers during times of budget cuts.  If and when these simplified social science thought exercises are applied as public policy involving teachers, they do reek of the disturbingly technocratic, “value-neutral” mindset pervasive in eugenics as well.

One other recent paper that’s gotten attention, applies this technocratic (my preference over eugenic) approach to determine whether using performance measures instead of seniority would result a) in different patterns of layoffs and b) in different average “effectiveness” scores (again, that circular logic rears its ugly head) Now, of course, if you lay off based on effectiveness scores rather than seniority, the average effectiveness scores of those left should be higher. The deck is stacked in this reformy analysis. But, even then, the authors find very small differences, largely because a) seniority based layoffs seem to be affecting mainly first and second year teachers, and b) effectiveness scores tend to be lower for first and second year teachers. Overall, the authors find:

We next examine our value-added measures of teacher effectiveness and find that teachers who received layoff notices were about 5 percent of a standard deviation less effective on average than the average teacher who did not receive a notice. This result is not surprising given that teachers who received layoff notices included many first and second-year teachers, and numerous studies show that, on average, effectiveness improves substantially over a teacher’s first few years of teaching.

Perhaps most importantly, these thought experiments, not ready for policy implementation prime time (nor will they ever be?) necessarily ignore the full complexity of the system to which they are applied, and as Riener noted, assume that individual’s traits are fixed – how you are rated by the statistical model today is assumed to correct (despite a huge chance it’s not) and assumed to be sufficient for classifying your usefulness as an employee, now and forever (be it a 1 or 3 year snapshot). In that sense, Riener’s comparison, while offensive to some, was right on target.

To summarize: Smart selection good. Dumb selection bad. Most importantly, selection itself is neither good nor bad. It all depends on how it’s done.

Advertisements

13 Comments

  1. Hey, thanks so much!
    A few thoughts on reading your post.
    You are right that I did not intend to compare modern day school reform to Hitler. I was hoping in part to reclaim the American eugenecists not as some sort of fringe group of fascists, but as a mainstream movement, that had a lot of support among a lot of well-meaning people. In fact, I would argue that trying to ascertain the intent (good or bad) of either historical figures, or of current ones, is a useless enterprise. Far better to assume good intent (from teacher’s unions, Bill Gates, and even the Waltons) and attack the assumptions, evidence, and methods of those with whom we disagree.

    Regarding the difference between good selection and bad selection, I think you are right, but I would go one step further. I think that when we want selection to do the bulk of the work of improvement (whether it be a profession or the health or intelligence of society) they are bound to fail no matter what kind of metrics we use. This is a corollary of Campbell’s Law. I agree that we should trust principals more to do their own selection, but ultimately my point is this: the people are not the first problem. I am tired of arguing how many bad teachers there are. I agree it is non-zero, and as a graduate of DC Public Schools, I have had my fair share. (Of course, I had some bad teachers at Harvard too, but I don’t hear too many people talking about massive layoffs there, unless the Texas Public Policy Foundation has its way)

    Trying to select the best teachers and deselect the bad ones creates other problems, no matter how good your selection. Error that may be acceptable in the economic model can have demoralizing effects on the school level. On the other hand, I think efforts to improve the profession as a whole, like increased mentorship, like increased classroom support, more planning periods, more equitable resources and smaller classes for the neediest students, and more freedom, ultimately have beneficial effects on selection by both making teachers more likely to stay in the profession, and draw more teachers to the profession (note that salary and merit pay are not included here). Granted, these changes take a little longer to take effect, but I think patience would be worth it.

    Reading your other stuff, I think you are on board with this, but I thought I would include that. I really appreciate you taking the time (both on twitter) and here to riff on this point.

    I’ll add that I am sorry people feel that “eugenics” is simply shorthand for “evil.” Oliver Wendell Holmes is responsible for a great deal of progressive jurisprudence, arguing that the law should reflect society, and move with the times (not exactly Scalia, he). From Wikipedia: “Francis Biddle writes: “He was convinced that one who administers constitutional law should multiply his skepticisms to avoid heading into vague words like ‘liberty’, and reading into law his private convictions or the prejudices of his class”. That he failed to apply this skepticism to his own attitude on eugenics should not cause us to dismiss him (and his fellow eugenicists) entirely.

  2. I agree entirely with your point that error that may be acceptable in the economic model may have demoralizing effects in real organizations with real people. I have written about this before. The economic models do not account for the potential demoralizing effects of teachers having to recognize that they have limited control over their fate/careers/job security. Having nearly random (as with VAM) firings occur with regularity sends one form of those signals I mentioned above in that two-sided matching discussion. Clearly that negative signal will alter who applies to teach – unless counterbalanced strongly with other factors such as dramatically increased pay or other benefits/amenities to offset the risk, perception of risk, or daily gut-wrenching uncertainty of whether statistical error alone will get you fired before next year. Yet, the usual simulations assume for simplicity that the average applicant quality will remain constant, even if the average salary is not changed. It’s just easier to simulate that way. But it’s not real. It can’t be applied in the real world with an expectation of the same result.

  3. Five thoughts:

    1. I like the designation “reformy pundit” and wish I could make it my Twitter handle.

    2. I’m glad Cedar commented on this post, Bruce, because he’s now made explicit what I was arguing with you via Twitter. Namely, Cedar disagrees with *all* selection, not just “deselection” of the bottom 8% or 5% or whatever, as you kept insisting. Which is kind of a weird argument to be having anyway, for two reasons. First, as you rightly observe, there aren’t any districts that automatically fire some set percentage of teachers (and thank goodness). DC’s IMPACT system, perhaps the most well-known and controversial teacher evaluation system, requires termination of teachers rated “ineffective,” which last year resulted in 2% being dismissed. And before anyone screams about the invalidity of VAM models, note that for 86% of DC public school teachers who don’t teach math or science, IMPACT is based almost entirely on classroom observations. Recommended reading: http://www.educationsector.org/publications/inside-impact-dcs-model-teacher-evaluation-system

    Second, Cedar’s contention that “selection is bound to fail” is difficult to square with what we know (or think we know) from studying other high performing education nations, including everyone’s favorite “non reformy” country, Finland. Without exception, they all use selectivity to drive talent into the teaching profession. I’d even go further to suggest that one major reason the US education system is in relative decline is that we’ve never figured out how to replace the extraordinary talent drain resulting from women entering the professional workforce over the last 30-40 years. In any event, I agree with the radical reformy pundit who’s previously said that ““I believe the recruiting, training and hiring system is broken, and that “if you build quality at the front door, then no one is there who shouldn’t be there.” (NEA President Dennis Van Roekel)

    3. I think it’s useful to compare and contrast your post Bruce, with the one Cedar wrote. You lay out the three alleged errors of eugenicists and then discuss how they apply to education/selectivity; agree or disagree, there’s nothing unreasonable about your argument. Cedar, in contrast, went pretty deep into the history of the eugenics movement, complete with references to forced sterilization. Although he later walked back a few of his statements in the comments to his post, it’s hard to wipe up mud after it’s already been thrown. If I wrote a blog post stating that Diane Ravitch was a modern-day eugenicist because she, like the eugenicists, believe that there are certain intractable qualities (IQ, poverty) that render all other social policy efforts mute, she’d go bananas — and rightfully so! Cedar’s eugenics metaphor is more prejudicial than probative and I for one think he should be ashamed of himself, particularly since his other writings (I’ve read his blog) suggest he’s smart enough to know better.

    4. If we’re going to talk about demoralizing effects on the teaching profession, how about labor contracts (or state laws) that require quality-blind layoffs, and provide for lock-step compensation regardless of effectiveness? What pernicious effects might that have on attracting and keeping talented young professionals as teachers?

    5. On the boring invocation of Campbell’s “Law” as a defense to any change to education policy, I refer you to my extended Twitter debate with @CohenD over the past two days, and in particular what I have now coined as the White Stripes Law: “I didn’t rob a bank because you made up the law — you can’t take the effect and make it cause!”

    — Ben Riley

    1. I discuss the issues of the relative negative consequences of seniority versus VAM based layoffs here: https://schoolfinance101.wordpress.com/2011/01/12/thinking-through-cost-benefit-analysis-and-layoff-policies/

      While I do see some downside to current pay structures – having been a young teacher in a public school at one time – I see the potential for much worse in rigidly applied value-added metrics. This is largely driven by the uncertainty and inability to control one’s fate. As much as being certain that what you think is a wrong decision (seniority based) might be made, the uncertainty and randomness of the present alternatives is potentially much worse. Not to mention the potential “benefits” are actually quite negligible even if we ignore these issues.

      As for your twitter debate with David Cohen. In that case, I also felt that you were twisting the conversation to make his argument overly broad. As I read it, he was arguing that if one ties financial or other incentives like merit pay to the measures that it becomes far more likely that the measures become both overemphasized and corrupted (Campbell). You seemed to be implying that he was against using measures at all… for accountability or system monitoring included. Those are two very different things – incentives and high stakes versus accountability and monitoring more generally. Obviously measures have their place. We’d have no idea how well some kids are doing and how poorly other kids are doing on selected skills/activities/knowledge/content without them. That is important to know, albeit limited information. And measures can reasonably be used for accountability (in a broad sense) and system monitoring. Used appropriately, perverse incentives can be avoided. But, when monetary awards are attached to these measures, or when teacher’s job security or school closure becomes attached to these measures they are more likely to be corrupted… especially when measures are constructed or applied poorly.

      I would argue that one reason we see such strong selection in some charter schools (often intentional through push out or creative use of admissions/placement tests) is that we’ve created accountability measures that provide little incentive for charter operators to try to serve especially needy children well. We’ve set up systems that push them into these behaviors. Good research studies account for these differences, but media praise, gubernatorial and presidential visits often occur simply because of high average pass rates or scores. And scrutiny and closure because of low average pass rates or scores – in spite of the fact that any reasonable analyst knows that the average pass rates and scores are heavily influenced by the population served. Thus the incentive to do the wrong thing, even by schools which started and largely continue to operate with the best of intentions.

      Teachers chasing high personal ratings may also exhibit undesirable behaviors, from the obviously objectionable flat out cheating to more subtle, likely more wide spread, less “obviously wrong” behaviors such as avoiding teaching the difficult kids – kids who are a disruptive pain but who would otherwise not have a specific variable attached to them that would be accounted for in the VA model. Teachers might avoid at all cost teaching in the highest need DC schools (where their model appears to significantly disadvantage teachers in those settings), and the teachers with the greatest ability to go elsewhere might be the ones we’d really like to get into those schools. These aren’t cheating behaviors, and it’s hard to really call them “wrong” from a personal perspective, but they are still the wrong incentives – and incentives driven by inappropriate use of measures.

  4. My “selection is bound to fail” comment is actually this “when we want selection to do the bulk of the work of improvement they are bound to fail no matter what metrics we use.” Despite being ungrammatical (hah! self-pedantry) this is critically different than your generalization.

    As for learning from Finland that we should be more selective, that seems to me to be absurd cherry picking. It is as if I blamed the engagement and achievement of the student body at my college all on the admissions office. If they were only more selective, I wouldn’t have kids skipping class and failing. But the admissions officer could rightly reply: “But I can only select from those that apply. Further, remember that fantastic merit scholarship student who transferred to UVA? Whose fault is that?” The recruiting, hiring, and training process might be broken, but “fixing” teacher recruitment and training is a lot more than elevating the numbers and status of TFA. You can only recruit so much when the working conditions are crappy. Test-based merit pay does not = better working conditions.

    Regarding the demoralizing effects paragraph, if you really feel that teachers are demoralized because when they have an effective or hard-working year, they don’t get a bigger raise than their colleagues, you really haven’t met many teachers, or looked at what teachers say motivates them on surveys (or looked at the research on merit pay programs). I have lived with teachers most of my life, and this has rarely been a complaint.

    Regarding IMPACT, I am well aware that it is based mostly on classroom observations by “master teachers”. Did you know that at least one of these “master teachers” was in fact denied a job by a principal in DCPS? He reported failing to hire a teacher, and then seeing her the next year evaluating his teachers. I also know that it does NOT include any feedback on written work, for one. To me, this is simply bizarre. That it included pseudoscientific learning styles is another strike against it in my book.

    Regarding the “replacing the talented women” meme: this has also always struck me as bizarre, because isn’t this the exact cohort of people whose ineffectiveness is apparently shielded by LIFO? One one hand, 30 years ago we could get talented women teachers because of sexism. On the other, look at all these complacent teachers with 30 years experience which doesn’t matter anyways. We should get rid of some of them and replace them with high achievers!

    Finally, this is not a yes or no debate. Diane Ravitch doesn’t believe poverty is intractable, or that we shouldn’t do anything in education until we deal with poverty. She thinks we should temper our expectations when we have kids who go hungry, or have lead poisoning etc., I don’t believe that we should swing the school doors open and let anyone with a pulse in the classroom. We believe that we should take the complexities of the system into account, and not expect any one lever to provide a miraculous solution. I am not that familiar with your opinions, so I can’t speak for you, but if you think that I am making a straw man out of Joel Klein, Michelle Rhee and Arne Duncan, you should read that Manifesto, and look at what Rhee actually did in DC. Look at what she did with the budget. Look at what she did with the principals at Shepherd and Oyster. Look at what she did at Hardy. Look at how she was pictured on the cover of Time. What (or who) needs sweeping? I think she was painfully clear about that. So I don’t really care if a little mud from little old me gets on the bottoms of reformy shoes as they stomp all over people I respect.

  5. I’ll add one question for Bruce (and for you, Benjamin, if you know).
    I have read about the relative share of a teacher’s evaluation score attributed to VAM, but I am not sure that is the most relevant index. If the share of the score is 40%, but the share of the _variance_ is much larger, then the test has far more influence than the 40% number indicates. Dan Ariely’s post reminded me of this: http://danariely.com/2011/07/19/teachers-cheating-and-incentives/

    I ask this because I know some of the elements on the IMPACT checklist are checked off by everyone, and some actually show variation. If we are grading on any kind of curve (and many of them are), then the absolute percentage doesn’t matter, but the variance does. So I would be curious to see an item by item accounting of variation in the IMPACT checklist, or if anyone has done anything similar in other evaluation metrics.

    1. As I’ve explained on my blog previously, if the relative share of evaluation assigned to VAM is 40%, and the VAM scores vary widely but the other measures making up other shares do not vary as widely, then VAM becomes the determining factor in teachers’ relative rankings/ratings. That’s an incredibly important point.

      Similar problems persist in faculty merit evaluations where teaching accounts for 40%, service 20% and research/scholarly publication for 40%. The weighting scheme suggests that teaching and research are equally weighted. But, if you are in a department where faculty members invariably get student survey teacher ratings that vary only from 4.5 to 5.0 (limited variation) thus all faculty get a “very good” or “exceptional” on their teaching 40%, BUT… scholarly productivity varies widely, with ratings from “poor” to “exceptional,” then the entire merit pay distribution is driven by scholarly productivity. My old dept. at Kansas used this type of approach… which worked fine for me (but for the generally lousy KU pay). My new dept uses more loose judgment… and it seems to work well.

      This is also why some who are proponents of integrating VAM into evaluation systems have suggested that VAM shouldn’t be used this way. That it not be assigned a fixed share of the evaluation but rather that it be used in a sequential screening approach, as with tests for medical diagnosis. Note that I’m not totally on board with this, but it’s one hell of a lot better than the IMPACT approach or other state legislation assigning fixed shares. For example, in a diagnosis model one uses a convenient, albeit higher error rate tool for fast initial screening (like a quick strep test). VAM could be similarly used for scanning across teachers and classrooms to identify where there might be problems (or successes) which warrant further exploration – with full recognition of the amount of noise (false positives or negatives) in the measures. That further exploration would involve more intensive/expensive but (hopefully) more accurate observations and additional evaluation tools. Used this way, VAM would not be a determining factor, but rather a cheap/convenient, knowingly noisy preliminary screening tool. This is far more appropriate than the fixed share approach.

      My problem with even this approach is that I remain unconvinced that VAM is either precise or accurate enough for even these purposes. Given the year-to-year and test-to-test variation and given persistent problems of non-random sorting bias, I suspect that much time would be wasted chasing false signals, when other options might be far more effective.

  6. You gentlemen are clearly much more knowledgeable about statistical analysis than I will ever be, but one issue stood out to me (as a former union representative for teachers): “[principals as] … human beings considering multiple factors … [in recommending retention of teachers].”

    Seldom taken into account is that *those* human beings (principals) often change on a regular basis. The presumption that the previous principal(s) did a good job evaluating (or did it at all) or held similar professional standards or thought about leadership the same way as the current one making the decision is vitally important to the affected teachers and the overall school environment, but is difficult to measure. The same can sometimes be said of superintendents. A new person coming into a leadership position who has very different ideas from the previous person in the job about curriculum, school discipline, teacher training and more may have a significant impact on the decision to retain a teacher (or not) even when unwise or unwarranted.

    1. Excellent point, and one reason why I’ve focused some of my recent research agenda specifically on principal stability!

    2. I agree that this is a critical, often ignored point.
      I would love to see more stability and less attrition at every level, including principals and superintendents.
      There is a learning curve to all of these jobs, and I would love to see us give people more time to learn how to get good at them, by learning the idiosyncrasies and cultures of the schools and systems in their charge.

      1. It’s a double edged sword though. You don’t want to see stable bad leadership, or stable inequitable distribution of leadership quality.

  7. True, but from my limited impressions of DCPS, we often simply exchange some flaws for others, and no one stays long enough to learn. But agreed that stable bad leadership is awful. I’d just love to see more acceptance of learning from mistakes as a natural part of leadership, rather than always an indictment of one’s incompetence. No doubt some leadership (just like some teachers) are beyond hope.

    I guess I am just an optimist about people’s ability to improve, as well as a big believer in the power of the situation. For example, as much as I dislike Michelle Rhee (can you tell?) I think she might have had to adopt a different approach had she not been given complete power (mayoral takeover) the way that she was.

  8. Nancy,

    You are entirely correct and that is point is largely missing from the teacher eval debate. Bruce and I have worked and are working on research about principal turnover. It is extremely high–especially in high-poverty and/or low-performing schools. These schools are precisely where policy makers want to focus teacher evals. But rather than spending the time, energy, effort, and MONEY to develop principals and retain them for long periods of time at the same school so that effective evaluations and (more importantly) effective support and improvement efforts can take place, policymakers have chosen the less expensive (maybe, maybe not) route of instituting VAMs. But even with VAMS–as Bruce astutely points out–you still need good teacher evals which require well-trained, experienced, and stable school leaders.

    And strikingly, we know almost nothing about why principals leave schools at such a high rate. There are less than handful of studies that delve deeply into the issue and these only look at a few different places. Until we get a handle on this issue, we will never really get a handle on teacher eval and improvement.

Comments are closed.