More Detail on the Problems of Rating Ed Schools by Teachers’ Students’ Outcomes

In my previous post, I explained that the new push to rate schools of education by the student outcome gains of teachers who graduated from certain education schools is a problematic endeavor… one unlikely to yield particularly useful information, and one that may potentially create the wrong incentives for education schools.  To reiterate, I laid out 3 reasons (and there are likely many more) why this approach is so problematic. Here, I divide them out a bit more – 4 ways.

  1. parsing out individual teacher’s academic backgrounds – that is if teachers hold credentials and degrees from may institutions, which institution is primarily responsible for their effectiveness?
  2. the teacher workforce in most states includes a mix of teachers from a multitude of within and out-of-state institutions, public and private, with many of those institutions having only a handful of teachers in some states. States will not be able to evaluate all pipelines reliably. Does this mean that states should just cut off teachers from other states, or from institutions that don’t produce enough of their teachers to generate an estimate of the effectiveness of those teachers?
  3. because of the vast differences in state testing systems, and differences in the biases in those testing systems toward either higher or lower ability student populations (floor and ceiling effects), graduates of a given teaching college who might for example flock to affluent suburban districts on each side of a state line might find themselves falling systematically at opposite ends of the effectiveness ratings. The differences may have little or nothing to do with actually being better or worse at delivering one state’s curriculum versus another, and may instead have everything to do with the ways in which the underlying scales of the tests lead to bias in teacher effectiveness ratings. We already know from research on Value Added estimates that the same teacher may receive very different ratings on different tests, even on the same basic content area (math).
  4. and to me, this is still the big one, that graduates of teaching programs are simply not distributed randomly across workplaces. This problem would be less severe perhaps if they were distributed in sufficient numbers across various labor markets in a state, where local sample sizes would be sufficient for within labor market analysis across all institutions. But teacher labor markets tend to be highly local, or regional within large states.

I showed previously how the rates of children qualifying for free or reduced price lunch varies significantly across schools of graduates of Kansas teacher preparation programs:

Racial composition varies as well:

But perhaps most importantly, the above to charts are merely indicative of the fact that the overall geographic distribution of teacher prep program graduates varies widely. Some are in low-income remote rural settings, with very small class sizes, while others are near the urban core of Kansas City, either in sprawling low poverty suburbs or in the very poor, relatively population dense inner urban fringe.  Making legitimate comparisons of the relative effectiveness of teachers across these widely varied settings is a formidable task for even the most refined value-added model and even that may be too optimistic.

Here’s the geographic distribution of teacher graduates of the major public teacher preparation institutions in Kansas:

The Kansas City suburbs in this figure are covered in Red (KU), Purple (K-State) and Orange (Emporia State) does, and a significant number of blue ones (Pitt State). Western Kansas is dominated by Green Dots (Hays State) and southeast Kansas by blue ones (Pitt State). Wichita is dominated by black dots (Wichita State). Nearly all of these clusters are local/regional, around the locations of the universities. Certainly, much of the distribution is also dependent upon demand for teachers, where the greatest growth has been in the Kansas City suburbs to the south and west (out toward Lawrence, home to KU).

Here it is peeled back. First KU:

Next K-State:

Wichita State:

Fort Hays State:

Pittsburg State:

Emporia State:

Even if we assume that value added models could be an effective tool for a) rating teacher effectiveness and b) aggregating that teacher effectiveness to their preparation institutions, it is a stretch to assume that we could find any reasonable way to reliably and validly compare the effectiveness of the graduates of these public institutions, given that they are clustered in such vastly different educational settings – with widely varied resource levels, widely varied class sizes, kids who sit on buses for widely varied amounts of time, widely varied poverty levels, immigration patterns and numerous other factors (it’s that other “unobservable” stuff that really complicates things!). The only reasonable statistical solution would be to have  graduates of Kansas teacher preparation programs randomly assigned to Kansas schools upon graduation.

As I noted on my previous post, I’m not entirely opposed to exploring our ability to generate useful information by testing statistical models of teacher effectiveness aggregated in this way (to preparation institutions or pipelines). It is certainly more reasonable to use these information in the aggregate for “program evaluation” purposes than for rating individual teachers. But, even then, I remain skeptical that these data will be of any particular use either for state agencies in determining which institutions should or should not be producing teachers, or for the institutions themselves. It is a massive leap, for example, to assume that a teacher preparation institution might be able to look at the value-added ratings based on the performance of students of their graduates, and infer anything from those ratings about the programs and courses their graduates took as they pursued their undergraduate (or graduate) degrees. Though again, I’m not opposed to seeing what, if anything, one can learn in this regard.

What would be particularly irresponsible – and what is actually being recommended – is to accept this information as necessarily valid and reliable (which it is highly unlikely to be) and to mandate the use of this information as a substantial component of high stakes decisions about institutional accreditation.



  1. Bruce,
    Great posts on Teacher Ed and VAM. I was wondering, do you have a sense of how much the known reliability issues (e.g., 35% error rate in evaluating teachers using 1 year’s worth of test data – as you’ve written about before) become amplified when you bring it up and out into rating teacher education programs?

    1. Actually, at higher levels of aggregation, some of those issues should be less problematic. The error rates are likely highest when trying to stick a rating or classification on any one individual teacher, and the fewer the students for that teacher, the fewer the class sections and years, the higher the error. So, aggregation helps some of that. But, when the aggregations are of such imbalanced populations, spread so unevenly across settings with a multitude of unobservable differences, the potential for bias in the ratings is huge.

  2. This model also assumes that a teacher’s skills are only a result of accredited programs, which definetly wasn’t true in my case. I learned most from my 20 years in sales.

Comments are closed.