Take your SGP and VAMit, Damn it!

In the face of all of the public criticism over the imprecision of value-added estimates of teacher effectiveness, and debates over whether newspapers or school districts should publish VAM estimates of teacher effectiveness, policymakers in several states have come up with a clever shell game. Their argument?

We don’t use VAM… ‘cuz we know it has lots of problems, we use Student Growth Percentiles instead. They don’t have those problems.

WRONG! WRONG! WRONG! Put really simply, as a tool for inferring which teacher is “better” than another, or which school outperforms another, SGP is worse, not better than VAM. This is largely because SGP is simply not designed for this purpose. And those who are now suggesting that it is are simply wrong. Further, those who actually support using tools like VAM to infer differences in teacher quality or school quality should be most nervous about the newly found popularity of SGP as an evaluation tool.

To a large extent, the confusion over these issues was created by Mike Johnston, a Colorado State Senator who went on a road tour last year pitching the Colorado teacher evaluation bill and explaining that the bill was based on the Colorado Student Growth Percentile Model, not that problematic VAM stuff. Johnston naively pitched to legislators and policymakers throughout the country that SGP is simply not like VAM (True) and that therefore, SGP is not susceptible to all of the concerns that have been raised based on rigorous statistical research on VAM (Patently FALSE!).  Since that time, Johnston’s rhetoric that SGP gets around the perils of VAM has been widely adopted by state policymakers in states including New Jersey, and these state policymakers understanding of SGP and VAM is hardly any stronger than Johnston’s.

This brings me back to my exploding car analogy. I’ve pointed out previously that if we lived in a society where pretty much everyone still walked everywhere, and then someone came along with this new automotive invention that was really fast and convenient, but had the tendency to explode on every third start, I think I’d walk. I use this analogy to explain why I’m unwilling to jump on the VAM bandwagon, given the very high likelihood of falsely classifying a good teacher as bad and putting their job on the line – a likelihood of misfire that has been validated by research.  Well, if some other slick talking salesperson (who I refer to as slick Mikey J.) then showed up at my door with something that looked a lot like that automobile and had simply never been tested for similar failures, leading the salesperson to claim that this one doesn’t explode (for lack of evidence either way), I’d still freakin’ walk! I’d probably laugh in his face first. Then I’d walk.

Origins of the misinformation aside, let’s do a quick walk through about how  and why, when it comes to estimating teacher effectiveness, SGP is NOT immune to the various concerns that plague value-added modeling. In fact, it is potentially far more susceptible to specific concerns such as the non-random assignment of students and the influence of various student, peer and school level factors that may ultimately bias ratings of teacher effectiveness.

What is a value-added estimate?

A value added estimate uses assessment data in the context of a statistical model, where the objective is quite specifically to estimate the extent to which a student having a specific teacher or attending a specific school influences that student’s difference in score from the beginning of the year to the end of the year – or period of treatment (in school or with teacher). The best of VAMs attempt to account for several prior year test scores (to account for the extent that having a certain teacher alters a child’s trajectory), classroom level mix of students, individual student background characteristics, and possibly school characteristics. The goal is to identify most accurately the share of the student’s value-added that should be attributed to the teacher as opposed to all that other stuff (a nearly impossible task)

What is a Student Growth Percentile?

To oversimplify a bit, a student growth percentile is a measure of the relative change of a student’s performance compared to that of all students and based on a given underlying test or set of tests. That is, the individual scores obtained on these underlying tests are used to construct an index of student growth, where the median student, for example, may serve as a baseline for comparison. Some students have achievement growth on the underlying tests that is greater than the median student, while others have growth from one test to the next that is less (not how much the underlying scores changed, but how much the student moved within the mix of other students taking the same assessments, using a method called quantile regression to estimate the rarity that a child falls in her current position in the distribution, given her past position in the distribution).  For more precise explanations, see: http://dirwww.colorado.edu/education/faculty/derekbriggs/Docs/Briggs_Weeks_Is%20Growth%20in%20Student%20Achievement%20Scale%20Dependent.pdf

So, on the one hand, we’ve got Value-Added Models, or VAMs, which attempt to construct a model of student achievement, and to estimate specific factors that may affect student achievement growth, including teachers, schools, and ideally controlling for prior scores of the same students, characteristics of other students in the same classroom and school characteristics. The richness of these various additional controls plays a significant role in limiting the extent to which one incorrectly assigns either positive or negative effects to teachers. Briggs and Domingue run various alternative scenarios to this effect here: http://nepc.colorado.edu/publication/due-diligence

On the other hand, we have a seemingly creative alternative for descriptively evaluating how one student’s  performance over time compares to the larger group of students taking the same assessments. These growth measures can be aggregated to the classroom or school level to provide descriptive information on how the group of students grew in performance over time, on average, as a subset of a larger group. But, these measures include no attempt at all to attribute that growth or a portion of that growth to individual teachers or schools. That is, sort out the extent to which that growth is a function of the teacher, as opposed to being a function of the mix of peers in the classroom.

What do we know about Value-added Estimates?

  • They are susceptible to non-random student sorting, even though they attempt to control for it by including a variety of measures of student level characteristics, classroom level and peer characteristics, and school characteristics. That is, teachers who persistently serve more difficult students, students who are more difficult in unmeasured ways, may be systematically disadvantaged.
  • They produce different results with different tests or different scaling of different tests. That is, a teacher’s rating based on their students performance on one test is likely to be very different from that same teacher’s rating based on her students performance on a different test, even of the same subject.
  • The resulting ratings have high rates of error for classifying teacher effectiveness, likely in large part due to error or noise in underlying assessment data and conditions under which students take those tests.
  • They are particularly problematic if based on annual assessment data, because these data fail to account for differences in summer learning, which vary widely by student backgrounds (where those students are non-randomly assigned across teachers).

What do we know and don’t we know about SGP?

  • They rely on the same underlying assessment data as VAMs, but simply re-express performance in terms of changes in relative growth rather than the underlying scores (or rescaled scores).
    • They are therefore susceptible to at least equal error of classification concern
    • Therefore, it is reasonable to assume that using different underlying tests may result in different normative comparisons of one student to another
    • Therefore, they are equally problematic if based on annual assessment data
  • They do not even attempt (because it’s not their purpose) to address non-random sorting concerns or other student and peer level factors that may affect “growth.”
    • Therefore, we don’t even know how badly these measures are biased by these omissions? Researchers have not tested this because it is presumed that these measures don’t attempt such causal inference.

Unfortunately, while SGPs are becoming quite popular across states including Massachusetts, Colorado and New Jersey, and SGPs are quickly becoming the basis for teacher effectiveness ratings, there doesn’t appear to be a whole lot of specific research addressing these potential shortcomings of SGPs. Actually, there’s little or none! This dearth of information may occur because researchers exploring these issues assume it to be a no brainer that if VAMs suffer classification problems due to random error, then so too would SGPs based on the same data. If VAMs suffer from omitted variables bias then SGP would be even more problematic, since it includes no other variables. Complete omission is certainly more problematic than partial omission, so why even bother testing it.

In fact, Derek Briggs, in a recent analysis in which he compares the attributes of VAMs and SGPs explains:

We do not refer to school-level SGPs as value-added estimates for two reasons. First, no residual has been computed (though this could be done easily enough by subtracting the 50th percentile), and second, we wish to avoid the causal inference that high or low SGPs can be explained by high or low school quality (for details, see Betebenner, 2008).

As Briggs explains and as Betebenner originally proposed, SGP is essentially a descriptive tool for evaluating and comparing student growth, including descriptively evaluating growth in the aggregate. But, it is not by any stretch of the imagination designed to estimate the effect of the school or the teacher on that growth.

Again, Briggs in his conclusion section of his analysis of relative and absolute measures of student growth explains:

However, there is an important philosophical difference between the two modeling approaches in that Betebenner (2008) has focused upon the use of SGPs as a descriptive tool to characterize growth at the student-level, while the LM (layered model) is typically the engine behind the teacher or school effects that get produced for inferential purposes in the EVAAS. (value-added assessment system) http://dirwww.colorado.edu/education/faculty/derekbriggs/Docs/Briggs_Weeks_Is%20Growth%20in%20Student%20Achievement%20Scale%20Dependent.pdf

To clarify for the non-researcher, non-statisticians, what Briggs means in his reference to “inferential purposes,” is that SGPs, unlike VAMs are not even intended to “infer” that the growth was caused by differences in teacher or school quality.  Briggs goes further to explain that overall, SGPs tend to be higher in schools with higher average achievement, based on Colorado data. Briggs explains:

These result suggest that schools that higher achieving students tend to, on average, show higher normative rates of growth than schools serving lower achieving students. Making the inferential leap that student growth is solely caused by the school and sources of influence therein, the results translate to saying that schools serving higher achieving students tend to, on average, be more effective than schools serving lower achieving students. The correlations between median SGP and current achievement are (tautologically) higher reflecting the fact that students growing faster show higher rates of achievement that is reflected in higher average rates of achievement at the school level.

Again, the whole point here is that it would be a leap, a massive freakin’ unwarrented leap to assume a causal relationship between SGP and school quality, if not building the SGP into a model that more precisely attempts to distill that causal relationship (if any).

It’s a fun and interesting paper and one of the few that addresses SGP and VAM together, but intentionally does not explore the questions and concerns I pose herein regarding how the descriptive results of SGP would compare to a complete value added model at the teacher level, where the model was intended for estimating teacher effects. Rather, Briggs compares the SGP findings only to a simple value-added model of school effects with no background covariates,[1] and finds the two to be highly correlated. Even then Briggs finds that the school level VAM is less correlated with initial performance level than is the SGP (where that correlation is discussed above).

So then, where does all of this techno-babble bring us? It brings us to three key points.

  1. First, there appears to be no analysis of whether SGP is susceptible to the various problems faced by value-added models largely because credible researchers (those not directly involved in selling SGP to state agencies or districts) consider it to be a non-issue. SGPs weren’t ever meant to nor are they designed to actually measure the causal effect of teachers or schools on student achievement growth. They are merely descriptive measures of relative growth and include no attempt to control for the plethora of factors one would need to control for when inferring causal effects.
  2. Second, and following from the first, it is certainly likely that if one did conduct these analyses, that one would find that SGPs produce results that are much more severely biased than more comprehensive VAMS and that SGPs are at least equally susceptible to problems of random error and other issues associated with test administration (summer learning, etc.).
  3. Third, and most importantly, policymakers are far too easily duped into making really bad decisions with serious consequences when it comes to complex matters of statistics and measurement.  While SGPs are, in some ways, substantively different from VAMS, they sure as heck aren’t better or more appropriate for determining teacher effectiveness. That’s just wrong!

And this is only an abbreviated list of the problems that bridge both VAM and SGP and more severely compromise SGP. Others include spillover effects (the fact that one teacher’s scores are potentially affected by other teachers on his/her team serving the same students in the same year), and the fact that only a handful of teachers (10 to 20%) could be assigned SGP scores, requiring differential contracts for those teachers and creating a disincentive to teach core content in elementary and middle grades.  Bad policy is bad policy. And this conversation shift from VAM to SGP is little more than a smokescreen intended to substitute a potentially worse, but entirely untested method with a method for which serious flaws are now well known.


Note: To those venders of SGP (selling this stuff to state agencies and districts) who might claim my above critique to be unfair, I ask you to show me the technical analyses conducted by a qualified fully independent third party that shows that SGPs are not susceptible to non-random assignment problems, that they miraculously negate bias resulting from differences in summer learning even when using annual test data, that they have much lower classification error rates when assigning teacher effectiveness ratings, that teachers receive the same ratings regardless of which underlying tests are used and that one teacher’s ratings are not influenced by the other teachers of the same students. Until you can show me a vast body of literature on these issues specifically applied to SGP (or even using SGP as a measure within a VAM), comparable to that already in existence on more complete VAM models, don’t waste my time.

[1] Noting: “while the model above can be easily extended to allow for multivariate test outcomes (typical of applications of the EVAAS by Sanders), background covariates, and a term that links school effects to specific students in the event that students attend more than one school in a given year (c.f., Lockwood et al., 2007, p. 127-128), we have chosen this simpler specification in order to focus attention on the relationship between differences in our choice of the underlying scale and the resulting schools effect estimates.”


29 thoughts on “Take your SGP and VAMit, Damn it!

  1. Both “VAM” and “SGM” have the same logic:

    Teacher Effect = [Average post-test] – [Average predicted post-test]

    The question for both is: how good is the prediction of the post-test?

    Traditional “VAM” uses linear prediction based on pretest and student background characteristics, but it can estimated using multiple lag pretests or non-linear transformations (e.g. prettest squared, piecewise linear segments of pretest, etc.). In other words, we can try to model the growth process mathematically to capture as much of the nuance as possible.

    What you’re calling “SGM”, if I understand it correctly, uses exact match of pretest to predict post-test. It’s not obvious how that would be an improvement except with regard to the high degree of nonlinearity in the relationship between pretest and posttest (scaling/growth properties of the test). It should be a relatively weaker predictor near the tails of the distribution, i.e. teachers who work with exceptionally high or exceptionally low achievers, because there are simply fewer of them, so we will have less information about “normal” or expected growth for students at those starting points.

    That having been said, the relevant question for both types of models for estimating teacher effects should be, “How useful are they for making policy decisions?” Such decisions can include low stakes (redirection of PD resources) or higher stakes (pay and promotion). Degree of influence can be higher or lower based on the reliability and the consequences of incorrect decisions. It doesn’t have to be a black and white “use it” or “don’t use it”. See here for more discussion.

    1. I tend to agree with your assumptions above, and you have far more background and experience with this stuff than I do. I think your point about the comparisons at the tails is particularly relevant in part because I’ve heard at least some discussion among insiders working on these projects that they have found the need to try to compare students within smaller slices of the overall distribution. It also seems to me that what is essentially a rescaling exercise in SGPs would still need to be built into a VAM framework if one wished to parse out teacher or school effect on SGP. Briggs and Betebenner say as much in the paper I link. Even then, I’m not sure of the advantages of rescaling in this way, compared to options that have been tested within VAM.

      I also appreciate the thought you’ve given (in presentations I’ve seen) to potential uses of the information, including using noisy screening tools such as VAM estimates as preliminary indicators that may suggest the need for more in-depth evaluation, as opposed to trying to assign a specific weight or value (fixed and finite influence on decision) to these indicators within an evaluation system.

    2. When you consider that there is a research proven, time tested system of teacher evaluation out there, PAR, it does become a black and white issue. Go with a winner or pay big $$$$ for a known to be flawed system? Also, with VAM, schools will have to waste more time and money by checking it’s results to try to mitigate and correct the serious errors that will occur. Concerning PAR, Montgomery County, MD is a great example of how to do it right. Add to that the superior levels of support it affords to teachers to improve, and the way PAR allows everyone, yes unions included, to work together in removing those who should not be in a classroom, and the question of why any school system would waste their scarce funds on VAM becomes even more pointed. The only results that VAM is guaranteed to produce is that it will siphon off funds that could have been used to actually educate children and result in increased litigation when teachers that everyone knows are great fight for the jobs they were wrongly removed from, unless they throw in the towel and decide to go after better pay in the private sector. That’s just what our schools need, a brain drain. It’s time the idiots pushing VAM and other false, profit driven reforms stopped viewing our kids as cash cows and see them as the future strength of our nation.

  2. The problem with these models is the fact that they are inferential at all. If you want to measure teacher effectiveness, you need to look at a teacher’s student population and their results and extrapolate your data from those assessments, and not use a predicted score at all. Growth percentages should be tied into actual student results, not predicted ones. Inference is to be used when you don’t know the population or its characteristics, and want to be able to make some prediction about how the population will behave, based on sample data. By making a prediction, and then holding the teacher accountable to your prediction (not their actual gains as measured through actual metrics), VAMs and SGPs are creating an unrealistic expectation. They should measure ACTUAL student growth and rank schools accordingly, after the fact.

    Take the SGP models, for instance (like the one used in Nevada). The goal is to create a baseline score that then distributes each student according to how well he/she did against that score. But this doesn’t measure “student growth”; it measures class rank. A student could achieve a 400% increase in pre-to-post test change, yet fall well below the 800% increases in growth that most other students have achieved. That 400% performance increase might not be enough to pass the state test, but it certainly shows that the teacher was effective. The SGP will not detect this change. It will only compare one student to the average, and rank them accordingly.

    I understand that there is some variability in the way these standardized tests change from year to year, and some variability in the way students perform from year to year. And I understand that the inferential methods are trying to adjust for that. But that’s exactly why these test should not be used to measure student growth; a whole new type of assessment needs to be created (which, I think, is your point to begin with). One that does not vary in both these areas from year to year. Since it’s highly unlikely that such a test can be made, these tests should simply not be used to measure student growth at all.

    TLDR: There is a critical flaw in the flow of reasoning behind using these models: They use predictions to make determinations about whether or not a teacher WAS effective, when extrapolations about PAST performances should come from non-inferential methods.

    Thanks for (once again) sticking up for us teachers. We really appreciate knowing there is someone else out there fighting the good fight.

  3. I must also say that I have a small problem with the argument in the link provided by the commenter S. Glazerman. I would like to point out that the section labeled “Some classification errors are worse than others” makes the argument that:

    “However, an evaluation system that results in tenure and advancement for almost every teacher and thus has a very low rate of false negatives generates a high rate of false positives, i.e., teachers identified as effective who are not.”

    This is not necessarily true. Based on the figure provided, they come to the conclusion that the bar should be moved right for tenure and effectiveness, so as to remove the chance of false positives. This may be an incorrect statement. Without real data providing for the amount of overlap of the two populations, we cannot make an inference about which way to swing the “cut bar”. If the two populations have extremely separate means and small standard deviations, we may be able to swing that bar very low (read:left) and still minimize the amount of false positives and false negatives. It depends on the actual data; not some graph they included to make their point.

    1. I have more than a few small problems with that link:

      “Full-throated debate about policies such as merit pay and “last in-first out” should continue, but we should not let controversy over the uses of teacher evaluation information stand in the way of developing and improving measures of teacher performance.”

      This is, to be blunt, a cop out. You can’t build bombs and then disavow yourself from the collateral damage they cause by claiming you didn’t have any say over where they were dropped.

      There is no doubt that everyone reading this knows that good teachers will be fired if VAM, SGP, or other such systems are used in high-stakes personnel decisions. The insouciance of the authors toward this problem is deeply troubling. It is not simply mitigated by having fewer low-scoring teachers who sneak past the cut point; it is a matter of basic fairness that stains the entire profession.

      And there is no evidence I see that having a low-scoring teacher (not a bad teacher, a low-scoring one) in one year is such an enormous detriment to student learning anyway. Where is the empirical evidence that this is a greater danger than firing good teachers? In data-driven research? Seems circular to me.

      “Our message is that the interests of students and the interests of teachers in classification errors are not always congruent, and that a system that generates a fairly high rate of false negatives could still produce better outcomes for students by raising the overall quality of the teacher workforce.”

      This is pure conjecture. “Could” produce better outcomes? Please. You’re going to have to do a lot better than that if you want to advocate for a system that you freely admit is going to fire good teachers.

      I also am bothered by the casual tone the authors take toward moving the cut point, as if that were a simple matter – it is not. The assumption of standard distributions in the graph is troubling enough; to say that it is merely a matter of moving the cut point along the x-axis to get you a policy outcome you want dismisses the difficulty of determining what the tests scores tell us about student learning in the first place.

      “It is instructive to look at other sectors of the economy as a gauge for judging the stability of value-added measures. The use of imprecise measures to make high stakes decisions that place societal or institutional interests above those of individuals is wide spread and accepted in fields outside of teaching.”

      Please name one field where the work of another determines your evaluation, and you have no say in whom you choose to do this work. And where that work is a four-day test you aren’t allowed to see, and is likely being graded by a low-paid, low-skill amateur.

      Even doctors have latitude in the patients they accept, and their patients have far less control of their health outcomes than students do over their test scores. Teaching needs unique evaluations because it is a unique job.

      The fact is that VAM or SGP or whatever the latest fad is does one thing: it judges whose classes get good test scores. This slavish adherence to the cult of the bubble test is enormously bothersome. It is not the entirety of teaching – it is not even a small part of good teaching. The assumption that young children who take these tests are beneficiaries of the standardized testing regime needs to be questioned, and not merely in the abstract, data-driven language I see here.

      With all due respect to the many excellent scholars who collaborated on the Brookings paper, I think you are too far removed from those of us down in the trenches to see the implications of your ideas. You need to get out more.

      1. “Please name one field where the work of another determines your evaluation, and you have no say in whom you choose to do this work. And where that work is a four-day test you aren’t allowed to see, and is likely being graded by a low-paid, low-skill amateur.”

        Spot on. This question has been asked repeatedly all over the place and the abject absurdity of the answers will give you brain damage, if you get a direct answer at all.

        I keep asking that any business endeavor that finds a 35% (25% after 3 years of data collection) error rate in making key personnel decisions acceptable be named, and I get the same response.

        You can’t win a war by firing on your own troops.

  4. Yes, there are differences and problems with all measurements – particularly ones that are inchoate in their development. But the problem with this sort of analysis is that it does not talk about the opportunity cost of the current status quo, which is a lack of any measurement.

    I don’t support VAM or SGP (or GAAP accounting for that matter, since this is a finance blog) because they are flawless — I support them because they are better then the alternative. And I think they will grow more precise and useful over time.

    For Bruce I would simply ask: if you were running a school, and wanted to be able to gauge the effectiveness of different teachers, what would you use instead? I would gladly trade VAM or SGP for the ability to allow principals to make subjective evaluations of teacher (like every other profession) — but collective bargaining agreements forbid that. So as a practitioner, do you do nothing or use the tools you have to the best of your ability?

    1. I too would gladly allow principals latitude to make more substantive quality judgments and encourage them to do so. However, I do not ignore the fact that principals already do and can play a substantive role in teacher evaluation including dismissal. Not that they do this well everywhere, by any stretch, or even close to the extent I’d like to see. But they are perhaps not as broadly prohibited from making judgments as you suggest. I would argue that you substantially overstate the perils of maintaining the supposedly dreadful status quo. You also grossly overstate the rationality of “every other profession” (a phrase which immediately undermines your argument). Moving to deeply problematic measures like VAM or SGP and especially using them as rigid decision enforcing tools is counterproductive and I would argue quite likely to be far worse than the status quo – even if we adopt some uniform assumption of what that status quo is, based on the most pessimistic characterization of it.

    2. Collective bargaining does not forbid principles evaluating teachers. Lack of a spread sheet does not mean that no one knows what’s going on. And, since VAM is in it’s infancy and known to have serious problems, wouldn’t the “business model” suggest that it’s developers offer to pay schools for the opportunity of using their students to further advance the quality of their product with the venture capital funds they got by presenting a good business plan, rather than being paid for something that’s nowhere near ready for a beta test? Would you send our troops into battle knowing that at best, 25% of their ammo is crap? Who wants to become a teacher if you can get fired for doing nothing wrong and everything right?

  5. @Alexander Ooms:

    “I would gladly trade VAM or SGP for the ability to allow principals to make subjective evaluations of teacher (like every other profession) — but collective bargaining agreements forbid that.”

    I think you meant tenure, not collective bargaining. At the school where I work, the principle (or vice-principle) is in charge of conducting your final evaluation, which includes an in-class observation. If you are non-tenure and they observe you and deem you are terrible, you are not invited back the following year. If you are tenure, and have had other poor observations that year by your department chair, you may receive notice of removal of tenure. And if the next year’s evaluations remain terrible, you are dismissed. In cases like this, tenure only prevents abuse of power. Collective bargaining simply ensures tenure.

    There is not a single school out there that is “doing nothing”. Every single school is educating young children. Whether or not they succeed is another matter, built on MANY factors. I suggest that, instead of throwing money away trying to create testing and evaluation systems to determine effectiveness, that we invest the money in those failing schools. It might pay off a little better.

    1. @J: I don’t know why it matters beyond semantics, but tenure rules (as well as most evaluations processes) are spelled out in the CBAs — it is that most of the protection provisions in the CBAs kick in after some period of time (the non-probationary period). The CBA is what governs most evaluations and the process for removing teachers.

      Look, the question here is if there can be a quantitative system that provides a correlation between teachers and the academic growth of students (for I can’t imagine one might argue such a system would not benefit both parties). IF one thinks this is possible, we should be trying to make the current systems better. If one believes that this is so complex as to be beyond the ability of mankind — well, ok, but I think we all know what usually happens.

  6. It’s almost pointless to say, but the problem with the “exploding car” analogy is only that, well, cars don’t explode (and never have). What happens is that the first car breaks down a lot, and most people walk. The second car gets a little better, and more people try it. Eventually cars work well enough where everyones uses them because it is a more efficient system than walking long distances, and no one thinks about it twice.

    This is the way industries evolve — medicine, science, and eventually yes, even education. People will try something new (like VAM), and the first few iterations will have problems, but they will improve. I’d be stunned if a generation from now we don’t view some sort of quantitative tool for measuring teacher quality as normal — and we will wonder who the people were who ran around claiming it was an exploding bomb.

    1. But in medicine or engineering we do not solve a problem of mechanism that has been shown through much research to fail to accomplish its goals, by replacing it with a completely untested one, that in fact removes many of the most important safety components of the original. That is what we are talking about here. SGP is completely untested when it comes to the various problems that have been identified for VAMs (a few of which are listed here), which severely compromise their usefulness.

      It is negligent to decide to go ahead anyway with SGP without running any of these tests… especially since SGP does, in effect, remove critical safety features of VAMs, and, in fact, because it wasn’t even intended to be used this way. It’s a purely descriptive tool making no attempt to isolate a teacher or school effect.

      And, in fact, across industries, we’ve tried many things over the years that have never worked, where the problems they were intended to solve were eventually handled by completely different technologies. Not every idea from its inception becomes a widely adopted successful technology. Far more fail than succeed. Perhaps I should have chosen one of the failures (which would make a really crappy illustrative example because no-one would know of it). But that’s way off the point. Sometimes a particular technology simply has insurmountable technical problems/issues. Sometimes, it’s actually just a bad idea from the start. That may be where we’re at here.

      You’re reasoning and comparisons to all those other industries and every possible profession that clearly do things more rationally than education systems, seems severely clouded by a warped understanding of both all those other systems in the supposed real world, as well as layers of misconceptions of how schools and education systems work.

      1. Perhaps more along these lines: http://www.youtube.com/watch?v=C7OJvv4LG9M

        Yeah… eventually we were going to figure out how to fly, but many early attempts were pretty far off the mark.

        Oh… and pertinent to the issue at hand, we didn’t start selling trans-atlantic flight tickets for these various machines without ever testing them. They didn’t work (for their intended purpose – flight), therefore we didn’t end up using them. Perhaps it’s just a little more obvious that these things don’t work for their intended purpose. They don’t fly. A problem indeed, and an obvious one.

        But, SGP similarly doesn’t actually evaluate teacher effectiveness and isn’t even designed to. While VAM is designed to, it has shown repeated, consistent patterns of failure (different results from different tests, non-random assignment bias, high classification error). It’s not just that they are “imperfect” shall we say, but rather that I am increasingly skeptical that they provide much useful information at all (though, as a data geek, I always hoped they would). That said, to the extent they do provide useful information (if any), that information must be used thoughtfully/appropriately.

  7. Google “Montgomery County, MD” and “PAR” if you want to learn of a system that works really well and has been up and running for 11 years. It’s so good in fact that they turned down tens of millions in RTTT funding since it would have required them to abandon what has worked so very well for them. The school board, administration, teachers, parents and union are all solidly on the same page on this. This is being done in other places as well.

    1. I’m somewhat familiar with PAR, and have certainly heard much positive feedback on it. While I’m not entirely convinced that PAR is “the answer” the fact that alternatives like PAR do exist is important to debunking these absurd arguments that a) all public schools do nothing for evaluation and, of course b) all private industry is really great at evaluation and successfully implements performance based pay and dismisses employees only on the basis of effectiveness. Arguments to that effect are simply wrong (if I hear one more time, “teaching is the only profession where you have a job for life”, “all teacher evaluations are useless” or “all other professions reward their employees based on effectiveness.” That’s crap!). PAR is a great example of how schools and districts do different things when it comes to evaluation. And to others posting here, so too do various other industries. There is no simple dichotomy here. There is much to be learned from those things that are already going on in public education systems, including PAR. And there may be new and interesting variations on the horizon. I’ll let all of you take it from there…

      1. I’ve not seen any system better than PAR, or even close for that matter. An additional bit, it’s adaptable to “local situations” it’s not one size fits all. Next, consider that PAR is how the best public ed system in the world is doing it and a big part of how they got there in the first place. It is bureaucratically much leaner than VAM and unlike VAM or any other system now in place operates on a day to day ongoing basis. All this pretty much clinches it for me. Sorry for being pedantic on this, but PAR was designed to help teachers improve as well as for evaluation, all the rest, not so much at all. Here’s a link for more info. http://www.gse.harvard.edu/~ngt/par/

  8. My local elementary school is small, with just one class per grade, and the population is relatively stable, meaning that for the most part the kids in 3rd grade this year were the kids in 2nd grade last year. The teachers have mostly been in place for several years, and they are all credentialed and experienced.

    What do you see when you look at the proficient/etc scores year to year? Noise. The same teacher will show a range of 70% proficient to 30% proficient and back again. The same CLASS will show a range of proficient that varies by 30-40% up and down… keeping in mind that one student is never less than 4% and sometimes as much as 8%.

    This situation is about as controlled as the variables can get and what you see is that trying to evaluate which teacher is “best” using these scores is a fool’s errand.

    And this doesn’t even address that they ALL build on the base given to them by the kindergarten teacher, whose students are not tested. A terrible kindergarten program will affect the school all the way through.

    1. And in this case, it sounds like you are talking about proficiency rates (% passing) bouncing around quite a bit year after year. Proficiency rates would actually tend to be more stable from year to year than would VAM estimates or SGPs. Proficiency rates and raw average scores have embedded in them all of the background a child brings to school with him/her. That would make them more stable, more predictable from year to year. VAM and SGP estimates both at least consider prior year scores – gain, so to speak. And VAM then also considers other background factors. Most/much of what determines how well a child does this year is a) his/her background family circumstance, b) peer group and c) prior performance. If you leave all of that in the scores (raw level of performance), year to year performance is relatively consistent. Take that stuff out and the year to year correlations can be really low. That is, VAM estimates will be that much noisier. But, there’s another issue potentially at play here too, and that’s this idea of using cut scores to indicate proficiency, rather than using raw or scale scores. I guess if you had large numbers of kids who raw or scale scores were frequently near the cut off, you could get some jumping back and forth. Taking raw or scale sores and putting them into pass fail categories around a cut score really reduces their usefulness.

  9. Re the myth of:
    “teaching is the only profession where you have a job for life”

    Where’s our value-add testing for Supreme Court justices? 🙂

Comments are closed.