Thoughts on Junk Indicators, School Rating Systems & Accountability


Over the past few years, co-authors Preston Green and Joseph Oluwole and I have written a few articles on the use, misuse and abuse of student growth metrics for evaluating teachers and for imposing employment consequences based on those metrics. Neither of our previous articles addressed the use of even more nonsensical status and status change measures for deciding which schools to close or reconstitute (with significant implications for employment consequences, racial disparities, and reshaping the racial makeup of the teacher workforce).

I have written a few posts on this blog in the past regarding school rating systems.

I’ve also tried to explain what might be appropriate and relevant uses of testing/assessment data for informing decision/policymaking:

It blows my mind, however, that states and local school districts continue to use the most absurdly inappropriate measures to determine which schools stay open, or close, and as a result which school employees are targeted for dismissal/replacement or at the very least disruption and displacement. Policymakers continue to use measures, indicators, matrices, and other total bu!!$#!+ distortions of measures they don’t comprehend, to disproportionately disrupt the schools and lives of low income and minority children, and the disproportionately minority teachers who serve those children. THIS HAS TO STOP!

Preston, Joseph and I previously explained problems of using value-added and growth percentile measures to evaluate and potentially dismiss teachers in cases where those measures led to racially disparate dismissals (under Title VII). We explain that courts would apply a three-part analysis as follows:

Courts apply a three-part, burden-shifting analysis for Title VII disparate impact claims.”-‘ First, the plaintiffs must establish a prima facie case, showing that a challenged practice has an adverse impact on a minority group.”‘* Once the plaintiffs have established a prima facie case, the burden shifts to the employer to show that the employment practice in question has a “manifest relationship to the employment”;”^ in other words, the employer has to show a “business justification.””^ If the employer satisfies this requirement, the burden then shifts to the plaintiffs to establish that less discriminatory alternatives exist.’ ‘^

In other words, showing the disparate effect is the first step. But that doesn’t alone mean that a policy or the measures it relies upon are wrong/unjustifiable/indefensible. That is, if the defendant (state, local district, “employer”) can validate that those measures/policy frameworks have a “manifest relationship” to the employment, then current policy may be acceptable.  Plaintiffs still, though, have opportunity to show that there are less discriminatory alternatives which are at least equally justifiable.

What follows are some preliminary thoughts in which I consider the “usual” measures in state and local school rating policies, and how one might evaluate these measures were they to come under fire in the context of Title VII litigation. It actually seems like a no-brainer, and waste of time to write about this, since few if any of the usual measures and indicators have, from a researchers perspective, “manifest relationship to the employment.” But, the researchers perspective arguably sets too high a bar.

Types of Measures & Indicators in School Accountability Systems: An overview

I begin with a clarification of the distinction between “measures” and “indicators” in the context of elementary and secondary education policy:

  • Measures: Measures are based on attributes of a system to which we apply some measurement instrument at a given point in time. Measures aren’t the attributes themselves, but the information we gather from application of our measurement tool to the targeted attributes. For example, we construct pencil and paper, or computerized tests to “measure” achievement or aptitude in areas such as mathematics or language arts, typically involving batches of 50 items/questions/problems covering the intended content or skills. The measures we take can be referred to as “realizations” which are generated by underlying processes (all the stuff going on in school, as well as in the daily lives of the children attending and teachers working in those schools, inclusive of weather conditions, heating, cooling and lighting, home environments, etc.).  Similarly, when we take a child’s temperature we are taking a measure on that child which may inform us whether the child is suffering some illness. But that measure tells us nothing specific of the underlying process – that is, what is causing the child to have (or not have) a fever. If we wrongly assume the measure is the underlying process, the remedy for a high temperature is simply to bathe the child in ice, an unlikely solution to whatever underlying process is actually causing the fever.
  • Indicators: Indicators are re-expressions of measures, often aggregating, simplifying or combining measures to make them understandable or more “useful” for interpreting, diagnosing or evaluating systems – that is, making inferences regarding what may be wrong (or right) regarding underlying system processes. Indicators are best used as “screening” tools, useful for informing how we might distribute follow-up diagnostic effort.  That is, the indicators can’t tell us what’s wrong, or if anything really is wrong with underlying processes, but may provide us with direction as to which processes require additional observation.

Measures are only useful to the extent that they measure what we intend them to measure and that we use those measures appropriately based on their design and intent.  Indicators are only useful to the extent that they appropriately re-express and or combine measures, and do not, for example, result in substantial information loss or distortion which may compromise their validity or reliability. One can too easily take an otherwise informative and useful measure, and make it meaningless through inappropriate simplification.

Expanding on the body temperature example, we might want to develop an indicator of the health of a group of 100 schoolchildren. Following typical school indicator construction, we might simplify temporal body temperature readings for a group of 100 children to a binary classification of over or under 98.6 degrees. Doing so, however, would convert otherwise potentially meaningful (continuously scaled) data into something significantly less meaningful (if not outright junk). First, applying this precise “cut-score” to the temperature ignores the margin of error in the measurement, establishing a seemingly substantive difference between a temperature of 98.6 and 98.7, where such a small difference in reading might result from imprecision of the measurement instrument itself, or our use of it. Second, applying this cut-score ignores that a temperature of 103 is substantively different from a temperature of 98.7 (more so than a difference between 98.6 & 98.7).  Given the imprecision of measurement (where temperature measurement is generally more precise than standardized testing), if large shares of the actual temperatures lie between 98.6 and 98.7 degrees, then large numbers will likely be misclassified. The over/under classification scheme has resulted in substantial information loss, limiting our ability to diagnose issues/problems with underlying processes. We’ve taken an otherwise useful indicator, and converted it into meaningless junk.

Notes on Validity and Reliability

As noted above, for a measure to be useful it must measure what we intend it to measure, and we must be using/interpreting that measure based on what it actually measures. That is, the measure should be valid, which takes the forms of “face validity” and “predictive validity” (there are many additional distinctions, but I will limit the discussion herein to these two).  A test of “algebraic reasoning” should measure a student’s capacity to apply algebraic reasoning to test items which accurately represent the content of “algebraic reasoning.” That is, content validity, which relates to face validity.

“Predictive validity” addresses whether the measure in question is “predictive” of a related, important outcome. This is particularly important in K-12 education systems where it is understood that successful test-taking is not the end-game for students. Rather, we hope these assessments will be predictive of some later life outcome, starting, for example, with higher levels of education attainment (high school graduation, college completion) and ultimately becoming a productive member of and/or contributor to society.

Measures commonly used for evaluating students, schools and education systems can actually have predictive validity without face validity. Typically, how well students perform on tests of language arts is a reasonable predictor (highly correlated with) of how well they also do on tests of mathematics. But that doesn’t mean we can or should use tests of language arts as measures of mathematics achievement. The measures tend to be highly correlated because they each largely reflect cumulative differences in student backgrounds.

The measures should also be reliable. That is, they should consistently measure the same thing – though they might consistently measure the wrong thing (reliably invalid). If measures are neither reliable nor valid, indicators constructed from the measures are unlikely to be reliable or valid. But, it’s also possible that measures are reliable and/or valid, but indicators constructed from those measures are neither.

Often, conversion of measures to indicators compromises either or both face or predictive validity. Sometimes it’s as simple as choosing a measure that measures one thing (validly) and expressing it as an indicator to measure something else, like taking a test score which measures a students’ algebraic reasoning ability at a point in time, and using it as an indicator of the quality of a school, or effectiveness of that child’s teacher.

Other times, steps applied to convert measures to indicators, such as taking continuous scaled test scores and lumping them into categories, can convert a measure which had some predictive validity into an indicator which has little or none. For example, while high school mathematics test scores are somewhat predictive of success in college credit bearing math courses, simply being over or under a “passing” cut-score may have little relation to later success in college math, in part, because students on either side of that “passing” threshold do not have meaningfully different mathematics knowledge or skill.[1]

Aggregating a simplified metric (to proportions of a population over or under a given threshold) may compound the information loss.  That is, by looking at the percent of children over or under and arbitrary and likely meaningless precise bright-line cut-score through an imprecise though potentially meaningful measure provides little useful information about either the individuals or the group of students (no less the institution they attend, or individuals employed by that institution).

Misattribution, Misapplication & Misinterpretation

Far too often, face validity is substantially compromised in indicator construction.  The problem is that the end users of the indicators often assume they mean something they simple don’t and never could.  A common form of this problem is misattribution – or asserting that measure or derived indicator provides insights into an underlying process – where in fact, the measures chosen, their re-expression and aggregation provide little or no insight into that process.  I, along with colleagues Preston Green and Joseph Oluwole explain the misapplication of student growth measures in the context of teacher evaluation. Student growth indicators (Student Growth Percentiles) are rescaled estimates of the relative (to peers) change in student performance (in reading and math) from one point in time to another. They do not, by their creators’ admission, attempt to isolate the contribution of the teacher or school to that growth. That is, they are not designed to attribute growth to teacher or school effectiveness and thus lack “face validity” for this purpose.[2] But many state teacher evaluations wrongly use these indicators for this purpose. In the same article, and second related article[3], professors Green, Oluwole and I explain how related approaches, like Value-added modeling at least attempt to isolate classroom or school correlates of growth, partially addressing face validity concerns, but still failing to achieve sufficient statistical validity or reliability.

Neither of our articles addresses the use of crude, school aggregate indicators, constructed with inappropriately reduced versions of assessment measures, to infer institutional (school) or individual (teacher) influence (effectiveness), leading to employment consequence.  That is because these indicators clearly lack the most basic face validity for making such inferences or attribution, leading to employment consequence.  As such, we felt it unnecessary to bother critiquing these indicators for this purpose. As noted above, proportions of children over/under arbitrary thresholds on assessments tell us little about the achievement of those children. These indicators are aggregations of meaningless distinctions made through otherwise potentially meaningful measures. These aggregations of meaningless distinctions surely provide no useful information about the institutions (and by extension the employees of those institutions) attended by the students on which the measures were originally taken.

In the best case, the un-corrupted measure – the appropriately scaled test score itself – may be an indicator reflecting ALL of the underlying processes that have brought the child to this point in time in their mathematics or language arts achievement, knowledge or skills. Those processes include all that went on from maternal health through early childhood interactions, community and household conditions, general health and well-being along the way (& related environmental hazards), and even when entering school, the greater share of hours per day spent outside of the schooling environment.  Point in time academic achievement measures pick up cumulative effects of all of these processes and conditions, which are vastly disparate across children and their neighborhoods, which is precisely why these measures continue to reveal vast disparities by race and income, and by extension, across schools and the often highly segregated neighborhoods they serve.[4]

The Usual Indicators, What they Mean & Don’t Mean

Here, I provide an overview of the types of indicators often used in state school report cards and large district school rating systems.  Simply because they are often used does not make them valid or reliable. Nor does it provide the excuse for using these indicators inappropriately – such as misattribution to underlying processes – when the inappropriateness of such application is well understood.  Table 1 provides a summary of common indicators.  The majority of indicators in Table 1 are constructed from measures derived from standardized achievement tests, usually in math and English language arts, but increasingly in science and other subject areas.  State and local school rating systems also tend to include indicators of graduation and attendance rates.

Of all of the indicators listed in Table 1, only one – Value-Added estimates – attempts attribution of “effect” to schools and teachers, though, with questionable and varied success. Most (except for the most through value-added models) are well understood to reflect both socioeconomic and racial bias, at the individual level and in group level aggregation. More detailed discussion of these indicators follows Table 1.

Table 1. Conceptual Overview of School Performance Indicators

Indicator Type Facial Notes Attribution / Effect
Academic assessment (e.g. reading/ math standardized test) Scale score or group mean scale score Student or group status/ performance level All such measures norm referenced to an extent, even if attached to supposed criteria (content frameworks) Makes NO attempt to isolate influence of schools or  teachers

[no manifest relationship]

Percent “proficient” or higher (or above/below any status cut-point) Status of a group of students relative to an arbitrary “cut-score” through distribution Ignores that those just above/below threshold not substantively different.

Substantially reduces information (precision)

Makes NO attempt to isolate influence of schools or  teachers

[no manifest relationship]

Cohort Trend/ Change Difference in status of groups sequentially passing through a system Typically measures whether subsequent group has higher share over/under threshold than previous.

Influenced by differences in group makeup,

and/or differences in test administration from one year to next.

Makes NO attempt to isolate influence of schools or  teachers

[no manifest relationship]

Growth Percentiles Change in student test score from time=t to time=t+1 Usually involves rescaling data based on student position in distribution of student scores.

Does not account for differences in student background, school or home context (resources)

Makes NO attempt to isolate influence of schools or  teachers

[no manifest relationship]

Value-Added Change in student test score from time=t to time=t+1 conditioned on student and school characteristics Uses regression models to attempt to compare growth of otherwise similar students in otherwise similar settings.

Ability isolate classroom/school “effects” highly dependent on comprehensiveness, precision & accuracy of covariates.

Attempts to isolate relationship between influence gains and classroom factors (teachers) and schools.

Suspect in terms of manifest relationship.[5]

Persistence & Completion Graduation Rates / On-Time Progress / Dropout Rates Student Status / Performance Level Tracks student pathways through grade levels, courses against expectations (on track) Makes NO attempt to isolate influence of schools (resources, etc.) or  teachers

[no manifest relationship]

Attendance Proportion of “enrolled” students “attending” per day, averaged over specified time period Status of a group of students relative to an arbitrary “cut-score” through distribution Typically does not discriminate between types/causes of absences.

Known to be disparate by race/SES, in relation to chronic health conditions.[6]

Makes NO attempt to isolate influence of schools (resources, etc.) or  teachers

[no manifest relationship]

Common indicators constructed with standardized assessment measures are summarized below:

  • “Proficiency” Shares: Shares of children scoring above/blow assigned cut-scores on standardized assessments. Few states or districts have conducted thorough (if any) statistical analysis of the predictive validity of the assigned cut-points or underlying assessments.[7] Raw scores underlying these indicators capture primarily cumulative differences in the starting points and backgrounds of students, individually and collectively in schools and classrooms. Proportions of children over/under thresholds depend on where those arbitrary thresholds are set.
    • Whether raw scores or proficiency shares, these indicators are well understood to substantially (if not primarily) reflect racial and socio-economic disparity across students and schools.
  • Change in “Proficiency” Shares (Cross-Cohort Percent Over/Under): For example, comparing the proficiency rate of this year’s 4th grade class to last year’s 4th grade class in the same school, perhaps calculating a trend over multiple cohorts/years. These indicators capture primarily a) changes in the starting points and backgrounds of incoming cohorts of students (demographic drift), and b) changes in the measures underlying the accountability system (test item familiarity, item difficulty, etc.), whether by design or not. In many cases, year over year changes in shares over/under proficiency cut-scores are little more than noise (assuming no substantive changes to tests themselves, or cohort demography) created by students shifting over/under arbitrary cut-scores with no substantive difference in their achievement level, knowledge or skill.
    • These indicators do not tend to reflect racial or socio-economic disparity (except for trends in those disparities) in large part because these indicators usually reflect nothing of substance or importance, and are often simply noise (junk).[8]
  • Student Achievement Growth (Student Growth Percentiles): Constructed by comparing test score growth of each student, with respect to others starting at similar points in the distribution, among their peers. These indicators include only prior scores and do not include other attributes of students, their peers or schooling context. These indicators capture differences in student measures from one point in time to another, and are most often (read always, in practice) scaled relative to other students, so as to indicate how a students’ “growth” compares with an average student’s growth. These indicators may also capture primarily the conditions under which the student is learning, both at home and in school.
    • These indicators tend to be racially and socio-economically disparate because they control only for differences in students’ initial scores, and not for students school and peer context, or students’ own socio-economic attributes.[9]
  • Value-Added Model Estimates: Based on statistical modeling of student test scores, given their prior scores, and various attributes of the students, their peers and schooling context (breadth of factors included varies widely, as does the precision with which these factors are measured). These measures are the “best attempt” (as I often say “least bad” alternative) to isolate the school and/or classroom related factors associated with differences in student measures from one point in time to another, but cannot, for example differentiate between school resources including instructional materials, building heating, cooling and lighting and/or the “effectiveness” of employees.[10]
    • These indicators may substantively reduce racial and economic disparity of the measures on which they are based, by using rich data to compare growth of similar students in similar contexts.

This is not to suggest that value-added measures in particular have no practical use. But rather that they should be used appropriately, given their limitations. As Preston Green, Joseph Oluwole and I explain:

Arguably, a more reasonable and efficient use of these quantifiable metrics in human resource management might be to use them as a knowingly noisy pre-screening tool to identify where problems might exist across hundreds of classrooms in a large district. Value-added estimates might serve as a first step toward planning which classrooms to observe more frequently. Under such a model, when observations are completed, one might decide that the initial signal provided by the value-added estimate was simply wrong. One might also find that it produced useful insights regarding a teacher’s (or group of teachers’) effectiveness at helping students develop certain tested skills.

School leaders or leadership teams should clearly have the authority to make the case that a teacher is ineffective and that the teacher even if tenured should be dismissed on that basis. It may also be the case that the evidence would actually include data on student outcomes – growth, etc. The key, in our view, is that the leaders making the decision – indicated by their presentation of the evidence – would show that they have reasonably used information to make an informed management decision. Their reasonable interpretation of relevant information would constitute due process, as would their attempts to guide the teacher’s improvement on measures over which the teacher actually had control. (p. 19)[11]

Other measures receiving less attention (Why “attendance” rates are sucky indicators too!)

School rating reports also often include indicators of student attendance and attainment and/or progress toward goals including graduation. It must be understood that even these indicators are subject to a variety of external influences, making “attribution” complicated. For example, one might wrongly assume that attendance rates reflect the efforts of schools to get kids to attend. If schools do a good job, kids attend and if they don’t, kids skip.  If this is the case, there should be little difference in attendance rates between “rich schools” and “poor schools” unless there is actually different effort on the part of school staff.   This assertion ignores the well understood reality that children from lower income and minority backgrounds are far more likely to suffer from chronic illness including asthma, obesity or both, which is a strong predictor of chronic absence (>10).[12] Additionally, neighborhood safety affects daily attendance.[13] These forces combine to result in significant racial and economic disparities in school attendance rates which are beyond the control of local school personnel.

Once again… for my thoughts on productive use of relevant measures in education, see:

Notes:

[1] See for example, http://usny.nysed.gov/scoring_changes/MemotoDavidSteinerJuly1.pdf and/or Papay, J. P., Murnane, R. J., & Willett, J. B. (2010). The consequences of high school exit examinations for low-performing urban students: Evidence from Massachusetts. Educational Evaluation and Policy Analysis, 32(1), 5-23.

[2] Baker, B. D., Oluwole, J., & Green, P. C. (2013). The legal consequences of mandating high stakes decisions based on low quality information: Teacher evaluation in the race-to-the-top era. Education Evaluation and Policy Analysis Archives, 21, 1-71.

[3] Green III, P. C., Baker, B. D., & Oluwole, J. (2012). Legal and Policy Implications of Value-Added Teacher Assessment Policies, The. BYU Educ. & LJ, 1.

[4] Duncan, G. J., & Murnane, R. J. (Eds.). (2011). Whither opportunity?: Rising inequality, schools, and children’s life chances. Russell Sage Foundation.

Coley, R. J., & Baker, B. (2013). Poverty and education: Finding the way forward. Educational Testing Service Center for Research on Human Capital and Education.

Reardon, S. F., & Robinson, J. P. (2008). Patterns and trends in racial/ethnic and socioeconomic academic achievement gaps. Handbook of research in education finance and policy, 497-516.

[5] Baker, B. D., Oluwole, J., & Green, P. C. (2013). The legal consequences of mandating high stakes decisions based on low quality information: Teacher evaluation in the race-to-the-top era. Education Evaluation and Policy Analysis Archives, 21, 1-71.

Green III, P. C., Baker, B. D., & Oluwole, J. (2012). Legal and Policy Implications of Value-Added Teacher Assessment Policies, The. BYU Educ. & LJ, 1.

[6] http://www.changelabsolutions.org/sites/default/files/School-Financing_StatePolicymakers_FINAL_09302014.pdf

[7] Some concordance analyses relating ISAT scores to ACT “college ready” benchmarks have been produced. See: http://www.k12accountability.org/resources/Early-Intervention/Early-intervention-targets.pdf & http://evanstonroundtable.com/ftp/P.Zavitkovsky.2010.ISAT.chart.pdf

[8] See, for example: http://www.shankerinstitute.org/blog/if-your-evidence-changes-proficiency-rates-you-probably-dont-have-much-evidence

[9] See also: Ehlert, M., Koedel, C., Parsons, E., & Podgursky, M. (2014). Choosing the right growth measure. Education Next, 14(2) & https://njedpolicy.wordpress.com/2014/06/02/research-note-on-teacher-effect-vs-other-stuff-in-new-jerseys-growth-percentiles/ & http://www.shankerinstitute.org/blog/does-it-matter-how-we-measure-schools-test-based-performance

[10] Baker, B. D., Oluwole, J., & Green, P. C. (2013). The legal consequences of mandating high stakes decisions based on low quality information: Teacher evaluation in the race-to-the-top era. Education Evaluation and Policy Analysis Archives, 21, 1-71.

Green III, P. C., Baker, B. D., & Oluwole, J. (2012). Legal and Policy Implications of Value-Added Teacher Assessment Policies, The. BYU Educ. & LJ, 1

[11] Baker, B. D., Oluwole, J., & Green, P. C. (2013). The legal consequences of mandating high stakes decisions based on low quality information: Teacher evaluation in the race-to-the-top era. Education Evaluation and Policy Analysis Archives, 21, 1-71. http://epaa.asu.edu/ojs/article/view/1298/1043

[12] http://www.changelabsolutions.org/sites/default/files/School-Financing_StatePolicymakers_FINAL_09302014.pdf

[13] Sharkey, P., Schwartz, A. E., Ellen, I. G., & Lacoe, J. (2014). High stakes in the classroom, high stakes on the street: The effects of community violence on students’ standardized test performance. Sociological Science, 1, 199-220.

Advertisements