There’s What You Want and What You Ask For

“When a measure becomes a target, it ceases to be a good measure.”
“Goodhart’s law.” Wikipedia, retrieved March 24, 2020.

“In economics, ‘rational expectations’ are model-consistent expectations, in that agents inside the model are assumed to ‘know the model’ and on average take the model’s predictions as valid.”
“Rational Expectations,” Wikipedia, retrieved March 24, 2020.


The issue of racial equity in K-12 education has been of concern to the public, educators, and policy makers for decades.[COLE1966][CLOT2007] In order to address this, in 2007 the North Carolina State Board of Education formed a commission with the task of “comprehensively reviewing and offering recommendations for revisioning the State’s test program and accountability system.”[NCDPI2008] Among these recommendations was that, “These new and different accountability models must be understandable and transparent. They must provide parents and other stakeholders with valid and meaningful information about the performance of students …”, and also, “Teachers and principals need support and data that enable them to make informed instructional decisions that result in positive outcomes for students.”

The Commission consisted of twenty-five members selected from legislators, corporate executives, educators, and university faculty. The Commission was assisted by thirty-two individuals who provided information and the benefit of their perspectives. As far as I can determine, there was perhaps one statistician or quantitative social scientist among these fifty-seven persons, although two were affiliated with an educational assessment company and might have provided statistical insights.

In the State Budget for 2013-15 the North Carolina legislature mandated school performance grades based on the results of standardized tests, combined with school growth scores produced from the SAS Educational Value-Added Assessment System.[NCSB402][SASevaas] A statement of the General Statutes applicable to school performance grades, as of 2020, is in the Appendix.[NCGS115C]

In 2015 the North Carolina Department of Public Insruction (NCDPI) published a report on the status of the state's standardized testing.[SMIT2015] The report states, “The current report focuses explicitly on the relationship between new assessments and their respective content standards or curricular goals. Phase 2 of the study will examine the relationship between instructional practice and relevant content standards based upon a randomly selected representative sample of teachers in the state, while Phase 3 will examine the impact of students' opportunity to learn standards-based content on student achievement. The completed study will provide the state with a unique data set for modeling the performance of the standards-based system as depicted by the various data collection and analysis strategies employed for the study.” I have not been able to find any followup to this report, that is, a Phase 2 or 3, but something may exist.

What I Address Here

Reading the Commission charge and the 2015 status report leads me to conclude that the standardized tests were conceived of as aiding in evaluating and improving outcomes for individual students, and only by implication providing school or instructor-level evaluations. Nevertheless, the General Statutes of North Carolina make it clear that the standardized tests have been purposed to these latter ends.[PSF2020] I will address the question to what extent do the publicly available NCDPI data support provide policymakers, educators, and the public with means to assess the achievement of students and schools? More specifically, I will discuss some quantitative aspects of determinations of education equity between Black and White elementary school students in North Carolina public schools.

The standardized tests are constructed and vetted by educators and psychometricians. The state's standardized tests are constructed according to Item Response Theory (IRT), where questions seek to determine levels of comprehension of grade-related skills.[SMIT2015][NCDPIach] Test scores have broad ramifications for students, teachers, and administrators, as they are used for purposes beyond evaluating the needs and performance of individual students, such as grading schools, evaluating teachers, setting policies, and so on.[HO2015]

My analysis concentrates on grades 3, 4, and 5, primarily because students in these grades participate in a series of standardized Math and Reading tests, and the students pass from grade to grade in a reasonably orderly fashion in what is sometimes called a wave. I will concentrate on comparing achievement of Black and White students. The other race categories are unevenly distributed throughout North Carolina, making their analysis difficult. I will not be looking at charter schools, since they tend to change in grades offered and enrollments over the years, making their analysis better subject to a separate endeavor.

The intention of the Commission was two-fold: to create a framework for testing and accountability to inform about the performance of children, and to provide instructional support. It does not appear that the statistical underpinnings of the former of these intents was achieved. To the extent that this intent was undertaken, it has been happenstance and not rigorous.

The crux of the problem is the disjuncture between the stated intent of the Commission, how the data is actually used, and what is actually provided to policymakers, educators, and the public.

Problem 1: Percentages

What is disclosed in the publicly available, FERPA-compliant data is the number of students who achieve Grade Level Proficiency GLP), and the total number of students tested, by race, rolled up for each grade for each school. Individual classes and teachers are not identified, unless there is only one class for a grade at a school, and that only by implication. The only practicable way to compare schools, or to follow schools across the years, is to use the percentage of students who achieve GLP, but grade size (the denominator) varies considerably. Numerical, statistical and interpretive problems can be expected whenever comparing percentages where there is a great variation in the denominator.

Problem 2: Goodhart's law

The construction and validation of the standardized tests seems to omit any evaluation of the extent to which student performance is influenced by the expenditure of resources on student preparation. It also must depend on student opportunities quite outside the schools' scope of control.

Problem 3: Characteristics of the Tests

NCDPI does not make publicly available the Score-Frequency distributions except rolled up for all students (more on that below). That data is available from NCERDC but is FERPA-protected. Without knowledge of the race-based characteristics of the tests, we are dealing with avoidable mystery.

Problem 4: A Very Zen-like Experience or Goodhart's Law Again

What is described by the Commission does not recognize a student past or future, but only isolated yearly evaluations. The concept of improvement is left to ad hoc comparisons of some summary measures, and to the completely separate EVAAS. This encourages educators, principals, and adminstrators to concentrate on improving the prospects for students to score well in the standardized tests, which can be different from learning the mandated materials.

Where Is The Data?

The publicly available data is primarily in what are called the disaggregated data files. These are FERPA-compliant, consequently some data is masked and individual classes are accumulated into one entry for each grade for each school. The data is shown for races and other demographics, such as gender and whether economically disadvantaged; these, however, are masked if the number of students is small (see the General Statutes in the Appendix). The disaggregated data files can be downloaded from NCDPI without the need for permission.[NCDPIdis]

FERPA-protected data is handed off by NCDPI to the North Carolina Education Research Data Center (NCERDC) under a contractual agreement.[NCERDC] NCERDC acts as a portal to the data, requiring research to be carried out according to FERPA-compliant privacy rules. This means that investigators must be associated with a university or research organization that has an Institutional Review Board (IRB). Investigators must be able to show that they will protect the data in compliance with NCERDC requirements, and to destroy the data (but of course not the results of analysis) upon completion of the project.

Grade Level Proficiency: Understandable and Reductionist

The Grade Level Proficiency categorization is among the most frequently mentioned quantitative criterion and is a realization of what the General Statutes mandate.[NCES2019] The grade 3, 4, and 5 standardized tests have been designed so that the score to achieve grade level proficiency is the median (give or take one point in a range of about eighty points) of the statewide student scores. This is the achievement of Level 3 of the five Level End of Grade tests. Gap analysis is the comparison of GLP percent mean values for Black students to that for White students. As generally used, the gap is the numerical difference between the GLP percents. While more refined methods can be used, consumption by policymakers and the public is eased through reducing whatever is happening to a single number. I avoid the use of the term “gap analysis,” since this subject already has a rich research literature and complex research methodologies.[JENC1998][CEPAgap] My interest lies in using the NCDPI publicly available data more broadly than just the GLP gap.

Score-Frequency Profiles

In grades 3, 4, and 5 the standardized tests are designed according to IRT methodologies, so that the achievement of GLP is not just a simple accumulation of correct answers. This is expressed most clearly by score-frequency diagrams. Figure 1 shows the data for all North Carolina public school students in grade 3, 2017-18 Mathematics.[NCDPIgb] The step at the Level 2/Level 3 boundary is an expression of the Item Response methodology. Its intent is to assure that getting some scattered collection of correct answers will not lead to achieving Grade Level Proficiency. This profile is derived from the testing at the end of the year, that is, it is not a design goal verified by sampling early in the school year. It also is completely devoid of the possible influence of the achievement levels of students on entry into the grade. I am sure that design goals are stated somewhere, but I do not know where that might be.

Figure 1. A Representative Score-Frequency Profile

How does the score-frequency diagram translate into observable results? NCDPI does not make the score-frequency data publicly available by subgroup (NCDPI uses that term to denote race and several other categories), but Figure 2 shows that there are different shapes in the end of year GLP percent by school results.

Figure 2. Representative GLP Percent By Race Distributions

It is evident from Figure 2 that the distributions for White and for the wealthier, Not-EDS (Economically Disadvantaged Students), students differ structurally from those for the other categories. Not only are the means different, but the shapes are different. The ALL data is more symmetrical than the White and Not-EDS, which lean toward higher percentages. The Black and EDS distributions have tails into higher GLP percent, while the mass of the students lies in the lower GLP percents. The profiles are so different that it raises the question of whether the tests are better at identifying the subgroup or assessing performance.[HO2015] (The vertical bar just below 100% in Not-EDS and White is an artifact of FERPA compliance and should be imagined as spread out over the 95% to 100% interval.)

I have already mentioned the 2015 program assessment that concentrated on the alignment between grade-specific learning objectives and the standardized tests.[SMIT2015] This omits any evaluation of student capabilites for taking any test, regardless of the goal of the test. But the ability to do well on tests must be at least associated with the extent to which teachers and principals expend resources on student preparation. It also must depend on student opportunities quite outside the schools' scope of control.

Comparisons Within Schools - A Moving Target

One way of comparing GLP percent of schools to themselves is to look across school years for the same grade. Since I am using grades 3, 4, and 5, and have six school years and two subjects, Math and ELA, that would result in a lot of plots. I did that and found that the similarities between grades, subjects, and years were so great that displaying one typical plot would be of value.

Hidden in plain sight:

Figure 3 combines the GLP percent data for three years, 2016-17, 2017-18, and 2018-19 into what is called a first differences plot. The horizontal axis shows, for all students regardless of race and by school for grade 3 Math, the difference between GLP percent for 2016-17 and 2017-18, while the vertical axis the difference between 2018-19 and 2017-18.

Figure 3. All Schools Math Grade 3

There are several striking features of Figure 3. For one, there is a “cloud,” and it is big. Differences extend to thirty percent, although the bulk are twenty-five percent or less. Another is that there are a lot of up and down changes, these being in the right-hand lower and the left-hand upper quadrants. From the viewpoint of statistics, the overall features of Figure 3 make sense, at least insofar as the smaller changes are concerned. Indeed, it would be very suspicious if the majority of changes were consistently in one direction. The underlying reason is that we are dealing with percentages where there are variations, some substantial, in the number of grade 3 students in each school. This leads into a second reason, that there are changes in the number of grade 3 students from year to year in the schools, which may well be associated with changes in staff and policies. This, combined with the race-related distributions shown in Figure 2, shows that the reductionist GLP percent is of some, but limited usefulness. The primary usefulness is the comparison of statewide mean GLP percent by grade across years, although even that is negatively affected by what I discussed in relation to Figure 2. That GLP percent obscures details definitely does not inspire confidence or recommend it to be used by policymakers and educators.

Figure 4 shows one of the sources of Figure 3's cloud. These are the year to year changes in grade enrollments. For many reasons, schools reconfigure, which might entail small changes or the large changes of adding or dropping a class. NCDPI data shows that average grade 3 class size was about twenty students, which helps interpret the changes. Figure 4 looks only at Black students for the convenience of having a less cluttered plot than for White or all students, there being about 250 more schools with White students than with Black students. These year to year changes raise many questions if they are appreciable. For instance, are changes by race related? In what ways is a school “the same” if changes are large? How can changes in GLP percent by race be used as a measure if changes are large, and so on.

Figure 4. Year to Year Changes in Grade 3 Enrollments

Correlations Between GLP Percentages for Mathematics and Reading

Another problem with the data is, again, associated with the use of GLP percent. Figure 4 showed that there is a large range of grade enrollments. This is especially true when looking at the number of students by race. In cases where that number is particularly small, calculating GLP percent is subject to variations that, in a sense, have nothing to do with the number of students achieving GLP. Rather, these are artifacts of using percentages with a large range of denominators. Part of this was addressed in Figure 3, but it goes beyond that and, in subtle ways, may affect assessments and decisions made on the basis of GLP percent. This is expressed in the statutory mandate that, for grades 3, 4, and 5, GLP percent for Mathematics and Reading be given equal weight. While that sounds commendable, it does not appear to have been subject to statistical review. This comes down to assessing the correlations of Math and Reading. It is not a story about race. Figure 5 shows a correlation scatterplot for grade 3 for Black students in 2018-19. Correlations for Hispanic or White students, and other grades and years tell the same story.

Figure 5. Math and Reading Correlations in Grade 3 2018-19

The correlograms are, first, for schools with the smallest classes, and second for those with the largest classes. They are obviously different. The one for the smallest classes is not a blob, as is the other plot. It shows a preference for equality between Math and Reading GLP percentages. Indeed, close inspection shows some other similar features. This means that sweeping schools of all grade enrollments together is not the “fair deal” that giving equal weights to Math and Reading might seem to be.

Wait, There’s More

There is more which I shall treat briefly although more detail is available. For one, in 2018-19 NCDPI made a change in the content of the publicly available data. Whereas prior the number of students achieving Level 1 and Level 2 of the standardized tests had been reported, beginning in 2018-19 for Mathematics these would be grouped into a single “not proficient” category.[NCDPI19br] (Arithmetically, this would simply be 100-GLP%.) I have not found any NCDPI discussion of this change, but it may have to do with a further hardening of FERPA compliance. The unfortunate consequence is that, when considering grade 3, 4, and 5, the profiles of grade 4 and grade 5 students at entry, that is, considering grade 3 and grade 4 from the prior year, are less detailed. Since the expected achievement of students is associated with their entry profile, this lack of data makes it difficult to address the extent to which schools achieve student improvement goals. Fortunately, the five Levels are retained for ELA.

Here is one last item. In the first two years of the current standardized tests, 2013-14 and 2014-15, NCDPI publicly available data included grades for schools where the number of students, by race, was below ten. In 2015-16 that changed, and records, by race, for those schools just vanished if the number of students was less than ten. The proportion of schools affected may be about ten percent at least for Black students. This was likely in response to an interpretation of FERPA compliance. While reporting GLP percent for such small enrollments would be misleading, dropping that many schools and in effect masking all that data without discussion is hardly helpful.

Conclusions and Future Activities

It should be no surprise that legislation has stipulated statistically unvetted and unsound uses of standardized tests. Put another way, I do not see any attempt to recognize and accommodate the departures from comparisons of identically distributed populations from which independent samples have been taken. Simplistic analyses, such as using the reductionist GLP percentage, return unreliable results. Without the ability to analyze the FERPA-protected data ourselves we can do no more than speculate about the oddities in the analysis of the publicly available data.[JACO2016a][JACO2016b]

Several research projects that would require only moderate resources come to mind. These include computing the score-frequency distributions, by race, for all the available years. Another is looking at the influence of student achievement at entry on the results at the end of a year. While the SAS EVAAS has been mandated to address growth, the methods and data are not available for review. This raises obstacles when combining GLP percentages and the EVAAS evaluations. Another project is to look at Black and White student achievement when small numbers of these students result in data being inaccessible in the NCDPI files. Yet another would be the pursuit of correlatons between Math and Reading test scores. Another is to track schools where racial composition changes significantly from year to year. A larger project would be to follow individual students for a few years, but that would require longer committment than the others.


[CEPAgap] Stanford Center for Education Policy Analysis, The Educational Opportunity Monitoring Project, Racial and Ethnic Achievement Gaps

[CLOT2007] Charles T Clotfelter & Helen F Ladd & Jacob L Vigdor, 2009. “The Academic Achievement Gap in Grades 3 to 8,” The Review of Economics and Statistics, MIT Press, vol. 91(2), pages 398-419, October 2009. Available at

[COLE1966] Coleman, J.S., et al., 1966. “Equality of Educational Opportunity.” U.S.Department of Health, Education, and Welfare. Super. of Docs. Catalog No. FS 5.238.38001.

[HO2015] Ho, A.D., Yu, C.C., 2015. “Descriptive Statistics for Modern Test Score Distributions: Skewness, Kurtosis, Discreteness, and Ceiling Effects.” Educational and Psychological Measurement 75 (3), 365-388.

[JACO2016a] Jacob, B.A., “Student test scores: How the sausage is made and why you should care.” Economic Studies at BROOKINGS, Evidence Speaks Reports, Vol 1, #25, August 11, 2016.

[JACO2016b] Jacob, B. and Rothstein, J., “The Measurement of Student Ability in Modern Assessment Systems.” Journal of Economic Perspectives, Vol. 30, #3, Summer 2016. Preprint

[JENC1998] Jencks, C., Phillips, M. 1998. “The Black-White Test Score Gap: Why It Persists and What Can Be Done”, BROOKINGS Article, March 1, 1998.

[NCDPI2008] “REPORT FROM THE BLUE RIBBON COMMISSION ON TESTING AND ACCOUNTABILITY” 2008.,%202008/Presentations/Blue%20Ribbon%20Commission%20on%20Testing%20and%20Accountability.pdf and and

[NCDPI19br] “2018-19 Business Rules for Calculating Results - Accountability Model Business Rules” 2019.

[NCDPIach] NCDPI Achievement Level Descriptors 2019

[NCDPIdis] NCDPI disaggregated data files (expand the “Reports of Supplemental Disaggregated …” section):,-school-system-and-school-performance-data


[NCDPIgb] NCDPI Green Books:

[NCDPIsad] NCDPI Student Accounting Data:

[NCERDC] North Carolina Education Research Data Center.

[NCGS115C] North Carolina General Statutes Chapter 115C Article 8 115C-83.15 and following, retrieved April 10, 2020.

[NCSB402] 2013-15 North Caarolina Budget Bill 402

[PSF2018] “Quick Facts: Understanding Class Size Chaos,” Public Schools First NC, retrieved March 28, 2020.

[PSF2020] “A-F School Performance Grades,” Public Schools First NC, retrieved April 9, 2020.

[SASevaas] “SAS® EVAAS® for K-12,” retrieved April 11, 2020.

[SMIT2015] Smithson, J.L., 2015. “A Report to the NCDPI On the Alignment Characteristics of State Assessment Instruments Covering Grades 3-8, and High School Mathematics, Reading and Science (2015)”.

[STOO2020] Stoops, T., “Policy Position: Class Size”, John Locke Society, retrieved March 28, 2020.

Appendix: North Carolina General Statutes Concerning School Performance

The original, 2013-14 version of this is available at Senate Bill 402.[NCSB402]

The current as of 2020, North Carolina General Statutes Chapter 115C Article 8 115C-83.15 and following.[NCGS115C]

Part 1B. School Performance.

§ 115C-83.15. School achievement, growth, performance scores, and grades.

\((a)\) School Scores and Grades. - The State Board of Education shall award school achievement, growth, and performance scores and an associated performance grade as required by G.S. 115C-12(9)c1., and calculated as provided in this section.

\((b)\) Calculation of the School Achievement Score. - In calculating the overall school achievement score earned by schools, the State Board of Education shall total the sum of points earned by a school as follows:

\((1)\) For schools serving any students in kindergarten through eighth grade, the State Board shall assign points on the following measures available for that school:

a. One point for each percent of students who score at or above proficient on annual assessments for mathematics in grades three through eight. For the purposes of this Part, an annual assessment for mathematics shall include any mathematics course with an end-of-course test.
b. One point for each percent of students who score at or above proficient on annual assessments for reading in grades three through eight.
c. One point for each percent of students who score at or above proficient on annual assessments for science in grades five and eight.
d. One point for each percent of students who progress in achieving English language proficiency on annual assessments in grades three through eight.

\((2)\) For schools serving any students in ninth through twelfth grade, the State Board shall assign points on the following measures available for that school:

a. One point for each percent of students who score at or above proficient on either the Algebra I or Integrated Math I end-of-course test or, for students who completed Algebra I or Integrated Math I before ninth grade, another mathematics course with an end-of-course test.
b. One point for each percent of students who score at or above proficient on the English II end-of-course test.
c. One point for each percent of students who score at or above proficient on the Biology end-of-course test.
d. One point for each percent of students who complete Algebra II or Integrated Math III with a passing grade.
e. One point for each percent of students who either (i) achieve the minimum score required for admission into a constituent institution of The University of North Carolina on a nationally normed test of college readiness or (ii) are enrolled in Career and Technical Education courses and score at Silver, Gold, or Platinum levels on a nationally normed test of workplace readiness.
f. Repealed by Session Laws 2019-142, s. 1, effective July 19, 2019, and applicable to measures based on data from the 2018-2019 school year and each school year thereafter.
g. One point for each percent of students who graduate within four years of entering high school.
h. One point for each percent of students who progress in achieving English language proficiency.

In calculating the overall school achievement score earned by schools, the State Board of Education shall (i) use a composite approach to weigh the achievement elements based on the number of students measured by any given achievement element and (ii) proportionally adjust the scale to account for the absence of a school achievement element for award of scores to a school that does not have a measure of one of the school achievement elements annually assessed for the grades taught at that school. The overall school achievement score shall be translated to a 100-point scale and used for school reporting purposes as provided in G.S. 115C-12(9)c1., 115C-218.65, 115C-238.66, and 116-239.8.

\((c)\) Calculation of the School Growth Score. - Using the Education Value-Added Assessment System (EVAAS), the State Board shall calculate the overall growth score earned by schools. In calculating the total growth score earned by schools, the State Board of Education shall weight student growth on the achievement measures as provided in subsection (b) of this section that have available growth values; provided that for schools serving students in grades nine through 12, the growth score shall only include growth values for measures calculated under sub-subdivisions a. and b. of subdivision (2) of subsection (b) of this section. The numerical values used to determine whether a school has met, exceeded, or has not met expected growth shall be translated to a 100-point scale and used for school reporting purposes as provided in G.S. 115C-12(9)c1., 115C-218.65, 115C-238.66, and 116-239.8.

\((d)\) Calculation of the Overall School Performance Scores and Grades. - The State Board of Education shall calculate the overall school performance score by adding the school achievement score, as provided in subsection (b) of this section, and the school growth score, as determined using EVAAS as provided in subsection (c) of this section, earned by a school. The school achievement score shall account for eighty percent (80%), and the school growth score shall account for twenty percent (20%) of the total sum. For all schools, the total school performance score shall be converted to a 100-point scale and used to determine an overall school performance grade. The overall school performance grade shall be based on the following scale and shall not be modified to add any other designation related to other performance measures, such as a “plus” or “minus”:

\((1)\) A school performance score of at least 85 is equivalent to an overall school performance grade of A.
\((2)\) A school performance score of at least 70 is equivalent to an overall school performance grade of B.
\((3)\) A school performance score of at least 55 is equivalent to an overall school performance grade of C.
\((4)\) A school performance score of at least 40 is equivalent to an overall school performance grade of D.
\((5)\) A school performance score of less than 40 is equivalent to an overall school performance grade of F.

\((d1)\) Establishment of Subgroups of Students. - The State Board shall establish the minimum number of students in a subgroup served by a school that is necessary to disaggregate information on student performance and to determine a subgroup performance score and grade for the following subgroups of students:

\((1)\) Economically disadvantaged students.
\((2)\) Students from major racial and ethnic groups.
\((3)\) Children with disabilities.
\((4)\) English learners.

\((d2)\) Calculation of the School Performance Scores and Grades for Certain Subgroups of Students Served by a School. - In addition to the overall school performance scores and grades awarded under this section, for each school that serves a minimum number of students in a subgroup of students listed in subsection (d1) of this section, the State Board of Education shall calculate school performance scores and shall determine a corresponding school performance grade for each subgroup using the same method as set forth in subsection (d) of this section. School performance scores for subgroups of students shall not be included in the calculation of the overall school performance scores and grades under subsection (d) of this section.