(Version: 6/15/2020)

This report is part of a series of analyses of publicly available North Carolina Department of Public Instruction student achievement data, known as the disaggregated data files. These FERPA-compliant files contain data on all the state’s public traditional and charter schools beginning in school year 2013-14. My intent in this series is to provide the public with discussions of the usefulness of the NCDPI data, and with descriptive statistics that I feel may be of interest. I will not perform modeling or hypothesis testing.

There’s What You Want and What You Ask For

“When a measure becomes a target, it ceases to be a good measure.”
“Goodhart’s law.” Wikipedia, retrieved March 24, 2020. https://en.wikipedia.org/wiki/Goodhart%27s_law

“In economics, ‘rational expectations’ are model-consistent expectations, in that agents inside the model are assumed to ‘know the model’ and on average take the model’s predictions as valid.”
“Rational Expectations,” Wikipedia, retrieved March 24, 2020. https://en.wikipedia.org/wiki/Rational_expectations

Introduction

The issue of racial equity in K-12 education has been of concern to the public, educators, and policy makers for decades.[COLE1966][CLOT2007] In order to address this, in 2007 the North Carolina State Board of Education formed a commission with the task of “comprehensively reviewing and offering recommendations for revisioning the State’s test program and accountability system.”[NCDPI2008] Among these recommendations was that, “These new and different accountability models must be understandable and transparent. They must provide parents and other stakeholders with valid and meaningful information about the performance of students …”, and also, “Teachers and principals need support and data that enable them to make informed instructional decisions that result in positive outcomes for students.”

The Commission consisted of twenty-five members selected from legislators, corporate executives, educators, and university faculty. The Commission was assisted by thirty-two individuals who provided information and the benefit of their perspectives. As far as I can determine, there was perhaps one statistician or quantitative social scientist among these fifty-seven persons, although two were affiliated with an educational assessment company and might have provided statistical insights.

In the State Budget for 2013-15 the North Carolina legislature mandated school performance grades based on the results of standardized tests, combined with school growth scores produced from the SAS Educational Value-Added Assessment System.[NCSB402][SASevaas] A statement of the General Statutes applicable to school performance grades, as of 2020, is in the Appendix.[NCGS115C]

In 2015 the North Carolina Department of Public Insruction (NCDPI) published a report on the status of the state's standardized testing.[SMIT2015] The report states, “The current report focuses explicitly on the relationship between new assessments and their respective content standards or curricular goals. Phase 2 of the study will examine the relationship between instructional practice and relevant content standards based upon a randomly selected representative sample of teachers in the state, while Phase 3 will examine the impact of students' opportunity to learn standards-based content on student achievement. The completed study will provide the state with a unique data set for modeling the performance of the standards-based system as depicted by the various data collection and analysis strategies employed for the study.” I have not been able to find any followup to this report, that is, a Phase 2 or 3, but something may exist.

What I Address Here

Reading the Commission charge and the 2015 status report leads me to conclude that the standardized tests were conceived of as aiding in evaluating and improving outcomes for individual students, and only by implication providing school or instructor-level evaluations. Nevertheless, the General Statutes of North Carolina make it clear that the standardized tests have been purposed to these latter ends.[PSF2020] I will address the question to what extent do the publicly available NCDPI data support provide policymakers, educators, and the public with means to assess the achievement of students and schools? More specifically, I will discuss some quantitative aspects of determinations of education equity between Black and White elementary school students in North Carolina public schools.

The standardized tests are constructed and vetted by educators and psychometricians. The state's standardized tests are constructed according to Item Response Theory (IRT), where questions seek to determine levels of comprehension of grade-related skills.[SMIT2015][NCDPIach] Test scores have broad ramifications for students, teachers, and administrators, as they are used for purposes beyond evaluating the needs and performance of individual students, such as grading schools, evaluating teachers, setting policies, and so on.[HO2015]

My analysis concentrates on grades 3, 4, and 5, primarily because students in these grades participate in a series of standardized Math and Reading tests, and the students pass from grade to grade in a reasonably orderly fashion in what is sometimes called a wave. I will concentrate on comparing achievement of Black and White students. The other race categories are unevenly distributed throughout North Carolina, making their analysis difficult. I will not be looking at charter schools, since they tend to change in grades offered and enrollments over the years, making their analysis better subject to a separate endeavor. A breakdown of school enrollment by race is available at the NCDPI website.[NCDPIenrol]

The intention of the Commission was two-fold: to create a framework for testing and accountability to inform about the performance of children, and to provide instructional support. It does not appear that the statistical underpinnings of the former of these intents was achieved. To the extent that this intent was undertaken, it has been happenstance and not rigorous.

The crux of the problem is the disjuncture between the stated intent of the Commission, how the data is actually used, and what is actually provided to policymakers, educators, and the public.

Problem 1: Percentages

What is disclosed in the publicly available, FERPA-compliant data is the number of students who achieve Grade Level Proficiency GLP), and the total number of students tested, by race, rolled up for each grade for each school. Individual classes and teachers are not identified, unless there is only one class for a grade at a school, and that only by implication. The only practicable way to compare schools, or to follow schools across the years, is to use the percentage of students who achieve GLP, but grade size (the denominator) varies considerably. Numerical, statistical and interpretive problems can be expected whenever comparing percentages where there is a great variation in the denominator.

Problem 2: Goodhart's law

The construction and validation of the standardized tests seems to omit any evaluation of the extent to which student performance is influenced by the expenditure of resources on student preparation. It also must depend on student opportunities quite outside the schools' scope of control.

Problem 3: Characteristics of the Tests

NCDPI does not make publicly available the Score-Frequency distributions except rolled up for all students (more on that below). That data is available from NCERDC but is FERPA-protected. Without knowledge of the race-based characteristics of the tests, we are dealing with avoidable mystery.

Problem 4: A Very Zen-like Experience or Goodhart's Law Again

What is described by the Commission does not recognize a student past or future, but only isolated yearly evaluations. The concept of improvement is left to ad hoc comparisons of some summary measures, and to the completely separate EVAAS. This encourages educators, principals, and adminstrators to concentrate on improving the prospects for students to score well in the standardized tests, which can be different from learning the mandated materials.

Where Is The Data?

The publicly available data is primarily in what are called the disaggregated data files. This data is for grades 3 through 8, including summaries. These are FERPA-compliant, consequently some data is masked and individual classes are accumulated into one entry for each grade for each school. The data is shown for races and other demographics, such as gender and whether economically disadvantaged; these, however, are masked if the number of students is small (see the General Statutes in the Appendix). The disaggregated data files can be downloaded from NCDPI without the need for permission.[NCDPIdis]

FERPA-protected data is handed off by NCDPI to the North Carolina Education Research Data Center (NCERDC) under a contractual agreement.[NCERDC] NCERDC acts as a portal to the data, requiring research to be carried out according to FERPA-compliant privacy rules. This means that investigators must be associated with a university or research organization that has an Institutional Review Board (IRB). Investigators must be able to show that they will protect the data in compliance with NCERDC requirements, and to destroy the data (but of course not the results of analysis) upon completion of the project.

Grade Level Proficiency: Understandable and Reductionist

The Grade Level Proficiency categorization is among the most frequently mentioned quantitative criterion and is a realization of what the General Statutes mandate.[NCES2019] The grade 3, 4, and 5 standardized tests have been designed so that the score to achieve grade level proficiency is the median (give or take one point in a range of about eighty points) of the statewide student scores. This is the achievement of Level 3 of the five Level End of Grade tests. Gap analysis is the comparison of GLP percent mean values for Black students to that for White students. As generally used, the gap is the numerical difference between the GLP percents. While more refined methods can be used, consumption by policymakers and the public is eased through reducing whatever is happening to a single number. I avoid the use of the term “gap analysis,” since this subject already has a rich research literature and complex research methodologies.[JENC1998][CEPAgap] My interest lies in using the NCDPI publicly available data more broadly than just the GLP gap.

Score-Frequency Profiles

In grades 3, 4, and 5 the standardized tests are designed according to IRT methodologies, so that the achievement of GLP is not just a simple accumulation of correct answers. This is expressed most clearly by score-frequency diagrams. Figure 1 shows the data for all North Carolina public school students in grade 3, 2017-18 Mathematics.[NCDPIgb] The step at the Level 2/Level 3 boundary is an expression of the Item Response methodology. Its intent is to assure that getting some scattered collection of correct answers will not lead to achieving Grade Level Proficiency. This profile is derived from the testing at the end of the year, that is, it is not a design goal verified by sampling early in the school year. It also is completely devoid of the possible influence of the achievement levels of students on entry into the grade. I am sure that design goals are stated somewhere, but I do not know where that might be.

Figure 1. A Representative Score-Frequency Profile

How does the score-frequency diagram translate into observable results? NCDPI does not make the score-frequency data publicly available by subgroup (NCDPI uses that term to denote race and several other categories), but Figure 2 shows that there are different shapes in the end of year GLP percent by school results.

Figure 2. Representative GLP Percent By Race Distributions

It is evident from Figure 2 that the distributions for White and for the wealthier, Not-EDS (Economically Disadvantaged Students), students differ structurally from those for the other categories. Not only are the means different, but the shapes are different. The ALL data is more symmetrical than the White and Not-EDS, which lean toward higher percentages. The Black and EDS distributions have tails into higher GLP percent, while the mass of the students lies in the lower GLP percents. The profiles are so different that it raises the question of whether the tests are better at identifying the subgroup or assessing performance.[HO2015] (The vertical bar just below 100% in Not-EDS and White is an artifact of FERPA compliance and should be imagined as spread out over the 95% to 100% interval.)

I have already mentioned the 2015 program assessment that concentrated on the alignment between grade-specific learning objectives and the standardized tests.[SMIT2015] This omits any evaluation of student capabilites for taking any test, regardless of the goal of the test. But the ability to do well on tests must be at least associated with the extent to which teachers and principals expend resources on student preparation. It also must depend on student opportunities quite outside the schools' scope of control.

Comparisons Within Schools - A Moving Target

One way of comparing GLP percent of schools to themselves is to look across school years for the same grade. Since I am using grades 3, 4, and 5, and have six school years and two subjects, Math and ELA, that would result in a lot of plots. I did that and found that the similarities between grades, subjects, and years were so great that displaying one typical plot would be of value.

Hidden in plain sight:

Figure 3 combines the GLP percent data for three years, 2016-17, 2017-18, and 2018-19 into what is called a first differences plot. The horizontal axis shows, for all students regardless of race and by school for grade 3 Math, the difference between GLP percent for 2016-17 and 2017-18, while the vertical axis the difference between 2018-19 and 2017-18.

Figure 3. All Schools Math Grade 3

There are several striking features of Figure 3. For one, there is a “cloud,” and it is big. Differences extend to thirty percent, although the bulk are twenty-five percent or less. Another is that there are a lot of up and down changes, these being in the right-hand lower and the left-hand upper quadrants. From the viewpoint of statistics, the overall features of Figure 3 make sense, at least insofar as the smaller changes are concerned. Indeed, it would be very suspicious if the majority of changes were consistently in one direction. The underlying reason is that we are dealing with percentages where there are variations, some substantial, in the number of grade 3 students in each school. This leads into a second reason, that there are changes in the number of grade 3 students from year to year in the schools, which may well be associated with changes in staff and policies. This, combined with the race-related distributions shown in Figure 2, shows that the reductionist GLP percent is of some, but limited usefulness. The primary usefulness is the comparison of statewide mean GLP percent by grade across years, although even that is negatively affected by what I discussed in relation to Figure 2. That GLP percent obscures details definitely does not inspire confidence or recommend it to be used by policymakers and educators.

Figure 4 shows one of the sources of Figure 3's cloud. These are the year to year changes in grade enrollments. For many reasons, schools reconfigure, which might entail small changes or the large changes of adding or dropping a class. NCDPI data shows that average grade 3 class size was about twenty students, which helps interpret the changes. Figure 4 looks only at Black students for the convenience of having a less cluttered plot than for White or all students, there being about 250 more schools with White students than with Black students. These year to year changes raise many questions if they are appreciable. For instance, are changes by race related? In what ways is a school “the same” if changes are large? How can changes in GLP percent by race be used as a measure if changes are large, and so on.