PARCC Again ?>


The NJ Board of Education announced this week that student test scores would be used for 30% of a teacher’s evaluation, three times higher than before.  This has reignited controversy over the PARCC standardized tests.

What this story leaves out is that the NJ-ASK score was 30% of a teacher’s evaluation until the 2014-15 school year.  For the year of introduction of the PARCC, the student score portion was lowered to 10%.  We are only returning to the status quo ante.

Opponents of PARCC in NJ schools point to a well-constructed statistical argument launched by Gerald Goldman, a professor at Rutgers.  I have added two letters from him at the bottom, so you don’t have to follow a link.

In short, he condemns the use of students PARCC results to evaluate teachers.  His arguments are valid, but they don’t answer the most fundamental question a statistician should answer, “Compared to what?”

The other 70% of a teacher’s evaluation comes from observations of the teacher – the same kind of evaluation most employees would get from their bosses.  It has four components:

  • Planning
  • Environment
  • Instruction
  • Professionalism

What Goldman fails to address is whether a subjective evaluation of teacher competence is a better indicator than the admittedly imperfect but objective student test scores.

There is of course, another option for teacher evaluation: don’t do it at all.  Either pay all teachers the same or put them on a uniform salary increase schedule.  All teachers who started in 2006 are paid the same.

What NJ taxpayers are paying for is supposedly the best primary and secondary public education in the United States.  To do that means we need the best teachers and best teaching practices in the country.  How do we know if we are getting the best teaching?  Most parents want to see their kids get into a good college, or at least have the skills for a good job.  The Abbott formula sums it up over the long haul:

  • Percent of adults with no high school diploma
  • Percent of adults with some college education
  • Occupational status
  • Unemployment rate
  • Percent of individuals in poverty
  • Median family income.

We need a means to measure whether the money we are investing in our schools and our kids is paying off with better results in the six Abbott measures.  Where teachers are concerned, we have three measures today.  If someone comes up with a better measure, I’m all ears.  Until then, I want teacher evaluated, hired and fired, and paid according to the measure that best aligns taxpayer goals with teaching methods.  Today we have only three:

  • Teacher time-in-service
  • Observation of teachers with subjective measures
  • Scores on student tests.

Sure, PARCC tests are imperfect.  They have been shown to predict college performance better than the SAT.  Neither of those tests predict college performance near as well as a student’s high school GPA.  The problem is that you can’t judge a teacher’s performance by his or her students’ GPA.  Teachers would sit down at the end of the marking period, give all their students high marks and write themselves an excellent evaluation in so doing.

PARCC test are an imperfect measure of a student’s learning.  A student may have an off day and not do well.  That’s random error.  When you look at a classroom full of students, or a school full, or a district full, those errors wash out.

Goldman criticizes the effects of kurtosis and threshold effects.  Those would be valid criticisms, if the test measured achievement of grade-level proficiency.  Today, the tests measure only percentile (rank relative to other students).  The designers of PARCC need to establish the absolute levels of proficiency the school system expects of a sixth grader or a tenth grader.  We have five years to do that before we use PARCC as a gate for graduation.  If there is no progress from today, I will be sorely disappointed.  For now, though, we have to administer the test and collect the history we need to develop the score thresholds we need in 2021.

I have very little faith in a system that rewards teachers for their ability to impress a classroom observer, but disregards the measurable performance of students.  And I hope that you, my reader, reject the number of years’ experience in the class as a measure of return for NJ taxes spent.


Goldman’s letters

An 8th thing to know: Junk statistics


I read with interest the on-line article (March 16, 2015), “7 things to know about PARCC’s effect on teacher evaluations”


As a mathematical scientist with knowledge of modeling, of statistics, and of mathematics education research, I am persuaded that what we see here could fairly be termed “junk statistics” — numbers without meaning or significance, dressing up the evaluation process with the illusion of rigor in a way that can only serve to deceive the public.


Most New Jersey parents and other residents do not have the level of technical mathematical understanding that would enable them to see through such a pseudoscientific numbers game.  It is not especially reassuring that only 10% of the evaluation of teachers will be based on such numbers this year, 20% next year, or that a teacher can only be fired based on two year’ s data.  Pseudoscience deserves no weight whatsoever in educational policy.  It is immensely troubling that things have reached this point in New Jersey.


I have not examined the specific plans for using PARCC data directly, but am basing this note on the information in the article.  Some of the more detailed reasons for my opinion are provided in a separate comment.


In short, I think the 8th thing to know about PARCC’s effect on teacher evaluation is that the public is being conned by junk statistics.  The adverse effects on our children’s education are immediate.  This planned misuse of test results influences both teachers and children.




Gerald A. Goldin, Ph.D.

Distinguished Professor

Mathematics, Physics, and Mathematics Education

Rutgers – The State University of New Jersey




Why the reportedly planned use of PARCC test statistics is “junk science”:


First, of course, we have the “scale error” of measurement in each of the two tests (PARCC and NJ-ASK).  Second, we have random error of measurement in each of the two tests, including the effects of all the uncontrollable variables on each student’s performance on any given day, resulting in inattention, misreading of a question, “careless mistakes,”  etc.  Third, we have any systematic error of measurement – possibly understating or overstating student competency – that may be present in the test instruments, may be different in the two instruments, and may vary across the test scales.


The magnitude of each of these sources of error is about doubled when the difference of two independently-obtained scores is taken, as it is in calculating the gain score.  In addition, since two different test instruments are being used in the calculation, taking the difference of the scores requires some derived scale not specified in the article, which can introduce additional error.  These sources of error mean that each student’s individual gain score has a wide “error bar” as a measure of whatever it is that each test is designed to measure.


Fourth, we have “threshold effects” – some students are advanced well beyond the content intended to be measured by each test, while others are far behind in their knowledge of that content.  The threshold effects contribute to contaminating the data with scores that are not applicable at all.  Note that while the scores of such students may be extremely high or low, their difference from one year to the next may not to be extreme at all.  Thus they can contribute importantly in calculating a median (see below).


A fifth effect results from students who did not take one of the two tests.  Their gain scores cannot be calculated, and consequently some fraction of each teacher’s class will be omitted from the data.  This may or may not occur randomly, and in any case it contributes to the questionability of the results.


Sixth is the fact that many variables other than the teacher influence test performance – parents’ level of education, socioeconomic variables, effects of prior schooling, community of residence, and so forth.  Sophisticated statistical methods sometimes used to “factor out” such effects (so-called “value added modeling”) introduce so much additional randomness that no teacher’s class comes close in size to being a statistically significant sample.  But without the use of such methods, one cannot properly attribute “academic growth” or its absence to the teacher.


According to the description in the article, the student gain scores are then converted to a percentile scale ranging from 0 to 100, by comparison with other students having “similar academic histories.”  It is not clear to me whether this means simply comparison with all those having taking both tests at the same grade level, or also means possibly stratifying with respect to other, socioeconomic variables (such as district factor groupings) in calculating the percentiles.  Then the median of these percentile scores is found across the teacher’s class.  Finally the median percentile of gain scores is converted to a scale of 1-4; it not specified whether one merely divides by 25, or some other method is used.


However, a seventh objection is that test scores, and consequently gain scores, are typically distributed according to a bell-shaped curve (that is, approximately a normal distribution).  Percentile scores, on the other hand, form a level distribution (that is, they are uniformly distributed form 0 to 99).  This artificially magnifies the scale toward the center of the bell-shaped distribution, and diminishes it at the tails.  Small absolute differences in gain scores near the mean gain score result in important percentile differences, while large absolute differences in gain scores near the extremes result in small percentile differences.


There are more complications.  The distribution of performance on one or both tests may be skewed (this called kurtosis), so that it is not a symmetrical bell-shaped curve.  How wide the distribution of scores is (the “sample standard deviation”) is very important, but does not seem to have been taken into account explicitly.  Sometimes this is done in establishing the scales for reporting scores, in which case one thereby introduces an additional source of random error into the derived score, particularly when distributions are skewed.


Eighth, and perhaps most tellingly, the median score as a measure of central tendency is entirely insensitive to the distribution of scores above and below it.  A teacher of 25 students with a median “academic growth” score of 40 might have as many as 12 students with academic growth scores over 90, or not a single student with an academic growth score above 45.  To use the same statistic in both cases is patently absurd.


These comments do not address the validity of the tests, which some others have criticized.  They pertain to the statistics of interpreting the results.


The teacher evaluation scores that will be derived from the PARCC test will tell us nothing whatsoever about teaching quality.  But their use tells us a lot about the quality of the educational policies being pursued in New Jersey and, more generally, the United States.


Gerald A. Goldin, Ph.D.

Distinguished Professor, Rutgers University

Mathematics, Physics, Mathematics Education

Leave a Reply