Thursday, 4 October 2007

Dangerous curves

In response to an Orin Kerr post about a grade complaint lawsuit against the University of Massachusetts, Megan McArdle asks why professors use curves in the first place:

[W]hy do faculty, particularly at the undergraduate level where the task is mastery of a basic body of knowlege, set exams where the majority of the students can’t answer a majority of the questions? Or, conversely, as I’ve also seen happen, where the difference between an A and a C is a few points, because everyone scored in the high 90’s? Is figuring out what your students are likely to know really so hard for an experienced teacher?

I’ve spent a lot of time the last four years looking into psychometric theory as part of my research on measurement (you can read a very brief primer here, or my working paper here), so I think I can take a stab at an answer. Or, a new answer: I’ve blogged a little about grading before at the macro level; you might want to read that post first to see where I’m coming from here.

The fundamental problem in test development is to measure the student’s domain-specific knowledge, preferably about things covered in the course. We measure this knowledge using a series of indicators—responses to questions—which we hope will tap this knowledge. There is no way, except intuition, to know a priori how well these questions work; once we have given an exam, we can look at various statistics that indicate how well each question performs as a measure of knowledge, but the first time the question is used it’s pure guesswork. And, we don’t want to give identical exams multiple times, because fraternities and sororities on most campuses have giant vaults full of old exams.

So we are always having to mix in new questions, which may suck. If too many of the questions suck—if almost all of the students get them right or get them wrong, or the good students do no better than the poor students on them—we get an examination that has a very flat grade distribution for reasons other than “all the students have equal knowledge of the material.”

It turns out in psychometric theory that the best examinations have questions that all do a good job of distinguishing good from bad students (they have high “discrimination”) and have a variety of difficulty levels, ranging from easy to hard. Most examinations don’t have these properties; the people who write standardized tests like the SAT, ACT, and GRE spend lots of time and effort on these things and have thousands of exams to work with, and even they don’t achieve perfection—that’s why they don’t report “raw” scores on the exams, instead reporting “standardized” scores that make them comparable over time.

If you go beyond simple true/false and multiple choice tests, the problems become worse; grading essays can be a bit of a nightmare. Some people develop really detailed rubrics for them; my tendency is to grade fairly holistically, with a few self-set guidelines for how to treat common problems consistently (defined point ranges for issues like citation problems, grammar and style, and the like).

So, we curve and otherwise muck with the grade distribution to correct these problems. Generally speaking, after throwing out the “earned F” students (students who did not complete all of the assignments and flunked as a result), I tend to aim for an average “curved” grade in the low 80s and try to assign grades based on the best mapping between the standard 90–80-70–60 cutoffs and GPAs. It doesn’t always work out perfectly, but in the end the relative (within-class) and absolute grades seem to be about right.

Update: More on grading from Orin Kerr here.