Dangerous curves (Signifying Nothing: how does saying nothing at all become so loud?)

Thursday, 4 October 2007

Dangerous curves

In response to an Orin Kerr post about a grade complaint lawsuit against the University of Massachusetts, Megan McArdle asks why professors use curves in the first place:

[W]hy do faculty, particularly at the undergraduate level where the task is mastery of a basic body of knowlege, set exams where the majority of the students can’t answer a majority of the questions? Or, conversely, as I’ve also seen happen, where the difference between an A and a C is a few points, because everyone scored in the high 90’s? Is figuring out what your students are likely to know really so hard for an experienced teacher?

I’ve spent a lot of time the last four years looking into psychometric theory as part of my research on measurement (you can read a very brief primer here, or my working paper here), so I think I can take a stab at an answer. Or, a new answer: I’ve blogged a little about grading before at the macro level; you might want to read that post first to see where I’m coming from here.

The fundamental problem in test development is to measure the student’s domain-specific knowledge, preferably about things covered in the course. We measure this knowledge using a series of indicators—responses to questions—which we hope will tap this knowledge. There is no way, except intuition, to know a priori how well these questions work; once we have given an exam, we can look at various statistics that indicate how well each question performs as a measure of knowledge, but the first time the question is used it’s pure guesswork. And, we don’t want to give identical exams multiple times, because fraternities and sororities on most campuses have giant vaults full of old exams.

So we are always having to mix in new questions, which may suck. If too many of the questions suck—if almost all of the students get them right or get them wrong, or the good students do no better than the poor students on them—we get an examination that has a very flat grade distribution for reasons other than “all the students have equal knowledge of the material.”

It turns out in psychometric theory that the best examinations have questions that all do a good job of distinguishing good from bad students (they have high “discrimination”) and have a variety of difficulty levels, ranging from easy to hard. Most examinations don’t have these properties; the people who write standardized tests like the SAT, ACT, and GRE spend lots of time and effort on these things and have thousands of exams to work with, and even they don’t achieve perfection—that’s why they don’t report “raw” scores on the exams, instead reporting “standardized” scores that make them comparable over time.

If you go beyond simple true/false and multiple choice tests, the problems become worse; grading essays can be a bit of a nightmare. Some people develop really detailed rubrics for them; my tendency is to grade fairly holistically, with a few self-set guidelines for how to treat common problems consistently (defined point ranges for issues like citation problems, grammar and style, and the like).

So, we curve and otherwise muck with the grade distribution to correct these problems. Generally speaking, after throwing out the “earned F” students (students who did not complete all of the assignments and flunked as a result), I tend to aim for an average “curved” grade in the low 80s and try to assign grades based on the best mapping between the standard 90–80-70–60 cutoffs and GPAs. It doesn’t always work out perfectly, but in the end the relative (within-class) and absolute grades seem to be about right.

Update: More on grading from Orin Kerr here.

1 comment:

Any views expressed in these comments are solely those of their authors; they do not reflect the views of the authors of Signifying Nothing, unless attributed to one of us.

1. Neil Lawrence wrote @ Fri, 5 Oct 2007, 12:13 pm CDT:

I read this with great interest. It’s always hard passing judgement on some student’s knowledge. Some people take written tests better than others. Maybe it was a bad day for the student due to illness, stress, etc. I know because I missed three questions on a math test in college because I didn’t turn the page over; I was too tired and not thinking right. Luckily, the professor graded me on the questions I had answered.

We (USAF navigation instructors) had similar problems in the Undergraduate Navigation School with testing, but we had little leeway in our test grading as individual instructors. Test grading was relatively easy as the tests were standardized, multiple choice exams. Ocassionally we’d have a bad question or a case were there were more than one correct answer. In that case we could throw out the question and adjust the grade accordingly. Grading flight and simulator check rides was much more difficult, especially the flight checks as they were harder to recreate minute-by-minute. We had a comprehensive grading guide for checkrides, but we were allowed to exercise judgement based on weather conditions, equipment malfunctions, etc. After a while at the school, I was in the position to edit the syllabus, edit the instructors’ classroom instruction guides and write new tests. Writing the tests was quite a challenge since the USAF had very definite guidelines on multiple choice questions. The only way to proof the tests was to give them to other instructors which often proved entertaining for the test giver and/or instructional for the test takers. Even after 55+ years of aerial navigation we were still finding errors in the two major USAF flight manuals. Dad

Comments are now closed on this post.