Monday, 25 August 2003

The L.A. Times poll and oversampling

Dan Weintraub notes that the Los Angeles Times poll of California voters—the first to show a lead for Bustamante outside the margin of error—included a special sample of 125 Latino voters. Dan hasn’t get clarification yet as to how the Latinos were counted in the overall poll, which interviewed 1,351 (self-declared?) registered voters, 801 of whom were deemed “likely” voters.

The key question is whether the 125 Latinos were all “likely” voters or just registered. In terms of registration numbers, the count seems reasonable in terms of a sample of Californians; however, if all 125 were “likely” there was an oversampling of Latinos which should have been corrected. (* For more on this, follow the Read More link.)

So the big question is whether or not the oversampling was an issue in the main poll, and if so whether it was compensated for. If it wasn’t, the Times poll is giving us a very biased estimate of the population parameter (in this case, the percentage of likely voters who are planning to vote for Bustamante or leaning that way).

Another possible source of the high Bustamante number is that the Times poll included “leaners” in addition to voters who initially declared a preference for a particular candidate. (Generally in surveys on vote choice, if you say “I don’t know” to the first question, a followup question will ask if there’s a candidate you are leaning towards.) If other polls aren’t combining the two categories, this could explain a big part of the difference. It might also be of substantive interest; if Bustamante’s support includes a disproportionate share of leaners, they would be easier for other candidates to sway than voters who are committed to Bustamante.

* If the oversampling was part of the master sampling design, the Latino voters could have been assigned a lower individual weight to correct for the oversampling (the technical term for this sort of procedure is stratification). This is generally hard to do in telephone surveys. (You can generally fudge it using quotas—which raise theoretical problems as well.)

However, if the oversampling was done post hoc, the only valid correction (at least in terms of classic probability sampling) is to only use the Latino voters who were in the original sample for inference to the population at large. An alternative approach called propensity weighting—which would weight the Latinos less due to their overrepresentation relative to the other ethnic groups—is somewhat controversial (however, Harris Interactive, the online survey research company, has demonstrated to at least some researchers’ satisfaction that Internet surveys with propensity weighting can actually do a better job of estimating population parameters than standard telephone surveys or Knowledge Networks’ competing WebTV-based survey product based on probability sampling; see this article from Political Analysis for details). There’s a subtle difference between the two weighting schemes (propensity weighting versus stratification), but in terms of sampling theory it’s a very important difference.

Interestingly enough, if you don’t care about the actual population parameters, sampling bias is far less problematic. For example, if you wanted to estimate a statistical model explaining support for candidates in the recall, sampling bias really isn’t an issue provided your explanatory variables account for the cause of the sampling bias. So if Latinos are oversampled, and you include a variable representing Latinos in the model, that particular oversampling problem is no longer an issue for making inferences about individual behavior. (However, if you don’t include such a variable, you’ll end up with non-spherical errors which will have unknown effects on your significance tests. It’s not quite omitted variable bias, but it’s still an ugly problem. Hence to be safe it is best to use the sample weights if you have them available, particularly if you don’t know how your sample was stratified.)