Data analysis involves a lot of technicalities, but also sometimes accounting for human error. If data have been manually keyed in, typos are inevitable. Checking numerical values for outliers can find some of these mistakes; but what should we look for?

There are many kinds of typo, of course, and some would be difficult to detect. If a 6 is keyed instead of a 7 in a long string of digits, for instance, who can say that's not a reasonable value? You might be able to judge against historical data points, or some other benchmark, using typical outlier detection methods.

But very often, the typos we're most concerned with are the egregious ones; and we want to prioritize finding those -- because there are too many modest outliers to investigate individually, or to simply throw out. Common egregious typos include omitting a digit and duplicating one (or more than one). This is especially the case when there are already strings of digits in a figure, like "70000" becoming "700000".

I was curious about the size of the numerical change this kind of error would create. I wrote a quick simulation in R to randomly drop digits or duplicate them, and ran it on a dummy data set. Here are the two functions (not robust since they're just for testing):

dupdigit <- function(x, d=NULL) {
    x <- as.character(format(x, scientific=F))
    if(is.null(d)) d <- sample(1:nchar(x), 1)
    xd <- paste(substr(x, 1, d), substr(x, d, nchar(x)), sep='')
    return(as.numeric(xd))
}

dropdigit <- function(x, d=NULL) {
    x <- as.character(format(x, scientific=F))
    if(is.null(d)) d <- sample(1:nchar(x), 1)
    xd <- paste(substr(x, 0, d-1), substr(x, d+1, nchar(x)), sep='')
    return(as.numeric(xd))
}

I was half expecting to get some sort of non-intuitive result, but in fact the results are pretty straightforward.

This remains true, even when many values come "rounded" with low-place digits that are always zero. It will change slightly if there is a natural cut-off -- like values that range only from 1 to 700. It appears to be similar across various distributions of original data: uniform, normal and log-normal at least.

You could logically then look for values 6 times larger than expected, or four times smaller, to identify mere typographic errors. The duplicates could even be fixed automatically (with some guess-work).