Data analysis involves a lot of technicalities, but also sometimes accounting for human error. If data have been manually keyed in, typos are inevitable. Checking numerical values for outliers can find some of these mistakes; but what should we look for?
We all know that statistics can lie and confuse. We can easily confuse ourselves, in fact -- if we don't think very carefully about what a statistic is measuring, and what we want to know. A subtle mismatch between these can cause "bugs" in our calculations and our thinking. Here is one interesting sort of statistical bug, based on very simple math.
If a bank is creating new paper notes, what sizes should it issue to make transactions easiest for the public? This is one example of an type of problem I've become interested in. It is basically about how to divide a scale into discrete intervals, under various constraints. It seems very abstract, but it actually has a lot to do with real-world design and engineering.
Graphs and charts often mislead by obscuring the unreliability of their source data. But even if a graph-maker wants to do better, it can be hard to present such information intelligibly, without long or technical sidebars. Here is one approach for visually displaying both the primary data, and their reliability, in one graph.
Imagine you're given the task of evaluating the abilities of the individuals in a group -- perhaps in sharp-shooting -- and award ratings like A or F. You might plan a series of challenges of increasing difficulty. What should you do if these turn out to be much harder than you planned -- so that while you expected a mean success rate of 60%, it was actually 20%? An obvious solution is to multiply all scores by 3 to bring up the average. But it turns out that this (and any other linear correction) penalizes the weaker performers. Better alternatives exst.