Outliers and rare events

Submitted by Sammy Smith (sammy@thesga.org)


When a scientist analyzes data, sometimes the values are more similar—except for one or a few values or characteristics or whatever. These different values are called outliers, meaning they lie outside the pattern of most of the values (or of the sample of values). Thus, outliers are rare within that data set. In the set of imaginary data points in the plot above, the outlier plots way in the upper right. The question of how to deal with outliers haunts many scientists, and is a point of analysis for some statisticians.

Nassim Nicholas Taleb, a mathematical researcher, has published a book he titled The Black Swan: The Impact of the Highly Improbable (2007, Random House). A “black swan,” to Taleb, refers to a rare event that is difficult to predict yet has an outsize impact, beyond normal expectations. Thus, there’s an element of randomness and an element of uncertainty in the outlier.

The name Taleb chose for the book, “Black Swan,” refers to the assumption by Europeans that all swans are white, since all wild swans Europeans were familiar with for centuries were indeed white. The term “black swan” thus was a metaphor for an impossibility. No one (in their world) had seen a black swan, so for them black swans did not exist. Then, in 1697, a Dutch explorer in Australia found black swans, and the Europeans had a bit of a shock. Thus, they altered the term to mean something assumed to be impossible that actually happened.

To Taleb, a Black Swan event is a surprise and has a major impact. Although he applies this concept to financial investment patterns, archaeologists can learn from consideration of Black Swan events and outliers.

First, you have to think about the data set that produced the Black Swan outlier. Perhaps the data may be just a small sample, so that the apparent outlier is really part of a normal distribution of data—it’s just that some data points are missing. You also need to make sure that the way the data were measured is sufficiently accurate and precise that the outlier does not result from some form of mismeasure.

If the data set seems complete, or to be a complete representation of the data set, so that the outlier is “real,” then how to explain it?

Archaeologists sometimes encounter statistical outliers in, for example, a set of radiocarbon dates. Sometimes, the “bad” date may result from inaccuracies in the sample, thus skewing its date. Sometimes, the “bad” date means something that is actually real, but doesn’t match with previous interpretations—for example, that some particular artifact type was actually used earlier or later than previous data and dates suggest.

Sometimes, because the “real world” doesn’t always make sense at a given time, it is hard to determine, based on field and laboratory methodology, why a particular outlier date is “bad.” If we assume it is not “bad,” and that it measures a real data point that is beyond expectations based on other reliable data, then we have far different concerns when we try to explain what that outlier means.

Author’s note: I am not a statistician, and this is by no means a complete disquisition on this subject. Instead, my intention is to raise the issue of interpreting outliers, and perhaps add a new twist to it for some. The Edge Foundation website has a long article by Taleb that you might be interested in reading, which elaborates on the Black Swan outliers, and, ultimately, on human behavior. Click here to read that article, posted in September 2008.