Straight Up Statistics – Deconstructing the Average

As more statistics come out on a daily basis that are supposed to tell us that the recession is over and we’ve hit the bottom, its more important than ever to be aware of the nuances these statistics come along with.

Government officials, bankers, retailers and snake oil salesmen alike throw out statistical arguments at the drop of a hat, telling you why their pitch is the only one worth listening to because they have the data to back it up. But before accepting what you hear or read at face value just because some nameless research institute did a study, stop for a minute to ponder the complexities of even the most seemingly innocuous of statistics: The average.

Let’s first assume some particular data being quoted were reliably gathered and analyzed (This is almost never a safe assumption, but that’s a topic for another day), then examine how the average and another so-called “descriptive statistic” –- the median — are used in the data reports we see every day.

While on the surface it may seem that these two statistical measures could be interchangeable (indeed they are often used interchangeably with no explanation), they tell us very different things about the data they describe.

The median of a given group of data is its middle value. For instance, if your dataset has five data points and you lined them all up from smallest to largest, the third value would be your median. On the other hand, the average, or mean, of a dataset is determined by summing all values and dividing by the number of data points.

For example, suppose you are looking at real estate sales in a certain area within a certain time frame and you had the following 5 values: $300,000, $320,000, $320,000, $450,000, and $1,200,000. The median of this set is $320,000 (the middle value). The average is $518,000 (2,590,000 / 5). As you can see, even in this simple example, the two descriptive statistics are significantly different.

Real estate sales are often represented by the median value. The reasons for this are varied, but center around the fact that a few sales at extremely high levels (like that $2 million house on the top of the hill) can easily skew the average of a dataset towards those properties, even though most homes in the area are selling at lower prices.

For example, in Temecula, CA where most homes sell at modest levels (by California standards) but some homes sell for significantly more, the average sale price in 2008 was about $435,000. The median price, on the other hand, was around $359,000. That’s is a difference of over 20%.

Contrast that with areas where home prices are more homogenous, like Daly City, CA, where the average and median values are more closely in line. In 2008, the average sale price for Daly City was around $562,000 while the median was about $558,000 – a much smaller spread (<1%).

So which is better? Average or median? As can be seen from the examples above, neither.

Both display different aspects of the same set of data points. In Temecula, where median and average wildly diverge, using the average skews the data towards a much higher level. An individual from out of state looking to buy there might incorrectly assume they couldn’t afford to do so. On the other hand, solely looking at the median leaves out the fact that there are million dollar plus estates in Temecula available to buyers looking for that sort of thing.

When the National Association of Realtors releases their monthly sales statistics — which is the real estate pricing data carried by most major news outlets — they present sales price data as both median and average values. These values are used to track sales prices over time to identify trends in sales activity nationwide and regionally. While both median and average values are freely available to anyone with internet access, the median values are often the ones quoted in the popular press.

By focusing exclusively on median values, however, one can miss interesting trends.

For example, on a nationwide level and in three of the four regions identified, median and average home sale prices have been tracking at around the same relative spread since 2005. In the West region, however, the median sales price has been falling faster than the average price.

This widening variance helps tell the story of what’s been happening in Western real estate markets in the past few years. In most markets, high-priced homes have retained their value better than homes that are closer to, or below the median. Since so many lower end homes are being sold, many after foreclosure, the sheer volume of these transactions is dragging down the median figures. The average, on the other hand, is propped up by the few expensive homes still being sold.

This analysis then begs the question, why does the trend only exist in the West? As other regions decline, can we expect the same pattern to play out? Why are higher priced homes holding up better? If expensive homes begin to lose their value, what would that do to the median and average sales prices? What does the data look like on a city or zip code level?

It’s easy to see that just by comparing the median and average sales price trends, much insight — or at the very least another list of questions — can be gained.

I could go on all day about the wealth of information that such a seemingly simple statistic as the average can provide those with the patience and curiosity to “drill down” past the headlines. But my point is simply this: Pay attention! Don’t let the evening news or your favorite web news source gloss over the statistics to prove whatever skewed point they want to make that day. Spend the time to think critically about the information or you run the risk being fleeced regularly for the rest of your life.

At the very least, pay close attention to the source of any information you are receiving, particularly when that information comes in the form of a statistic. If you are being presented with a descriptive statistic like an average or a median, notice which one you are being given and pause for a second to think about why they used one and not the other.

Furthermore, if you notice that a single set of data is being described interchangeably by median and average, this should throw up a huge red flag as to the reliability of the information and its source.