The average season is over 3,701 days long! Don’t believe me; check the data. Better yet, chart it using a modern visualization called a box plot.
Okay; I misled you, but data can be misleading if we don’t properly represent the extremes of the data. This data is the average season length in days of planets in our solar system that have measurable seasonality1.
Sometimes we feel that we effectively represent data using averages and variances. These crude measurements may be enough from some data sets. Other data sets may require a more clinical visualization. We will need to learn the statistical values needed for Box Plots, as well as how to create the chart.
To get a more focused visualization, we will need to understand some basic statistical values. The median value of a data set is the middle number when the data set is sorted in a linear fashion2. If there are an even number of entries in the data set, then the average of the 2 middle values is the median.
A quartile is the middle of the data between the upper or lower extreme and the median. The first quartile (Q1) is the middle value between the lowest value in the data set and the median. The second quartile (Q2) is the median value. The third quartile (Q3) is the middle value between the median and the highest value in the data set.
The area between the first and third quartiles is called the Interquartile Range (IQR)3.
There are various types of box plot charts when it comes to where the whiskers are ended. The 3 most common are minimum and maximum values, 1.5 IQR below and above (this calculation will be explained later), and one standard deviation below and above the mean. Depending on your analytics requirements any of these are acceptable. In this example, we will be referencing the 1.5 IQR version.
The formula for the upper and lower whiskers in a 1.5 IQR box plot is Q3 + (1.5 * IQR) and Q1 – (1.5 * IQR), respectively. If we plug in 4928 for Q3, 150 for Q1, and 4,778 for an IQR, we get 12,095 and -7017. We will want to move the whisker values to the nearest data point that is closest to 0. In our case, that is 7,300 and 56.5.
How to create the Box Plot
Now that we have terminology defined and the initial example definitively defined as Neptune skewing the average calculation, we need to represent this as a box plot. Put each Planet on a single scale chart by Days.
We can see that there is one value that is larger, but is it statistically different? Plot the median value.
We can tell that the median is left of the center. Put a line for the first and third quartiles and shade in the space between Q1 and Q3.
Certainly, most of our data fall between these two marks. Finally, the upper and lower whiskers can be added as 7,300 and 56.5, respectively. These are the first and last data points within the 1.5 IQR whiskers.
It is easy to see and statistically proven, that the one planet is an outlier in our data. We can also add additional information as well as hide unnecessary information, so we can make the best use of the user’s available attention.
The color represents a nominal categorization, and the removal of data that is not the focus yields a chart that quickly helps our audience focus on outliers in our data.
Though some of us think we know what is skewing our data, it may be necessary to prove to others around us just what is occurring. Depicting outliers can be an effective way to communicate data points that skew results. Another useful property of our data is relationships, and we will investigate how to depict related data using a chord chart in the next post in this blog series.