For Data Designers, Visualization Can Be Misleading
I'm a big fan of the NYT Graphics Department. They produce a lot of great visualization work and occasionally document their process on the chartsnthings blog.
Recently they published a graphic titled For Young Drivers, Drinking is More Dangerous. The graphic is a beautiful heat map and immediately inspired me to copy down the data so that I could try to reproduce it using QlikView (that effort will come in a later post). As I was transcribing the data, the title and explanation of the data became unsettling.
For Young Drivers, Drinking is More Dangerous.
As in, drinking and driving is more dangerous for young drivers than old drivers.
Is this explanation plausible? Sure. Is it supported by the data presented? No.
Based on the legend of the heat map, the color density of the cells represents the number of fatal crashes for each BAC and age combination:
Browsing this chart, it becomes clear that there is a huge problem with younger people drinking and driving. However, nothing indicates that this problem is with the drunk driving abilities of those under 27. Because this data represents occurrences of events (the total number of fatal crashes), it is much more plausible that the distribution of crashes is influenced by the number of young people drunk driving. In other words, the number of drunk fatal crashes is higher for younger people because younger people drink and drive more frequently than older people.
If we examine the Time of Day version of the graph, we can see how the analysis quickly falls apart:
If we apply the same logic that the NYT used previously, we would determine that drunk driving is much less dangerous between the hours of 5 AM and 5 PM. Does that make any sense? Of course not. The accidents are heavily distributed across the 11 PM to 4AM range because that is when alcohol consumption spikes.
The NYT's analysis makes the assumption that the number of drunk drivers is evenly distributed across ages, which is a big assumption to make. In order to substantiate the claim that older drivers can handle their alcohol behind the wheel better than younger drivers, we would need to discard this assumption. How many people from each age group were driving at each BAC and didn't get caught? If we had that data, we could calculate the percentage of drunk drivers that caused crashes from each age and BAC group and compare that metric to determine who is more dangerous as a drunk driver.
So, while there is clearly an issue with younger people causing accidents due to drunk driving, this statement:
For less experienced drivers, one or two drinks can cause the loss of reasoning and reaction time that results in a fatal crash.
is not proven by the chart presented. The point of the article is to support lowering the BAC limit for younger drivers. Can this data set provide support for this legislation? Maybe.
Using the data I loaded into QlikView, I decided to put together a few charts to investigate the proposed legislation. First, I created a bar chart using the BAC and Age groupings provided by the NYT:
From this view, the discrepancy is much less dramatic than the heat map. However, the age grouping here is actually exacerbating the sample size issue. The youngest group spans the least number of years, while the oldest group spans the most. Let's remove that by dividing by the number of ages in each age group to get fatal crashes per age:
We've now arrived at a conclusion similar to that of the heat map. The youngest age group has the most fatal crashes related to drunk driving. We still don't know anything about the sample size, so we can't really compare the absolute figures. However, this bar chart does highlight something that is not as readily highlighted by the heat map: the transition from BAC group to BAC group. In the heat map, you can somewhat observe that the oldest group is trending upwards with BAC. However, it is not very obvious since the colors are faint compared with the intense frequency with which younger drunk drivers are causing accidents. It is certainly not quantifiable. From this view however, the trend is more readily observed. With this in mind, we may be able to make a comparison to answer how BAC affects age groups differently. We can take this data and analyze how the frequency of accidents for each age group changes at each BAC level. Thus, regardless of initial sample size for each group, we can see whose fatal car crash rate was affected most by increases in BAC:
The table above shows us that at the second BAC group, which is the range that new legislation is targeting, the youngest age group's frequency of accidents increases by 56%. Eureka! The author's point may have some basis. However, the oldest group also increased by an alarming 50%. This suggests that the oldest group doesn't drive well right under the legal limit either. Just like the author hypothesized that experience played a role for the younger group, I would guess that diminished vision and reaction time due to aging might help explain the oldest group's driving. Regardless, that is just an idea and not proven by the data.
I also included a column showing the difference against the 1st group for the last BAC group. It would seem overall that the older you get, the more likely you are to crash if you have been drinking. Based on the table above, one might suggest that the legal BAC limit be lowered both for young and old people. However, just like the NYT's hypothesis was based on an assumption that the amount of drunk drivers was evenly distributed across ages, the analysis of this table is based on the assumption that the amount of drunk drivers is evenly distributed across BAC groups. Thus, it may not be any more valuable than the previous analysis.
What the data tells us for sure is that too many young people are causing fatal car crashes by drinking and driving.