Strip Plots




Strip plots are a form of scatter plots using only 1 expression. When a data set is not too large, plotting every instance of a dimension over a single metric can be a useful exercise for reviewing a distribution and discovering outliers in the data. This raw data view can also be combined with summary information like the mean or median of a set of values to add further context. It also can be useful to compare multiple strip plots side by side; for this post, I will refer to these charts as "multi strip plots".

Both of these charts can be created in QlikView, although the latter has some constraints applied that may not make it ideal for production applications. For the following examples, I have used an insurance data set. The set contains underwriters who are responsible for insurance policies. The metrics we use to analyze the policies includes premium amounts, days to expiration, days renewed before expiration, and renewal status.

Single Strip Plot

In a single strip plot, the points  are stacked on top of each other in 1 column. By utilizing this with a transparent color for our points, we can produce charts like the following, which plots underwriters (UW) by the amount of policies they cover:

single-column

By viewing the distribution across a metric in this way, you can glean information about your underwriters as a whole, spot groupings of underwriters based on the metric, and find individual points of interest. In this example, we have 156 underwriters plotted. The x-axis has an expression equal to "1" and is hidden; the y-axis contains the expression "sum([Policy Count])" to compute the number of policies. The dimension is UW. The example on the right includes some statistical benchmarks to provide further context to the distribution of data. These were added using the "Reference Lines" on the Presentation tab.

While this chart has some value, it can be hard to read in its current format. Because of the number of points, it is hard to compare areas of high concentration. To solve this problem, strip plots frequently use a technique called "jittering". “Jittering” is taking the strip plot data points and randomly dispersing them across the x-axis to give better visibility into all of the points plotted. In QlikView, I was able to replace my x-axis expression with both rowno() and rand() to accomplish this:

single-column-jittered

Another improvement that can be made to the chart is the coloring. The points don’t necessarily have to be the same color. By randomly distributing the points across a set of colors, we can discern the overlapping points with greater ease. This makes the chart a bit more legible in the high density areas:

single-column-multi-coloredTo randomly assign the colors, I used the pick() and rand() functions to randomly select 1 of three colors. I used the following expression in the Background Color settings of my first expression:

=pick(ceil(rand()*3),$(vLiteGreyT(100)),$(vLiteBlueT(150)),$(vLiteOrangeT(150)))

The variables contain argb() functions for calculating transparent versions of specific colors.

The single strip plot is pretty easily implemented, can fit in a small space, and responds well to filtering.

Multi Strip Plot

The multi strip plot is much more complex to implement. I experimented with a number of ways to approach it; ultimately, to get the formatting in a way that was legible, I created the chart under the following constraints:

  1. Use a scatter plot
  2. Set static min and max values for the x-axis, resulting in a finite number of dimensions that can be compared on the plot
  3. Use a straight table overlaying the scatter plot to label the dimension values
  4. Use thousand, million, and billion symbols for the metric's number settings to minimize the amount of re-sizing of the y-axis labels when filtering. The re-sizing of the y-axis labels can affect the length of the x-axis, which can cause it to be out of line with the straight table label

Under these constraints, I was able to build the following chart:

multiple

This example plots a point for every policy. Each column corresponds to an UW. The metric is the number of days that the policy was renewed before expiration; negative values are good, while positive values are bad, indicating that a policy expired before it was renewed. The grey point represents the median of the metric for each UW; this could be extended to include other points like quartile ranges. The chart limits to only 10 UWs and is sorted in this instance by median. The sort order can be determined with an expression.

The scatterplot uses dimensions POLICY (to plot the points), UW (to break out the points into the various columns in our x-axis expression), and a valueloop of (1,2) to facilitate calculation of the median as well as the main metric.

In the single scatter plot, rand() was used on the x-axis to create a scatter of points between 0-1. In order to create the separate strips here, I needed to create a separate range of x-values for each UW plotted. For example, Jody Huckfeldt should have x-values randomly distributed from 0-1, while Willard Keathley has values 1-2, etc. To create this separation, I needed a sorted rank for each UW so I could appropriately determine in which range to plot their policies. I also wanted to include a gap of half the strip plot in between each plot. This was accomplished using the following expression to determine the x-axis value:

rand()+1.5*Aggr(NODISTINCT Rank(Aggr(NODISTINCT median(DAYS_PRIOR_RNWL),UW)),UW)

Axis blogger Jerry DiMaso supplied the aggr() function that calculated the rank of my UW dimension based on the desired metric. In this case, I used the median value. By multiplying this by 1.5, we create an offset for each new UW added to the chart. Then, rand() is added to create a range of length 1 from the offset. The result is that 1st UW is plotted on the x-axis from 1.5-2.5, the 2nd from 3-4, the 3rd from 4.5-5.5, etc.

This expression plots the x-values for the policies. In order to add the median, the expression needed to be modified to reference the valueloop field. For the median values, I wanted the point to always be drawn dead center. I modified the expression to facilitate that:

=pick(valueloop(1,2,1),rand()+1.5*Aggr(NODISTINCT Rank(Aggr(NODISTINCT median(DAYS_PRIOR_RNWL),UW)),UW),.5+1.5*Aggr(NODISTINCT Rank(Aggr(NODISTINCT median(DAYS_PRIOR_RNWL),UW)),UW))

The second part of the pick statement calculates the midpoint of each range; this is where the median is subsequently calculated.

The y-value expression checks the valueloop again. For the first value, it plots the metric since these correspond with our randomly distributed policies on the x-axis. For the 2nd value which is automatically centered, it plots the median value:

=pick(valueloop(1,2,1),avg(DAYS_PRIOR_RNWL),if(above(total UW,2)<>UW,median(total<UW> DAYS_PRIOR_RNWL)))

In the median calculation, I check to see if the UW value has changed so that I will only calculate and plot the median point once. If you change the chart into a straight table, you will see that this works correctly. However, I am not confident that it is working the same way in scatter plot; while testing, the result of making the median point transparent suggested that the median was still being drawn multiple times.

The coloring of the points is calculated using the pick() function to assign different colors to the policies and median point.

Because the x-axis of a scatter plot will only show numbers, some constraints had to be applied to allow for labeling. I decided to use a horizontal straight table containing the UWs sorted in the same fashion as the scatter plot. However, the cell widths of the table are static, so the scatter plot columns must be as static in width as possible as well. This is why a static min and max is applied, as well as a calculated condition so only 10 or less UWs will be plotted on the chart. For this implemetation, 1-16.5 was the static range I used. Then, I placed the straight table on top and played with the cell widths until they matched with the columns. These were placed over the x-axis, which was given a white font. The x-axis was not set to "hidden" because the space is needed to hold the straight table. Without the x-axis, the plot collapses to the bottom of the chart border.

The above is a brief summary of the more complex implementation challenges of the multi strip plot. If you would like to play with either of the charts demonstrated in this post, you can download a sample QVW here: Strip Plots

A potential next step with this chart could be to combine it with Matthew Crowther's love of indexed charts to create the mythical Indexed Exploding Multi Strip Plot.

-Speros

TAGS: QlikView, Visualization

Related News

IS YOUR COMPANY UPGRADING TO QLIKVIEW 12?

Axis Group was approached by a client asking for help...

Read More

Part 1 of this series describes what Qlik Sense does better than QlikView, from the perspective...

Read More