Hi! This is actually from January 9, 2016! It's from my old data blog but I resurrected it because I've always been proud of this piece of work. It was pretty popular back in the day, and the very smart Christopher Groskopf incorporated it into his Python package leather (the code mentions this blog!) As I am not a very good coder, and as the leather library went on to be very popular, this was a charming surprise for me and I'm quite tickled by it. Here's the original blog:
For an arbitrary set of data points, what’s the ‘best’ graph axis for those points? Say that your x-variable runs from 5 to 45. A human can quickly pick a few promising options for the x-axis. It could have ticks at 0, 15, 30, and 45. Or perhaps 0, 10, 20, 30, 40, and 50. Getting a program to generate ‘nice’ options like this is a bit trickier. I’ve been working on a graphing app recently and I’ve reproduced my solution, with notes, below.
Although there are a number of algorithms people have put forth on Stack Overflow and elsewhere, many of these do not handle certain kinds of data sets correctly, and virtually none treat 0 correctly in my opinion. I started from scratch with the following 5 rules for my function:
My Javascript algorithm is below with comments. The basic approach is to generate a large number of candidate axes, score each axis based on the amount of wasted space and the number of ticks used, and then compare these scores. It is a bit brute force in this way. Computation is cheap! It takes an array of data values as input – for example [0, 12, 8, 20, 5] or whatever. I have excluded parts of the function – to handle dates and certain other edge cases.
First, get the minimum and maximum of the series, toggle the zero_flag variable if 0 is between the min and max, and get the range of the data.
Next, define ‘nice’ numbers. You could change this if you’d like to include other possibilities, but I decided I would allow counting by 1s, 2s, 5s, 10s, 15s, 25s, and 75s. This will make a bit more sense below.
This next part is a bit of path dependence – I had an algorithm where the number of ticks was more central and I kept that framework but I probably wouldn’t do it this way again. I get a naive value for the distance between ticks and determine the place value of this distance.
Now using the value of digits (the place value of steps), generate a list of candidate steps. These are just the values from nice_steps above multiplied by powers of 10 according to the place value of digits. Because computation doesn’t matter to me, I check 10^place value+1, 10^place value, and 10^place value-1. Most of these candidate step lengths will be terrible but it doesn’t matter – they will get weeded out in the next step. If our initial step length was 13, candidate steps will be generated by taking 1, 10 and 100 * all values of nice_ticks. So 13 would result in candidate steps: [.1,.2,.5,1,.15,.25,.75,1,2,5,10,2.5,2.5,7.5,20,50,100,15,25,75]
Loop through candidate steps and generate an axis based on each step length.
All that remains is to score all the candidate arrays. I’m not going to include my scorer, because there are a lot of arbitrary choices involved, but basically I look at how much space each array wastes compared to the data use that as a starting value. Each array gets the score 10^percent wasted space – then I further penalize the array for large values of ticks, tick values that I don’t like as much (.15 for example, is great in certain cases, but probably shouldn’t be liked as much by the function as .1). The array with the lowest score ‘wins’.