For an arbitrary set of data points, what’s the ‘best’ graph axis for those points? Say that your x-variable runs from 5 to 45. A human can quickly pick a few promising options for the x-axis. It could have ticks at 0, 15, 30, and 45. Or perhaps 0, 10, 20, 30, 40, and 50. Getting a program to generate ‘nice’ options like this is a bit trickier. I’ve been working on a graphing app recently and I’ve reproduced my solution, with notes, below.
Although there are a number of algorithms people have put forth on Stack Overflow and elsewhere, many of these do not handle certain kinds of data sets correctly, and virtually none treat 0 correctly in my opinion. I started from scratch with the following 5 rules for my function:
If an axis crosses 0, 0 *must* be an axis tick value
Axis ticks *must* be attached to a grid line (i.e. you can’t have x-axis ticks floating in space – they must be attached to a y-axis grid line as you see in the example above)
The data should be as tightly contained as possible (little wasted space)
There should be no fewer than 4 ticks on each axis, and if rule 2 requires it, up to 10
There are 14 years of data, starting in 2000, but the first two appear to be incomplete and of course 2013 isn’t over yet, so I restricted my analysis to 2002-2013. I wanted to map the data, not because I had any interesting point to make about the location of federal contracts, but rather because I wanted to get more practice using Kartograph. Unfortunately, the raw data does not provide very good information on the locations of contract winners. Most of the time there is a little bit of location data embedded in a field for the contract winner’s name. But these are often nothing more than zip codes or incomplete addresses. To locate the data on a map, these addresses needed to be geocoded (turning an address into a lat/lon coordinate pair).
A little while back a friend suggested that maybe we could write a fun piece about BBQ. We are both living in Texas and while BBQ has always been a big deal in Texas, it seems to be getting bigger, with Texas Monthly appointing a full-time BBQ editor. I’ve been wanting to learn to work with maps better and to get into interactive web visualizations for a while, so this seemed like a good excuse to work on both.
I used the Yelp API to get data on BBQ restaurants in each of the 30 largest US cities (according to Wikipedia, city limits only). You’re really not supposed to use the API like this and in the future I’ll use one of Yelp’s academic data sets. To make it work, I had to split each city up into 625 lat/lon grid points and query each one separately. I averaged the ratings of each restaurant I found to get an average for the city. I also sampled a small number of restaurants in each city and collected review text for those restaurants (Yelp really doesn’t like you doing this, since it can’t be done with the API, and I don’t recommend trying it. Let’s just say that I can no longer read Yelp at home.) I used a dictionary of food adjectives I found on the web to pare the corpus down, and found three words for each city that are frequently used to describe that city’s BBQ. As you can see from mousing over the cities on the map below, this didn’t always work out great. More thoughts under the map…
US BBQ according to Yelp reviews
Circle size is proportionate to number of restaurants. Mouse over a city to see adjectives commonly used in Yelp reviews for area restaurants.