I am still tweaking the graphs I’ve already shown and working on some new things1 but I want to post something about the decisions I made in this process. The biggest puzzle in making these kinds of visuals has to do with binning. Binning data is taking data and sorting it into discrete bins to make it easier to interpret. Here’s a shooting graph where the data has been binned very very little (into 1ft x 1ft squares):
This graph is visually cluttered and hard to make sense of. What’s more, it’s a mix of deep red and deep blue because in many squares there are only a few shots, and the result is that in these squares the FG% is 100 or 0. So you have to do some binning. If you’ve seen Kirk Goldsberry’s stuff, you might have noticed that there is binning going on, even though the bins are not explicit (he also makes some charts with no binning at all). All shots near to one another are the same color because, like me, Kirk is getting a FG% for a region but then showing the location of shots in much smaller regions. In his case, the smaller regions are hexagons. Mine are 1ft x 1ft squares. For my post on Roy Hibbert, I binned shots into 17 zones. Squares in each zone are the same color, because all shots were binned and a single FG% was determined. The squares are just showing where shots come from within a zone. Here are the regions I use:
I didn’t really think about these regions; I just arbitrarily made them up. They’re not too far from the regions that stats.NBA.com uses for its shot charts, and I like the way the Roy Hibbert graphs turned out. A bit later though, I decided to look at a player that I know is a poor defender: Jose Calderon.2 Here are those graphs.
Opposing FG%, Calderon on court
Opposing FG%, Calderon off court
At best these are visually confusing and hard to draw a conclusion from. At worst they give you the impression that Calderon is a plus defender. While this would be a nice surprise, and a good data visualization can certainly show you things you didn’t expect, it’s just not plausible here. Opponent FG% rises by 2% when Calderon is on the court, and the Mavericks give up way more points per 100 possessions. The bins I chose (and to a limited extent, my FG% coloring scheme) are, in this case, misleading. This is because bin size involves a tradeoff. Small bins give you a better idea of what’s going on in your data. If there was just one bin, the whole court, I couldn’t tell you anything about different locations on the court. But smaller bins also have less data in them, so you get extreme results like those in the first graph of this post, where the bins fluctuate between dark blue and bright red.3
Clearly I need to redesign my bins, and a reasonable idea is to try to design bins that will have roughly the same number of shots in them. But this is not so easy to implement because of where shots are taken on a basketball court. The graph below shows how many shots have been attempted in each of the 1ft x 1ft squares on the basketball court. There are about 105,000 shots total (this data is actually a few weeks old) and the vast majority of them are right under the hoop. If you wanted 20 equal sized regions, each of the one foot square regions around the hoop would be their own zones.
- Graphs of the on-court/off-court difference players make, adjusting or normalizing graphs to account for teammates and opponents, graphs focused on offense rather than defense, and other things. ↑
- I am a MFFL so no offense to Jose. I’m a big fan. ↑
- To put it another way, we are very uncertain about what the ‘true’ value of these bins should be. ↑