Thanks everyone for the really great response to my previous blog on twitter. Some people asked good questions and I want to clarify a few things and talk about some improvements that could (maybe should) be made to this. This post will be kinda methods-y but may include some non-technical tidbits.
First, one thing I really should have mentioned in the original post is that this approach has all the same drawbacks as adjusted plus minus. Specifically, it has the same collinearity problems that make it difficult, if not impossible, to compare players who are close substitutes for one another. For example, Ian Mahinmi and Roy Hibbert never play together. This makes estimates for both players somewhat unreliable.
Second, I want to reiterate that I have not put much (any) thought into testing model specifications. For example, I am not currently including a constant in the regression models. The way my code works, I run 1,750 models for each team, one team at a time.1 This is not like adjusted plus minus where every player is included all at once. For each team, the model has a number of covariates equal to the number of players who have defended >1000 shots (+1 covariate for home/away). If I did include an intercept, my interpretation of that intercept would be that it represents a team wide effect independent of any single player’s defense. You could think of it as the contribution of the coach’s scheme. This could be pretty interesting! In some early testing, I found that the intercept was actually detrimental to prediction, so I left it out. BUT: I have not tested specification in any systematic way! I just ran a few of the 1,750 regressions for one team and looked at some results. What I would like to do is calculate a PRE (proportional reduction in error) for every single regression, and then compare PREs between models to see what the best specification is. This is near the top of my todo list.
Along these same lines, it’s possible I could improve my accounting for offensive players by, for example, including four more covariates that take the value of the non-shooting offensive players’ offensive RAPM, or something like that. Seems a little ad-hocish but I might try it.
Selecting shots for each regression
@ShrillCosby asked about cell sizes and I realized I hadn’t explained this very well. 1ft x 1ft cells are waaaaaay too small for regression. Many of the cells you are seeing on player charts only have 5 or 6 shots in them (perimeter ones mostly). So let me explain the algorithm a bit better. Basically to gather shots for a regression, my algorithm looks in the cell first. If the cell is located right under the basket, it probably stops there, because there are tons of shots right under the basket. If the cell just has a few shots in it, however, the search is expanded. It looks outside the cell, pulling in the closest shots until it has reached some completely arbitrary threshold I set (I think right now it’s 8% of total shot volume). I do set some limits on this process. First, it is only allowed to grab shots of the same type, where type is either 3pt, midrange, or close (close is <8ft from basket). Second, it is never allowed to look further than 12 feet away. If, after this process, it has not found at least 100 shots to regress on, it doesn't run a regression for that cell at all. You can sort of see this play out on certain player's charts. If you look at James Johnson, for example, you can see he defends right-side 3s very poorly, left-side 3s quite well. In the middle there is no significance. That's probably because in the middle I am regressing on shots to both the left and the right and the result is just noise. But clearly the signal is much better once you get away from the middle. I also give a visual representation of how this works in my methods blog post.
Is there a better way to do this? I think the one place where there’s room for improvement here is to move away from OLS and use weighted least squares instead. The idea would be to make shots closer to the actual square count for more, and distant shots to count for less. This is probably the right thing to do but it would be a pretty significant code rewrite so I doubt I’ll get to it anytime real soon.
Drawing and sizing squares
I didn’t explain this at all in the previous post. I decided that every player would get exactly 250 squares on their chart2, corresponding to the 250 locations where they defended the largest number of shots. For players who don’t play many minutes, this necessarily means that their charts contain squares with much smaller shot volumes than starters. Squares are scaled for each player such that the smallest square is always 8 pixels across and the largest always 25 pixels across (the square right under the basket). Scaling is done according to the log of shot volume in each square. There’s really no other way to do it–shot volume near the basket is just massively larger than shot volume anywhere else and linear scaling would result in a really funky looking graph. The actual size of each 1ft x 1ft square is 15 x 15 pixels but I went with the maximum size of 25 to convey just how huge shot volume is right under the hoop.
Data and code
A couple people asked about the data. All the data was scraped by me from NBA.com. I worked pretty hard to minimize the amount of data I had to pull down from them (I say this because it would be terrible if they took steps to restrict scraping. Right now they not only allow it but even seem to encourage it–there are some ‘hidden’ apis that make the whole process fairly easy). I may make the data public. There’s no reason I couldn’t do it now, I’m just nervous about violating NBA.com’s terms of service and making them mad.
This issue of whether a player is good at forcing shots from a certain location has come up a bit too. I really want to address this because I think it’s critical to understanding defense from the perspective of court location. I am going to try to have something in place when I share the results for the rest of the players, which should be sometime this week(-end maybe).
- This is also a problem for players who change teams. Evan Turner doesn’t qualify for the Pacers regressions. He might qualify for the 76ers regressions, but then the graph will not show his full season effort–only shots he defended while with the 76ers. I haven’t quite figured out how I want to deal with this. ↑
- More than that just clutters up the chart and is visually confusing. This is purely a design decision. If I allowed more squares they would probably all end up being gray anyways. ↑