Adjusted Defensive Impact Location Methods (including actual code!)

Here are the grisly details about calculating adjusted defensive impact by court location that nobody has been waiting for. I’ve partially explained the methods before here, but some things have changed and that explanation was not comprehensive. So buckle in and get ready for a lot of really dry prose, a couple of kinda cool graphs, and some code that should help you understand the algorithm I’m using to select shots.

Data
First, it helps to understand the format my data is in. I have an observation for every shot taken in the NBA this season (2013-2014). There are roughly 200k observations. Each observation tells me a bunch of things about the shot but for our purposes what really matters is that I have the five offensive and five defensive players that were on the court when the shot is taken, the identity of the shooter, the location of the shot, and whether the shooter was at home or away from home.

I scraped the data from NBA.com. There are a few little problems with it. First, the NBA.com data for three games was corrupt and couldn’t really be used (I probably could have used parts of it but then I’d have to add a bunch of error checking and eh). A slightly more significant problem has to do with recovering the names of the five offensive and five defensive players on the court. This was by far the hardest part of the process. Basically I get a list of all substitutions that happen in a game and then use that to reconstruct who is on the court. Unfortunately the resources I use for this can be a bit lazy. For example, sometimes in Suns games you will see “Markieff in for Markieff” or something like that. That’s not a huge deal–I know which Markieff was active so I can sub him out and sub in the inactive one. There were some other problems with the sub patterns that were harder to deal with. Consequently, I don’t know one of the players on the court for some shots. There are about 1,700 shots out of the 200,000 where I don’t know one of the players on the court. I think this is relatively harmless, but I do know that it might be particularly harmful for certain players. I know, for example, that the Pacers’ George Hill was frequently affected by this, so it may be causing moderate problems for a small subset of players.

And here’s the data. I wasn’t going to share it, mainly because I was concerned about the size (400MB uncompressed), but then I found out that when it is zipped, it is just 6.6MB. This makes the compression algorithm black magic as far as I am concerned, but also means I can probably afford the bandwidth. Please be gentle, and if you have questions don’t hesitate to tweet at me. Also please mention that I am the source of the data if you use it for something.

zipfile Shot location data for adjusted defensive impact

The Regression
I run 1,600 regressions–one for each of the 1ft x 1ft squares that I care about on the court. If you’re not familiar, an NBA basketball court is 50 feet wide and 94 feet long. Half court then would be 47 feet, but my graphs only extend to 32 feet from the baseline. 32 x 50 = 1,600. Each regression takes the following form:

Shot_made = Offensive_FG% + Home + Defensive_P1 + Defensive_P2 + … + Offensive_P1 + Offensive_P2 + …

Where shot_made is a dummy variable indicating whether or not the shot was good, home is a dummy for whether or not the offensive player was at home, and the defensive/offensive variables are dummies for each player in the league who was a defensive/offensive player at least 1,000 shots (I don’t have the number in front of me but this works out to 600-ish independent variables). Offensive_FG% deserves its own extended explanation. This is the probability of the shooter making this particular shot. I get this probability by looking at all shots the shooter has taken in my dataset and getting an average for that particular location on the court. More on this in the next section.

I use OLS and I intentionally suppress the constant, because Offensive_FG% theoretically is the constant. I did some testing of this specification choice and it basically has no impact at all. I also constrain the coefficient on Offensive_FG% to be 1, essentially forcing it to be the constant. To do this I rearrange the regression equation to the following:

Shot_made – Offensive_FG% = Home + Defensive_P1 + Defensive_P2 + … + Offensive_P1 + Offensive_P2 + …

This new dependent variable takes values from -1 to 1.

As many people have correctly noticed, there is no way to get enough shots for a regression out of a 1ft x 1ft square (unless you are right under the basket), which brings us to the next step: collecting shots.

Collecting Shots for Each Regression
This is where it gets a little tricky. There are actually two parts to collecting shots. First, there is a shot collecting algorithm that goes into figuring out what the offensive_FG% is for each shot. I’m going to sidestep that issue today, because it has already been calculated in the data file above. Instead, I’m going to focus on shot collection for the actual regressions. To illustrate how it works I will show some Stata code that I use to verify my work (the real work is done in Python, but I wrote code for both languages to catch and eliminate bugs).

First, some data prep (again, this should work using the dataset provided above).

insheet using "PATH TO CSV HERE/data_reduced.csv", comma clear

* Drop observations if there is no offensiveplayerfg. This happens
* when there are not enough observations for the shooter to reliably
* determine FG% from a spot on the court.
drop if offensive=="OMIT"
destring,replace

* Encode team variable for later use in regression as dummies
encode team,gen(team2)
drop team
rename team2 team

* gen dependent variable
gen dep_var=shot_made-offensiveplayerfg


* drop players with fewer than 1000 shots
foreach var of varlist def_-def_miroslavraduljica {
	cap: tab `var' if `var'==1
	if `r(N)'<1000 {
		drop `var'
	}
}

foreach var of varlist off_-off_miroslavraduljica {
	cap: tab `var' if `var'==1
	if `r(N)'<1000 {
		drop `var'
	}
}

Ok, now we’re ready for the main regression loop. I’ll take this in stages and explain each bit.

local y=-52.5
local x=-250

This is just setting our starting points. If you’re just wanting to see the regression in one box or whatever, you can just change these to that box’s coordinates. As a quick aside: the x and y locations are distance from the hoop in tenths of a foot. The hoop is the center of the whole thing, hence the negative values. The y-axis is the distance from the baseline to mid court. The x-axis is from one sideline to the other.

forvalues i=1(1)1600 {
	local x_center=`x'+5
	local y_center=`y'+5

Here’s the start of the loop. We’ll be iterating 1,600 times, once for each of our squares. For each square, it’s convenient to find the center of the square. Remember each square is 1ft x 1ft or 10 x 10 since these are tenths of a foot, so the center is the corner +5.

	gen box=0
	cap: replace box=1 if (x_loc>=`x' & x_loc<`x'+10) & (y_loc>=`y' & y_loc<`y'+10)
	
	gen extend_box=0
	gen distance=sqrt((`x_center'-x_loc)^2+(`y_center'-y_loc)^2)
	cap: replace extend_box=1 if distance<120

Box is a dummy variable that indicates if a shot originates from inside our box. The second line here identifies all those shots. Extend_box is what we really care about though. This is the dummy variable that will identify all of the shots in our regression, in a bigger area than just the box itself. The first step to populating extend_box is to find the distance of each shot in the dataset from the center of our box. That’s the distance variable, generated above.

The way this works is I set extend_box to 1 for the absolute largest number of shots that could possibly be in the regression, and then I gradually reduce them according to the criteria you’ll see below. My first rule is that I don’t include a shot in a regression if it is more than 12 feet away from the center of the box, so the first step is to set extend_box to 1 if the distance is less than 120.

	* Figure out the region for box
	if abs(`x_center')>=220 | ((`x')^2+(`y')^2>=56406.25) {
		cap: replace extend_box=0 if region!=2
	}
	
	if ((`x_center')^2+(`y_center')^2<=6400) {
		cap: replace extend_box=0 if region!=0
	}
	
	if ((`x_center')^2+(`y_center')^2>6400) & abs(`x_center')<220 & ((`x_center')^2+(`y_center')^2<56406.25) {
		cap: replace extend_box=0 if region!=1
	}

This is setup for winnowing extend_box down. These three ifs determine if the location we are looking at is a close, mid, or 3-point location. I use the center of the box to make this determination. I then eliminate shots if they are not of the same type as the box itself. That is, if the center of the box is a 3-point location, then only 3-point shots will be used in the regression for that box, and so on.

	sort extend_box distance
	cap: by extend_box: replace extend_box=0 if _n>10000

This is the final modification to extend_box. I remove all observations in excess of the closest 10,000. In practice, what this means is that virtually every box uses shots that are less than 12 feet distant (almost all regressions hit the 10k cap). Circle size varies though–some regressions only have to wander a foot or two from their center, while others encompass 8 or 9 feet. The 12 foot limit and the 10k limit are completely arbitrary and I haven’t tested other specifications at all. Some early testing I did suggests that the regressions themselves are fairly robust to small changes in these parameters, but I admit to some ignorance on this point.

	xi: reg dep_var home def_-def_cjmiles off_-off_cjmiles i.team if extend_box==1, nocons
	drop box
	drop distance
	drop extend_box 	

The regression, and removing clutter to get ready for the next regression.

	local y=`y'+10
	if `y'==267.5 {
		local y=-52.5
		local x=`x'+10
	}
}

And cycling to the next box. What I am not showing here is capturing the coefficients and turning them into a spreadsheet. I do that part in Python. As I mentioned before, the Stata code is just a validation tool. If you want to see for yourself that this all checks out, take a look at the source code on my adjusted defensive impact page, download a player’s .csv file, and verify. The csv files show 250 locations with the coefficient in the third column and the standard error in the 4th column. For some reason my Python and Stata standard errors are very slightly different from each other. This is a bug I haven’t figured out yet, but the differences are tiny.

Specification and Model Fit
I have done some specification testing. I tested the following possibilities: with/without a constant, with/without team dummies, with/without offensive player dummies, and all combinations of these 3. Looking at R2 and proportional reduction in error, offensive player dummies are a clear win. Team dummies are nearly a wash but appear to very slightly improve the model so I went ahead and threw them in. The constant doesn’t appear to add anything to the model and theoretically I think it’s appropriate to leave it out.

The two figures below show model fit statistics for the final model. The first graph shows proportional reduction in error. PRE ranges from very poor (3%) to pretty darn good (30%). Average PRE is 10%. As the first figure shows, the pattern of PRE is interesting. Higher PREs are shown in warmer colors. Deep greens indicate lower PREs. I don’t know what I expected, but I definitely didn’t expect the border of the restricted area to pop out.1 The patterns appear fairly regular. The edge of the restricted area is the easiest area to predict, followed by the corners and then the midrange. It is most difficult to predict shot outcomes behind the arc.

Model Fit--PRE

The second graph, below, shows R2 for each regression. R2 ranges from 0.05 to 0.12. Let me rant about R2 for a second.2 First of all, keep in mind that I moved offensive_fg% to the lefthand side of the model. This means the R2 is the variation explained by only the variables left on the righthand side. If offensive_fg% was on the righthand side3, the R2 would range from 0.40 to 0.70. Second, a model can’t remove error that is random from an event, and there is a significant stochastic component to whether or not a shot is made. It is unlikely you could make a model to predict NBA shot success with an R2 of 0.90 or something like that. This is like claiming you have a model with a R2 of 0.80 for a coin flip.

Model Fit--R2

R2 is a little better near the hoop and in the corners, but otherwise fairly uniform.

Parting Thoughts
This was a pretty big undertaking for me and there are lots of things I didn’t get done or would have done different in retrospect. It might make sense to re-imagine the regression as a geographically weighted regression, or even just a simple weighted OLS to de-emphasize shots that are taken far from the center of the square being drawn. I should probably be using ridge regression. There are certain covariates I really wish I had collected (it would be nice to know if a shot was taken as part of a fast break possession, for example). There are robustness checks I didn’t do (validation of my 10,000 shot and 12 foot distance parameters).

Moreover, adjusted defensive impact suffers from many of the same ailments as adjusted plus minus. There are collinearity problems4, significance is poor5, and some players do not pass the smell test (see James Harden6). But these are the problems of every method we have.7 I hope to revisit this in the next Basketball season, and maybe I can improve on the method. In the meantime, I would be very interested in hearing your feedback.

  1. I haven’t poked around much but I don’t think this is just an artifact. The naive model produces fewer errors close to the hoop, because FG% rises near the hoop. This should naturally inflate PRE a little since smaller numerical increases in prediction will lead to larger % increases in prediction. But that should be true for the entire area under the hoop, not just the perimeter of the restricted area. I feel like I’m missing something but I don’t know what.
  2. The short version is that people focus on R2 too much.
  3. This is actually a pretty inconsequential change–the coefficient on it is indistinguishable from 1 in every regression
  4. this may be part of the story behind the Harden graph, which doesn’t track closely to most people’s intuition
  5. It makes you wonder about statistical significance because APM has strong face validity for many players even though it does not approach statistical significance
  6. I recently read some old posts Dave Berri wrote criticizing APM. While the criticisms were valid, and I think some skepticism is warranted, a great deal of ink is spilled to support the superiority of wins produced. In another blog entry from around the same time period he lists the top 50 players by wins produced for 2010-2011. Dirk Nowitzki does not feature in the top 50. Let it sink in for a second. Every method will produce some inaccurate estimates.
Share on FacebookTweet about this on TwitterShare on LinkedIn