This post was prompted by this twitter exchange:

@hetrick46 Very hard to compare them using any kind of regression because they are perfect subs for one another (never play together)

— Austin Clemens (@AustinClemens2) April 21, 2014

I want to clarify this issue for non-stats people. Adjusted plus-minus and similar analytics, including my own adjusted defensive impact analytic that I debuted a couple days ago, use multivariate regression to ‘adjust’ the raw numbers. The basic idea is that a lot of players who are not very good get to share the floor with great players sometimes. Udonis Haslem gets a fair bit of floor time with LeBron James. When that happens, Haslem’s raw plus-minus is going to look great! But that’s mostly because James was also in the game with him. So the regression controls for all the players on the court, and tries to extract the contribution made by each.

One of the frequently cited problems with this approach is multicollinearity. This happens when certain players are close or perfect substitutes for each other, like Ian Mahinmi and Roy Hibbert. Mahinmi and Hibbert have not spent a single second on the court together this whole season (there have been instances where both have been off the court, but this is rare). The variables we use to control for the presence of Hibbert and Mahinmi are therefore highly correlated: when one is 0 (indicating that the player is not on the court) the other is almost always 1.

But that’s only true for the particular specification used in APM. Comparing perfectly correlated variables is not only possible in regression, it’s done all the time! At my day job right now, I’m working with a big dataset of criminal justice outcomes, and one variable that is shown to be a significant predictor of criminal recidivism is gender. If we created two variables, one indicating whether a person is male, and one indicating whether a person is female, these variables will be perfectly correlated. But this isn’t a problem–we simply exclude one of these variables, and the one we exclude is the comparison group. So if I place the male variable in the model, for example, I would then interpret the model as saying^{1} “Males are 5% more likely than females to recidivate, all else equal.” What if I try to include both variables in the model? I won’t get very far if the variables are *perfectly* collinear, as they are in this case, but Mahinmi and Hibbert are sometimes both out together, so their indicator variables are not perfectly correlated.

How does multicollinearity affect the model? The most common problem with multicollinearity is that the errors associated with model estimates blow up. That’s bad if you care about statistical significance, but generally we don’t for things like APM and RPM. And it’s very possible that multicollinearity won’t affect estimates of player ability at all. The could though, especially when you don’t just have two groups but something closer to 200, as we do when we want to get estimates for nearly every NBA player.

What if we respecify the model though? Remember we can make this work if we exclude one of either Mahinmi or Hibbert such that that player is the comparison group. I don’t have the data necessary for APM, but I do have shot data for the 2013-2014 season. I ran the following regression using all shots taken against the Pacers:

Shot_made = Offensive_FG% + Home + Defensive_P1 + Defensive_P2 + Defensive_P3 + Defensive_P4 + Defensive_P5 + β¦

Where shot_made is an indicator variable that is 1 if the shot went in, 0 if it didn’t, offensive_fg% is the FG% of the player taking the shot from that spot on the court, home is an indicator variable that is 1 if the Pacers are at home and otherwise 0, and the defensive_p* variables are just indicator variables for every Pacers player–even ones who barely saw the court. HOWEVER, I remove Ian Mahinmi’s indicator variable (this is arbitrary, I would get the exact same answer if I removed Hibbert’s). This is now a well-specified model that doesn’t suffer from multicollinearity (at least, not between Hibbert and Mahinmi) and for which the coefficients I retrieve can be interpreted as a player’s defensive contribution relative to Ian Mahinmi.

What I get back from the model is an estimate of how much that player’s presence on the court either adds to or subtracts from the probability that an opposing player will hit his shot compared to if Ian Mahinmi was on the court instead. To my surprise, the coefficient for Hibbert’s indicator variable is 0.005. This means that Hibbert’s presence increases the probability that an opposing player will hit his shot by 0.5% compared to if Ian Mahinmi was on the court. So the model says Mahinmi infinitesimally better. This estimate is not statistically significant however, reflecting just how close these two guys are in terms of the defense they provide. Additionally, keep in mind that my model is only about a player’s effect on shots. APM is more holistic, and can reflect things like the ability to create turnovers, draw offensive fouls, rebound and so on.

The problem is you can’t apply this approach on a large scale. If ESPN came out with a new stat “RPMCTIM”, which of course stands for “Real plus-minus compared to Ian Mahinmi,” no one would quite know what to make of it. Virtually everyone in the league would appear to be a poor defender, because the comparison is Ian Mahinmi–a great defender!

- All these numbers are completely made up. ↑