@treycausey tweeted about this Uber blog post about how Uber reduced DUI arrests in Seattle. Using a technique called regression discontinuity, they claim that Uber has reduced DUI arrests by about 10% on average. The post bugged me though, because there is not a lot of detail on the methods, and regression discontinuity is the sort of research design that is very much dependent on specification. In this post I replicate the study and walk through what regression discontinuity is and why it can be a very effective research design. Ultimately I think it’s plausible that Uber did in fact reduce DUIs in Seattle, but the story is a bit more complex than the blog post lets on. Continued under the break!

Why would Uber reduce DUI arrests? It would have to be the case that people who once drove to and from bars now take Uber instead. An alternate hypothesis is that Uber will siphon off existing cab customers without creating new ones. If this is the case, then there will be no change in the number of DUIs. Uber would probably say that Uber cabs are more convenient and more readily available than traditional cabs, and therefore have the potential to attract new customers.^{1} Ok, that’s our theory, now let me quickly explain what regression discontinuity is.

**Why Regression Discontinuity?**

^{2}receives a national merit scholarship. We want to know if the scholarship itself causes better academic achievement in the future. Comparing students who received the scholarship to those who do not receive the scholarship is not an appropriate research design. On average, students who received the scholarship are smarter, more focused, or better at taking tests (hence the higher SAT score) than those who do not receive the scholarship. What we want is a population of students who do not receive the scholarship and a population of students who do receive the scholarship where assignment of the scholarship is plausibly random. The idea behind regression discontinuity is that we can achieve this if we look at students very close to the cutoff point for receiving the scholarship (an SAT of 1400).

The core assumption is that scores are at least somewhat random. A student who receives a 1400 is not obviously smarter than a student who receives a 1390. The latter student might simply have guessed incorrectly on a question that the former student guessed correctly. Students near the cutoff point therefore receive the treatment (a scholarship) in a way that is very nearly random, or at least, as close as we can get to random since we’re unlikely to be able to conduct a controlled experiment on this topic.

The methods behind regression discontinuity are reassuringly simple. Because we have a theoretical reason to believe that assignment is random in the neighborhood of 1400, all we do is throw out all the students who aren’t close to that cutoff and then run simple linear regression on what remains. The dependent variable will be scholastic achievement, and the only two independent variables we really need are one for the actual SAT score received (this is called the forcing variable) and one for receipt of the treatment (0/1 did the student receive a scholarship). We don’t really need anything else because assignment is random. For an appropriately large sample size, all other variables are uncorrelated with the dependent variable.

**Replicating Uber's Regression**

Conceptually, the forcing variable here (that is, the variable that determines the treatment cutoff) is the date. According to Uber’s Seattle blog, Uber started officially offering service in Seattle on August 12th of 2011, so I used that as the cutoff for the “uber” variable, which is the assignment variable. It is 1 if the date is >=8/12/2011 and 0 if the date is <8/12/2011.
An important concept in regression discontinuity is bandwidth. Remember we're going to throw away points that are not close to the assignment cutoff. You probably noticed in my previous discussion that I didn't specify *how* close these points have to be. There is no rule for this, so we’re just going to try different bandwidths and see what happens. We might try 50 days on either side of our cutoff, for example.

Unfortunately, the Uber blog post does not specify the bandwidth used.^{3} The output looks like it comes from R and they may have used an algorithm to choose an appropriate bandwidth. The bandwidth must be huge, because they include an independent variable for Marijuana and that doesn’t happen until a good 2 years after Uber is introduced to Seattle! Moreover, their bandwidth must be asymmetric, because there are more days of data post-Uber introduction (1,030) than there pre-Uber introduction (389).

Here’s the regression if we just use all data points:

Variable | Coefficient | P-value |
---|---|---|

Uber | -0.677 | 0.043 |

Day | 0.002 | 0.001 |

Marijuana | 1.438 | 0.000 |

Friday | -1.878 | 0.000 |

Monday | -4.470 | 0.000 |

Saturday | 0.284 | 0.359 |

Thursday | -3.264 | 0.000 |

Tuesday | -4.625 | 0.000 |

Wednesday | -4.119 | 0.000 |

These results are close-ish to what Uber displays, so I assume they used a very very wide bandwidth. This is good and bad. To start with the bad, the wider you make the bandwidth, the more likely it is that other stuff happens that interferes with what should be a relatively clean causal design. This is why Uber has to include the marijuana control, for example. A narrower bandwidth will more precisely identify observations where the assignment of the treatment is nearly random. Just as we don’t want to compare students with SAT scores of 1100 to students with SAT scores of 1600, it’s not good to compare a day 2 years before Uber’s introduction to one 2 years after Uber’s introduction. Any number of things could have happened in those 4 years that also influences the number of DUIs received and that we are not controlling for.

On the flip side, using a very small bandwidth could be bad in its own way. Using regression discontinuity is complicated in this particular case by the fact that the treatment of Uber providing service in the city is not really a 0/1 treatment. As time goes on, Uber market share probably increases, and the effect of the Uber treatment grows. With small bandwidths, Uber marketshare is small, and we bias the regression against finding an effect on DUI.^{4}

Whenever you use regression discontinuity, you should test various bandwidths to see how the bandwidth affects your estimate of the treatment effect. The graph below shows how the coefficient on the Uber treatment variable changes as I vary the bandwidth in increments of 25. 95% confidence intervals are shown for each point estimate.

As we should expect, it is difficult to say anything about the treatment at first, because there aren’t enough observations for the treatment variable to obtain significance. Once we have 125 days on either side of the cutoff the coefficient is significant and the effect size is substantial. This effect lingers for slightly larger bandwidths but overall as you increase the size of the bandwidth the effect of the Uber treatment diminishes, until it is basically 0 for large bandwidths. It seems then like the effect found in the Uber blogpost is being driven in large part by days that occur more than 400 days after Uber opened its doors in Seattle. That makes me skeptical to say the least, but it does seem to me like some of the smaller bandwidth regressions demonstrate that there was a short-lived effect.

- They might also argue that the supply of cabs is artificially constrained due to a limited number of licenses, as is the case in most major cities. If that’s the case, it is more likely that Uber will add new cab customers, since wait times should decrease substantially. ↩
- Yes I know it’s out of 2400 or something these days, but back in my day it was 1600! Kids these days!. ↩
- I am assuming they used parametric regression and I don’t know a damned thing about nonparametric methods for RD so I hope that assumption is correct. ↩
- One way to remedy this would be to include an interaction of the Uber dummy with the date. This would allow for the possibility that the slope of the best fit line is different after the discontinuity, perhaps increasing as we get further away and Uber marketshare increases. This raises various functional form questions though, because it assumes a linear increase in marketshare and a linear relationship between marketshare and DUIs. ↩

Thanks for taking the skeptical view! Peer review can only make our work stronger. Cheers :)

Can you share the R code used?

Sorry I missed this comment in my inbox. I am happy to but I also can’t find it at the moment. I’ll find it this weekend or early next week and put it up.

Using the dataset (which I just updated to fix a couple of small problems), here are the commands for the big regression and then regressions of varying bandwidths to make the graph:

lm(incidents~uber+marijuana+eventdate+factor(dayofweek))

lm(incidents~uber+marijuana+eventdate+factor(dayofweek),data=subset(data,eventdate>=365.5 & eventdate< =415.5)) lm(incidents~uber+marijuana+eventdate+factor(dayofweek),data=subset(data,eventdate>=340.5 & eventdate< =440.5)) lm(incidents~uber+marijuana+eventdate+factor(dayofweek),data=subset(data,eventdate>=315.5 & eventdate< =465.5)) lm(incidents~uber+marijuana+eventdate+factor(dayofweek),data=subset(data,eventdate>=290.5 & eventdate< =490.5)) lm(incidents~uber+marijuana+eventdate+factor(dayofweek),data=subset(data,eventdate>=265.5 & eventdate< =515.5)) lm(incidents~uber+marijuana+eventdate+factor(dayofweek),data=subset(data,eventdate>=240.5 & eventdate< =540.5)) lm(incidents~uber+marijuana+eventdate+factor(dayofweek),data=subset(data,eventdate>=215.5 & eventdate< =565.5)) lm(incidents~uber+marijuana+eventdate+factor(dayofweek),data=subset(data,eventdate>=190.5 & eventdate< =590.5)) lm(incidents~uber+marijuana+eventdate+factor(dayofweek),data=subset(data,eventdate>=165.5 & eventdate< =615.5)) lm(incidents~uber+marijuana+eventdate+factor(dayofweek),data=subset(data,eventdate>=140.5 & eventdate< =640.5)) lm(incidents~uber+marijuana+eventdate+factor(dayofweek),data=subset(data,eventdate>=115.5 & eventdate< =665.5)) lm(incidents~uber+marijuana+eventdate+factor(dayofweek),data=subset(data,eventdate>=90.5 & eventdate< =690.5)) lm(incidents~uber+marijuana+eventdate+factor(dayofweek),data=subset(data,eventdate>=65.5 & eventdate< =715.5)) lm(incidents~uber+marijuana+eventdate+factor(dayofweek),data=subset(data,eventdate>=40.5 & eventdate< =740.5)) lm(incidents~uber+marijuana+eventdate+factor(dayofweek),data=subset(data,eventdate>=15.5 & eventdate< =765.5)) lm(incidents~uber+marijuana+eventdate+factor(dayofweek),data=subset(data,eventdate>=-9.5 & eventdate<=790.5)) And then my graph is just a quick ggplot thing: ggplot(data2, aes(x=bandwidth, y=coefficient)) + geom_point() + geom_errorbar(aes(ymax = coefficient+1.96*se, ymin = coefficient- 1.96*se))

When you can this analysis (or any for that matter), would you expand the data set you posted to include zeros? And if so, does that raise concern for the appropriateness of a linear model?

The data posted should have expanded 0s actually. It was missing 1 or 2 days so I updated it just now. As far as the model goes, I don’t see a problem with using linear regression. There might be more appropriate things to use (a count model of some sort), but OLS is pretty robust to this kind of thing.

can=ran

Hey, this is some really cool stuff for playing around in Stata; would you possibly have either .do files or the R/SAS equivalent that I could convert to Stata format? Thanks!

Sorry for taking so long to respond to this comment. The R code is shown in a comment a bit above yours. Doing it in Stata really isn’t that difficult. You just do OLS regression with if statements to limit the bandwidth. So for example the equivalent of:

lm(incidents~uber+marijuana+eventdate+factor(dayofweek),data=subset(data,eventdate>=365.5 & eventdate365.5 & evendate<415.5

If you are using older versions of Stata you may have to throw a xi: on the beginning of that statement.