Tuesday, March 4, 2014

Divvy Data Challenge

With any large data set, it is always good to first get acquainted with it with a few simple histograms. I veered away from looking at any of the geographic data, since that will come later, but rather asked a few simple questions like "How long do people take Divvy bikes out?"




(It's pretty clear the subscribers understand the rules best)

", and how far do they ride them?"





or, "What time of day/day of the week are bikes used?"






You can tell who is commuting, but it's particularly surprising how little use there is by subscribers on weekends.

or "What is the distribution of rides by day? and how is this affected by the weather?" 



(Weather data obtained from the NOAA NCDC. Precipitation totals are from O'Hare, while daily high temperatures are from Adler Planetarium)

What I think is more interesting than cumulative totals, though, is trying to see exactly where the Divvy bikes are used. But the Divvy bikes presumably don't have tracking devices, and we're stuck with only information about when and where a bike is taken and returned. But this still tells us something more than just resorting to putting straight tie lines between points.

Let's presume a bike is travelling from point A to B in 10 minutes. We know that 2 minutes into the ride, it can't possibly be farther from point A than 2 minutes at a cycling pace. Likewise, we know that the bike must be returned to point B in 8 minutes, so it also couldn't be farther than 8 minutes away from point B. If we draw appropriately sized circles around each point (assuming a constant average cycling pace) the bike must be where the two circles overlap. The following image describes this process.



We can then repeat this process for every bike ride and animate it. I've also colored the overlap regions by the average direction of the ride, to allow us to more easily track where the regions are going between points. Regions are white if the bike is returned to the same location. Coloration is vector summed for regions which overlap. Transparency is also used relative to the size of the overlap regions to indicate the likelihood of a bike being in any given point. The results look like this:



We can further filter the rides shown by type of user.

Customers:

and Subscribers:

While the animations are neat, and some information can be gleaned, especially for the subscribers animation, the data can become rather overwhelming when a lot of regions overlap. To ease this problem in the above videos, only rides which lasted less than an hour had been used, as that complies with how Divvy bikes are intended to be used - rides of 30min or less.


This data can also be compiled into a heat map of where bikes go for any filtered set of rides, by 'exposing' our map just to the transparency data, without the color information.


So subscribers (in July) riding on weekdays between 7-9am looks like this:




And customers (in July) riding on Saturdays between 9-11am looks like this:



Another thing we can do is isolate a single bike by its bike id. Divvy Red was found in the data (bikeid: 1015 determined by unique ride count and corroborative evidence from the Twitter account) and could be tracked in a similar fashion. Geo- and time-tagged images tweeted with the #DivvyRed tag could be used to give further context for where the promotional bike had been. Work on this is ongoing.

Another animation which can be made is one for station usage. Here, a blue dot is placed on each station location. A two-hour running total of bikes entering and leaving is kept for each station. If bikes are used, the blue dot is changed to a pie chart, whose size is determined by the total number of bikes entering or leaving, with the red portion relating the number of bikes entering, and green the number of bikes leaving. This data set takes 30min steps between frames, and runs through the duration of the entire data set for all rides.




What is interesting to note is that toward the beginning of the data set, few of the stations are used - which could relate to the visibility of the Divvy program, or the number of stations that had been deployed at the time. Some of these stations, especially ones outside of the Loop and Streeterville, like those on Ashland or Milwaukee, are still used quite a bit more than neighboring stations on less trafficked streets.