As I talked about in my previous post, I want to put together a system that will help Citibike by showing the best stations that they can rebalance. I’ve decided that I’m probably going to leave the future station prediction up to the models built over at DSSG and work on what we can do after we can predict how stations will look in 30 or 60 minutes.

Scraping in R

Parsing JSON with R is a bit silly, but I eventually came up with a decent solution that pulls in multiple fields from a decently nested JSON object without having to use either recursion or super inefficient R for loops:

One minor annoyance was that I ended up with a dataframe of factors (as opposed to numerics or character vectors), so I had to go through and manually convert them into something more friendly.

Clustering mostly empty stations

I decided to try out kmeans clustering as it generally performs well. One small problem is that the algorithm requires the user to input the number of clusters. In order to examine the clusters, I used the standard practice of looking at the within groups sum of square and choosing the number of clusters where there appeared to be an “elbow” which in this case was around four (using only slightly modified code from here):

Then, we just apply the kmeans back to the input data frame, and plot the result:

And there it is – a super basic visualization of the geographic clusters of missing bikes. The next step is to figure out how to translate those clusters into graphs (with distances as weights), and then figure out the most central point of those graphs.