Welcome to the web blog
Bit pusher at Spotify. Previously Interactive News at the New York Times, U.S. Digital Service, and Code for America.
Picking up from the previous post, the next step of the original plan is to represent the station clusters that we built as a network graph.
The first step here is to explode out each cluster into a data format that can be turned into a graph. I used R’s combn
function to accomplish this:
This function is a bit complex, so let’s map out what exactly is going on here. The function takes in a data.frame
whose row names are station names and that has a column of latitudes and one of longitudes. The combn
function then returns a 2 x n matrix where n is the number of combinations. I then use an apply
function to calculate the Haversine distance between those two points. The haversine.distance
function is given as:
Once I get the haversine distance between every combination of points, I simply the output to a data.frame
and then transpose that it, turning the 2 x n matrix into a n x 2 data.frame
. Finally, I discard point combinations that are over half a mile apart (a totally unscientific guess about how far people might be willing to walk to get a bike), and return a final data.frame
. With this function, we are able to turn a set of points into data that can be represented as a network graph.
Now that we have the data in a form that can be represented as a network graph, we can go ahead and use the wonderful igraph
package to plot network representation andperform calculations. Note that for the rest of the post, I am using one of the four clusters identified in the last post.
First, let’s load up the data and plot it to see what it looks like:
Now that we have this network representation, we can go ahead and apply some centrality measures to try and understand how to use this graph representation to make recommendations about which stations to fill first. Two centrality measures spring to mind as good use cases for this are degree centrality (measuring how many ties a node has) and possibly closeness centrality (how long it might take to traverse from some node s and other nodes). The igraph package provides the ability to calculate both of these measures, so we can take a look at the output and see how these measure stack up against how we might expect.
First up, let’s take a look at degree centrality. Degree centrality basically just measures how many edges touch a certain node. This seemingly simple method gives us a good approximation of where the most “central” nodes are located.
Examining this graph, we see that while the measure does a good job of finding central points in the network, it is biased towards one side of the graph, which means that taking the most centralized nodes in this measure might end up leaving parts of the city still uncovered. However, this can be worked around in the future using a function that builds a list of “close” nodes and only takes nodes that aren’t “close” to each other.
Next up, let’s take a look at closeness centrality. Because a closeness centrality score ends up between 0 and 1, we have to do some transformations in order to see how the mesaure works graphically.
Looking at this graph, we see that the closeness meausre looks to actually lend itself towards favoring nodes that “connect” the larger parts of the graph. Because of this, it seems as though the conceptually simple degree centrality measure is the better choice going forward.
Now that we have our clusters and our centrality measure, all we have to do is parse out the data from the graph structure and nicely wrap up the results. Look for these in future posts!