In the last post, we used the RateBeer API and Exploratory to gather and analyze data that matched beer styles to specific types of glassware. This gave us the ability to see some interesting patterns and to especially understand which glass styles are tied to the most popular (or obscure) beer types. In this follow-up, we’re going to do some data wrangling using Exploratory, with the end goal of prepping the data for ingestion into Gephi.
Why do we want to build a network graph for this data? Basically, so we can depict relationships between entities in a highly visual manner. This can help viewers to more easily understand connections within the data, and draw conclusions from their observations. The graphs can also be designed to be interactive, allowing users to engage with the data and potentially discover further insights.
While Exploratory can perform many amazing statistical and analytic tasks, we’re going to use it for the relatively mundane task of data wrangling this time around. Our primary goal is to convert the data into a format that can be easily read into Gephi. In network graph parlance, we will need nodes and edges as the basis for any subsequent graphs. In this example, nodes can be defined thusly:
- Beer style nodes
- Glassware nodes
In our case, we know that beer styles will only connect to glassware types, and vice versa. This will ultimately form what is known as a bi-partite network, where we have two distinct layers of nodes connecting to one another, but without intra-group connections (no beer styles connecting to other beer styles, for example).
Our first step will be to create the edges file for use in the network graph. Exploratory has a beautiful function called unnest, which will enable the conversion of the JSON data in the glasses column into something more useful for our purposes. Here’s the simple dialog for unnest, where we simply need to provide the source column and the data separator values:
Now that the data has been converted, Exploratory has provided all we need to create edges for our upcoming graph:
To arrive at our destination, we’ll now use Exploratory to create a set of beer style nodes, made remarkably easy by the Summarize function. Note that we have previously filtered the data to display only beer types; another branch does the same for glassware styles. Here’s a look at the Summarize approach for beer types:
We simply group by the name column, and do an average on the number of beers value (if we use sum, we’ll be double counting many of the values). Our beer styles data is now ready for exporting to Excel, where it can be merged with the glassware style nodes. We now do the same for glassware styles:
Our final data wrangling will take place in Excel, where we can do some simple formatting and naming to make the data friendly for Gephi. We’ll combine the two node files into a single tab-delimited source file; we’ll also edit the edges file to contain source and target attributes, an essential step for importing the data into Gephi. Here’s a look at the edges file:
Now that our prep is done, we’re ready for the fun part, where we actually build the graph. That will be the focus of our next post on this topic. As always, thanks for reading!