In this entry in our informal series tracking data from the RateBeer API, we’ll take a look at using Exploratory for some simple text analysis. The goal here is to pair specific descriptive words from user reviews with their associated score. There are some nuances involved here, as reviewers will offer multiple terms to describe a beer, but I’m still hoping that we can get a directional sense for which words are more correlated with higher (or lower) scores.
I’m going to walk through several steps in some level of detail:
- Gather the data from the RateBeer API
- Save the data to a .json file format
- Read the data into Exploratory
- Tokenize the text into individual terms
- Filter the text to remove stopwords
- Group the data by the tokens
- Run summarize functions for counts and average scores
- Filter the tokens for the most frequently used terms
- Create charts showing our results
Whew! Sounds like a lot of steps, and probably a lot of work. Certainly, if I were attempting to do this using R code, it might be, especially given my lack of skill with R scripting :). However, as we’ll see shortly, Exploratory makes this process rather quick and simple. Alright, let’s get started.
Step 1: Gather the RateBeer data
In our previous posts, we walked through the general process of gathering data from the RateBeer API. We will follow the same general framework, except now we need to extract user comments along with their ratings for a specific beer. Our focus beer is once again Bell’s Two Hearted Ale, one of America’s signature IPAs, which will yield many available reviews. However, we can’t simply request all reviews from the API, as there is a limit to 100 per pull. So we can start with the first 100 (most recent reviews) and then use some logic to get subsequent pulls for the next 100, and the 100 after that. Here’s our query:
query{
beerReviews(beerId: 1502 first:100 ) {
totalCount
items{
author {
id
}
beer {
name
}
id
score
comment
createdAt
updatedAt
}
}
}
Our next 100 simply requires reviews beyond the final Id value of the previous pull, using the after term.
query{
beerReviews(beerId: 1502 first:100 after: 11214363) {
totalCount
items{
author {
id
}
beer {
name
}
id
score
comment
createdAt
updatedAt
}
}
}
When it’s all said and done (and edited for .json specs), I’ll have 300 user reviews, which will give us a solid base for this level of analysis.
Step 2: Save the .json file
This is an easy one, once we’ve made the minor edits to meet .json syntax specifications (removing redundant braces and brackets from our 2nd & 3rd data pulls). Here’s a glimpse of our data, showing the first two of our 300 reviews:
{
“data”: {
“beerReviews”: {
“totalCount”: 3817,
“items”: [
{
“author”: {
“id”: “417272”
},
“beer”: {
“name”: “Bell’s Two Hearted Ale”
},
“id”: “11329643”,
“score”: 3.9,
“comment”: “On tap at One Mile House. Pours orange gold. Pine, grapefruit, pineapple, touch of papaya, orange peel, nice lingering bitterness. Good body. Very nice.”,
“createdAt”: “2018-11-11T23:46:45.377Z”,
“updatedAt”: “2018-11-11T23:48:15.827Z”
},
{
“author”: {
“id”: “356459”
},
“beer”: {
“name”: “Bell’s Two Hearted Ale”
},
“id”: “11308335”,
“score”: 4.2,
“comment”: “11/4/18 (Chicago): Canned 9/25/18, purchased at Bell’s General Store earlier in the day, 6 pack cans $11.50 w/ tax. Can poured into pint glass. Slighty dark golden color, puffy head, thick lacing, solid carbonation. Citrus pine aroma, sharp pleasant, sweet. Very well balanced, citrus pine taste, dry hops, bitter nice. Dry creamy mouth feel, solid presence, smooth drinking. Overall excellent IPA, glad I stopped at Bell’s Brewery for this one!”,
“createdAt”: “2018-11-05T00:27:51.220Z”,
“updatedAt”: “2018-11-05T00:27:51.220Z”
},
Our key information can be found in the score and comment fields.
Step 3: Read the data into Exploratory
As noted in our previous post, Exploratory is very adept at identifying .json data, and provides a specific option to import it. As anyone who has worked with .json data is aware, it’s not always this simple! Here’s the dialog window:

Step 4: Tokenize the text
This is where the fun begins! Exploratory provides us with many R commands for working with text fields. In our example, we’ll need these to begin analyzing the comments field and it’s respective values. Using the drop down menu from the comments field, we can view the many text operations:

We’ll select the Tokenize Text option, which gives us this screen:

We can choose to tokenize by sentences, paragraphs, lines, or words, among other options. In our case, we select words. This gives us a new column named ‘token’, which provides single words that are still associated with their original user review:

Step 5: Filter the text to remove stopwords
One of the necessities in performing text analysis is to make sure we have information that makes sense, and it’s a great idea to do the cleanup prior to creating any charts. Removing stopwords is one of the most useful of the cleanup techniques, as it gets rid of all the connector words (“a”, “the”, “and”, and so on) that would clutter our charts without adding any insight. So let’s use Exploratory for this process. First we remove any useless characters from our tokens by using the str_clean command.

Next, we remove stopwords using a filter. In this case, we use the English language option, since most of our reviews are in that language:

We also want to have only alphabetic values for our tokens. Incidental data about the price of a beer or spurious emojis will not add much to our analysis!

So now we have some very clean data we can work with for our ultimate analysis. Bear in mind, Exploratory provides many other text operations for deeper analysis.
Step 6: Group the data by tokens
In order to summarize our tokens to see which words are most informative, we first need to tell Exploratory how we wish to group the data. This is very simply done, using the dropdown menu:

We then select the token field, and move on to summarizing the data.
Step 7: Run summarize functions for counts and average scores
Now that we have the data grouped at the token level, we can easily run calculations. First we do a simple row count that will tell us the frequency for each word:

Followed by a mean calculation on the score column:

Next, we’re ready to finalize our analysis.
Step 8: Filter the tokens for the most frequently used terms
While we now have our calculations complete, we still have nearly 2,300 rows of summary data, a bit unwieldy for charting. To address this, I’ll create a simple filter that keeps only tokens with 50 or more mentions:

Our data set is now down to a more manageable 34 rows. Now we can proceed to putting this in visual form via some charts.
Step 9: Create charts showing our results
For our initial chart, let’s try a histogram that shows the number of words corresponding with average score ranges. Histograms are often very useful for quickly understanding the composition of a dataset. Here’s ours:

We can see a roughly bell-shaped distribution here, with 15 of the 34 words associated with ratings between 3.9 and 3.95 (5 point scale). The distribution does skew to the right, with more words landing in the 3.95 to 4, and 4 to 4.05 buckets versus those below 3.9. All in all, this is a rather narrow range, indicating generally highly favorable scores for the most frequently used words.
For our final chart, we’ll actually opt for a color-coded table of results; I found this to work better for this example versus the bar chart I would typically use. This will let use see which individual words are linked to higher (or lower) scores. The (partial) results:

Let’s go to a full screen view:

Darker colors in this case represent higher average scores. The top scoring words are ipa, can, balanced, hop, malt, and well. Hmm…I may need to go back and remove some of these, as they don’t really tell us much about the characteristics of the beer. Let’s look further.
Close behind, we see terms like floral, orange, grapefruit, and pine. Now we’re getting more to the essence of what makes Two Hearted Ale one of the top IPAs. One other note – just as the histogram predicted, the overall score distribution is very tight, ranging from 3.83 to 4.05, so there is probably little that separates most of these terms from a statistical viewpoint, at least based on 300 reviews.
Thus ends our foray into using Exploratory for basic text analysis, and beginning to understand what users like about Bell’s Two Hearted Ale. I hope you enjoyed this, and thanks for reading!