Skip to content
BrewGraphs
Menu
  • Home
  • Charts
  • Maps by Beer Type
  • Network Graphs
  • About BrewGraphs
Menu

Bell’s Two Hearted Text Mining

Posted on November 30, 2018December 5, 2018 by kc2519

In this entry in our informal series tracking data from the RateBeer API, we’ll take a look at using Exploratory for some simple text analysis. The goal here is to pair specific descriptive words from user reviews with their associated score. There are some nuances involved here, as reviewers will offer multiple terms to describe a beer, but I’m still hoping that we can get a directional sense for which words are more correlated with higher (or lower) scores.

I’m going to walk through several steps in some level of detail:

  1. Gather the data from the RateBeer API
  2. Save the data to a .json file format
  3. Read the data into Exploratory
  4. Tokenize the text into individual terms
  5. Filter the text to remove stopwords
  6. Group the data by the tokens
  7. Run summarize functions for counts and average scores
  8. Filter the tokens for the most frequently used terms
  9. Create charts showing our results

Whew! Sounds like a lot of steps, and probably a lot of work. Certainly, if I were attempting to do this using R code, it might be, especially given my lack of skill with R scripting :). However, as we’ll see shortly, Exploratory makes this process rather quick and simple. Alright, let’s get started.

Step 1: Gather the RateBeer data

In our previous posts, we walked through the general process of gathering data from the RateBeer API. We will follow the same general framework, except now we need to extract user comments along with their ratings for a specific beer. Our focus beer is once again Bell’s Two Hearted Ale, one of America’s signature IPAs, which will yield many available reviews. However, we can’t simply request all reviews from the API, as there is a limit to 100 per pull. So we can start with the first 100 (most recent reviews) and then use some logic to get subsequent pulls for the next 100, and the 100 after that. Here’s our query:

query{
beerReviews(beerId: 1502 first:100 ) {
totalCount
items{
author {
id
}
beer {
name
}
id
score
comment
createdAt
updatedAt
}
}
}

Our next 100 simply requires reviews beyond the final Id value of the previous pull, using the after term.

query{
beerReviews(beerId: 1502 first:100 after: 11214363) {
totalCount
items{
author {
id
}
beer {
name
}
id
score
comment
createdAt
updatedAt
}
}
}

When it’s all said and done (and edited for .json specs), I’ll have 300 user reviews, which will give us a solid base for this level of analysis.

Step 2: Save the .json file

This is an easy one, once we’ve made the minor edits to meet .json syntax specifications (removing redundant braces and brackets from our 2nd & 3rd data pulls). Here’s a glimpse of our data, showing the first two of our 300 reviews:

{
“data”: {
“beerReviews”: {
“totalCount”: 3817,
“items”: [
{
“author”: {
“id”: “417272”
},
“beer”: {
“name”: “Bell’s Two Hearted Ale”
},
“id”: “11329643”,
“score”: 3.9,
“comment”: “On tap at One Mile House. Pours orange gold. Pine, grapefruit, pineapple, touch of papaya, orange peel, nice lingering bitterness. Good body. Very nice.”,
“createdAt”: “2018-11-11T23:46:45.377Z”,
“updatedAt”: “2018-11-11T23:48:15.827Z”
},
{
“author”: {
“id”: “356459”
},
“beer”: {
“name”: “Bell’s Two Hearted Ale”
},
“id”: “11308335”,
“score”: 4.2,
“comment”: “11/4/18 (Chicago): Canned 9/25/18, purchased at Bell’s General Store earlier in the day, 6 pack cans $11.50 w/ tax. Can poured into pint glass. Slighty dark golden color, puffy head, thick lacing, solid carbonation. Citrus pine aroma, sharp pleasant, sweet. Very well balanced, citrus pine taste, dry hops, bitter nice. Dry creamy mouth feel, solid presence, smooth drinking. Overall excellent IPA, glad I stopped at Bell’s Brewery for this one!”,
“createdAt”: “2018-11-05T00:27:51.220Z”,
“updatedAt”: “2018-11-05T00:27:51.220Z”
},

Our key information can be found in the score and comment fields.

Step 3: Read the data into Exploratory

As noted in our previous post, Exploratory is very adept at identifying .json data, and provides a specific option to import it. As anyone who has worked with .json data is aware, it’s not always this simple! Here’s the dialog window:

Importing the .json data

Step 4: Tokenize the text

This is where the fun begins! Exploratory provides us with many R commands for working with text fields. In our example, we’ll need these to begin analyzing the comments field and it’s respective values. Using the drop down menu from the comments field, we can view the many text operations:

Viewing text operations

We’ll select the Tokenize Text option, which gives us this screen:

Tokenize text options

We can choose to tokenize by sentences, paragraphs, lines, or words, among other options. In our case, we select words. This gives us a new column named ‘token’, which provides single words that are still associated with their original user review:

Creating a tokens column

Step 5: Filter the text to remove stopwords

One of the necessities in performing text analysis is to make sure we have information that makes sense, and it’s a great idea to do the cleanup prior to creating any charts. Removing stopwords is one of the most useful of the cleanup techniques, as it gets rid of all the connector words (“a”, “the”, “and”, and so on) that would clutter our charts without adding any insight. So let’s use Exploratory for this process. First we remove any useless characters from our tokens by using the str_clean command.

Cleaning text strings

Next, we remove stopwords using a filter. In this case, we use the English language option, since most of our reviews are in that language:

Stopword filtering

We also want to have only alphabetic values for our tokens. Incidental data about the price of a beer or spurious emojis will not add much to our analysis!

Retaining only alphabetic characters

So now we have some very clean data we can work with for our ultimate analysis. Bear in mind, Exploratory provides many other text operations for deeper analysis.

Step 6: Group the data by tokens

In order to summarize our tokens to see which words are most informative, we first need to tell Exploratory how we wish to group the data. This is very simply done, using the dropdown menu:

Grouping by tokens

We then select the token field, and move on to summarizing the data.

Step 7: Run summarize functions for counts and average scores

Now that we have the data grouped at the token level, we can easily run calculations. First we do a simple row count that will tell us the frequency for each word:

Row count by token

Followed by a mean calculation on the score column:

Mean score by token

Next, we’re ready to finalize our analysis.

Step 8: Filter the tokens for the most frequently used terms

While we now have our calculations complete, we still have nearly 2,300 rows of summary data, a bit unwieldy for charting. To address this, I’ll create a simple filter that keeps only tokens with 50 or more mentions:

Filtering for the Top 50 items

Our data set is now down to a more manageable 34 rows. Now we can proceed to putting this in visual form via some charts.

Step 9: Create charts showing our results

For our initial chart, let’s try a histogram that shows the number of words corresponding with average score ranges. Histograms are often very useful for quickly understanding the composition of a dataset. Here’s ours:

Average score histogram

We can see a roughly bell-shaped distribution here, with 15 of the 34 words associated with ratings between 3.9 and 3.95 (5 point scale). The distribution does skew to the right, with more words landing in the 3.95 to 4, and 4 to 4.05 buckets versus those below 3.9. All in all, this is a rather narrow range, indicating generally highly favorable scores for the most frequently used words.

For our final chart, we’ll actually opt for a color-coded table of results; I found this to work better for this example versus the bar chart I would typically use. This will let use see which individual words are linked to higher (or lower) scores. The (partial) results:

Token average scores

Let’s go to a full screen view:

Token scores full page view

Darker colors in this case represent higher average scores. The top scoring words are ipa, can, balanced, hop, malt, and well. Hmm…I may need to go back and remove some of these, as they don’t really tell us much about the characteristics of the beer. Let’s look further.

Close behind, we see terms like floral, orange, grapefruit, and pine. Now we’re getting more to the essence of what makes Two Hearted Ale one of the top IPAs. One other note – just as the histogram predicted, the overall score distribution is very tight, ranging from 3.83 to 4.05, so there is probably little that separates most of these terms from a statistical viewpoint, at least based on 300 reviews.

Thus ends our foray into using Exploratory for basic text analysis, and beginning to understand what users like about Bell’s Two Hearted Ale. I hope you enjoyed this, and thanks for reading!








  • Bell's
  • comments
  • Exploratory
  • RateBeer
  • text mining
  • Two Hearted
  • Leave a Reply Cancel reply

    Your email address will not be published. Required fields are marked *

    Tags

    Arclight beer Beer Maverick beer types Bell's breweries Brewers Friend BreweryDB brewing charts color ColorZilla comments data visualization data wrangling double ipa Exploratory Founders Gephi glassware glassware styles GraphiQL hops imperial stout json Mapbox mapping mead Narke Kaggen network graphs quadrupel RateBeer sentiment analysis sigma.js SRM styles text mining top beers Toppling Goliath Two Hearted yeast

    Recent Posts

    • Hop Network, Part 2
    • Developing a Hop Network
    • The Search for Beer Data, Part 1
    • Beer Types and Glassware Styles, Part 3
    • Beer Styles and Glassware – Part 2

    Visit My Other Sites

    Visual-Baseball

    Visualidity

    JazzGraphs

    Recent Comments

      Archives

      • December 2022
      • March 2020
      • February 2020
      • April 2019
      • March 2019
      • December 2018
      • November 2018

      Categories

      • beer
      • data analysis
      • data visualization
      • general
      • mapping
      • Uncategorized

      Meta

      • Log in
      • Entries feed
      • Comments feed
      • WordPress.org
      © 2023 BrewGraphs | Powered by Minimalist Blog WordPress Theme
      Scroll Up