How Was Your Stay? Exploring Hawai’i Island TripAdvisor Reviews with NLP

Have you ever wondered what your customers were saying about your business? Probably seems like a silly question, right? After all, the “voice of the customer” should be important to any business and is the raison d être for anyone in market research or user experience. Yet after spending years in this field, I began to wonder: “Are we really getting valuable insights from our surveys and focus groups?”

While the market research industry seems to be unwilling to part with these bread-and-butter methodologies, there are alternatives today that may not have been as easy to explore prior to modern computing power or the advent of the Internet. Today, consumers are ready and willing to offer their opinions for free! Of course, I’m speaking about online reviews.

Online reviews are an import source of data for both businesses and consumers alike. For businesses, reviews provide an ongoing flow of information and sentiment while for consumers they can be highly influential in purchase decisions. Today it’s fairly easy to tap into this river of product/service reviews through software tools…or build your own.

Python and Natural Language Processing

One positive evolutionary change that should be noted about the market research industry is that there has finally been an acceptance that emotion is the driver of behavior. The expression of feelings can be a much more powerful indicator of future behavior than a “Very Likely” checkbox. Yet the problem is how does one capture emotion?

Sentiment analysis is one technique: a means of parsing out emotions from written (and now spoken) words. What better way to understand a consumer’s feelings than in their own words? Granted, the stalwart open-ended question has existed forever in surveys and IDIs, but often times:

  • The costly process of coding for topic/emotion extraction means that researchers typically only cherry-pick a few illustrative responses to include in reports — and discard the rest
  • While popular for visualization, word clouds sacrifice context and are often the only option in the tool chest for displaying text data
  • As with any direct questioning technique, there is inherent bias introduced by the researcher in the wording of the question itself which can lead the response and miss out on other important factors

Today, consumers can openly voice their opinions online in a free-form, undirected manner. This allows them to speak on whatever concerns they may have without the potential bias of a researcher (yet to be fair, bias is never completely removed from research — it can still be imposed during the analysis phase).

Python has become the de facto programming language of choice for data science and offers a variety of packages for processing human language. Natural language processing (NLP) has arguably more approaches and techniques than the majority of other machine learning subfields and is a significant component of modern artificial intelligence applications. What can NLP teach us about visitors’ stays at hotels and resorts on, say, Hawai’i Island?

Gathering Reviews

Several websites and services exist that allow users to obtain product/service reviews (e.g. shopping.com,shopzilla.com, webhose.io) but typically for a price. TripAdvisor’s API is only designed to embed data into third-party websites, but this does not include review text itself. We’ll have to obtain it by another means.

Web scraping, or gathering and storing website content via a “bot,” is sometimes frowned upon by web admins, but they can (and will) put controls in place if they detect excessive activity from a single IP address. Luckily, this wasn’t an issue and using the Python package Scrapy around 41,500 text reviews, headlines, bubble ratings, review dates, and hotel names for Hawai’i Island were collected covering over 14 years — not bad for about an hour of programming!

Prepping for Processing

Anyone who’s worked with data knows that raw data is almost always messy. TripAdvisor reviews are no different but unstructured text data provides a different kind of messy that NLP attempts to address. For this, the Python package NLTK(Natural Language Toolkit) was used:

  • All text in each review was made lowercase. This is called case folding and it makes it so an NLP algorithm realizes that words like “Resort” and “resort” (the former could appear at the beginning of a sentence) are the same. Granted, this doesn’t help with context since “resort” can mean both “hotel” and “the action of turning to and adopting a strategy or course of action” but for a demo, this is fine.
  • Text also can be partitioned so a computer knows what a sentence is and what an individual word is. This is called tokenization and it turns sentences/words into variables that can be used during the model-building process. “What a great hotel” tokenized by word would be “What”, “a”, “great”, “hotel.” This allows an algorithm to count how many times a word appears in a review if not the entire collection of reviews. This is a key step in creating Naive Bayes/Bag-of-Words/TF-IDF models for classifying documents into categories.
  • Part-of-Speech Tagging assigns words to word types like “nouns,” “verbs,” adjectives,” etc. This makes it easier to extract informative components of sentences like noun phrases (e.g. “wonderful spa”, “beautiful pool”, “horrible restaurant” are all examples of simple noun phrases — an adjective followed by a noun). Noun phrases can be used to identify sentiments as they apply to specific features of a product/service — in this case hotel amenities.

NLP has other preprocessing options depending on the application. For example, stemming takes words like “hot,” “hotter,” and “hottest” and removes suffixes so all three become “hot” (lemmatization is another method like this). The purpose is similar to case folding so we don’t treat all three like different words when they all convey the same idea.

Also, stopword removal can be used to remove common words like “I,” “the,” “a,” and so on from text. These words appear very frequently but convey no information and can skew word frequency counts.

What Are Reviewers Talking About?

TripAdvisor allows reviewers to rate things like “cleanliness” and “value” but it might be more helpful to see what they want to talk about themselves. One way to begin this process is to extract nouns from the reviews:

This simple visualization depicts the 50 most frequently used nouns in our 41,500 reviews. Some are more helpful than others. For instance “i” is the most used although it would have been removed if we eliminated stop words. “Hotel,” “room,” and “beach” are next, but their value is minimal (since all beaches in Hawai’i are public, we can’t really attribute sentiments to the hotel). “Pool” and “staff,” however, could be useful.

How Do Reviewers Feel?

NLTK comes with a built-in sentiment analyzer called VADER (Valence Aware Dictionary and sEntiment Reasoner) that you can learn more about here. VADER works a lot like other sentiment analysis algorithms: sentences are “read” and words (or groups of words occurring together called n-grams) are matched to a lexicon of words with a sentiment “score” tied to them. This score indicates how positive, neutral, or negative a word/word group is. In VADER it’s normalized between -1 (entirely negative) to 1 (entirely positive) with neutral in the middle. A review is “read” by the algorithm collectively and scores are assigned and averaged out to ultimately give the review a positive, neutral, negative sentiment.

The lexicon VADER uses was initially constructed using human input to calculate the scores. That means that VADER has some interesting features like:

  • It can use punctuation like “!” and “!!” to effectively increase a score in either direction so “Great” is positive, “Great!” is a little more positive, and “Great!!” is still more positive
  • It can use ALL CAPS to increase a score in either direction so “horrible” is negative and “HORRIBLE” is more negative
  • It considers adjectives so “The food was good” is positive and the “The food was really good” is more positive
  • It can take into account conjunctions like “but” to sense two sentiments within the same sentence
  • It can take into account n-grams to tell if sentence sentiment “flips” because of word placement as in “This resort really isn’t all that great.”

Earlier it was shown that reviewers frequently mentioned words like “pool” and “staff.” For demonstrative purposes let’s also look at other high-frequency nouns like “food,” “service,” and “location.” VADER extracts all sentences in our collection of reviews that contain these five words, scores them, then averages the scores across all sentences.

Generally speaking, most sentences than mention these amenities are positive (remember the scale goes from -1 to 1). This isn’t totally surprising since the majority of Hawai’i Island hotel reviews on TripAdvisor are given bubble ratings of “5 out of 5.”

We can get a little more context by using another feature of the NLTK package called concordance which will show some of the sentences that contain these amenities in an easy to read visual:

Now there is a little additional context:

  • “lovely pool, great waterview from many rooms”
  • “pool was out of service for half our stay”
  • “loved the pool and water slide”

Examining Individual Hotels

While this info is interesting, it would probably be more useful in a real-life setting if you could limit reviews to just one hotel (additionally, since the reviews scraped off TripAdvisor contain dates, you may want to narrow them by that too). Let’s look at the Hilton Waikoloa for example:

Similar nouns appear for Hilton, but let’s substitute “service” and “location” with “lagoon” and “tram” (these are two amenities that are unique to the Hilton Waikoloa). Again, most reviews are favorable for these amenities.

Hilton also receives mostly “5 out of 5” bubble ratings so the positive sentiments make sense and in a way validate the VADER algorithm.

But remember that the nouns/amenities have been cherry-picked from the list of top 50 most frequently mentioned. This is introducing our own biases into the analysis and it would be better if we could do more to reduce that.

Rather than choose a few nouns/amenities, let’s just look at all nouns. The goal is to get a little more differentiation. Keep in mind though that there isn’t a stringent threshold for positive, neutral, and negative sentiments in the VADER scores so it’s up to the end user to determine.

When looking at all nouns (with at least 100 mentions to remove outliers), there is a little more differentiation. For instance “KPC” refers to Hilton’s “Kamuela Provision Company,” the hotel’s flagship restaurant. Both “turtle,” “turtles,” and “waterfall” refer to the on-premise lagoon (sea turtles frequent the area).

Nouns with less positive scores include “manager,” “hhonors” (Hilton’s rewards program), “carpet” and “smell” which seem to indicate cleanliness issues, and “kirin” which refers to another on-premise restaurant that is no longer open (the low score could be a reason why Kirin is no longer open or complaints from guests because Kirin is no longer open — further investigation would be necessary).

Visualizing Sentiment Word Choices

A lot of work in NLP is around the classification of documents into categories based on how words are used differently (or not at all) depending on the category. When analyzing TripAdvisor reviews, we aren’t really concerned with classification since the bubble rating a reviewer gave a hotel is effectively indicating this. It still might be interesting to see how positive and negative reviews differ in terms of word choice, however.

The Python package Scattertext provides an intuitive means of visualizing word usage differences between dichotomous categories — in our case, we’ll say 1 bubble vs. 5 bubble reviews. While others’ may choose different ways of grouping positive/negative reviews (i.e. “top-two/bottom-two box score” or even using the review sentiment scores themselves to create custom partitions), in survey research, most 5-point Likert Scale responses tend to fall in the middle unless an individual feels very strongly about whatever they are being asked.

I suggest reading the Scattertext documentation to understand the use of Scaled F-Scores in creating this visualization, but basically, it calculates a score for each word based on the number of times it appears within a category and overall across categories with an additional scaling factor. The resulting visual presents a scatterplot where words mostly unique to the categories appear along the diagonal (i.e. upper left for 5 bubble words, lower right for 1 bubble words).

When looking at the full set of reviews for 14 years, it becomes clear that the biggest complaints in 1 bubble reviews are around cleanliness. Scattertext also allows a user to enter a specific word to extract sentences where it is mentioned. For example “stains” appears, on average, zero times in 25,000 5 bubble reviews but 4 times per 25,000 in 1 bubble:

For the interactive version of the Scattertext charts and the code used to produce this demo, visit the Jupyter Notebook here (takes a second to load).

Pros and Cons

Comparing NLP review analysis to surveys and focus groups in market research isn’t completely apples-to-apples — there are both benefits and drawbacks:

  • While bias can never be completely eliminated, the free-form environment of online reviews can reduce bias imposed by fellow focus group participants and researchers themselves.
  • The cost of incentivizing survey/focus group participants is becoming prohibitive whereas the exercise above can be implemented for the cost of a data solutions professional or purchased off the shelf — no need to pay reviewers…they are willing to offer their opinions and contrary to popular belief they aren’t just complaining!
  • An NLP tool is always on. An organization doesn’t need to prepare a survey/focus group, field it, process it, and report on it which can take weeks if not months. Reviews can be scraped as part of a big data solution ETL system, cleaned, processed and reported in near real-time.

There are also drawbacks:

  • As we’ve seen above, NLP doesn’t always eliminate ambiguity or provide clear sentiment and it doesn’t allow the “deeper dive” that a survey/focus group can sometimes (i.e. the why behind the what).
  • Depending on what’s being reviewed, there can be a large discrepancy between positive and negative reviews. In this case, the positive far outweigh the negative. While all-in-all that’s a good thing, it doesn’t help in identifying customer pain points as accurately which is ultimately the goal of this type of analysis.
  • There is a tendency to “not fix what isn’t broke” in businesses. If an NLP system is set up and left to run without monitoring, there is no guarantee it will perform well without maintenance. This could mean errors go unseen and bad decisions are made (or not made) based of unreliable outcomes due to something “broken.”