Fly the Frustrating Skies

Sentiment Analysis of the U.S. Airline Industry

Brave New World

A couple decades ago, sentiment analysis and topic modeling were casual pastimes of computational linguistics nerds that did not get much public attention. But with online review and social media sites such as Yelp, Amazon, Facebook and Twitter spewing a constant stream of text data into the internet, natural language processing (NLP) techniques are quickly becoming some of the most important algorithms in the data scientist's toolkit.

Recently, Twitter has become an important source of information for many brands; with more and more companies developing a social media presence, marketers and brand managers are constantly seeking information on what people are saying about their products and whether it is good or bad (without, you know, having to read thousands or millions of tweets). For my fourth Metis project, I compared the performance of various sentiment analyzers in order to identify the most effective strategy for identifying customer complaints directed towards the Twitter accounts of 6 major U.S. airlines (United, American Airlines, U.S. Airways, Southwest, Delta, and Virgin America).

Out-of-The Box Sentiment Analyzers: You Get What You Code For

The Python TextBlob library provides an easy, "out-of-the-box" sentiment analyzer. The analyzer is extremely easy to use and requires only a single line of code - you simply plop the tweet (or any other text) into the TextBlob analyzer, and the analyzer will return a sentiment "polarity" between -1 and 1, with -1 being extremely negative and 1 being extremely positive.

To begin my analysis, I pulled 500 of the most recent tweets to United Airlines and calculated the TextBlob sentiment polarity for each. The polarity distribution is below:

Wait a minute, that doesn't seem right. The above histogram indicates that the majority of tweets are neutral, but my personal experience with United suggest that most consumers hate the airline with a fiery passion. Surely the sentiment should be more skewed towards the negative. Sure enough, if we manually review a handful of the tweets, we see that TextBlob is not picking up on the negativity of many tweeters.

Alternate Approaches

A little web research led me to two viable alternatives to TextBlob:

  • The Valence Aware Dictionary and Sentiment Reasoner (VADER): Similar to TextBlob, VADER uses a lexicon with assigned polarities to evaluate sentiment. However, VADER's lexicon has been specifically designed to evaluate sentiment in microblog content (such as tweets), and includes slang, emoticons, and other features commonly used in tweets that may be absent from traditional lexicons.

  • Supervised Text Classification: I also identified two potential sources of human-labeled data which I could use to train a supervised text classifier: (a) a corpus of 10,000 tweets included in the python Natural Language Toolkit (NLTK) library, labeled as positive or negative. and (b) a dataset of more than 10,000 tweets directed to U.S. airlines in February 2015, labeled as either positive, neutral, or negative, sourced from micro-tasking platform Crowdflower's data for everyone library.

Performance Comparison

Now that we've identified several different sentiment analyzers, it's time to put them to the test! Specifically, I wanted to investigate:

  1. Whether or not supervised text classification would outperform unsupervised, lexicon-based methods and

  2. If so, is the topic of the training data important? Will training a supervised text classifier on airline-specific tweets yield better results than a text classifier based on tweets on any topic?

The accuracies of our sentiment analyzers on a test set of 30% of the Crowdflower data are shown below. Since the unsupervised, lexicon-based approaches return a continuous polarity as opposed to a binary positive/ negative classification, I selected the positive-negative cutoff that maximized accuracy . For our supervised classifiers, I used three common text classification algorithms (Bernoulli Naive Bayes, Multinomial Naive Bayes, and Linear SVM) with two common text feature extraction methodologies (count vectorization and term frequency-inverse document frequency (TF-IDF) vectorization).

Supervised models do outperform unsupervised ones, but only when trained on a corpus of context-specific material! A Bernoulli Naive Bayes (with Count Vectorization) or linear SVM (with TF-IDF vectorization) classifier outperformed the dummy classifier by ~15 percentage points, while unsupervised methods or  supervised classifiers trained on a general corpus of tweets outperformed the dummy by at most 5 percentage points. These results reveal the importance of obtaining subject matter-relevant data when training text classification methods in order to optimize natural language processing sentiment analyses.

You can also check out my GitHub repository for this project here.