Movie Review Sentiment Analysis

Eric D. Manley and Timothy M. Urness
Drake University
eric.manley@drake.edu, timothy.urness@drake.edu

Metadata

Summary Students use Rotten Tomatoes movie review data to predict the sentiment of new text.
Topics CS 1 Version: File I/O, early control structures, and accumulators/counters for the minimal version (optionally, it can include methods, arrays, lists, dictionaries, and/or min/max algorithms)
CS 2 Version: Hash Tables, custom classes, string manipulation
Audience CS1/CS2
Difficulty CS 1 Version: Easy.
CS 2 Version: Intermediate.
Strengths
  • Real data
  • Simple algorithm achieves compelling results
  • No special external libraries or software
  • Introduce data science early in the curriculum
  • Very extensible
Weaknesses
  • The naive sentiment analysis algorithm works well-enough, though it has limitations. Off the shelf, its false positive rate isn't great, but this can be fixed by simply adjusting the cutoff for which scores count as negative and which count as positive (by default, we use a cutoff of 2 since this is the score of a neutral review).
  • Requires some string tokenizing for scoring more than a single word at a time. For the CS 1 version, we didn't want to get into at this point in the semester, so we had them score phrases by averaging scores from a file with one word per line. This obviously isn't quite as rewarding as a program that can score text in live dialogue with a user. In the CS 2 version, we used some hacky tokenization, though this could be extended to use better available libraries.
  • CS 1 Version: in order to avoid using more advanced data structures, student programs will have to read through the same file many times in a given run which is inefficient (but still not too bad). It can be improved by storing the data in an array, list, or dictionary.
Dependencies None
Variants
  • Two versions of the assignments are given targeting different levels and concepts.
  • Each assignment contains several optional parts depending on which programming concepts need to be emphasized.
  • We have used it in Java, Python, and C++ classes at different points in the semester depending on whether or not we wanted to store the data in some kind of data structure.
  • Challenge students to improve on the naive algorithm, test their results using separate validation data, and compete for best results on an unlabeled test set
  • Improve results with simple ideas: make it case insensitive, remove stop words (common words that don't carry much meaning), etc.
  • Visualize words and phrases based on sentiment

About

This assignment uses movie reviews from the Rotten Tomatoes database to do some simple sentiment analysis. Students will write programs that use the review text and a manually labeled review score to automatically learn how negative or positive the connotations of a particular word are. This can then be used to predict the sentiment of new text with reasonably good results. For example, student programs will be able to read text like this:

The film was a breath of fresh air.
and predict that it is a positive review while predicting negative sentiment for text like this:
It made me want to poke out my eyeballs.
The data (with some pre-processing from us) is from a Sentiment Analysis project at Stanford (which used a much more sophisticated algorithm) and has been used for a Kaggle machine learning competition.

We have provided two examples of projects based on this idea that we have used in a CS 1 course and a CS 2 course, though there are many extensions that could be made for these or other higher-level courses.

Materials