Movie Review Sentiment Analysis

Eric D. Manley and Timothy M. Urness
Drake University
eric.manley@drake.edu, timothy.urness@drake.edu

Metadata

Summary	Students use Rotten Tomatoes movie review data to predict the sentiment of new text.
Topics	CS 1 Version: File I/O, early control structures, and accumulators/counters for the minimal version (optionally, it can include methods, arrays, lists, dictionaries, and/or min/max algorithms) CS 2 Version: Hash Tables, custom classes, string manipulation
Audience	CS1/CS2
Difficulty	CS 1 Version: Easy. CS 2 Version: Intermediate.
Strengths	Real data Simple algorithm achieves compelling results No special external libraries or software Introduce data science early in the curriculum Very extensible
Weaknesses	The naive sentiment analysis algorithm works well-enough, though it has limitations. Off the shelf, its false positive rate isn't great, but this can be fixed by simply adjusting the cutoff for which scores count as negative and which count as positive (by default, we use a cutoff of 2 since this is the score of a neutral review). Requires some string tokenizing for scoring more than a single word at a time. For the CS 1 version, we didn't want to get into at this point in the semester, so we had them score phrases by averaging scores from a file with one word per line. This obviously isn't quite as rewarding as a program that can score text in live dialogue with a user. In the CS 2 version, we used some hacky tokenization, though this could be extended to use better available libraries. CS 1 Version: in order to avoid using more advanced data structures, student programs will have to read through the same file many times in a given run which is inefficient (but still not too bad). It can be improved by storing the data in an array, list, or dictionary.
Dependencies	None
Variants	Two versions of the assignments are given targeting different levels and concepts. Each assignment contains several optional parts depending on which programming concepts need to be emphasized. We have used it in Java, Python, and C++ classes at different points in the semester depending on whether or not we wanted to store the data in some kind of data structure. Challenge students to improve on the naive algorithm, test their results using separate validation data, and compete for best results on an unlabeled test set Improve results with simple ideas: make it case insensitive, remove stop words (common words that don't carry much meaning), etc. Visualize words and phrases based on sentiment

About

This assignment uses movie reviews from the Rotten Tomatoes database to do some simple sentiment analysis. Students will write programs that use the review text and a manually labeled review score to automatically learn how negative or positive the connotations of a particular word are. This can then be used to predict the sentiment of new text with reasonably good results. For example, student programs will be able to read text like this:

The film was a breath of fresh air.

and predict that it is a positive review while predicting negative sentiment for text like this:

It made me want to poke out my eyeballs.

The data (with some pre-processing from us) is from a Sentiment Analysis project at Stanford (which used a much more sophisticated algorithm) and has been used for a Kaggle machine learning competition.

We have provided two examples of projects based on this idea that we have used in a CS 1 course and a CS 2 course, though there are many extensions that could be made for these or other higher-level courses.

Materials

Movie review data file. We removed all of the partial reviews from the Kaggle data and reformatted it to make it a little easier for students to read into their programs.
CS 1 Assignment Handout. In this assignment, students use the data to determine the sentiment of individual words and practice common early CS 1 concepts like control structures, file I/O, accumulators/counters, min/max algorithm, and methods.
CS 1 Starter Code. This code shows how to read the different fields of the movie review data and search for words within reviews. This is short and can be developed live with students or given ahead of time.
CS 2 Assignment Handout. In this assignment, students predict the sentiment of larger pieces of text. The assignment requires appropriate data structures (e.g. hash tables, custom classes) to increase the search speed and reduce the need for excessive file access.
CS 2 Starter Code. This code shows how to read the movie review data. It also provides the .h files for the custom class and hash table functions that need to be implemented.