Movie Review Sentiment Analysis
Eric D. Manley and Timothy M. Urness
Students use Rotten Tomatoes movie review data to predict the sentiment of new text.
||CS 1 Version: File I/O, early control structures, and accumulators/counters for the minimal version (optionally, it can include methods, arrays, lists, dictionaries, and/or min/max algorithms)
CS 2 Version: Hash Tables, custom classes, string manipulation
CS 1 Version: Easy.
CS 2 Version: Intermediate.
- Real data
- Simple algorithm achieves compelling results
- No special external libraries or software
- Introduce data science early in the curriculum
- Very extensible
The naive sentiment analysis algorithm works well-enough, though it has limitations. Off the shelf, its false positive rate isn't great, but this can be fixed by simply adjusting the cutoff for which scores count as negative and which count as positive (by default, we use a cutoff of 2 since this is the score of a neutral review).
- Requires some string tokenizing for scoring more than a single word at a time. For the CS 1 version, we didn't want to get into at this point in the semester, so we had them score phrases by averaging scores from a file with one word per line. This obviously isn't quite as rewarding as a program that can score text in live dialogue with a user. In the CS 2 version, we used some hacky tokenization, though this could be extended to use better available libraries.
- CS 1 Version: in order to avoid using more advanced data structures, student programs will have to read through the same file many times in a given run which is inefficient (but still not too bad). It can be improved by storing the data in an array, list, or dictionary.
Two versions of the assignments are given targeting different levels and concepts.
Each assignment contains several optional parts depending on which programming concepts need to be emphasized.
- We have used it in Java, Python, and C++ classes at different points in the semester depending on whether or not we wanted to store the data in some kind of data structure.
Challenge students to improve on the naive algorithm, test their results using separate validation data, and compete for best results on an unlabeled test set
Improve results with simple ideas: make it case insensitive, remove stop words (common words that don't carry much meaning), etc.
Visualize words and phrases based on sentiment
This assignment uses movie reviews from the Rotten Tomatoes database to do some simple sentiment analysis. Students will write programs that use the review text and a manually labeled review score to automatically learn how negative or positive the connotations of a particular word are. This can then be used to predict the sentiment of new text with reasonably good results.
For example, student programs will be able to read text like this:
The film was a breath of fresh air.
and predict that it is a positive review while predicting negative sentiment for text like this:
It made me want to poke out my eyeballs.
The data (with some pre-processing from us) is from a Sentiment Analysis project at Stanford (which used a much more sophisticated algorithm) and has been used for a Kaggle machine learning competition.
We have provided two examples of projects based on this idea that we have used in a CS 1 course and a CS 2 course, though there are many extensions that could be made for these or other higher-level courses.
- Movie review data file. We removed all of the partial reviews from the Kaggle data and reformatted it to make it a little easier for students to read into their programs.
CS 1 Assignment Handout. In this assignment, students use the data to determine the sentiment of individual words and practice common early CS 1 concepts like control structures, file I/O, accumulators/counters, min/max algorithm, and methods.
CS 1 Starter Code. This code shows how to read the different fields of the movie review data and search for words within reviews. This is short and can be developed live with students or given ahead of time.
CS 2 Assignment Handout. In this assignment, students predict the sentiment of larger pieces of text. The assignment requires appropriate data structures (e.g. hash tables, custom classes) to increase the search speed and reduce the need for excessive file access.
CS 2 Starter Code. This code shows how to read the movie review data. It also provides the .h files for the custom class and hash table functions that need to be implemented.