Personalized Book Recommendation System

Michelle Craig
University of Toronto
mcraig@cs.toronto.edu

Overview

Virtually every student has had an online experience where a website makes personalized recommendations in hopes of future sales or ongoing traffic. Amazon.com tells you "Customers Who Bought This Item Also Bought", YouTube makes suggestions for other videos to watch, and NetFlix ran a contest with a million dollar prize for an algorithm that would improve their movie recommendations. In this assignment, students write a program that makes personalized book recommendations using algorithms with increasing levels of sophistication.

After reading a pre-collected set of ratings for a list of books, the program makes recommendations for a particular reader based on a small set of sample ratings from that reader and the preferences of other readers in the community. The assignment was inspired by machine learning research used to predict book preferences for Chapters.Indigo.ca. It provides the opportunity to discuss similarity measures for non-trivial objects and alludes to machine learning techniques.

Meta Information

Summary	Read book ratings for a set of users from a file. Define a similarity measure for any pair of readers. Based on reader-similarity, use ratings from the reader community to recommend new books for a particular user.
Topics	arrays, reading from files. It requires at least a one-dimensional array of Strings and either a 2D array of integers or a 1D array of objects. It could also be designed to use dictionaries and/or include a GUI.
Audience	CS1. It has also been used in senior high-school classes.
Difficulty	This is an intermediate assignment, taking 2 or 3 weeks for a CS1 student. It can be made more or less complicated by allowing the students varying degrees of flexibility in defining their reader-similarity measure and data structures.
Strengths	Students are inspired by the fact that the Netflix contest offered significant money for the real-life version of this problem. They like to see their programs make personalized predictions for themselves and their friends. Instructors like the fact that it can be assigned when students haven't learned about more complex data-structures and that it can be a starting point for talking about other CS ideas: weighted averages, comparing non-trivial objects, designing distance measures between objects and writing comparators, sparse data and machine learning. Instructors also like that is very flexible allowing them to meet other curriculum expectations as needed.
Weaknesses	If the rating scheme and distance measure (between two readers) are left up to the student, they may design something that doesn't make good predictions. Without direction, students tend to design 1-5 or 0-5 based scheme similar to giving a book "3 stars". This makes generating a similarity measure more complicated. If you are giving the students a file of pre-collected ratings, you have to give them the rating scheme. In order to work, the assignment needs real data. You can use the book list and associated ratings files that I've collected or you can recruit a community of people to rate a large set of items. You can't generate the input data randomly. The Netflix contest is still reasonably current, but it will soon be old news. That won't make the assignment less doable, but it won't be as nifty as it becomes less current.
Dependencies	No external libraries required.
Variants	Instead of using the book data that I've provided here, some teachers had students collect their own data set so that the predictions could be about something other than books. It can be a lot of work to get enough data but this can also be a very cool thing to do with a large class. You can encourage each student in the class to submit their ratings with a simple web form. The amount of freedom given to the students to design the rating scheme and the similarity function (between pairs of readers) can be varied. See the cautions in the weaknesses section. Including a GUI component works well, but the assignment is also fine with only a text-based interface. While the prediction algorithms can be implemented using a 2D array of integer ratings, many of these ratings are zero leading to a sparse matrix. One way to make the assignment harder is to require the use of a more sophisticated data-structure to store the ratings data without explicitly storing all the zeros. This can drive home the idea of having different representations for the same data. Reading the input file can be made slightly more difficult by combining the book list data and the ratings data into a single input file that must be parsed by the program.

History

This assignment was originally designed as a final project for the university-track grade 12 computing course in Ontario (ICS4U). It was one component in an initiative by the Toronto District School Board to help CS teachers adopt a new Ontario Ministry of Education curriculum. It was designed to address a number of specific curriculum expectations while not requiring programming concepts beyond the scope of the ministry-defined course. In particular, it does not require the use of a GUI nor does it require objects, dictionaries or other data-structures more sophisticated than a 2D array. The templates for the high school versions of the handouts include a lot of material about the software-development process that was specifically included to address ministry curriculum expectations and omitted when the assignment was reconfigured for CS1.

Data Set

Because this assignment was originally designed for high school students, the data set focuses on this audience. I worked with a community librarian to identify 55 books that in some sense covered the spectrum of books read by typical high school seniors in Canada. I then recruited 86 readers to rate the books in the set using the following rating scheme:

Rating Meaning

-5 Hated it!

-3 Didn't like it

0 Haven't read it

1 ok - neither hot nor cold about it

3 Liked it!

5 Really liked it!

Rating	Meaning
-5	Hated it!
-3	Didn't like it
0	Haven't read it
1	ok - neither hot nor cold about it
3	Liked it!
5	Really liked it!

Here are the resulting data files.

The list of 55 books
Ratings file for the above book list (from 86 different readers)

Solution Code

I have solutions to the assignment in both Java and Python and am happy to distribute them to instructors who are considering using the project. I ask that no solutions be posted on the web.

Example Handouts

Diane Horton gave this assignment at the University of Toronto in CS1 and we are aware that it was given in various local high schools. The CS1 handout below is almost exactly what was used with the omission of instructions about our local submission and marking mechanisms. The high school handouts are templates that were customized by different teachers. Teachers were intended to take material from the teacher supplement handout and paste it into the student handout template as needed.

Other Related Resources

I presented this assignment at a conference for high school teachers using these slides I am willing to share the Keynote source for them if someone wants to customize them and use them for their own presentation.
The web page to contribute your rating for these books to the data set I am happy to share the very-simple webform to use to collect ratings to build your own data set.
Diane Horton provided these sample program to her CS1 students when she assigned this project in Python. Neither of these are fundamental to the basic assignment, but if you use her variation, these files might be helpful.
- pickle_useful.py A program demonstrating the use of Pickle to save data between runs of a program
- sort_demo.py A program demonstrating the use of Python's list.sort function.

Extra Advice

It can be hard to grasp the concept of using only the ratings from the community to predict the future rating for a given user but that is the most interesting computer science idea in the assignment. A surprising number of people claimed that they understood the basis for the assignment, but then clearly didn't. This was obvious when they made comments such as expressing a desire to replace the names of books in the provided data file with the names of current movies (paired at random!)

Extra info about this assignment: