Virtually every student has had an online experience where a website makes personalized recommendations in hopes of future sales or ongoing traffic. Amazon.com tells you "Customers Who Bought This Item Also Bought", YouTube makes suggestions for other videos to watch, and NetFlix ran a contest with a million dollar prize for an algorithm that would improve their movie recommendations. In this assignment, students write a program that makes personalized book recommendations using algorithms with increasing levels of sophistication.
After reading a pre-collected set of ratings for a list of books, the program makes recommendations for a particular reader based on a small set of sample ratings from that reader and the preferences of other readers in the community. The assignment was inspired by machine learning research used to predict book preferences for Chapters.Indigo.ca. It provides the opportunity to discuss similarity measures for non-trivial objects and alludes to machine learning techniques.
Summary |
Read book ratings for a set of users from a file. Define a similarity measure
for any pair of readers. Based on reader-similarity, use ratings from the reader
community to recommend new books for a particular user. |
Topics |
arrays, reading from files. It requires at least a one-dimensional array of Strings and either a 2D array of integers or a 1D array of objects. It could also be designed to use dictionaries and/or include a GUI. |
Audience |
CS1. It has also been used in
senior high-school classes. |
Difficulty |
This is an intermediate assignment, taking 2 or 3
weeks for a CS1 student. It can be made more or less complicated by
allowing the students varying degrees of flexibility in defining their
reader-similarity measure and data structures. |
Strengths |
|
Weaknesses |
|
Dependencies |
No external libraries required. |
Variants |
|
This assignment was originally designed as a final project for the university-track grade 12 computing course in Ontario (ICS4U). It was one component in an initiative by the Toronto District School Board to help CS teachers adopt a new Ontario Ministry of Education curriculum. It was designed to address a number of specific curriculum expectations while not requiring programming concepts beyond the scope of the ministry-defined course. In particular, it does not require the use of a GUI nor does it require objects, dictionaries or other data-structures more sophisticated than a 2D array. The templates for the high school versions of the handouts include a lot of material about the software-development process that was specifically included to address ministry curriculum expectations and omitted when the assignment was reconfigured for CS1.
Because this assignment was originally designed for high school students, the data set focuses on this audience. I worked with a community librarian to identify 55 books that in some sense covered the spectrum of books read by typical high school seniors in Canada. I then recruited 86 readers to rate the books in the set using the following rating scheme:
Rating | Meaning |
---|---|
-5 | Hated it! |
-3 | Didn't like it |
0 | Haven't read it |
1 | ok - neither hot nor cold about it |
3 | Liked it! |
5 | Really liked it! |
I have solutions to the assignment in both Java and Python and am happy to distribute them to instructors who are considering using the project. I ask that no solutions be posted on the web.
Diane Horton gave this assignment at the University of Toronto in CS1 and we are aware that it was given in various local high schools. The CS1 handout below is almost exactly what was used with the omission of instructions about our local submission and marking mechanisms. The high school handouts are templates that were customized by different teachers. Teachers were intended to take material from the teacher supplement handout and paste it into the student handout template as needed.
It can be hard to grasp the concept of using only the ratings from the community to predict the future rating for a given user but that is the most interesting computer science idea in the assignment. A surprising number of people claimed that they understood the basis for the assignment, but then clearly didn't. This was obvious when they made comments such as expressing a desire to replace the names of books in the provided data file with the names of current movies (paired at random!)
Extra info about this assignment: