CSC108H Project, Winter 2010
A Recommendation System

Due someday, somemonth, sometime

Purpose

The project brings together everything you've learned this term to build something that I hope you'll be proud of. Its purpose is to give you practise using dictionaries, creating GUIs, deciding on some aspects of a program's specifications, and designing an entire program from scratch.

Note that 80% of your mark for the project is for completing a basic set of requirements, and the last 20% is for adding enhancements of your choosing.

Introduction: Making recommendations

If you've ever bought a book online, the bookseller's website has probably told you what other books you might like. This is handy for customers, but also very important for business. In September, online movie-rental company Netflix awarded one million dollars to the winners of the Netflix Prize. The competition simply asked for an algorithm that would perform 10% better than their own algorithm. Making good predictions about people's preferences was that important to this company. It is also a very current area of research in machine learning, which is part of the area of computer science called artificial intelligence.

A basic approach

So how might we write a program to make recommendations for books? Consider a user named Rabia. How is it that the program should predict books Rabia might like? The simplest approach would be to make almost the same prediction for every customer. In this case the program would simply calculate the average rating for all the books in the database, sort the books by rating and then from that sorted list, suggest the top 5 books that Rabia hasn't already rated. With this simple approach, the only information unique to Rabia used by the prediction algorithm was whether or not Rabia had read a specific book.

A more sophisticated approach

We could make a better prediction about what Rabia might like by considering her actual ratings in the past and how these ratings compare to the ratings given by other customers. Consider how you decide on movie recommendations from friends. If a friend tells you about a number of movies that s(he) enjoyed and you also enjoyed them, then when your friend recommends another movie that you have never seen, you probably are willing to go see it. On the other hand, if you and a different friend always tend to disagree about movies, you are not likely to go to see a movie this friend recommends.

A program can calculate how similar two users are by treating each of their ratings as a vector and calculating the dot product of these two vectors. (Remember that the dot product is just the sum of the products of each of the corresponding elements.) For example, suppose we had 3 books in our data-base and Rabia rated them [5, 3,-5], Suelyn rated them [1, 5,-3], Bob rated them [5, -3, 5], and Kalid rated them [1, 3, 0]. The similarity between Rabia and Bob is calculated as: (5 x 5) + (3 x -3) + (-5 x 5) = 25 - 9 - 25 = -9. The similarity between Rabia and Suelyn is: (5 x 1) + (3 x 5) + (-5 x -3) = 5 + 15 + 15 = 35. The similarity between Rabia and Kalid is (5 x 1) + (3 x 3) + (-5 x 0) = 5 + 9 + 0 = 14. We see that if both people like a book (rating it with a positive number) it increases their similarity and if both people dislike a book (both giving it a negative number) it also increases their similarity.

Once you have calculated the pair-wise similarity between Rabia and every other customer, you can then identify whose ratings are most similar to Rabia's. In this case Suelyn is most similar to Rabia, so we would recommend to Rabia the top books from Suelyn's list that Rabia hadn't already rated.

Your task for this project is to write a program that takes people's book ratings and makes book recommendations to them using this technique. We will refer to the people who use the program as "members".

Basic requirements

These are the requirements you must satisfy for the first 80% of your grade.

Information

  1. Books

    Your program must keep track of the following information about each book in the system:

  2. Members

    Your program must keep track of the following information about each member:

    Your program must also keep track of which member is currently logged in, if any. (For example, upon program startup or immediately after a member logs out, there is no one currently logged in.)

  3. Ratings

    Your program must keep track of book ratings. Each rating has three pieces of information: the book, the member, and the rating. Each rating must be stored as an int with one of these values: -5, -3, 0, 1, 3, 5. They are to be interpreted as follows:
    Rating Meaning
    -5 Hated it!
    -3 Didn't like it
    0 Haven't read it
    1 ok - neither hot nor cold about it
    3 Liked it!
    5 Really liked it!
    Although the data on which you test your program will likely be somewhat small, in a real system their would be a great many books and a great many users, and no user would have rated a very large fraction of the books. If you picture the ratings data as a table (rows being members and columns being books, with rating values in the cells of the table), most of the table would be empty. We call this "sparse" data. Keep this fact in mind when designing your structure for storing ratings.

Graphical User-Interface (GUI)

Your program must use a GUI for all input and output with the user.

Your program must give feedback indicating the result of every operation the user performs. For example, if the user tries to add a new member with an account name that is already in the system, it might display the message "Sorry -- that account name is already in use. Try again." It will be up to you to decide how you want to handle all the situations that may arise, except in the few cases where I specify the behaviour.

Operations

Below are the basic operations your program must provide. You will get to decide (1) how to handle all the "border" conditions that may arise and (2) what the GUI will look like.

When it starts, your program must load all the existing data (created in previous runs of the program) from a pickle file called data.pkl. (If the file doesn't exist -- because this is the first run of the program -- your program must initialize all the data structures to empty.) And when the user chooses to exit the program, it must automatically save the updated date to that file for use the next time the program runs.

Your program must provide the operations below to the user.

Many details are unspecified. For example, when rating a book, do you choose it from a list or must you type its name? When you ask to see recommended books, how many do you see? Part of your task is to make these decisions. Part of your mark will depend on the quality of those decisions.

Enhancements

I have provided you with a baseline description of the application. You will decide how challenging to make this assignment by extending your application with new features ("enhancements") of your choice. These enhancements will be worth the final 20% of your project grade.

The additional features you implement could be practically anything! Here are a few suggestions:

Please don't ask how many marks your enhancements will earn: we can't evaluate your code before the due date! For good marks on this aspect, we do expect to see significant enhancements.

To receive credit for your enhancements, you must submit a written description that explains how your enhancements went beyond the baseline requirements. Submit your description in a plain text file named enhancements.txt. (To ensure that your file is in plain text, you should use Wing as your editor.) Your submission should not exceed 150 words. If you do not submit this file, with this name, your enhancements will receive a mark of zero.

If you did not add enhancements, submit enhancements.txt but say in it "None"; you will still be able to earn up to 80%.

Implementation tips

You may find that there are a few values that never change, and that you need to use throughout your program. For example, the name of your pickle file will never change while the program runs. When you have an unchanging value, you should use a named variable for it, rather than just sticking the value in everywhere. That makes it easier to change such a value in the future: you just change it in one place rather than hunting around for all occurrences of the value. It is accepted practise to give such variables names that are in all upper case and to define them at the top of the module so that they can be accessed throughout the module without having to pass them around. This is an exception to our general rule against use of global variables, and should only be used for a small number of values that will never change. Generally, all functions should receive the information they need to know in order to do their job through parameters, and should tell back their result through a return value (and possibly also through mutable parameters). This is a critical aspect of good programming style. If you are at all unsure about it, please ask me!

Suggested milestones

To help you break the project down and keep on track, here is a series of suggested milestones.

Milestone 1: Plan

The specifications provided tell you what operations your program must offer, but not all the details of how it should behave under every circumstance. Go through all of the operations and make those decisions. For example, if the user wants to create a new account but gives an account name that has already been taken, what should happen? Write all your decisions down.

Next, plan what your enhancements will be. If you have ambitious plans, you would be wise to have some simpler enhancements in mind, just in case you get behind.

Finally, you should work with your partner to create a schedule for milestone completion. Write it down and refer back to it often.

This would be a good time to check the course discussion board for any relevant postings. Get in the habit of checking it every time you work on the project.

Milestone 2: Write the outline of your program

Write the outline of your recommender program, using the "recipe" we have used in lecture and lab for designing GUI programs. Include all the top-level operations that you want to provide in your basic version. (You can easily add new operations, if you need them for your enhancements, later.)

Once you have the outline written, run the code to see if everything is hooked together properly. You are on your way!

Milestone 3: Decide on your data structures

You will need some structures to represent all the information about books, members, and ratings. Decide on what structures you will use. There will be many options, and part of your grade will be for making good choices. Try to choose structures that will make it easy to perform the kinds of operations you will need to do. For example, will you ever need to search the members by name, to add multiple ratings for the same book, to look up all books that ever received a high rating, or to determine the average rating a user gave?

Milestone 4: Implement the operations

Now write the bodies of the functions for your operations. Remember to use top-down design to break your functions down into coherent pieces that are nice and small. For each helper function, be sure to write the docstring before you write the code. In fact, ideally you will do it in this order: write docstring, design test cases, write the function, test the function.

This milestone will require you to start writing code that involves the data structures you designed. This is a good time to think about how well your data structure(s) are working for you, and consider whether any revisions would be appropriate. Think also about all the special cases that can arise with your operations (e.g., trying to delete a member who doesn't exist), and how you are handling them. You may want to make some revisions.

Remember that each operation should give feedback about what your program just did (or didn't) do. You might display, for instance, "Successfully added new member Diane Horton" or "There is no member Fred Flintstone; unable to delete."

Milestone 5: Thoroughly test your program

Although you have been testing pieces of your program throughout its development, now you should verify that it works properly as a whole.

You should reread the docstring for each function and make sure that you have tested it under every condition that the function is meant to handle. And if you think of cases during your testing that are not mentioned in your docstring, update it.

This would also be a good time to re-read the entire handout and make sure that you have met all the project requirements.

Milestone 6: Implement and test your enhancements

By this time, you have had the experience of implementing the basic requirements and you can make a better estimate as to how much work your planned enhancements will be. Put this together with how many days you have left -- you may not quite be on your planned schedule -- and decide whether to revise your plans. It is crucial that you make this decision explicitly, rather than just cross your fingers and hope it will all work out. One of the biggest mistakes students will tell you they made on their csc108 project is not being realistic about their enhancements.

Now that you have a realistic plan, go ahead and implement your enhancements.

Some final advice

If you develop your program in small, incremental steps, and you test your code as you go, then even if you run out of time at the end, you will still have a working program to hand in. It may not implement every feature, but what it does have will work, which will mean you can get part marks for correctness.

Marking

To mark your assignment, we will run your program by hand. We will also, as always, examine your Python code.

To mark your enhancements, we will read enhancements.txt and run your application to test the features that are described. Only enhancements described in that file will be marked.

These are the aspects of your work that we will focus on in the marking:

What to Hand In

Instructions about local submission mechanism.

Hand in

Do not hand in your data.pkl file. We need to be able to run your code starting with no existing data.