University of Chicago Laboratory Schools
Visualization of a solution: lines represent number of common word
sequences between files. Can you find the plagiarists?
This lab presents a real problem that really requires a software solution...neato!
Catch the plagiarists! Your goal is to try to (quickly) determine the similarities between documents in a set to see if you can find out if plagiarism is going on within the group.
More formally: given a set of text documents (see below), try to determine whether or not, or the degree to which, any pair of documents contains heavily copied portions of text. If one document shares an inordinate number of phrases in common with another, it's likely the document is plagiarized (or is the result of the all-too-often occurrence of finding out that one student "helped" another by giving him a copy of his work, only to have the other turn it in as his own).
Really, all you need is the problem statement and the document sets. I provide here a sample assignment sheet which gives my students some direction if they're feeling totally overwhelmed. It also shows the milestones I use to make sure they're making progress (these milestones really need some work. If you come up with better milestones please email me to let me know what you did).
Really, this assignment is quite accessible for first-year CS students. It's fantastic to watch the students really grapple with the pros and cons of making certain design decisions. Students seem to understand afterward that a.) there are MANY different ways the same problem can be attacked both good and bad and b.) design is hard, but worthwhile. And students really do go after this assignment because there's a hook: they want to find the plagiarized documents in the set that I give them. It's like hide and seek.
I would like to give credit to Ross O'Connell (PhD Candidate, University of Michigan, Physics Dept., Particle Theory Group) for alerting me to this problem.
Assignment can be modified to cover one, a few, or many CS1/CS2 topics including file I/O, text parsing and most importantly, designing and applying data structures. You can even delve into some difficult graphics issues if you want to go that route. Depending on the strategy a student uses issues of sorting may also come to the fore. As a capstone-type project it can require application of all topics typically taught in CS1/2.
Probably appropriate for CS2ish-type-people. But a less-deep version could easily be concocted for the latter part of a CS1 class. The more data structures the students know, the more interesting and rewarding they'll find the assignment.
Difficulty depends how open you leave the assigment. I present it as a real-world problem to my students without any CS mumbo-jumbo and allow them to creatively apply what they know - really, I give them the problem statement, point them to the document sets and let 'em go. This approach is very rewarding for the students but it does take more time.
You could concievably provide students with some of the work already done in order to allow them to focus on one particular aspect of the problem. For example: do the text parsing for them so they can get down to the work of designing data structures.
I give my H.S. students about four weeks to do this assignment with milestones that need to be achieved each week. Two weeks is probably appropriate at the university level. You could conceivably do it in pieces if you want tighter control over where students will go with it.
This assignment seems to hit the sweet spot between a toy-application and real-world problem. As a teacher I love it because it really forces burgeoning CS students to think deeply about choosing an efficient algorithm to solve the problem and designing data structures to support it. It's also a neat way to show how a good algorithm/data structure can make all the difference in the world.
This assignment's greatest strength is that it offers a lot to a highly differentiated group of students and abilities. It's a deep problem that's easy to state, intriguing and do-able for a wide range of students. Students who are just trying to pass can get by with good results, and know-it-alls are confronted with some real-world issues that are challenging to tackle. The most adventurous students also try to produce a graphical GUI and graphical output (as pictured above and in the assignment sheet).
Part of the reason I'm submitting this assignment to Nifty is because I have not found a tremendous weakness, but here are some things that can be difficult:
Requires an introductory understanding of common data structures and algorithms. This assignment is really about intelligent application of common data structures and algorithms - students will be forced to reckon with the implications of choosing one strategy and/or data structure over another. As such, it's helpful to have library classes (like those in java.util) at your disposal.
Variation is built into the assignment since there are so many facests to it. If you don't want your students to do the whole thing from scratch you can have them just do pieces of it to focus on a particular subject. Choices abound at every turn. It can be just an I/O assignment, or just a text parsing assignment. You can provide them with the text parsing already done and make it an assignment to compare/contrast data structures. It can be an assignment about memory management and using the file system to handle large programs. You could do this in parallel! You could simply use the data generated as the basis of an interesting graphics assignment. Lots of options.