Automated authorship detection is the process of using a computer program to analyze a large collection of texts one of which has an unknown author, and making guesses about the author of that unattributed text. The basic idea is to use different statistics from the text -- called "features" in the machine learning community -- to form a linguistic "signature" for each text. One example of a simple feature is the number of words per sentence. Some authors may prefer short sentences while others tend to write sentences that go on and on with lots and lots of words and not very concisely, just like this one. Once we have calculated the signatures of two different texts we can determine their similarity and calculate a likelihood that they were written by the same person.
In this assignment, students will write a number of small functions that each calculate an individual linguistic feature. They will apply their functions to a piece of mystery text and combine the features into a signature which they will compare to a set of known signatures for famous authors and make a prediction about the author of the mystery file.
Summary |
Define small functions that each operate on a piece of text and calculate
a linguistic feature represented by a floating point value. Combine these
values to produce an author signature.
Define a similarity measure
for pairs of authors. Compare the calculated signature for a mystery text
with signatures for known authors read from input files and predict the author
of the unattributed text. |
Topics |
arrays, reading from files |
Audience |
CS1 |
Difficulty |
This is an intermediate assignment, taking approximately 2 weeks for a CS1 student. It can be made more or less complicated by
varying the particular linguistic features. The version we gave had
five features none of which required the use of a data-structure more
complicated than an array.
|
Strengths |
|
Weaknesses |
|
Dependencies |
No external libraries required. |
Variants |
|
Here is the handout we used at the University of Toronto in Fall 2010.
I used publically accessable data from Project Gutenberg to calculate linguistic signatures for 13 different authors whose names I hoped my students would recognize. I then picked other works by five of these authors and created "mystery" files. They weren't much of a mystery to the students, since the author is in the first few lines of each file. Using the linguistic features we required of our students, the algorithm correctly guesses the author on four of the five mystery texts. This leaves room to discuss why it doesn't always work and how it could be improved. The weights used in the similarity measure defined in the handout above were hand-tuned to give this result on this data. If you generate additional known-author signatures and new mystery files (which is easy to do!), you may need to experiment with new weights to get the performance you want your students to see.
The data files are linked from the handout above and available here.
I have solutions to the assignment in Python 2 and am happy to distribute them to instructors who are considering using the project. I ask that no solutions be posted on the web. Even if you don't want the Python solutions, I would love to hear from you by email if you use this assignment (or a variation of it) for your students.
Automated authorship detection has uses in plagiarism detection, email-filtering, social-science research and as forensic evidence in court cases. Also called authorship attribution, this is a current research field and the state-of-the-art linguistic features are considerably more complicated than the ones calculated in this assignment.
Extra info about this assignment: