Authorship Detection

Michelle Craig
University of Toronto
mcraig@cs.toronto.edu

Overview

Automated authorship detection is the process of using a computer program to analyze a large collection of texts one of which has an unknown author, and making guesses about the author of that unattributed text. The basic idea is to use different statistics from the text -- called "features" in the machine learning community -- to form a linguistic "signature" for each text. One example of a simple feature is the number of words per sentence. Some authors may prefer short sentences while others tend to write sentences that go on and on with lots and lots of words and not very concisely, just like this one. Once we have calculated the signatures of two different texts we can determine their similarity and calculate a likelihood that they were written by the same person.

In this assignment, students will write a number of small functions that each calculate an individual linguistic feature. They will apply their functions to a piece of mystery text and combine the features into a signature which they will compare to a set of known signatures for famous authors and make a prediction about the author of the mystery file.

Meta Information

Summary	Define small functions that each operate on a piece of text and calculate a linguistic feature represented by a floating point value. Combine these values to produce an author signature. Define a similarity measure for pairs of authors. Compare the calculated signature for a mystery text with signatures for known authors read from input files and predict the author of the unattributed text.
Topics	arrays, reading from files
Audience	CS1
Difficulty	This is an intermediate assignment, taking approximately 2 weeks for a CS1 student. It can be made more or less complicated by varying the particular linguistic features. The version we gave had five features none of which required the use of a data-structure more complicated than an array.
Strengths	Students like the fact that it was fun to see their programs correctly guess the real authors when they themselves couldn't tell easily from the text alone. Instructors like the fact that it can be assigned when students haven't learned about more complex data-structures but still uses real linguistic features. For example, the Hapax Legomana ratio is the ratio of words that occur exactly once in the text to the total number of words in the text. The numerator can be calculated by adding and removing words from two one-dimensional lists. The fact that students write some independent functions makes tackling this assignment manageable for a CS1 student. It also simplifies testing for both the student and ultimately for the instructor. The assignment can be a starting point for talking about other CS ideas like weighted averages, natural language understanding, machine learning and the challenges of working with real data rather than clean and tidy toy input.
Weaknesses	It is too complicated for CS1 students to correctly extract individual sentences from the text and so we used operational definitions for a word, a sentence and a phrase that were easy to understand and compute. This frustrated some of the students and meant that some of the linguistic features when applied to real data gave incorrect results. The linquistic features don't always differentiate well enough between authors. Because the linguistic features we used were constrained to only require one-dimensional arrays, we omitted some fairly powerful options. For example, we didn't do anything with word or punctuation frequency. If the assignment were used at a later point in the term, where dictionaries or maps were available, it would be easy to add additional linquistic features that would work better.
Dependencies	No external libraries required.
Variants	Add new linguistic features such as frequency of various punctuation marks or some common words. Require more sophisticated handling of punctuation in the files or don't give the punctuation-stipping function as part of the starter code. Of course, students may find it on the Nifty site.

Example Handout

Here is the handout we used at the University of Toronto in Fall 2010.

Data Set

I used publically accessable data from Project Gutenberg to calculate linguistic signatures for 13 different authors whose names I hoped my students would recognize. I then picked other works by five of these authors and created "mystery" files. They weren't much of a mystery to the students, since the author is in the first few lines of each file. Using the linguistic features we required of our students, the algorithm correctly guesses the author on four of the five mystery texts. This leaves room to discuss why it doesn't always work and how it could be improved. The weights used in the similarity measure defined in the handout above were hand-tuned to give this result on this data. If you generate additional known-author signatures and new mystery files (which is easy to do!), you may need to experiment with new weights to get the performance you want your students to see.

The data files are linked from the handout above and available here.

Solution Code

I have solutions to the assignment in Python 2 and am happy to distribute them to instructors who are considering using the project. I ask that no solutions be posted on the web. Even if you don't want the Python solutions, I would love to hear from you by email if you use this assignment (or a variation of it) for your students.

Connection to Current CS Research and Development

Automated authorship detection has uses in plagiarism detection, email-filtering, social-science research and as forensic evidence in court cases. Also called authorship attribution, this is a current research field and the state-of-the-art linguistic features are considerably more complicated than the ones calculated in this assignment.

Extra info about this assignment: