One type of question encountered in the SAT and the TOEFL (Test of English as a Foreign Language) standartized tests is the “Synonym Question,” where students are asked to pick the synonym of a query word out of a list of choices. For example:
In this assignment, students build an intelligent system that learns to answer such questions simply by processing a large corpus of English text (i.e., without using thesauruses or dictionaries). Students build a system which obtains the correct answer (out of two possible answers) to SAT synonym questions more than 70% of the time. The system achieves this by learning to approximate the semantic similarity of any pair of words. The semantic similarity between two words is a measure of the relatedness of their meanings. For example, the semantic similarity between “car” and “vehicle” is high, while the semantic similarity between “car” and “flower” is low. The system guesses that the word which has the highest semantic similarity to the query word is the correct answer to the synonym question.
The system approximates semantic similarity of any pair of words by constructing a descriptor for each word, and then comparing the descriptors of the two words to each other in order to determine the semantic similarity score. The descriptor of a word is computed by reading in a large corpus of text (we use famously large novels, such as War and Peace and Remembrance of Things Past, downloaded from Project Gutenberg), and recording the number of times that each word in the novel appears in a sentence together with the word of interest. The intuition behind this method is related to the Distributional Hypothesis in linguistics: semantically similar words tend to appear in similar contexts, and so they would have similar descriptors.
The assignment allows students to build a system that automatically performs a very difficult task (figure out word similarity without encoding a lot of knowledge about language by hand), which is a task all students can appreciate (having tried themselves to infer word meaning from context), using just the tools taught in CS1. For cohorts where the students are learning Linear Algebra while taking CS1/CS2, the assignment provides a nice illustration of how university-level math that first-year science and engineering students are becoming familiar with can be used in Artificial Intelligence (AI). The assignment introduces students to techniques that are currently used in AI, and provides a remarkable illustration of the power of statistical methods in AI. Finally, the assignment provides an opportunity for students to work in a context where large amounts of data need to be processed, with good implementations taking about 10 seconds to run, and inefficient implementations taking minutes or even hours to run; in the advanced version of the assignment, students are asked to come up with an efficient implementation, and most succeed (the efficient algorithm is fairly simple, and students who are stuck only require a brief verbal explanation to understand what needs to be done).
We have used different flavors of this assignment in the middle and at the end of the first Introduction to Programming class for Engineering Science students (CS1 with some elements of CS2) at the University of Toronto. The intermediate-level assignment is more scaffolded; students write several small functions to read text from files, to process the text; to compute counts of words and store those counts in a Python dictionary (hash table); and to find the word whose descriptor is the most similar to the descriptor of the query word. The advanced-level assignment is less scaffolded, and requires students to write several functions that would benefit from being broken down into helper functions. In the advanced-level assignment, we guide students as they experiment with the system to get a better appreciation for how it works. For a small amount of marks, students are asked to make sure that their implementation is efficient.
Summary | Write small (optionally: medium-sized) functions to read text from files, process the text, and compute descriptors (counts) from the text. Write small functions to read and process the Synonym Questions (optionally: this can be provided to students). Write functions to combine all of the above in order to test the system and obtain its performance (optionally: if the student's first attempt runs slowly, come up with a solution that takes less time). |
Topics | Lists, files, text processing with Python (str.split(), str.strip(), etc.), dictionaries, nested loops. Optionally: basic time-efficiency. Optionally: using dictionaries to store sparse vectors. |
Audience | CS1/CS2. Some optional parts of the assignment are particularly suitable to students who were exposed to introductory Linear Algebra. |
Difficulty | This is can be either an intermediate or an advanced CS1 assignment, requiring about 10 hours of work. The assignment could also be used for students who already took CS1 and are switching to a new programming language. |
Strengths |
|
Weaknesses |
|
Variants | An advanced and an intermediate version are available (see below). Further simplifications are possible. |
Intermediate version: handout (LaTeX source), starter code, synonym question tasks.
Advanced version: handout (LaTeX source), starter code, synonym question tasks.
Note that the intermediate version could be further simplified by providing the students with more handout code (solution code in Python is available upon request).
Solutions in Python are available. Autotesting code is also available upon request.
This assignment cannot be the first one where students work with string processing. When using the assignment, we have also assigned this 2-3 hour lab, which students completed before the assignment (note that only the first two problems there were mandatory). We of course covered string processing in class as well.
In 2014, we saw that many students started experimenting with the assignment on their own: trying to feed the system more novels to improve the performance of the algorithm, applying simple heuristics to improve the performance, and working on making the code run faster. This was one of the motivations for us to change the assignment in 2015 to have more students work on these aspects of the problem. This was our first experiment with asking introductory programming students to explore the bahaviour of their code, and we believe many students benefited from that experience.
Many extensions are possible for the interested students. One example is to explore the use of techniques frequently seen in computational linguistics. For example, students can experiment with removing stop words. Another possible direction, which excited many students, is to use similar techniques to the ones used in the assignment to compute similarity between sentences or whole texts.
The material from this assignment can be used for in-class exercises or examples in an end-of-term lecture on the use of programming in an artificial intelligence setting. This was done in the CS1 course at McGill University, COMP 202. See the sample lecture slides here (pptx available upon request.)