Automatically Solving SAT/TOEFL Synonym Questions with Computational Linguistics

Michael Guerzhoy (University of Toronto, `guerzhoy@cs.toronto.edu`),
Jackie Chi Kit Cheung (McGill University, `jcheung@cs.mcgill.ca`), and
François Pitt (University of Toronto, `fpitt@cs.toronto.edu`)

Overview

One type of question encountered in the SAT and the TOEFL (Test of English as a Foreign Language) standartized tests is the “Synonym Question,” where students are asked to pick the synonym of a query word out of a list of choices. For example:

vexed
1. annoyed
2. amused
3. frightened
4. excited
(Answer: (a) annoyed)

In this assignment, students build an intelligent system that learns to answer such questions simply by processing a large corpus of English text (i.e., without using thesauruses or dictionaries). Students build a system which obtains the correct answer (out of two possible answers) to SAT synonym questions more than 70% of the time. The system achieves this by learning to approximate the semantic similarity of any pair of words. The semantic similarity between two words is a measure of the relatedness of their meanings. For example, the semantic similarity between “car” and “vehicle” is high, while the semantic similarity between “car” and “flower” is low. The system guesses that the word which has the highest semantic similarity to the query word is the correct answer to the synonym question.

The system approximates semantic similarity of any pair of words by constructing a descriptor for each word, and then comparing the descriptors of the two words to each other in order to determine the semantic similarity score. The descriptor of a word is computed by reading in a large corpus of text (we use famously large novels, such as War and Peace and Remembrance of Things Past, downloaded from Project Gutenberg), and recording the number of times that each word in the novel appears in a sentence together with the word of interest. The intuition behind this method is related to the Distributional Hypothesis in linguistics: semantically similar words tend to appear in similar contexts, and so they would have similar descriptors.

The assignment allows students to build a system that automatically performs a very difficult task (figure out word similarity without encoding a lot of knowledge about language by hand), which is a task all students can appreciate (having tried themselves to infer word meaning from context), using just the tools taught in CS1. For cohorts where the students are learning Linear Algebra while taking CS1/CS2, the assignment provides a nice illustration of how university-level math that first-year science and engineering students are becoming familiar with can be used in Artificial Intelligence (AI). The assignment introduces students to techniques that are currently used in AI, and provides a remarkable illustration of the power of statistical methods in AI. Finally, the assignment provides an opportunity for students to work in a context where large amounts of data need to be processed, with good implementations taking about 10 seconds to run, and inefficient implementations taking minutes or even hours to run; in the advanced version of the assignment, students are asked to come up with an efficient implementation, and most succeed (the efficient algorithm is fairly simple, and students who are stuck only require a brief verbal explanation to understand what needs to be done).

We have used different flavors of this assignment in the middle and at the end of the first Introduction to Programming class for Engineering Science students (CS1 with some elements of CS2) at the University of Toronto. The intermediate-level assignment is more scaffolded; students write several small functions to read text from files, to process the text; to compute counts of words and store those counts in a Python dictionary (hash table); and to find the word whose descriptor is the most similar to the descriptor of the query word. The advanced-level assignment is less scaffolded, and requires students to write several functions that would benefit from being broken down into helper functions. In the advanced-level assignment, we guide students as they experiment with the system to get a better appreciation for how it works. For a small amount of marks, students are asked to make sure that their implementation is efficient.

Meta Information

Summary	Write small (optionally: medium-sized) functions to read text from files, process the text, and compute descriptors (counts) from the text. Write small functions to read and process the Synonym Questions (optionally: this can be provided to students). Write functions to combine all of the above in order to test the system and obtain its performance (optionally: if the student's first attempt runs slowly, come up with a solution that takes less time).
Topics	Lists, files, text processing with Python (`str.split()`, `str.strip()`, etc.), dictionaries, nested loops. Optionally: basic time-efficiency. Optionally: using dictionaries to store sparse vectors.
Audience	CS1/CS2. Some optional parts of the assignment are particularly suitable to students who were exposed to introductory Linear Algebra.
Difficulty	This is can be either an intermediate or an advanced CS1 assignment, requiring about 10 hours of work. The assignment could also be used for students who already took CS1 and are switching to a new programming language.
Strengths	We received some very positive feedback for this assignment. Students like the fact that the synonym question task seems highly non-trivial, but a computer program that they wrote can tackle it. Students get excited about the algorithm, and several students have been in touch with us to tell us about their extensions of the algorithm. The system requires processing large files to obtain good performance. This is an opportunity to teach students to test their functions using their own (small) input; and to debug programs that should process large files by making small test files first. The system the students build is broken up into well-defined functions that are automatically testable, making grading easy. Nevertheless, especially In the advanced version, students benefit from writing their own helper functions. The assignment allows for interesting class discussion of the transition from rule-based classical AI to statistical AI in the last two decades. Modern AI is exemplified by the system the students build, as well as by YouTube's video recommendation engine (for example). The advanced-level assignment contains a very rare case where it is possible for students to come up with a very inefficient algorithm and a fairly efficient algorithm with just CS1 knowledge (the algorithms in questions are all simpler than sorting). For the more mathematically-inclined students, the assignment is an opportunity to see the linear algebra content they learned in action. In one version of the assignment, all the linear algebra is done for the students, so that linear algebra knowledge is not actually necessary.
Weaknesses	In the advanced-level version, we ask that students come up with an efficient algorithm for computing the descriptors. Although the algorithm is very simple, a significant number of the students require some sort of hint. In our experience, a short verbal description can get students on the right track: what is required is to set up an accumulator dictionary then update it iteratively. In our class of 300 students, this was labor-intensive but hopefully rewarding both for the students who had an “a-ha” moment, and the students who needed a hint to get there. While it is completely possible to give this assignment to students without linear algebra knowledge (we have done that when giving the assignment earlier in the semester), students who are taking linear algebra concurrently with their computer programming course will benefit more from the assignment.
Variants	An advanced and an intermediate version are available (see below). Further simplifications are possible.

Example Handouts

Intermediate version: handout (LaTeX source), starter code, synonym question tasks.

Advanced version: handout (LaTeX source), starter code, synonym question tasks.

Note that the intermediate version could be further simplified by providing the students with more handout code (solution code in Python is available upon request).

Solution Code

Solutions in Python are available. Autotesting code is also available upon request.

Supporting the assignment

This assignment cannot be the first one where students work with string processing. When using the assignment, we have also assigned this 2-3 hour lab, which students completed before the assignment (note that only the first two problems there were mandatory). We of course covered string processing in class as well.

Lessons learned

In 2014, we saw that many students started experimenting with the assignment on their own: trying to feed the system more novels to improve the performance of the algorithm, applying simple heuristics to improve the performance, and working on making the code run faster. This was one of the motivations for us to change the assignment in 2015 to have more students work on these aspects of the problem. This was our first experiment with asking introductory programming students to explore the bahaviour of their code, and we believe many students benefited from that experience.

Possible extensions

Many extensions are possible for the interested students. One example is to explore the use of techniques frequently seen in computational linguistics. For example, students can experiment with removing stop words. Another possible direction, which excited many students, is to use similar techniques to the ones used in the assignment to compute similarity between sentences or whole texts.

Using the material in lecture

The material from this assignment can be used for in-class exercises or examples in an end-of-term lecture on the use of programming in an artificial intelligence setting. This was done in the CS1 course at McGill University, COMP 202. See the sample lecture slides here (pptx available upon request.)