Autocomplete

Kevin Wayne
Princeton University
wayne@princeton.edu

Overview

Students write a program to process autocomplete queries for a given set of suggestion strings and associated weights. That is, given a query string, find all suggestion strings that start with the query string, sorted by weight. Autocomplete is a pervasive user-interface feature in many modern apps and services, ranging from text messaging to IDEs to web search. For example, Google uses autocomplete to display suggestions in real time, as the user types a web search query. Google autocomplete

In one version of the assignment, students implement autocomplete in two stages:

Preprocess the suggestions: Sort the suggestions alphabetically.
Process an autocomplete query: Binary search to find all suggestions that start with the given query; sort the matching suggestions in descending order by weight.

The system sort suffices for sorting. However, the variant of binary search required is nonstandard, so students must design and implement their own version. Along the way, student must define three custom orders.

The assigment is nifty because autocomplete is a quintessential application of sorting that is familiar, authentic, and engaging. Autocomplete also exemplifies how applying a core HCI principle can dramatically improve user-interface design. Armed with only sorting, binary search, and a hundred lines of code, students can compose an effective solution to an important problem.

Resources

Assignment handout (as offered in Spring 2015).
Real-world datasets (culled from the Internet Movie Database, U.S. Social Security, Google Trillion Word Corpus, Million Song Dataset, NASDAQ, and other sources).

Metadata

Summary	Students compose a program to process autocomplete queries for a set of suggestions.
Topics	sorting, binary search, arrays, string processing, HCI
Audience	late CS1 or early CS2
Difficulty	easy to moderate
Strengths	Spotlights a familiar and important application, for which performance is critical. Brings sorting and binary search to life—two topics that are often dull and insufficiently motivated in CS1 and CS2. The accompanying interactive GUI and real-world datasets add concreteness, authenticity, and amusement. Requires only about 100 lines of code (and no starter code).
Weaknesses	Implementing binary search is tricky. Defining a custom order is awkward in Java 7.
Dependencies	Requires a basic I/O library to read text from a file, ideally one that supports Unicode.
Variants	Many variations of this assignments are suitable for CS1 and CS2. Simplify the assignment to allow use of system binary search. Refocus the assignment on lists/arrays and string processing by replacing binary search with sequential search (suitable for small datasets or small numbers of queries). Achieve improved performance using fancier data structures (such as tries, suffix arrays, segment trees, and Cartesian trees). Extend the autocomplete data type to support adding/removing suggestion strings or changing the associated weights. Delve into Unicode and related internationalization issues (such as using a diacritic-insensitive order). We also believe this assignment has great potential for adaption in several upper-level courses (HCI, Networks, Algorithms, and ML).