DNA Sequence Matching

Authors: Brian R. King and Edward Talmage (elt006@bucknell.edu), Bucknell University

Description

Biologists regularly need to match DNA sequences, but these matches are typically not perfect, since DNA changes, and we are often looking for similar sequences, not just exact matches. There are many different ways to calculate similarity between DNA sequences, and we care about algorithmic efficiency, since there we must compare against many known sequences and the sequences are often very long.

This assignment asks students to implement several string comparison algorithms to compare a given query DNA sequence against a database of labeled DNA sequences. Some of these algorithms are ones they have seen in an abstract, formal setting in class, others they must read about on their own. The purposes of the assignment are to connect the abstract algorithm to a real-world problem, to allow students to get "hands-on" with an algorithm and learn its ins and outs by actually making it run, and to allow discussion about external factors that affect how we implement and use algorithms in writing code to solve real problems.

For grading, since the course is not primarily programming-focused, most of the assessment is based on an in-person demonstration. I sit down with students and have them run their code on the originally-provided query, as well as some extra tests I provide only at that time. We then discuss the results and their interpretation. This allows me to ask questions like "How good a match is the best match?" which are important as they force students to think about the context in which they are using the algorithm. To further reduce grading load and student workload with other assignments in the course, I assign the project for groups of 3-4 students, which works fairly well. With more dedicated time, this would be very feasible as an individual assignment.

We have used this assignment with a total of approximately 200 students across at least 5 terms, in a roughly junior-level Algorithm Analysis course. Student feedback is consistently very positive, with the project regularly receiving praise in course evaluations.

Metadata

Summary	Students must implement several string-comparison dynamic programming algorithms to determine similarity between a given DNA sequence and a database of known sequences, then choose the database sequence which most closely matches the query.
Topics	Dynamic programming, specifically practice actually implementing classic algorithms covered in a theoretical way in class.
Audience	This assignment is aimed at an Algorithms course that is studying dynamic programming.
Difficulty	This is not a difficult assignments. A moderately skilled coder who understand dynamic programming could complete it in less than one day. With teams, new material, and integration, I usually allow two weeks, with other work concurrent.
Strengths	In a course focused on pseudocode and formal analysis, practically-minded students really appreciate getting to see a dynamic programming algorithm in action. The practical repetition and attention to detail required to debug code are also helpful in cementing student understanding. DNA sequence matching, especially with real sequence data, is a good opportunity to talk about real-world applications of algorithms, and some of the external concerns that affect real software engineering.
Weaknesses	A significant portion of the work time, especially for groups, is on integration, interface, and other topics that are not the primary focus. Also, DNA matching is a complex topic many students know little about, so they struggle to evaluate the quality of their solutions.
Dependencies	Basic programming (CS2 level), assumes a basic understanding of dynamic programming, though students without that understanding could blindly implement the algorithms, since we allow students to look up pseudocode.
Variants	The assignment is written to be open-ended for students to add more sequence-comparison algorithms if they wish. One could assign other algorithms or require students to develop their own. One could also require students to do more formal testing and comparison of the different algorithms or their performance. Students often implement various interface features, including bringing in outside knowledge to provide a GUI, for extra credit. This can help balance load when some members of a team are more experienced coders, as they can add bonus features while a less-experienced team member works more slowly.
Teaching Notes	I have each team meet with me to do an in-person demo using their own computer before their final submission. This gives a space to discuss the meaning, usefulness, and limitations of the algorithms we use. The demo environment also allows students to work in any programming language and gives them a chance to discuss issues and fix bugs before the final submission. Finally, my grading load is reduced since I mostly know which code is working, and spend less time trying to figure out how to deal with package dependencies, etc.

DNA Sequence Matching

Description

Metadata

Assignment Files