By Christopher Tralie ctralie -at- alumni.princeton.edu
A "dendrogram" is a binary tree that results from merging similar items in a hierarchy. In this assignment, students build a dendrogram that represents a "phylogenetic tree" of species across the animal kingdom. Starting with raw DNA strings from mitochondrial DNA obtained from https://www.ncbi.nlm.nih.gov/nuccore/, students first compare all pairs of DNA between species using the Needleman-Wunsch algorithm, with substitution costs between amino acids specified by "BLOSUM" tables obtained at this link. Amazingly, simply from this information, students are then able to build up the "tree of life" using Kruskal's algorithm by merging species in descending order of similarity according to these results from Needleman-Wunsch. Students then plot this tree to visualize the results, and they programmatically extract various clusters using recursion. Among other things, students quantitatively show that humans are in a cluster with our primeate cousins, finding the cluster: ['Bonobo', 'Chimpanzee', 'Gorilla', 'Human', 'Neanderthal', 'Orangutan']
It's worth mentioning that, while I have collaborated with numerous biologists in my career, I am by no stretch of the imagination a biologist! And most of my students don't have much biology background either. But luckily, we were all able to approach this since the focus is mainly on the algorithm concepts; the biology information is mostly "for fun" here!
Summary | This assignment is a marathon through most of the curriculum in data structures and algorithms. It culminates in a phylogenetic tree describing a hierarchical representation of the relationship between 71 species, using data obtained from mitochondrial DNA listed in NIH databases. To get to that endpoint, students must first implement a dynamic programming algorithm to solve Needleman-Wunsch comparison between pairs of mitochondrial DNA strings to compare two species. Then, they run single-linkage clustering, using Kruskal's algorithm merging species in decreasing order of similarity, to construct a dendrogram corresponding to the phylogenetic "tree of life." Finally, they plot this tree, and they recursively traverse it to gather lists of clusters of species whose pairwise similarities exceed some predetermined threshold. |
Audience | I have run this assignment four times as the second to last assignment in a data structures and algorithms course. It requires students to know most of the content in such a course, as explained more in the "difficulty" section: |
Difficulty |
This assignment is the most difficult assignment for students in my data structures class, especially because I give so little starter code and they have to design the tree class and do the plotting in matplotlib themselves. However, I mitigate this by scaffolding in various ways before and during the assignment:
|
Strengths |
This is in many ways a summative assessment of skills from all of data structures; it ties together trees, recursion, dynamic programming, and graph algorithms (Kruskal's with union find). The code design is also open ended, so students have a chance to make sure they can design everything from scratch, much like they might do in a coding interview. This assignment moves students towards mastery of recursion, especially once they get to drawing and extracting clusters from the tree. It's also a "CS+X" experience, bringing computational biology into the fold. Students are blown away by the fact that we can start with raw strings from mitochondrial DNA and end up with a "tree of life" that they can look at and which makes sense. And they will have done it all themselves from scratch starting with the strings, so there is very little mystery, and they feel empowered! |
Weaknesses |
This requires some specialized knowledge of plotting with matplotlib that could be seen as beyond the scope of data structures, and sometimes some adjustments need to be made to the figure size to fit the huge tree on the screen, which sometimes confuses students. I've mitigated this by showing simpler examples of mapltolib in class, as well as providing a robust discussion board for students while they're working on the assignment. When implementing Needleman-Wunsch, students sometimes "overfit" to the edit distance example, and they end up initializing the base cases by assuming unit costs for matching. More generally, the introduction of variable costs that students have to look up in a dictionary can be slightly confusing, though I tried to mitigate this with the interactive applet, where the costs are formatted exactly as they would be in a python dictionary. I usually just have to give students a nudge to examine that applet carefully and compare to their results. It's difficult for students to test the intermediate steps of the "building phylogenetic trees" task until they plot it, and plotting is tricky for students, so they have to think carefully about how to test out simpler examples and print out information about the merge events in Kruskal's algorithm. It could be worth providing them with a smaller set of animals (e.g. just the primates) so they can check their tree nodes by hand by printing them before plotting is ready. There are also limitations to using the "max rule" for single linkage clustering that affect the quality of the clusters, though that's beyond the scope of this assignment... |
Dependencies / Variants |
I went with python and matplotlib because they're relatively simple and I run my data structures course in python, but this could be just as easily run in almost any other language. One particularly good alternative to python/matplotlib could be Javascript/d3. In an earlier variant of the assignment, I had students plot multiple renditions of the trees, since, unlike a binary search tree, the ordering of children is arbitrary. In this case, students made pseudo-random choices about which subtree was on the left and which subtree was on the right at each node. |
Teaching Notes |
One hazard is that students get stuck on Needleman-Wunsch, or that they get incorrect results that could affect their code downstream. For this reason, I split the assignment submission into 3 submissions, the first of which is Needelman-Wunsch (second submission is building a phylogenetic tree, and the third submission is clustering). If they don't get Needleman-Wunsch correct, I send them a solution so they can keep moving. Also, since it takes a lot of time to compute all pairs of Needelman-Wunsch distances for all of the species, I have them cache the results in a JSON file before they move onto the next part. |