This current assignment was developed in 2002, used for two years, then resurrected in the Spring of 2008 and used since then. It's turned into a very different assignment over that time. These differences leverage discussion of tradeoffs more than the original assignment, and were motivated in part by opportunities (and differences) provided by Java compared to C++. The assignment is meant to illustrate pragmatically the benefits of using a linked list.
|Restriction Enzymes||Assignment Background||PCR: Polymerase Chain Reaction|
In this assignment you'll experiment with different implementations of a simulated restriction enzyme cutting (or cleaving) a DNA molecule. Three scientists shared the Nobel Prize in 1978 for the discover of restriction enzymes. They're also an essential part of the process called PCR polymerase chain reaction which is one of the most significant discoveries/inventions in chemistry and for which Kary Mullis won the Nobel Prize in 1993.
The simulation is a simplification of the chemical process, but provides an example of the utility of linked lists in implementing a data structure. The linked list you'll write and reason about is an example of a chunk list. You can do more work with chunk lists for extra credit.
Kary Mullis, the inventor of PCR, is an interesting character. To see more about him see this archived copy of a 1992 interview in Omni Magazine, this 1994 interview as parf of virus myth, his personal website which includes information about his autobiography Dancing Naked in the Mind Field, though you can read this free Nobel autobiography as well.
In particular you'll need to do three things.
String.splitto find all occurrences of an enzyme and replaces the enzyme with another strand of DNA. The benchmarking is done by the class
DNABenchMarkand is explained in more detail below. You'll run virtual experiments in code to show two things:
SimpleStrand.cutAndSpliceis O(N) where N is the size of the recombined strand returned. In your README describe the results of the process and reasoning you use to demonstrate the O(N). You should have empirical results and timings that demonstrate the O(N) behavior. Graphing the data can be useful in showing behavior.
DNABenchMarkshould only use strings whose lengths are a power of two. Report this result in your README by providing the power-of-two string you can use without running out of memory with the input file ecoli.dat which has 646 cut points in an original strand of 4,639,221 base-pairs with the restriction enzyme EcoRI: "gaattc".
Describe how you determined this result and how long your program takes on your machine to create the recombinant strand is constructed using a splicee string whose length is the greatest (power of 2) that can be used to create the recombinant cut-and-spliced strand before memory is an issue. If you double the size of the heap available to the Java runtime (-Xmx1024M) does your machine support the next power-of-two strand (i.e., the one that couldn't run with -Xmx1024m)? If so, how long does it take with ecoli.dat. You should also run with -Xmx2048M if your machine has enough memory to support this.
Be sure to report on running with consecutive powers of two for both -Xmx arguments and size-of-string. If your machine doesn't support -Xmx512M start with 256M instead.
DNABenchMarkon the data file ecoli.dat
dna length = 4639221 SimpleStrand: splicee 256 time 0.314 before 4,639,221 after 4,639,221 recomb 4,800,471 SimpleStrand: splicee 512 time 0.253 before 4,639,221 after 4,639,221 recomb 4,965,591 SimpleStrand: splicee 1,024 time 0.235 before 4,639,221 after 4,639,221 recomb 5,295,831 SimpleStrand: splicee 2,048 time 0.253 before 4,639,221 after 4,639,221 recomb 5,956,311 SimpleStrand: splicee 4,096 time 0.254 before 4,639,221 after 4,639,221 recomb 7,277,271 SimpleStrand: splicee 8,192 time 0.287 before 4,639,221 after 4,639,221 recomb 9,919,191 SimpleStrand: splicee 16,384 time 0.375 before 4,639,221 after 4,639,221 recomb 15,203,031 SimpleStrand: splicee 32,768 time 0.498 before 4,639,221 after 4,639,221 recomb 25,770,711 SimpleStrand: splicee 65,536 time 0.784 before 4,639,221 after 4,639,221 recomb 46,906,071 Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2882) at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100) at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:390) at java.lang.StringBuilder.append(StringBuilder.java:119) at SimpleStrand.cutAndSplice(SimpleStrand.java:58) at DNABenchMark.strandSpliceBenchmark(DNABenchMark.java:77) at DNABenchMark.main(DNABenchMark.java:109)
LinkStrandthat implements the
IDnaStrandinterface so that your implementation passes the JUnit tests in
TestStrand. See the DNA howto for help with JUnit. Passing these tests doesn't guarantee full credit since the tests are about correctness, not about efficiency. Your code should avoid recalculating values that can be determined by other means. In particular the length of a
LinkStrandstrand should be calculated in O(1) time once the strand is created. Implementing
LinkStrandis the bulk of the coding work for this assignment. You'll need to implement every method and use the JUnit tests to help determine if your methods are correctly implemented.
LinkStrandin the benchmarking code, change the string "SimpleStrand" to "LinkStrand" in the
mainmethod of the class
For example, note that there are 646 breaks for ecoli.dat when using EcoRI "gaattc" as the restriction enzyme. You'll need to make many runs to show this, not just one. Describe in your README how you show this and make sure your submitted code supports your experiments and conclusion.
Restriction enzymes cut a strand of DNA at a specific location, the binding site, typically separating the DNA strand into two pieces. In the real chemical process a strand can be split into several pieces at multiple binding sites, we'll simulate this by repeatedly dividing a strand.
Given a strand of DNA "aatccgaattcgtatc" and a restriction enzyme like EcoRI "gaattc", the restriction enzyme locates each occurrence of its pattern in the DNA strand and divides the strand into two pieces at that point, leaving either blunt or sticky ends as described below. In the simulation there's no difference between a blunt and sticky end, and we'll use a single strand of DNA in the simulation rather than the double-helix/double-strand that's found in the physical/real process.
Restriction enzymes have two properties or features: the pattern of DNA that marks a site at which separation occurs and a number/index that indicates how many characters/nucleotides of the pattern attach to the left-part of the split strand. For example, the diagram below shows a strand split by EcoRI. The pattern for EcoRI is "gaattc" and the index of the split is one indicating that the first nucleotide/character of the restriction enzyme adheres to the left part of the split.
In some experiments, and in the simulation you'll run, another strand of DNA will be spliced into the separated strand. The strand spliced in matches the separated strand at each end as shown in the diagram below where the spliced-in strand matches with G on the left and AATTC on the right as you view the strands.
When the spliced-in strand joins the split strand we see a new, recombinant strand of DNA as shown below. The shaded areas indicate where the original strand was cleaved/cut by the restriction enzyme.
Your code will be a software simulation of this recombinant process: the restriction enzyme will cut a strand of DNA and new DNA will be spliced-in to to create a recombinant strand of DNA. In the simulation the code simply replaces every occurrence of the restriction enzyme with new genetic material/DNA --- your code models the process with what is essentially string replacement.
spliceeto create a recombinant strand. The strand spliced-in replaces the enzyme. In the simulation the enzyme is removed each time it occurs/finds a binding site. The characters representing the enzyme are replaced by
splicee. (This simulates the process of splitting the original DNA by the restriction enzyme, leaving sticky/blunt ends, and binding the new DNA to the split site. However, in the simulation the restriction enzyme is removed.)
This code is from the class
SimpleStrand you're given. The
String representing DNA, which is
myInfO, is split at every
occurrence of the string parameter
enzyme. The spaces are
added before the split as shown below to ensure that characters
representing the enzyme that are found at the beginning or ending
of the DNA are found as part of calling
As the spliced-in strand
splicee grows in size the code
above will take longer to execute even with the same original strand of
DNA and the same restriction enzyme. Creating the recombinant strand
using the code above is an O(N) operation
where N is the size of the resulting,
recombinant strand (you have to justify this in your README).
As part of this assignment you must develop an alternate implementation
of DNA, rather than a simple String, that makes the complexity of the
splicing independent of the size of the spliced-in strand. Each splice
operation, simulated by the call to
append above, should be
O(1) rather than O(S) for a splicee strand with S
characters/base-pairs. In your new implementation, the complexity of
creating the recombinant strand will be O(B) where B is the number
of breaks/splits created by the restriction enzyme. For a recombinant
strand of size N where N
>> B (>> means much bigger than) this is
significantly more efficient both in time and (especially) memory.
You'll be developing/coding a class
LinkStrand that implements
a Java interface
IDnaStrand. The class
simulates cutting a strand of DNA
by a restriction enzyme and appending/splicing-in a new strand. The code
supplied with this project gives specifics as to the interfaces and the
howto for this assignment also supplies more
You must use a linked-list to support the operations -- specifically the
LinkStrand should maintain pointers to a linked list
used to represent a strand. You should keep and maintain a pointer to
the first Node of the linked list and to the last node of
the linked list. These pointers are maintained as class
invariants -- they are true after any method in the class executes
(and thus before any method in the class executes).
A Strand of DNA is initially representing by a linked list with one
Node, the Node stores one string representing the entire strand of
Every linked list representing DNA maintains pointers to the first and
last nodes of the linked list. Initially, before any cuts/splices
have been made,
myLast point to the same node since there is only
one node even if it contains thousands of characters
representing DNA. The diagram below shows a list with at
least two nodes in it. If any nodes are appended, the value of
myLast must be updated to ensure that it correctly points
to the last node of the new list.
The diagram below shows the results of cutting an original strand of DNA at three points and then splicing-in the strand "GTGATAATTC" at each of the locations at which the original strand was cut.
initializeFromusing the same code copied, or leverage the common code in one method. When creating or initializing a new
LinkStrandonly one node is created, the entire string representing the DNA is in one node.
appendshould not concatenate strings, but can create new nodes. If you're appending a
LinkStrandyou don't need to create any nodes, you can simply relink the appended to nodes to the nodes in the list being appended to (think about that). If you're appending another kind of strand, e.g.,
SimpleStrand, convert the strand to a string and append it -- see the implementation in
cutAndSpliceassume there's only one node, though it might contain a huge string of DNA. If there's more than one node you can throw an exception, e.g.,:
cutAndSpliceshould not call
String.split. Instead you should repeatedly call
String.indexOfusing the two-parameter version that specifies the index at which the search will start. Each time you find an occurrence you'll be adding to the linked list/Strand that's returned in the method you're writing. One model is partially shown below.
LinkStrand.cutWitha call to
cutWithhas a side-effect of altering the strand being cut (the strand changes to represent the portion before the enzyme cleave/cut. The method also returns a newly-created strand representing the portion after the cut.
If no cut occurs because the restriction enzyme finds no binding site
then the strand being cut is unaltered and an empty or zero-length DNA
strand is returned. As part of the code provided in this project we
provide a class
SimpleStrand that implements the
IDnaStrand interface on which you can model your
|Benchmark code and explanation for O(N)||10 points|
| ||20 points|
|Benchmark code and explanation for O(B) in creating recombinant strands.||10 points|
Submit your project using the assignment name linked-dna.