• Polymorphism rate is defined as the number of nucleotide base changes between two
different members of a species. This occurs at an average rate of 1 in 1,000 bases in
page 2 of 7
humans; therefore, the nucleotide sequences of any two humans is roughly 99.9%
identical. The polymorphism rate may be substantially higher in other species.
Additionally, bases possessing the highest polymorphism rates are typically least
important, often falling in non-functional DNA where mutations in the nucleotide
sequence have least impact on the organism's survival. Thus, similarity is preserved.
DNA fragments
Known
Vector location
Circular genome
(bacterium, plasmid) + Clone
= (restriction
site)
page 3 of 7
o Read: A read is a 500-700 nucleotide sequence from the leftmost or rightmost ends
of an insert that is output by PHRED. Each nucleotide is accompanied by a quality
score defined as
− 10 × log10 Prob( Error ) .
The quality score corresponds to the probability that a given nucleotide is correctly
reported. A quality score of 40 corresponds to an error probability of 0.001, which is
considered the gold standard.
o Double-barreled Sequencing: sequencing from both the leftmost and rightmost
ends of a clone; this is done because it is impossible to tell whether the forward or
backward strand is being sequenced.
4 Shotgun Sequencing
• The method most commonly used for genomic DNA sequencing is called Shotgun. The
sample DNA is first randomly cleaved multiple times into several thousand base pair
segments. Each segment provides one or two reads or approximately 500 base pairs
from the leftmost and rightmost ends. Enough reads are obtained to cover the region to
be sequenced with seven to ten-fold redundancy to ensure complete tiling. Coverage C
is defined as
n⋅l
C=
L
L = length of genomic segment
n = number of reads
l = length of each read.
Assuming a uniform distribution of reads, the Lander-Waterman model predicts a
coverage of 10 to result in 1 gapped region per 106 nucleotides. Overlaps among the
reads are detected an extended to reconstruct the original genomic sequence.
genomic segment
~500 bp ~500 bp
5 Sequencing Strategies
• Hierarchical Sequencing: A large
collection (coverage of 10-20) of roughly
100,000 base pair BAC clones is obtained.
These clones are physically mapped
relative to each other onto the genome.
The map is used to select a minimum
tiling path of clones to be sequenced. Each clone in the path is sequenced by shotgun
and reassembled. The clone sequences are then assembled into the complete genome.
This strategy fundamentally assumes that 100,000 base pair segments will contain
fewer repeats than the complete genome.
Physical mapping can be accomplished by hybridization or digestion.
o Hybridization utilizes many DNA probes (p1, p2, ..., pn) each consisting of short
words that attach to complementary sequences in the set of clones. Each clone Ci is
page 6 of 7
treated with all probes, and all attachments (Ci, pj) are recorded. If the same probes
attach to clones X and Y, it can be assumed that these clones overlap.
Overlap between clones can be determined
using a matrix of m probes by n clones.
Cell (i, j) equals 1 if pi hybridizes to Cj
and 0 otherwise. The probes of the filled
matrix are then reordered to put the
matrix in consecutive-ones form where
all 1's are consecutive in each row and
column. This can be solved with time
complexity O(m3) where m > n. The
ordering and overlap of the clones can
be easily deduced from the consecutive-
ones matrix.
An additional computational problem results from the possibility of a probe
hybridizing in multiple places in the genome, generating a false overlap.
Incomplete tiling by the clones may also introduce gaps in the reordered matrix.
Thus, the parsimonious sequence of probes implying minimal probe repetition
must be found. Or in other words, find the shortest string of probes such that
each clone appears as a substring. This problem is APX-hard; solutions are
typically greedy, probabilistic, and require significant manual curation.
o Digestion uses restriction enzymes to cut each clone where specific words appear.
After each clone has been cut separately with an enzyme, the fragments are run on a
gel to measure their lengths. Clones Ca and Cb that produce a set of fragments of
identical lengths {li, lj, lk} following enzymatic digestion can be assumed to overlap.
Double Digestion: The process can be repeated digesting with enzymes A and
B individually and then with both enzymes simultaneously.
• The Walking Method: This method sequences the genome clone-by-clone without
following an initial physical map.
First, a very redundant library of
BAC clones with sequenced clone-
ends is built. Several "seed" clones
are randomly selected from this
library and sequenced. Sequencing
continues by "walking" off the seeds
using clone-ends to choose library
clones that extend left and right.
Optimally, clones having the least
overlap should be selected to
sequence in each ensuing step. If a
selected clone turns out to be a false overlap, it can be used as a new seed instead.
o Although walking off a single seed would minimize redundant sequencing, it would
be very impractical, requiring almost 15,000 walking steps. By walking off several
seeds in parallel, a genome can be sequenced in approximately 5 walking steps with
less than 20% redundancy.
page 7 of 7
o Most inefficiency results from redundant sequencing in which a small gap is closed
using a much larger clone. This inefficiency can be minimized by integrating the
use of a second library of shorter clones.