A collection of sequences does not, by itself, increase the scientist's understanding of the biology of organisms. However, comparing sequences with known functions with these new sequences is one way of understanding the biology of that organism from which the new sequence comes. Thus, sequence analysis can be used to assign function to genes and proteins by the study of the similarities between the compared sequences. Nowadays there are many tools and techniques that provide the sequence comparisons (sequence alignment) and analyze the alignment product to understand the biology. Sequence analysis in molecular biology and bioinformatics is an automated, computerbased examination of characteristic fragments, e.g. of a DNA strand. It basically includes relevant topics: 1. The comparison of sequences in order to nd similarity and dissimilarity in compared sequences (sequence alignment) 2. Identication of gene-structures, reading frames, distributions of introns and exons and regulatory elements 3. Finding and comparing point mutations or the single nucleotide polymorphism (SNP) in organism in order to get the genetic marker. 4. Revealing the evolution and genetic diversity of organisms. 5. Function annotation of genes.
Bioinformtica e Estrutura Molecular
73
Bioinformatics - BLAST
What we really want to be able to do is go from DNA sequence to protein function!
1st step: Are there other proteins with similar sequences ? To nd out we must search the existing protein sequence databases and compare our sequence with all the other sequences in the database
>sp|P00974|BPT1_BOVIN Pancreatic trypsin inhibitor OS=Bos taurus PE=1 SV=2 MKMSRLCLSVALLVLLGTLAASTPGCDTSNQAKAQRPDFCLEPPYTGPCKARIIRYFYNA KAGLCQTFVYGGCRAKRNNFKSAEDCMRTCGGAIGPWENL
74
Bioinformatics - BAST
Basic Local Alignment Search Tool
http://blast.ncbi.nlm.nih.gov/Blast.cgi
75
Bioinformatics - BLAST
Basic Local Alignment Search Tool
http://www.nature.com/scitable/topicpage/Basic-Local-Alignment-Search-Tool-BLAST-29096
Bioinformtica e Estrutura Molecular
76
Bioinformatics - BLAST
Basic Local Alignment Search Tool
The bit score gives an indication of how good the alignment is; the higher the score, the better the alignment. In general terms, this score is calculated from a formula that takes into account the alignment of similar or identical residues, as well as any gaps introduced to align the sequences.
Bioinformtica e Estrutura Molecular
The E-value gives an indication of the statistical signicance of a given pairwise alignment and reects the size of the database and the scoring system used. The lower the E-value, the more signicant the hit. A sequence alignment that has an E-value of 0.05 means that this similarity has a 5 in 100 (1 in 20) chance of occurring by chance alone.
77
Bioinformatics - BLAST
Basic Local Alignment Search Tool
Number of identical residues in this alignment (Identities), the number of conservative substitutions (Positives), and if applicable, the number of gaps in the alignment
The line between the two sequences indicates the similarities between the sequences. If the query and the subject have the same amino acid at a given location, the residue itself is shown. Conservative substitutions, as judged by the substitution matrix, are indicated with +.
78
Bioinformatics - BLAST
Basic Local Alignment Search Tool
What is a substitution matrix ?
79
Bioinformatics - BLAST
Basic Local Alignment Search Tool
BLOSUM (BLOcks of Amino Acid SUbstitution Matrix[1]) is a substitution matrix used for sequence alignment of proteins.
Bioinformtica e Estrutura Molecular
80
Bioinformatics - BLAST
Basic Local Alignment Search Tool
Why do we run sequence alignments ?
The Needleman-Wunsch algorithm for sequence alignment 7th Melbourne Bioinformatics Course Vladimir Likc, The University of Melbourne
- To observe patterns of conservation (or variability). To nd the common motifs present in both sequences. - To assess whether it is likely that two sequences evolved from the same sequence. - To nd out which sequences from the database are similar to the sequence at hand. Alignment involves: 1. Construction of the best alignment between the sequences. 2. Assessment of the similarity from the alignment. There are three different types of sequence alignment: Global alignment/Local alignment/Multiple sequence alignment
Bioinformtica e Estrutura Molecular
81
Bioinformatics - BLAST
Basic Local Alignment Search Tool
Why do we run sequence alignments ?
The Needleman-Wunsch algorithm for sequence alignment 7th Melbourne Bioinformatics Course Vladimir Likc, The University of Melbourne
- to infer homology from similarity measurements. If 2 sequences are homologous then they should have regions of similarity If 2 sequences are similar then they are not necessarily homologous. Homology - Two genes share a common evolutionary history - Divergence from a common ancestral sequence. Similarity - Observable quantity - Two sequences have regions containing similar terms
Homologies can be inferred from similarity measurements. - Relationship is putative (more evidence required) - Human (zeta) Crystallin and quinone oxidoreductase sequence similarity suggests homology and experimental evidence shows that they have different function
Bioinformtica e Estrutura Molecular
82
Bioinformatics - BLAST
Basic Local Alignment Search Tool
Why do we run sequence alignments ?
The Needleman-Wunsch algorithm for sequence alignment 7th Melbourne Bioinformatics Course Vladimir Likc, The University of Melbourne
- to infer homology from similarity measurements. If 2 sequences are homologous then they should have regions of similarity If 2 sequences are similar then they are not necessarily homologous.
Conserved regions in pairwise alignments: - Conserved regions within distantly related organisms are usually most responsible for structure, function or gene expression. - Substrate binding site, disulde bridges, etc. - For conserved regions between recent divergent organisms this may not be the case. - EX) mouse and rat - Higher number of conserved regions overall
Bioinformtica e Estrutura Molecular
83
The Needleman-Wunsch algorithm for sequence alignment 7th Melbourne Bioinformatics Course Vladimir Likc, The University of Melbourne
Global Alignment: the best alignment over the entire length of two sequences and is suitable when the two sequences are of similar length, with a signicant degree of similarity throughout. Example:! !
S P I I M I L L L A R A R I T Y -
Local alignment: Involving stretches that are shorter than the entire sequences, possibly more than one. Suitable when comparing substantially different sequences, which possibly differ signicantly in length, and have only a short patches of similarity.
M I I L L L A R A R
82
The Needleman-Wunsch algorithm for sequence alignment 7th Melbourne Bioinformatics Course Vladimir Likc, The University of Melbourne
Multiple alignment: Simultaneous alignment of more than two sequences. Suitable when searching for subtle conserved sequence patterns in a protein family, and when more than two sequences of the protein family are available.
S P I I M I L L L A R A R A R I I T Y T Y
M O L
83
The Needleman-Wunsch algorithm for sequence alignment 7th Melbourne Bioinformatics Course Vladimir Likc, The University of Melbourne
Intuitively we seek an alignment to maximize the number of residue-to-residue matches. Sequence alignment is the establishment of residue-to-residue correspondence between two or more sequences such that the order of residues in each sequence is preserved. A gap, which indicates a residue-to-nothing match, may be introduced in either sequence. A gap-to-gap match is meaningless and is not allowed.
Bioinformtica e Estrutura Molecular
84
The Needleman-Wunsch algorithm for sequence alignment 7th Melbourne Bioinformatics Course Vladimir Likc, The University of Melbourne
Denition: Sequence alignment is the establishment of residue-to-residue correspondence between two or more sequences such that the order of residues in each sequence is preserved. A gap, which indicates a residue-to-nothing match, may be introduced in either sequence. A gap-to-gap match is meaningless and is not allowed.
85
The Needleman-Wunsch algorithm for sequence alignment 7th Melbourne Bioinformatics Course Vladimir Likc, The University of Melbourne
Give two sequences we need a number to associate with each possible alignment (i.e. the alignment score = goodness of alignment). The scoring scheme is a set of rules which assigns the alignment score to any given alignment of two sequences. 1. The scoring scheme is residue based: it consists of residue substitution scores (i.e. score for each possible residue alignment), plus penalties for gaps. 2. The alignment score is the sum of substitution scores and gap penalties.
86
The Needleman-Wunsch algorithm for sequence alignment 7th Melbourne Bioinformatics Course Vladimir Likc, The University of Melbourne
Use +1 as a reward for a match, -1 as the penalty for a mismatch, and ignore gaps The best alignment by eye from before:
A A T T G G G C A G G T T
score: +1+1+1+01+1+1=4
An alternative alignment:
A A
T -
G T
G G
C A
G G
T T
score: +1+01+11+1+1=2
87
The Needleman-Wunsch algorithm for sequence alignment 7th Melbourne Bioinformatics Course Vladimir Likc, The University of Melbourne
A concise way to express the residue substitution costs can be achieved with a N N matrix (N is 4 for DNA and 20 for proteins). The substitution matrix for the simple scoring scheme:
C T A G C 1 -1 -1 -1 C C T A G 2 1 -1 -1 T -1 1 -1 -1 T 1 2 -1 -1 A -1 -1 1 -1 A -1 -1 2 1 G -1 -1 -1 1 G -1 -1 1 2
A better scheme A, G are purines (pyrimidine ring fused to an imidazole ring), T, C are pyrimidines (one six-membered ring). Assume we believe that from evolutionary standpoint purine/pyrimidine mutations are less likely to occur compared to purine/purine (pyrimidine/pyrimidine) mutations. Can we capture this in a substitution matrix?
88
The Needleman-Wunsch algorithm for sequence alignment 7th Melbourne Bioinformatics Course Vladimir Likc, The University of Melbourne
Protein substitution matrices are signicantly more complex than DNA scoring matrices. Proteins are composed of twenty amino acids, and physico-chemical properties of individual amino acids vary considerably. A protein substitution matrix can be based on any property of amino acids: size, polarity, charge, hydrophobicity. In practice the most important are evolutionary substitution matrices.
89
90
91
R
-1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3
N
-2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3
D
-2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3
C
0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1
Q
-1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2
E
-1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2
G
0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3
H
-2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3
I
-1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3
L
-1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1
K
-1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2
M
-1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1
F
-2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1
P
-1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2
S
1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2
T
0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0
W
-3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3
Y
-2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1
V
0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4
92
R
-1 5
N
-2 0 6
D
-2 -2 1 6
C
0 -3 -3 -3 9
Q
-1 1 0 0 -3 5
E
-1 0 0 2 -4 2 5
G
0 -2 0 -1 -3 -2 -2 6
H
-2 0 1 -1 -3 0 0 -2 8
I
-1 -3 -3 -3 -1 -3 -3 -4 -3 4
L
-1 -2 -3 -4 -1 -2 -3 -4 -3 2 4
K
-1 2 0 -1 -3 1 1 -2 -1 -3 -2 5
M
-1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5
F
-2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6
P
-1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7
S
1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4
T
0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5
W
-3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11
Y
-2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7
V
0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4
93
The Needleman-Wunsch algorithm for sequence alignment 7th Melbourne Bioinformatics Course Vladimir Likc, The University of Melbourne
A conventional wisdom dictates that the penalty for a gap must be several times greater than the penalty for a mutation. That is because a gap/extra residue - Interrupts the entire polymer chain - In DNA shifts the reading frame
94
The Needleman-Wunsch algorithm for sequence alignment 7th Melbourne Bioinformatics Course Vladimir Likc, The University of Melbourne
The conventional wisdom: the creation of a new gap should be strongly disfavored. However, once created insertions/deletions of chunks of more than one residue should be much less expensive (i.e. insertion of domains often occurs). A simple yet effective solution is afne gap penalties: ! ! ! ! ! ! ! ! ! ! ! ! ! (n) = o (n 1)e where o is opening penalty and e is extension penalty encourages gap extension
95
The Needleman-Wunsch algorithm for sequence alignment 7th Melbourne Bioinformatics Course Vladimir Likc, The University of Melbourne
A TG T AG TG T A T AG T AC A TGC A A TG T AG - - - - - - - T AC A TGC A
A TG T AG TG T A T AG T AC A TGC A A TG T A - - G - - T A - - - CA TGCA
96
The Needleman-Wunsch algorithm for sequence alignment 7th Melbourne Bioinformatics Course Vladimir Likc, The University of Melbourne
Consider two alternative alignments of ANRGDFS and ANREFS with the gap opening penalty of 10 (-10 score):
A A N N R R G D E F F S S
score: 4+6+510+2+6+4 = 17
score: 4+6+5210+6+4 = 13
A A
N N
R R
G E
D -
F F
S S
The scoring scheme provides us with the quantitative measure of how good is some alignment relative to alternative alignments. However the scoring scheme does not tell us how to nd the best alignment.
97
The Needleman-Wunsch algorithm for sequence alignment 7th Melbourne Bioinformatics Course Vladimir Likc, The University of Melbourne
Generate the list all possible alignments between two sequences, score them Select the alignment with the best score The number of possible global alignments between two sequences of length N is
22N N
For two sequences of 250 residues this is 10
1078 number of atoms in the universe
149
98
The Needleman-Wunsch algorithm for sequence alignment 7th Melbourne Bioinformatics Course Vladimir Likc, The University of Melbourne
A smart way to reduce the massive number of possibilities that need to be considered, yet still guarantees that the best solution will be found (Saul Needleman and Christian Wunsch, 1970). The basic idea is to build up the best alignment by using optimal alignments of smaller subsequences. The Needleman-Wunsch algorithm is an example of dynamic programming, a discipline invented by Richard Bellman (an American mathematician) in 1953!
99
The Needleman-Wunsch algorithm for sequence alignment 7th Melbourne Bioinformatics Course Vladimir Likc, The University of Melbourne
A divide-and-conquer strategy: Break the problem into smaller subproblems. Solve the smaller problems optimally. Use the sub-problem solutions to construct an optimal solution for the original problem. Dynamic programming can be applied only to problems exhibiting the properties of overlapping subproblems (e.g. nding the best chess move)
100
The Needleman-Wunsch algorithm for sequence alignment 7th Melbourne Bioinformatics Course Vladimir Likc, The University of Melbourne
A matrix D(i,j) indexed by residues of each sequence is built recursively, such that
D(i1,j1) + s(xi,yj) D(i,j) = max D(i1,j) + g D(i,j1) + g subject to a boundary conditions. s(i, j) is the substitution score for residues i and j, and g is the gap penalty.
101
The Needleman-Wunsch algorithm for sequence alignment 7th Melbourne Bioinformatics Course Vladimir Likc, The University of Melbourne
We consider all possible pairs of residue from two sequences (this gives rise to a 2D matrix representation). We will have two matrices: the score matrix and traceback matrix. The Needleman-Wunsch algorithm consists of three steps: 1. Initialization of the score matrix 2. Calculation of scores and lling the traceback matrix 3. Deducing the alignment from the traceback matrix
102
The Needleman-Wunsch algorithm for sequence alignment 7th Melbourne Bioinformatics Course Vladimir Likc, The University of Melbourne
The alignment of two sequences SEND and AND with the BLOSUM62 substitution matrix and gap opening penalty of 10 (no gap extension):
S A A A
E A N N
N N N D
D D D D -
103
The cells of the score matrix are labelled C(i,j) where i = 1,2,...,N and j = 1,2,...,M
A N D
C(2,1) C(2,2) C(2,3) C(2,4) C(2,5) C(3,1) C(3,2) C(3,3) C(3,4) C(3,5) C(4,1) C(4,2) C(4,3) C(4,4) C(4,5)
A N D
104
The rst row and column of the two matrices are lled in the initialization step qdiag! =! C(0,1) + s(1,2) qup! =! C(0,2) + g qleft! =! C(1,1) + g = 0 - 10 = -10
diag ! left !! up !! ! the letters from two sequences are aligned ! a gap is introduced in the left sequence ! a gap is introduced in the top sequence
S
0 -10
E
-20
N
-30
D
-40 done
S
left
E
left
N
left
D
left
A N D
C(2,2) C(2,3) C(2,4) C(2,5) C(3,2) C(3,3) C(3,4) C(3,5) C(4,2) C(4,3) C(4,4) C(4,5)
A N D
up up up
105
The rst row and column of the two matrices are lled in the initialization step qdiag! =! C(0,2) + s(1,3) qup! =! C(0,3) + g qleft! =! C(1,2) + g = -10 - 10 = -20
diag ! left !! up !! ! the letters from two sequences are aligned ! a gap is introduced in the left sequence ! a gap is introduced in the top sequence
S
0 -10
E
-20
N
-30
D
-40 done
S
left
E
left
N
left
D
left
A N D
C(2,2) C(2,3) C(2,4) C(2,5) C(3,2) C(3,3) C(3,4) C(3,5) C(4,2) C(4,3) C(4,4) C(4,5)
A N D
up up up
106
where S(i,j) is the substitution score for letters i and j and g is the gap penalty
S
0 -10
E
-20
N
-30
D
-40 done
S
left
E
left
N
left
D
left
A N D
C(2,2) C(2,3) C(2,4) C(2,5) C(3,2) C(3,3) C(3,4) C(3,5) C(4,2) C(4,3) C(4,4) C(4,5)
A N D
up up up
107
The Needleman-Wunsch algorithm for sequence alignment 7th Melbourne Bioinformatics Course Vladimir Likc, The University of Melbourne
The value of the cell C(i,j) depends only on the values of the immediately adjacent northwest diagonal, up, and left cells:
C(i-1,j-1)
C(i-1,j)
C(i,j-1)
C(i,j)
108
The Needleman-Wunsch algorithm for sequence alignment 7th Melbourne Bioinformatics Course Vladimir Likc, The University of Melbourne
! the letters from two sequences are aligned ! a gap is introduced in the left sequence ! a gap is introduced in the top sequence
S
0 -10 ?
E
-20
N
-30
D
-40 done
S
left ?
E
left
N
left
D
left
A N D
A N D
up up up
109
The Needleman-Wunsch algorithm for sequence alignment 7th Melbourne Bioinformatics Course Vladimir Likc, The University of Melbourne
qdiag! =! C(1,1) + S(S,A)! =! 0 + 1 ! ! = 1 qup! =! C(1,2) + g!! ! ! = -10 + (-10) != -20 qleft! =! C(2,1) + g!! ! ! = -10 + (-10)! = -20
S
0 -10 ?
E
-20
N
-30
D
-40 done
S
left ?
E
left
N
left
D
left
A N D
A N D
up up up
110
The Needleman-Wunsch algorithm for sequence alignment 7th Melbourne Bioinformatics Course Vladimir Likc, The University of Melbourne
qdiag! =! C(1,1) + S(S,A)! =! 0 + 1 ! ! = 1 qup! =! C(1,2) + g!! ! ! = -10 + (-10) != -20 qleft! =! C(2,1) + g!! ! ! = -10 + (-10)! = -20 Highest score is 1 from the diag cell
S
0 -10 1
E
-20
N
-30
D
-40 done
S
left diag
E
left
N
left
D
left
A N D
A N D
up up up
111
The Needleman-Wunsch algorithm for sequence alignment 7th Melbourne Bioinformatics Course Vladimir Likc, The University of Melbourne
qdiag! =! C(1,2) + S(E,A)! = -10 + -1 ! = -11 qup! =! C(1,3) + g!! ! ! = -20 + (-10) != -30 qleft! =! C(2,2) + g!! ! ! = 1 + (-10)! = -9 !
S
0 -10 1
E
-20 ?
N
-30
D
-40 done
S
left diag
E
left ?
N
left
D
left
A N D
A N D
up up up
112
The Needleman-Wunsch algorithm for sequence alignment 7th Melbourne Bioinformatics Course Vladimir Likc, The University of Melbourne
Finally when all cells in the scores and traceback matrices are lled in
S
0 -10 1 -9 -19
E
-20 -9 -1 -11
N
-30 -19 -3 2
D
-40 -29 -13 3 done
S
left diag diag up
E
left left diag diag
N
left left diag diag
D
left left left diag
A N D
A N D
up up up
113
The Needleman-Wunsch algorithm for sequence alignment 7th Melbourne Bioinformatics Course Vladimir Likc, The University of Melbourne
Traceback = the process of deduction of the best alignment from the traceback matrix. The traceback always begins with the last cell to be lled with the score, i.e. the bottom right cell. One moves according to the traceback value written in the cell. There are three possible moves: diagonally (toward the top-left corner of the matrix), up, or left. The traceback is completed when the rst, top-left cell of the matrix is reached (done cell).
114
The Needleman-Wunsch algorithm for sequence alignment 7th Melbourne Bioinformatics Course Vladimir Likc, The University of Melbourne
S
done left diag diag up
E
left left diag diag
N
left left diag diag
D
left left left diag
A N D
up up up
115
The Needleman-Wunsch algorithm for sequence alignment 7th Melbourne Bioinformatics Course Vladimir Likc, The University of Melbourne
The alignment is deduced from the values of cells along the traceback path, by taking into account the values of the cell in the traceback matrix: diag ! !the letters from two sequences are aligned left ! !a gap is introduced in the left sequence up ! !a gap is introduced in the top sequence Sequences are aligned backwards.
done
S
left diag diag up
E
left left diag diag
N
left left diag diag
D
left left left diag
A N D
Bioinformtica e Estrutura Molecular
up up up
116
The Needleman-Wunsch algorithm for sequence alignment 7th Melbourne Bioinformatics Course Vladimir Likc, The University of Melbourne
S
done left diag diag up
E
left left diag diag
N
left left diag diag
D
left left left diag
A N D
up up up
D D
117
The Needleman-Wunsch algorithm for sequence alignment 7th Melbourne Bioinformatics Course Vladimir Likc, The University of Melbourne
S
done left diag diag up
E
left left diag diag
N
left left diag diag
D
left left left diag
A N D
up up up
N D N D
118
The Needleman-Wunsch algorithm for sequence alignment 7th Melbourne Bioinformatics Course Vladimir Likc, The University of Melbourne
S
done left diag diag up
E
left left diag diag
N
left left diag diag
D
left left left diag
A N D
up up up
E N D - N D
119
The Needleman-Wunsch algorithm for sequence alignment 7th Melbourne Bioinformatics Course Vladimir Likc, The University of Melbourne
E
left left diag diag
N
left left diag diag
D
left left left diag
A N D
up up up
S E N D A - N D
120
The Needleman-Wunsch algorithm for sequence alignment 7th Melbourne Bioinformatics Course Vladimir Likc, The University of Melbourne
E
left left diag diag
N
left left diag diag
D
left left left diag
A N D
up up up
S E N D A - N D
121
The Needleman-Wunsch algorithm for sequence alignment 7th Melbourne Bioinformatics Course Vladimir Likc, The University of Melbourne
A A A
A N N
N N D
D D D -
E
left left diag diag
N
left left diag diag
D
left left left diag
A N D
up up up
S E N D A - N D
122
The Needleman-Wunsch algorithm for sequence alignment 7th Melbourne Bioinformatics Course Vladimir Likc, The University of Melbourne
It was much easier to align SEND and AND by the exhaustive search! As we consider longer sequences the situation quickly turns against the exhaustive search: Two 12 residue sequences would require considering 1 million alignments. Two 150 residue sequences would require considering 1088 alignments ( 1078 is the estimated number of atoms in the Universe).
For two 150 residue sequences the Needleman-Wunsch algorithm requires lling a 150 150 matrix. Assuming it takes 0.01 s to calculate the score and traceback for all possible alignments it would still take............. ca. 3 x 1079 years to complete
Bioinformtica e Estrutura Molecular
123
The Needleman-Wunsch algorithm for sequence alignment 7th Melbourne Bioinformatics Course Vladimir Likc, The University of Melbourne
The alignment is over the entire length of two sequences: the traceback starts from the lower right corner of the traceback matrix, and completes in the upper left cell of the matrix. The Needleman-Wunsch algorithm works in the same way regardless of the length or complexity of sequences, and guarantees to nd the best alignment. The Needleman-Wunsch algorithm is appropriate for nding the best alignment of two sequences which are (i) of the similar length; (ii) similar across their entire lengths.
124
http://www.nature.com/scitable/topicpage/Basic-Local-Alignment-Search-Tool-BLAST-29096
Bioinformtica e Estrutura Molecular
125
126
127
Number of identical residues in this alignment (Identities), the number of conservative substitutions (Positives), and if applicable, the number of gaps in the alignment
The line between the two sequences indicates the similarities between the sequences. If the query and the subject have the same amino acid at a given location, the residue itself is shown. Conservative substitutions, as judged by the substitution matrix, are indicated with +.
128
129
130
131
132
133
134
135
136
137
138
139
! aligning all sequences in pairs ! placing the most similar sequence pairs (highest % sequence identity) close
! together in a phylogenic tree and the most dissimilar farther apart
! building a MSA using the most similar sequence pair and aligning them ! extending the MSA by aliogning pairs in decreasing order of similarity
A phylogenetic tree or evolutionary tree is a tree showing the evolutionary relationships among various biological species or other entities that are known to have a common ancestor.
140
By measuring the number of differences between each pair of species (Hamming distances) we can deveop a clustering proceedure to develop an unrooted phylogenic tree
141
ATSS
Species A Species B Species C Species D
ATGS
TTSG
TSGG
142
ATSS
Species A Species B Species C Species D
ATGS
TTSG
TSGG
0 0 0 0
143
ATSS
Species A Species B Species C Species D
ATGS 1 0
TTSG 2
TSGG 4
0 0
144
ATSS
Species A Species B Species C Species D
ATGS 1 0
TTSG 2 3 0
TSGG 4 3
145
ATSS
Species A Species B Species C Species D
ATGS 1 0
TTSG 2 3 0
TSGG 4 3 2 0
146
ATSS
Species A Species B Species C Species D
ATGS 1 0
TTSG 2 3 0
TSGG 4 3 2 0
147
ATSS
ATGS
148
TTSG
TSGG
149
TTSG
TSGG
0 0
150
TTSG
TSGG
0.5(2+3)=2.5 0.5(4+3)=3.5 0 2 0
151
TTSG 2.5 0
TSGG 3.5 2 0
152
TTSG
TSGG
153
ATSS
ATGS
TTSG
TSGG
154
ATSS
ATGS
TTSG
TSGG
155
The length between 2 sequences is half the Hamming distance between them
ATSS ATSS ATGS TTSG TSGG 0 ATGS 1 0 TTSG 2 3 0 TSGG 4 3 2 0 {ATSS, ATGS} {ATSS, ATGS} 0 TTSG 2.5 0 TSGG 3.5 2 0
1.5
1.5
TTSG TSGG
0.5
0.5
ATSS
ATGS
TTSG
TSGG
156
157