Searching
Similarity Homology
1) 25% similarity 100 AAs is strong
evidence for homology
2) Homology is an evolutionary statement
which means descent from a common
ancestor
common 3D structure
usually common function
Sequence Complexity*
MCDEFGHIKLAN.
High Complexity
ACTGTCACTGAT.
Mid Complexity
NNNNTTTTTNNN.
Low Complexity
KETAAAKFERQHMD
KVFGRCELAAAMKRHGLDNYRGYSLGNWVCAAKFESNFNT
Rbn
Lsz
SST SAASSSNYCNQMMKSRNLTKDRCKPMNTFVHESLA
QATNRNTDGSTDYGILQINSRWWCNDGRTP
GSRN
Rbn
Lsz
DVQAVCSQKNVACKNGQTNCYQSYSTMSITDCRETGSSKY
LCNIPCSALLSSDITASVNC AKKIVSDGDGMNAWVAWR
Rbn
Lsz
PNACYKTTQANKHIIVACEGNPYVPHFDASV
NRCKGTDVQA
WIRGCRL
Dot Plot
shows regions of similarity as diagonals
89
L G N E L S Q D E S G A A A I F T V Q L
108
Annexin
82
L P S A L K S A L S G H L E T V I L G L
101
154
L E K D I I S D T S G D F R K L M V A L
173
240
L E S I K K E V K G D L E N A F L N L
258
314
L Y Y Y I Q Q D T K G D Y Q K A L L Y L
333
Consensus
L x P x x x P D x S G x h x x h x V L L
3) dynamic programming
optimal computer solution, not approximate
BLAST
Multiple Sequence Alignment
Dynamic Programming*
Dynamic Programming
G
E
N
E
S
I
S
G
10
0
0
0
0
0
0
E
0
10
0
0
0
0
0
N
0
0
10
0
0
0
0
E
0
10
0
10
0
0
0
G
|
G
T
0
0
0
0
0
0
0
I
0
0
0
0
0
10
0
E
|
E
C
0
0
0
0
0
0
0
S
0
0
0
0
10
0
10
N
|
N
E
|
E
G
E
N
E
S
I
S
T
*
S
G
60
40
30
20
20
10
0
I
|
I
E
40
50
30
20
20
10
0
N
30
30
40
20
20
10
0
S
|
S
E
20
30
20
30
20
10
0
T
20
20
20
20
20
10
0
I
0
0
0
10
0
20
0
C
10
10
10
10
10
10
0
S
0
0
0
0
10
0
10
Scoring Similarity
1) Can only score aligned sequences
2) DNA is usually scored as identical or not
3) modified scoring for gaps - single vs. multiple base
gaps (gap extension)
4) AAs have varying degrees of similarity
a. # of mutations to convert one to another
b. chemical similarity
c. observed mutation frequencies
PAM Matrices
Developed by M.O. Dayhoff (1978)
PAM = Point Accepted Mutation
Matrix assembled by looking at patterns of
substitutions in closely related proteins
1 PAM corresponds to 1 amino acid change per
100 residues
1 PAM = 1% divergence or 1 million years in
evolutionary history
FASTA
1) Derived from logic of the dot plot
compute best diagonals from all frames of
alignment
FASTA Algorithm
FASTA Alignments
z-sc E(1018780)..
60
70
80
90
100
110
u39412.gb_pr CCCTTTGTGGCCGCCATGGACAATTCCGGGAAGGAAGCGGAGGCGATGGCGCTGTTGGCC
|| ||| | ||||| | ||| |||||
DMU09374 AGGCGGACATAAATCCTCGACATGGGTGACAACGAACAGAAGGCGCTCCAACTGATGGCC
130
140
150
160
170
180
120
130
140
150
160
170
u39412.gb_pr GAGGCGGAGCGCAAAGTGAAGAACTCGCAGTCCTTCTTCTCTGGCCTCTTTGGAGGCTCA
||||||||| || ||| | | || ||| |
|| || ||||| ||
DMU09374 GAGGCGGAGAAGAAGTTGACCCAGCAGAAGGGCTTTCTGGGATCGCTGTTCGGAGGGTCC
190
200
210
220
230
240
180
190
200
210
220
230
u39412.gb_pr TCCAAAATAGAGGAAGCATGCGAAATCTACGCCAGAGCAGCAAACATGTTCAAAATGGCC
||| | ||||| || ||| |||| | || | |||||||| || ||| ||
DMU09374 AACAAGGTGGAGGACGCCATCGAGTGCTACCAGCGGGCGGGCAACATGTTTAAGATGTCC
250
260
270
280
290
300
240
250
260
270
280
290
u39412.gb_pr AAAAACTGGAGTGCTGCTGGAAACGCGTTCTGCCAGGCTGCACAGCTGCACCTGCAGCTC
|||||||||| ||||| | |||||| |||| ||| || ||| || |
DMU09374 AAAAACTGGACAAAGGCTGGGGAGTGCTTCTGCGAGGCGGCAACTCTACACGCGCGGGCT
310
320
330
340
350
360
FASTA Format
simple format used by almost all programs
>header line with a [return] at end
Sequence (no specific requirements for line
length, characters, etc)
>URO1 uro1.seq Length: 2018 November 9, 2000 11:50 Type: N Check: 3854 ..
CGCAGAAAGAGGAGGCGCTTGCCTTCAGCTTGTGGGAAATCCCGAAGATGGCCAAAGACA
ACTCAACTGTTCGTTGCTTCCAGGGCCTGCTGATTTTTGGAAATGTGATTATTGGTTGTT
GCGGCATTGCCCTGACTGCGGAGTGCATCTTCTTTGTATCTGACCAACACAGCCTCTACC
CACTGCTTGAAGCCACCGACAACGATGACATCTATGGGGCTGCCTGGATCGGCATATTTG
TGGGCATCTGCCTCTTCTGCCTGTCTGTTCTAGGCATTGTAGGCATCATGAAGTCCAGCA
GGAAAATTCTTCTGGCGTATTTCATTCTGATGTTTATAGTATATGCCTTTGAAGTGGCAT
CTTGTATCACAGCAGCAACACAACAAGACTTTTTCACACCCAACCTCTTCCTGAAGCAGA
TGCTAGAGAGGTACCAAAACAACAGCCCTCCAAACAATGATGACCAGTGGAAAAACAATG
GAGTCACCAAAACCTGGGACAGGCTCATGCTCCAGGACAATTGCTGTGGCGTAAATGGTC
CATCAGACTGGCAAAAATACACATCTGCCTTCCGGACTGAGAATAATGATGCTGACTATC
CCTGGCCTCGTCAATGCTGTGTTATGAACAATCTTAAAGAACCTCTCAACCTGGAGGCTT
BLAST**
Basic Local Alignment Search Tool
Developed in 1990 and 1997 (S. Altschul)
A heuristic method for performing local
alignments through searches of high scoring
segment pairs (HSPs)
1st to use statistics to predict significance of initial
matches - saves on false leads
Offers both sensitivity and speed
BLAST
Uses word matching like FASTA
Similarity matching of words (3 aas, 11 bases)
does not require identical words.
BLAST Algorithm
MEAAVKEEISVEDEAVD
MEA
EAA
AAV
AVK
VKE
KEE
EEI
EIS
ISV
Break query
into words:
...
Break database
sequences
into words:
RTT
SDG
SRW
QEL
VKI
DKI
LFC
AAV
PFR
AAQ
KSS
LLN
RWY
GKG
NIS
WDV
KVR
DEI
MEA
EAA
AAV
AVK
KLV
KEE
EEI
EIS
ISV
TDVRWMSETGIIDVFLLLGPSISDVFRQYASLTGTQALPPLFSLGYHQSRWNY
IWLDIEEIHADGKRYFTWDPSRFPQPRTMLERLASKRRVKLVAIVDPH
Seq_XYZ:
Query:
HVTGRSAF_FSYYGYGCYCGLGTGKGLPVDATDRCC
QSVFDYIYYGCYCGWGLG_GK__PRDA
E-val=10-13
Query: 48
Sbjct: 70
KEYRVILPVSVDEYQVGQLYSVAEASKNXXXXXXXXXXXXXXPYEK----DGE--KGQYT 101
K+ RV+LP+SV+EYQVGQL+SVAEASK
P++
+G+ KGQYT
KKSRVVLPMSVEEYQVGQLWSVAEASKAETGGGEGVEVLKNEPFDNVPLLNGQFTKGQYT 129
can be confusing
reduces overall significance score
BLAST 2 algorithm
The NCBIs BLAST website and GCG
(NETBLAST) now both use BLAST 2 (also
known as gapped BLAST)
Formatting Results
BLAST Output
BLAST Output
BLAST Output
BLAST Output
BLAST Output
BLAST Output
FASTA/BLAST Statistics
E() value is equivalent to standard P value
BLAST is Approximate
BLAST makes similarity searches very
quickly because it takes shortcuts.
looks for short, nearly identical words (11 bases)
Interpretation of output
very low E() values (e-100) are homologs or
identical genes
moderate E() values are related genes
long list of gradually declining of E() values
indicates a large gene family
long regions of moderate similarity are more
significant than short regions of high identity
Biological Relevance
It is up to you, the biologist to scrutinize these
alignments and determine if they are significant.
Were you looking for a short region of nearly
identical sequence or a larger region of general
similarity?
Are the mismatches conservative ones?
Are the matching regions important structural
components of the genes or just introns and
flanking regions?
PSI BLAST
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/psi1.html
Genome Alignment
How to match a protein or mRNA to genomic
sequence?
There is a Genome BLAST server at NCBI
Each of the Genome websites has a similar search
function
Smith-Waterman searches
A more sensitive brute force approach to
searching
much slower than BLAST or FASTA
Conclusions
Sequence alignments and database searching are
key to all of bioinformatics
There are four different methods for doing
sequence comparisons 1) Dot Plots; 2) Dynamic
Programming; 3) Fast Alignment; and 4) Multiple
Alignment
Understanding the significance of alignments
requires an understanding of statistics and
distributions