Aligned sequences A D A
A D B
Amino acids A B D
Observed Changes 1 1 0
Frequency of
Occurrence 3 1 2
(in total composition)
1978 1991
L 0.085 0.091
A 0.087 0.077
The frequencies in the
G 0.089 0.074
S 0.070 0.069 middle column are
V 0.065 0.066 taken from Dayhoff
E 0.050 0.062
(1978), the frequencies
T 0.058 0.059
K 0.081 0.059 in the right column are
I 0.037 0.053 taken from the 1991
D 0.047 0.052
recompilation of the
R 0.041 0.051
P 0.051 0.051 mutation matrices
N 0.040 0.043 representing a database
Q 0.038 0.041
of observations that is
F 0.040 0.040
Y 0.030 0.032 approximately 40 times
M 0.015 0.024 larger than that
H 0.034 0.023
available to Dayhoff.
C 0.033 0.020
W 0.010 0.014
Third step: Relative Mutabilities
• To obtain a complete picture of the mutational process,
the amino-acids that do not mutate are also taken into
account i.e., what is the chance, on average, that a given
amino acid will mutate at all.
AABCDA...BBCDA
DABCDA.A.BBCBB
BBBCDABA.BCCAA
AAACDAC.DCBCDB
CCBADAB.DBBDCC
AAACAA...BBCCC
Collecting substitution statistics
1. Count amino acids pairs in each column;
e.g.,
– 6 AA pairs, 4 AB pairs, 4 AC, 1 BC, 0 BB, 0 A
CC.
A
– Total = 6+4+4+1=15
B
1. Normalize results to obtain probabilities A
(pX’s and qXY’s)
C
2. Compute log-odds score matrix from A
probabilities:
s(X,Y) = log (qXY / (pX py))
Estimation of a BLOSUM matrix
• The BLOCKS database contains local ID FIBRONECTIN_2; BLOCK
COG9_CANFA GNSAGEPCVFPFIFLGKQYSTCTREGRGDGHLWCATT
multiple gap-free alignments of proteins. COG9_RABIT GNADGAPCHFPFTFEGRSYTACTTDGRSDGMAWCSTT
FA12_HUMAN LTVTGEPCHFPFQYHRQLYHKCTHKGRPGPQPWCATT
HGFA_HUMAN LTEDGRPCRFPFRYGGRMLHACTSEGSAHRKWCATTH
Pair-freq(obs) 0.01%
log SA,C = log = -1.3
Pair-freq(expected) 0.21%
Constructing a BLOSUM matr.
1. Counting mutations
2. Tallying mutation frequencies
3. Matrix of mutation probs.
4. Calculate abundance of each
residue (Marginal prob)
5. Obtaining a BLOSUM matrix
Constructing BLOSUM r
• To avoid bias in favor of a certain protein, first eliminate
sequences that are more than r% identical
• The elimination is done by either
– removing sequences from the block, or
– finding a cluster of similar sequences and replacing it by a new
sequence that represents the cluster.
• BLOSUM r is the matrix built from blocks with no more the r%
of similarity
– E.g., BLOSUM62 is the matrix built using sequences with no more than
62% similarity.
– Note: BLOSUM 62 is the default matrix for protein BLAST
Obtaining BLOSUM62 Matrix
pij
Sij = 2 ⋅ log 2
pi p j
PAM & BLOSUM
The PAM family
BLOSUM matrices with higher numbers and PAM matrices with low
numbers are both designed for comparisons of closely related
sequences.
BLOSUM matrices with low numbers and PAM matrices with high
numbers are designed for comparisons of distantly related proteins.
If distant relatives of the query sequence are specifically being sought,
the matrix can be tailored to that type of search.