Anda di halaman 1dari 77

Comparing Two Protein

Sequences
Cdric Notredame

Cdric Notredame (12/01/16)

Our Scope

Look once Under the Hood


Pairwise Alignment methods are POWERFUL
Pairwise Alignment methods are LIMITED

If You Understand the LIMITS


they Become VERY POWERFUL
Cdric Notredame (12/01/16)

Outline

-WHY Does It Make Sense To Compare Sequences


-HOW Can we Compare Two Sequences ?
-HOW Can we Align Two Sequences ?
-HOW can I Search a Database ?

Cdric Notredame (12/01/16)

Why Does It Make Sense


To Compare Sequences ?
Sequence Evolution

Cdric Notredame (12/01/16)

Why Do We Want To Compare Sequences

wheat
?????

--DPNKPKRAMTSFVFFMSEFRSEFKQKHSKLKSIVEMVKAAGER
| | |||||||| || | |||
||| | |||| ||||
KKDSNAPKRAMTSFMFFSSDFRS----KHSDL-SIVEMSKAAGAA

EXTRAPOLATE

Homology?
??????

Cdric Notredame (12/01/16)

SwissProt

Why Do We Want To Compare Sequences

Cdric Notredame (12/01/16)

Why Does It Make Sense To Align


Sequences ?

-Evolution is our Real Tool.


-Nature is LAZY and Keeps re-using Stuff.
-Evolution is mostly DIVERGEANT
Same Sequence Same Ancestor

Cdric Notredame (12/01/16)

Why Does It Make Sense To Align


Sequences ?

Same
Sequence
Same
Function

Same
Origin
Same
3D Fold

Cdric Notredame (12/01/16)

Many
Counter-examples!

Comparing Is Reconstructing Evolution

Cdric Notredame (12/01/16)

An Alignment is a STORY
ADKPKRPLSAYMLWLN

ADKPKRPLSAYMLWLN

ADKPKRPLSAYMLWLN

Mutations
+
Selection
ADKPKRPKPRLSAYMLWLN
ADKPRRPLS-YMLWLN

Cdric Notredame (12/01/16)

An Alignment is a STORY
ADKPKRPLSAYMLWLN

ADKPKRPLSAYMLWLN

ADKPKRPLSAYMLWLN
Mutations
+
Selection

ADKPKRPKPRLSAYMLWLN

ADKPRRPLS-YMLWLN

Insertion

Deletion

ADKPRRP---LS-YMLWLN
ADKPKRPKPRLSAYMLWLN
Mutation

Cdric Notredame (12/01/16)

Evolution is NOT Always Divergent


Chen et al, 97, PNAS, 94, 3811-16

AFGP with (ThrAlaAla)n


Similar To Trypsynogen

S
AFGP with (ThrAlaAla)n
NOT
Cdric Notredame (12/01/16)

Similar to
Trypsinogen

Evolution is NOT Always Divergent


AFGP with (ThrAlaAla)n
Similar To Trypsynogen
N

S
AFGP with (ThrAlaAla)n
NOT
Similar to Trypsinogen

SIMILAR Sequences
BUT
DIFFERENT origin
Cdric Notredame (12/01/16)

Evolution is NOT always Divergent


But in MOST cases, you may assume it is
Similar Function
DOES NOT REQUIRE
Similar Sequence

Same
Sequence

Same
Origin

Same
Function
Same
3D Fold

Cdric Notredame (12/01/16)

Similar Sequence

Historical Legacy

How Do Sequences Evolve

Each Portion of a Genome has its own Agenda.

Cdric Notredame (12/01/16)

How Do Sequences Evolve ?


CONSTRAINED Genome Positions Evolve SLOWLY
EVERY Protein Family Has its Own Level Of Constraint

Family

KS

KA

Histone3
Insulin
Interleukin I
Globin
Apolipoprot. AI
Interferon G

6.4
4.0
4.6
5.1
4.5
8.6

0
0.1
1.4
0.6
1.6
2.8

Rates in Substitutions/site/Billion Years as measured on Mouse Vs Human (80 Million years)


Ks Synonymous Mutations, Ka Non-Neutral.

Cdric Notredame (12/01/16)

Differentmolecularclocksfordifferentproteinsanotherprediction

Cdric Notredame (12/01/16)

How Do Sequences Evolve ?


The amino Acids Venn Diagram
To Make Things Worse, Every Residue has its Own
Personality

L V
I

Aliphatic

Aromatic

P
AG G
T C S
D
N
Y HKE
Q
W R

Hydrophobic
Cdric Notredame (12/01/16)

Polar

Small

How Do Sequences Evolve ?

In a structure, each Amino Acid plays a Special Role

+
On the surface,
CHARGE MATTERS
OmpR, Cter Domain
Cdric Notredame (12/01/16)

In the core,
SIZE MATTERS

How Do Sequences Evolve ?


Accepted Mutations Depend on the Structure
Big -> Big
Small ->Small
NO DELETION
+
Charged -> Charged
Small <-> Big or Small
DELETIONS

Cdric Notredame (12/01/16)

How Can We Compare


Sequences ?
Substitution Matrices

Cdric Notredame (12/01/16)

How Can We Compare Sequences ?


To Compare Two Sequences, We need:

Their Structure
We Do Not
Have Them !!!
Their Function

Cdric Notredame (12/01/16)

How Can We Compare Sequences ?


We will Need To Replace Structural Information With
Sequence Information.
Same
Sequence
Same
Origin

Same
Function

Same
3D Fold

It CANNOT Work ALL THE TIME !!!

Cdric Notredame (12/01/16)

How Can We Compare Sequences ?


To Compare Sequences, We need to Compare Residues
We Need to Know How Much it COSTS to SUBSTITUTE
an Alanine into an Isoleucine
a Tryptophan into a Glycine

The table that contains the costs for all the possible
substitutions is called the SUBSTITUTION MATRIX
How to derive that matrix?

Cdric Notredame (12/01/16)

How Can We Compare Sequences ?


Using Knowledge Could Work
C
Aliphatic

L V
I

Aromatic

Y
W

Small

A G

G
CC
D
K E

S
N
Q

Hydrophobic
Polar

But we do not know enough about Evolution and


Structure.
Using Data works better.
Cdric Notredame (12/01/16)

How Can We Compare Sequences ?


Making a Substitution Matrix
-Take 100 nice pairs of Protein Sequences,
easy to align (80% identical).
-Align them
-Count each mutations in the alignments
-25 Tryptophans into phenylalanine
-30 Isoleucine into Leucine

-For each mutation, set the substitution score to the log odd ratio:

Log

Observed
Expected by chance

Cdric Notredame (12/01/16)

Youre kidding! I was struck by a lightning twice too!!


Cdric Notredame (12/01/16)

Garry Larson, The Far Side

How Can We Compare Sequences ?


Making a Substitution Matrix
-Take 100 nice pairs of Protein Sequences,
easy to align (80% identical).
-Align them
-Count each mutations in the alignments
-25 Tryptophans into phenylalanine
-30 Isoleucine into Leucine

-For each mutation, set the substitution score to the log odd ratio:

Log

Observed
Expected by chance

Cdric Notredame (12/01/16)

How Can We Compare Sequences ?


Making a Substitution Matrix

The Diagonal Indicates How


Conserved a residue tends to be.
W is VERY Conserved
Some Residues are Easier To
mutate into other similar
Cysteins that make disulfide
bridges and those that do not
get averaged

Cdric Notredame (12/01/16)

How Can We Compare Sequences ?


Making a Substitution Matrix

Cdric Notredame (12/01/16)

Cdric Notredame (12/01/16)

How Can We Compare Sequences ?


Using Substitution Matrix

Given two Sequences and a substitution Matrix,


We must Compute the CHEAPEST Alignment

Insertion

Deletion

ADKPRRP---LS-YMLWLN
ADKPKRPKPRLSAYMLWLN
Mutation

Cdric Notredame (12/01/16)

Scoring an Alignment
Most popular Subsitution Matrices
PAM250
Blosum62 (Most widely used)

Raw Score

TPEA
TPEA
|
| ||
APGA
APGA
Score =1 + 6 + 0 + 2 = 9

Question: Is it possible to get such a good alignment


by chance only?
Cdric Notredame (12/01/16)

Insertions and Deletions


Gap Penalties
Gap Opening Penalty
Gap Extension Penalty

gap

Seq
Seq
Seq
Seq

AAGARFIELDTHE----CAT
GARFIELDTHE----CAT
|||||||||||
|||
|||||||||||
|||
BBGARFIELDTHELASTCAT
GARFIELDTHELASTCAT

Opening a gap is more expensive than extending


it

Cdric Notredame (12/01/16)

How Can We Compare Sequences ?


Limits of the substitution Matrices
They ignore non-local interactions and Assume that
identical residues are equal

ADKPKRPLSAYMLWLN

They assume evolution rate


to be constant

ADKPKRPLSAYMLWLN

ADKPKRPLSAYMLWLN

Mutations
+
Selection

ADKPKRPKPRLSAYMLWLN
ADKPRRPLS-YMLWLN

Cdric Notredame (12/01/16)

How Can We Compare Sequences ?


Limits of the substitution Matrices

Substitution Matrices Cannot Work !!!

Cdric Notredame (12/01/16)

How Can We Compare Sequences ?


Limits of the substitution Matrices

I know But at least, could I get some idea of


when they are likely to do all right

Cdric Notredame (12/01/16)

How Can We Compare Sequences ?


The Twilight Zone

%Sequence Identity

Similar Sequence
Similar Structure

Different Sequence
Structure ????

Same 3D Fold

30%
30
Twilight Zone
Length
100

Cdric Notredame (12/01/16)

How Can We Compare Sequences ?


The Twilight Zone

Substitution Matrices Work Reasonably Well on


Sequences that have more than 30 % identity over
more than 100 residues

Cdric Notredame (12/01/16)

Cdric Notredame (12/01/16)

Cdric Notredame (12/01/16)

Cdric Notredame (12/01/16)

Cdric Notredame (12/01/16)

How Can We Compare Sequences ?


Which Matrix Shall I used
The Initial PAM matrix was computed on 80%
similar Proteins
It been extrapolated to more distantly
related sequences.
Pam 250
Pam 350
Other Matrices Exist:
BLOSUM 42
BLOSUM 62
BLOSUM 62
Cdric Notredame (12/01/16)

How Can We Compare Sequences ?


Which Matrix Shall I use

PAM: Distant Proteins High Index (PAM 350)


BLOSUM: Distant Proteins Low Index (Blosum30)
Choosing The Right Matrix may be Tricky
GONNET 250> BLOSUM62>PAM 250.
But This will depend on:
The Family.
The Program Used and Its Tuning.
Insertions, Deletions?

Cdric Notredame (12/01/16)

HOW Can we Align Two


Sequences ?
Dot Matrices
Global Alignments
Local Alignment

Cdric Notredame (12/01/16)

Cdric Notredame (12/01/16)

Dot Matrices
QUESTION
What are the elements shared by
two sequences ?

Cdric Notredame (12/01/16)

Dot Matrices
>Seq1
THEFATCAT
>Seq2
THELASTCAT

T H E F A T C A T

Window

Stringency
Cdric Notredame (12/01/16)

T
H
E
F
A
S
T
C
A
T

Dot Matrices

Sequences

Window size

Stringency

Cdric Notredame (12/01/16)

Dot Matrices
Strigency

Window=1
Stringency=1
Cdric Notredame (12/01/16)

Window=11
Stringency=7

Window=25
Stringency=15

Dot Matrices

x
y

Cdric Notredame (12/01/16)

x
y

Dot Matrices

Cdric Notredame (12/01/16)

http://myhits.isb-sib.ch/cgi-bin/dotlet

Dot Matrices

Cdric Notredame (12/01/16)

Dot Matrices

Cdric Notredame (12/01/16)

Dot Matrices

Cdric Notredame (12/01/16)

Dot Matrices
Limits
-Visual aid
-Best Way to EXPLORE the Sequence Organisation
-Does NOT provide us with an ALIGNMENT

wheat
?????

--DPNKPKRAMTSFVFFMSEFRSEFKQKHSKLKSIVEMVKAAGER
| | |||||||| || | |||
||| | |||| ||||
KKDSNAPKRAMTSFMFFSSDFRS----KHSDL-SIVEMSKAAGAA

Cdric Notredame (12/01/16)

Global Alignments
-Take 2 Nice Protein Sequences
-A good Substitution Matrix (blosum)
-A Gap opening Penalty (GOP)
-A Gap extension Penalty (GEP)

Cost

GOP

GOP

GEP
GOP

GOP
L

Afine Gap Penalty


Cdric Notredame (12/01/16)

Parsimony:
Evolution takes the simplest path
(So We Think)

Insertions and Deletions


Gap Penalties
Gap Opening Penalty
Gap Extension Penalty

gap

Seq
Seq
Seq
Seq

AAGARFIELDTHE----CAT
GARFIELDTHE----CAT
|||||||||||
|||
|||||||||||
|||
BBGARFIELDTHELASTCAT
GARFIELDTHELASTCAT

Opening a gap is more expensive than extending


it

Cdric Notredame (12/01/16)

Global Alignments
-Take 2 Nice Protein Sequences
-A good Substitution Matrix (blosum)
-A Gap opening Penalty (GOP)
-A Gap extension Penalty (GEP)
-DYNAMIC PROGRAMMING

>Seq1
THEFATCAT
>Seq2
THEFASTCAT

Cdric Notredame (12/01/16)

THEFA-TCAT
THEFASTCAT

DYNAMIC
PROGRAMMING

Global Alignments
DYNAMIC PROGRAMMING

Brute Force Enumeration


F A S T
F A T

Cdric Notredame (12/01/16)

----FAT
FAST-----FATFAST----F-ATFAST---

(L1+l2)!

(L1)!*(L2)!

Global Alignments
DYNAMIC PROGRAMMING

Dynamic Programming (Needlman and Wunsch)


Match=1

MisMatch=-1

Gap=-1

F A S T
0

F -1
A -2
T -3

-1 -2 -3 -4
1

F A S T
0

-1 -2 -3 -4
-1 1 0 -1 0

F
A -2
T -3

-1 -1

F A S T
F A - T

Cdric Notredame (12/01/16)

F A S T
0

F
A
T

-1 -2 -3 -4
1
2

1
2

Global Alignments
DYNAMIC PROGRAMMING

Global Alignments are very sensitive to gap


Penalties

GOP

GEP

Cdric Notredame (12/01/16)

Global Alignments
DYNAMIC PROGRAMMING

Global Alignments are very sensitive to gap


Penalties
Global Alignments do not take into account the
MODULAR nature of Proteins
C: K vitamin dep. Ca Binding
K: Kringle Domain
G: Growth Factor module
F: Finger Module

Cdric Notredame (12/01/16)

Local Alignments

GLOBAL Alignment

LOCAL Alignment

Smith And Waterman (SW)=LOCAL Alignment

Cdric Notredame (12/01/16)

Local Alignments
We now have a PairWise Comparison Algorithm,
We are ready to search Databases

Cdric Notredame (12/01/16)

Database Search
Q

SW

1.10e-20
10
1.10e-100
1.10e-2
1.10e-1
10

QUERRY
Comparison Engine
Database

3
1
3
6
1.10e-2
1
20
15

E-values
How many time do we expect such an
Alignment by chance?

Cdric Notredame (12/01/16)

13

Cdric Notredame (12/01/16)

CONCLUSION

Cdric Notredame (12/01/16)

Sequence Comparison

-Thanks to evolution, We CAN compare


Sequences

-There is a relation between Sequence and


Structure.

-Substitution matrices only work well with


similar Sequences (More than 30% id).

The Easiest way to Compare Two Sequences is


a dotplot.
Cdric Notredame (12/01/16)

A few Addresses

Cdric Notredame (12/01/16)

Cdric Notredame (12/01/16)

Cdric Notredame (12/01/16)

Cdric Notredame (12/01/16)

Cdric Notredame (12/01/16)

Cdric Notredame (12/01/16)

Cdric Notredame (12/01/16)

Anda mungkin juga menyukai