Anda di halaman 1dari 11

MATH0011 Lecture 3

2007/02/16

Contents of this and next lectures


MATH0011
Numbers and Patterns in Nature and Life

Lecture 3

Molecular Evolution (1)

DNA mutations
t ti
Probability and matrix
models
Phylogenetic distance
Phylogenetic trees
and their construction

http://147.8.101.93/MATH0011/
1

Introduction

Traditional taxonomy

Taxonomy: classification of organisms into a


hierarchical scheme, with each taxon being
an element
l
iin the
h taxonomy.

Construct hierarchical classification by making


intuitive or subjective
j
decisions concerning
g
similarities of characteristics among species.

Aristotle divided animals into


those have red blood and
those have not;
then
th subdivided
bdi id d iinto
t
subgroups whose young are
reproduced alive, or in eggs,
or as pupae, etc.
2

MATH0011 Lecture 3

2007/02/16

Numerical taxonomy

Evolutionary taxonomy

Making a lists of characters, and assigning


numerical values to them for each species
p
considered. Then, based on these quantities
obtained, classify the species.

A powerful tool: tracking DNA (and other similar


substances) of species to determine the
phylogenetic relationship among the species.
Theory based on the following assumptions:

DNA of organisms mutate (slightly) over


generations.
Species descended from a common ancestor
should have DNA sequences similar to each
other.

DNA structure

DNA (deoxyribonucleic acid)

primary chemical component of chromosomes in


cells.
material of which genes are made.

sometimes called the


molecule of heredity
because parents transmit
copied portions of their
own DNA to offspring
during reproduction.

Pairs of molecules entwine like vines to form


a double helix.
E h vine-like
Each
i lik molecule
l
l iis a strand
t d off DNA
DNA: a
chemically linked chain of nucleotides, each
consists of a sugar, a phosphate and one of
four kinds of aromatic bases: adenine (A),
thymine (T), cytosine (C), and guanine (G).
DNA molecule has directional sense: ATCC
is different from CCTA. Thus each strand
may be considered as simply a sequence
composed of the four letters A,T,C,G.
7

MATH0011 Lecture 3

2007/02/16

DNA structure

DNA replication

In a DNA double helix, two strands come


together through complementary pairing of the
bases through hydrogen bonding.
Each base forms hydrogen bonds readily to
only one other -- A to T and C to G.
Thus the nucleotide
sequence of the two strands
are complementary
p
y to each
other. E.g.:
AGCCGCGTATTAG

First, hydrogen bonds between


the two strands break --- strands
separated.
Complementary strand forms on
each of two separated strands --two new double strands formed.
Though rarely happened
happened, errors in
formation of new double strands
may occur --- DNA mutation.

TCGGCGCATAATC
8

Examples of DNA mutations

DNA mutations

Suppose original sequence is CGTGACTTCC


base substitution: CGGGACTTCC
base insertion: CGATGACTTCC
base deletion: CG GACTTCC
sequence insertion: CGTATTAGGACTTCC
sequence deletion: CG CTTCC
sequence double up: CGTGACTTGACTTCC
sequence inversion: CGTTCAGTCC

Base substitution is the most common mutation.


For simplicity
simplicity, we shall assume no DNA
mutations other than base substitution in our
model.
transitions: purine(A or G) purine, or
pyrimidine(C or T) pyrimidine
transversions: purine pyrimidine,
pyrimidine or
pyrimidine purine
Transitions are more frequently observed;
transversions are less frequent.
10

11

MATH0011 Lecture 3

2007/02/16

DNA mutations

DNA mutations

Problem: deduce amount of mutation occurred


during evolutionary descent of DNA sequences.
E g ancestor sequence
E.g.:
seq ence S0: ACCTGCGCTA...
ACCTGCGCTA
intermediate sequence S1: ACGTGCACTA...
descendent sequence S2: ACGTGCGCTA
If only S0 and S2 are compared, one would notice
only 1 mutation (ratio = 1/10);
If S1 is also considered, then 3 mutations are
observed (ratio = 3/10).
Here GAG at 7th site is a hidden mutation.

Also, if there is a site at which G T A occurs,


then only 1 mutation is noticed if only the initial
and final sequences are observed.
Fixing methods:

Either assume that mutations are very rare, so


that no more than 1 mutation will occur at the
same site
site.
Or, use a suitable mathematical tool ---

Probability and Matrix


12

13

Introduction to Probability

Some simple examples

Probability of a certain outcome is a number,


denoted by P(outcome) or simply P, which
indicates the likelihood of occurrence of that
outcome.
Equivalently, the probability of an outcome gives
expectation of the percentage of trials in which
that outcome will occur, if the trial is repeated
many many times
times.
0 P 1
P = 0 : the outcome will never occur.
P = 1 : the outcome will occur 100% of the time.

14

When flipping a fair coin,


P(heads) = 1/2, P(tails) = 1/2.
When tossing a fair die, probability of getting a 4
is P(4) = 1/6.
True or false: When you buy a Mark 6 lottery
ticket, there are only 2 outcomes: either you win
the lottery or you lose. Therefore, probability of
you win the lottery is P(win) = 1/2 ???
15

MATH0011 Lecture 3

2007/02/16

Event

A simple example on DNA sequence

An event: a set that groups several outcomes


together.
We say an event occurs if any of the outcomes
in the event occurs.
Example: when tossing a die, we may consider
the event of getting a 1 or a 2, or the event of
getting an even number:
E1 or 2 = {1,2}
{1 2} , Eeven = {2,4,6}
{2 4 6} . Also,
Also E3={3}
{3} .
P(E1 or 2 ) = 2/6, P(Eeven) = 3/6, P(E3) = P(3)=1/6
Example: for DNA bases,
Epurine = {A,G} , Epyrimidine = {C,T} , Enot A = {G,C,T}

Suppose a 20-base sequence reads as


AGCCTACTGGCCAGGACCTC
What is the probability that the next base, in site
21, is an A?
Suppose we know no further information on this
DNA. We may assume bases have been chosen
at random, and treat each base as a trial of some
random process.
As there are 4 As in this 20-base sequence, we
estimate the probability of an A in site 21 as 4/20.
16

17

Union of events

Union of events E and F, denoted by EF :


the event that either E or F, occurs. Equivalently,
E F is the set of outcomes that appear in either
E or F, or in both.
Example: tossing a die -Eeven E3 = {2,4,6} {1,2,3} = {1,2,3,4,6}
Disjoint
j
events: events E and F are said to be
mutually exclusive (or disjoint) if it is impossible
for them to happen simultaneously.
E.g., when tossing a die, E1 and Eeven are
mutually exclusive, but E3 and Eeven are not.

Probability of union of events

18

Addition rule (for union of disjoint events)


If events E and F are mutually exclusive, then
P(E F) = P(E) + P(F)
Note: the above does not hold for events that are
not mutually exclusive. In fact if E and F are not
mutually exclusive, then
( F)) P(E)
( ) + P(F)
( )
P(E
Extension of the addition rule: if event E1, E2, ,
En are pairwise disjoint, then
P(E1 E2 En) = P(E1) + P(E2) + +P(En)
19

MATH0011 Lecture 3

2007/02/16

Intersection of events
Complementary event

Complementary event of E,
E denoted by E
E, is the
event composed of all those outcomes not in E.
E.g., when tossing a die, Eeven = Eodd = {1,3,5} ,
E4 = {1,2,3,5,6}
Fact: P(E) = 1 P(E)
E
Example:
l iin th
the previous
i
DNA sequence example,
l
since P(A)= 0.2, one has P({C,G,T}) = 0.8.

(head,1), (head,2), , (head,6), (tail,1), (tail,2), , (tail,6).

Ehead = {(head,1), (head,2), , (head,6)}


E2 = {(head,2), (tail,2)} , Ehead E2 = {(head,2)}

20

Independent events

Intersection of events E and F, denoted by EF:


the event that both E and F occur. Equivalently,
E F is the set of outcomes that appear in both E
EF
and F.
Example: tossing a die -Eeven E3 = {2,4,6} {1,2,3} = {2}
Example: flipping a coin and tossing a die together,
there are 12 possible outcomes:

21

Probability of intersection of events

Two events are said to be independent if one of


them has occurred or not does not change the
probability of occurrence the other event.
event
E.g., in the previous example of coin-and-die, E2
and Ehead are independent events. But Ehead and
Etail are not independent events.
Another example: tossing two fair dice. All
possible
ibl outcomes:
t
(1 1) (1
(1,1),
(1,2),
2) ,
(1,6),(2,1), ,(2,6), , (6,6) --- total of 36
outcomes. Question: is Esum=12 and Eone die = 5
independent event?

22

Multiplication rule (for intersection of independent


events):
If E and F are independent events, then
P(EF) = P(E) P(F)
Note: If E and F are not independent, then
P(EF) P(E) P(F)
Extension of the multiplication rule: if event E1,
E2, , En are independent, then
P(E1 E2 En) = P(E1) P(E2) P(En)
23

MATH0011 Lecture 3

2007/02/16

Application on modeling of DNA mutation

Application on modeling of DNA mutation

Example: suppose we focus on a particular site in a DNA


sequence, and we care about only whether the base is a
purine (A or G) or a pyrimidine (C or T).
Suppose we know that with each generation, the base at
this site has a chance of 1.5% undergoing a transversion
(a change). Then
P(Echange) = 0.015, P(Eno change) = 0.985
Assume changes within generations are independent to
each other
other.
Consider changes over 2 generations, with 4 possibilities:
change / no change, followed by change / no change

P(Echange, change) = (.015)(.015) = .000225


P(Echange, no change) = (.015)(.985) = .014775
P(Eno change, change) = (.985)(.015)
( 985)( 015) = .014775
014775
P(Eno change, no change) = (.985)(.985) = .970225.
Probability of seeing no change from the original base in
generation 0 to generation 2 is:
P(Eno change, no change) + P(Echange, change)
= .970225 + .000225 = .97045.
This is slightly greater than the chance of no change
having actually occurred.
Hidden mutation is recovered!

24

25

Conditional probability

Conditional probability of event F given E (here E,


F are two events) P(F|E) : the probability of event
F given that event E has occurred
F,
occurred.
E.g., tossing two fair dice:
Esum=10 = {(4,6),(5,5),(6,4)}
P(Efirst die=4 | Esum=10) = 1/3
Mathematical definition of conditional p
probability:
y
P(F | E) = P(FE) / P(E).
Exercise: What is P(Eboth dice=5) ?
What is P(Eboth dice=5 | Esum=10) ?
26

Caution:

P(E|F) P(EF) in general


P(E|F)
( | ) P(F|E)
( | ) in general

Independent events (again)

Mathematical definition of independent events:


events E and F are said to be independent if
P(EF) = P(E) P(F)
or equivalently, P(F | E) = P(F).
27

MATH0011 Lecture 3

2007/02/16

Example: Consider 40-base aligned sequences:


Ancestral sequence S0 :
ACTTGTCGGATGATCAGCGGTCCATGCACC
TGACAACGGT
Descendent sequence S1 :
ACATGTTGCTTGACGACAGGTCCATGCGCC
TGAGAACGGC
Treating each site as a trial of the same
probabilistic process, one wants to estimate
P(S1=i | S0=j) , where i, j = A, G, C, or T.

S0 :
ACTTGTCGGATGATCAGCGGTCCATGCACC
TGACAACGGT
S1 :
ACATGTTGCTTGACGACAGGTCCATGCGCC
TGAGAACGGC
There are 9 sites in S0 which are As. At these 9
sites of S1, there are 7 A
Ass, 1 G,
G 0 C and 1 T.
T
P(S1=A | S0=A) = 7/9, P(S1=G | S0=A) = 1/9,
P(S1=C | S0=A) = 0, P(S1=T | S0=A) = 1/9.
28

We may compile a frequency table for events


S1=i and S0=j (where i, j = A, G, C, or T) :

29

Dividing the number in each cell of the frequency


table by the corresponding column sum, one
obtains conditional probabilities P(S1=i | S0=j).

E h column
Each
l
sum iis
total number of base j (=
A,G,C or T) occur in S0.
Each row sum is total
number of base i (=
A G C or T) occur in S1.
A,G,C

Table of conditional probabilities P(S1=i | S0=j).

Each cell in table is number of occurrences of


S1=i and S0=j at the same site in the
sequences.

30

Note : all column sums = 1 in this table.


31

MATH0011 Lecture 3

2007/02/16

Basic Matrix model of molecular evolution

Probability vector and transition matrix

We assume that:

Only base substitutions may occur (i.e., no deletions,


insertions inversions,
insertions,
inversions duplications)
duplications).
Each site of the ancestral sequence, behaving identically
and independently of every other site, appears randomly
as A,G,C, or T according to probabilities PA, PG, PC, PT.
Note : each Pj 0, PA+PG+PC+PT =1.
Over one time step, at each site, the base is subject to
possible ssubstitution
bstit tion according to conditional probabilities
P(S1=i | S0=j). (E.g, if the base is G, then there is a
chance of P(S1=T | S0=G) that it will change to T after 1
time step.)

Write probability vector of ancestral sequence


p0 = ( PA, PG, PC, PT)
A
Arrange
conditional
diti
l probabilities
b biliti off b
base substitutions
b tit ti
into transition matrix :
Here we write
P(S1=i | S0=j) as
Pi | j for simplicity

Note :
Always use the ordering A,G,C,T.
Column sums of M are all 1.

32

33

Note that
Pi | j P j = P(S1=i | S0=j) P(S0=j) = P(S1=i and S0=j)

Multiplying M and p0 gives:

Therefore 1st entry in the result of Mp0 is:


PA|APA + PA|GPG + PA|CPC + PA|TPT
= P(S1=A and S0=A)+ P(S1=A and S0=G)
+ P(S1=A and S0=C)+ P(S1=A and S0=T)
= P(S1=A).
Similarly, the 2nd, 3rd, 4th entries of Mp0 are
respectively P(S1=G), P(S1=C), and P(S1=T).
Hence Mp0 gives the probability vector p1 of the
probabilities of the base at each site of descendent
sequence S1 being A, G, C, or T.
34

35

MATH0011 Lecture 3

2007/02/16

Markov model for molecular evolution

Let St denote descendent sequence after t time steps:

We further assume that:


Probabilistic mutation process over subsequent time
steps is equivalent to that of the 1st time step.
For each site, what happens during each of
subsequent time steps depends only on what the
base was at the beginning of that time step, and is
irrelevant to what that base was in previous time
steps (i.e., the process has no memory) .
This kind of model is a Markov model.

36

Powers of the matrix M

37

Properties of Markov matrices


The transition matrix M, also know as Markov matrix in the
Markov model, has the properties:

it is a square matrix
matrix,

all entries are nonnegative, and

all column sums = 1.


Theorem :

A Markov matrix always has 1 as its eigenvalue, and


magnitude of all its other eigenvalues are not greater than 1.
It also has an eigenvector with all nonnegative entries
corresponding to the eigenvalue 1.

If all entries of the Markov matrix are positive, then 1 is its


strictly dominant eigenvalue, with a unique corresponding
eigenvectors (up to scalar multiples) with all entries positive.

Fact : for any t = 1,2,..., the matrix M t (i.e., product of t


copies
cop
es o
of M)) is
s equa
equal to
to:

Hence entries
H
t i off M t give
i conditional
diti
l probabilities
b biliti
P(St=i | S0=j) where i, j = A, G, C, or T.

38

39

10

MATH0011 Lecture 3

2007/02/16

References

Mathematical Models in Biology, An Introduction,


E.S. Allman and J.A. Rhodes, Cambridge
University Press, 2004.

Lifes Other Secret, I. Stewart, John Wiley & Sons,


1998.

An Introduction to Mathematical Taxonomy


Taxonomy, G
Dunn and B.S. Everitt, Cambridge University
Press, 1982.

40

11

Anda mungkin juga menyukai