Lecture 3

MATH0011 Lecture 3
2007/02/16
Contents of this and next lectures

MATH0011
Numbers and Patterns in Nature and Life
Lecture 3
Molecular Evolution (1)
DNA mutations
t ti
Probability and matrix
models
Phylogenetic distance
Phylogenetic trees
and their construction
http://147.8.101.93/MATH0011/
1
Introduction
Traditional taxonomy
Taxonomy: classification of organisms into a

hierarchical scheme, with each taxon being
an element
l
iin the
h taxonomy.
Construct hierarchical classification by making

intuitive or subjective
j
decisions concerning
g
similarities of characteristics among species.
Aristotle divided animals into

those have red blood and
those have not;
then
th subdivided
bdi id d iinto
t
subgroups whose young are
reproduced alive, or in eggs,
or as pupae, etc.
2
MATH0011 Lecture 3
2007/02/16
Numerical taxonomy
Evolutionary taxonomy
Making a lists of characters, and assigning

numerical values to them for each species
p
considered. Then, based on these quantities
obtained, classify the species.
A powerful tool: tracking DNA (and other similar

substances) of species to determine the
phylogenetic relationship among the species.
Theory based on the following assumptions:
DNA of organisms mutate (slightly) over

generations.
Species descended from a common ancestor
should have DNA sequences similar to each
other.
DNA structure
DNA (deoxyribonucleic acid)
primary chemical component of chromosomes in

cells.
material of which genes are made.
sometimes called the

molecule of heredity
because parents transmit
copied portions of their
own DNA to offspring
during reproduction.
Pairs of molecules entwine like vines to form

a double helix.
E h vine-like
Each
i lik molecule
l
l iis a strand
t d off DNA
DNA: a
chemically linked chain of nucleotides, each
consists of a sugar, a phosphate and one of
four kinds of aromatic bases: adenine (A),
thymine (T), cytosine (C), and guanine (G).
DNA molecule has directional sense: ATCC
is different from CCTA. Thus each strand
may be considered as simply a sequence
composed of the four letters A,T,C,G.
7
MATH0011 Lecture 3
2007/02/16
DNA structure
DNA replication
In a DNA double helix, two strands come

together through complementary pairing of the
bases through hydrogen bonding.
Each base forms hydrogen bonds readily to
only one other -- A to T and C to G.
Thus the nucleotide
sequence of the two strands
are complementary
p
y to each
other. E.g.:
AGCCGCGTATTAG
First, hydrogen bonds between

the two strands break --- strands
separated.
Complementary strand forms on
each of two separated strands --two new double strands formed.
Though rarely happened
happened, errors in
formation of new double strands
may occur --- DNA mutation.
TCGGCGCATAATC
8
Examples of DNA mutations
DNA mutations
Suppose original sequence is CGTGACTTCC

base substitution: CGGGACTTCC
base insertion: CGATGACTTCC
base deletion: CG GACTTCC
sequence insertion: CGTATTAGGACTTCC
sequence deletion: CG CTTCC
sequence double up: CGTGACTTGACTTCC
sequence inversion: CGTTCAGTCC
Base substitution is the most common mutation.

For simplicity
simplicity, we shall assume no DNA
mutations other than base substitution in our
model.
transitions: purine(A or G) purine, or
pyrimidine(C or T) pyrimidine
transversions: purine pyrimidine,
pyrimidine or
pyrimidine purine
Transitions are more frequently observed;
transversions are less frequent.
10
11
MATH0011 Lecture 3
2007/02/16
DNA mutations
DNA mutations
Problem: deduce amount of mutation occurred

during evolutionary descent of DNA sequences.
E g ancestor sequence
E.g.:
seq ence S0: ACCTGCGCTA...
ACCTGCGCTA
intermediate sequence S1: ACGTGCACTA...
descendent sequence S2: ACGTGCGCTA
If only S0 and S2 are compared, one would notice
only 1 mutation (ratio = 1/10);
If S1 is also considered, then 3 mutations are
observed (ratio = 3/10).
Here GAG at 7th site is a hidden mutation.
Also, if there is a site at which G T A occurs,

then only 1 mutation is noticed if only the initial
and final sequences are observed.
Fixing methods:
Either assume that mutations are very rare, so

that no more than 1 mutation will occur at the
same site
site.
Or, use a suitable mathematical tool ---
Probability and Matrix

12
13
Introduction to Probability
Some simple examples
Probability of a certain outcome is a number,

denoted by P(outcome) or simply P, which
indicates the likelihood of occurrence of that
outcome.
Equivalently, the probability of an outcome gives
expectation of the percentage of trials in which
that outcome will occur, if the trial is repeated
many many times
times.
0 P 1
P = 0 : the outcome will never occur.
P = 1 : the outcome will occur 100% of the time.
14
When flipping a fair coin,

P(heads) = 1/2, P(tails) = 1/2.
When tossing a fair die, probability of getting a 4
is P(4) = 1/6.
True or false: When you buy a Mark 6 lottery
ticket, there are only 2 outcomes: either you win
the lottery or you lose. Therefore, probability of
you win the lottery is P(win) = 1/2 ???
15
MATH0011 Lecture 3
2007/02/16
Event
A simple example on DNA sequence
An event: a set that groups several outcomes

together.
We say an event occurs if any of the outcomes
in the event occurs.
Example: when tossing a die, we may consider
the event of getting a 1 or a 2, or the event of
getting an even number:
E1 or 2 = {1,2}
{1 2} , Eeven = {2,4,6}
{2 4 6} . Also,
Also E3={3}
{3} .
P(E1 or 2 ) = 2/6, P(Eeven) = 3/6, P(E3) = P(3)=1/6
Example: for DNA bases,
Epurine = {A,G} , Epyrimidine = {C,T} , Enot A = {G,C,T}
Suppose a 20-base sequence reads as

AGCCTACTGGCCAGGACCTC
What is the probability that the next base, in site
21, is an A?
Suppose we know no further information on this
DNA. We may assume bases have been chosen
at random, and treat each base as a trial of some
random process.
As there are 4 As in this 20-base sequence, we
estimate the probability of an A in site 21 as 4/20.
16
17
Union of events
Union of events E and F, denoted by EF :

the event that either E or F, occurs. Equivalently,
E F is the set of outcomes that appear in either
E or F, or in both.
Example: tossing a die -Eeven E3 = {2,4,6} {1,2,3} = {1,2,3,4,6}
Disjoint
j
events: events E and F are said to be
mutually exclusive (or disjoint) if it is impossible
for them to happen simultaneously.
E.g., when tossing a die, E1 and Eeven are
mutually exclusive, but E3 and Eeven are not.
Probability of union of events
18
Addition rule (for union of disjoint events)

If events E and F are mutually exclusive, then
P(E F) = P(E) + P(F)
Note: the above does not hold for events that are
not mutually exclusive. In fact if E and F are not
mutually exclusive, then
( F)) P(E)
( ) + P(F)
( )
P(E
Extension of the addition rule: if event E1, E2, ,
En are pairwise disjoint, then
P(E1 E2 En) = P(E1) + P(E2) + +P(En)
19
MATH0011 Lecture 3
2007/02/16
Intersection of events
Complementary event
Complementary event of E,
E denoted by E
E, is the
event composed of all those outcomes not in E.
E.g., when tossing a die, Eeven = Eodd = {1,3,5} ,
E4 = {1,2,3,5,6}
Fact: P(E) = 1 P(E)
E
Example:
l iin th
the previous
i
DNA sequence example,
l
since P(A)= 0.2, one has P({C,G,T}) = 0.8.
(head,1), (head,2), , (head,6), (tail,1), (tail,2), , (tail,6).
Ehead = {(head,1), (head,2), , (head,6)}

E2 = {(head,2), (tail,2)} , Ehead E2 = {(head,2)}
20
Independent events
Intersection of events E and F, denoted by EF:

the event that both E and F occur. Equivalently,
E F is the set of outcomes that appear in both E
EF
and F.
Example: tossing a die -Eeven E3 = {2,4,6} {1,2,3} = {2}
Example: flipping a coin and tossing a die together,
there are 12 possible outcomes:
21
Probability of intersection of events
Two events are said to be independent if one of

them has occurred or not does not change the
probability of occurrence the other event.
event
E.g., in the previous example of coin-and-die, E2
and Ehead are independent events. But Ehead and
Etail are not independent events.
Another example: tossing two fair dice. All
possible
ibl outcomes:
t
(1 1) (1
(1,1),
(1,2),
2) ,
(1,6),(2,1), ,(2,6), , (6,6) --- total of 36
outcomes. Question: is Esum=12 and Eone die = 5
independent event?
22
Multiplication rule (for intersection of independent

events):
If E and F are independent events, then
P(EF) = P(E) P(F)
Note: If E and F are not independent, then
P(EF) P(E) P(F)
Extension of the multiplication rule: if event E1,
E2, , En are independent, then
P(E1 E2 En) = P(E1) P(E2) P(En)
23
MATH0011 Lecture 3
2007/02/16
Application on modeling of DNA mutation
Application on modeling of DNA mutation
Example: suppose we focus on a particular site in a DNA

sequence, and we care about only whether the base is a
purine (A or G) or a pyrimidine (C or T).
Suppose we know that with each generation, the base at
this site has a chance of 1.5% undergoing a transversion
(a change). Then
P(Echange) = 0.015, P(Eno change) = 0.985
Assume changes within generations are independent to
each other
other.
Consider changes over 2 generations, with 4 possibilities:
change / no change, followed by change / no change
P(Echange, change) = (.015)(.015) = .000225

P(Echange, no change) = (.015)(.985) = .014775
P(Eno change, change) = (.985)(.015)
( 985)( 015) = .014775
014775
P(Eno change, no change) = (.985)(.985) = .970225.
Probability of seeing no change from the original base in
generation 0 to generation 2 is:
P(Eno change, no change) + P(Echange, change)
= .970225 + .000225 = .97045.
This is slightly greater than the chance of no change
having actually occurred.
Hidden mutation is recovered!
24
25
Conditional probability
Conditional probability of event F given E (here E,

F are two events) P(F|E) : the probability of event
F given that event E has occurred
F,
occurred.
E.g., tossing two fair dice:
Esum=10 = {(4,6),(5,5),(6,4)}
P(Efirst die=4 | Esum=10) = 1/3
Mathematical definition of conditional p
probability:
y
P(F | E) = P(FE) / P(E).
Exercise: What is P(Eboth dice=5) ?
What is P(Eboth dice=5 | Esum=10) ?
26
Caution:
P(E|F) P(EF) in general

P(E|F)
( | ) P(F|E)
( | ) in general
Independent events (again)
Mathematical definition of independent events:

events E and F are said to be independent if
P(EF) = P(E) P(F)
or equivalently, P(F | E) = P(F).
27
MATH0011 Lecture 3
2007/02/16
Example: Consider 40-base aligned sequences:

Ancestral sequence S0 :
ACTTGTCGGATGATCAGCGGTCCATGCACC
TGACAACGGT
Descendent sequence S1 :
ACATGTTGCTTGACGACAGGTCCATGCGCC
TGAGAACGGC
Treating each site as a trial of the same
probabilistic process, one wants to estimate
P(S1=i | S0=j) , where i, j = A, G, C, or T.
S0 :
ACTTGTCGGATGATCAGCGGTCCATGCACC
TGACAACGGT
S1 :
ACATGTTGCTTGACGACAGGTCCATGCGCC
TGAGAACGGC
There are 9 sites in S0 which are As. At these 9
sites of S1, there are 7 A
Ass, 1 G,
G 0 C and 1 T.
T
P(S1=A | S0=A) = 7/9, P(S1=G | S0=A) = 1/9,
P(S1=C | S0=A) = 0, P(S1=T | S0=A) = 1/9.
28
We may compile a frequency table for events

S1=i and S0=j (where i, j = A, G, C, or T) :
29
Dividing the number in each cell of the frequency

table by the corresponding column sum, one
obtains conditional probabilities P(S1=i | S0=j).
E h column
Each
l
sum iis
total number of base j (=
A,G,C or T) occur in S0.
Each row sum is total
number of base i (=
A G C or T) occur in S1.
A,G,C
Table of conditional probabilities P(S1=i | S0=j).
Each cell in table is number of occurrences of

S1=i and S0=j at the same site in the
sequences.
30
Note : all column sums = 1 in this table.

31
MATH0011 Lecture 3
2007/02/16
Basic Matrix model of molecular evolution
Probability vector and transition matrix
We assume that:
Only base substitutions may occur (i.e., no deletions,

insertions inversions,
insertions,
inversions duplications)
duplications).
Each site of the ancestral sequence, behaving identically
and independently of every other site, appears randomly
as A,G,C, or T according to probabilities PA, PG, PC, PT.
Note : each Pj 0, PA+PG+PC+PT =1.
Over one time step, at each site, the base is subject to
possible ssubstitution
bstit tion according to conditional probabilities
P(S1=i | S0=j). (E.g, if the base is G, then there is a
chance of P(S1=T | S0=G) that it will change to T after 1
time step.)
Write probability vector of ancestral sequence

p0 = ( PA, PG, PC, PT)
A
Arrange
conditional
diti
l probabilities
b biliti off b
base substitutions
b tit ti
into transition matrix :
Here we write
P(S1=i | S0=j) as
Pi | j for simplicity
Note :
Always use the ordering A,G,C,T.
Column sums of M are all 1.
32
33
Note that
Pi | j P j = P(S1=i | S0=j) P(S0=j) = P(S1=i and S0=j)
Multiplying M and p0 gives:
Therefore 1st entry in the result of Mp0 is:

PA|APA + PA|GPG + PA|CPC + PA|TPT
= P(S1=A and S0=A)+ P(S1=A and S0=G)
+ P(S1=A and S0=C)+ P(S1=A and S0=T)
= P(S1=A).
Similarly, the 2nd, 3rd, 4th entries of Mp0 are
respectively P(S1=G), P(S1=C), and P(S1=T).
Hence Mp0 gives the probability vector p1 of the
probabilities of the base at each site of descendent
sequence S1 being A, G, C, or T.
34
35
MATH0011 Lecture 3
2007/02/16
Markov model for molecular evolution
Let St denote descendent sequence after t time steps:
We further assume that:

Probabilistic mutation process over subsequent time
steps is equivalent to that of the 1st time step.
For each site, what happens during each of
subsequent time steps depends only on what the
base was at the beginning of that time step, and is
irrelevant to what that base was in previous time
steps (i.e., the process has no memory) .
This kind of model is a Markov model.
36
Powers of the matrix M
37
Properties of Markov matrices

The transition matrix M, also know as Markov matrix in the
Markov model, has the properties:
it is a square matrix
matrix,
all entries are nonnegative, and
all column sums = 1.

Theorem :
A Markov matrix always has 1 as its eigenvalue, and

magnitude of all its other eigenvalues are not greater than 1.
It also has an eigenvector with all nonnegative entries
corresponding to the eigenvalue 1.
If all entries of the Markov matrix are positive, then 1 is its

strictly dominant eigenvalue, with a unique corresponding
eigenvectors (up to scalar multiples) with all entries positive.
Fact : for any t = 1,2,..., the matrix M t (i.e., product of t

copies
cop
es o
of M)) is
s equa
equal to
to:
Hence entries
H
t i off M t give
i conditional
diti
l probabilities
b biliti
P(St=i | S0=j) where i, j = A, G, C, or T.
38
39
10
MATH0011 Lecture 3
2007/02/16
References
Mathematical Models in Biology, An Introduction,

E.S. Allman and J.A. Rhodes, Cambridge
University Press, 2004.
Lifes Other Secret, I. Stewart, John Wiley & Sons,

1998.
An Introduction to Mathematical Taxonomy

Taxonomy, G
Dunn and B.S. Everitt, Cambridge University
Press, 1982.
40
11

Lecture 3

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Lecture 3

Diunggah oleh

Hak Cipta:

Format Tersedia

MATH0011 Lecture 3

Contents of this and next lectures

Molecular Evolution (1)

Taxonomy: classification of organisms into a

Construct hierarchical classification by making

Aristotle divided animals into

Making a lists of characters, and assigning

A powerful tool: tracking DNA (and other similar

DNA of organisms mutate (slightly) over

DNA (deoxyribonucleic acid)

primary chemical component of chromosomes in

sometimes called the

Pairs of molecules entwine like vines to form

In a DNA double helix, two strands come

First, hydrogen bonds between

Examples of DNA mutations

Suppose original sequence is CGTGACTTCC

Base substitution is the most common mutation.

Problem: deduce amount of mutation occurred

Also, if there is a site at which G T A occurs,

Either assume that mutations are very rare, so

Probability and Matrix

Some simple examples

Probability of a certain outcome is a number,

When flipping a fair coin,

A simple example on DNA sequence

An event: a set that groups several outcomes

Suppose a 20-base sequence reads as

Union of events E and F, denoted by EF :

Probability of union of events

Addition rule (for union of disjoint events)

(head,1), (head,2), , (head,6), (tail,1), (tail,2), , (tail,6).

Ehead = {(head,1), (head,2), , (head,6)}

Intersection of events E and F, denoted by EF:

Probability of intersection of events

Two events are said to be independent if one of

Multiplication rule (for intersection of independent

Application on modeling of DNA mutation

Application on modeling of DNA mutation

Example: suppose we focus on a particular site in a DNA

P(Echange, change) = (.015)(.015) = .000225

Conditional probability of event F given E (here E,

P(E|F) P(EF) in general

Independent events (again)

Mathematical definition of independent events:

Example: Consider 40-base aligned sequences:

We may compile a frequency table for events

Dividing the number in each cell of the frequency

Table of conditional probabilities P(S1=i | S0=j).

Each cell in table is number of occurrences of

Note : all column sums = 1 in this table.

Basic Matrix model of molecular evolution

Probability vector and transition matrix

Only base substitutions may occur (i.e., no deletions,

Write probability vector of ancestral sequence

Multiplying M and p0 gives:

Therefore 1st entry in the result of Mp0 is:

Markov model for molecular evolution

Let St denote descendent sequence after t time steps:

We further assume that:

Powers of the matrix M

Properties of Markov matrices

all entries are nonnegative, and