Anda di halaman 1dari 76

1

Chapter Y

Bioinformatics of Phylogeny
The focus of this chapter is to address the bioinformatics of
phylogeny. Relying on the classical and modern views on evolution
pertinent to origin and diversity of species, details on phylogenetic
concepts and phylogenetic analysis are presented. Relevant software
classifications and available phylogenetic programs are discussed.

y.1 EVOLUTIONARY THEORY


The historical visit in 1837 to Galapagos Islands by Charles Darwin,
led him to observe the unique features of those islands in each being
different from other in their natural fauna and all of the islands had
distinct populations of finches largely adapted to the natural ambient
of each individual island. Further, a set of birds with certain types of
beaks flourished on particular islands implying that a selection
process had been acting on the populations. Darwin presented his
observations in his celebrated book, The Origin of Species by Means
of Natural Selection, or the Preservation of Favoured Races in the
Struggle for Life [y.1] with a bold proposition that “all the organic
beings which have ever lived on this earth have descended from
some one primordial form”. He supported his hypothesis with
different species of finch observed by him on the Gallapagos islands
and indicated an extensive examples from nature such as the vestigial
wings on flightless beetles and argued that these could only be
remnants of some common ancestor.
Thus emerged, the evolutionary history of all organisms
based on Darwin’s general framework on how to infer relationships
among species using morphological, paleontological, and bio-
geographical details. Ever since then systematically, grew the art of
understanding the relationships among species with a scale of
diversity of life seen on the earth.
Evolution is also concurrently known as biological, genetic or
organic evolution. It refers to the changes and modifications
observed “in the inherited traits of a population of organisms through
2

successive generations”. Here the “trait” a specific attribute of


anatomical, biochemical or behavioral characteristics stemming from
gene-environment interaction.
The observed changes are due to two settings of a complex
system: (i) Interactions between a set of processes, which introduce
variations into a population; (ii) and, another set of processes
eliminating the variations. In this context of such polemically
differing interacting processes, the variants having specified traits
“become more, or less, common”.
In the biology evolution, the variation indicated above is
caused by mutation, which introduces genetic modifications. Further,
the changes are heritable conveyed through generations via
reproduction giving rise “to alternative traits in organisms”. Yet
another source of variation is possible -due to genetic recombination,
wherein the genes could get shuffled into new combinations leading
to organisms exhibiting different traits. It has been also observed that
variation can also be enhanced under certain circumstances. This
happens as a result of transfer of genes between species.
When an organism takes up permanent residence within
another with the two becoming a single functional organism, it is
known as endosymbiosis (Mitochondria and plastids are believed to
have resulted from endosymbiosis). Rarely (but significantly), the
variations indicated may occur due to “wholesale incorporation of
genomes through endosymbiosis”.
Largely, the variants may become more common or rarer in a
population due to natural selection. The reason is that the natural
selection process allows two streams of traits: Traits that aid survival
and reproduction to become more common and those that hinder
survival and reproduction to become rarer.
In Nature, the available resources are limited and organisms
may produce excessive offspring, more than their environment can
support. As such, natural selection comes into play inasmuch as only
a small proportion of individuals in each generation will survive and
reproduce. Over the generations, natural selection process
(selectively) filters the heritable variation in traits retaining
successively the beneficial changes through differential survival and
reproduction. The underlying iterative considerations adjust the traits
making them better suited to the ambient of the organism. Such
3

adjustments termed as adaptations. (However, it is cautioned that not


all change could be adaptive).
The so-called genetic drift is also a causative mechanism for
evolution leading to random changes seen in the prevalence of
common traits in a population. This genetic drift is very concerning
whenever traits do not strongly influence survival particularly across
a small populations (where chance or probabilistic norms would play
a disproportionate role in the frequency of traits passed on to the next
generation.
The concept of genetic drift is popular in the applications of
the so-called neutral theory of molecular evolution, and it plays a role
in the molecular clocks of phylogenetic studies.
Speciation is a key component in evolution. It refers to the
context in which a single ancestral species splits and diversifies into
multiple new species via several modes of occurrence. A long stretch
of speciation events is responsible for the descend of all living (and
extinct) species from a common ancestor. These events are marks set
on a diverse "tree of life" (considered to have grown over the 3.5
billion years supporting the life on Earth). Visibly, speciation is seen
in anatomical, genetic and other similarities between groups of
organisms, geographical proliferation of related species, the fossil
record etc.

y.1.1 Natural Selection


The underlying mechanism through which evolution takes place is
called natural selection. It refers to a process that acts on populations
of individual species in order to produce the diversity of life as seen
now. Natural selection hypothesizes that, in Nature any population
produces a gamut of offspring more than the environment can
support. As such, there is always an inevitable struggle for survival
within the population; and, those individuals that are “best adapted to
the environment” bear a better chance of survival and to produce
offspring. A continued natural selection process acted over a large
numbers of generations, leads to a population, which is highly
adapted to its particular environmental conditions. In certain
situations and ambient, a population may become reproductively
isolated producing a new species.
4

In the biological ecosystem, there are factors that are biotic


(species) meaning related to life (such as plants, animals, fungi,
bacteria etc.). To start with, the locale of such living factors is barren
and unoccupied. Subsequently, new organisms colonize the
environment. However, their successful proliferation (survival and
reproduction) relies on favorable ambient conditions in the area. Such
environmental conditions are dictated by abiotic factors such as
habitat (pond, lake, ocean, desert, mountain etc.) and weather features
like temperature, rain, snow etc. When a variety of species are present
in such an ecosystem, the consequent actions of these species can
affect/influence the lives of fellow species in the area; and, these
influencing factors are deemed biotic factors. Thus, biotic and abiotic
factors combine to create a system or more precisely, a complex
biological ecosystem that represents a community of living and
nonliving things considered as a unit.
Natural selection is not braced over the population across the
time-scale. Most often, its template for fortuitous individuals gets
constantly changed as dictated by how external (exogenous) factors
or abiotic influences (such as climatic conditions, emergence of new
predators, accessibility to food etc.). Further, no matter to what
extent a population has adapted to its environment, it does not,
however, provides any guarantees on the future survival of the
species. An example of this phenomenology refers to the Irish Elk,
which is known for its formidable antlers whose existence are
suspected to be have been due to constant sexual selection among
fighting males attempting to gain access to females. However, as the
climate changed quickly at the end of the last ice-age, the
combination of a general reduction in food supply and increased
hunting by humans led the species to suffer from a condition similar
to osteoporosis, which is thought to have resulted eventually to its
extinction [Stuart et al., 2004].
Said correctly, “nothing in biology makes sense except in the
light of evolution” [Dobzhansky, T. 1973. Nothing in Biology Makes
Sense Except in the Light of Evolution. The American Biology
Teacher, 35:125-129].Thus, the theory of evolution points out the
plausibility of a unique, unifying and cohesive force that explains the
origin and existence of all forms of life. As stated in [FSU], “it is to
5

the life sciences what the long sought holy grail of the unified field
theory is to astrophysics”.
Any form of life is a descendent from a common ancestral
origin. Evolution of species is not a static process but it dynamically
changes over time. Darwin’s description of this process conforms to
being a variation sorted out through drift and selection of lineages
that diverge. In short, evolution implies a “descent with
modification.” making the organisms to bear a history; and, the
modifications observed are stochastical epochs (of that history)
stemming from the statistical nature of mutational changes causing
damage or otherwise.

y.1.2 Modern views on Evolution: Synthetic Theory


With his enunciated theory of evolution, Darwin could not, however,
explain the underlying process and operational aspects of natural
selection. This was mainly due to the absence of genetic
considerations vis-à-vis inheritance that was later proposed by
Mendel (in 1866) revealing the secrets of heredity and variability that
prevail in populations; and, only in 20th century, neo-Darwinism
came into being with the synthetic theory of evolution that blends the
fields of genetics and evolution [Huxley, 1974)]. This neoteric
concept formalizes genetic mutation as the essential ingredient, in
formulating the observed variations solicited by the occurrence of
natural selection. Further, with the advent of protein sequence data
(in the early 1960’s), it became evident that the level of mutation
among populations could be much higher than that expected by
advantageous mutations [Barrell and Sanger, 1967]. Subsequently, a
number of theories were ushered in to explain the aforesaid
observation. For example, the so-called Kimura’s neutral theory of
evolution specifies that most mutations at the genetic level are
neutral meaning that they are neither advantageous nor
disadvantageous toward natural selection) and determined largely by
the mutation rate and effective population size [Kimura, 1968]. This
concept concurs with the theory of natural selection wherein it is
proposed that only a minute fraction of mutations are adaptive and
eventually affect the fitness of an individual. Considering a fairly
large platform of population, substitutions (namely, change of one
amino acid or nucleotide base to another) occur as a very gradual
6

(temporally long) process and support many different mutations but


only a very small proportion actually gets sustained in the overall
population. In all, in addition to indels and substitutions,
rearrangements of genetic materials in the species along the
evolutionary time-scale are also possible.
Implicit knowledge of genetic science has been in vogue even
in the prehistoric times through selective breeding and domestication
of animals; and, the early 20th century saw a graceful emergence of
the subject genetics (with “geno” meaning “to give birth” in Greek))
with the proposal by William Bateson to describe the study of
inheritance, variation, and heredity. As described in earlier chapters,
subsequent contribution by Watson and Crick (in 1953 relevant to the
physical and chemical structure of DNA that resides in each and
every cell of the living system) provided details on the exact format
of instructions pertinent to the duplication of the organism in terms
of the genetic information associated with the chain of simple
molecules of the DNA.
In simple terms, as described in earlier chapters, the
prevalence of genetic information and any possibility of its
corruption can be summarized as follows: With the two
complimentary sugar-phosphate strands bound around each other
helically (with each strand consisting of a chain of nucleotides,
namely, Adenine (A), Guanine (G), Cytosine (C) and Thymine (T), a
set of genetic information pertinent to an organism prevails in the
DNA (and RNA). It is referred to as the genome. The DNA with its
genomic profile is copied and passed from one generation to the next.
But, the replication may not always be perfect with a number of
errors depicting unexpected insertions and deletions (shortly known
as indels) popping up with a particular nucleotide is missing from its
locale (in the DNA chain) or a nucleotide is inserted in the replicated
stretch of DNA. Further, a mutation occurs whenever one nucleotide
is substituted with another nucleotide at the same position.
Referring to the post genomic era, rapid sequencing methods
[Sanger and Coulson, 1975], invention of the polymerase chain
reaction, PCR [Saiki et al., 1985] and automation of DNA
sequencing [Hunkapiller et al., 1991] came into realities and
scientists today can sequence whole prokaryote genomes in few of
hours [Margulies et al., 2005] enabling the whole genomes being
7

viewed while performing the so-called phylogenetic analysis


described in the following sections.
In summary, modern view on evolution synthesizes
Mendelism and Darwinism; (starting in the 1930’s,); however,
deemphasis on phylogenetics. Enter Zimmerman (1920’s...) and
Hennig (1950’s...) Phylogenetic Systematics — the Cladists.
Enter the molecule:
Perhaps the first reference to molecules as a means for deciphering
phylogeny is

y.2 PHYLOGENY
Considering the diversity of life with the plethora of estimated 5 to
100 million species of organisms living on Earth, an evidential
implication of details gathered (from morphological, biochemical,
and gene sequence data) “suggests that all organisms on Earth are
genetically related, and the genealogical relationships of living things
can be represented by a vast evolutionary tree”…; and, this tree of
life depicts the phylogeny of organisms. (The term “phylon’ in Greek
means a combining form of a race or a tribe; and as such, phylogeny
or phylogenesis implies a race history of an animal or a plant type).
In essence, phylogeny is a collection of information about the origin
and diversity of species. That is, considering the history of lineages
associated with organisms dynamically changing through time, it
implies that “different species arise from previous forms via descent,
and that all organisms, from the smallest microbe to the largest plants
and vertebrates, are connected by the passage of genes along the
branches of the phylogenetic tree that links all of life”. Thus the
evolutionary history depicts the history of development of biological
organisms, functions, molecules etc. through random mutations
under selective (invariably nonrandom) pressure. As such, the
evolutionary history presumes that an existing organism has
descended from ancestral organisms; and, inevitable mutations take
place at nucleotide level leading to the observed biological diversity.
[Hennig, W. 1965. Phylogenetic Systematics. Ann. Rev. Entomol.
10,97-116, Phylogenetic Systematics. (tr. D. Davis and R. Zangerl),
Univ. of Illinois Press, Urbana 1966, reprinted 1979, Zuckerkandl
and Pauling(1965) “Molecules as Documents of Evolutionary
History.” Journal of Theoretical Biology. 8:357-366].
8

In short, living systems bear a history. If so, can it be


uncovered and probed into? The answer is borne in the art of
phylogeny. In short it explains that, given a parental strain of a
species, how (perhaps why), this strain possibly diverges or splits
into two other strain over a period of time. In a broader sense,
considering a family of (presumably related) sequences, a
phylogenetic analysis determines how this family might have derived
and.emerged across the phases of evolution. By placing the
sequences as out(most) branches on a tree, it depicts the evolutionary
relationships among the sequences and the associated branching
relationships (in the inner part of the tree) reflect the extent to which
different sequences are related.
Darwin recognized that all life that has ever existed is related
through the process of natural selection and by a “great Tree of Life”
[Darwin, 1859] and his recognition today reverberates in
reconstructing the tree of life so as to discover the last universal
common ancestor of all life [Woese et al., 1990]. What is the shape
of the tree of life? This query and inquisitiveness has led to many
versions and revisions on the proposed aspects of the tree of life.
Constructing just the tree of life alone is not however, the end
or limit of probing the evolutionary processes of life. There are a
number of avenues of fruitful deductions of phylogenetic analyses
through evolutionary considerations. Useful and need-based
techniques have been developed in the recent times to build
phylogenetic trees across the cross-section of biology and
evolutionary considerations for example, in estimating the timing of
the ancestor of the HIV virus [Korber et al., 2000], investigating the
origins of deadly flu strains [Worobey et al., 2002] and in
investigating the genetic mechanisms of malaria [Kedzierski et al.,
2002].

y.2.1 Convergent and Divergent Evolutions


As seen above, in evolutionary process, unpredictably species can
adapt a particular phenotype (meaning any observable feature of an
organism, which has stemmed from one ore more genes); and, such
physical features assume the suit with a purpose regardless of genetic
variations or changes accommodating the evolutionary pressure felt
for such changes.
9

Acquisition of the same biological trait in unrelated lineages


defines the convergent evolution process. As a classical example, the
can be indicated as a result of convergent evolution in action. In spite
of the fact that the last common ancestor of birds and bats did not
have wings, presently as seen these species are capable of flying. The
mechanism of flight places constraints on the wing shape making the
wings as of almost similar shapes across all bird species. This
similarity can also be due to shared ancestry, inasmuch as evolution
works only with a given entity of preexistence. Hence, wings depict
the morphology seen with limbs modified (as evidenced by their
bone structure). Convergent evolution lead to specific traits termed
analogous structures. (These are in contrast to homologous
structures, which have a common origin). The wings of bat and
pterodactyl (a kind of “winged lizard”) depict analogous structures.
But the bat wing is homologous to human as well as other mammal
forearms in the sense that an ancestral state is shared despite serving
different functions. Homoplasy denotes the similarity in species of
different ancestry which is the result of convergent evolution.
Further, convergent evolution is similar to, but distinct from
what are known as the phenomena of evolutionary relay and parallel
evolution. Evolutionary relay portrays how independent species
acquire similar characteristics through their evolution in similar
ecosystems, at different times seen for example as the dorsal fins of
extinct ichthyosaurs and sharks; and, parallel evolution takes place
whenever two independent species evolve simultaneously in the
habitat of same ecology. Also they acquire similar characteristics, (as
for instance seen in extinct browsing-horses and paleotheres).
The opposite of convergent evolution is divergent evolution.
Here, related species evolve different traits. This can happen at
molecular level as a result of random mutation unrelated to adaptive
changes. Similar to convergence in evolution, a rationale for
divergent evolution can be seen vis-à-vis the accumulation of
differences between groups leading to the formation of new species.
That is, a “divergent evolution results from diffusion of the same
species adapting to different environments, leading to natural
selection defining the success of specific mutations”. For example,
the vertebrate limb is an example of divergent evolution. Though, the
limb in several species has a common origin, it has diverged
10

somewhat in overall structure and function. That is, divergent


evolution is somewhat readily observable in organisms in certain
higher-level characters of structure and function. The divergent
evolution can be applied to molecular biology characteristics as well.
That is, it can be seen with respect to a pathway in two or more
organisms or cell types applied to genes and proteins, (such as
nucleotide sequences or protein sequences that derive from two or
more homologous genes). Further, divergent evolution prevails both
in orthologous and paralogous genes. (The orthologous genes result
from a speciation event and paralogous genes result from gene
duplication within a population). The existence of possible divergent
evolution in paralogous genes indicates occurrence feasibility
between two genes within a species.
In summary, similarity seen in the case of divergent evolution
is due to the common origin, (such as divergence from a common
ancestral structure or function that has not yet completely obscured
the underlying similarity). In contrast, a convergent evolution arises
whenever some sort of ecological or physical drivers are pushing
toward a similar solution, (even though the structure or function has
arisen independently), such as different characters converging on a
common, similar solution from different points of origin. This
includes analogous structures as well.

y.2.2 Phylogenetic Trees


Studies on phenotype precede sequencing efforts exercised on DNA
and/or proteins. In 1960-70’s, the evolutionary positioning of species
were largely done on the basis of anatomical features and by the art
of taxonomy (depicting the science of plant and animal
classification).
Subsequently, phylogeny is indicated towards deciphering the
history of descent of a group of organisms. It is described usually by
branching tree-like diagram (known as phylogenetic tree). This
phylogenetic approach offers rigorous mathematical background and
computational methods can be used to infer the trees or better known
as dentograms. The underlying inference is based on proteins and
DNA that are homologous. Relevant measure is therefore unaffected
by environmental effects on phenotypical changes. In short,
11

phylogenetic tree enables understanding the evolutionary changes at


the molecular level.
In constructing the tree of life each branch is called a clade
and living organisms are placed as leaves at the tips of the branches;
and, their evolutionary history can be represented by a series of
ancestors shared hierarchically via different subsets of organisms
(that are seen surviving today). As mentioned before, these
organisms seen alive are depicted as the leaves on the dentograms,
leading to down-tracing their history back to the branches
encountering their various ancestors. On time-scale these ancestral
existence, denote thousands or millions or even possibly hundreds of
millions of years ago. Thus, in short a notion exists that all of life is
genetically connected through a mammoth phylogenetic tree.
Does it mean that there could be a common ancestor for
humans and beetles? Plausibly, yes! Relevant common ancestral
organism could have been some sort of a worm; and, somewhere
along the evolutionary time-frame, this species of ancestral worm
divided into two separate (worm) species, which thereafter divided
again (and again), each division (or speciation) resulting in new,
independently evolving lineages along the branches of the tree with
the end result being two possible leaves – human and beetles!
(Vow!!)
Across various ancestral forms derived, the new lineages
retain mostly their ancestral features. However, these would
gradually get modified and supplemented with necessary traits that
make them congenially adjust to (and survive in) the environment of
their habitation. The pedagogy of phylogeny of organisms explains
relevant similarities and differences among plants, animals, and
microorganisms that have evolved through the formalism of the tree
of life. Thus, the phylogenetic tree (Figure y.1) offers a systematic
framework for various sub-disciplines of biology in understanding
the organizational aspects of biological knowledge.
12

Fig. y.1 An evolutionary tree-of-life: Concept of phylogeny implies


that existing species (depicted as end leaves) can be linked by an
underlying structure in a tree-like manner

As is evident from its name, a phylogenetic tree (Figure y.1) is a


conceptual representation of the tree-of-life by a tree structure with
species located at the leaf-nodes of the tree. In essence, it is a
diagrammatic representation of the evolutionary relationships among
a set of species. Phylogenetic trees can be deduced from molecular
sequences (like DNA or amino acid sequences) by comparing
similarities between such sequences. Branches in the tree depict that
a particular species has evolved from another through internal nodes,
(which denote the epochs of speciation). Further, the events of
speciation are supposedly taking place in a dichotomous (binary)
fashion. It implies that any speciation event results in two new
species.
Quantitatively, the length of a branch can be regarded as
some measure of evolutionary distance between two nodes.
13

Implicitly, it indicates the degree of dissimilarity of two (DNA)


sequences in the branched parts of the tree at speciation node.
Constructing a phylogenetic tree is based on the following heuristics:

 Species exhibit variations or diversity


 However, a set of similar species can be identified and
grouped
 Such grouped species can be connected (or linked) to a
common ancestor.

First as defined earlier, dentogram is a tree representation where


all the leaves are of equidistant from the root, implicitly
representing the passage of time of sequences up to the leaf-tips
from the root. Descriptively, the components of a phylogenetic
tree and the set of terminology associated in constructing such
structures is as follows:

 Tree: This is a line-diagram that provides a visual means of


representation for a group of sequences or species and
indicates their time-series of origin. A tree consists of nodes,
branches and leaves. It is a mathematical structure that
models the actual evolutionary history of a group of
sequences or organisms.
14

Branch OTU

OTU

Node
Branch
length

OTU

Outer
group

Ancestral
root

Figure y.2 Components of a phylogenetic tree. (OTU:


Operational taxonomic units – leaves; and, root is the common
ancestor of all OTUs

 Nodes: A tree consists of nodes connected by branches.


Nodes depict ends of the tree and are represented by a small
circle. A node can be internal or at an end (in which case, it is
called a “leaf” that is, the leaf is the loose-end (terminal) node
of a branch in the tree. Internal nodes denote hypothetical
nodes and an unique (internal) node can be identified as the
root of the tree depicting the ancestor of all the sequences.
The terminal nodes represent sequences or organisms (for
which the phylogenetic data are compiled and known). Each
15

terminal node (leaf) is designated as an operational


taxonomical unit (OTU). That is, an OTU depicts a terminal
node in phylogenetic analysis and represents an organism and
a group of such OTUs constitute a clade representing a set of
several sequences and their common ancestral nodes. A
bifurcating node explicitly carries two distinct lineages
arising from it

Root node

Internal Bifurcating
node node

Terminal
nodes

Figure y.a Types of nodes in a phylogenetic tree

 Types of trees: There are two versions of trees, namely rooted


and unrooted trees as illustrated in Figure y.y

(a) (b)

Figure y.x Phylogenetic tree configurations: (a) Rooted tree


and (b) unrooted tree
16

Rooted tree implies a structure in which the direction of


evolution is specified with respect to a “root” or branch-off
site of single node. That is, the tree structure is shown with a
designated “root” depicting an ancestral origin); and, via
adequate divergence across the phases of evolution,
multitudes species (denoted as different leaves (OTUs) of the
tree) have evolved.
An unrooted tree simply displays the underlying connections
or links between the species. That is, no specified root node
of ancestral implication is shown on this structure. The nodes
shown simply denote the mutual relativeness (such as branch
lengths between them. This branch length measures the extent
of divergence between the nodes).
Scaling the tree (or scaled trees) implies elucidating the
differences between adjoining nodes. It is done by
determining the length of branches involved.
Gene tree: This denotes a tree structure that results from
analyzing homologous genes
As regard to the two versions of tree as above, the number of
branches and nodes can be obtained in terms of the number of
OTUs (M). Relevant details are shown in Table y.y.

Number of Nodes
branches
Interior M-2 M-1
Rooted Total 2M - 2 2M - 1
Interior M-3 M-2
Unrooted Total 2M - 3 2M - 2

 Tree network: While trees signify only one path between any
pair of nodes, a tree network has more than one path between
any pair of nodes as shown in Figure y.r
17

(a) (b)

Figure y.r (a) A simple tree and (b) a tree network mesh

 Newick formatted phylogenetic tree: Newick tree format (or


alternatively known as Newick notation or New Hampshire
tree format) is a mathematical way to represent graph-
theoretical trees with edge-lengths using parentheses and
commas. When an unrooted tree is represented in Newick
format, an arbitrary node is chosen as its root. Whether rooted
or unrooted, typically the representation of the tree is rooted
on an internal node and rarely (but permissibly) rooting a tree
on a leaf node is done. Newick format is a shorthand notation
and is illustrated in Figure y.x

{{A, B}{C, D} {E, F}} {{A, B} {C, D}}

Figure y.t Newick tree format

 A rooted binary tree that is rooted on an internal node has


exactly two immediate descendant nodes for each internal
node. An unrooted binary tree that is rooted on an arbitrary
18

internal node has exactly three immediate descendant nodes


for the root node, and each other internal node has exactly
two immediate descendant nodes. A binary tree rooted from a
leaf has at most one immediate descendant node for the root
node, and each internal node has exactly two immediate
descendant nodes

 Cladistics: It literally means "branch" and forms biological


systematics to classify species of organisms into hierarchical
monophyletic groups. More specifically, it is defined as the
study of the pathways of evolution. How many branches there
are among a group of organisms, which branch connects to
which other branch and what is the branching sequence are
queries of interest to cladists. Typically, cladistics strives to
identify monophyletic clades- a group that represents a
species and all its descendants. Closely related clades are
called sister groups; and other groupings are known as
paraphyletic depicting common ancestor and some of its
descendants and polyphyletic, which signifies sister groups
but not the common ancestor). Further, in cladistic sense, a
monophyletic group is group of organisms (taxon)
constituting a clade consisting of an ancestor and all its
descendants
 Taxon (and taxa): A taxon (with taxa being its plural) as
mentioned above is a group of (one or more) organisms,
(which a taxonomist adjudges to be a unit). However, in
phylogenetic nomenclature of cladistic approach, do require
taxa need not be monophyletic (consisting of all descendants
of some ancestor). Here taxa is not the basic unit and "clades"
is used instead. That is, a clade is a special form of taxon in
phylogenetic sense.
 Cladogram: A tree-like network that expresses ancestor-
descendant relationships is called a cladogram. Thus, it
describes the topology of a rooted phylogenetic tree via
relative ancestral origins of sequences, but without any
branch length considerations. In essence, cladistic
classifications of trees illustrate cladograms (Figure y.n) with
19

the intention to reflect the relative recency of common


ancestry or the sharing of homologous features

Figure y.n A cladogram showing the relative order of


common ancestry

 Additive tree: It is a cladogram specified with branch lengths.


It is also known as phylograms or metric trees. An example of
additive tree is shown in Figure y.y

1 2

3 7
4 6

Figure y.y Example of an additive tree. The numbers shown


are some hypothetical branch lengths

Ultrametric tree: It is a dendogram denoting a specific type of an


additive tree in which the tips of the tree are all equidistant from the
root as illustrated in Figure y.r where the relative order of common
ancestry can be seen
20

1
1
2 4

2
1 1

Figure y.n An example of ultrametric tree. (All the tips of


the tree are of equidistance from the root. In this case, the
equidistance is equal to four)

 Cladistic terminology: It includes (i) apomorphy or derived


character shared by a species and its descendants, but not in
the ancestral species; (ii) synapomorphy, which is used to
define an ancestor and its descendants and (iii) pleisomorphy
defining ancestral characteristics
 Phenetics is the study of relationships among a group of
organisms on the basis of the degree of similarity between
them, be that similarity molecular, phenotypic, or anatomical
 Phenogram: It is a tree-like network expressing phenetic
relationships
 Pleisomorphy: This defines some characteristics pertinent to
the ancestor, which are sequel in all further sequences. And it
is a character-state present in both outgroups as well as in the
ancestors
 Homoplasy: This refers to similarity that has evolved
independently without being indicative of common
phylogenetic origin. Similarity seen in species of different
ancestry is the result of convergent evolution and it denotes
the homoplasy
 Convergent evolution: It describes the acquisition of the same
biological trait in unrelated lineages
 Polytomies: Soft polytomy implies a lack of information about
the order of divergence; and, hard poytomy hypotheses that
multiple divergenses occurred simultaneously
21

 Autapomorphy is a derived trait seen uniquely in a particular


taxon
 Synapomorphy: In contrast with autapomorphy,
synapomorphy denotes the characteristics shared with the
ancestor in a specific phylogeny and then derived from the
ancestor. In cladistics defined earlier, a synapomorphy or
synapomorphic character denotes a trait, which is shared
("symmorphy") by two or more taxa and their last common
ancestor, whose ancestor in turn does not possess the trait. A
synapomorphy is thus an apomorphy (or a derived
characteristic of a clade)

Phylogenetic trees based on morphological features


Phylogenetic trees based on numerical taxonomy do not indicate the
buried subjectivity of evolution. For example, in a particular
phylogenetic analysis, the numerical approach does not say the
relative importance of, say for example, skin color and tail-length.
So, a phylogenetic tree that displays morphological aspects across the
diverging species of evolution can be indicated. An example of such
morphological features is that a chimp has furs and a bird does not. A
morphological tree can be drawn as shown in Figure y.3.
22

?
Furs,
Mammary
glands

Feather
s
Claws,
Nails

Lung
s
Jaw
s
Figure y.3: A morphological tree of phylogeny

y.2.2 Development of a Phylogenetic Tree


Consistent with the phylogenetic terminology, and in conceiving a
phylogenetic tree with the components illustrated in Figure y.2, the
underlying assumptions are as follows: (i) Each species in the tree (or
the taxon) bears a relation to a common ancestor; (ii) the
phylogenetic tree in essence is a branching of structure, that is a
bifurcating tree and (iii) along the time of evolution that frames the
phylogenetic tree mutations have taken place randomly. These
assumptions lead to eventual phylogenetic inferences of scientific
interest.
Making such phylogenetic inferences therefore, relies on
models that specify the process of evolution and the development of
the associated tree. To elucidate these models, the set of a priori
details required are based on the following queries relevant to the
sequences adopted in the modeling pursuits:
23

 Are the sequences in hand are correct?


 Are they (true) representatives of the evolution process
involved?
 Are they homologous?
 Is the multiple alignment of the sequences question correct?

Homologs or homologous structures are defined as those that are


derived from a common ancestral structure in two related species. In
contrast, the other possibilities are:
Orthologs: These also bear a common ancestor, carry similar
functional and structural attributes but they are separated by
speciation, which depicts the phenomenon wherein a common
ancestor gives birth to two subgroups that slowly drift away to
become distinct species.
Paralogs: These are homologs separated by a duplication event,
meaning that within a genome, a gene had been duplicated; and, one
of the duplicates retained the original function and the other duplicate
could have assumed a new (or related) function.
Xenalogs: These result from a lateral transfer between two
organisms, where a lateral transfer is a direct DNA transfer between
two species. Hence, one of the genes contains a gene that does not
have the same history as the genome in which it is inserted. In this
horizontal gene transfer, the result is hard-to-tell similar functions
being observed.
Formation of orthologs and paralogs via duplication and
speciation are indicated by an illustration in Figure y.4.

α
Duplication

α β

Speciation

α1 β1 α2 β2
24

Figure y.4 Consistent with the occurrence of duplication and


α1, α2, β1, β2} depict orthologs and
speciation as illustrated, the set {α
α1, β2} denote paralogs.
the set {α

y.2 Methods of Phylogenetic Tree Construction


The method of constructing a phylogenetic tree involves a set of
procedures, which can be illustrated in a pseudocode format given in
Table y.1

Table y.1: Pseudocode formatted description of constructing a


phylogenetic tree: Phylogenetic analysis
______________________________________________

CONSTRUCTING A PHYLOGENETIC TREE:

// Initial: Choose the test sequences


For choosing the sequence, CALL SUBROUTINE I
write SeQ:
← Database search and list the test
sequences
perform MSeQA:
← Perform multiple sequence alignment
go to: SUBROUTINE II
← Multiple alignment preparation
next

// Choose the model of evolution by:


check Similarity SIM:
if SIM defines “strong similarity”,
then Goto PM
← PM corresponds to parsimony methods
Perform SUBROUTINE III
or else,
check SIM:
if SIM defines moderate similarity,
then go to DM
25

← DM corresponds to distance methods


or else,
check SIM:
if SIM defines low or no similarity,
then go to ML
← ML corresponds to maximum-likelihood
method
ENDIF
next

PERFORM TREE-BUILDING/RECONSTRUCTION
← This refers to making of the
required phylogenetic tree. This is
done with the set of multiple
sequences aligned and prepared
go to: SUBROUTINE III
next

EVALUATE THE QUALITY OF THE END-RESULT


← This is done by applying consensus
methods to the set of trees
constructed. Consensus method
verifies the reliable tree that
truly depicts the evolution history
of the sequences addressed.
go to: SUBROUTINE IV

END: ;
______________________________________________

SUBROUTINE I: Choosing the query sequence


← This refers to choosing homologous
test sequences
← Relevant choice is based on:
Check: The selection is not a
sequence fragment. Incomplete
sequences are not friendly toward
26

multiple sequence alignment nor


tree reconstruction. At least same
fragment is used for all multiple
sequences
Check: The sequences chosen are
not xenologs. Unless the purpose
is to study xenologs, avoid
including genes that result from
lateral transfer.
Check: The selection is not a
recombinant sequence. (Some
proteins result from a combination
of multiple proteins (as is common
in viruses. Such proteins carry
two ancestors (instead of one)not
being compatible for regular tree
reconstruction.
Check: Whether the sequences are
pertinent to large and complex
families containing various
domains and repeats.(Working on
smaller and more uniform subsets
is preferable)
Check: Whether the sequences are
pertinent to nucleotides or
proteins
Check: Are they from closely
related species (in conformance to
an extent of being at least 70%
identical with or without the
underlying mutations being high)

if… the sequences do not


conform to an extent of being
at least 70% identical as above
(meaning more divergent)
then,Goto to: Use protein
sequence or conserved
27

nucleotides(such as
ribosomal RNA)
or else… Use the
chosen nucleotide
sequence
return Selected sequence

next

Goto to: Perform MSeQA: Subroutine II


______________________________________________

Subroutine II: MSeQA:


Multiple sequence alignment and preparation
for tree construction

← Alignments are an essential pre-requisite


to many further analyses of protein families
such a homology modeling, phylogenetic
reconstruction or simply to illustrate
conserved and variable sites within a family.
Sequence alignment is an important and
useful procedure in bioinformatics. A multiple
sequence alignment is a scheme of writing once
sequence on top of each other where the
parallel residues in any one position are
deemed to have a common evolutionary origin.
Sequences are placed in rows on top of each
other and aligned so that homologous residues
are placed in the same column. Multiple
sequence alignment is computationally
intensive. As shown before, more than four
sequences would render the alignment almost
incomputable.
Typical multiple alignment programs are:
Clustal, T-Coffee, MAFFT, Muscle..
28

← Subroutine II involves two steps


(i) Retrieving homologs
(ii) Arranging multiple sequences
(alignment procedure) and preparing
the multiple sequences for tree
construction, which refers to
“cleaning up” the chosen multiple
sequences for alignment by a set of
procedures

Step (i)-
Retrieve-
Given a sequence, its homolog can be retrieved
as follows:

1. Access NCBI BLAST by typing the URL


http://www.ncbi.nlm.nih.gov/blast into
the address line of the web browser
2. There are two options: By selecting
protein Blast option, paste in the query
protein sequence. (That is, Search
protein database using a protein query -
Algorithms: blastp, psi-blast, phi-
blast).
Otherwise, by selecting nucleotide Blast
option, paste in the query nucleotide
sequence.(That is, Search a nucleotide
database using a nucleotide query-
Algorithms:
blastn, megablast, discontiguous
megablast
3. Exercise search limitations on the BLAST
run. (In practice, a selection of
sequences from the same species as the
query sequence can be searched along with
a selection of sequences from different
species. This can be done by using two
separate BLAST searches (using different
search limitations).
29

4. Next, the query (protein or nucleotide)


sequence is pasted in the box below
‘Enter Query Sequence’ and the name of
the species that query sequence comes
from is typed in the Organism box. This
will return hits from species of interest
only. (To obtain results from other
species run BLAST without any
limitations.
5. With the same search limitations selected
before, continue with the BLAST search by
clicking the BLAST button. After a while
(depending on the search time), the FASTA
format sequence of six ORF can be
retrieved from the BLAST output.

Step (ii)-
Arrange/Prepare: This refers to placing the
chosen multiple sequences one below the other
forming a column of sequences as an alignment
procedure and preparing the multiple sequences
for tree construction. The preparation
involves “cleaning up” the chosen multiple
sequences for alignment by a set of procedures

Example1: Multiple DNA sequences of


hypothetical homologous species

AAGCA-AGGTAAATGCATGCATGGA- -AGTCCTGGAATGGTA

AGAT- - AGGTAAATGCAGCTAGCAT-AAGTCCTGGACCGGAT

GCAATTAGGTAAAACCAAGGTACCT- -AGTCCTGGAGAGATA

GTGATTAGGTAAAACCAACGCAACGCAGTCCTGGACGTAGG

Example 2: Multiple protein sequences of


30

hypothetical homologous species

ASLIFR- SDAYS KNRTVIPVWNEGF Q- - DQSSL HVVVKMQEY


SDAV- - ERN
AGVA- - SDAYS KKRTVIKNSVNPVWNE DQGSLHVVVKKEN
GS
CSAVFASDAYS KKRTVVIKNSVNNVWN DQSSLHVV ELL- - - -
-
CVVVF- SDAYS KKRTVIIKNSVNPVWNE DQSSLHVVVKTKEE
SE

Prepare: This refers to removing certain sets


of columns in the arranged multiple sequences
shown in the above examples as shaded
sections.

← Criteria for such removals?

o Sections of gap-free columns are mostly


retained. Gaps invariably cause
phylogenetic tree-forming.(In tree-
construction, Programs like ClustalW
follows complete-deletion policy
ignoring every column that contains
the gap)
o Extremities of the multiple sequences
are removed inasmuch as N-terminus and
C-terminus tend to be poorly conserved
and as such, they do not well aligned
o Gap-rich columns can be removed. They
often spond to loops. As such, even
when a program returns an alignment
with gap-rich columns, it may not be a
meaningful
o Most informative blocks should be
retained. Ideally, in building a tree,
high alignment of sequences possessing
low level of identity is preferable
31

since it would contain a trace of the


family history
o “Good blocks” (typically with 20 – 30
amino acids long with a few conserved
positions are useful in realizing a
correct tree
← Programs for column removal

-Removal of columns that are unlikely to


be correctly aligned can be done via T-
Coffee server. T-Coffee is a progressive
method for sequence alignment [C.
Notredame, D. Higgins and J. Heringa, T-
Coffee: “A novel method for multiple
sequence alignments", Journal of
Molecular Biology, 302, 205-217, (2000]
- Editing multiple alignments can be done
with Jalview. It is a multiple alignment
editor written in Java. It is used widely
in a variety of web pages (for example,
the EBI Clustalw server and the Pfam
protein domain database and is available
as a general purpose alignment editor.
← Programs for multiple alignment
Clustal, T-Coffee, MAFFT, Muscle, …

return Aligned multiple sequence

End
_______________________________________________________

Building the phylogenetic tree


Phylogenetic tree construction relies on two considerations: (i) An
approach based on similarities and (ii) another approach based on
dissimilarities of the observed data. The similarity aspect is
concerned with the relatedness in the compared entities. It is divided
into two types: (a) phenetic version and (b) cladistic version. The
phonetic version is character-based phenogram, for example,
32

comparing a set of plants of related characteristics in terms of their


associated characters (such as, petals, sepals, anthers, ovary, size or
style etc.). Depending on the extent of comparable characters
between them, they are declared as being related closely or not.
Cladistics is a method of hypothesizing relationships among
organisms in reconstructing evolutionary trees. The basis of a
cladistic analysis is data on the characters, or traits, of the organisms
and these characters could be anatomical and physiological
characteristics, behaviors, or genetic sequences. The result of a
cladistic analysis is a tree, which represents a supported hypothesis
about the relationships among the organisms.
Summarizing the contrasts and comparisons between
phenetics versus cladistics, the former offers relationships among a
group of organisms on the basis of the degree of similarity between
them. (This similarity may refer to molecular, phenotypic, or
anatomical features); hence, a tree-like network (called a
phenogram.) is evolved so as to express the underlying phenetic
relationships.
Cladistics refers to the pathways of evolution. In this context,
number of branches that prevail among a group of organisms,
specifying which branch connects to which other branch and knowing
the associated branching sequence are queries of interest. Relevant
tree-like network that expresses such ancestor-descendant
relationships is a cladogram or the topology of a rooted phylogenetic
tree.
A phenogram is an indicator of cladistic relationships. But, it
is not per se identical to a cladogram. (Only when exists a linear
relationship between the time of divergence and the degree of genetic
(or morphological) divergence, the two types of trees may become
identical to each other.
A class of phylogeny is molecular phylogenetics, (also known
as molecular systematics), uses the structure of molecules in order to
gain information on the evolutionary relationships for an organism.
The result of such molecular phylogenetic analysis can be expressed
in a phylogenetic tree. In pursuing molecular phylogeny, the
associated classification of methods involves distance and character-
state approaches.
33

Methods of distance-based approach are based on certain


distance measures, such as the number of nucleotide or amino-acid
substitutions. Examples of this distance-based method are: The
unweighted pair group (UPGMA) method, the transformed-distance
method, and the neighbors-relation method.
The heuristic logic behind the distance method is that
evolutionary distance is a tree metric and hence defines a tree.
Typically, evolutionary distances are computed for all pairs of taxa;
and the tree is constructed by considering the relationships among
these distance data (fitting a tree to the matrix).
The methods of character-state approach rely on the state of
the character, namely (i) the nucleotide or amino acid at a particular
site; and, (ii) the presence or absence of an indel at a certain DNA
location. An example of character-state methods is the maximum
parsimony method is a character-state method.
Yet another statistical strategy of phylogenetic tree
reconstruction using molecular data refers to maximum likelihood
method, which uses all the information available in the sequence.
Before getting into the details of aforesaid molecular phylogeny
methods, some basic aspects of tree construction are outlined below.
The process of building a tree, in essence, refers to making of
or reconstructing the required phylogenetic tree structure using the
multiple sequences aligned and prepared as indicated before. With
any given set of (N) multiple sequences, corresponding number of
possible trees (M) is such that, for high values of N will give
extremely large values of M. (For example, M = 1 for N = 3; M = 3
for N = 4; …; M = 1,027,025 for N = 10; …, 2.8 × 1074 and so on as
will be explained later). Therefore, unless only a small number of
sequences is considered, the total number of feasible trees would
increase to an enormous extent. As such, with the sequence data
gathered, the multiple alignment is done only on a limited number of
sequences (leading to a suboptimal number of trees being
reconstructed).
As indicated above, building the tree implies assessing the
underlying phylogeny. This assessment can be done by two
approaches, namely, (i) distance-based approach and (ii) clustering-
based approach.
34

The distance-based approach refers to introducing a “weight”


concept to the basic tree structures indicated earlier. The underlying
concept of distance-based pursuit is as follows: Considering a rooted
tree, the root is the most recent ancestor in the tree and the path from
the root to a leaf signifies the evolutionary path. Such rooted trees are
often represented with a root vertex as shown in Figure y.x (a)
emphasizing that the root corresponds to the ancestral species. In
contrast, the unrooted tree (Figure y.x(b)), bears no assumption as
regard to the position of an evolutionary ancestor (root) in the tree.
That is, no assumption about the origin of species prevails in
unrooted trees.
The concept of weight in distance-based methods can be
illustrated as in Figure y.x (c) where, for example, there are six leaf-
nodes (vertices), I, II, III, IV, V and VI; and, a positive weight (or
length) is assigned between any two consecutive nodes. This length,
for example, may depict the number of mutations on the evolutionary
path.
Quantitatively, the length (d) of the path between any two
vertices can be specifies as the sum of the weights in the path
between them. In Figure y.x(c), for example, d between nodes I and
V is given by: d = 13 + 13 + 14 + 18 + 12 = 70. In general, given a
weighted tree (T) with n leaves (end-nodes), computation of the path
di, j (T) between any two leaves (i, j) can be done as indicated by the
above example.
35

I VI

13 13
IV

13
14 18

15 12
10

V
II III

Figure y.cc Weighted unrooted (star topology) tree

Now, considering an inverse problem, suppose a (N × N) distance


matrix ∆i, j = [di, j (T)] (for every two leaves (i, j) is available (mostly
via biological experiments). A method is then required to search for a
tree T that has n leaves and consistent with the data in hand.
Whenever the matrix size is small (say 3 × 3) and it is symmetric and
non-negative, construction of the tree could be trivial. But for larger
matrix sizes, the number of trees to be constructed becomes
unwieldy. This could be seen from the following algorithmic
relations: Given ν nodes, the number of rooted trees that can be
designed is εR = (2ν – 3)!! = (2ν – 3)(2ν – 5)(2ν – 7) … ∞; and,
number of unrooted trees is, εUR = (2ν – 5)!! = (2ν – 5)(2ν – 7)(2ν –
9) … ∞. Illustrated in Figure y.y is a couple of simple examples.
36

(a) (b)

Leaf

Root

Figure y.y: Given the number of nodes (ν), realization of εR and εUR:
(a) With ν = 2 in a rooted tree and (b) with ν = 3 in an unrooted (star
topology) tree.

In the distance-based approach, [N. Saitou and M. Nei, The


Neighbor-Joining Method: A New Method for Reconstruction of
Phylogenetic Trees, Molecular Biology and Evolution, 4, 1987, 406-
425], evolutionary distances are calculated for all pairs of taxa and
phylogenetic tree is constructed by means of an algorithm, which
establishes some functional relationships among distance values.
Hence a distance matrix is deduced which is a table that contains the
“distances” (or counts on the number of evolutionary events) that
separate each pair of sequences in the data set of aligned multiple
sequences. Popularly, the distance matrix is conceived via UPGMA,
which is simple toward tree reconstruction. It essentially assumes
that the rate of evolution is nearly constant among different
evolutionary lineages. That is, evolutionary distance is proportional
to the divergence period time. Essentially, distance matrices of
phylogeny are non-parametric schemes originally adopted for
phenetic data using a matrix of pairwise distances. The distances
obtained thereof are then used to make a tree, (that is a phylogram,
depicting the informative branch lengths that carry the information
on the underlying evolutionary process).
Normally as said before, the distance matrix is a result of
biological experiments (such as immunological studies) complied as
measured values.It can also be elucidated from:

 Morphometric analysis
37

 Pairwise distance formulations (like Euclidean distance


between discrete morphological characters)
 Genetic distance calculations from sequence, restriction
fragment
 Allozyme data.

Raw distance values vis-à-vis phylogenetic character data, can be


decided by simple counts on the number of pairwise differences in
character states. (Specifically described as Manhattan distance, the
raw distance values in question conform to what is known as taxicab
geometry, proposed by Hermann Minkowski in the 19th century. It
refers to a form of geometry wherein the conventional Euclidean
geometry metric is supplanted by a new metric in which the distance
between two points is the sum of the (absolute) differences of their
coordinates. The taxicab metric is also known as rectilinear distance).
Manhattan distance or Manhattan length, or its variations depict the
geometry of grid layout of most streets on the island of Manhattan.
Relevant length of the shortest path that a taxicab could take between
two points in the city is equal to the distance between the points in
taxicab geometry).
Inasmuch as the distance-matrix approach requires "genetic
distance" evaluation between the sequences being classified, they
need multiple sequence alignment (MSA) as an input. This genetic
distance is often defined as the fraction of mismatches at aligned
positions, with gaps either ignored or counted as mismatches [D. M.
Mount, Bioinformatics: Sequence and Genome Analysis, Cold Spring
Harbor Laboratory Press: Cold Spring Harbor, NY:2004].
Further, distance methods imply constructing an all-to-all
matrix from the sequence query set. This describes the distance
between each sequence pair. The constructed phylogenetic tree via
distance matrix renders closely-related sequences under the same
interior node and the branch lengths reproduce to a close extent the
observed distances between sequences. Further, the type of tree
reconstructed can be either rooted or unrooted version depending on
the type of algorithm adopted. Distance methods lay foundation for
progressive and iterative types of MSA. But, such methods do not
use efficiently the information about any local high-variation regions
38

that may appear across multiple sub-trees [J. Felsenstein J., Inferring
Phylogenies, Sinauer Associates, Sunderland, MA: 2004].
The genetic distance concept is adopted as a data clustering
strategy in a method known as the neighbor-joining approach, which
enables reconstruction of unrooted trees. In its approach, neighbor-
joining does not assume a constant rate of evolution (that is., a
molecular clock) across lineages. In other words evolutionary
divergence time cannot be found from mutations and as said before,
mutation rates are not constant. (In contrast, in the UPGMA to be
described later rooted trees are reconstructed using a constant-rate
assumption of an ultrametric tree in which, as said earlier, the
distances from the root to every branch tip are equal).
Neighbor-joining is based on the minimum-evolution
criterion for phylogenetic trees, i.e. the topology that gives the least
total branch length is preferred at each step of the algorithm.
However, neighbor-joining may not find the true tree topology with
least total branch length because it is a greedy algorithm that
constructs the tree in a step-wise fashion. Even though it is sub-
optimal in this sense, it has been extensively tested and usually finds
a tree that is quite close to the optimal tree. Nevertheless, it has been
largely superseded in phylogenetics by methods that do not rely on
distance measures and offer superior accuracy under most conditions.
The main virtue of neighbor-joining relative to these other
methods is its computational efficiency. That is, neighbor-joining is a
polynomial-time algorithm. It can be used on very large data sets for
which other means of phylogenetic analysis (e.g. minimum
evolution, maximum parsimony and maximum likelihood) are
computationally prohibitive. Unlike the UPGMA algorithm for
phylogenetic tree reconstruction, neighbor-joining does not assume
that all lineages evolve at the same rate (molecular clock hypothesis)
and produces an unrooted tree. Rooted trees can be created by using
the outgroup and the root can then effectively be placed on the point
in the tree where the edge from the outgroup connects.
Furthermore, neighbor-joining is statistically consistent under
many models of evolution. Hence, given data of sufficient length,
neighbor-joining will reconstruct the true tree with high probability.
Atteson proved that if each entry in the distance matrix differs from
39

the true distance by less than half of the shortest branch length in the
tree, then neighbor joining will construct the correct tree.

UPGMA method: An algorithmic procedure to construct a


phylogenetic tree

The unweighted pair-group method with arithmetic mean (UPGMA)


is a method of tree construction developed originally in the context
of constructing taxonomic phenograms of depicting trees that reflect
the phenotypic similarities between OTUs. The underlying
consideration refers to the relationships between organisms viewed in
terms of similarity seen between them (instead of probing their
genealogy). That is similar organisms can be grouped (clustered)
together (per the old adage of birds of the same feather flock
together!)
A method of such grouping effort is exercised by the
UPGMA. In this phenetic approach the groups are identified in
making clusters of varying degrees of similarity (that is, on the basis
of least distant to most distant relationships). It makes the tree clock-
like (ultrametric). Computationally, tree construction by UPGMA is
fast. Hence, UPGMA can be adopted arbitrarily with large data set.
But a single tree that is obtained in a broad sense of similarity and
does not look for whether the relationships considered are
historically significant.
UPGMA can also be used to construct phylogenetic trees if
the rates of evolution are approximately constant among the different
lineages. For this purpose either the number of observed nucleotide
or amino-acid substitutions can be used.
In summary, UPGMA employs a sequential clustering
algorithm, in which local topological relationships are identified in
order of similarity and the phylogenetic tree is built in a ladder
format. First, identified from among all the OTUs are those two
OTUs that are most similar to each other and then these two are
combined as a new single OTU. This combined OTU is called a
composite OTU. Subsequently, a distance matrix is constructed with
the composite OTU plus the rest of the OTUs. Again, the pair with
the highest similarity is identified to make the new composite and so
40

on, until only two OTUs are left out. Relevant exercise is illustrated
in the following example.

Example x.1
Consider a set of six OTUs {α, β, χ, δ, ε, φ} whose sequences are
listed below:
α GAACGCTGCGTGGTGTAGTCGTCTGCGAGATATGGCTGG

β G A A C G C T G C G T G G T G T AG T C G T C T G C G A G A T A T G G C T C T

χ GAACGCTGCGTCGTGTTGTCGTCTGTGAGATATGGCTCG

δ GAAGCCTGCGTGTGGTTGTCGTCTGCGAGATATGGCTCG

ε G A A G G T T G C G T G G T GT T G T C T G C T G C G A G A T A T G G C T C G

φ : T C A G G C C G C G T G G T G T TG T C G TC T G C G A G A A T T G G C T C G

Assuming that the above set of OTUs had the following common
ancestral root (R), corresponding evolutionary distance (ED) matrix
can be constructed as illustrated as illustrated in Figure x.

R TAAGGCTGCGTGGTGTTGTCGTCTGCGAGATATGGCTCG

Evolutionary
distance (ED) ED matrix (EDM)

1 1 1 1 α β χ δ ε φ
Root α α 0
β β 2 0

χ χ 4 4 0
δ δ 6 6 6 0
ε ε 6 6 6 4 0
φ φ 8 8 8 8 8 0
Taxa/OTUs
41

Figure x Construction of the evolutionary distance matrix for the


OTU-set of Example 1.

The ED values indicated in the matrix of Figure 1 correspond to the


extent (number) of changes in the nucleotide residue set of R in going
to each taxonal leaf (OTU). That is, ED corresponds to number of
dissimilarities observed between the sequences being compared. This
is illustrated in the following set of tables (Tables 1 - x):

Check I: Distance of four (4) of each OTU from the root R.


(Dissimilar residues are shown bold)

R TAAGGCTGCGTGGTGTTGTCGTCTGCGAGATATGGCTCG
α GAACGCTGCGTGGTGTAGTCGTCTGCGAGATATGGCTGG

R TAAGGCTGCGTGGTGTTGTCGTCTGCGAGATATGGCTCG
β G A A C G C T G C G T G G T G T AG T C G T C T G C G A G A T A T G G C T C T

R TAAGGCTGCGTGGTGTTGTCGTCTGCGAGATATGGCTCG
χ GAACGCTGCGTCGTGTTGTCGTCTGTGAGATATGGCTCG

R TAAGGCTGCGTGGTGTTGTCGTCTGCGAGATATGGCTCG
δ GAAGCCTGCGTGTGGTTGTCGTCTGCGAGATATGGCTCG

R TAAGGCTGCGTGGTGTTGTCGTCTGCGAGATATGGCTCG
ε G A A G G T T G C G T G G T GT T G T C T G C T G C G A G A T A T G G C T C G

R TAAGGCTGCGTGGTGTTGTCGTCTGCGAGATATGGCTCG
φ TCAGGCCGCGTGGTGTTGTCGTCTGCGAGAATTGGCTCG

Check II: Distance of subsequent OTUs from α


(Dissimilar residues are shown bold)

α G A A C G C T G C G T G G T G T A G T C G T C T G C G A G A T A T G G C TGG
β G A A C G C T G C G T G G T G T AG T C G T C T G C G A G A T A T G G C T CT

α G A A C G C T G C G T G G T G T A G T C G T C T G C G A G A T A T G G C TGG
χ G A A C G C T G C G T C G T G T T G T C G T C T G T G A G A T A T G G C T CG

α G A A C G C T G C G T G G T G T A G T C G T C T G C G A G A T A T G G C TGG
δ G A A G C C T G C G T G T G G T T G T C G T C T G C G A G A T A T G G C TCG

α G A A C G C T G C G T G G T G T A G T C GT C T G C G A G A T A T G G C TGG
42

ε G A A G G T T G C G T G G T GT T G T C TG C T G C G A G A T A T G G C T C G

α G A A C G C T G C G T G G T G T A G T C G T C T G C G A G A T A T G G C TGG
Φ T C A G G C C G C G T G G T G T T G T C G TC T G C G A G A A T T G G C TCG

Check III: Distance of subsequent OTUs from β


(Dissimilar residues are shown bold)

β G A A C G C T G C G T G G T G T AG T C G T C T G C G A G A T A T G G C T C T
χ GAACGCTGCGTCGTGTTGTCGTCTGTGAGATATGGCTCG

β G A A C G C T G C G T G G T G T AG T C G T C T G C G A G A T A T G G C T C T
δ GAAGCCTGCGTGTGGTTGTCGTCTGCGAGATATGGCTCG

β G A A C G C T G C G T G G T G T AG T C G T C T G C G A G A T A T G G C T C T
ε G A A G G T T G C G T G G T GT T G T C T G C T G C G A G A T A T G G C T C G

β GAACGCTGCGTGGTGTAGTCGTCTGCGAGATATGGCT CT
φ T C A G G C C G C G T G G T G T T G T C G TC T G C G A G A A T T G G C T C G

Check IV: Distance of subsequent OTUs from χ


(Dissimilar residues are shown bold)

χ GAACGCTGCGTCGTGTTGTCGTCTGTGAGATATGGCTCG
δ CAATGCTCCGTGGTGTTGTCGTCTGCGATATATGGCTCG

χ GAACGCTGCGTCGTGTTGTCGTCTGTGAGATATGGCTCG
ε G A A G G T T G C G T G G T GT T G T C T G C T G C G A G A T A T G G C T C G

χ GAACGCTGCGTCGTGTTGTCGTCTGTGAGATATGGCTCG
φ T C A G G C C G C G T G G T G T TG T C G TC T G C G A G A A T T G G C T C G

Check V: Distance of subsequent OTUs from δ


(Dissimilar residues are shown bold)

δ GAAGCCTGCGTGTGGTTGTCGTCTGCGAGATATGGCTCG
ε G A A G G T T G C G T G G T GT T G T C T G C T G C G A G A T A T G G C T C G
43

δ GAAGCCTGCGTGTGGTTGTCGTCTGCGAGATATGGCTCG
φ T C A G G C C G C G T G G T G T T G T C G TC T G C G A G A A T T G G C T C G

Check VI: Distance of subsequent OUT from ε


(Dissimilar residues are shown bold)

ε G A A G G T T G C G T G G T GT T G T C T G C T G C G A G A T A T G G C T C G
φ T C A G G C C G C G T G G T G T TG T C G TC T G C G A G A A T T G G C T C G

With reference to the evolutionary distances (EDs) indicated in the


ED matrix (EDM) of Figure 1, the problem in hand is to construct an
additive tree using clustering algorithm of UPGMA. For this
purpose, as mentioned earlier the number of observed nucleotide or
amino-acid substitutions can be used in the UPGMA sequential
clustering algorithm where local topological relationships are
identified in the order of similarity. Hence, building of the
phylogenetic tree is done via following steps:

Step I: Given an EDM, the two OTUs that are most similar (bearing
closest ED) are identified from among all the OTUs and these two
OTUs are treated as a new single composite OTU. As shown in
Figure y.1, the OTU-pair α, and β is the chosen pair by virtue of their
smallest ED of 2 in the EDM; and, the composite OTU is {α, β}.

ED EDM

1 1 1 1
Root α α

β β 2
Taxa/OTUs

Fig. y.1: Selection of two OTUs α and β that are most similar in the
given EDM (bearing the smallest ED equal to 2)

Step II: With the composite OTU {α, β} introduced in the EDM,
the new matrix is constructed by choosing again the most similar pair
and clustering them together as illustrated in Figure y.2. Relevant set
of EDs is elucidated as follows:
44

ED between {α, β}and χ = ½ × [ED(α and χ) + ED (β and χ)]


ED between {α, β}and δ = ½ × [ED(α and δ) + ED (β and δ)]
ED between {α, β}and ε = ½ × [ED(α and ε) + ED (β and ε)]
ED between {α, β}and φ = ½ × [ED(α and φ) + ED (β and φ)]

ED EDM
αβ χ δ ε φ
1 1 1 1
Root α
αβ 0
β
χ χ 4 0

δ δ 6 6 0

ε ε 6 6 4 0
φ φ 8 8 8 8 0

Taxa/OTUs

Fig. y.2: Construction of the new EDM with the composite OTU, {α
β}

Step III: As in the previous steps, among the new group of OTUs,
the pair with the highest similarity is identified and the corresponding
composite OTU is specified. This procedure is repeated until we only
two OTUs are left out. Corresponding EDM constructions are shown
in Figures y.3-y.x.
45

ED EDM
αβ χ δε φ
1 1 1 1
Root α
αβ 0
β
χ χ 4 0

δ
δε 6 6 0
ε

φ φ 8 8 8 0

Taxa/OTUs

Fig. y.3: Construction of the new EDM with the composite OTUs,
{δε} and {δε}

ED EDM
αβχ δε φ
1 1 1 1
Root α

β αβχ 0
χ

δ
δε 6 0
ε

φ φ 8 8 0

Taxa/OTUs

Fig. y.4: Construction of the new EDM with the composite OTUs,
{αβχ} and {δε}
46

Evolutionary
distance (ED) ED matrix (EDM)

α βχδε φ
1 1 1 1
Root α

χ αβχδε 0

ε
φ
φ 8 0

Taxa/OTUs

Fig. y.4: Construction of the new EDM with the composite OTUs,
{αβχ δε}

The resulting ultrametric tree for the test EDM is illustrated in Figure
y.5.
Evolutionary
distance (ED)

1 1 1 1
α

Root ε

φ
Taxa/OTUs

Fig. y.4: Final UPGMA-based ultrametric phylogenetic tree for the


test EDM
47

Problem y.x
The EDM with metrics of distances depicting a hypothetical
ultrametric phylogenetic tree is shown below. Suppose the following
sequence depicts the common ancestral root (R) corresponding to the
given EDM, determine the sequences that can be specified for the
OTU-set {a, b, c, d, e, f, g} of the tree.

R: G A A T G TT G C G T G G T G T T G T G G T C T G C G A G A T A T A A C
T C G AATGCCT

Evolutionary
distance (ED) ED matrix (EDM)

a b c d e f g
1 1 1 1
Root a a 0

b b 2 0

c c 4 4 0

d d 6 6 6 0

e e 6 6 6 4 0

f f 6 6 6 4 2 0
g g 8 8 8 8 8 8 0

Taxa/OTUs

Figure x A hypothetical EDM supplied for Problem y.x 1.

The shortcomings of the UPGMA clustering method are: (i) It is


sensitive to unequal evolutionary rates implying that whenever one
of the OTUs has experienced more number of mutations over time
relative to others, the resulting topology of the tree will be erroneous.
(Why?); (ii) clustering is feasible only when the data is ultrametric
and (iii) ultrametric distances are constrained by the so-called “three-
point conditions”. (the three-point condition stipulates that for any
given set of three taxa {A, B, C}, the two largest distances are equal;
48

that is, ED(A-C) ≤ maximum of [ED(A-B), ED(B-C)]. The two


largest distances being equal signifies that the evolutionary rate is the
same for all branches. Should this condition of rate constancy fails
among lineages, an erroneous topology would result in.

Problem y.x
Consider an EDM with metrics of distances as shown below. For the
ED values of the matrix indicated, relevant topology is as shown.

Evolutionary
distance (ED) ED matrix (EDM)

1 1 1 1 1 1 1 α β χ δ ε φ
α α 0

β β 5 0
χ χ 4 7 0
δ δ 7 10 7 0
ε ε 6 9 6 5 0
φ φ 8 11 8 9 8 0

Taxa/OTUs

Figure y.a EDM of an phylogenetic tree having no evolution rate


constancy

Determine the three-point condition and show that the tree


reconstructed by considering the evolutionary history via
UPGMA leads to a wrong topology illustrated below implying
that UPGMA on unequal rates of mutation will show a
completely different topology from the original EDM-specific
tree.
49

2.0
α
1.0

2.0
0.5 χ
3.0
β
0.5 2.5
δ
1.5
2.5
Root ε

4.5
φ
Taxa/OTUs

Figure y.b UPGMA-based topology for the EDM of Figure y.a

(Solution hint: Considering the divergence phases of α and β, the


taxon β has faced mutations at a much higher rate than the taxon α.
Check therefore, the three-point criterion is violated and the possible
UPGMA-based tree is erroneous. In such cases the neighborhood-
joining procedure (described below) will yield the correct topology).

Example y.y
An EDM indicated below corresponds to real data matrix for five
ribosomal RNA sequences. Each value denotes the estimated number
of nucleotide residue substitutions per position separating the
corresponding pair of the presently existing sequences.

Node 1 2 3 4 5
Taxa BSu Bst Lvi Amo Mlu
1 Bsu 0 0.172 0.215 0.309 0.233
2 Bst 0 0.299 0.340 0.206
3 Lvi 0 0.280 0.394
4 Amo 0 0.429
5 Mlu 0

Construct an ultrametric tree.


50

Solution:

Bsu

Bst

Mlu

Lvi

Amu

Neighbor-joining (NJ) method: An algorithmic procedure to find


the shortest tree

The NJ method is a method for reconstructing phylogenetic trees,


and computing the lengths of the branches of this tree. In each stage,
the two nearest nodes of the tree are chosen and defined as neighbors
in the tree. This is recursively followed until all of the nodes are
paired together. The algorithm was originally developed by Saitou
and Nei [ ] with subsequent corrections on the proof of the algorithm,
plus some minor changes (to the algorithm) due to Studier and
Kepler [ ].
51

α β
X
la lb

ld lc
Y

δ χ

Figure y.x Illustration of neighbors in the phylogenetic trees

Phylogenetic neighbors are defined as a pair of OTUs connected at a


single node connecting them. For instance, consider a tree illustrated
in Figure y.x with the (non-internal) node set {α, β, χ, δ} and {X, Y}
depicts an internal node pair. The set of branch-lengths is {la, lb, lc,
ld} as illustrated. Further, α and β are neighbors since they are
connected through a common internal node. Likewise χ and δ are
also neighbors. (However, as evidently {α, δ} and {β, χ} are not
neighbors).
The underlying algorithm in constructing, assumes that the
trees are additive trees. An additive tree means a tree where, for
example, in Figure y.x, the distance between nodes α and β is equal
to the distance between nodes α and X plus the distance between
nodes β and X.

Stipulation of 4-point condition


A four-point condition can be stipulated to specify the general
observation that neighbors are closer than non-neighbors. Referring
to Figure y.y,

dαχ + dβδ = dαδ + dβχ = la + lb+ lc+ ld + 2x = dαβ + dχδ + 2x

and the 4-point conditions thereof are:


52

[Sum of all distances between neighboring nodes: for example,


(dαβ + dχδ) in the sets {α,β} and {χ, δ}]
< [Sum of all distances between non-neighbor nodes: for
example, (dαχ + dβδ) in the sets {α, χ} and {β, δ}]

[Sum of all distances between neighboring nodes: for example,


(dαβ + dχδ) in the sets {α,β} and {χ, δ}]
< [Sum of all distances between non-neighbor nodes: for
example, (dαχ + dβδ) in the sets {α, δ} and {β, χ}]

Suppose, there are N OTUs with a matrix in which evolutionary


distances between every pair of OTUare known. Then the process of
building the tree starts with no initial clustering. That is, every OTU
is considered as being in an equal relationship with every other OTU.
This leads to initiating a non-hierarchical (star-like) structure shown
in Figure y.t(a).

(a) (b)
1 8 7 8
7
1
6
2 X 6 X Y
x x 5
2
4
3 4 5 3

Figure y.t (a) Star-like, non-hierarchical structure initialized; and (b)


the tree realized with the given set of OTUs

Let the distance between OTU-pair i and j be dij and lab be the
distance between nodes a and b. The sum of the branch lengths for
the nonhierarchical star-like tree centered at X in Figure y.t(a) is then
given by:

 N   1 
So =  ∑ l iX  =  ∑ dij  (y.c)
i = 1   N − 1
(i < j)

53

The above relation is due to each branch being counted (N – 1) times


when all distances are added. Next step of tree construction involves
identifying the first pair of OTUs to be clustered implying that they
diverge from the same node in the tree. Considering an example
(with eight OTUs as above) forming the star as shown in Figure
y.t(a), suppose 1 and 2 constitute the first pair being clustered at X;
and, rest of the OTUs are clustered at another node Y as illustrated in
Figure y.t(b). Now, for this tree that has {1, 2} OTU pair diverging
from a single node (X) and other, {3, 4, 5, 6, 7, 8} OTU set from
another node (Y), the “length” (S12) of the tree can be determined.
Relevant algorithm is as follows: The length of the tree (Sij) for a
node pair {i, j} can be deduced in terms of number of OTUs (N) and
pairwise distances di j for every two leaves (taxa/OTUs) (i, j). Hence,
with respect to Figure y.t(b), the branch length between X and Y
(lXY) is given by:
 1 
l XY =  ∑ (d1k + d 2k )  − [ (N − 2)(l1X + l 2X ]
 {2(N − 2)} (k = 3, N) 
 N 
−  2∑ l iY 
 i=3 
(y.z)

 1 
where  ∑ (k = 3, N)
(d1k + d 2k )  term specifies the sum of
{2(N − 2)} 
all distances that include lXY.
 N 
And, [ (N − 2)(l1X + l 2X ] and  2∑ l iY  are terms that
 i=3 
depict irrelevant entities being excluded in computing lXY. Further, if
the interior branch (X to Y) in Figure y.t (b), is removed, there will
be two independent star-like trees, one for the OUT-pair {1 and 2}
and the other for the remaining set of (N – 2) OTUs. Corresponding
branch lengths, (l1X + l2X) and  ∑ i = 3 l iY  can be deduced by
N
 
applying equation (y.c). That is,
54

l1X + l2X = d12 (y.a1)


and,
 N   1 
 ∑ l iY  =  ∑ ≤
dij  (y.a2)
  N − 3
(3 i < j)
i = 3 

Adding these branch lengths, the sum of all branch lengths of the tree
configuration of Figure y.t(b) namely S12 is obtained as follows:

N 
S12 = [ l XY + (l1X + l 2X ] +  ∑ l iY 
 i=3 
 1 
+

∑ (k = 3, N) (d1k + d 2k ) 
 {2(N 2)} 
 1  1 
= ∑
− 2)} (k = 3, N)
(d1k + d 2k ) +  2 d12 
 
 {2(N 
 1 
+ 

∑ (3 ≤ i < j) dij 
 (N 2) 

(y.zz)

where the three terms on the right-hand side denote explicitly, the
following:

 1 
 {2(N − 2)} ∑ (k = 3, N) (d1k + d 2k )  = Mean distance from OTU1 and
 
OTU2 to the rest of the OTUs 3,
4,…, N.
1 
 2 d12  = Semi-pairwise distance from
 
OTU1 to OTU2

 1 
 N − 2 ∑ (3 ≤ i < j) dij  = Average of all pairwise distances
 
between the OTUs 3 to N.
55

(Equation (y.zz) refers to the sum of least-squares estimate of


branch-lengths as proved in [Soito]).
Thus S12 is the total tree-length when OTU1 and OTU2
considered to form the first cluster. Likewise tree-lengths S1j → S13,
S14, …, S1N can be calculated and the lowest of these would indicate
the true pair of OTUs to be first clustered (implying the first
neighbors to be joined). These first two OTU neighbors are then
merged to depict a single (composite) OTU.
For example, suppose S12 is found to be smallest among all
values of the set {Sij}. Then OTUs 1 and 2 are designated as a pair of
(nearest) neighbors, and these are joined to form a combined OTU
(1-2). And, the distance between this combined OTU with respect to
any other OTU j is given by: D(1 − 2)j = (D1j + D2j )/2 for (3 ≤ j ≤ N).
Now, the number of OTUs involved is reduced by one; and
for the resulting new distance matrix, the previous procedure is
applied so as to find the next pair of (nearest) neighbors. This cycle
of operation iterated until the number of OTUs becomes three upon a
single unrooted tree.
In the tree construction, another computation needed refers to
determining the branch lengths. This can be done via so-called Fitch-
Margoliash algorithm [W. M. Fitch and E. Margoliash, Construction
of phylogenetic trees, vol. 55, Science, 1967, 279-284]. Relevant
considerations are as follows:
Suppose, for example, the OUT pair {1, 2} is the first pair
being joined in the tree of Figure y.t, corresponding l1X and l2X are
estimated by:
l1X = (d12 + d1Z − d2Z)/2 (y.ua)
l2X = (d12 + d2Z − d1Z)/2 (y.ub)

where Z denotes a group of all OTUs (leaves) excluding the OTU


pair {1, 2}. It represents intuitively the whole cluster of all included
OTUs as shown in Figure y.b.
56

8
7
1
6
x y
5
2
4
3

Figure y.b Cluster (Z) of all the leaves excluding {1 and }

Further, d1Z and d2Z are distances between (1 and Z) and (2 and Z)
respectively. Explicitly, they are determined by [Nei 1987]:

 1 N 
d1Z =  ∑ d1i 
 N − 2 i = 3 
(y.uc)

 1 N 
d 2Z =  ∑ d 2i  (y.ud)
 N − 2 i = 3 
And, l1X and l2X denote least-squares estimates for the tree of the
illustration in Figure y.t(b).
The neighbor-joining method in essence, is an iterative
algorithm, which assumes an additive tree or summing procedure. Its
each iteration is indicated in the following pseudocode:
__________________________________________________
Phylogenetic tree construction: Pseudocode on
NJ method
Input distance matrix
← Given a set of multiple sequences data
← Call subroutine for sequence alignment:
Global multiple alignment and selecting
reliable data of aligned sequences
devoid of any gaps; that is, any gap
57

present is skipped or ignored – Working


alignment
Build distance matrix between sequences
Define a distance parameter (di, j) between
the nodes 1 through N (N = 8 in the
example of Figure y.t(a))
← Distance parameter can be defined by
different notions, such as: Jukes-Cantor
method, Kimura method, p-distance method
etc.
← As an example, the p-distance method
defines the distance between two
sequences as equal to the ratio of count
of different positions (of a nucleotide,
for example) to total number of
positions
← The distance matrix is symmetric and so,
only the values above/below the diagonal
need to be computed; that is, the matrix
is symmetric regardless of the diagonal
and it is suffice to show normally only
the top half (or lower half) filled.
← enter the distance parameter as elements
in a matrix.
← Resulting distance matrix(DM) is used to
identify the first branch of the tree as
consisting of the two OTUs that have the
shortest distance between them

Example of an(N × N)distance matrix

i
d11 d12 . . . d1N
d21 d2N
j . .
. dij .
. .
dN1 dN1 . . . dNN
58

← The columns and rows of this matrix


denote nodes; and, the values (di,j) of
the matrix elements represent the
distance between nodes i and j.
← An example of a (5 × 5) distance matrix
with numerical values. (This corresponds
to real data matrix for five ribosomal
RNA sequences. Each value denotes the
estimated number of nucleotide residue
substitutions per position separating
the corresponding pair of the presently
existing sequences).

Node 1 2 3 4 5
Taxa BSu Bst Lvi Amo Mlu
1 Bsu 0 0.172 0.215 0.309 0.233
2 Bst 0 0.299 0.340 0.206
3 Lvi 0 0.280 0.394
4 Amo 0 0.429
5 Mlu 0

Initialize
Step 1
Start off with a star tree (Figure y.y1)

1 5

2 3
Figure y.y1: Star-like tree initialized

Step 2
Define/identify the neighbors
← two nodes with the lowest value in the
matrix are chosen and they are defined
59

as neighbors. In the example shown above


nodes 1 (Bsu) and 2 (Bst) show lowest
value (0.172) in the distance matrix.
So, they are the “nearest” and
identified as “neighbors”.

Merge these neighbors to form a single


composite (new) OTU and label it as (1-2)

Step 3
Compute the branch lengths for between this
composite node (1-2) and rest of the other
nodes
← This computation involve staking the
unweighted arithmetic mean of all the
pairwise distances of the new OTU(1-
2)with respect to all other OTUs
(namely, 3, 4, 5. Hence,
→ the distance between (1-2) and 3
is: d(1-2)3 = (d13 + d23)/2
→ the distance between (1-2) and
4 is: d(1-2)4 = (d14 + d24)/2
→ the distance between (1-2) and
5 is: d(1-2)5 = (d15 + d25)/2

Step 4
Construct the resulting table as shown below:

Node 1 2 3 4 5
Taxa Bsu Bst Lvi Amo Mlu
1 Bsu 0 0.172 0.215 0.309 0.233
2 Bst 0 0.299 0.340 0.206
3 Lvi 0 0.280 0.394
4 Amo 0 0.429
5 Mlu 0
60

Node 1-2 3 4 5
Taxa Bsu- Lvi Amo Mlu
BSt
1-2 Bsu-Bst 0 0.257 0.325 0.220
3 Lvi 0 0.280 0.394
4 Amo 0 0.429
5 Mlu 0

Step 5
Repeat the process from Step 2 again looking
for the 2 nearest nodes, and constructing the
resultant matrix
← Now OTU 5 (Mlu) and the composite OTU
(1-2(Bsu/Bst)constitute the nearest
neighbors
← Construct the new matrix with:
d(1-2)5 = (d15 + d25)/2 = (d1x + d2x)
d[(1-2)5]3 = (d1x + d13)/2 + (d2x + d23)/2
d[(1-2)5]4 = (d1x + d14)/2 + (d2x + d24)/2

Node (1-2)5 3 4
Taxa (Bsu/Bst)Mlu Lvi Amo
(1-2)5 (Bsu/Bst)Mlu 0 0.282 0.435
3 Lvi 0 0.280
4 Amo 0

Step 6
Repeat the process from Step 2 again looking
for the 2 nearest nodes, and constructing the
resultant matrix
← Now composite OTU(1-2)5(Bsu-Bst)Mlu and
the composite OTU(3-4)(Lvi-Amo) form
nearest neighbors
← Construct the new matrix with:
d(1-2)5 =(d1x + d2x)
61

← d[(1-2)5](3-4) = (d1x + d1x-3 +d2x + d2x-3)/4

Node (1-2)5 (3-4)


Taxa (BSu-BSt)Mlu (Lvi-Amo)
(1-2)5 (Bsu-Bst)Mlu 0 0.289
(3-4) (Lvi-Amo) 0

______________________________________

Inasmuch no prior information is available on

_______________________________________________________

Reconstruction of a phylogenetic tree: Iterative algorithm of


neighbor-joining (NJ) method

Input: a (n × n) distance matrix ∆i, j = [di,j


(T)] for every two leaves (taxa) (i,j)

Do: Iteration (I) over k = 1, 2, ..., r over


r taxa

← Each I involves…
Calculate: Q-matrix based on the current
distance matrix.
← For a given distance d(i, j) between
taxa i and j and using the corresponding
distance matrix relating r taxa, Q-
matrix is specified as follows:
r r
← Qi, j = (r − 2)d(i, j) − ∑ d(i, k) − ∑ d(j, k) (y.1)
k=1 k=1
Determine: The pair of taxa with the
lowest value in Q
Create: A node on the tree that joins this
pair of taxa.
← the closest neighbors are joined
62

Calculate: The distance of each of the


taxa in the pair to this new node
Determine: The distance of all taxa
outside of this pair to the new node

Go To: I (k = 1, 2, …, r), considering the


pair of joined neighbors as a single
taxon and using the distances
calculated in the previous step.
Stop: Termination of rth iteration
End
_______________________________________________________

Example y.1 (of NJ method)


Suppose there are three sequences (α, β, γ) with distances specified
by:

β γ
α 25 30
β 35

The problem is to construct an unrooted tree shown below with the


branch lengths a, b and c indicated.

a b

α β γ

Solution:
From the distance matrix given:

α to β: a + b = 25
α to γ: a + c = 30
63

β to γ: b + c = 35

The three linear, independent equations lead to, a = 10, b = 15 and c


= 20.

Example y.z
A hypothetical set of multiple sequences is given below:

α CAG CGT TGG GCG ATG GCA ACC…


β CAG CGT TGG GCG ACG GTA ATC…
δ CAG CAT TGA ATG ATG ATA ATC…
ε CAG CAT TGA GTG ATA ATA ATC…

Construct the distance matrix between the sequences and hence


construct a unrooted tree using the distance matrix derived.

Solution
The distances depict number of changes seen in the nucleotides (or
aminoacids) between any two sequences under comparison. Hence,
the required matrix is:

α β δ ε
α 3 7 8
β 6 7
δ 3
ε

Unrooted tree evolved from the given sequences


64

α β

2 1

1 2

δ ε

The distance scoring used can be empirical. However, only a single


tree is realized. In building this single tree, the raw pairwise distances
may not always be perfectly additive. That is, with reference to the
above example, the sum of the three simultaneous equations (= sum
of the branch lengths) can be seen exactly equal to the sum of
pairwise distances. Such exactness may not prevail in raw data. Such
situation may arise when there is undetected homoplasy. (Homoplasy
is a character shared by a set of species but not present in their
common ancestor). It may miss homoplasies across long distances,
meaning long evolutionary distances may be underestimated.
Neighbor-joining can be regarded as a star decomposition
technique of heuristic nature. It is computationally not very intensive
and may produce reasonable trees. But its tree search does not pose
ant optimality criterion, and as such it, no guarantee is given to say
that the “recovered tree is the one that best fits the data”. More
should attempt to produce a starting tree, and then perform a tree
search exercising an optimality criterion so as to ensure that the best
tree is obtained.

Example y.2 (Deducing a Q-matrix)


Given a set of four taxa {α, β, δ, ε} as shown, deduce the
corresponding Q-matrix.

α β δ ε
α
65

β 7
δ 11 6
ε 14 9 7

Solution:

Q- matrix:

α β δ ε
α
β − 40
δ − 34 − 34
ε − 34 − 34 − 40

In the example above, two pairs of taxa have the lowest value,
namely − 40. We can select either of them for the second step of the
algorithm. We follow the example assuming that we joined taxa α
and β together. Adding more branches is iterative. If more sequences
are included, corresponding branches will be introduced in the tree.
This method scales linearly with the number of sequences. This
involves adding a new node to the tree.

A. Distance of the pair members to the new node


Suppose for each neighbor in the pair just joined, calculation of the
distance to the new node with reference to the taxa set, {θ and φ: the
paired taxa; ω: the newly generated node) can be done by using the
following algorithm:
d(θ, φ) 1 k=r k=r 
d(θ, ω) = +  ∑ d(θ, k) − ∑ d(φ, k)  (y.2)
2 2(r − 1)  k = 1 k=1 

B. Distance of the other taxa to the new node

For each taxon not considered above, the distance to the new node is
obtained from:
66

1 1
d(ω, k) = [d(θ, k) − d(θ, u)] + [ d(φ, k) − d(φ, u)] (y.3)
2 2

where, as said before ω is the new node, and θ and φ are the
members of the pair just joined; and, k is the node for which it is
required to calculate the distance. and f and g are the members of the
pair just joined

Problem y.1:
(a) In the Example y.2 determine: (i) the distance between α and
the new node; and (ii) the distance between β and the new
node. Use equation (y.2). (Answers: (i) 6; (ii) 1)

(b) In the Example y.1 determine: (i) the distance between δ and
the new node; and (ii) the distance between ε and the new
node. Use equation (y.3). (Answers: (i) 5; (ii) 8)

Next step on summing procedure


Using the considerations as above, the following matrix can be
evolved iteratively for Example y.3 (with αβ acting as a new taxon)
with the result of Problem y.1(b):

αβ δ ε
αβ
δ 5
ε 8 7

In the iteration, this matrix as above can be considered as the original


distance matrix. (With reference to Example y.2, one more step of
the recursion will enable the complete tree. Try!)

Neighbor-joining (NJ) topology gives the least total branch length is


preferred at each step of the algorithm. Hence it specifies minimum-
evolution criterion for phylogenetic trees, But the NJ solution may
not lead to the true tree topology with this least total branch length
specification inasmuch as it is a greedy algorithm. That is, it
develops the tree in a step-wise fashion. It is suboptimal in this sense,
67

but often it finds a tree close to optimal topology. Yet there are other
methods that do not rely on distance measures and give results with
better accuracy.
NJ method is computationally efficient and conforms to
polynomial-time algorithm. Very large data sets can be handled by
NJ approach. (Other phylogenetic analyses like minimum evolution,
maximum parsimony and maximum likelihood are computationally
intensive. Though NJ computation leads to an unrooted topology,
rooted trees can be obtained by using an outgroup; and, the root can
be appropriately inserted at a point in the tree where the edge from
the outgroup connects. That is, inasmuch as the phylogenetic
inference via NJ method yields unrooted tree, from this result itself it
is not possible to tell which of the OTUs had branched off before all
the others. So, in order to root a tree one should add an “outgroup” to
the data set. An outgroup implies an OTU for which external
information (for example, paleontological information) is available
that indicates that the outgroup branched off before all other taxa.
(See Figure y.1). In practice, distance-matrix methods suggest the
inclusion of at least one outgroup sequence, (which is known to be
only distantly related to the sequences of interest in the query set).
This procedure conforms to an experimental control (contrasting an
experimental group versus a control group). When the chosen
outgroup is appropriate, it will show much greater genetic distance
(and hence will possess longer branch) than any other sequence in
the set; as such, it will appear near the root of a rooted tree or
indicates the nearest node as the possible root-node in a unrooted
topology (Figure y.1).
Further, in choosing an appropriate outgroup requires the
selection of a sequence that is moderately related to the sequences of
interest. Suppose an outgroup implicates too close a relationship.
Then, this outgroup will lose its significance. Likewise, when an
outgroup depicts too far a relationship, it simply adds noise to the
analysis. In prescribing an outgroup, care must be exercises so as to
omit cases where the species from which the sequences were taken
are distantly related; but, the gene encoded by the sequences is
rather highly conserved across lineages. Another confounding aspect
of outgroup selection and usage is horizontal gene transfer (between,
otherwise divergent bacteria).
68

NJ strategy is statistically consistent usually under many models of


evolution. Hence, given data of sufficient length, neighbor-joining
will reconstruct the true tree with high probability. It is shown in [K.
Atteson, The performance of neighbor-joining methods of
phylogenetic reconstruction, Algorithmica, 25, 1999, 251-278] that,
if each entry in the distance matrix differs from the true distance by
less than half of the shortest branch length in the tree, then NJ will
lead to constructing the correct tree.In order to correct for the
enhanced inaccuracy in measuring distances between distantly-
related sequences, an alternative procedure indicated for tree
reconstruction using distance matrix refers to what is known as the
Fitch-Margoliash (FM) method [W. M. Fitch and E. Margoliash,
Construction of phylogenetic trees. Science 155, 1967, 279–284]. It is
described below.

Fitch-Margoliash method
This method uses the genetic distance and follows a weighted least
squares method for clustering. That is, more weight in the tree
construction process is given to closely-related sequences. This
approach would reduce the inaccuracy associated in the distance
measurement on distantly-related sequences. In the relevant
computation, the distances used as input to the algorithm are first
normalized so as to prevent large deviational artifacts in (computing)
relationships between closely-related versus distantly-related groups.
Further, the distances adopted for this method must be linear. This
linearity criterion for distances requires that the expected values of
the branch lengths for two individual branches must equal the
expected value of the sum of the two branch distances. This property
applies to biological sequences only when they have been corrected
for the possibility of back mutations at individual sites. (This
correction can be done via substitution matrix such as that derived
from the Jukes-Cantor model of DNA evolution. The distance
correction is, however, necessary in practice only when the evolution
rates differ among branches).
Though the least-squares criterion advocated enables the
distances in FM approach to be more accurate, it however, burdens
the computational efforts more heavily than the NJ methods. Further,
in addition to least-squares criterion, another improvement that
69

corrects for correlations between distances (stemming from many


closely-related sequences in the data set) can be envisaged in FM
technique with increased computational cost. Finding the optimal
least-squares tree with any correction factor is NP-complete [W. H.
E.Day, Computational complexity of inferring phylogenies from
dissimilarity matrices. Bulletin of Mathematical Biology 49, 1986, 461-
467].so heuristic search methods like those used in maximum-
parsimony analysis are applied to the search through tree space.
(Suppose given a solution to a problem and it is verifiable quickly (in
polynomial time). Then, such a set of problems is known as NP
(nondeterministic polynomial time) in the theory of computational
complexity. Relevant class of complexity is NP-complete
(abbreviated NP-C or NPC). Further, the class of problems are such
that, if the problem can be solved quickly (in polynomial time), then
so can every problem in NP. In searching for a quick problem-
verification, however, there is no known efficient way to locate a
solution a priori. In NP-complete problems the fast solution to them
is rather unknown. That is, the time required to solve the problem
using any currently known algorithm would increase considerably as
the size of the problem grows. As such, mostly NP-complete
problems are addressed by using approximation algorithms).
Notwithstanding the details perceived from a group of query
sequences, suppose some independent information as regard to the
relationship between such sequences or groups is available. It can be
used profitably in curtailing the (rooted and/or unrooted) tree search-
space.
In general, pairwise distance data are an underestimate of the
path-distance between taxa on a phylogram. Pairwise distances
effectively "cut corners" in a manner analogous to geographic
distance: the distance between two cities may be 100 miles "as the
crow flies," but a traveler may actually be obligated to travel 120
miles because of the layout of roads, the terrain, stops along the way,
etc. Between pairs of taxa, some character changes that took place in
ancestral lineages will be undetectable, because later changes have
erased the evidence (often called multiple hits and back mutations in
sequence data). This problem is common to all phylogenetic
estimation, but it is particularly acute for distance methods, because
only two samples are used for each distance calculation; other
70

methods benefit from evidence of these hidden changes found in


other taxa not considered in pairwise comparisons. For nucleotide
and amino acid sequence data, the same stochastic models of
nucleotide change used in maximum likelihood analysis can be
employed to "correct" distances, rendering the analysis "semi-
parametric."
Several simple algorithms exist to construct a tree directly
from pairwise distances, including UPGMA and neighbor joining
(NJ), but these will not necessarily produce the best tree for the data.
To counter potential complications noted above, and to find the best
tree for the data, distance analysis can also incorporate a tree-search
protocol that seeks to satisfy an explicit optimality criterion. Two
optimality criteria are commonly applied to distance data, minimum
evolution (ME) and least squares inference. Least squares is part of a
broader class of regression-based methods lumped together here for
simplicity. These regression formulae minimize the residual
differences between path-distances along the tree and pairwise
distances in the data matrix, effectively "fitting" the tree to the
empirical distances. In contrast, ME accepts the tree with the shortest
sum of branch lengths, and thus minimizes the total amount of
evolution assumed. ME is closely akin to parsimony, and under
certain conditions, ME analysis of distances based on a discrete
character dataset will favor the same tree as conventional parsimony
analysis of the same data.
Phylogeny estimation using distance methods has produced a number
of controversies. UPGMA assumes an ultrametric tree (a tree where
all the path-lengths from the root to the tips are equal). If the rate of
evolution were equal in all sampled lineages (a molecular clock), and
if the tree were completely balanced (equal numbers of taxa on both
sides of any split, to counter the node density effect), UPGMA
should not produce a biased result. These expectations are not met by
most datasets, and although UPGMA is somewhat robust to their
violation, it is not commonly used for phylogeny estimation.
Neighbor-joining is a form of star decomposition and, as a heuristic
method, is generally the least computationally intensive of these
methods. It is very often used on its own, and in fact quite frequently
produces reasonable trees. However, it lacks any sort of tree search
and optimality criterion, and so there is no guarantee that the
71

recovered tree is the one that best fits the data. A more appropriate
analytical procedure would be to use NJ to produce a starting tree,
then employ a tree search using an optimality criterion, to ensure that
the best tree is recovered.

Many scientists eschew distance methods. In some cases, this is for


esoteric philosophical reasons. A commonly-cited reason is that
distances are inherently phenetic rather than phylogenetic, in that
they do not distinguish between ancestral similarity
(symplesiomorphy) and derived similarity (synapomorphy). This
criticism is not entirely fair: most currently implementations of
parsimony, likelihood, and Bayesian phylogenetic inference use
time-reversible character models, and thus accord no special status to
derived or ancestral character states. Under these models, the tree is
estimated unrooted; rooting, and consequently determination of
polarity, is performed after the analysis. The primary difference
between these methods and distances is that parsimony, likelihood,
and Bayesian methods fit individual characters to the tree, whereas
distance methods fit all the characters at once. There is nothing
inherently less phylogenetic about this approach.

More practically, distance methods are avoided because the


relationship between individual characters and the tree is lost in the
process of reducing characters to distances. Since these methods do
not use character data directly, and information locked in the
distribution of character states can be lost in the pairwise
comparisons. Also, some complex phylogenetic relationships may
produce biased distances. On any phylogram, branch lengths will be
underestimated because some changes cannot be discovered at all
due to failure to sample some species due to either experimental
design or extinction (a phenomenon called the node density effect).
However, even if pairwise distances from genetic data are
"corrected" using stochastic models of evolution as mentioned above,
they may more easily sum to a different tree than one produced from
analysis of the same data and model using maximum likelihood. This
is because pairwise distances are not independent; each branch on a
tree is represented in the distance measurements of all taxa it
separates. Error resulting from any characteristic of that branch that
might confound phylogeny (stochastic variability, change in
72

evolutionary parameters, an abnormally long or short branch length)


will be propagated through all of the relevant distance measurements.
The resulting distance matrix may then better fit an alternate
(presumably less optimal) tree.

Despite these potential problems, distance methods are extremely


fast, and they often produce a reasonable estimate of phylogeny.
They also have certain benefits over the methods that use characters
directly. Notably, distance methods allow use of data that may not be
easily converted to character data, such as DNA-DNA hybridization
assays. They also permit analyses that account for the possibility that
the rate at which particular nucleotides are incorporated into
sequences may vary over the tree, using LogDet distances. For some
network-estimation methods (notably NeighborNet), the abstraction
of information about individual characters in distance data are an
advantage. When considered character-by character, conflict between
character and a tree due to reticulation cannot be told from conflict
due either to homoplasy or error. However, pronounced conflict in
distance data, which represents an amalgamation of many characters,
is less likely due to error or homoplasy unless the data are strongly
biased, and is thus more likely to be a result of reticulation.

Distance methods are extremely popular among some molecular


systematists, a substantial number of whom use NJ without an
optimization stage almost exclusively. With the increasing speed of
character-based analyses, some of the advantages of distance
methods will probably wane. However, the nearly-instantaneous NJ
implementations, the ability to incorporate an evolutionary model in
a speedy analysis, LogDet distances, network estimation methods,
and the occasional need to summarize relationships in with a single
number all mean that distance methods will probably stay in the
mainstream for a long time to come.

The clustering-based approach


Since the raw data is a string of character states (for example, the
nucleotide sequence), the character-state methods are indicated as
powerful than distance methods; and, in transforming such character-
state data into distance-matrices some information is lost.
73

However, while the maximum parsimony method indeed uses the


raw data, it usually uses only a small fraction of the available data.
For instance, in the example that we have used only three sites are
informative and were used while six sites are excluded from the
analysis. For this reason, this method is often less efficient than some
distance-matrix methods (e.g., see Saitou and Nei 1986). Of course,
if the number of informative sites is large, the maximum parsimony
method is generally very effective. Distance methods only give one
tree, while parsimony analyses many trees and may suggest multiple,
equally likely trees, none of which is necessarily the right one. The
maximum likelihood method does not suffer from these limitations.
It uses the entire sequence information, it analyses many trees and
proposes the tree with the highest likelihood. Moreover, the method
seems to be robust and relatively insensitive to violations of the
evolutionary model used or to unequal rates of evolution or
nucleotide bias. The method, however, is not suitable for large
datasets due to its CPU-intensive nature.

Subroutine III
← Making of the required phylogenetic tree:

Mainly, phylogenetic tree-reconstruction is


done by two computer-based methods: (i) Use of
ClustalW, or (ii) use of Phylip.
(There are also other programs: PAUP*,
PhyloWin, CLC, BioEdit, MacClade etc.)

The constructed tree can be visualized by:


TreeView, NJplot, CLC…

Simple
Program or
sophisticated? Remarks

Free
or commercial?

Good tool with different


ClustalW Easy, fast and analysis methods.
74

hustle-free Essentially, it is a
multiple sequence
alignment (MSA) tool
enabling assembling of a
tree; but, it is not a
phylogenetic tree. It is
a guide tree that
ClustalW uses to make
MSA. Only with EBI
ClustalW a phylogenetic
tree is enables.
Contains many programs.
Phylip Sophisticated Both distance- and
and involves character-based analyses
running are feasibleBoot-
several strapping is possible.
programs No graphical interface.

Free

PAUP* Apparently Good tool with different


good analyses methods.

Commercial

MacClade Good tool with different


analysis methods.
Works on Macintosh only.
Commercial

BioEdit Simple: by Phylogeny methods are


selecting the built-in.
sequences in Phylip routines can be
the alignment called.
and choosing Boot-strapping is not
the wanted facilitated
phylogeny, the
result is
given.
No need to
learn the
command line

Free package
75

It enables maximum-
PhyloWin likelihood on nucleotide
Fairly simple sequences.
Allows maximum parsimony
Free computation.

Has routine to make


CLC distance-based trees
Work- Commercial It enable neighbor-
bench joining and UPGMA

Does visualization of
TreeView trees. Its link: Wiki

NJplot Does visualization of


trees. Its link: wiki

______________________________________________

This involves statistical methods in


reconstructing the tree from molecular data.
The tree-reconstruction process conforms to
optimum phylogenetic estimation. As seen
before that depending on the number of
sequences taken, there are a number of
feasible tree configurations. Therefore,
methods of favoring one phylogenetic tree over
another have been developed [Felsenstein,
2004]. Each of these existing methods can be
classed into one of three main categories:
parsimony method, distance-based or distance-
matrix method and statistical methods
(specifically, the maximum likelihood method).

Thus, in the case of inferring phylogenetic trees from paralogous


sequences, internal nodes in the tree represent duplication events.
Relevantly, the phylogenetic trees are viewed as unrooted trees
inasmuch as the choice of which two nodes between which to place
the root is not obvious. There are two techniques that are usually
76

used to infer where the root of the tree is located. One approach is
outgroup rooting (Felsenstein, 2004). Whenever three monophyletic
groups of organisms are compared, and two of them are more closely
related to each other than either is to the third, the third group is
known as the outgroup. Generally this technique is used to determine
the locations of the root of the tree as the outgroup is assumed a
priori to have diverged earlier from the common ancestor than the
other groups in the tree. Another method is to assume a molecular
clock when constructing the tree. By assuming a molecular clock, we
are saying that evolution on all lineages of the tree happened at the
same rate (homogeneous rate of evolution). This makes the distance
from root to tip equal for all paths emanating from the root.
Therefore, we can find the root of the tree by locating the place in the
tree where the distance along all lineages to all tips is approximately
equal (Felsenstein, 2004). Figure 1.3: A rooted binary tree of animals
(a) and its unrooted equivalent (b).

Anda mungkin juga menyukai