Anda di halaman 1dari 36

Multiple Sequences Alignment

Homology: Definition
Homology: similarity that is the result of inheritance from a common ancestor
Paralogs - related genes within an organism Orthologs genes in other species An Alignment is an hypothesis of positional homology between bases/ amino acids.

Why are multiple sequences alignment used?


Related protein can often provide the likely function, structure, and evolution. Multiple alignment is more sensitive than pairwise alignment to detect homologs. Revealed conserved residues or motifs. Database search effectively perform multiple sequences alignment. The regulatory region of many genes contain consensus sequences for transcription factor-binding site.

Information in Multiple Alignment


Conserved regions Region that are invariant in all the alignment. These usually indicate regions with a specific function. Can be totally or partially conserved. Phylogenetic analysis Tell you which sequences are closest. Sequences are arranged from the most closely related to the most distantly related.

Multiple Sequences Alignment -- Goal


To generate a concise, information-rich summary of sequence data. Used to illustrate the similarity between a group of sequences. Used to illustrate the dissimilarity between a group of sequences. Alignment can be treated as models that can be used to test hypotheses.

Alignment can be easy or difficult

Easy

Difficult : due to the insertions or deletions

The Methods of Multiple Sequences Alignment

Multiple Sequences Alignment - methods


Methods of solving the Multiple Alignment Problem Manual Dynamic Programming Hidden Markov Models (HMMs) Progressive Alignment

Manual Alignment
Alignment is easy. There is some extraneous information. Automated alignment methods have encountered the local minimum problem. An automated alignment method can be improved.

Dynamic Programming Alignment


Dynamic Programming Consider 2 protein sequences of 100 amino acids in length. If it takes 100 seconds to completely align these sequences, it will takes 100 seconds to align 3 sequences, and then 4 sequences etc. It will takes 1.90258x1034 years to align 20 sequences completely.

Limited to a small number of sequences.

Pairwise Alignment
Aligning two sequences : GATTC & GAATTC 1 Scoring: matches: +1 mismatches: 0 indel: -1 1

-1
1 1 1

GATTC GAATTC

Score = 2

GATTC GAATTC

Score = 4

Hidden Markov Models


HMMER was written by Sean Eddy. http://hmmer.wustl.edu Running on UNIX platform. Probabilistic models. Described the likelihood that an amino acid residue occurs at each given position of an alignment. Two main uses search a sequence database with a single profile HMM. search a single query sequence against a library of HMMs.

Progressive Alignment
Devised by Feng and Doolittle in 1987. Heuristic method, as such, is not guaranteed to find the optimal alignment. Based on the pairwise alignment. Most successful implementation is Clustal (by Des Higgins)

ClustalW

ClustalW - Introduction
. General purpose is the comparison or alignment of DNA or protein sequences. . Biologists can study the sequence patterns conserved through evolution and ancestral relationship between different organisms. . Clustalw can be displayed on different operating systems, including: WinXP, UNIX (Linux), Macintosh. . The first Clustal programme (1988) by Des Higgins ClustalV (1992) ClustalW (1994) ClustalX

. The latest version is ClustalW 1.83

ClustalW download & WWW


Download
http://www.imtech.res.in/pub/mirror_sites/ebi/dos/clustalw/ http://iubio.bio.indiana.edu/soft/iubionew/molbio/dna/analysis/ClustalW/ ftp://ftp-igbmc.u-strasbg.fr/pub/ClustalW/ WWW http://www.ebi.ac.uk/clustalw (version 1.83)

Three main stages for ClustalW :


Pairwise alignment: Calculate distance matrix

Unrooted Neighbour-Joining tree Rooted NJ tree (guide tree) and sequence weights

Progressive alignment: Align following the guide tree

Pairwise Alignment
. Pairwise aligns each sequence with every the others
for example: there are n sequences

n(n 1) n C2 2
pairwise alignments were calculated.

. accurate scores from full dynamic programming alignment using 2 gap penalties (opening and extending ) a full amino acid weight matrix
. Each pairwise alignment is completely independent

Calculate distance matrix

Both of the scores (gap penalties and amino acid weight matrix) are initially calculated as per cent identity scores and are converted to distances by dividing by 100 and subtracting from 1.0 to give number of differences per site.

Three main stages for ClustalW :


Pairwise alignment: Calculate distance matrix

Unrooted Neighbour-Joining tree Rooted NJ tree (guide tree) and sequence weights

Progressive alignment: Align following the guide tree

Guide Tree unroot NJ tree


0.17 0.13

Generate a Neighbor-Joining guide tree from these pairwise distance. This guide tree gives the order in which the progressive alignment will be carried out.

Three main stages for ClustlaW :


Pairwise alignment: Calculate distance matrix

Unrooted Neighbour-Joining tree Rooted NJ tree (guide tree) and sequence weights

Progressive alignment: Align following the guide tree

Guide Tree root NJ tree

The weights are dependent upon the distance from the root of the tree but sequences which have a common branch with other sequences share the weight derived from the shared branch.

Three main stages for ClustalW :


Pairwise alignment: Calculate distance matrix Unrooted Neighbour-Joining tree Rooted NJ tree (guide tree) and sequence weights

Progressive alignment: Align following the guide tree

Progressive Alignment

Align the two most closely-related sequences first. This alignment is then fixed and will never change. Once gap, always gap.

Summary
There are three main stages for ClustalW

Higgins D., Thompson J., Gibson T.Thompson J.D., Higgins D.G., Gibson T.J.(1994). Nucleic Acids Res. 22:4673-4680.

ClustalW spends around 96% running time in the first stage for pairwise alignment of the n sequences; and the rest is the running time for second and third stages.

Perform ClustalW alignment

ClustalW

Main menu

Input file

Input File
Prepare the input file sequences should be all in one file there are 7 formats can be accepted : NBRF/PIR, EMBL/Swissport, Fasta, GDE, Clustal, GCG/MSF, RSF

edit the file by Notepad for example :

Fasta is the common

Main Menu
Multiple alignment menu 1. Do complete multiple alignment now (slow/fast) 2. Produce guide tree only 3. Do alignment using old guide tree file 4. Slow / fast pairwise alignment 5. Pairwise alignment parameter 6. Multiple alignment parameter 7. Reset gaps before alignemnt 8. Screen display 9. Output format option 1. Sequence input from disk 2. Multiple alignment 3. Profile / structure alignment

4. Phylogenetic tree
Profile / Structure alignment 1. Input 1st. profile 2. Input 2nd. profile / sequence 3. Align 2nd. profile to 1st. profile 4. Align sequences to 1st. profile Phylogenetic tree 1. Input alignment 2. Exclude position with gaps 3. Correct for multiple substitutions 4. Draw tree now 5. Bootstrap tree

Toggle slow/fast pairwise alignment


Slow/accurate alignment It is fine for short sequences. If sequences>100, length >1000, the speed will be extremely slow full dynamic programming. Fast/approximate alignment how to be fast: - only exactly matching fragments - only the best diagonal

Pairwise Alignment Parameter (1)


Slow alignment:

. Gap Open Penalty: the penalty for opening a gap. (initial gap penalty)
. Gap Extension Penalty: the penalty for extending a gap by 1 residue. ACGTAAATTTTTGG ACGT - - - - - -TTGG
GOP GEP

. Protein Weight Matrix: Gonnet, BLOSUM, PAM

. DNA Weight Matrix: assigned to matches and mismatches


For example: Gonnet BLOSUM PAM Scoring Matrix

Pairwise Alignment Parameters (2)


Fast alignmnet
. K-Tuple Size: the size of exactly matching fragment

increase for speed (max=2 for protein, 4 for DNA); decrease for
sensitivity . Top Diagonals: the number of K-Tuple matches on each diagonal (most matches)

decrease for speed; increase for sensitivity


. Window size: the number of diagonals around each of the best diagonals

decrease for speed; increase for sensitivity

Multiple Alignment Parameter


. increase the Gap Opening Penalty will make gaps less frequent. . increase the Gap Extension Penalty will make gaps shorter. . Delay Divergent Sequences: for delaying the alignment of the most distantly related sequences until most closely related sequences have aligned. . DNA Transition Weight: give the score of AG, CT, between 0 or 1 0 mismatches; 1 matches. for distantly related DNA sequences, the weight is approximately 0 for closely related DNA sequences, the weight has higher score. . Protein Weight Matrix: how similar the sequences to be aligned at this alignment step are.

Output File
CLUSTAL output : [filename].aln

GUIDE TREE : [filename].dnd

Anda mungkin juga menyukai