Protein Sequence Analysis

Analysis of Protein Sequences
Darren Soanes
Evolution of Amino Acid Sequences

Amino acid sequences change due to mutations in DNA sequence. Amino acid sequences evolve more slowly than DNA sequences. Evolutionary selection occurs on protein sequences.
DNA mutations (1)

Synonymous substitution change in DNA sequence that does not affect the amino acid sequence, often in the third position of a codon, e.g. CCG (Pro)CCA (Pro). Non-synonymous substitution - change in DNA sequence that does affect the amino acid sequence, often in the first or second position of a codon, e.g. CCG (Pro)CAG (Gln).
Genetic Code
DNA mutations (2)

Non-synonymous substitution also called missense mutation. Nonsense mutation where a a stop codon is introduced into the middle of a sequence, e.g. TGG (Trp) TAG (Stop) Insertion / deletion (indel), causes a frame shift if not a multiple of three bases. Nonsense and frame-shift mutations usually produce non-functional proteins.
Changes in amino acid sequence (1)

Deleterious mutations reduce the fitness of a protein, natural selection keeps the frequencies of these low. Depends on environment e.g. sickle cell anaemia due to mutation in -haemoglobin. Individuals homozygous for this gene usually dont survive childhood. Heterozygotes have some protection against malaria, therefore in tropical regions the frequencies of this gene are relatively high.

Neutral mutations do not change the fitness of a protein, may become widespread in a population due to genetic drift. Due to conservative substitutions, where one amino acid is changed to another with a similar property e.g Arg Lys

Advantageous mutations increase the fitness of a protein. Frequency will greatly increase in a population due to natural selection. Depends on environment. Protein that works better at high temperatures will be an advantage if environmental temperature increases but not if it decreases.
Families of amino acids
Amino acid substitution matrices (1)

Substitutions between amino acids that are similar in properties are more common. Cysteine, glycine and tryptophan rarely change. Substitution matrices measure the likelihood that one amino acid is likely to change to another.
Amino acid substitution matrices (2)

Amino acid substitution matrices are empirically derived by alignment of sets of closely related protein sequences. Examples include Dayhoff, BLOSUM (used in BLAST searches), WAG, JTT. Different matrices suitable for looking at proteins encoded by mitochondrial genome e.g. MtREV.
BLOSUM 62 Matrix
Rates of amino acid change

Rate of substitution varies at different positions in an amino acid sequence. A proportion of sequences are likely to be invariant, generally have an essential role in the function of a protein. A gamma distribution models the variation of rates at different sites. Sites are sorted into gamma rate categories.
Structure of thrombin showing catalytic triad (conserved in serine proteases)
Phylogenetic analysis
Phylogenetic analysis programs take an alignment of protein sequences and attempt to produce a phylogenetic tree showing evolutionary relationships between the sequences. User can select amino acid substitution matrix and number of gamma rate categories, the program will estimate the proportion of invariant sites. Programs use these parameters and protein alignment to estimate evolutionary distance between sequences. They calculate topology and branch length of final tree.
Distance Methods
Evolutionary distance calculated for all pairs of taxa. UPGMA - assumes rate of substitution is constant. Least squares allows different rates of substitution in different branches. Minimum evolution (ME) topology chosen where the sum of branch lengths is the smallest. Can take a long time to compute, neighbour joining (NJ) method is simplified version of ME much quicker.
Maximum parsimony
For each topology the smallest number of amino acid substitutions are calculated that could explain the evolutionary process. The topology that requires the smallest number of substitutions is chosen as the best one.
Maximum likelihood (ML)

For each topology the likelihood is calculated that the known sequences could have evolved on that tree (branch lengths and substitution rate parameters optimised). Topology with the best likelihood score is chosen. Takes a long time to compute ML of every possible tree. Heuristic methods such as quartet puzzling reduce the number of candidate trees. Programs that use ML methods: PhyML, RAxML, TreePuzzle (uses quartet puzzling).
Bootstrapping
Tests the reliability of a tree. Initial protein alignment is randomised (by sampling columns at random). Tree construction repeated for each randomised alignment. For each group of taxa in the original tree it is determined what percentage of the randomised trees contain the same group.
Bayesian methods
A sample is taken of a large number of trees with high ML. Posterior probabilities calculated for different events of interest. Markov Chain Monte Carlo method used to generate samples of trees. Mr Bayes uses these methods.
Taxon sampling
Take initial protein sequence. Decide which range of species you are interested in. Use BLAST to find homologous sequences in databases, either NCBI database or individual genome databases.
Multiple sequence alignment

Take FASTA file of sequences you are interested in. Align sequences using ClustalW, Muscle, TCoffee.
Sampling of conserved blocks

To get reliable trees non-aligned and poorly conserved areas of sequence need to be removed. Gblocks samples highly conserved blocks of sequence.
Which substitution model should I use?

ModelGenerator takes your sequence alignment and calculates the best amino acid substitution model to use. Quite slow to run on the web, is available as a java program to download.
Creating tree
Take alignment produced by Gblocks and use program of choice to generate a tree (using substitution model suggest by ModelGenerator and specifying number of gamma rate categories, 4 is sufficient). File format problems, different programs use different file formats use Readseq to convert between file formats. Use tree viewing program to look at graphical representation of tree (TreeView, TreeDyn).
Interpreting protein trees

Protein trees show the evolutionary relationships between protein encoding genes, not species. Comparison with a species tree is needed so that examples of gene duplication, gene loss and lateral gene transfer can be identified.
Gene duplication
Sequence homology (1)

Programs such as BLAST measure the degree of similarity between sequences. Genes are said to be homologous if they share a common evolutionary ancestor. Orthologues are genes in different species that evolved from a common ancestral gene by speciation. Normally, orthologues retain the same function in the course of evolution. (e.g. myoglobin in mammals).
Sequence homology (2)

Paralogous genes are related by duplication within a genome. Paralogues often evolve new functions, even if these are related to the original one. In-paralogues, paralogues that were duplicated after a speciation and are therefore in the same species Out-paralogues, paralogues that were duplicated before a speciation. Not necessarily in the same species.
Orthology and paralogy
Paralogues
A, B and C are different species and are different paralogues of the same gene
Out-paralogues
In-paralogues
Evolution of globin superfamily in humans
Using proteins to create a species tree

Need to identify sets of orthologues that are present in one copy in all the species of interest, often housekeeping genes that evolve slowly. Protein sequences are concatenated so there is one sequence per species each containing many proteins. Sequences are aligned and phylogenetic tree produced.
oomycete (not fungi)
Fungal species trees

microsporidia
30 proteins
plant zygomycete basidiomycetes
ascomycetes
yeasts
60 proteins
filamentous ascomycetes
Lateral gene transfer (purine-cytosine permease)
oomycete
fungi
Eukaryotic Tree of Life
Phytophthora sojae
Aspergillus oryzae
Summary
Phylogenetic methods can be used to analyse protein sequences and produce models of the evolution of a particular protein encoding gene. Comparison with a species tree can identify events such as duplication, gene loss and lateral gene transfer.
Workshop task
Looking at evolution of genes encoding two types of phosphoglycerate mutase in fungi.
Two types of phosphoglycerate mutase (PGM)
Both catalyse the same overall reaction:

3-phosphoglycerate 2-phosphoglycerate
cofactor-dependent PGM (dPGM) uses 2,3-bisphosphoglycerate (2,3BPG) as a cofactor:

3PG + P-Enzyme 2,3BPG + Enzyme 2PG + P-Enzyme
cofactor-independent PGM (iPGM) has two bound Mn(II) ions at its active site.
3PG + Enzyme PG + P-Enzyme 2PG + Enzyme
Two types of phosphoglycerate mutase (PGM)

dPGM found in yeasts and vertebrates iPGM found in filamentous fungi, plants and some invertebrates Both can be found in bacteria. No sequence similarity between the two forms of the enzyme.
Structure of iPGM
Structure of dPGM
Task
Use BLAST search to find PGM protein sequences in a sample of fungal species. Use these to create phylogenetic trees showing the evolution of genes encoding these enzymes.
Taxon sampling (get sequences BLAST)
Alignment (ClustalW)
Sampling conserved positions (GBlocks)
Determine substitution model (ModelGenerator)
Create tree (PhyML) Visualise tree (TreeDyn)

Protein Sequence Analysis

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Protein Sequence Analysis

Diunggah oleh

Hak Cipta:

Format Tersedia

Analysis of Protein Sequences

Evolution of Amino Acid Sequences

DNA mutations (1)

DNA mutations (2)

Changes in amino acid sequence (1)

Changes in amino acid sequence (2)

Changes in amino acid sequence (3)

Families of amino acids

Amino acid substitution matrices (1)

Amino acid substitution matrices (2)

Rates of amino acid change

Structure of thrombin showing catalytic triad (conserved in serine proteases)

Maximum likelihood (ML)

Multiple sequence alignment

Sampling of conserved blocks

Which substitution model should I use?

Interpreting protein trees

Sequence homology (1)

Sequence homology (2)

Orthology and paralogy

Evolution of globin superfamily in humans

Using proteins to create a species tree

oomycete (not fungi)

Fungal species trees

plant zygomycete basidiomycetes

Lateral gene transfer (purine-cytosine permease)

Eukaryotic Tree of Life

Two types of phosphoglycerate mutase (PGM)

Both catalyse the same overall reaction:

cofactor-dependent PGM (dPGM) uses 2,3-bisphosphoglycerate (2,3BPG) as a cofactor:

Two types of phosphoglycerate mutase (PGM)

Taxon sampling (get sequences BLAST)

Sampling conserved positions (GBlocks)

Determine substitution model (ModelGenerator)

Create tree (PhyML) Visualise tree (TreeDyn)

Anda mungkin juga menyukai