Anda di halaman 1dari 15

2017

Bioinformatics
Homework – 2
MULTIPLE ALIGNMENT WITH
CLUSTAL W AND JALVİEW
Gebze Technical University

Ebru AKHARMAN
142204026
13.12.2017
13.12.2017

MULTIPLE ALIGNMENT WITH CLUSTAL W AND JALVİEW


HW2
Ebru AKHARMAN - 142204026, Gebze Technical University, Turkey

AIM:
Multiple sequence alignments are essential in computational analysis of protein sequences and structures,
with applications in structure modeling, functional site prediction, phylogenetic analysis and sequence
database searching. Molecular biology is becoming a computationally intense realm of contemporary science
and faces some of the current grand scientific challenges. In its context, tools that identify, store, compare and
analyze effectively large and growing numbers of bio-sequences are found of increasingly crucial importance.
Multiple sequence alignment (MSA) has assumed a key role in comparative structure and function analysis of
biological sequences. It often leads to fundamental biological insight into sequence-structure-function
relationships of nucleotide or protein sequence families. Significant advances have been achieved in this field,
and many useful tools have been developed for constructing alignments. The main purpose of this task is to
identify the similarities of the proteins with one another with clastal w and visualize them with jalview.
INTRODUCTION:
The sensitivity of the commonly used progressive identification of conserved regions. This is very
multiple sequence allignment method has been useful in designing experiments to test and modify
greatly improved for the allignment of divergent the functionof specific proteins predicting the
protein sequences. Firstly, individual weights are function and structure of proteins and in identifying
assigned to each sequence in a partial alignment in new members of protein families. Clustal W is a
order to downweight near – duplicate sequences general purpose multiple sequence alignment
and up – weight the most divergent ones. Secondly, program for DNA or proteins. Clustal W produces
amino acid substitution matrices are varied at biologically meaningful multiple sequence
different alignment stages according to the alignments of divergent species. It calculates the
divergence of the sequences to be aligned. Thirdly, best match for the selected sequence, and lines
residue – specific gap penalties and locally reduced them up so that the identities, similarities and
gap penalties in hydrophilic regions encourage new differences can be seen. Evolutionary relationship
gaps in potential loop regions rather than regular can be seen via viewing cladograms or phylograms.
secondary structure. Fourthly, positions in early
alignments where gaps have been opened receive ( http://www.genome.jp/tools-bin/clustalw )
locally reduced gap penalties to encourage the
opening up of new gaps at these positions. How to work Clustal W ?

The simultaneous alignment of many nucleotide or Step 1: The multiple sequences were uploaded in
amino acid sequences is now an essential tool in the fasta format.
molecular biology. Multiple alignments are used to Step 2: The options are left default.
find diagnostic patterns to characterise protein
families; to detect or demonstrate homology Step 3: If desired sequence title may be entered.
between new sequences and existing families of
sequences; to help predict the secondary and Step 4: Press the run button Clustal W alignment.
tertiary structures of new sequences; to suggest
oligonucleotide primers for PCR; as an essential Step 5: The plain text version of the alignment will
prelude to molecular evolutionary analysis. be temporarily stored in all file.

The rate of appearance of new sequence data is “ * ” means that the particular residue into that
steadily increasing and the development of efficient column are identical in all sequence alignment. “ : ”
and accurate automatic methods for multiple means that conserved substitution have been
alignment is, therefore, of major importance. observed according to colour table. “ . ” means that
Multiple alignments of protein sequence and semi conserved substitutions are observed.
nucleotide sequence are important tool in studying
Step 6: Save the Phylogenic trees. .
the sequences. The basic information provides

~2~
13.12.2017
Jalview is a multiplw alignment editor written in Java. It is used widely in a variety of web pages but is
available as a general purpose alignment editor and analysis workbench. This is a good software to view
multiple alignments using colors and shading to highlight different characteristics of the sequences like
hydrophobicity, percentage identity etc.. ( http://www.jalview.org/download )

Conserved sequence: A base sequence in a DNA molecule (or an amino acid sequence in a protein) that has
remained essentially unchanged, and so has been conserved, throughout evolution. Any sequences of bases (or
amino acids) in comparable segments of different nucleotides (or proteins) that tends to show similarity
greater than that due to chance alone. For example, if one position is occupied by the same base in all
comparable DNA sequences, then that position is said to be completely conserved. If the same base occurs at a
given position in, say, 75% of samples examined, it would be described as partially conserved. By extension, the
conservation of other positions in a sequence is assessed in the same way, usually by computer analysis. The
degree to which sequences are conserved can indicate the extent of structural and functional similarities
between different genes or between different proteins and provides clues to their possible evolutionary
relations.

Image 1: Nucleotide Sequence

The protein synthesized by the given nucleotide sequence is found as follows.

~3~
13.12.2017
Step 1:

Image 2: Blast Home Page

The nucleotide blast is clicked from the Blast home page. The reason for this is that the sequence is a nucleotide
sequence.

Step 2:

Image 3: Sequence Alignment in Nucleotide Blast

Nucleotides are written in the area shown in Fasta format. A name is given a nucleotide sequence and a
database is selected.

~4~
13.12.2017
Step 3:

Image 4: The Parameters

The parameters are selected as follows to determine the identity at the maximum level.

Step 4:

Image 5: Formatting Options

Format changes are made to obtain a more understandable image of the aligned sequence in Blast. Matches are
expressed with dots, and mismatches are expressed with red.

~5~
13.12.2017
Step 5:

Image 6: Sequence Alignment

The red fields show that identity more than 200 nucleotides. Pink fields show nucleotide matches between 80
and 200 nucleotides. Green areas show nucleotide matches between 50 and 80. Blue areas show nucleotide
matches between 40 and 50.

Step 6:

Image 7: The Results

In this way we can see that the nucleotide sequence encodes the DNA Polymerase of Thermus aquaticus. We
select the option with the lowest "E Value" and the highest "Identity" and click "Accession Number". Thus
achieving the GenBank tab.

~6~
13.12.2017
Step 7:

Image 8: GenBank Tab

This field is clicked on "protein_id" in the "translation" field above "ORIGIN". This tab shows the protein
sequence encoded by the gene.

Step 8:

Image 9: Protein Sequence

The protein sequence of Thermus aquaticus DNA polymerase is thus obtained. In Step 6, 10 differentially
similar protein sequences are selected.

~7~
13.12.2017
Step 9:

Image 10: Clustal W Program

The 10 protein sequences found are copied into a text document. This text document is opened in the Clustal W
program. The protein option is selected.

Step 10:

Image 11: Clustal W Parameters

The current "Gap Extension Penalty" value for protein sequences is 0.2. Other parameters do not change. “ Execute
Multiple Alignment ” is clicked.

~8~
13.12.2017
Step 11:

Since homologous sequences are evolutionarily related, we


can first build a guide tree of these sequences by their pair-
similarities and then follow the tree to carry out the multiple
alignment of the entire set. Stages can be summarized as
below:
1) All pairs of sequences to be aligned are compared by pair-
wise alignment and a score matrix of distance or similarity is
produced, indicating the divergence / similarities of each pair.
2) A guiding tree is built from the score matrix with branch
length proportional to the score of each pair. In this example,
NJ method is used to build the unrooted and rooted tree.
3) Multiple alignment is carried out by starting with the closest
related pairs, aligning them and then including other more
distant pairs progressively according to the branching order in
Image 12: Protein Sequences the guide tree. Gaps present in previous alignment are fixed
later on. The progressive approach, first proposed by Feng and
Doolittle works efficiently and the quality of the alignments is usually excellent and reliable as long as the sequences
are not too divergent. However, the choice of alignment parameters remains a major problem for this approach.
Traditionally, for a multiple alignment, one weight matrix and two gap penalties (for gap opening and extension
respectively) are chosen and fixed at the alignment process. However, for very divergent sequences, different choice
of these parameters may greatly affect the final solution of the alignment:
(1) all residue matrices give most weight to identities, which guarantees they will find approximately the correct
solution for closely related sequences. On the other hand, the scores given to mismatch will be critically important
for highly divergent sequences and we have to choose appropriate matrices to get the truly optimal alignment.
(2) For divergent sequences, the range of gap penalty values is very narrow and may vary with the degree of
divergence. Thus applying the same penalty at different alignment stage will cause inaccurate results.
(3) Gaps occur with different probability at different positions, for example, factors such as hydrophilic stretches
and residue specificities should be taken into consideration to modify the position-specific gap penalty.

Image 13: Alignment Score Image 14: Alignment Score Continue

~9~
13.12.2017
The accession numbers of protein sequences used are below.

Sequence 1: BAA06775 Sequence 1 belongs to question


Sequence 2: AAA27507 1. Other sequences are found
with Step 6. Sequence 1 is main
Sequnece 3: ACJ07015 protein sequence and
comparised other potein
Sequence 4: ACH89345
sequnces. If results are column
Sequence 5: BAA06033 graph.
Sequence 6: BAM64800

Sequence 7: ACJ07014

Sequence 8: AAC46079

Sequence 9: ACJ07018

Sequence 10: ACJ07021

Sequence 11: ACJ07016

The Rate of Score


100 99,8798 98,7981
A 100 86,0577
l 86,6587 86,899 83,6145 85,9036
90
i 76,9231 76,8029 76,5625
80
g
70
n
m 60
e 50
n 40
t 30
20
S
10
c
0
o
r 1 2 3 4 5 6 7 8 9 10 11
e Protein Sequences

The Rate of Score

Graph 1: The Rate of Score Graph

Looking at this graph, we can say Sequence 2 is most similar to Sequence 1. Also, sequence 10 is the sequence
least similar to sequence 1. From this knowledgeable path, it can be said that the affinity between sequence 1
and 2 is stronger. Sequence 1 and 2 may have come from the same ancestor.

~ 10 ~
13.12.2017

Looking at the
phylogenetic tree, we can
say that sequence 1 and
sequence 2 are close
relatives.

Image 15: Phylogenetic Tree

“ * ” means that the


particular residue into that
column are identical in all
sequence alignment. “ : ”
means that conserved
substitution have been
observed according to
colour table. “ . ” means
that semi conserved
substitutions are observed.
By looking at the alignment
results in these areas, we
can say that there are high-
preserved columns.

Image 16: Comment of Alignment

~ 11 ~
13.12.2017

Image 17: 3D Ribbon Model of Thermus aquaticus DNA Polymerase

This image was generated using the Chimera program with Thermus Aquaticus DNA Polymerase ID obtained
from the PDB database.

Image 18: Sequence 1 and Sequence 6 Comparison

Sequence 6 is blue and Sequence 1 is brown. Sequence 6 is known as Thermus termophilus and Sequence 1
Thermus aquaticus. This visual shows the common areas between the two sequences.

~ 12 ~
13.12.2017
Once you have aligned with Clustal W, the clustalw.aln file is downloaded. This file opens in the Jalview
program. Protein sequences are colored with BLOSUM 62 to see the conserved regions more clearly. The
darkest areas have the highest conserved area. We can comment on the conserved of regions by looking at the
color tone.

Image 19: Conserved Regions with Blosum62

With WebLogo, we can create the logos of the selected area in red. With WebLogo, we can create the logos of
the selected area in red. The selected area is copied and WebLogo is also pasted to the reserved area for the
sequence.( http://weblogo.threeplusone.com/create.cgi ).

Image 20: The Logo of Selected Area

Each letter in the logo is in different columns. In this case, we can say that the selected area is conserved at a
high level. To compare this situation.

~ 13 ~
13.12.2017

Image 21: Poorly Conserved Region

By looking at the color tone of the selected area, we can see that this area is poorly conserved.

Image 22: The Logo of Selected Area

It is seen that more than one letter is placed on the same column in the Logo.In these cases, in the same area it
indicates that different proteins have different amino acid. Thus, the selected area is proved to be under-
conserved.

~ 14 ~
13.12.2017
RESULT:
What are the structurally distinguishing features of conserved protein sequence regions? The structure of many
diverse proteins is currently known. In addition, many more protein sequences have been determined. These
data can be used to study the relations between structure and conserved sequence features of proteins, such as
protein secondary structure, which is a basic structural attribute that defines structural folds.
Sequence conservation of homologous sequences is rarely homogenous along their length; as sequences
diverge, their conservation is localized to specific regions. Typically, evolutionary conserved regions are
important both structurally and functionally. To obtain the general structural features of conserved regions of
all proteins, it is necessary to decide which scale of protein clustering, conserved regions, and structure
features to analyze. Natural choices are generically defined protein families, ungapped protein sequence motifs
(blocks) that separate proteins into either conserved or random regions, and the four basic secondary structure
elements (SSEs), namely, alpha helices, beta strands, structured turns, and loops. It is also of paramount
importance to analyze data from a very large and diverse group of proteins, avoiding conclusions drawn from
biased and a limited amount of data and using exact statistics to identify subtle but significant features.

In this assignment, the nucleotide sequence given was determined using blastn and the protein sequence was
achieved via GenBank. Other protein sequences were obtained with the aid of blast. Protein sequences found
were aligned with Clustal W. Later, the sequences were visualized with Jalview and colored with Blosum 62.
The BLOSUM62 matrix is consistent with strong evolutionary pressure to conserve protein function. As
expected, the most common substitution for any amino acid is itself. Overall, positive scores are less common
than negative scores, suggesting that most substitutions negatively affect protein function. The most highly
conserved amino acids are cysteine, tryptophan and histidine, which have the highest scores. Interestingly,
these latter amino acids have unique chemistries and often play important structural or catalytic roles in
proteins. When we look at the results obtained, we can say that the proteins are 85% identical.

Resources:
https://www.ncbi.nlm.nih.gov/pubmed/16679011

https://www.ncbi.nlm.nih.gov/pubmed/18566763

https://www.ncbi.nlm.nih.gov/pubmed/18592193

https://webcache.googleusercontent.com/search?q=cache:wqxSkx558BkJ:https://en.wikipedia.org/wiki/Cons
erved_sequence+&cd=1&hl=tr&ct=clnk&gl=tr

https://www.ncbi.nlm.nih.gov/pubmed/8743695

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC308517/?page=6

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC308517/?page=2

http://www.encyclopedia.com/science/dictionaries-thesauruses-pictures-and-press-releases/conserved-
sequence

http://guava.physics.uiuc.edu/~nigel/courses/598BIO/498BIOonline-essays/hw3/files/hw3_li.pdf

https://www-bimas.cit.nih.gov/clustalw/clustalw.html

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1781454/

~ 15 ~

Anda mungkin juga menyukai