Bioinformatics
Homework – 2
MULTIPLE ALIGNMENT WITH
CLUSTAL W AND JALVİEW
Gebze Technical University
Ebru AKHARMAN
142204026
13.12.2017
13.12.2017
AIM:
Multiple sequence alignments are essential in computational analysis of protein sequences and structures,
with applications in structure modeling, functional site prediction, phylogenetic analysis and sequence
database searching. Molecular biology is becoming a computationally intense realm of contemporary science
and faces some of the current grand scientific challenges. In its context, tools that identify, store, compare and
analyze effectively large and growing numbers of bio-sequences are found of increasingly crucial importance.
Multiple sequence alignment (MSA) has assumed a key role in comparative structure and function analysis of
biological sequences. It often leads to fundamental biological insight into sequence-structure-function
relationships of nucleotide or protein sequence families. Significant advances have been achieved in this field,
and many useful tools have been developed for constructing alignments. The main purpose of this task is to
identify the similarities of the proteins with one another with clastal w and visualize them with jalview.
INTRODUCTION:
The sensitivity of the commonly used progressive identification of conserved regions. This is very
multiple sequence allignment method has been useful in designing experiments to test and modify
greatly improved for the allignment of divergent the functionof specific proteins predicting the
protein sequences. Firstly, individual weights are function and structure of proteins and in identifying
assigned to each sequence in a partial alignment in new members of protein families. Clustal W is a
order to downweight near – duplicate sequences general purpose multiple sequence alignment
and up – weight the most divergent ones. Secondly, program for DNA or proteins. Clustal W produces
amino acid substitution matrices are varied at biologically meaningful multiple sequence
different alignment stages according to the alignments of divergent species. It calculates the
divergence of the sequences to be aligned. Thirdly, best match for the selected sequence, and lines
residue – specific gap penalties and locally reduced them up so that the identities, similarities and
gap penalties in hydrophilic regions encourage new differences can be seen. Evolutionary relationship
gaps in potential loop regions rather than regular can be seen via viewing cladograms or phylograms.
secondary structure. Fourthly, positions in early
alignments where gaps have been opened receive ( http://www.genome.jp/tools-bin/clustalw )
locally reduced gap penalties to encourage the
opening up of new gaps at these positions. How to work Clustal W ?
The simultaneous alignment of many nucleotide or Step 1: The multiple sequences were uploaded in
amino acid sequences is now an essential tool in the fasta format.
molecular biology. Multiple alignments are used to Step 2: The options are left default.
find diagnostic patterns to characterise protein
families; to detect or demonstrate homology Step 3: If desired sequence title may be entered.
between new sequences and existing families of
sequences; to help predict the secondary and Step 4: Press the run button Clustal W alignment.
tertiary structures of new sequences; to suggest
oligonucleotide primers for PCR; as an essential Step 5: The plain text version of the alignment will
prelude to molecular evolutionary analysis. be temporarily stored in all file.
The rate of appearance of new sequence data is “ * ” means that the particular residue into that
steadily increasing and the development of efficient column are identical in all sequence alignment. “ : ”
and accurate automatic methods for multiple means that conserved substitution have been
alignment is, therefore, of major importance. observed according to colour table. “ . ” means that
Multiple alignments of protein sequence and semi conserved substitutions are observed.
nucleotide sequence are important tool in studying
Step 6: Save the Phylogenic trees. .
the sequences. The basic information provides
~2~
13.12.2017
Jalview is a multiplw alignment editor written in Java. It is used widely in a variety of web pages but is
available as a general purpose alignment editor and analysis workbench. This is a good software to view
multiple alignments using colors and shading to highlight different characteristics of the sequences like
hydrophobicity, percentage identity etc.. ( http://www.jalview.org/download )
Conserved sequence: A base sequence in a DNA molecule (or an amino acid sequence in a protein) that has
remained essentially unchanged, and so has been conserved, throughout evolution. Any sequences of bases (or
amino acids) in comparable segments of different nucleotides (or proteins) that tends to show similarity
greater than that due to chance alone. For example, if one position is occupied by the same base in all
comparable DNA sequences, then that position is said to be completely conserved. If the same base occurs at a
given position in, say, 75% of samples examined, it would be described as partially conserved. By extension, the
conservation of other positions in a sequence is assessed in the same way, usually by computer analysis. The
degree to which sequences are conserved can indicate the extent of structural and functional similarities
between different genes or between different proteins and provides clues to their possible evolutionary
relations.
~3~
13.12.2017
Step 1:
The nucleotide blast is clicked from the Blast home page. The reason for this is that the sequence is a nucleotide
sequence.
Step 2:
Nucleotides are written in the area shown in Fasta format. A name is given a nucleotide sequence and a
database is selected.
~4~
13.12.2017
Step 3:
The parameters are selected as follows to determine the identity at the maximum level.
Step 4:
Format changes are made to obtain a more understandable image of the aligned sequence in Blast. Matches are
expressed with dots, and mismatches are expressed with red.
~5~
13.12.2017
Step 5:
The red fields show that identity more than 200 nucleotides. Pink fields show nucleotide matches between 80
and 200 nucleotides. Green areas show nucleotide matches between 50 and 80. Blue areas show nucleotide
matches between 40 and 50.
Step 6:
In this way we can see that the nucleotide sequence encodes the DNA Polymerase of Thermus aquaticus. We
select the option with the lowest "E Value" and the highest "Identity" and click "Accession Number". Thus
achieving the GenBank tab.
~6~
13.12.2017
Step 7:
This field is clicked on "protein_id" in the "translation" field above "ORIGIN". This tab shows the protein
sequence encoded by the gene.
Step 8:
The protein sequence of Thermus aquaticus DNA polymerase is thus obtained. In Step 6, 10 differentially
similar protein sequences are selected.
~7~
13.12.2017
Step 9:
The 10 protein sequences found are copied into a text document. This text document is opened in the Clustal W
program. The protein option is selected.
Step 10:
The current "Gap Extension Penalty" value for protein sequences is 0.2. Other parameters do not change. “ Execute
Multiple Alignment ” is clicked.
~8~
13.12.2017
Step 11:
~9~
13.12.2017
The accession numbers of protein sequences used are below.
Sequence 7: ACJ07014
Sequence 8: AAC46079
Sequence 9: ACJ07018
Looking at this graph, we can say Sequence 2 is most similar to Sequence 1. Also, sequence 10 is the sequence
least similar to sequence 1. From this knowledgeable path, it can be said that the affinity between sequence 1
and 2 is stronger. Sequence 1 and 2 may have come from the same ancestor.
~ 10 ~
13.12.2017
Looking at the
phylogenetic tree, we can
say that sequence 1 and
sequence 2 are close
relatives.
~ 11 ~
13.12.2017
This image was generated using the Chimera program with Thermus Aquaticus DNA Polymerase ID obtained
from the PDB database.
Sequence 6 is blue and Sequence 1 is brown. Sequence 6 is known as Thermus termophilus and Sequence 1
Thermus aquaticus. This visual shows the common areas between the two sequences.
~ 12 ~
13.12.2017
Once you have aligned with Clustal W, the clustalw.aln file is downloaded. This file opens in the Jalview
program. Protein sequences are colored with BLOSUM 62 to see the conserved regions more clearly. The
darkest areas have the highest conserved area. We can comment on the conserved of regions by looking at the
color tone.
With WebLogo, we can create the logos of the selected area in red. With WebLogo, we can create the logos of
the selected area in red. The selected area is copied and WebLogo is also pasted to the reserved area for the
sequence.( http://weblogo.threeplusone.com/create.cgi ).
Each letter in the logo is in different columns. In this case, we can say that the selected area is conserved at a
high level. To compare this situation.
~ 13 ~
13.12.2017
By looking at the color tone of the selected area, we can see that this area is poorly conserved.
It is seen that more than one letter is placed on the same column in the Logo.In these cases, in the same area it
indicates that different proteins have different amino acid. Thus, the selected area is proved to be under-
conserved.
~ 14 ~
13.12.2017
RESULT:
What are the structurally distinguishing features of conserved protein sequence regions? The structure of many
diverse proteins is currently known. In addition, many more protein sequences have been determined. These
data can be used to study the relations between structure and conserved sequence features of proteins, such as
protein secondary structure, which is a basic structural attribute that defines structural folds.
Sequence conservation of homologous sequences is rarely homogenous along their length; as sequences
diverge, their conservation is localized to specific regions. Typically, evolutionary conserved regions are
important both structurally and functionally. To obtain the general structural features of conserved regions of
all proteins, it is necessary to decide which scale of protein clustering, conserved regions, and structure
features to analyze. Natural choices are generically defined protein families, ungapped protein sequence motifs
(blocks) that separate proteins into either conserved or random regions, and the four basic secondary structure
elements (SSEs), namely, alpha helices, beta strands, structured turns, and loops. It is also of paramount
importance to analyze data from a very large and diverse group of proteins, avoiding conclusions drawn from
biased and a limited amount of data and using exact statistics to identify subtle but significant features.
In this assignment, the nucleotide sequence given was determined using blastn and the protein sequence was
achieved via GenBank. Other protein sequences were obtained with the aid of blast. Protein sequences found
were aligned with Clustal W. Later, the sequences were visualized with Jalview and colored with Blosum 62.
The BLOSUM62 matrix is consistent with strong evolutionary pressure to conserve protein function. As
expected, the most common substitution for any amino acid is itself. Overall, positive scores are less common
than negative scores, suggesting that most substitutions negatively affect protein function. The most highly
conserved amino acids are cysteine, tryptophan and histidine, which have the highest scores. Interestingly,
these latter amino acids have unique chemistries and often play important structural or catalytic roles in
proteins. When we look at the results obtained, we can say that the proteins are 85% identical.
Resources:
https://www.ncbi.nlm.nih.gov/pubmed/16679011
https://www.ncbi.nlm.nih.gov/pubmed/18566763
https://www.ncbi.nlm.nih.gov/pubmed/18592193
https://webcache.googleusercontent.com/search?q=cache:wqxSkx558BkJ:https://en.wikipedia.org/wiki/Cons
erved_sequence+&cd=1&hl=tr&ct=clnk&gl=tr
https://www.ncbi.nlm.nih.gov/pubmed/8743695
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC308517/?page=6
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC308517/?page=2
http://www.encyclopedia.com/science/dictionaries-thesauruses-pictures-and-press-releases/conserved-
sequence
http://guava.physics.uiuc.edu/~nigel/courses/598BIO/498BIOonline-essays/hw3/files/hw3_li.pdf
https://www-bimas.cit.nih.gov/clustalw/clustalw.html
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1781454/
~ 15 ~