Bioinformatics
Homework – 3
Finding Families, Motifs,
Patterns and Clans of
Proteins with PROSİTE and
PFAM
Gebze Technical University
EBRU AKHARMAN
142204026
18.12.2017
18.12.2017
~4~
18.12.2017
This protein sequence belongs to a living species called Gallus gallus (chicken). PFAM is used to find the
clan and family of this protein sequence. The following steps are followed for this situation.
STEP 1:
The protein sequence is pasted to the indicated area and the "GO" option is clicked.
STEP 2:
~5~
18.12.2017
Above are the details of the matches that were found. We separate Pfam-A matches into two tables, containing
the significant and insignificant matches. A significant match is one where the bits score is greater than or equal
to the gathering threshhold for the Pfam domain. Hits which do not start and end at the end points of the
matching HMM are highlighted. The Pfam graphic below shows only the significant matches to your sequence.
Clicking on any of the domains in the image will take you to a page of information about that domain. Pfam does
not allow any amino-acid to match more than one Pfam-A family, unless the overlapping families are part of the
same clan. In cases where two members of the same clan match the same region of a sequence, only one match
is show, that with the lowest E-value. A small proportion of sequences within the enzymatic Pfam families have
had their active sites experimentally determined. Using a strict set of rules, chosen to reduce the rate of false
positives, we transfer experimentally determined active site residue data from a sequence within the same
Pfam family to your query sequence. These are shown as "Predicted active sites".
For Pfam-A hits we show the alignments between your search sequence and the matching HMM. You can show
individual alignments by clicking on the "Show" button in each row of the result table, or you can show all
alignments using the links above each table. This alignment row for each hit shows the alignment between your
sequence and the matching HMM. The alignment fragment includes the following rows:
#HMM: consensus of the HMM. Capital letters indicate the most conserved positions
#MATCH: the match between the query sequence and the HMM. A '+' indicates a positive score which can be
interpreted as a conservative substitution
#PP: posterior probability. The degree of confidence in each individual aligned residue. 0 means 0-5%, 1 means
5-15% and so on; 9 means 85-95% and a '*' means 95-100% posterior probability
#SEQ: query sequence. A '-' indicate deletions in the query sequence with respect to the HMM. Columns are
coloured according to the posterior probability
According to the obtained result, it is seen that the protein sequence belongs to ACTIN protein. Actin is
a family of globular multi-functional proteins that form microfilaments. It is found in essentially all eukaryotic
cells (the only known exception being nematode sperm), where it may be present at a concentration of over
100 μM. An actin protein's mass is roughly 42-kDa, with a diameter of 4 to 7 nm, and it is
the monomeric subunit of two types of filaments in cells: microfilaments, one of the three major components of
the cytoskeleton, and thin filaments, part of the contractile apparatus in muscle cells. It can be present as either
a free monomer called G-actin (globular) or as part of a linear polymer microfilament called F-
actin (filamentous), both of which are essential for such important cellular functions as the mobility and
contraction of cells during cell division. Actin participates in many important cellular processes,
including muscle contraction, cell motility, cell division and cytokinesis, vesicleand organelle movement, cell
signaling, and the establishment and maintenance of cell junctions and cell shape. Many of these processes are
mediated by extensive and intimate interactions of actin with cellular membranes. In vertebrates, three main
groups of actin isoforms, alpha, beta, and gamma have been identified. The alpha actins, found in muscle
tissues, are a major constituent of the contractile apparatus. The beta and gamma actins coexist in most cell
types as components of the cytoskeleton, and as mediators of internal cell motility. It is believed that the
diverse range of structures formed by actin enabling it to fulfill such a large range of functions is regulated
through the binding of tropomyosin along the filaments. A cell’s ability to dynamically form microfilaments
provides the scaffolding that allows it to rapidly remodel itself in response to its environment or to the
organism’s internal signals, for example, to increase cell membrane absorption or increase cell adhesion in
order to form cell tissue. Other enzymes or organelles such as cilia can be anchored to this scaffolding in order
to control the deformation of the external cell membrane, which allows endocytosis and cytokinesis. It can also
~6~
18.12.2017
produce movement either by itself or with the help of molecular motors. Actin therefore contributes to
processes such as the intracellular transport of vesicles and organelles as well as muscular
contraction and cellular migration. It therefore plays an important role in embryogenesis, the healing of
wounds and the invasivity of cancer cells. The evolutionary origin of actin can be traced to prokaryotic cells,
which have equivalent proteins. Actin homologs from prokaryotes and archaea polymerize into different helical
or linear filaments consisting of one or multiple strands. However the in-strand contacts and nucleotide
binding sites are preserved in prokaryotes and in archaea. Lastly, actin plays an important role in the control
of gene expression.
A large number of illnesses and diseases are caused by mutations in alleles of the genes that regulate the
production of actin or of its associated proteins. The production of actin is also key to the process
of infection by some pathogenic microorganisms. Mutations in the different genes that regulate actin
production in humans can cause muscular diseases, variations in the size and function of the heart as well
as deafness. The make-up of the cytoskeleton is also related to the pathogenicity of
intracellular bacteria and viruses, particularly in the processes related to evading the actions of the immune
system.
Structure:
Its amino acid sequence is also one of the most highly conserved of the proteins as it has changed little over the
course of evolution, differing by no more than 20% in species as diverse as algae and humans. It is therefore
considered to have an optimised structure. It has two distinguishing features: it is an enzyme that
slowly hydrolizes ATP, the "universal energy currency" of biological processes. However, the ATP is required in
order to maintain its structural integrity. Its efficient structure is formed by an almost unique folding process.
In addition, it is able to carry out more interactions than any other protein, which allows it to perform a wider
variety of functions than other proteins at almost every level of cellular life. Myosin is an example of a protein
that bonds with actin. Another example is villin, which can weave actin into bundles or cut the filaments
depending on the concentration of calcium cations in the surrounding medium.
Architecture of the nucleus - interaction of actin with alpha II-spectrin and other proteins are important
for maintaining proper shape of the nucleus.
Transcription – actin is involved in chromatin reorganization. Transcription initiation and interacts with
transcription complex. Actin takes part in the regulation of chromatin structure interact with both the RNA
polymerase I, II and III In Pol I transcription, actin and myosin (MYO1C, which binds DNA) act as
a molecular motor. For Pol II transcription, β-actin is needed for the formation of the preinitiation complex.
Pol III contains β-actin as a subunit. Actin can also be a component of chromatin remodelling complexes as
well as pre-mRNP particles (that is, precursor messenger RNA bundled in proteins), and is involved
in nuclear export of RNAs and proteins.
Regulation of gene activity – actin binds to the regulatory regions of different kinds of genes Actin ability
to regulate gene activity is used in the molecular reprogramming method, which allows differentiated cells
return to their embryonic state
Translocation of the activated chromosome fragment from under membrane region to euchromatin
where transcription starts. The movement require the interaction of actin and myosin
Integration of different cellular compartments. Actin is a molecule that integrates cytoplasmic and
nuclear signal transduction pathway. An example is the activation of transcription in response to serum
stimulation of cells in vitro.
Due to its ability to conformational changes and interaction with many proteins actin acts as a regulator of
formation and activity of protein complexes such as transcriptional complex.
~7~
18.12.2017
Step 3:
The actin-like ATPase domain forms an alpha/beta canonical fold. The domain can be subdivided into 1A, 1B,
2A and 2B subdomains. Subdomains 1A and 1B share the same RNAseH-like fold (a five-stranded beta-sheet
decorated by a number of alpha-helices). Domains 1A and 2A are conserved in all members of this superfamily,
whereas domain 1B and 2B have a variable structure and are even missing from some homologues. Within the
actin-like ATPase domain the ATP-binding site is highly conserved. The phosphate part of the ATP is bound in a
cleft between subdomains 1A and 2A, whereas the adenosine moiety is bound to residues from domains 2A and
2B. This clan contains 31 families and the total number of domains in the clan is 150464. The clan was built by
RD Finn. This clan contains the following 31 member families;
By clicking on the links in the area specified as "Members", Wikipedia information is reached. For example;
~8~
18.12.2017
The motifs and patterns of the protein sequence given with the PROSITE are found by the following steps.
STEP 1:
Prosite web page is reached via ExPASy. Click ScanProsite for more scanning options.
STEP 2:
~9~
18.12.2017
Depending on the situation, one of the 3 options is selected. In this section, "Option 1" is selected because the
protein sequence is searched for the motif.
The protein sequence is pasted to the indicated region and the tick mark is removed in the field "Exclude
motifs with a high probability of occurrence from the scan". Then click "Start to Scan" button.
STEP 3:
~ 10 ~
18.12.2017
Actins [1,2,3,4] are highly conserved contractile proteins that are present in all eukaryotic cells. In vertebrates
there are three groups of actin isoforms: α, β and γ. The α actins are found in muscle tissues and are a major
constituent of the contractile apparatus. The β and γ actins co-exists in most cell types as components of the
cytoskeleton and as mediators of internal cell motility. In plants [5] there are many isoforms which are
probably involved in a variety of functions such as cytoplasmic streaming, cell shape determination, tip growth,
graviperception, cell wall deposition, etc. Actin exists either in a monomeric form (G-actin) or in a polymerized
form (F-actin). Each actin monomer can bind a molecule of ATP; when polymerization occurs, the ATP is
hydrolyzed. Actin is a protein of from 374 to 379 amino acid residues. The structure of actin has been highly
conserved in the course of evolution.
Recently some divergent actin-like proteins have been identified in several species. These proteins are:
Centractin (actin-RPV) from mammals, fungi (yeast ACT5, Neurospora crassa ro-4) and Pneumocystis
carinii (actin-II). Centractin seems to be a component of a multi-subunit centrosomal complex involved
in microtubule based vesicle motility. This subfamily is also known as ARP1.
ARP2 subfamily which includes chicken ACTL, yeast ACT2, Drosophila 14D, C.elegans actC.
ARP3 subfamily which includes actin 2 from mammals, Drosophila 66B, yeast ACT4 and fission yeast
act2.
ARP4 subfamily which includes yeast ACT3 and Drosophila 13E.
We developed three signature patterns. The first two are specific to actins and span positions 54 to 64 and 357
to 365. The last signature picks up both actins and the actin-like proteins and corresponds to positions 106 to
118 in actins.
~ 11 ~
18.12.2017
The representation of each line is listed below:
A sequence logo is a graphical display of a multiple sequence alignment consisting of colour-coded stacks of
letters representing amino acids at successive positions. Sequence logos provide a richer and more precise
description of sequence similarity than consensus sequences and can rapidly reveal significant features of the
alignment that could otherwise be difficult to perceive. The total height of a logo position depends on the
degree of conservation in the corresponding multiple sequence alignment column.
Very conserved alignment columns produce high logo positions. The height of each letter in a logo position is
proportional to the observed frequency of the corresponding amino acid in the alignment column. The letter of
each stack is ordered from most to least frequent, so that it is possible to read the consensus sequence from the
top of the stacks. For patterns, each position is shown in the logo, whereas for profiles only match positions are
considered, i.e. the length of the logo corresponds to the length of the profile.
~ 12 ~
18.12.2017
This picture obtained by using PDB belongs to Gallus gallus original Actin protein.
This image was obtained with the Chimera program and shows the PS00406 Pattern in Actin protein. Selected
area with green PS00406.
~ 13 ~
18.12.2017
PS00432 Pattern in Prosite Format:
The field indicated by "PA" in Original Prosite Data Format is PS00432 Pattern.
~ 14 ~
18.12.2017
This image was obtained with the Chimera program and shows the PS00432 Pattern in Actin protein. Selected
area with green PS00432.
The field indicated by "PA" in Original Prosite Data Format is PS01132 Pattern.
~ 15 ~
18.12.2017
This image was obtained with the Chimera program and shows the PS01132 Pattern in Actin protein. Selected
area with green PS01132.
~ 16 ~
18.12.2017
It has been known for a long time that potential N-glycosylation sites are specific to the consensus sequence
Asn-Xaa-Ser/Thr. It must be noted that the presence of the consensus tripeptide is not sufficient to conclude
that an asparagine residue is glycosylated, due to the fact that the folding of the protein plays an important role
in the regulation of N-glycosylation. It has been shown that the presence of proline between Asn and Ser/Thr
will inhibit N-glycosylation; this has been confirmed by a statistical analysis of glycosylation sites, which also
shows that about 50% of the sites that have a proline C-terminal to Ser/Thr are not glycosylated. It must also
be noted that there are a few reported cases of glycosylation sites with the pattern Asn-Xaa-Cys; an
experimentally demonstrated occurrence of such a non-standard site is found in the plasma protein C.
The field indicated by "PA" in Original Prosite Data Format is PS00001 Pattern.
~ 17 ~
18.12.2017
This image was obtained with the Chimera program and shows the PS00001 Pattern in Actin protein. Selected
area with green PS00001.
An appreciable number of eukaryotic proteins are acylated by the covalent addition of myristate (a C14-
saturated fatty acid) to their N-terminal residue via an amide linkage. The sequence specificity of the enzyme
responsible for this modification, myristoyl CoA:protein N-myristoyl transferase (NMT), has been derived from
the sequence of known N-myristoylated proteins and from studies using synthetic peptides. It seems to be the
following:
Note: We deliberately include as potential myristoylated glycine residues, those which are internal to a
sequence. It could well be that the sequence under study represents a viral polyprotein precursor and that
subsequent proteolytic processing could expose an internal glycine as the N-terminal of a mature protein.
The field indicated by "PA" in Original Prosite Data Format is PS00008 Pattern.
~ 18 ~
18.12.2017
This image was obtained with the Chimera program and shows the PS00008 Pattern in Actin protein. Selected
area with green PS00008.
~ 19 ~
18.12.2017
PS00005 Pattern in Prosite Format:
In vivo, protein kinase C exhibits a preference for the phosphorylation of serine or threonine residues found
close to a C-terminal basic residue. The presence of additional basic residues at the N- or C-terminal of the
target amino acid enhances the Vmax and Km of the phosphorylation reaction.
The field indicated by "PA" in Original Prosite Data Format is PS00005 Pattern.
This image was obtained with the Chimera program and shows the PS00005 Pattern in Actin protein. Selected
area with green PS00005.
~ 20 ~
18.12.2017
Casein kinase II (CK-2) is a protein serine/threonine kinase whose activity is independent of cyclic nucleotides
and calcium. CK-2 phosphorylates many different proteins. The substrate specificity of this enzyme can be
summarized as follows:
~ 21 ~
18.12.2017
The field indicated by "PA" in Original Prosite Data Format is PS00006 Pattern.
This image was obtained with the Chimera program and shows the PS00006 Pattern in Actin protein. Selected
area with green PS00006.
Substrates of tyrosine protein kinases are generally characterized by a lysine or an arginine seven residues to
the N-terminal side of the phosphorylated tyrosine. An acidic residue (Asp or Glu) is often found at either three
or four residues to the N-terminal side of the tyrosine. There are a number of exceptions to this rule such as the
tyrosine phosphorylation sites of enolase and lipocortin II.
~ 22 ~
18.12.2017
The field indicated by "PA" in Original Prosite Data Format is PS00007 Pattern.
This image was obtained with the Chimera program and shows the PS00007 Pattern in Actin protein. Selected
area with green PS00007.
~ 23 ~
18.12.2017
PS00004 Pattern in Prosite Format:
There has been a number of studies relative to the specificity of cAMP- and cGMP-dependent protein kinases.
Both types of kinases appear to share a preference for the phosphorylation of serine or threonine residues
found close to at least two consecutive N-terminal basic residues. It is important to note that there are quite a
number of exceptions to this rule.
The field indicated by "PA" in Original Prosite Data Format is PS00004 Pattern.
This image was obtained with the Chimera program and shows the PS00004 Pattern in Actin protein. Selected
area with green PS00004.
~ 24 ~
18.12.2017
FOR MOTIFS:
STEP 1:
The protein sequence is pasted to the indicated region and the tick mark is removed in the field "Exclude
motifs with a high probability of occurrence from the scan". "Output Format" is changed to "table" in the "Step
3". Then click "Start to Scan" button.
~ 25 ~
18.12.2017
STEP 2:
"Image 38" represents motifs in the Gallus gallus Actin protein sequence.
~ 26 ~
18.12.2017
RESULT:
PROSITE is a database of protein families and domains. It is based on the observation that, while there is a huge
number of different proteins, most of them can be grouped, on the basis of similarities in their sequences, into a
limited number of families. Proteins or protein domains belonging to a particular family generally share
functional attributes and are derived from a common ancestor.
It is apparent, when studying protein sequence families, that some regions have been better conserved than
others during evolution. These regions are generally important for the function of a protein and/or for the
maintenance of its three- dimensional structure. By analyzing the constant and variable properties of such
groups of similar sequences, it is possible to derive a signature for a protein family or domain, which
distinguishes its members from all other unrelated proteins. A pertinent analogy is the use of fingerprints by
the police for identification purposes. A fingerprint is generally sufficient to identify a given individual.
Similarly, a protein signature can be used to assign a newly sequenced protein to a specific family of proteins
and thus to formulate hypotheses about its function. PROSITE currently contains patterns and profiles specific
for more than a thousand protein families or domains. Each of these signatures comes with documentation
providing background information on the structure and function of these proteins.
Proteins are generally composed of one or more functional regions, commonly termed domains. Different
combinations of domains give rise to the diverse range of proteins found in nature. The identification of
domains that occur within proteins can therefore provide insights into their function. Pfam also generates
higher-level groupings of related entries, known as clans. A clan is a collection of Pfam entries which are
related by similarity of sequence, structure or profile-HMM.
In this assignment, The Gallus gallus actin protein sequences were examined with PFAM and PROSITE . Family
and clan of actin protein sequence were found with PFAM web tool. The motifs and patterns of the actin protein
sequence were found with the PROSITE web tool. In addition, information i s gived about patterns found. In
addition, patterns of the protein sequence with the program named "Chimera" were observed on the 3D image
of the protein sequence.
RESOURCES:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2808889/
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2808866/
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4280011/
http://vlab.amrita.edu/?sub=3&brch=273&sim=1426&cnt=1
https://biology.stackexchange.com/questions/7785/differences-between-protein-motifs-and-protein-
domains
https://webcache.googleusercontent.com/search?q=cache:Ja5YLeqfdPEJ:https://en.wikipedia.org/wiki/Seque
nce_motif+&cd=3&hl=tr&ct=clnk&gl=tr
https://webcache.googleusercontent.com/search?q=cache:vQ0T0jvIUGEJ:https://en.wikipedia.org/wiki/Prote
in_family+&cd=3&hl=tr&ct=clnk&gl=tr
~ 27 ~