Anda di halaman 1dari 8

BIOL 5272 M: Data Analysis I: DNA

Sequencing (Chromatogram 010.abi)


Interpretation of sequencing chromatograms provides information such as
protein function, structure, localisation, and evolution. Bioinformatics
methods, database, and tools enable the task to be done at home. Specific
software can process the chromatogram into a FASTA format. BLAST
algorithms identify the species, locate its domains, establish its phylogeny,
perform DNA mapping, and other tasks. Other tools provide protein
structure homology modelling, while accessible databases contain a
magnitude of information on evolution and biological functions of the
protein from the literature. In the end, the goal is to increase the
understanding and directing us to further studies of biological processes.

Sequence 010.abi
Sequencing chromatogram 010.abi was assigned for interpretation. Figure 1
shows the chromatogram upon opening with the free software FinchTV
(http://mac.softpedia.com/get/Math-Scientific/FinchTv/shtml). The sequence
corresponds to a partial gene for a certain protein, thus further analysis is
required to retrieve as much information as possible on protein function,
structure, localisation and evolution.

Figure 1. Chromatography of sequence 010.abi opened with the free


software Finch TV.
Conversion to FASTA format using the same software renders the following
result:
GACATAATGCGATTGGGTTTGATTCTTGGTCAGAATGGACTCGTATATATTAGTAAGTAAAGTGTGACA
CTTAGCCAACTGCCCATAAGCCTTCCTGCTCTTGTAAACTGGAATAACAAGTTCTAAAATGCTTGCACA
AAAATGGTAGAGCTCAGCTTGAGAGAAAAGCTTATTAGCAAGCTGTAAATACTTGACGGCGGAGTCAAC
TGTTAGCTTTGAAGCACCATAACCCTCGACTTCAGCTGCAGATGCCTCTGTTGTAAACTCGCCACTTAC
CATCGGGCAAATCTTACGCAAGGCAGACACATGATCCTTGCTCCACACACCATCATTCCTAGCCACCAA
GGCCTGCATTATCACACCTGCAACCGCCACAGCACACTGTGCAGCTTCTGCCCAAGACTGCATTTCCTG
GTGGGCATCACATAGATGCAATAACCACATTATGTGAAGATCCGGAACAGGAGCAAACGCCATTCCAAG
TTTGTAGAAGCTCTCTGCAGCAGCATATCGATCCATAGCCATCACAGATCCCAAAAGGGCATGCCCGAG
GCTGGCATCAAGAGCAAGAACAAGGCTGTCAGAAAGATGTTTGACTTCAGCCCATGACCAGCGGTTCTC

TGTAAATTTCTCGGGAATAATCAACAGTGTATCATCAGGAAGACCACACTCCCTCAACAAATTGACACT
CTTAGCTTCATCAGCCATTTCGCTTAATGACTGTTGAAGTCGCCGTGCTTCTCCACTCTCTTCTAATGT
ATTATCTGACTTCATATGAGTCACTTGGACATCCGACATGAGTTCTGACAGTGTAATCGTCAGCAAAGC
CCTCAGCCTAGCAGTCTGCATAAAGTATAGCGAGCTCTTAACTAGTATTTGAAGACCAATAACAGCCCT
TTTCCTGACACTGTCATTGCGGTAAACAGCAAGCCGGAGAAGATGGAAGGCTATCTGTTTTAAGAAGCG
ATCATTTTCCCTGGCCCATCAGTGTAGCCCCATGAAGATCAAAGATTCTGTTGAAAATTGGAAAAGAAG
CTTTCCAGAAAGCTAATGACTGGTTTCGGGAGAGAAACTCGTCAGTATTGTAGTAATGCAGTCCAATTT
GCCATAGTCGGTCGCAATATTGTGAGGAGCTGCATGATGAAAAATTTTCAGTGATCTCAAAACTGCAGC
TAACAGTGCACTACATCTCCTCCCAAAGTTCAGTTTTGCTCAAATGGGTGCAGCGATTCTTCTTGTTGA
CTTGGTCTGGAGCTTCATTCGGAAAGAATTGGGCTGACTAAGCCATCTCACTTAGGTACAATCCAGGTG
TGACGATAAGCGA

Screening for Contamination


A contaminated sequence does not represent the genetic information from the
source organism as it contains foreign segments such as vectors, adapters,
linkers, and PCR primers. Contamination affects data analysis because it may
elongate or add an open reading frame and altering the predicted translation
product(s). Therefore, tools to screening nucleic acid sequences for
contamination such as NCBI's VecScreen is valuable prior to further analyses to
avoid time and effort wasted on fruitless analyses, wrong conclusions drawn on
sequence significance, misassembly of sequence and false clustering of expressed
sequence tags, delay in sequence release in a public database, and finally
pollution of public databases. Figure 2 indicates that sequence 010.abi has no
identified contamination and thus ready for further analyses.

Figure 2. VecScreen detected no contaminants in the given sequence.

Finding the Best Possible Sequence Producing Significant


Alignment
BLAST (Basic Local Alignment Search Tool) compares primary biological
sequence information with a library or database of sequences and identifies the
sequences producing significant alignments above a certain threshold. Figure 3
shows the top three possible sequences producing significant alignment to the
query sequence.

Figure 3. The top three possible sequences producing significant alignment


to the assigned sequence 010.abi
The fact that the sequences share the exact maximum score (2093), total score
(2093), query cover (96%), E value (0.0), and identity (97%) but different
accessions calls for a further analysis. Table 1 demonstrates that although each
of the three sequences has different open reading frames, the most probable
reading frame is surprisingly identical for all three sequences, shown in FASTA
format as follows:
MENNNLGLRFRKLPRQPLALPKLDPLLDENLEQWPHLNQLVQCY
GTEWVKDVNKYGHYENIRPDSFQTQIFEGPDTDTETEIRLASARSATIEEDVASISGR
PFSDPGSSKHFGQPPLPAYEPAFDWENERAMIFGQRTPESPAASYSSGLKISVRVLSL
AFQSGLVEPFFGSIALYNQERKEKLSEDFYFQIQPTEMQDAKLSSENRGVFYLDAPSA
SVCLLIQLEKTATEEGGVTSSVYSRKEPVHLTEREKQKLQVWSRIMPYRESFAWAVVP
LFDNNLTTNTGESASPSSPLAPSMTASSSHDGVYEPIAKITSDGKQGYSGGSSVVVEI
SNLNKVKESYSEESIQDPKRKVHKPVKGVLRLEIEKHRNGHGDFEDLSENGSIINDSL
DPTDRLSDLTLMKCPSSSSGGPRNGCSKWNSEDAKDVSRNLTSSCGTPDLNCYHAFDF
CSTTRNEPFLHLFHCLYVYPVAVTLSRKRNPFIRVELRKDDTDIRKQPLEAIYPREPG
VSLQKWVHTQVAVGARAASYHDEIKVSLPATWTPSHHLLFTFFHVDLQTKLEAPRPVV
VGYASLPLSTYIHSRSDISLPVMRELVPHYLQESTKERLDYLEDGKNIFKLRLRLCSS
LYPTNERVRDFCLEYDRHTLQTRPPWGSELLQAINSLKHVDSTALLQFLYPILNMLLH
LIGNGGETLQVAAFRAMVDILTRVQQVSFDDADRNRFLVTYVDYSFDDFGGNQPPVYP
GLATVWGSLARSKAKGYRVGPVYDDVLSMAWFFLELIVKSMALEQARLYDHNLPTGED
VPPMQLKESVFRCIMQLFDCLLTEVHERCKKGLSLAKRLNSSLAFFCYDLLYIIEPCQ
VYELVSLYMDKFSGVCQSVLHECKLTFLQIISDHDLFVEMPGRDPSDRNYLSSILIQE
LFLSLDHDELPLRAKGARILVILLCKHEFDARYQKAEDKLYIAQLYFPFVGQILDEMP
VFYNLNATEKREVLIGVLQIVRNLDDTSLVKAWQQSIARTRLYFKLMEECLILFEHKK
AADSILGGNNSRGPVSEGAGSPKYSERLSPAINNYLSEASRQEVRLEGTPDNGYLWQR
VNSQLASPSQPYSLREALAQAQSSRIGASAQALRESLHPILRQKLELWEENVSATVSL
QVLEITENFSSMAASHNIATDYGKLDCITTILTSFFSRNQSLAFWKAFFPIFNRIFDL
HGATLMARENDRFLKQIAFHLLRLAVYRNDSVRKRAVIGLQILVKSSLYFMQTARLRA
LLTITLSELMSDVQVTHMKSDNTLEESGEARRLQQSLSEMADEAKSVNLLRECGLPDD
TLLIIPEKFTENRWSWAEVKHLSDSLVLALDASLGHALLGSVMAMDRYAAAESFYKLG
MAFAPVPDLHIMWLLHLCDAHQEMQSWAEAAQCAVAVAGVIMQALVARNDGVWSKDHV
SALRKICPMVSGEFTTEASAAEVEGYGASKLTVDSAVKYLQLANKLFSQAELYHFCAS
ILELVIPVYKSRKAYGQLAKCHTLLTNIYESILDQESNPIPFIDATYYRVGFYGEKFG
KLDRKEYVYREPRDVRLGDIMEKLSHIYESRMDSNHILHIIPDSRQVKAEDLQAGVCY
LQITAVDAVMEDEDLGSRRERIFSLSTGSVRARVFDRFLFDTPFTKNGKTQGGLEDQW
KRRTVLQTEGSFPALVNRLLVTKSESLEFSPVENAIGMIETRTTALRNELEEPRSSDG
DHLPRLQSLQRILQGSVAVQVNSGVLSVCTAFLSGEPATRLRSQELQQLIAALLEFMA
VCKRAIRVHFRLIGEEDQEFHTQLVNGFQSLTAELSHYIPAILSEL

Table 1. A comparison between the best open reading frames (ORFs) for
the top three possible sequences producing significant alignment to the
query sequence.
Top 3
Possible
Sequences
1*

Open Reading Frames

The Best Open Reading Frame

2*

3*

1* Arabidopsis thaliana DOCK family guanine nucleotide exchange factor


SPIKE1 mRNA, complete cds = Arabidopsis thaliana putative guanine
nucleotide exchange factor (SPK1) mRNA, complete cds
2* Arabidopsis thaliana mRNA for hypothetical protein, clone: RAFL16-07F02
3* Arabidopsis thaliana putative guanine nucleotide exchange factor
(SPK1) mRNA, complete cds

The Identification and the Characterisation of the Query


Protein from the Identical ORF of the Top Three Possible
Sequences Producing Significant Alignment to the Original
Query Sequence 010.abi
Based on the fact that all three possible sequences producing significant
alignment to the query sequence share the same open reading frame, the next
task is to identify if the amino acid sequence within the open reading frame
produces significant alignments to known proteins. Putative conserved domains
and the best sequences producing significant alignments to the open reading
frame amino acid sequence can be identified using BLASTP 2.2.30+ program as
seen in Figure 4.

Figure 4. The identification of the query protein from the identical open
reading frame of the top three possible sequences producing significant
alignment to the query sequence
It is shown that DOCK family guanine nucleotide exchange factor SPIKE 1 from
Arabidopsis thaliana (accession: NP_193367.7) has the most significant
alignment with a maximum score of 3793, a total score of 3793, a 100 % query
cover, an E value of 0.0, and a 100% identity.

DOCK family guanine nucleotide exchange factor SPIKE


1 from Arabidopsis thaliana (accession: NP_193367.7)
Putative guanine nucleotide exchange factor
Gene Names
SPK1 (Ordered Locus Names: At4g16340)
Also known as: DL4200C; FCAALL.346; SPIKE1; SPK1

Summary
mutant has seedling lethal; trichrome, leaf-shape, cotyledon defects; Putative
Cytoskeletal Protein

Proteomes
UP000006548: Chromosome 4

Figure 5. The genomic context of SPK1 gene in chromosome 4


Arabidopsis thaliana

The Arabidopsis Information Resource (TAIR)


AT4G16340

Functions

GTPase binding
GTP binding
Guanyl-nucleotide exchange factor activity

Located in
Cytosol, plasma membrane, extrinsic component of membrane, endoplasmic
reticulum exit site, nucleus

Domain hits
DHR2_DOCK (accession: cd11684): Dock Homology Region 2, a GEF
domain, of Dedicator of Cytokinesis proteins

DHR2 is one of the two domains of DOCK proteins, which are a family of atypical
guanine nucleotide exchange factors (GEFs) without the usual Dbl homology
(DH) domain. As GEFs, they activate the small GTPases Rac and Cdc42 through
bound GDP exchange for free GTP. DHR2 contains the catalytic GEF activity for
Rac and/or Cdc42.
Marchler-Bauer A et al. (2011), "CDD: a Conserved Domain Database for the functional
annotation of proteins.", Nucleic Acids Res.39(D)225-9.

Marchler-Bauer A et al. (2009), "CDD: specific functional annotation with the


Conserved Domain Database.", Nucleic Acids Res.37(D)205-10.

Marchler-Bauer A, Bryant SH (2004), "CD-Search: protein domain annotations on the


fly.", Nucleic Acids Res.32(W)327-331.

Marchler-Bauer A et al. (2013), "CDD: conserved domains and protein threedimensional structure.", Nucleic Acids Res. 41(D1):D384-52.

C2_DOCK180_related (accession: cd08679): C2 domains found in Dedicator


of CytoKinesis1 (DOCK 180) and related proteins
Dock180 was first identified as a product of c-Crk-interacting protein important
in actin cytoskeletal changes. It is known know that many C2 domains are
calcium-dependent membrane-targeting modules that bind a various substances.
Most C2 domain proteins are either signal transduction enzymes such as protein
kinase, or membrane trafficking proteins such as synaptotagmin 1.

Cellular signaling of Dock family proteins in neural function.Cell. Signal. 2010 Feb; 22(2):175-182
[Regulation of cell morphology and motility by Dock family proteins].Seikagaku 2009 Aug; 81(8):711-716
Structural basis of membrane targeting by the Dock180 family of Rho family guanine exchange factors
(Rho-GEFs).J. Biol. Chem. 2010 Apr 23; 285(17):13211-13222

Ded_cyto (accession: pfam06920): Dedicator of cytokinesis


Dedicator of cytokinesis represents a conserved region around 200 residues
long, which are potential guanine nucleotide exchange factors that activate
several small GTPases by exchanging bound GDP for free GTP
DOCK-C2 (accession: pfam14429): C2 domain in Dock 180 and Zizimin
proteins
They are atypical GTP/GPD exchange factor for GTPases Rac and Cdc42 and are
implicated in phagocytosis and cell-migration.

Structural basis of membrane targeting by the Dock180 family of Rho family guanine exchange factors
(Rho-GEFs).J. Biol. Chem. 2010 Apr 23; 285(17):13211-13222
Identification of novel families and classification of the C2 domain superfamily elucidate the origin and
evolution of membrane targeting activities in eukaryotes.Gene 2010 Dec 1; 469(1-2):18-30

Model Structure
(Provided by ModBase)

template:
the best match after the second blast
copy of the accession number to fuckin ncbi for more info on the query protein
bottom TAIR -- arabidopsis thaliana (function, localisation)
uniprot -- enter the name