Kuliah 02 Biological - Databases

Kuliah 02
Bioinformatika:
Pengenalan Basis data Biologi
Bahan: Jin Xion, Essential Bioinformatics, Cambridge, 2006

1
Basis data
Awalnya basis data berupa file flat 

semakin sulit retrieval bila jumlah data besar
Diperlukan Database Management Systems:
basis data relasional
Menggunakan beberapa set tabel (suatu relasi)
yang terdiri dari kolom dan baris
basis data Object-Oriented
Cth Flat File
Name, State, Course Number, Course Name Amir,
Brebes, FisDas 11, Fisika Dasar Siti, Sidoarjo, Bio 12,
Biologi Tuti, Makasar, Chem 21, Kimia Dasar Budi,
Ambon, Math 21, Matematik
• Minimum amount of basis data design and memory

• Read through
• basis data management systems (DMS)
• Manageable for small basis data
• Inefficient for big data
Cth Basis data Relational
Name, State, Course Number, Course Name Amir,
Brebes, FisDas 11, Fisika Dasar Siti, Sidoarjo, Bio 12,
Biologi Tuti, Makasar, Chem 21, Kimia Dasar Budi,
Ambon, Math 21, Matematik
Student# Name State Student# Course 1 Course Course Name
1 Amir Brebes 1 FIsDas 11 FIsDas 11 Fisika Dasar
2 Siti Sidoarjo 2 Bio 12 Bio 12 Biologi
3 Tuti Makasar 3 Chem 21 Chem 21 Kimia Dasar
4 Budi Ambon 4 Math 21 Math 21 Matematik
• Field (column)  individual fields

• Value (row)
• Attribute
• Structured Query Language (SQL)
Cth Basis data Relasional
basis data relasional

adalah tabel yang
digunakan tidak
menggambarkan
hubungan hierarki yang
kompleks antara item
data.
5
Basis data Object-oriented
 Dalam bahasa pemrograman berorientasi objek, objek
dapat dianggap sebagai unit yang menggabungkan
data dan rutinitas matematika yang bekerja pada data.
 Database terstruktur sedemikian rupa sehingga objek
dihubungkan oleh satu set pointer yang
mendefinisikan hubungan yang telah ditentukan
sebelumnya antara objek
 Bahasa pemrograman yang biasa digunakan utk
menyusun Basis data, seperti C++.
6
Cth Basis data Object-oriented
3 obyek disusun dan dihubungkan oleh

pointer.
Penelusuran informasi bergantung pada

navigasi melalui obyek-obyek tersebut
dengan bantuan pointer-pointer.
Untuk penyederhanaan, beberapa pointer

tidak ditampilkan di sini.
7
Cth basis data Object-oriented
• Lebih flexksibel
• Kurang memiliki dasar matematika yang kuat dibandingkan basis data relasional
8
Basis data Biologi
9
Basis data Biologi
Masih menggunakan ketiga jenis model penyusunan basis data di atas
Berdasarkan isinya, Basis data biologi dapat digolongkan:
Basis data Primer
Dokumen data biologi asli yang diunggah oleh para peneliti.
Cth: Gen Bank dan Protein Data Bank (PDB)
Basis data Sekunder
Data yg sudah diproses dgn komputer. Sudah ada informasi tambahan secara manual.
Database urutan protein yang diterjemahkan berisi anotasi fungsional termasuk dalam
kategori ini.
Cth: SWISS-Prot ; Protein Information Resources (PIR)
Basis data Khusus (Specialized database)
Memenuhi kebutuhan penelitian tertentu.
Cth: Flybase, HIV sequence database, dan Ribosomal Database Project
Basis data Primer
 Basis data terbesar yang menyimpan data mentah urutan asam
nukleat yang dihasilkan dan diunggah para periset seluruh dunia:
 GenBank,
 The European Molecular Biology Laboratory (EMBL) database
 The DNA Data Bank of Japan (DDBJ),

 Semuanya free
 A minimal level of annotation.
 Beberapa adata yang berasal dari publikasi literatur tahun 1980-

an diunggah secara manual oleh karyawan database
management.
11
Basis data Primer (2)
 Saat ini sebagian besar jurnal ilmiah mensyaratkan pengunggahan ke
GenBank, EMBL, atau DDBJ untuk menjami data molekular fundamental
tersedia secara bebas.
 Ketiga basis data saling bertukar data dan membentuk International
Nucleotide Sequence Database Collaboration.
 Walau datanya sama, ada sedikit perbedaan format penyajian data.
 Hanya PDB yang menyajikan 3-D struktur makromolekul biologi, yaitu
koordinat atom makromolekul (baik protein maupun asam nukleat) yang
didapatkan ndari kristalografi x-ray dan NMR.
 format flat file digunakan untuk menyajikan nama protein, penemu, detail
percobaan, struktur sekunder, kofaktor, dan koordinat atom.
 Web interface PDB juga menyajikan piranti untuk manipulasi citra
sederhana.
12
Basis data Sekunder
 Sequence annotation information in the

primary database is often minimal. To
 turn the raw sequence information into more
sophisticated biological knowledge,
 much postprocessing of the sequence
information is needed. This begs the need for
13
Basis data dalam Bioinformatika
Basis data Sequence
Sequence analysis
Functional genomics
Basis data Literature
Basis data Structura
Basis data Metabolic pathway
Basis data khusus (Specialized)

Pitfalls of Biological basis datas
Errors in Sequence basis datas

Redundancy in the Primary Sequence basis
datas
False or Incomplete Genes Annotations
Errors in Nucleotide Sequences
sequencing errors
frame-shifts
Contaminated with sequences from cloning
vectors
Exceptional Care for sequences produced before
the 1990s
Redundancy
 repeated submission
 identical or overlapping sequences by the same or

different authors
 revision of annotations
 dumping of expressed sequence tags (EST) data
 poor basis data management

Basis data Bioinformatics
 Growing steadily in number
 Growing amazingly in size

 Specialization
 Which genome they contain (mouse, human, all of them)
 Which types of information about the genome they contain

 Contain information such as
 Sequences: of bases and of residues
 Structure: 3d conformations of known proteins
 Families: Which sets of genes are known to be homologous
 Annotations: which processes each gene is involved in

▪ And lots of other information
The definitive source….
 More than 1300 DB

 http://nar.oxfordjournals.org/content/39/supp
l_1.toc
DNA Sequence basis datas
 Main repositories:
 GenBank (US)
▪ (http://www.ncbi.nlm.nih.gov/Genbank/index.html)
 EMBL (Europe)
▪ (http://www.ebi.ac.uk/embl/)
 DDBJ (Japan)
▪ (http://www.ddbj.nig.ac.jp/)
 Primary basis datas
 DNA sequences are identical

EMBL basis data
Number of entries
(current 199,575,971)
Graphs created on 22 November 2010
http://www.ebi.ac.uk/embl/Services/DBStats/
www.ncbi.nlm.nih.gov
ENTREZ
 NCBI (USA) National Center for Biotechnology Information
 PubMed: The biomedical literature (PubMed)
 Nucleotide sequence basis data (Genbank)
 Protein sequence basis data http://www.ncbi.nlm.nih.gov/Entrez/

 Structure: three-dimensional macromolecular structures
 Genome: complete genome assemblies
 PopSet: population study data sets
 OMIM: Online Mendelian Inheritance in Man
 Taxonomy: organisms in GenBank
 Books: online books
 ProbeSet: gene expression and microarray datasets
 3D Domains: domains from Entrez Structure
 UniSTS: markers and mapping data
 SNP: single nucleotide polymorphisms
 CDD: conserved domains
 Journals: journals in Entrez
 UniGene: gene-oriented clusters of transcript sequences
 PMC: full-text digital archive of life sciences journal literature

PubMed is…
• National Library of Medicine's search service
• >20 million citations in MEDLINE
• links to participating online journals
• PubMed tutorial (via side bar)
Entrez integrates…
• the scientific literature;
• DNA and protein sequence basis datas;
• 3D protein structure data;
• population study data sets;
• assemblies of complete genomes
Sequence basis datas
Annotated sequence basis datas

SWISS-PROT, GenBank etc…
Usage: identifying function, retrieving information
Low-annotation sequence basis datas

EST basis datas, high-throughput genome sequences
Usage: discovery of new genes
General Protein basis datas
 SWISS-PROT
 Manually curated
 high-quality annotations, less data

 GenPept/TREMBL
 Translated coding sequences from GenBank/EMBL
 Few annotations, more up to date

 PIR
 Phylogenetic-based annotations
 All 3 now combining efforts to form UniProt
(http://www.uniprot.org)
Low-annotation basis datas
ESTs (Expressed Sequence Tags)
Low quality sequences generated by high
-volume sequencing the 3’ or 5’ end of cDNAs
High-throughput genome sequences

Produced by mass-sequencing of genomic DNA
Non-redundant basis datas
Sequence data only: cannot be browsed, can only be
searched using a sequence
Combine sequences from more than one basis data
Examples:
NR Nucleic (genbank+EMBL+DDBJ+PDB DNA)
NR Protein (SWISS-
PROT+TrEMBL+GenPept+PDB protein)
Sequence & Structure basis datas
 PDB (Protein Databank)
 Stores 3-dimensional atomic coordinates for biological molecules including

protein and nucleic acids
 Data obtained by X-ray crystallography, NMR, or computer modelling
 http://www.rcsb.org/pdb/
 MMDB (Molecular Modelling basis data)
 Over 28,000 3D macromolecular structures, including proteins and

polynucleotides
 (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Structure)
 SCOP (Structural Classification of Proteins)
 Classification of proteins according to structural and evolutionary relationships

File Formats
 GenBank/GB, genbank flatfile format
 NBRF format
 EMBL, EMBL flatfile format
 Swissprot
 GCG, single sequence format of GCG software
 DNAStrider, for common Mac program
 Pearson/Fasta, a common format used by Fasta programs and others
 Phylip3.2, sequential format for Phylip programs
 Phylip, interleaved format for Phylip programs (v3.3, v3.4)
 Plain/Raw, sequence data only (no name, document, numbering)
 MSF multi sequence format used by GCG software
 PAUP"s multiple sequence (NEXUS) format
 ASN.1 format used by NCBI
EMBL Format
ID TRBG361 standard; mRNA; PLN; 1859 BP.
XX
AC X56734; S46826;
XX
SV X56734.1
FH Key Location/Qualifiers
XX FH
DT 12-SEP-1991 (Rel. 29, Created) FT source 1..1859
DT 15-MAR-1999 (Rel. 59, Last updated, Version 9) FT /db_xref="taxon:3899"
XX FT /mol_type="mRNA"
DE Trifolium repens mRNA for non-cyanogenic beta- FT /organism="Trifolium repens"
glucosidase FT /tissue_type="leaves"
XX
FT /clone_lib="lambda gt10"
KW beta-glucosidase.
XX FT /clone="TRE361"
OS Trifolium repens (white clover) FT CDS 14..1495
OC Eukaryota; Viridiplantae; Streptophyta; Embryophyta; FT /db_xref="GOA:P26204"
Tracheophyta; FT /db_xref="SWISS-PROT:P26204"
OC Spermatophyta; Magnoliophyta; eudicotyledons; core FT /note="non-cyanogenic"
eudicots; rosids; FT /EC_number="3.2.1.21"
OC eurosids I; Fabales; Fabaceae; Papilionoideae; FT /product="beta-glucosidase"
Trifolieae; Trifolium.
FT /protein_id="CAA40058.1"
XX
RN [5] FT /translation="MDFIVAIFALFVISSFTITSTNAVEASTLLDIGNLSRSSFPRGFI
RP 1-1859 FT FGAGSSAYQFEGAVNEGGRGPSIWDTFTHKYPEKIRDGSNADITVDQYHRYKEDVGIMK
RX MEDLINE; 91322517. FT DQNMDSYRFSISWPRILPKGKLSGGINHEGIKYYNNLINELLANGIQPFVTLFHWDLPQ
RX PUBMED; 1907511. FT VLEDEYGGFLNSGVINDFRDYTDLCFKEFGDRVRYWSTLNEPWVFSNSGYALGTNAPGR
RA Oxtoby E., Dunn M.A., Pancoro A., Hughes M.A.; FT CSASNVAKPGDSGTGPYIVTHNQILAHAEAVHVYKTKYQAYQKGKIGITLVSNWLMPLD
RT "Nucleotide and derived amino acid sequence of the FT DNSIPDIKAAERSLDFQFGLFMEQLTTGDYSKSMRRIVKNRLPKFSKFESSLVNGSFDF
cyanogenic FT IGINYYSSSYISNAPSHGNAKPSYSTNPMTNISFEKHGIPLGPRAASIWIYVYPYMFIQ
RT beta-glucosidase (linamarase) from white clover
FT EDFEIFCYILKINITILQFSITENGMNEFNDATLPVEEALLNTYRIDYYYRHLYYIRSA
(Trifolium repens L.).";
RL Plant Mol. Biol. 17(2):209-219(1991). FT IRAGSNVKGFYAWSFLDCNEWFAGFTVRFGLNFVD"
XX FT mRNA 1..1859
RN [6] FT /evidence=EXPERIMENTAL
RP 1-1859 XX
RA Hughes M.A.; SQ Sequence 1859 BP; 609 A; 314 C; 355 G; 581 T; 0 other;
RT ; aaacaaacca aatatggatt ttattgtagc catatttgct ctgtttgtta ttagctcatt 60
RL Submitted (19-NOV-1990) to the EMBL/GenBank/DDBJ cacaattact tccacaaatg cagttgaagc ttctactctt cttgacatag gtaacctgag 120
basis datas.
tcggagcagt tttcctcgtg gcttcatctt tggtgctgga tcttcagcat accaatttga 180
RL M.A. Hughes, UNIVERSITY OF NEWCASTLE UPON TYNE,
MEDICAL SCHOOL, NEW CASTLE aggtgcagta aacgaaggcg gtagaggacc aagtatttgg gataccttca cccataaata 240
RL UPON TYNE, NE2 4HH, UK tccagaaaaa ataagggatg gaagcaatgc agacatcacg gttgaccaat atcaccgcta 300
XX caaggaagat gttgggatta tgaaggatca aaatatggat tcgtatagat tctcaatctc 360
DR GOA; P26204. ttggccaaga atactcccaa agggaaagtt gagcggaggc ataaatcacg aaggaa
DR MENDEL; 11000; Trirp;1162;11000.
DR SWISS-PROT; P26204; BGLS_TRIRP.
Genbank Format
LOCUS SCU49845 5028 bp DNA PLN 21-JUN-
1999
DEFINITION Saccharomyces cerevisiae TCP1-beta gene, partial cds, andgene 687..3158
Axl2p /gene="AXL2"
(AXL2) and Rev7p (REV7) genes, complete cds. CDS 687..3158
ACCESSION U49845 /gene="AXL2"
VERSION U49845.1 GI:1293613 /note="plasma membrane glycoprotein"
KEYWORDS . /codon_start=1
SOURCE Saccharomyces cerevisiae (baker's yeast) /function="required for axial budding pattern of S.
ORGANISM Saccharomyces cerevisiae cerevisiae"
Eukaryota; Fungi; Ascomycota; Saccharomycotina; /product="Axl2p"
Saccharomycetes; /protein_id="AAA98666.1"
Saccharomycetales; Saccharomycetaceae; Saccharomyces. /db_xref="GI:1293615"
REFERENCE 1 (bases 1 to 5028) /translation="MTQLQISLLLTATISLLHLVVATPYEAYPIGKQYPPVARVNESF
AUTHORS Torpey,L.E., Gibbs,P.E., Nelson,J. and Lawrence,C.W. TFQISNDTYKSSVDKTAQITYNCFDLPSWLSFDSSSRTFSGEPSSDLLSDANTTLYFN
TITLE Cloning and sequence of REV7, a gene whose function is VILEGTDSADSTSLNNTYQFVVTNRPSISLSSDFNLLALLKNYGYTNGKNALKLDPNE
required for VFNVTFDRSMFTNEESIVSYYGRSQLYNAPLPNWLFFDSGELKFTGTAPVINSAIAPE
DNA damage-induced mutagenesis in Saccharomyces cerevisiae TSYSFVIIATDIEGFSAVEVEFELVIGAHQLTTSIQNSLIINVTDTGNVSYDLPLNYV
JOURNAL Yeast 10 (11), 1503-1509 (1994) YLDDDPISSDKLGSINLLDAPDWVALDNATISGSVPDELLGKNSNPANFSVSIYDTYG
MEDLINE 95176709 DVIYFNFEVVSTTDLFAISSLPNINATRGEWFSYYFLPSQFTDYVNTNVSLEFTNSSQ
PUBMED 7871890 DHDWVKFQSSNLTLAGEVPKNFDKLSLGLKANQGSQSQELYFNIIGMDSKITHSNHSA
REFERENCE 2 (bases 1 to 5028) NATSTRSSHHSTSTSSYTSSTYTAKISSTSAAATSSAPAALPAANKTSSHNKKAVAIA
AUTHORS Roemer,T., Madden,K., Chang,J. and Snyder,M. CGVAIPLGVILVALICFLIFWRRRRENPDDENLPHAISGPDLNNPANKPNQENATPLN
TITLE Selection of axial growth sites in yeast requires Axl2p, a NPFDDDASSYDDTSIARRLAALNTLKLDNHSATESDISSVDEKRDSLSGMNTYNDQFQ
novel SQSKEELLAKPPVQPPESPFFDPQNRSSSVYMDSEPAVNKSWRYTGNLSPVSDIVRDS
plasma membrane glycoprotein YGSQKTVDTEKLFDLEAPEKEKRTSRDVTMSSLDPWNSNISPSPVRKSVTPSPYNVTK
JOURNAL Genes Dev. 10 (7), 777-793 (1996) HRNRHLQNIQDSQSGKNGITPTTMSTSSSDDFVPVKDGENFCWVHSMEPDRRPSKKRL
MEDLINE 96194260 VDFSNKSNVNVGQVKDIHGRIPEML
PUBMED 8846915 BASE COUNT 1510 a 1074 c 835 g 1609 t
REFERENCE 3 (bases 1 to 5028) ORIGIN
AUTHORS Roemer,T. 1 gatcctccat atacaacggt atctccacct caggtttaga tctcaacaac ggaaccattg
TITLE Direct Submission 61 ccgacatgag acagttaggt atcgtcgaga gttacaagct aaaacgagca gtagtcagct
JOURNAL Submitted (22-FEB-1996) Terry Roemer, Biology, Yale 121 ctgcatctga agccgctgaa gttctactaa gggtggataa catcatccgt gcaagaccaa
University, New 181 gaaccgccaa tagacaacat atgtaacata tttaggatat acctcgaaaa taataaaccg
Haven, CT, USA 241
FEATURES Location/Qualifiers
source 1..5028
/organism="Saccharomyces cerevisiae"
/db_xref="taxon:4932"
/chromosome="IX"
Swissprot format
Specialized Sequence basis datas
 Focus on a specific type of sequences

 Sequences are often modified or specially
annotated
 Usage depends on the basis data
 Examples:
 Ribosomal RNA basis datas
 Immunology basis datas

Protein domain basis datas
 Pfam (http://www.sanger.ac.uk/Software/Pfam/)
 Collection of multiple sequence alignments and hidden Markov
models covering many common protein domains and families
 SMART (a Simple Modular Architecture Research Tool)
 Identification and annotation of genetically mobile domains and the
analysis of domain architectures
 (http://smart.embl-heidelberg.de/help/smart_about.shtml
 CDD (http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi)
 Combines SMART and Pfam basis datas
 Easier and quicker search

Sequence Motif basis datas
 Scan Prosite
(http://www.expassy.org/prosite) and PRINTS
(http://bioinf.man.ac.uk/dbbrowser/PRINTS/)
 Store conserved motifs occurring in nucleic acid or
protein sequences
 Motifs can be stored as consensus sequences,
alignments, or using statistical representations
such as residue frequency tables
Ribosomal RNA basis datas
 RDP (Michigan State University, USA)
 http://rdp.cme.msu.edu/html/
 rRNA basis data (University of Antwerp, Belgium)
 http://rrna.uia.ac.be/
 ribosomal RNA sequences are pre-aligned
according to their secondary structure
 Usage: creating data sets for molecular phylogeny,
especially for microbial taxonomy and identification
Immunological Sequence basis datas
 The Kabat basis data of Sequences of Proteins of

Immunological Interest
 www.hgmp.mrc.ac.uk/Bioinformatics/basis datas
/kabatp-help.html
 Sequences are classified according to antigen specificity, and
available in pre-aligned format
 The Immunogenetics basis data (IMGT)
 http://imgt.cnusc.fr:8104/
 Focuses on immunoglobulins, T-cell receptors and MHC genes

Genome basis datas
 Focus on one organism or group of organisms:
 Colibase (E. coli and related species)

http://colibase.bham.ac.uk/
 GDB (human) http://www.gdb.org/
 Flybase (Drosophila) http://flybase.bio.indiana.edu/
 WormBase (C. elegans) http://wormbase.org
 AtDB (Arabidopsis) http://www.arabidopsis.org
 SGD (S. cerevisiae)

http://genome-www.stanford.edu/Saccharomyces/
Expression basis datas
RNA expression
Results of microarray experiments measuring the change in specific mRNA content under
certain conditions
Array Express (EBI) and Geo (NCBI)
Not user friendly
Proteome basis datas
2D gel electrophoresis images representing the protein content of a cell or tissue under
specific conditions
SWISS 2D PAGE at http://us.expasy.org/ch2d/

Other basis data Types
 Literature
 MEDLINE (http://ncbi.nlm.nih.gov/PubMed/)
 HighWire (http://www.highwire.org)
 Variation
 dbSNP (http://ncbi.nlm.nih.gov/SNP/)
 HGBase (http://hgbase/interactiva/de)
 Metabolic pathways
 KEGG (http://kegg.genome.ad.jp/kegg/)
 WIT (http://wit.mcs/anl.gov/WIT2)
 Organisms and nomenclature
 Taxonomies (e.g.: http://ncbi.nlm.nih.gov/Taxonomy/ )
 Mendel (http://mbclserver.rutgers.edu/CPGN)
Methods for Accessing Data
local installation
screen scraping
BioPerl
FTP sites
Local Installations
SRS
Need to obtain license from Lion Biosceinces
Download data from FTP sites
Ensembl
"framework to organize biology around the sequences of large

genomes"
www.ensembl.org
Screen Scraping
URL spoofing
construction of URLs that replicate the query
html parsing
extraction of results from html pages returned by query
Requirements
html module
knowledge of query mechanism
Method NOT advocated by most data providers

BioPerl
BioPerl is a collection of modules that

facilitates the development of Perl scripts
for bioinformatics applications.
www.bioperl.org
SWISSPROT
European/Swiss Bioinformatics Institute 1986
Highly accurate, hand curated resource

http://www.ebi.ac.uk/swissprot/
Aims:
Have a high level of annotation
Often by the people who have been working with the gene
Have a low level of redundancy
Have a high level of integration with other basis datas

TrEMBL
SWISSPROT’s Big Brother

http://www.ebi.ac.uk/trembl/
All genes which have been left out of SWISSPROT

Computer annotated rather than human annotated
PROSITE
Families of proteins
Can search using regular expressions http://ca.expasy.org/prosite/
Similar to unix commands using wildcards, etc.
E.g., [AC]-x-V-x(4)-{ED}
Interpreted as:
[Ala or Cys]-any-Val-any-any-any-any-{any but Glu or Asp}
Families exhibit these patterns
So we can search over families
1574 documents about 1308 different patterns

PFAM
http://pfam.sanger.ac.uk/
Maintained by the Sanger Centre (Cambridge)
Protein families aligned using HMMs
Hidden Markov Models (see later lecture)
Given a new sequence
Find families which the sequence might fit into
Sequence Coverage
11912 families
Split into Pfam-A (high quality) and Pfam-B (low quality)

SCOP and CATH
SCOP http://scop.mrc-lmb.cam.ac.uk/scop/
http://www.cathdb.info/
Structural Classification of Proteins
Hierarchically ordered and manually curated
38221 PDB Entries
110800 Domains
CATH
Classification of protein domain structures
124 folds
226 Superfamily
1148 Sequence family
14473 Domain
Using basis datas
with the FASTA Format
May need to know the FASTA format
For residue sequences
First line must start with a > sign
First line contains identification information for gene
Other lines contain the residue sequence
OK to have a ragged right format
Usually OK to have lower case (but check)

Example FASTA Format
Geninfo num, assigned by the NCBI
Indicates that SWISS-PROT was source basis data
SWISS-PROT Identifier
Molecule Name
 > gi|121664|sp|P00435|GSHC_BOVIN GLUTATHIONE PEROXIDASE
 mcaaqrsaaalaaaaprtvyafsarplaggepfnlsslrgkvllienvak
 slcgttvrdytqmndlqrrlgprglvvlgfpcnqfghqenakneeilncl
 yvrpgggfepnfmlfekcevngekahplfaflrevlptpsddatalmtdp
 kfitwspvcrndvswnfekflvgpdgvpvrrysrrfltidiepdietlls
 qgasa
Analyzing Results
Using PERL Scripts
basis data servers now do:
Increasingly specific analysis of your results
But you will eventually need to do analysis
Ideal programming language is PERL
Designed to manipulate text and files
Can use it to play around with (manipulate) strings
Will be using it in the coursework
PERL Tutorial
PDB Format
The PDB format consists of a collection of fixed format records that describe :
Atomic coordinates,
Chemical and biochemical features
Experimental details of the structure determination
Some structural features such as
Secondary structure assignments,
Hydrogen bonding
Biological assemblies
Active sites
The challenge
(Boguski, 1999)
In 1995, the number of genes in the basis data started to exceed the number of
papers on molecular biology and genetics in the literature!
Data types
primary data sequence primary basis data
AATGCGTATAGGC DNA
DMPVERILEALAVE amino acid
secondary data secondary protein secondary db

structure
“motifs”: regular
expressions, blocks,
profiles, fingerprints e. g., alpha-helices,
beta-strands
tertiary data tertiary protein tertiary db
structure
atomic co-ordinates domains, folding units

60
 Berdasarkan sumber data, basis data
bioinformatika dapat dibagi menjadi:
 basis data primer (primary)
 basis data sekunder (secondary)
 basis data komposit/tertier (composite/tertiary)
 basis data literatur

basis data primer/utama
• Data repository
• Data berasal dari direct submission oleh peneliti
( “sampah”, inkonsistensi, dll)
• Raw data
• Redundant
• Selalu update (daily/weekly)
Contoh:
– NCBI Genbank (sekuen DNA dan translasinya)
– UniProt/SwissProt (sekuen protein)
– PDB (struktur kristal protein dan makromolekul)
– dbEST (potongan-potongan sekuen mRNA)
– dbGSS (survei sekuen genom)
– Trace Archive (data hasil sekuensing DNA)
– SAGEMap, GEO (data eksperimen microarray)
basis data sekunder
• Knowledge repository
• Data bersumber dari basis data primer/utama
• Data dianotasi oleh kurator (umumnya manual)
• Non-redundant
• Ada jeda waktu untuk sinkronisasi dengan sumber
basis data primer
Contoh:
– NCBI RefSeq (sekuen DNA dan translasinya)
– UniProt/TrEMBL (sekuen protein)
– Ensembl (sekuen genom eukaryotes)
– MMDB (struktur kristal protein dan makromolekul)
basis data komposit
• Bersumber dari basis data primer & sekunder

• Mencakup data non-biomolekuler
• Kenyamanan end-user
• Banyak kurator
• Jeda waktu sinkronisasi dengan basis data primer/sekunder yang lama
Contoh:
– EuPathDB (basis data patogen eukaryotes, seperti Plasmodium, dsb)
– MitoMAP (basis data mapping genome mitochodria)
– Mammalian ncRNA (sekuen ncRNA dari mamalia)
– REBASE (basis data enzim restriksi)
– SGD (basis data jamur/fungi)
– KEGG (basis data pathway, dsb)
– HIVdb (basis data resistansi HIV-1)
– Reactome (basis data pathway)
– Gramene (comparative resources untuk tanaman)
– PlantGDB (basis data genome tanaman)
basis data literatur
• PubMed (http://www.ncbi.nih.gov/pubmed)
– Jurnal internasional bidang ilmu alam & kedokteran
– Abstrak, Penulis, Jurnal
• PubMed Central (http://www.pubmedcentral.nih.gov)

– Subset dari PubMed
– Full-text
• NCBI Bookshelf (http://www.ncbi.nih.gov/books)
• NCBI OMIM - Online Mendelian Inheritance in Man

– Katalog dari penyakit genetik dan gen manusia (& tikus)
• NCBI OMIA - Online Mendelian Inheritance in Animal

– Katalog dari penyakit genetik dan gen hewan (selain manusia & tikus)
Top organisms in GenBank
(Release 191)
Organism base pairs
Homo sapiens 16,310,774,187
Mus musculus 9,974,977,889
Rattus norvegicus 6,521,253,272
Bos taurus 5,386,258,455
Zea mays 5,062,731,057
Sus scrofa 4,887,861,860
Danio rerio 3,120,857,462
Strongylocentrotus purpuratus 1,435,236,534
Macaca mulatta 1,256,203,101
Oryza sativa Japonica Group 1,255,686,573
Nicotiana tabacum 1,197,357,811
Xenopus (Silurana) tropicalis 1,249,938,611
Drosophila melanogaster 1,119,965,220
Pan troglodytes 1,008,323,292
Arabidopsis thaliana 1,144,226,616
Canis lupus familiaris 951,238,343
Vitis vinifera 999,010,073
Gallus gallus 899,631,338
Glycine max 906,638,854
Triticum aestivum 898,689,329
66
NCBI nucleotide DB (GenBank & RefSeq)
Nomenclature NCBI RefSeq
NM_* RefSeq untuk coding sequence (mRNA)
NR_* RefSeq untuk non-coding RNA yang ditranskripkan
NT_* RefSeq untuk genomic contig dari kromosom
NW_* RefSeq untuk alternatif genomic contig dari kromosom
NG_* RefSeq untuk famili/cluster atau pseudo-gene
NC_* RefSeq untuk full/circular genome
NP_* RefSeq untuk sekuen protein
XM_* RefSeq untuk sekuen hypothetical mRNA
XP_* RefSeq untuk sekuen hypothetical protein

(Sebagian) basis data bioinformatika
Anotasi Gen/Genom
Anotasi gen/genom
• Kumpulan informasi yang relevan

• Didapat dan diverifikasi dari eksperimen (diambil dari jurnal-jurnal
penelitian) secara manual
• Informasinya antara lain:
– Organisme/sampel
– Organelle
– Lokasi kromosom beserta orientasinya
– Daerah enhancer, promoter, exon-intron
– Transkrip (beserta alternatifnya hasil dari alternative splicing) dan
proteinnya
– Variasi dari gen (SNP)
– Fungsi
– Motif / pattern dari proteinnya
– Korelasi phenotype / penyakit
– Informasi lainnya
Informasi tekstual
Informasi sekuen

Kuliah 02 Biological - Databases

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Kuliah 02 Biological - Databases

Diunggah oleh

Hak Cipta:

Format Tersedia

Kuliah 02

Bahan: Jin Xion, Essential Bioinformatics, Cambridge, 2006

Awalnya basis data berupa file flat 

• Minimum amount of basis data design and memory

• Field (column)  individual fields

basis data relasional

3 obyek disusun dan dihubungkan oleh

Penelusuran informasi bergantung pada

Untuk penyederhanaan, beberapa pointer

 The European Molecular Biology Laboratory (EMBL) database

 The DNA Data Bank of Japan (DDBJ),

 A minimal level of annotation.

 Beberapa adata yang berasal dari publikasi literatur tahun 1980-

 Sequence annotation information in the

Basis data Literature

Basis data Structura

Basis data Metabolic pathway

Basis data khusus (Specialized)

Errors in Sequence basis datas

 identical or overlapping sequences by the same or

 dumping of expressed sequence tags (EST) data

 poor basis data management

 Growing amazingly in size

 Which genome they contain (mouse, human, all of them)

 Which types of information about the genome they contain

 Sequences: of bases and of residues

 Structure: 3d conformations of known proteins

 Families: Which sets of genes are known to be homologous

 Annotations: which processes each gene is involved in

 More than 1300 DB

 DNA sequences are identical

Graphs created on 22 November 2010

 PubMed: The biomedical literature (PubMed)

 Nucleotide sequence basis data (Genbank)

 Protein sequence basis data http://www.ncbi.nlm.nih.gov/Entrez/

 Genome: complete genome assemblies

 PopSet: population study data sets

 OMIM: Online Mendelian Inheritance in Man

 Taxonomy: organisms in GenBank

 Books: online books

 ProbeSet: gene expression and microarray datasets

 3D Domains: domains from Entrez Structure

 UniSTS: markers and mapping data

 SNP: single nucleotide polymorphisms

 CDD: conserved domains

 Journals: journals in Entrez

 UniGene: gene-oriented clusters of transcript sequences

 PMC: full-text digital archive of life sciences journal literature

Annotated sequence basis datas

Low-annotation sequence basis datas

 high-quality annotations, less data

 Translated coding sequences from GenBank/EMBL

 Few annotations, more up to date

High-throughput genome sequences

 PDB (Protein Databank)

 Stores 3-dimensional atomic coordinates for biological molecules including

 Over 28,000 3D macromolecular structures, including proteins and

 Classification of proteins according to structural and evolutionary relationships

 Focus on a specific type of sequences

 Ribosomal RNA basis datas

 Immunology basis datas

 Easier and quicker search

 The Kabat basis data of Sequences of Proteins of