Anda di halaman 1dari 73

Kuliah 02

Bioinformatika:
Pengenalan Basis data Biologi

Bahan: Jin Xion, Essential Bioinformatics, Cambridge, 2006


1
Basis data

Awalnya basis data berupa file flat 


semakin sulit retrieval bila jumlah data besar
Diperlukan Database Management Systems:
basis data relasional
Menggunakan beberapa set tabel (suatu relasi)
yang terdiri dari kolom dan baris
basis data Object-Oriented
Cth Flat File
Name, State, Course Number, Course Name Amir,
Brebes, FisDas 11, Fisika Dasar Siti, Sidoarjo, Bio 12,
Biologi Tuti, Makasar, Chem 21, Kimia Dasar Budi,
Ambon, Math 21, Matematik

• Minimum amount of basis data design and memory


• Read through
• basis data management systems (DMS)
• Manageable for small basis data
• Inefficient for big data
Cth Basis data Relational
Name, State, Course Number, Course Name Amir,
Brebes, FisDas 11, Fisika Dasar Siti, Sidoarjo, Bio 12,
Biologi Tuti, Makasar, Chem 21, Kimia Dasar Budi,
Ambon, Math 21, Matematik
Student# Name State Student# Course 1 Course Course Name
1 Amir Brebes 1 FIsDas 11 FIsDas 11 Fisika Dasar
2 Siti Sidoarjo 2 Bio 12 Bio 12 Biologi
3 Tuti Makasar 3 Chem 21 Chem 21 Kimia Dasar
4 Budi Ambon 4 Math 21 Math 21 Matematik

• Field (column)  individual fields


• Value (row)
• Attribute
• Structured Query Language (SQL)
Cth Basis data Relasional

basis data relasional


adalah tabel yang
digunakan tidak
menggambarkan
hubungan hierarki yang
kompleks antara item
data.

5
Basis data Object-oriented
 Dalam bahasa pemrograman berorientasi objek, objek
dapat dianggap sebagai unit yang menggabungkan
data dan rutinitas matematika yang bekerja pada data.
 Database terstruktur sedemikian rupa sehingga objek
dihubungkan oleh satu set pointer yang
mendefinisikan hubungan yang telah ditentukan
sebelumnya antara objek
 Bahasa pemrograman yang biasa digunakan utk
menyusun Basis data, seperti C++.

6
Cth Basis data Object-oriented

3 obyek disusun dan dihubungkan oleh


pointer.

Penelusuran informasi bergantung pada


navigasi melalui obyek-obyek tersebut
dengan bantuan pointer-pointer.

Untuk penyederhanaan, beberapa pointer


tidak ditampilkan di sini.
7
Cth basis data Object-oriented

• Lebih flexksibel
• Kurang memiliki dasar matematika yang kuat dibandingkan basis data relasional
8
Basis data Biologi

9
Basis data Biologi
Masih menggunakan ketiga jenis model penyusunan basis data di atas
Berdasarkan isinya, Basis data biologi dapat digolongkan:
Basis data Primer
Dokumen data biologi asli yang diunggah oleh para peneliti.
Cth: Gen Bank dan Protein Data Bank (PDB)
Basis data Sekunder
Data yg sudah diproses dgn komputer. Sudah ada informasi tambahan secara manual.
Database urutan protein yang diterjemahkan berisi anotasi fungsional termasuk dalam
kategori ini.
Cth: SWISS-Prot ; Protein Information Resources (PIR)
Basis data Khusus (Specialized database)
Memenuhi kebutuhan penelitian tertentu.
Cth: Flybase, HIV sequence database, dan Ribosomal Database Project
Basis data Primer
 Basis data terbesar yang menyimpan data mentah urutan asam
nukleat yang dihasilkan dan diunggah para periset seluruh dunia:
 GenBank,

 The European Molecular Biology Laboratory (EMBL) database

 The DNA Data Bank of Japan (DDBJ),


 Semuanya free

 A minimal level of annotation.

 Beberapa adata yang berasal dari publikasi literatur tahun 1980-


an diunggah secara manual oleh karyawan database
management.

11
Basis data Primer (2)
 Saat ini sebagian besar jurnal ilmiah mensyaratkan pengunggahan ke
GenBank, EMBL, atau DDBJ untuk menjami data molekular fundamental
tersedia secara bebas.
 Ketiga basis data saling bertukar data dan membentuk International
Nucleotide Sequence Database Collaboration.
 Walau datanya sama, ada sedikit perbedaan format penyajian data.
 Hanya PDB yang menyajikan 3-D struktur makromolekul biologi, yaitu
koordinat atom makromolekul (baik protein maupun asam nukleat) yang
didapatkan ndari kristalografi x-ray dan NMR.
 format flat file digunakan untuk menyajikan nama protein, penemu, detail
percobaan, struktur sekunder, kofaktor, dan koordinat atom.
 Web interface PDB juga menyajikan piranti untuk manipulasi citra
sederhana.
12
Basis data Sekunder

 Sequence annotation information in the


primary database is often minimal. To
 turn the raw sequence information into more
sophisticated biological knowledge,
 much postprocessing of the sequence
information is needed. This begs the need for

13
Basis data dalam Bioinformatika
Basis data Sequence

Sequence analysis

Functional genomics

Basis data Literature

Basis data Structura

Basis data Metabolic pathway

Basis data khusus (Specialized)


Pitfalls of Biological basis datas

Errors in Sequence basis datas


Redundancy in the Primary Sequence basis
datas
False or Incomplete Genes Annotations
Errors in Nucleotide Sequences

sequencing errors
frame-shifts
Contaminated with sequences from cloning
vectors
Exceptional Care for sequences produced before
the 1990s
Redundancy

 repeated submission

 identical or overlapping sequences by the same or


different authors
 revision of annotations

 dumping of expressed sequence tags (EST) data

 poor basis data management


Basis data Bioinformatics
 Growing steadily in number

 Growing amazingly in size


 Specialization

 Which genome they contain (mouse, human, all of them)

 Which types of information about the genome they contain


 Contain information such as

 Sequences: of bases and of residues

 Structure: 3d conformations of known proteins

 Families: Which sets of genes are known to be homologous

 Annotations: which processes each gene is involved in


▪ And lots of other information
The definitive source….

 More than 1300 DB


 http://nar.oxfordjournals.org/content/39/supp
l_1.toc
DNA Sequence basis datas
 Main repositories:

 GenBank (US)
▪ (http://www.ncbi.nlm.nih.gov/Genbank/index.html)
 EMBL (Europe)
▪ (http://www.ebi.ac.uk/embl/)
 DDBJ (Japan)
▪ (http://www.ddbj.nig.ac.jp/)
 Primary basis datas

 DNA sequences are identical


EMBL basis data
Number of entries
(current 199,575,971)

Graphs created on 22 November 2010

http://www.ebi.ac.uk/embl/Services/DBStats/
www.ncbi.nlm.nih.gov
ENTREZ
 NCBI (USA) National Center for Biotechnology Information

 PubMed: The biomedical literature (PubMed)

 Nucleotide sequence basis data (Genbank) 

 Protein sequence basis data http://www.ncbi.nlm.nih.gov/Entrez/


 Structure: three-dimensional macromolecular structures

 Genome: complete genome assemblies

 PopSet: population study data sets

 OMIM: Online Mendelian Inheritance in Man

 Taxonomy: organisms in GenBank

 Books: online books

 ProbeSet: gene expression and microarray datasets

 3D Domains: domains from Entrez Structure

 UniSTS: markers and mapping data

 SNP: single nucleotide polymorphisms

 CDD: conserved domains

 Journals: journals in Entrez

 UniGene: gene-oriented clusters of transcript sequences

 PMC: full-text digital archive of life sciences journal literature


PubMed is…
• National Library of Medicine's search service
• >20 million citations in MEDLINE
• links to participating online journals
• PubMed tutorial (via side bar)
Entrez integrates…
• the scientific literature;
• DNA and protein sequence basis datas;
• 3D protein structure data;
• population study data sets;
• assemblies of complete genomes
Sequence basis datas

Annotated sequence basis datas


SWISS-PROT, GenBank etc…
Usage: identifying function, retrieving information

Low-annotation sequence basis datas


EST basis datas, high-throughput genome sequences
Usage: discovery of new genes
General Protein basis datas
 SWISS-PROT

 Manually curated

 high-quality annotations, less data


 GenPept/TREMBL

 Translated coding sequences from GenBank/EMBL

 Few annotations, more up to date


 PIR

 Phylogenetic-based annotations
 All 3 now combining efforts to form UniProt
(http://www.uniprot.org)
Low-annotation basis datas
ESTs (Expressed Sequence Tags)
Low quality sequences generated by high
-volume sequencing the 3’ or 5’ end of cDNAs

High-throughput genome sequences


Produced by mass-sequencing of genomic DNA
Non-redundant basis datas
Sequence data only: cannot be browsed, can only be
searched using a sequence
Combine sequences from more than one basis data
Examples:
NR Nucleic (genbank+EMBL+DDBJ+PDB DNA)
NR Protein (SWISS-
PROT+TrEMBL+GenPept+PDB protein)
Sequence & Structure basis datas

 PDB (Protein Databank)

 Stores 3-dimensional atomic coordinates for biological molecules including


protein and nucleic acids
 Data obtained by X-ray crystallography, NMR, or computer modelling

 http://www.rcsb.org/pdb/
 MMDB (Molecular Modelling basis data)

 Over 28,000 3D macromolecular structures, including proteins and


polynucleotides
 (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Structure)
 SCOP (Structural Classification of Proteins)

 Classification of proteins according to structural and evolutionary relationships


File Formats
 GenBank/GB, genbank flatfile format
 NBRF format
 EMBL, EMBL flatfile format
 Swissprot
 GCG, single sequence format of GCG software
 DNAStrider, for common Mac program
 Pearson/Fasta, a common format used by Fasta programs and others
 Phylip3.2, sequential format for Phylip programs
 Phylip, interleaved format for Phylip programs (v3.3, v3.4)
 Plain/Raw, sequence data only (no name, document, numbering)
 MSF multi sequence format used by GCG software
 PAUP"s multiple sequence (NEXUS) format
 ASN.1 format used by NCBI
EMBL Format
ID TRBG361 standard; mRNA; PLN; 1859 BP.
XX
AC X56734; S46826;
XX
SV X56734.1
FH Key Location/Qualifiers
XX FH
DT 12-SEP-1991 (Rel. 29, Created) FT source 1..1859
DT 15-MAR-1999 (Rel. 59, Last updated, Version 9) FT /db_xref="taxon:3899"
XX FT /mol_type="mRNA"
DE Trifolium repens mRNA for non-cyanogenic beta- FT /organism="Trifolium repens"
glucosidase FT /tissue_type="leaves"
XX
FT /clone_lib="lambda gt10"
KW beta-glucosidase.
XX FT /clone="TRE361"
OS Trifolium repens (white clover) FT CDS 14..1495
OC Eukaryota; Viridiplantae; Streptophyta; Embryophyta; FT /db_xref="GOA:P26204"
Tracheophyta; FT /db_xref="SWISS-PROT:P26204"
OC Spermatophyta; Magnoliophyta; eudicotyledons; core FT /note="non-cyanogenic"
eudicots; rosids; FT /EC_number="3.2.1.21"
OC eurosids I; Fabales; Fabaceae; Papilionoideae; FT /product="beta-glucosidase"
Trifolieae; Trifolium.
FT /protein_id="CAA40058.1"
XX
RN [5] FT /translation="MDFIVAIFALFVISSFTITSTNAVEASTLLDIGNLSRSSFPRGFI
RP 1-1859 FT FGAGSSAYQFEGAVNEGGRGPSIWDTFTHKYPEKIRDGSNADITVDQYHRYKEDVGIMK
RX MEDLINE; 91322517. FT DQNMDSYRFSISWPRILPKGKLSGGINHEGIKYYNNLINELLANGIQPFVTLFHWDLPQ
RX PUBMED; 1907511. FT VLEDEYGGFLNSGVINDFRDYTDLCFKEFGDRVRYWSTLNEPWVFSNSGYALGTNAPGR
RA Oxtoby E., Dunn M.A., Pancoro A., Hughes M.A.; FT CSASNVAKPGDSGTGPYIVTHNQILAHAEAVHVYKTKYQAYQKGKIGITLVSNWLMPLD
RT "Nucleotide and derived amino acid sequence of the FT DNSIPDIKAAERSLDFQFGLFMEQLTTGDYSKSMRRIVKNRLPKFSKFESSLVNGSFDF
cyanogenic FT IGINYYSSSYISNAPSHGNAKPSYSTNPMTNISFEKHGIPLGPRAASIWIYVYPYMFIQ
RT beta-glucosidase (linamarase) from white clover
FT EDFEIFCYILKINITILQFSITENGMNEFNDATLPVEEALLNTYRIDYYYRHLYYIRSA
(Trifolium repens L.).";
RL Plant Mol. Biol. 17(2):209-219(1991). FT IRAGSNVKGFYAWSFLDCNEWFAGFTVRFGLNFVD"
XX FT mRNA 1..1859
RN [6] FT /evidence=EXPERIMENTAL
RP 1-1859 XX
RA Hughes M.A.; SQ Sequence 1859 BP; 609 A; 314 C; 355 G; 581 T; 0 other;
RT ; aaacaaacca aatatggatt ttattgtagc catatttgct ctgtttgtta ttagctcatt 60
RL Submitted (19-NOV-1990) to the EMBL/GenBank/DDBJ cacaattact tccacaaatg cagttgaagc ttctactctt cttgacatag gtaacctgag 120
basis datas.
tcggagcagt tttcctcgtg gcttcatctt tggtgctgga tcttcagcat accaatttga 180
RL M.A. Hughes, UNIVERSITY OF NEWCASTLE UPON TYNE,
MEDICAL SCHOOL, NEW CASTLE aggtgcagta aacgaaggcg gtagaggacc aagtatttgg gataccttca cccataaata 240
RL UPON TYNE, NE2 4HH, UK tccagaaaaa ataagggatg gaagcaatgc agacatcacg gttgaccaat atcaccgcta 300
XX caaggaagat gttgggatta tgaaggatca aaatatggat tcgtatagat tctcaatctc 360
DR GOA; P26204. ttggccaaga atactcccaa agggaaagtt gagcggaggc ataaatcacg aaggaa
DR MENDEL; 11000; Trirp;1162;11000.
DR SWISS-PROT; P26204; BGLS_TRIRP.
Genbank Format
LOCUS SCU49845 5028 bp DNA PLN 21-JUN-
1999
DEFINITION Saccharomyces cerevisiae TCP1-beta gene, partial cds, andgene 687..3158
Axl2p /gene="AXL2"
(AXL2) and Rev7p (REV7) genes, complete cds. CDS 687..3158
ACCESSION U49845 /gene="AXL2"
VERSION U49845.1 GI:1293613 /note="plasma membrane glycoprotein"
KEYWORDS . /codon_start=1
SOURCE Saccharomyces cerevisiae (baker's yeast) /function="required for axial budding pattern of S.
ORGANISM Saccharomyces cerevisiae cerevisiae"
Eukaryota; Fungi; Ascomycota; Saccharomycotina; /product="Axl2p"
Saccharomycetes; /protein_id="AAA98666.1"
Saccharomycetales; Saccharomycetaceae; Saccharomyces. /db_xref="GI:1293615"
REFERENCE 1 (bases 1 to 5028) /translation="MTQLQISLLLTATISLLHLVVATPYEAYPIGKQYPPVARVNESF
AUTHORS Torpey,L.E., Gibbs,P.E., Nelson,J. and Lawrence,C.W. TFQISNDTYKSSVDKTAQITYNCFDLPSWLSFDSSSRTFSGEPSSDLLSDANTTLYFN
TITLE Cloning and sequence of REV7, a gene whose function is VILEGTDSADSTSLNNTYQFVVTNRPSISLSSDFNLLALLKNYGYTNGKNALKLDPNE
required for VFNVTFDRSMFTNEESIVSYYGRSQLYNAPLPNWLFFDSGELKFTGTAPVINSAIAPE
DNA damage-induced mutagenesis in Saccharomyces cerevisiae TSYSFVIIATDIEGFSAVEVEFELVIGAHQLTTSIQNSLIINVTDTGNVSYDLPLNYV
JOURNAL Yeast 10 (11), 1503-1509 (1994) YLDDDPISSDKLGSINLLDAPDWVALDNATISGSVPDELLGKNSNPANFSVSIYDTYG
MEDLINE 95176709 DVIYFNFEVVSTTDLFAISSLPNINATRGEWFSYYFLPSQFTDYVNTNVSLEFTNSSQ
PUBMED 7871890 DHDWVKFQSSNLTLAGEVPKNFDKLSLGLKANQGSQSQELYFNIIGMDSKITHSNHSA
REFERENCE 2 (bases 1 to 5028) NATSTRSSHHSTSTSSYTSSTYTAKISSTSAAATSSAPAALPAANKTSSHNKKAVAIA
AUTHORS Roemer,T., Madden,K., Chang,J. and Snyder,M. CGVAIPLGVILVALICFLIFWRRRRENPDDENLPHAISGPDLNNPANKPNQENATPLN
TITLE Selection of axial growth sites in yeast requires Axl2p, a NPFDDDASSYDDTSIARRLAALNTLKLDNHSATESDISSVDEKRDSLSGMNTYNDQFQ
novel SQSKEELLAKPPVQPPESPFFDPQNRSSSVYMDSEPAVNKSWRYTGNLSPVSDIVRDS
plasma membrane glycoprotein YGSQKTVDTEKLFDLEAPEKEKRTSRDVTMSSLDPWNSNISPSPVRKSVTPSPYNVTK
JOURNAL Genes Dev. 10 (7), 777-793 (1996) HRNRHLQNIQDSQSGKNGITPTTMSTSSSDDFVPVKDGENFCWVHSMEPDRRPSKKRL
MEDLINE 96194260 VDFSNKSNVNVGQVKDIHGRIPEML
PUBMED 8846915 BASE COUNT 1510 a 1074 c 835 g 1609 t
REFERENCE 3 (bases 1 to 5028) ORIGIN
AUTHORS Roemer,T. 1 gatcctccat atacaacggt atctccacct caggtttaga tctcaacaac ggaaccattg
TITLE Direct Submission 61 ccgacatgag acagttaggt atcgtcgaga gttacaagct aaaacgagca gtagtcagct
JOURNAL Submitted (22-FEB-1996) Terry Roemer, Biology, Yale 121 ctgcatctga agccgctgaa gttctactaa gggtggataa catcatccgt gcaagaccaa
University, New 181 gaaccgccaa tagacaacat atgtaacata tttaggatat acctcgaaaa taataaaccg
Haven, CT, USA 241
FEATURES Location/Qualifiers
source 1..5028
/organism="Saccharomyces cerevisiae"
/db_xref="taxon:4932"
/chromosome="IX"
Swissprot format
Specialized Sequence basis datas

 Focus on a specific type of sequences


 Sequences are often modified or specially
annotated
 Usage depends on the basis data
 Examples:

 Ribosomal RNA basis datas

 Immunology basis datas


Protein domain basis datas
 Pfam (http://www.sanger.ac.uk/Software/Pfam/)
 Collection of multiple sequence alignments and hidden Markov
models covering many common protein domains and families
 SMART (a Simple Modular Architecture Research Tool)
 Identification and annotation of genetically mobile domains and the
analysis of domain architectures
 (http://smart.embl-heidelberg.de/help/smart_about.shtml
 CDD (http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi)
 Combines SMART and Pfam basis datas

 Easier and quicker search


Sequence Motif basis datas

 Scan Prosite
(http://www.expassy.org/prosite) and PRINTS
(http://bioinf.man.ac.uk/dbbrowser/PRINTS/)
 Store conserved motifs occurring in nucleic acid or
protein sequences
 Motifs can be stored as consensus sequences,
alignments, or using statistical representations
such as residue frequency tables
Ribosomal RNA basis datas
 RDP (Michigan State University, USA)

 http://rdp.cme.msu.edu/html/
 rRNA basis data (University of Antwerp, Belgium)

 http://rrna.uia.ac.be/
 ribosomal RNA sequences are pre-aligned
according to their secondary structure
 Usage: creating data sets for molecular phylogeny,
especially for microbial taxonomy and identification
Immunological Sequence basis datas

 The Kabat basis data of Sequences of Proteins of


Immunological Interest
 www.hgmp.mrc.ac.uk/Bioinformatics/basis datas
/kabatp-help.html
 Sequences are classified according to antigen specificity, and
available in pre-aligned format
 The Immunogenetics basis data (IMGT)

 http://imgt.cnusc.fr:8104/

 Focuses on immunoglobulins, T-cell receptors and MHC genes


Genome basis datas
 Focus on one organism or group of organisms:

 Colibase (E. coli and related species)


http://colibase.bham.ac.uk/
 GDB (human) http://www.gdb.org/

 Flybase (Drosophila) http://flybase.bio.indiana.edu/

 WormBase (C. elegans) http://wormbase.org

 AtDB (Arabidopsis) http://www.arabidopsis.org

 SGD (S. cerevisiae)


http://genome-www.stanford.edu/Saccharomyces/
Expression basis datas
RNA expression

Results of microarray experiments measuring the change in specific mRNA content under
certain conditions

Array Express (EBI) and Geo (NCBI)

Not user friendly

Proteome basis datas

2D gel electrophoresis images representing the protein content of a cell or tissue under
specific conditions

SWISS 2D PAGE at http://us.expasy.org/ch2d/


Other basis data Types
 Literature

 MEDLINE (http://ncbi.nlm.nih.gov/PubMed/)

 HighWire (http://www.highwire.org)
 Variation

 dbSNP (http://ncbi.nlm.nih.gov/SNP/)

 HGBase (http://hgbase/interactiva/de)
 Metabolic pathways

 KEGG (http://kegg.genome.ad.jp/kegg/)

 WIT (http://wit.mcs/anl.gov/WIT2)
 Organisms and nomenclature

 Taxonomies (e.g.: http://ncbi.nlm.nih.gov/Taxonomy/ )

 Mendel (http://mbclserver.rutgers.edu/CPGN)
Methods for Accessing Data

local installation
screen scraping
BioPerl
FTP sites
Local Installations

SRS

Need to obtain license from Lion Biosceinces

Download data from FTP sites

Ensembl

"framework to organize biology around the sequences of large


genomes"

www.ensembl.org
Screen Scraping
URL spoofing

construction of URLs that replicate the query

html parsing

extraction of results from html pages returned by query

Requirements

html module

knowledge of query mechanism

Method NOT advocated by most data providers


BioPerl

BioPerl is a collection of modules that


facilitates the development of Perl scripts
for bioinformatics applications.

www.bioperl.org
SWISSPROT
European/Swiss Bioinformatics Institute 1986

Highly accurate, hand curated resource


http://www.ebi.ac.uk/swissprot/
Aims:

Have a high level of annotation

Often by the people who have been working with the gene

Have a low level of redundancy

Have a high level of integration with other basis datas


TrEMBL

SWISSPROT’s Big Brother


http://www.ebi.ac.uk/trembl/

All genes which have been left out of SWISSPROT


Computer annotated rather than human annotated
PROSITE
Families of proteins

Can search using regular expressions http://ca.expasy.org/prosite/

Similar to unix commands using wildcards, etc.

E.g., [AC]-x-V-x(4)-{ED}

Interpreted as:

[Ala or Cys]-any-Val-any-any-any-any-{any but Glu or Asp}

Families exhibit these patterns

So we can search over families

1574 documents about 1308 different patterns


PFAM
http://pfam.sanger.ac.uk/

Maintained by the Sanger Centre (Cambridge)

Protein families aligned using HMMs

Hidden Markov Models (see later lecture)

Given a new sequence

Find families which the sequence might fit into

Sequence Coverage

11912 families

Split into Pfam-A (high quality) and Pfam-B (low quality)


SCOP and CATH
SCOP http://scop.mrc-lmb.cam.ac.uk/scop/
http://www.cathdb.info/
Structural Classification of Proteins

Hierarchically ordered and manually curated

38221 PDB Entries

110800 Domains

CATH

Classification of protein domain structures

124 folds

226 Superfamily

1148 Sequence family

14473 Domain
Using basis datas
with the FASTA Format

May need to know the FASTA format

For residue sequences

First line must start with a > sign

First line contains identification information for gene

Other lines contain the residue sequence

OK to have a ragged right format

Usually OK to have lower case (but check)


Example FASTA Format
Geninfo num, assigned by the NCBI
Indicates that SWISS-PROT was source basis data
SWISS-PROT Identifier
Molecule Name

 > gi|121664|sp|P00435|GSHC_BOVIN GLUTATHIONE PEROXIDASE

 mcaaqrsaaalaaaaprtvyafsarplaggepfnlsslrgkvllienvak

 slcgttvrdytqmndlqrrlgprglvvlgfpcnqfghqenakneeilncl

 yvrpgggfepnfmlfekcevngekahplfaflrevlptpsddatalmtdp

 kfitwspvcrndvswnfekflvgpdgvpvrrysrrfltidiepdietlls

 qgasa
Analyzing Results
Using PERL Scripts
basis data servers now do:

Increasingly specific analysis of your results

But you will eventually need to do analysis

Ideal programming language is PERL

Designed to manipulate text and files

Can use it to play around with (manipulate) strings

Will be using it in the coursework

PERL Tutorial
PDB Format
The PDB format consists of a collection of fixed format records that describe :

Atomic coordinates,

Chemical and biochemical features

Experimental details of the structure determination

Some structural features such as

Secondary structure assignments,

Hydrogen bonding

Biological assemblies

Active sites
The challenge

(Boguski, 1999)

In 1995, the number of genes in the basis data started to exceed the number of
papers on molecular biology and genetics in the literature!
Data types
primary data sequence primary basis data

AATGCGTATAGGC DNA
DMPVERILEALAVE amino acid

secondary data secondary protein secondary db


structure
“motifs”: regular
expressions, blocks,
profiles, fingerprints e. g., alpha-helices,
beta-strands
tertiary data tertiary protein tertiary db
structure

atomic co-ordinates domains, folding units


60
 Berdasarkan sumber data, basis data
bioinformatika dapat dibagi menjadi:

 basis data primer (primary)

 basis data sekunder (secondary)

 basis data komposit/tertier (composite/tertiary)

 basis data literatur


basis data primer/utama
• Data repository
• Data berasal dari direct submission oleh peneliti
( “sampah”, inkonsistensi, dll)
• Raw data
• Redundant
• Selalu update (daily/weekly)

Contoh:
– NCBI Genbank (sekuen DNA dan translasinya)
– UniProt/SwissProt (sekuen protein)
– PDB (struktur kristal protein dan makromolekul)
– dbEST (potongan-potongan sekuen mRNA)
– dbGSS (survei sekuen genom)
– Trace Archive (data hasil sekuensing DNA)
– SAGEMap, GEO (data eksperimen microarray)
basis data sekunder

• Knowledge repository
• Data bersumber dari basis data primer/utama
• Data dianotasi oleh kurator (umumnya manual)
• Non-redundant
• Ada jeda waktu untuk sinkronisasi dengan sumber
basis data primer

Contoh:
– NCBI RefSeq (sekuen DNA dan translasinya)
– UniProt/TrEMBL (sekuen protein)
– Ensembl (sekuen genom eukaryotes)
– MMDB (struktur kristal protein dan makromolekul)
basis data komposit

• Bersumber dari basis data primer & sekunder


• Mencakup data non-biomolekuler
• Kenyamanan end-user
• Banyak kurator
• Jeda waktu sinkronisasi dengan basis data primer/sekunder yang lama

Contoh:
– EuPathDB (basis data patogen eukaryotes, seperti Plasmodium, dsb)
– MitoMAP (basis data mapping genome mitochodria)
– Mammalian ncRNA (sekuen ncRNA dari mamalia)
– REBASE (basis data enzim restriksi)
– SGD (basis data jamur/fungi)
– KEGG (basis data pathway, dsb)
– HIVdb (basis data resistansi HIV-1)
– Reactome (basis data pathway)
– Gramene (comparative resources untuk tanaman)
– PlantGDB (basis data genome tanaman)
basis data literatur

• PubMed (http://www.ncbi.nih.gov/pubmed)
– Jurnal internasional bidang ilmu alam & kedokteran
– Abstrak, Penulis, Jurnal

• PubMed Central (http://www.pubmedcentral.nih.gov)


– Subset dari PubMed
– Full-text

• NCBI Bookshelf (http://www.ncbi.nih.gov/books)

• NCBI OMIM - Online Mendelian Inheritance in Man


– Katalog dari penyakit genetik dan gen manusia (& tikus)

• NCBI OMIA - Online Mendelian Inheritance in Animal


– Katalog dari penyakit genetik dan gen hewan (selain manusia & tikus)
Top organisms in GenBank
(Release 191)
Organism base pairs
Homo sapiens 16,310,774,187
Mus musculus 9,974,977,889
Rattus norvegicus 6,521,253,272
Bos taurus 5,386,258,455
Zea mays 5,062,731,057
Sus scrofa 4,887,861,860
Danio rerio 3,120,857,462
Strongylocentrotus purpuratus 1,435,236,534
Macaca mulatta 1,256,203,101
Oryza sativa Japonica Group 1,255,686,573
Nicotiana tabacum 1,197,357,811
Xenopus (Silurana) tropicalis 1,249,938,611
Drosophila melanogaster 1,119,965,220
Pan troglodytes 1,008,323,292
Arabidopsis thaliana 1,144,226,616
Canis lupus familiaris 951,238,343
Vitis vinifera 999,010,073
Gallus gallus 899,631,338
Glycine max 906,638,854
Triticum aestivum 898,689,329
66
NCBI nucleotide DB (GenBank & RefSeq)
Nomenclature NCBI RefSeq

NM_* RefSeq untuk coding sequence (mRNA)

NR_* RefSeq untuk non-coding RNA yang ditranskripkan

NT_* RefSeq untuk genomic contig dari kromosom

NW_* RefSeq untuk alternatif genomic contig dari kromosom

NG_* RefSeq untuk famili/cluster atau pseudo-gene

NC_* RefSeq untuk full/circular genome

NP_* RefSeq untuk sekuen protein

XM_* RefSeq untuk sekuen hypothetical mRNA

XP_* RefSeq untuk sekuen hypothetical protein


(Sebagian) basis data bioinformatika
Anotasi Gen/Genom
Anotasi gen/genom

• Kumpulan informasi yang relevan


• Didapat dan diverifikasi dari eksperimen (diambil dari jurnal-jurnal
penelitian) secara manual
• Informasinya antara lain:
– Organisme/sampel
– Organelle
– Lokasi kromosom beserta orientasinya
– Daerah enhancer, promoter, exon-intron
– Transkrip (beserta alternatifnya hasil dari alternative splicing) dan
proteinnya
– Variasi dari gen (SNP)
– Fungsi
– Motif / pattern dari proteinnya
– Korelasi phenotype / penyakit
– Informasi lainnya
Informasi tekstual
Informasi sekuen

Anda mungkin juga menyukai