Anda di halaman 1dari 81

The Hierarchical Structure of an organism

Level of Organization

organism organs tissues cells

chromsomes

Chromosomes are made up of protein and DNA

http://www.brooklyn.cuny.edu/bc/ahp/LAD/C4/C4_Chromosomes.html

Chromosomes

Figure 4-14. Two closely related species of deer with very different chromosome numbers. In the evolution of the Indian muntjac, initially separate chromosomes fused, without having a major effect on the animal. These two species have roughly the same number of genes. (Adapted from M.W. Strickberger, Evolution, 3rd edition, 2000, Sudbury, MA: Jones & Bartlett Publishers

DNA is a macromolecule composed of four basic units

distancelearning.ksi.edu/ demo/bio378/lecture.html

A chromosome is a long sequence

A simple view of the central dogma

All organisms on earth use the same operating system

Central Dogma of Gene Expression

Genomics
genome

transcriptome

proteome

metabolome

physiome
Dinner discussion: Integrative Bioinformatics & Genomics VU

Genetic information is carried as three base genetic code


Four bases (A G C T/U) must encode for 20 a.a. Therefore a combination is required: 43 = 64 Triplet code is called a CODON that must begin at a precise site Of 64, 61 specify individual a.a.; and three are STOP codons

starting codon is AUG (methionine) Code is universal, synonymous, degenerate Reading frame 3rd base in codon wobble frameshifts/deletions/insertions (MUTATIONS)

The genetic code

All organisms use the same genetic code

Three different reading frames

Codon Bias
Gene Finders are often organism specific Coding regions often modelled by 5th order Markov chain (hexamers/dicodons)

Important Features
DNA contains genetic template" for proteins. DNA is found in the nucleus Protein synthesis occurs in the cytoplasm - ribosome. "Genetic information" must be transferred to the cytoplasm where proteins are synthesized.

There are between 30,000 to 40,000 genes in the human genome

DNA *

genetic code

protein

The human gene inventory corresponds to ~1.5% of the genome (coding regions)

The average gene

The average human gene is 1.4 kb long, but distributed in exons over an average of 30 kb. There are about 11 genes/ Mb DNA Chromosome 19 having the highest density (close to 30).
Ashurst, J.L. and Collins, J.E. 2003. Annu Rev Genomics Hum Genet 4: 69-88.

Component Regions of RNA

The leading and trailing regions, the 5 and 3 untranslated regions (UTR), are important for cellular positioning, message stability, and the efficiency with which protein can be made from the mRNA. The open reading frame codes for the protein.

Most eukaryotic genes contain introns


5 end
promoter
+1 transcription

Insulin gene

100 bp

3 end
poly-A signal +1411

5 end
1 intron 1 exon 2 intron 2

3 end
exon 3 primary transcript (1431 nt) capping, polyadenylation splicing

exon 2

exon 2

exon 3

AAAAAAAAAAAAAAAAAAAAAAAAAA

mRNA (465 nt + poly-A tail) exon 3

Insulin Gene Primary Transcript Sequence


1 5 exon 1

intron 1

AGC CC U CC AGG AC AGGCU GC AUC AG AAG AGGCC AUC AAGC AGgucuguuccaagggccuuugcgucaggugggc uc aggguuccaggguggcuggaccccaggccccagcucugcagcagggaggacguggcugggcucgugaagcaugugg gggugagcccaggggccccaaggcagggcaccuggccuucagccugccucagcccugccugucucccagAUCACUG UC C U U C U GC C AU GGC C C U GU GG AU GCGC C U C C U GC C C CU GC U GGC GC U GC UGGC C C U C U GG GG AC C U G AC C C AGC C GC AGC C U U U G U G A AC C A AC AC C U GU GC GGC U C AC AC C U G GU GG A AGC U C U C U AC C U AG U G U G C GGGG A AC G AGG C U UC UUC U AC AC AC C C A AG ACCC GC C GGG AGGC AG AGG ACC U GC AGGgugagccaaccgcccauugcugccccuggcc gcccccagccacccccugcuccuggcgcucccacccagcaugggcagaagggggcaggaggcugccacccagcagg g g g u c a g g u g c a c u u u u u u a a a a a g a a g u u c u c u u g g uc a c g u c c u a a a a g u g a c c a g c u c c c u g u g g c c c a g u c a gaaucucagccugaggacgguguuggcuucggcagccccgagauacaucagagggugggcacgcuccucccuccac ucgccccucaaacaaaugccccgcagcccauuucuccacccucauuugaugaccgcagauucaaguguuuuguuaa guaaaguccugggugaccuggggucacagggugccccacgcugccugccucugggcgaacaccccaucacgcccgg aggagggcguggcugccugccugagugggccagaccccugucgccagccucacggcagcuccauagucaggagaug gggaagaugcuggggacaggcccuggggagaaguacugggaucaccuguucaggcucccacugugacgcugccccg gggcgggggaaggaggugggacaugugggcguuggggccuguagguccacacccagugugggugacccucccucua accuggguccagcccggcuggagaugggugggagugcgaccuagggcuggcgggcaggcgggcacugugucucccu gacuguguccuccugugucccucugccucgccgcuguuccggaaccugcucugcgcggcacguccuggcagUGGGG C AG GU GG AG C U G GG C G G GG GC C C U G G U G C AG G C AGC C U GC AGC C C U U GG C C C U G G AG GG G U C C C U G C A G A A GC GU G

exon 2

intron 2

exon 3

GC AU U G U G G A AC A AU GC U GU AC C AG C AU C U GC U C C C U C U AC C AGC U G G AG A AC U AC U G C A AC U AG AC G C AGC C U G C AG G C AG C C C C AC AC C C G C C G C C U C C U G C AC C G AG AG AG AU G G A AU A A AG C C C U U G A AC C A GC

3 1431

Insulin Gene mature mRNA sequence


1 5 59

AG C C C U C C A G G AC A G G C U G C AU C AG A A G AG G C C AU C A AG C AG AU C AC U G U C C U U C U G C C AU G G C C C U G U G G AU G C G C C U C C U G C C C C U G C U G G C G C U G C U G G C C C U C U G G G G AC C U G AC C C A G C C G C AG C C U U U G U G A AC C A AC A C C U G U G C G G C U C AC A C C U G G U G G A AG C U C U C U AC C U AG U G U G C G G G G A AC G AG G C U U C U U C U A C AC A C C C A AG AC C C G C C G G G AG G C AG AG G A C C U G C A G G U G G G G C A G G U G G A G C U G G G C G G G G G C C C U G G U G C AG G C AG C C U G C A G C C C U U G G C C C U G G AG G G G U C C C U G C AG A AG C G U G G C AU U G U G G A AC A AU G C U G U AC C AG C AU C U G C U C C C U C U AC C AG C U G G AG A A C U AC U G C A A C U AG AC G C A G C C U G C A G G C AG C C C C AC A C C C G C C G C C U C C U G C A C C G AG AG AG AU G G A AU A A A G C C C U UG A AC C AGC AAAAAAAAAAAAAAAAAAAAAAAAAAAAA

465

392

1 59 = 5 UTR 60-388 = protein-coding region 392-465 = 3 UTR

RNA Processing, including capping and splicing, is co-transcriptional

TYPES OF INTRONS
GU AG INTRONS; AU AC INTRONS;

The GT-AG (or GU-AG) Rule


Intron boundaries are defined by the nucleotides GU (GT in DNA) and AG.
Called the GT-AG rule. Splicing enhancers (and silencers) are found in the exons. The majority of animal and plant introns are removed by the spliceosome that recognizes GT-AG introns. However, plants and animals (but not fungi) have a second alternative spliceosome that is responsible for splicing non-canonical introns. After removal from the primary transcript, virtually all introns are degraded. Alternative splicing is thought to explain our complexity despite our limited number of protein coding genes (~30,000).

HOW THE SPLICING STARTS?

Andrew P. Read Human Molecular Genetics

The Splicing Reaction


An unusual 5-2 linkage is made between the branch point nucleotide and the 5 splice site. The the free 3 end of the 5 exon will displace the 3 splice site. This liberates the so-called lariat structure, which is degraded.

The Branch Point Attack in More Detail

A mature mRNA transcript looks like this

Alternative Splicing

The pre-mRNA contains the introns and the exons encoded in the DNA. For the mRNA to produce a functional protein, the introns must be removed. In removing the introns, a variety of potential exon combinations are possible, ie, different combinations of exons may be joined together to generate different forms of the same protein.

Alternative splicing can generate different polypeptides

female

487 amino acid polypeptide

male

549 amino acid polypeptide

a-tropomyosin splicing in different cell types

A huge amount of diversity can be derived from a single gene

(12)

(48)

(33)

(2)

Summary
A Drosophila homolog of human Down syndrome cell adhesion molecule (DSCAM), an immunoglobulin superfamily member, is required for the formation of axon pathways in the embryonic central nervous system. cDNA and genomic analyses reveal the existence of multiple forms of Dscam with a conserved architecture containing variable Ig and transmembrane domains. Alternative splicing can potentially generate more than 38,000 Dscam isoforms. This molecular diversity may contribute to the specificity of neuronal connectivity.

Forms of alternative splicing


Exon skipping / inclusion

Alternative 3 splice site

Alternative 5 splice site

Mutually exclusive exons

Intron retention
Constitutive exon Alternatively spliced exons

Alt splicing as a mechanism of gene regulation

Functional domains can be added/subtracted protein diversity Can introduce early stop codons, resulting in truncated proteins or unstable mRNAs It can modify the activity of the transcription factors, affecting the expression of genes It is observed nearly in all metazoans Estimated to occur in 30%-60% of human

How to study alternative splicing?

Mapping the human genome


Cloning and Sequencing 1953 1972 1977 1980 1985 1986 1987 DNA double helix (Watson and Crick) Recombinant DNA (Berg, et al.) DNA sequencing (Maxam and Gilbert, and Sanger) Physical mapping by RFLPs (Bostein, Davis, and White) PCR (Mullis) Automated DNA sequencing machine (Hood and Smith) YACs (Burke, Olson, and Carle) Fluorescent chainterminating dideoxynucleotides (DuPont) Commercial DNA sequencing machine (Applied Biosystems) Expressed Sequence Tag (EST, Venter et al.) Bacterial Artificial Chromosomes (BACs, Shizuya et al.) Capillary sequencing machine (Molecular Dynamics)

1991 1992 1997

What is bioinformatics?
Interface of biology and computers Analysis of proteins, genes and genomes using computer algorithms and computer databases Genomics is the analysis of genomes. The tools of bioinformatics are used to make sense of the billions of base pairs of DNA that are sequenced by genomics projects.

Mapping the human genome


Bioinformatics
1970 1981 1990 1994 Global alignment, dynamic programming (Needleman and Wunsch) Local alignment (Smith and Waterman) Basic Local Alignment Search Tool (BLAST, Altschul et. al.) Hidden Markov Model (HMM, Krogh et. al.) protein domain gene structure

1995

Phred/phrap (Phil Green and Brent Ewing) Phred: assign confidence score to sequenced nucleotide Phrap: assemble sequences

Align genome and cDNA/EST: sim4, spidey, BLAT Gene prediction: GeneScan, FgeneSH, Genie, GeneWise

Length of the human genome


240 230 220 210 200 190 180 170 160 150 140 130 120 110 100 90 80 70 60 50 40 30 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y

Chromosome number

Estimated base-pairs in the genome: 3,231,365,992

Basic characteristics of the human genome It is estimated that 88% is in finished form. Currently in finished form: 2,843,602,073 base-pairs A 839,853,043 G 581,490,131 T 841,181,383 C 581,077,516

Estimated base-pairs in the genome: 3,231,365,992


T G C A
Colors correspond to chromosomes

10

20

Mbp (mega base-pairs)

30

40

50

60

70

GC content of the human genome

IHGSC. Nature (2001) 409 860-921

The average chromosome 6

6p21 Major Histocompatibility Complex 167,000,000 bp 2,190 genes 61% of the genes have CpG islands 2kb upstream and 1kb downstream of 5 and 3 ends of genes. Mean gene length 32,530 bp Mean exon length 318 bp Mean transcript per gene 1.79 Mean exons per gene 5.28 Mungall, A.J. et al. 2003. The DNA sequence and analysis of human chromosome 6. Nature 425: 805-811.

Human genome: individuals 0.1% different


For every person . . .

Lots of variation! 3.2 x 109 bp/genome x 0.001 changes/bp = 3.2 x 106 changes/genome Two major types of variation SNPs, RFLPs Repeated DNA - short to long repeats

Distribution of DNA According to Function

Genome structure: GC content

GC vs. genes

GC vs. introns/exons

Protein-coding genes

Recurrence of domains
Sequence definition Distinct regions of protein sequence that are highly conserved in evolution.

The SH2 domain is found embedded in a wide variety of metazoan proteins that regulate functionally diverse processes.
Pawson, T. et al., Trends in Cell Biology Vol.11 No.12 December 2001

Recurrence of chromosomal segments

Science v. 291, pp 1304-1351

Recurrence of segments

Science v. 291, pp 1304-1351

There are dead genes in the genome

www.people.virginia.edu/ ~rjh9u/hbmut.html

Finding genes -- computer searches


Computer searches locate most genes in prokaryotes, Archeae, and yeast, but only ~1/3 of human genes are identified correctly. Criteria Protein start, stop signals, splicing signals . . . Codon bias Comparisons to other genomes (mouse, rat, fish, fly, mosquito, worm, yeast . . .)

Some hard problems: small genes, post-translational modifications, unique genes, spliced genes, alternative splicing, gene rearrangements (e.g. IgGs) . . .

Completed Genomes (as of 2004)


Virus: 1,377 viral genomes and 36 viroids

Organelle: mitochondria (~ 600); chloroplasts (~40) Microbial: archaea (~20); bacteria (~700) Eukaryota: Yeast Saccharomyces cerevisiae (bakers); Schizosaccharomyces pombe (fission) Metazoa Homo sapiens (human); Pan troglodytes (chimp); Mus musculus (mouse); Rattus norvegicus (rat); Gallus gallus (chicken); Drosophila melanogaster (fly); Anopheles gambiae (mosquito); Caenorhabditis elegans (worm); Fugu rubripes (Puffer fish); Danio rerio (zebrafish); Plants Arabidopsis thaliana (thale-cress); Oryza sativa (rice);Avena sativa (oat); Glycine max (soybean); Hordeum vulgare (barley); Lycopersicon esculentum (tomato); Triticum aestivum (bread wheat); Zea mays (corn) Others Encephalitozoon cuniculi; Guillardia theta nucleomorph; Plasmodium falciparum; Leishmania major

Similar genes are found across organisms

Protein kinase, cAMP-dependent, catalytic, alpha


Mus Rat Chinese Oryctolagus Canis Ovis B.taurus Homo TCTTAGACAAGCAGAAGGTGGTGAAGCTAAAGCAGATCGAGCACACTCTGAATGAGAAGC TCTTGGACAAGCAGAAGGTGGTGAAGCTGAAGCAGATCGAGCACACTCTGAATGAGAAGC TCTTGGACAAACAGAAGGTGGTGAAGCTGAAGCAGATTGAGCACACTCTAAATGAGAAGC TCCTCGACAAACAGAAGGTGGTGAAGCTGAAACAGATCGAGCACACCCTGAACGTTAAAC TCCTCGACAAACAGAAGGTCGTGAAGCTGAAACAGATTGAGCATACCCTGAACGAAAAGC TCCTCGACAAACAGAAGGTGGTGAAGCTGAAACAGATTGAGCACACCCTGAACGAGAAGC TCCTCGACAAACAGAAGGTGGTGAAGCTGAAACAGATTGAGCACACCCTGAATGAGAAGC TCCTCGACAAACAGAAGGTGGTGAAACTGAAACAGATCGAACACACCCTGAATGAAAAGC ** * ***** ******** ***** ** ** ***** ** ** ** ** ** * ** *

The Minimal Genome

E. coli 1,146

889 239 1,129 H.influenzae 1

18

10

M.genitalium

Organism

# Chromosomes

# Genes

Exons

Introns

Mycoplasma genitalium Deinococcus radiodurans Saccharomyces cerevisiae C. elegans

500

500 1/gene 3500 1.02/gene 6500 1.04/gene 91,000 5/gene 54,000 4/gene 133,000 5/gene 247/exon 310,000 8+/gene 455 bp/exon

3200

61

16

6200

220

18,000

73,000 4/gene 44,000 3/gene 60 bp/intron 107,000 4/gene 169 bp/intron 250,000 7/gene 3400 bp/intron

Drosophila melanogaster Arabodopsis thaliana Homo sapiens

14,000

25,000

23

30,000

Needles in Haystacks...
Only 2% of human genome is coding regions Intron-exon structure of genes
Large introns (average 3365 bp ) Small exons (average 145 bp) Long genes (average 27 kb)

ESTs (Expressed Sequence Tags)

Single-pass sequencing of a small (end) piece of cDNA Typically 200-500 nucleotides long It may contain coding and/or non-coding region

ESTs
Cells from a specific organ, tissue or developmental stage
5

mRNA extraction

AAAAAA 3

Add oligo-dT primer 5 Reverse transcriptase RNA DNA 5 3 AAAAAA 3 3 TTTTTT 5 AAAAAA 3

TTTTTT 5

Ribonuclease H 3 DNA polimerase Ribonuclease H 5

TTTTTT

AAAAAA 3

Double stranded cDNA

TTTTTT 5

ESTs
5 3 AAAAAA 3

TTTTTT 5

Clone cDNA into a vector

5 EST

Single-pass sequence reads

Multiple cDNA clones

3 EST

Sampling the Transcriptome with ESTs


Genomic Primary transcript Splicing Splice variants
oligo-dT primer Reverse transcriptase

cDNA clones
(double stranded)

EST sequences (Single-pass sequence reads) 5 3 5 3

Large scale EST-sequencing coupled to Genome sequencing

EST sequencing
Is fast and cheap Gives direct information about the gene sequence Partial information Resulting ESTs (DB searches) Known gene Similar to known gene Contaminant Novel gene

dbEST release 20 February 2004


Number of public entries:
Summary by organism Homo sapiens (human) Mus musculus + domesticus (mouse) Rattus sp. (rat) Triticum aestivum (wheat) Ciona intestinalis Gallus gallus (chicken) 460,385 Danio rerio (zebrafish) Zea mays (maize) 391,417 Xenopus laevis (African clawed frog) 359,901 5,472,005 4,056,481 583,841 549,926 492,511

20,039,613

450,652

EST lengths
~ 450 bp Human EST length distribution (dbEST Sep. 2003 )

ESTs provide expression data


eVOC Ontologies Anatomical System Cell Type Pathology Developmental Stage Pooling http://www.sanbi.ac.za/evoc/

The tissue, organ or anatomical system from which the sample was prepared. Examples are digestive, lung and retina. The precise cell type from which a sample was prepared. Examples are: Blymphocyte, fibroblast and oocyte. The pathological state of the sample from which the sample was prepared. Examples are: normal, lymphoma, and congenital. The stage during the organism's development at which the sample was prepared. Examples are: embryo, fetus, and adult. Indicates whether the tissue used to prepare the library was derived from single or multiple samples. Examples are pooled, pooled donor and pooled tissue.

J Kelso et al. Genome Research 2002

Exon Size
35 30 25 20 15 10 5 0 1- 100- 200- 300- >500 100 200 300 500 Fungi Verterbrate

Intron Size
70 60 50 40 30 20 10 0 <100 <200 <1 1 to kbp 5 >5 Fungi Verterbrate

Intron Prevalence
100 90 80 70 60 50 40 30 20 10 0

Yeast Fungi Mammal

>1

Gene Finding Challenges


Need the correct reading frame
Introns can interrupt an exon in midcodon

There is no hard and fast rule for identifying donor and acceptor splice sites
Signals are very weak

Codon Bias
Gene Finders are often organism specific Coding regions often modelled by 5th order Markov chain (hexamers/dicodons)

Overpredicting Genes
Easy to predict all exons Report all sequences flanked by ..AG and GT.. as exons Sensitivity = 100% Specificity ~ 0%

Locating ORFs
Simplest method of predicting coding regions is to search for open reading frames (ORFs) open reading frames begin with a start (AUG) codon, and ends with one of three stop codons Six total reading frames

Locating ORFs
Example from HW#1:
AUUGCAAUGGAAUUAGUAAUCUCUAUUUCCGCCCUUAUUAUAGUUGAAUAGAUAGCCGUA

E L V I

S I

S A L I I V E

Locating ORFs
Prokaryotes: DNA sequences coding for proteins generally transcribed into mRNA which is translated into protein with very little modification Locating an open reading frame from a start codon to a stop codon can give a strong suggestion into protein coding regions

Longer ORFs are more likely to predict protein-coding regions than shorter ORFs.

Locating ORFs
Eukaryotes: mRNA undergoes processing to remove introns before the protein is translated ORF corresponding to a gene may contain regions with stop codons found within intronic regions Posttranscriptional modification makes gene prediction more difficult

Locating Similar Sequences


Take the new DNA sequence, translating into six reading frames Compare each to protein sequence databases Locates known open reading frames

Locating Similar Sequences

Top ten challenges for bioinformatics


[1] Precise models of where and when transcription will occur in a genome (initiation and termination)
[2] Precise, predictive models of alternative RNA splicing [3] Precise models of signal transduction pathways; ability to predict cellular responses to external stimuli

[4] Determining protein:DNA, protein:RNA, protein:protein recognition codes


[5] Accurate ab initio protein structure prediction

Anda mungkin juga menyukai