Intro To Molbiol and Tics

The Hierarchical Structure of an organism
Level of Organization
organism organs tissues cells
chromsomes
Chromosomes are made up of protein and DNA
http://www.brooklyn.cuny.edu/bc/ahp/LAD/C4/C4_Chromosomes.html
Chromosomes
Figure 4-14. Two closely related species of deer with very different chromosome numbers. In the evolution of the Indian muntjac, initially separate chromosomes fused, without having a major effect on the animal. These two species have roughly the same number of genes. (Adapted from M.W. Strickberger, Evolution, 3rd edition, 2000, Sudbury, MA: Jones & Bartlett Publishers
DNA is a macromolecule composed of four basic units
distancelearning.ksi.edu/ demo/bio378/lecture.html
A chromosome is a long sequence
A simple view of the central dogma
All organisms on earth use the same operating system
Central Dogma of Gene Expression
Genomics
genome
transcriptome
proteome
metabolome
physiome
Dinner discussion: Integrative Bioinformatics & Genomics VU
Genetic information is carried as three base genetic code

Four bases (A G C T/U) must encode for 20 a.a. Therefore a combination is required: 43 = 64 Triplet code is called a CODON that must begin at a precise site Of 64, 61 specify individual a.a.; and three are STOP codons
starting codon is AUG (methionine) Code is universal, synonymous, degenerate Reading frame 3rd base in codon wobble frameshifts/deletions/insertions (MUTATIONS)
The genetic code
All organisms use the same genetic code
Three different reading frames
Codon Bias
Gene Finders are often organism specific Coding regions often modelled by 5th order Markov chain (hexamers/dicodons)
Important Features
DNA contains genetic template" for proteins. DNA is found in the nucleus Protein synthesis occurs in the cytoplasm - ribosome. "Genetic information" must be transferred to the cytoplasm where proteins are synthesized.
There are between 30,000 to 40,000 genes in the human genome
DNA *
genetic code
protein
The human gene inventory corresponds to ~1.5% of the genome (coding regions)
The average gene
The average human gene is 1.4 kb long, but distributed in exons over an average of 30 kb. There are about 11 genes/ Mb DNA Chromosome 19 having the highest density (close to 30).
Ashurst, J.L. and Collins, J.E. 2003. Annu Rev Genomics Hum Genet 4: 69-88.
Component Regions of RNA
The leading and trailing regions, the 5 and 3 untranslated regions (UTR), are important for cellular positioning, message stability, and the efficiency with which protein can be made from the mRNA. The open reading frame codes for the protein.
Most eukaryotic genes contain introns

5 end
promoter
+1 transcription
Insulin gene
100 bp
3 end
poly-A signal +1411
5 end
1 intron 1 exon 2 intron 2
3 end
exon 3 primary transcript (1431 nt) capping, polyadenylation splicing
exon 2
exon 2
exon 3
AAAAAAAAAAAAAAAAAAAAAAAAAA
mRNA (465 nt + poly-A tail) exon 3
Insulin Gene Primary Transcript Sequence

1 5 exon 1
intron 1
AGC CC U CC AGG AC AGGCU GC AUC AG AAG AGGCC AUC AAGC AGgucuguuccaagggccuuugcgucaggugggc uc aggguuccaggguggcuggaccccaggccccagcucugcagcagggaggacguggcugggcucgugaagcaugugg gggugagcccaggggccccaaggcagggcaccuggccuucagccugccucagcccugccugucucccagAUCACUG UC C U U C U GC C AU GGC C C U GU GG AU GCGC C U C C U GC C C CU GC U GGC GC U GC UGGC C C U C U GG GG AC C U G AC C C AGC C GC AGC C U U U G U G A AC C A AC AC C U GU GC GGC U C AC AC C U G GU GG A AGC U C U C U AC C U AG U G U G C GGGG A AC G AGG C U UC UUC U AC AC AC C C A AG ACCC GC C GGG AGGC AG AGG ACC U GC AGGgugagccaaccgcccauugcugccccuggcc gcccccagccacccccugcuccuggcgcucccacccagcaugggcagaagggggcaggaggcugccacccagcagg g g g u c a g g u g c a c u u u u u u a a a a a g a a g u u c u c u u g g uc a c g u c c u a a a a g u g a c c a g c u c c c u g u g g c c c a g u c a gaaucucagccugaggacgguguuggcuucggcagccccgagauacaucagagggugggcacgcuccucccuccac ucgccccucaaacaaaugccccgcagcccauuucuccacccucauuugaugaccgcagauucaaguguuuuguuaa guaaaguccugggugaccuggggucacagggugccccacgcugccugccucugggcgaacaccccaucacgcccgg aggagggcguggcugccugccugagugggccagaccccugucgccagccucacggcagcuccauagucaggagaug gggaagaugcuggggacaggcccuggggagaaguacugggaucaccuguucaggcucccacugugacgcugccccg gggcgggggaaggaggugggacaugugggcguuggggccuguagguccacacccagugugggugacccucccucua accuggguccagcccggcuggagaugggugggagugcgaccuagggcuggcgggcaggcgggcacugugucucccu gacuguguccuccugugucccucugccucgccgcuguuccggaaccugcucugcgcggcacguccuggcagUGGGG C AG GU GG AG C U G GG C G G GG GC C C U G G U G C AG G C AGC C U GC AGC C C U U GG C C C U G G AG GG G U C C C U G C A G A A GC GU G
exon 2
intron 2
exon 3
GC AU U G U G G A AC A AU GC U GU AC C AG C AU C U GC U C C C U C U AC C AGC U G G AG A AC U AC U G C A AC U AG AC G C AGC C U G C AG G C AG C C C C AC AC C C G C C G C C U C C U G C AC C G AG AG AG AU G G A AU A A AG C C C U U G A AC C A GC
3 1431
Insulin Gene mature mRNA sequence

1 5 59
AG C C C U C C A G G AC A G G C U G C AU C AG A A G AG G C C AU C A AG C AG AU C AC U G U C C U U C U G C C AU G G C C C U G U G G AU G C G C C U C C U G C C C C U G C U G G C G C U G C U G G C C C U C U G G G G AC C U G AC C C A G C C G C AG C C U U U G U G A AC C A AC A C C U G U G C G G C U C AC A C C U G G U G G A AG C U C U C U AC C U AG U G U G C G G G G A AC G AG G C U U C U U C U A C AC A C C C A AG AC C C G C C G G G AG G C AG AG G A C C U G C A G G U G G G G C A G G U G G A G C U G G G C G G G G G C C C U G G U G C AG G C AG C C U G C A G C C C U U G G C C C U G G AG G G G U C C C U G C AG A AG C G U G G C AU U G U G G A AC A AU G C U G U AC C AG C AU C U G C U C C C U C U AC C AG C U G G AG A A C U AC U G C A A C U AG AC G C A G C C U G C A G G C AG C C C C AC A C C C G C C G C C U C C U G C A C C G AG AG AG AU G G A AU A A A G C C C U UG A AC C AGC AAAAAAAAAAAAAAAAAAAAAAAAAAAAA
465
392
1 59 = 5 UTR 60-388 = protein-coding region 392-465 = 3 UTR
RNA Processing, including capping and splicing, is co-transcriptional
TYPES OF INTRONS
GU AG INTRONS; AU AC INTRONS;
The GT-AG (or GU-AG) Rule

Intron boundaries are defined by the nucleotides GU (GT in DNA) and AG.
Called the GT-AG rule. Splicing enhancers (and silencers) are found in the exons. The majority of animal and plant introns are removed by the spliceosome that recognizes GT-AG introns. However, plants and animals (but not fungi) have a second alternative spliceosome that is responsible for splicing non-canonical introns. After removal from the primary transcript, virtually all introns are degraded. Alternative splicing is thought to explain our complexity despite our limited number of protein coding genes (~30,000).
HOW THE SPLICING STARTS?
Andrew P. Read Human Molecular Genetics
The Splicing Reaction

An unusual 5-2 linkage is made between the branch point nucleotide and the 5 splice site. The the free 3 end of the 5 exon will displace the 3 splice site. This liberates the so-called lariat structure, which is degraded.
The Branch Point Attack in More Detail
A mature mRNA transcript looks like this
Alternative Splicing
The pre-mRNA contains the introns and the exons encoded in the DNA. For the mRNA to produce a functional protein, the introns must be removed. In removing the introns, a variety of potential exon combinations are possible, ie, different combinations of exons may be joined together to generate different forms of the same protein.
Alternative splicing can generate different polypeptides
female
487 amino acid polypeptide
male
549 amino acid polypeptide
a-tropomyosin splicing in different cell types
A huge amount of diversity can be derived from a single gene
(12)
(48)
(33)
(2)
Summary
A Drosophila homolog of human Down syndrome cell adhesion molecule (DSCAM), an immunoglobulin superfamily member, is required for the formation of axon pathways in the embryonic central nervous system. cDNA and genomic analyses reveal the existence of multiple forms of Dscam with a conserved architecture containing variable Ig and transmembrane domains. Alternative splicing can potentially generate more than 38,000 Dscam isoforms. This molecular diversity may contribute to the specificity of neuronal connectivity.
Forms of alternative splicing

Exon skipping / inclusion
Alternative 3 splice site
Alternative 5 splice site
Mutually exclusive exons
Intron retention
Constitutive exon Alternatively spliced exons
Alt splicing as a mechanism of gene regulation
Functional domains can be added/subtracted protein diversity Can introduce early stop codons, resulting in truncated proteins or unstable mRNAs It can modify the activity of the transcription factors, affecting the expression of genes It is observed nearly in all metazoans Estimated to occur in 30%-60% of human
How to study alternative splicing?
Mapping the human genome

Cloning and Sequencing 1953 1972 1977 1980 1985 1986 1987 DNA double helix (Watson and Crick) Recombinant DNA (Berg, et al.) DNA sequencing (Maxam and Gilbert, and Sanger) Physical mapping by RFLPs (Bostein, Davis, and White) PCR (Mullis) Automated DNA sequencing machine (Hood and Smith) YACs (Burke, Olson, and Carle) Fluorescent chainterminating dideoxynucleotides (DuPont) Commercial DNA sequencing machine (Applied Biosystems) Expressed Sequence Tag (EST, Venter et al.) Bacterial Artificial Chromosomes (BACs, Shizuya et al.) Capillary sequencing machine (Molecular Dynamics)
1991 1992 1997
What is bioinformatics?
Interface of biology and computers Analysis of proteins, genes and genomes using computer algorithms and computer databases Genomics is the analysis of genomes. The tools of bioinformatics are used to make sense of the billions of base pairs of DNA that are sequenced by genomics projects.
Mapping the human genome

Bioinformatics
1970 1981 1990 1994 Global alignment, dynamic programming (Needleman and Wunsch) Local alignment (Smith and Waterman) Basic Local Alignment Search Tool (BLAST, Altschul et. al.) Hidden Markov Model (HMM, Krogh et. al.) protein domain gene structure
1995
Phred/phrap (Phil Green and Brent Ewing) Phred: assign confidence score to sequenced nucleotide Phrap: assemble sequences
Align genome and cDNA/EST: sim4, spidey, BLAT Gene prediction: GeneScan, FgeneSH, Genie, GeneWise
Length of the human genome

240 230 220 210 200 190 180 170 160 150 140 130 120 110 100 90 80 70 60 50 40 30 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y
Chromosome number
Estimated base-pairs in the genome: 3,231,365,992
Basic characteristics of the human genome It is estimated that 88% is in finished form. Currently in finished form: 2,843,602,073 base-pairs A 839,853,043 G 581,490,131 T 841,181,383 C 581,077,516
Estimated base-pairs in the genome: 3,231,365,992

T G C A
Colors correspond to chromosomes
10
20
Mbp (mega base-pairs)
30
40
50
60
70
GC content of the human genome
IHGSC. Nature (2001) 409 860-921
The average chromosome 6
6p21 Major Histocompatibility Complex 167,000,000 bp 2,190 genes 61% of the genes have CpG islands 2kb upstream and 1kb downstream of 5 and 3 ends of genes. Mean gene length 32,530 bp Mean exon length 318 bp Mean transcript per gene 1.79 Mean exons per gene 5.28 Mungall, A.J. et al. 2003. The DNA sequence and analysis of human chromosome 6. Nature 425: 805-811.
Human genome: individuals 0.1% different

For every person . . .
Lots of variation! 3.2 x 109 bp/genome x 0.001 changes/bp = 3.2 x 106 changes/genome Two major types of variation SNPs, RFLPs Repeated DNA - short to long repeats
Distribution of DNA According to Function
Genome structure: GC content
GC vs. genes
GC vs. introns/exons
Protein-coding genes
Recurrence of domains
Sequence definition Distinct regions of protein sequence that are highly conserved in evolution.
The SH2 domain is found embedded in a wide variety of metazoan proteins that regulate functionally diverse processes.
Pawson, T. et al., Trends in Cell Biology Vol.11 No.12 December 2001
Recurrence of chromosomal segments
Science v. 291, pp 1304-1351
Recurrence of segments
Science v. 291, pp 1304-1351
There are dead genes in the genome
www.people.virginia.edu/ ~rjh9u/hbmut.html
Finding genes -- computer searches

Computer searches locate most genes in prokaryotes, Archeae, and yeast, but only ~1/3 of human genes are identified correctly. Criteria Protein start, stop signals, splicing signals . . . Codon bias Comparisons to other genomes (mouse, rat, fish, fly, mosquito, worm, yeast . . .)
Some hard problems: small genes, post-translational modifications, unique genes, spliced genes, alternative splicing, gene rearrangements (e.g. IgGs) . . .
Completed Genomes (as of 2004)

Virus: 1,377 viral genomes and 36 viroids
Organelle: mitochondria (~ 600); chloroplasts (~40) Microbial: archaea (~20); bacteria (~700) Eukaryota: Yeast Saccharomyces cerevisiae (bakers); Schizosaccharomyces pombe (fission) Metazoa Homo sapiens (human); Pan troglodytes (chimp); Mus musculus (mouse); Rattus norvegicus (rat); Gallus gallus (chicken); Drosophila melanogaster (fly); Anopheles gambiae (mosquito); Caenorhabditis elegans (worm); Fugu rubripes (Puffer fish); Danio rerio (zebrafish); Plants Arabidopsis thaliana (thale-cress); Oryza sativa (rice);Avena sativa (oat); Glycine max (soybean); Hordeum vulgare (barley); Lycopersicon esculentum (tomato); Triticum aestivum (bread wheat); Zea mays (corn) Others Encephalitozoon cuniculi; Guillardia theta nucleomorph; Plasmodium falciparum; Leishmania major
Similar genes are found across organisms
Protein kinase, cAMP-dependent, catalytic, alpha

Mus Rat Chinese Oryctolagus Canis Ovis B.taurus Homo TCTTAGACAAGCAGAAGGTGGTGAAGCTAAAGCAGATCGAGCACACTCTGAATGAGAAGC TCTTGGACAAGCAGAAGGTGGTGAAGCTGAAGCAGATCGAGCACACTCTGAATGAGAAGC TCTTGGACAAACAGAAGGTGGTGAAGCTGAAGCAGATTGAGCACACTCTAAATGAGAAGC TCCTCGACAAACAGAAGGTGGTGAAGCTGAAACAGATCGAGCACACCCTGAACGTTAAAC TCCTCGACAAACAGAAGGTCGTGAAGCTGAAACAGATTGAGCATACCCTGAACGAAAAGC TCCTCGACAAACAGAAGGTGGTGAAGCTGAAACAGATTGAGCACACCCTGAACGAGAAGC TCCTCGACAAACAGAAGGTGGTGAAGCTGAAACAGATTGAGCACACCCTGAATGAGAAGC TCCTCGACAAACAGAAGGTGGTGAAACTGAAACAGATCGAACACACCCTGAATGAAAAGC ** * ***** ******** ***** ** ** ***** ** ** ** ** ** * ** *
The Minimal Genome
E. coli 1,146
889 239 1,129 H.influenzae 1
18
10
M.genitalium
Organism
# Chromosomes
# Genes
Exons
Introns
Mycoplasma genitalium Deinococcus radiodurans Saccharomyces cerevisiae C. elegans
500
500 1/gene 3500 1.02/gene 6500 1.04/gene 91,000 5/gene 54,000 4/gene 133,000 5/gene 247/exon 310,000 8+/gene 455 bp/exon
3200
61
16
6200
220
18,000
73,000 4/gene 44,000 3/gene 60 bp/intron 107,000 4/gene 169 bp/intron 250,000 7/gene 3400 bp/intron
Drosophila melanogaster Arabodopsis thaliana Homo sapiens
14,000
25,000
23
30,000
Needles in Haystacks...
Only 2% of human genome is coding regions Intron-exon structure of genes
Large introns (average 3365 bp ) Small exons (average 145 bp) Long genes (average 27 kb)
ESTs (Expressed Sequence Tags)
Single-pass sequencing of a small (end) piece of cDNA Typically 200-500 nucleotides long It may contain coding and/or non-coding region
ESTs
Cells from a specific organ, tissue or developmental stage
5
mRNA extraction
AAAAAA 3
Add oligo-dT primer 5 Reverse transcriptase RNA DNA 5 3 AAAAAA 3 3 TTTTTT 5 AAAAAA 3
TTTTTT 5
Ribonuclease H 3 DNA polimerase Ribonuclease H 5
TTTTTT
AAAAAA 3
Double stranded cDNA
TTTTTT 5
ESTs
5 3 AAAAAA 3
TTTTTT 5
Clone cDNA into a vector
5 EST
Single-pass sequence reads
Multiple cDNA clones
3 EST
Sampling the Transcriptome with ESTs

Genomic Primary transcript Splicing Splice variants
oligo-dT primer Reverse transcriptase
cDNA clones
(double stranded)
EST sequences (Single-pass sequence reads) 5 3 5 3
Large scale EST-sequencing coupled to Genome sequencing
EST sequencing
Is fast and cheap Gives direct information about the gene sequence Partial information Resulting ESTs (DB searches) Known gene Similar to known gene Contaminant Novel gene
dbEST release 20 February 2004

Number of public entries:
Summary by organism Homo sapiens (human) Mus musculus + domesticus (mouse) Rattus sp. (rat) Triticum aestivum (wheat) Ciona intestinalis Gallus gallus (chicken) 460,385 Danio rerio (zebrafish) Zea mays (maize) 391,417 Xenopus laevis (African clawed frog) 359,901 5,472,005 4,056,481 583,841 549,926 492,511
20,039,613
450,652
EST lengths
~ 450 bp Human EST length distribution (dbEST Sep. 2003 )
ESTs provide expression data

eVOC Ontologies Anatomical System Cell Type Pathology Developmental Stage Pooling http://www.sanbi.ac.za/evoc/
The tissue, organ or anatomical system from which the sample was prepared. Examples are digestive, lung and retina. The precise cell type from which a sample was prepared. Examples are: Blymphocyte, fibroblast and oocyte. The pathological state of the sample from which the sample was prepared. Examples are: normal, lymphoma, and congenital. The stage during the organism's development at which the sample was prepared. Examples are: embryo, fetus, and adult. Indicates whether the tissue used to prepare the library was derived from single or multiple samples. Examples are pooled, pooled donor and pooled tissue.
J Kelso et al. Genome Research 2002
Exon Size
35 30 25 20 15 10 5 0 1- 100- 200- 300- >500 100 200 300 500 Fungi Verterbrate
Intron Size
70 60 50 40 30 20 10 0 <100 <200 <1 1 to kbp 5 >5 Fungi Verterbrate
Intron Prevalence
100 90 80 70 60 50 40 30 20 10 0
Yeast Fungi Mammal
>1
Gene Finding Challenges

Need the correct reading frame
Introns can interrupt an exon in midcodon
There is no hard and fast rule for identifying donor and acceptor splice sites
Signals are very weak
Codon Bias
Gene Finders are often organism specific Coding regions often modelled by 5th order Markov chain (hexamers/dicodons)
Overpredicting Genes
Easy to predict all exons Report all sequences flanked by ..AG and GT.. as exons Sensitivity = 100% Specificity ~ 0%
Locating ORFs
Simplest method of predicting coding regions is to search for open reading frames (ORFs) open reading frames begin with a start (AUG) codon, and ends with one of three stop codons Six total reading frames
Locating ORFs
Example from HW#1:
AUUGCAAUGGAAUUAGUAAUCUCUAUUUCCGCCCUUAUUAUAGUUGAAUAGAUAGCCGUA
E L V I
S I
S A L I I V E
Locating ORFs
Prokaryotes: DNA sequences coding for proteins generally transcribed into mRNA which is translated into protein with very little modification Locating an open reading frame from a start codon to a stop codon can give a strong suggestion into protein coding regions
Longer ORFs are more likely to predict protein-coding regions than shorter ORFs.
Locating ORFs
Eukaryotes: mRNA undergoes processing to remove introns before the protein is translated ORF corresponding to a gene may contain regions with stop codons found within intronic regions Posttranscriptional modification makes gene prediction more difficult
Locating Similar Sequences

Take the new DNA sequence, translating into six reading frames Compare each to protein sequence databases Locates known open reading frames
Locating Similar Sequences
Top ten challenges for bioinformatics

[1] Precise models of where and when transcription will occur in a genome (initiation and termination)
[2] Precise, predictive models of alternative RNA splicing [3] Precise models of signal transduction pathways; ability to predict cellular responses to external stimuli
[4] Determining protein:DNA, protein:RNA, protein:protein recognition codes

[5] Accurate ab initio protein structure prediction

Intro To Molbiol and Tics

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Intro To Molbiol and Tics

Diunggah oleh

Hak Cipta:

Format Tersedia

The Hierarchical Structure of an organism

organism organs tissues cells

Chromosomes are made up of protein and DNA

DNA is a macromolecule composed of four basic units

A chromosome is a long sequence

A simple view of the central dogma

All organisms on earth use the same operating system

Central Dogma of Gene Expression

Genetic information is carried as three base genetic code

The genetic code

All organisms use the same genetic code

Three different reading frames

There are between 30,000 to 40,000 genes in the human genome

The average gene

Component Regions of RNA

Most eukaryotic genes contain introns

mRNA (465 nt + poly-A tail) exon 3

Insulin Gene Primary Transcript Sequence

Insulin Gene mature mRNA sequence

1 59 = 5 UTR 60-388 = protein-coding region 392-465 = 3 UTR

RNA Processing, including capping and splicing, is co-transcriptional

The GT-AG (or GU-AG) Rule

HOW THE SPLICING STARTS?

Andrew P. Read Human Molecular Genetics

The Splicing Reaction

The Branch Point Attack in More Detail

A mature mRNA transcript looks like this

Alternative splicing can generate different polypeptides

487 amino acid polypeptide

549 amino acid polypeptide

a-tropomyosin splicing in different cell types

A huge amount of diversity can be derived from a single gene

Forms of alternative splicing

Alternative 3 splice site

Alternative 5 splice site

Mutually exclusive exons

Alt splicing as a mechanism of gene regulation

How to study alternative splicing?

Mapping the human genome

1991 1992 1997

Mapping the human genome

Length of the human genome

Estimated base-pairs in the genome: 3,231,365,992

Estimated base-pairs in the genome: 3,231,365,992

Mbp (mega base-pairs)

GC content of the human genome

IHGSC. Nature (2001) 409 860-921

The average chromosome 6

Human genome: individuals 0.1% different

Distribution of DNA According to Function

Genome structure: GC content

Recurrence of chromosomal segments

Science v. 291, pp 1304-1351

Science v. 291, pp 1304-1351

There are dead genes in the genome

Finding genes -- computer searches

Completed Genomes (as of 2004)

Similar genes are found across organisms

Protein kinase, cAMP-dependent, catalytic, alpha

The Minimal Genome

889 239 1,129 H.influenzae 1

Mycoplasma genitalium Deinococcus radiodurans Saccharomyces cerevisiae C. elegans