Level of Organization
chromsomes
http://www.brooklyn.cuny.edu/bc/ahp/LAD/C4/C4_Chromosomes.html
Chromosomes
Figure 4-14. Two closely related species of deer with very different chromosome numbers. In the evolution of the Indian muntjac, initially separate chromosomes fused, without having a major effect on the animal. These two species have roughly the same number of genes. (Adapted from M.W. Strickberger, Evolution, 3rd edition, 2000, Sudbury, MA: Jones & Bartlett Publishers
distancelearning.ksi.edu/ demo/bio378/lecture.html
Genomics
genome
transcriptome
proteome
metabolome
physiome
Dinner discussion: Integrative Bioinformatics & Genomics VU
starting codon is AUG (methionine) Code is universal, synonymous, degenerate Reading frame 3rd base in codon wobble frameshifts/deletions/insertions (MUTATIONS)
Codon Bias
Gene Finders are often organism specific Coding regions often modelled by 5th order Markov chain (hexamers/dicodons)
Important Features
DNA contains genetic template" for proteins. DNA is found in the nucleus Protein synthesis occurs in the cytoplasm - ribosome. "Genetic information" must be transferred to the cytoplasm where proteins are synthesized.
DNA *
genetic code
protein
The human gene inventory corresponds to ~1.5% of the genome (coding regions)
The average human gene is 1.4 kb long, but distributed in exons over an average of 30 kb. There are about 11 genes/ Mb DNA Chromosome 19 having the highest density (close to 30).
Ashurst, J.L. and Collins, J.E. 2003. Annu Rev Genomics Hum Genet 4: 69-88.
The leading and trailing regions, the 5 and 3 untranslated regions (UTR), are important for cellular positioning, message stability, and the efficiency with which protein can be made from the mRNA. The open reading frame codes for the protein.
Insulin gene
100 bp
3 end
poly-A signal +1411
5 end
1 intron 1 exon 2 intron 2
3 end
exon 3 primary transcript (1431 nt) capping, polyadenylation splicing
exon 2
exon 2
exon 3
AAAAAAAAAAAAAAAAAAAAAAAAAA
intron 1
AGC CC U CC AGG AC AGGCU GC AUC AG AAG AGGCC AUC AAGC AGgucuguuccaagggccuuugcgucaggugggc uc aggguuccaggguggcuggaccccaggccccagcucugcagcagggaggacguggcugggcucgugaagcaugugg gggugagcccaggggccccaaggcagggcaccuggccuucagccugccucagcccugccugucucccagAUCACUG UC C U U C U GC C AU GGC C C U GU GG AU GCGC C U C C U GC C C CU GC U GGC GC U GC UGGC C C U C U GG GG AC C U G AC C C AGC C GC AGC C U U U G U G A AC C A AC AC C U GU GC GGC U C AC AC C U G GU GG A AGC U C U C U AC C U AG U G U G C GGGG A AC G AGG C U UC UUC U AC AC AC C C A AG ACCC GC C GGG AGGC AG AGG ACC U GC AGGgugagccaaccgcccauugcugccccuggcc gcccccagccacccccugcuccuggcgcucccacccagcaugggcagaagggggcaggaggcugccacccagcagg g g g u c a g g u g c a c u u u u u u a a a a a g a a g u u c u c u u g g uc a c g u c c u a a a a g u g a c c a g c u c c c u g u g g c c c a g u c a gaaucucagccugaggacgguguuggcuucggcagccccgagauacaucagagggugggcacgcuccucccuccac ucgccccucaaacaaaugccccgcagcccauuucuccacccucauuugaugaccgcagauucaaguguuuuguuaa guaaaguccugggugaccuggggucacagggugccccacgcugccugccucugggcgaacaccccaucacgcccgg aggagggcguggcugccugccugagugggccagaccccugucgccagccucacggcagcuccauagucaggagaug gggaagaugcuggggacaggcccuggggagaaguacugggaucaccuguucaggcucccacugugacgcugccccg gggcgggggaaggaggugggacaugugggcguuggggccuguagguccacacccagugugggugacccucccucua accuggguccagcccggcuggagaugggugggagugcgaccuagggcuggcgggcaggcgggcacugugucucccu gacuguguccuccugugucccucugccucgccgcuguuccggaaccugcucugcgcggcacguccuggcagUGGGG C AG GU GG AG C U G GG C G G GG GC C C U G G U G C AG G C AGC C U GC AGC C C U U GG C C C U G G AG GG G U C C C U G C A G A A GC GU G
exon 2
intron 2
exon 3
GC AU U G U G G A AC A AU GC U GU AC C AG C AU C U GC U C C C U C U AC C AGC U G G AG A AC U AC U G C A AC U AG AC G C AGC C U G C AG G C AG C C C C AC AC C C G C C G C C U C C U G C AC C G AG AG AG AU G G A AU A A AG C C C U U G A AC C A GC
3 1431
AG C C C U C C A G G AC A G G C U G C AU C AG A A G AG G C C AU C A AG C AG AU C AC U G U C C U U C U G C C AU G G C C C U G U G G AU G C G C C U C C U G C C C C U G C U G G C G C U G C U G G C C C U C U G G G G AC C U G AC C C A G C C G C AG C C U U U G U G A AC C A AC A C C U G U G C G G C U C AC A C C U G G U G G A AG C U C U C U AC C U AG U G U G C G G G G A AC G AG G C U U C U U C U A C AC A C C C A AG AC C C G C C G G G AG G C AG AG G A C C U G C A G G U G G G G C A G G U G G A G C U G G G C G G G G G C C C U G G U G C AG G C AG C C U G C A G C C C U U G G C C C U G G AG G G G U C C C U G C AG A AG C G U G G C AU U G U G G A AC A AU G C U G U AC C AG C AU C U G C U C C C U C U AC C AG C U G G AG A A C U AC U G C A A C U AG AC G C A G C C U G C A G G C AG C C C C AC A C C C G C C G C C U C C U G C A C C G AG AG AG AU G G A AU A A A G C C C U UG A AC C AGC AAAAAAAAAAAAAAAAAAAAAAAAAAAAA
465
392
TYPES OF INTRONS
GU AG INTRONS; AU AC INTRONS;
Alternative Splicing
The pre-mRNA contains the introns and the exons encoded in the DNA. For the mRNA to produce a functional protein, the introns must be removed. In removing the introns, a variety of potential exon combinations are possible, ie, different combinations of exons may be joined together to generate different forms of the same protein.
female
male
(12)
(48)
(33)
(2)
Summary
A Drosophila homolog of human Down syndrome cell adhesion molecule (DSCAM), an immunoglobulin superfamily member, is required for the formation of axon pathways in the embryonic central nervous system. cDNA and genomic analyses reveal the existence of multiple forms of Dscam with a conserved architecture containing variable Ig and transmembrane domains. Alternative splicing can potentially generate more than 38,000 Dscam isoforms. This molecular diversity may contribute to the specificity of neuronal connectivity.
Intron retention
Constitutive exon Alternatively spliced exons
Functional domains can be added/subtracted protein diversity Can introduce early stop codons, resulting in truncated proteins or unstable mRNAs It can modify the activity of the transcription factors, affecting the expression of genes It is observed nearly in all metazoans Estimated to occur in 30%-60% of human
What is bioinformatics?
Interface of biology and computers Analysis of proteins, genes and genomes using computer algorithms and computer databases Genomics is the analysis of genomes. The tools of bioinformatics are used to make sense of the billions of base pairs of DNA that are sequenced by genomics projects.
1995
Phred/phrap (Phil Green and Brent Ewing) Phred: assign confidence score to sequenced nucleotide Phrap: assemble sequences
Align genome and cDNA/EST: sim4, spidey, BLAT Gene prediction: GeneScan, FgeneSH, Genie, GeneWise
Chromosome number
Basic characteristics of the human genome It is estimated that 88% is in finished form. Currently in finished form: 2,843,602,073 base-pairs A 839,853,043 G 581,490,131 T 841,181,383 C 581,077,516
10
20
30
40
50
60
70
6p21 Major Histocompatibility Complex 167,000,000 bp 2,190 genes 61% of the genes have CpG islands 2kb upstream and 1kb downstream of 5 and 3 ends of genes. Mean gene length 32,530 bp Mean exon length 318 bp Mean transcript per gene 1.79 Mean exons per gene 5.28 Mungall, A.J. et al. 2003. The DNA sequence and analysis of human chromosome 6. Nature 425: 805-811.
Lots of variation! 3.2 x 109 bp/genome x 0.001 changes/bp = 3.2 x 106 changes/genome Two major types of variation SNPs, RFLPs Repeated DNA - short to long repeats
GC vs. genes
GC vs. introns/exons
Protein-coding genes
Recurrence of domains
Sequence definition Distinct regions of protein sequence that are highly conserved in evolution.
The SH2 domain is found embedded in a wide variety of metazoan proteins that regulate functionally diverse processes.
Pawson, T. et al., Trends in Cell Biology Vol.11 No.12 December 2001
Recurrence of segments
www.people.virginia.edu/ ~rjh9u/hbmut.html
Some hard problems: small genes, post-translational modifications, unique genes, spliced genes, alternative splicing, gene rearrangements (e.g. IgGs) . . .
Organelle: mitochondria (~ 600); chloroplasts (~40) Microbial: archaea (~20); bacteria (~700) Eukaryota: Yeast Saccharomyces cerevisiae (bakers); Schizosaccharomyces pombe (fission) Metazoa Homo sapiens (human); Pan troglodytes (chimp); Mus musculus (mouse); Rattus norvegicus (rat); Gallus gallus (chicken); Drosophila melanogaster (fly); Anopheles gambiae (mosquito); Caenorhabditis elegans (worm); Fugu rubripes (Puffer fish); Danio rerio (zebrafish); Plants Arabidopsis thaliana (thale-cress); Oryza sativa (rice);Avena sativa (oat); Glycine max (soybean); Hordeum vulgare (barley); Lycopersicon esculentum (tomato); Triticum aestivum (bread wheat); Zea mays (corn) Others Encephalitozoon cuniculi; Guillardia theta nucleomorph; Plasmodium falciparum; Leishmania major
E. coli 1,146
18
10
M.genitalium
Organism
# Chromosomes
# Genes
Exons
Introns
500
500 1/gene 3500 1.02/gene 6500 1.04/gene 91,000 5/gene 54,000 4/gene 133,000 5/gene 247/exon 310,000 8+/gene 455 bp/exon
3200
61
16
6200
220
18,000
73,000 4/gene 44,000 3/gene 60 bp/intron 107,000 4/gene 169 bp/intron 250,000 7/gene 3400 bp/intron
14,000
25,000
23
30,000
Needles in Haystacks...
Only 2% of human genome is coding regions Intron-exon structure of genes
Large introns (average 3365 bp ) Small exons (average 145 bp) Long genes (average 27 kb)
Single-pass sequencing of a small (end) piece of cDNA Typically 200-500 nucleotides long It may contain coding and/or non-coding region
ESTs
Cells from a specific organ, tissue or developmental stage
5
mRNA extraction
AAAAAA 3
Add oligo-dT primer 5 Reverse transcriptase RNA DNA 5 3 AAAAAA 3 3 TTTTTT 5 AAAAAA 3
TTTTTT 5
TTTTTT
AAAAAA 3
TTTTTT 5
ESTs
5 3 AAAAAA 3
TTTTTT 5
5 EST
3 EST
cDNA clones
(double stranded)
EST sequencing
Is fast and cheap Gives direct information about the gene sequence Partial information Resulting ESTs (DB searches) Known gene Similar to known gene Contaminant Novel gene
20,039,613
450,652
EST lengths
~ 450 bp Human EST length distribution (dbEST Sep. 2003 )
The tissue, organ or anatomical system from which the sample was prepared. Examples are digestive, lung and retina. The precise cell type from which a sample was prepared. Examples are: Blymphocyte, fibroblast and oocyte. The pathological state of the sample from which the sample was prepared. Examples are: normal, lymphoma, and congenital. The stage during the organism's development at which the sample was prepared. Examples are: embryo, fetus, and adult. Indicates whether the tissue used to prepare the library was derived from single or multiple samples. Examples are pooled, pooled donor and pooled tissue.
Exon Size
35 30 25 20 15 10 5 0 1- 100- 200- 300- >500 100 200 300 500 Fungi Verterbrate
Intron Size
70 60 50 40 30 20 10 0 <100 <200 <1 1 to kbp 5 >5 Fungi Verterbrate
Intron Prevalence
100 90 80 70 60 50 40 30 20 10 0
>1
There is no hard and fast rule for identifying donor and acceptor splice sites
Signals are very weak
Codon Bias
Gene Finders are often organism specific Coding regions often modelled by 5th order Markov chain (hexamers/dicodons)
Overpredicting Genes
Easy to predict all exons Report all sequences flanked by ..AG and GT.. as exons Sensitivity = 100% Specificity ~ 0%
Locating ORFs
Simplest method of predicting coding regions is to search for open reading frames (ORFs) open reading frames begin with a start (AUG) codon, and ends with one of three stop codons Six total reading frames
Locating ORFs
Example from HW#1:
AUUGCAAUGGAAUUAGUAAUCUCUAUUUCCGCCCUUAUUAUAGUUGAAUAGAUAGCCGUA
E L V I
S I
S A L I I V E
Locating ORFs
Prokaryotes: DNA sequences coding for proteins generally transcribed into mRNA which is translated into protein with very little modification Locating an open reading frame from a start codon to a stop codon can give a strong suggestion into protein coding regions
Longer ORFs are more likely to predict protein-coding regions than shorter ORFs.
Locating ORFs
Eukaryotes: mRNA undergoes processing to remove introns before the protein is translated ORF corresponding to a gene may contain regions with stop codons found within intronic regions Posttranscriptional modification makes gene prediction more difficult