Friend or Foe?
Tim Shank 4/2/03
Comparative Genomics
Genomics Projects
Population Genomics
Pharmaco genomics
Microbial Genomics
Functional Genomics
Field of genomics that links complex genotypes and phenotypes by comparing the flow of genotypic and phenotypic information in breeding and natural populations (Andrew Benson, U. Neb)
Genomic variation within species permitting the construction of detailed linkage maps using polymorphic markers, and through crossing experiments between individuals with different phenotypes, identification of genes responsible for phenotypic variation (e.g, disease susceptibility, drug toxicity) (Andrew Clark, PSU)
Do topographic and hydrographic features like transform faults and currents disrupt or facilitate gene flow between demes?
Does the pattern of colonization and mode of dispersal affect the retention of genetic diversity in marine animals?
Dispersal models
Continuous populations
Isolation-by-distance
Discrete populations
Stepping-stone
Island model
FST -approaches
sA FST pA (1 pA )
2
Wright (1951) [The genetical structure of populations. Ann. Eugen. 15:323-354.] noted the following relationship holds when populations reach an equilibrium between genetic drift and migration: 1 F ST 4Nm 1 where N is the variance effective population size of the average population, and m is the average proportion of immigrants in each population Problem: Useful parameter space is for FST values between 0.1 and 0.4
0.8
0.6
FST
0.4
Nm is a virtual number
0.2
0 0 2 4
Nm
10
13 9 2 Galapagos Rift
11
10
N W S E
Reject expectations of "island model" Consistent with stepping-stone model Inference: a species with more limited dispersal abilities
Black et al. 1994 Gene flow among vestimentiferan tube worm (Riftia pachyptila) populations from hydrothermal vents of the Eastern Pacific. Marine Biology 120: 33-39.
Molecular Toolkit: markers for inferring population structure and gene flow
Allozymes
multiple, independent, codominant loci; relatively easy; low cost
RFLPs
variation in restriction fragment lengths polymorphic due to restriction site mutation
mtDNA
relatively easy; maternally inherited; effectively haploid; non-recombining; modest cost; amenable to genealogical analysis linked loci and psuedoreplication
AFLPs
can get 100s of loci relatively easily dominance; recombination; state characters; mutation models not available
minisatellites
repeats of 10-40 bp units polymorphic due to unequal crossing over
Molecular Toolkit: markers for inferring population structure and gene flow
DNA microsatellites
Repeat unit 2-3 bp; nuclear; can get dozens of loci relatively easily; method of choice for parentage recombination; state characters; start-up time is great; issues of homoplasy in geographical studies; mutation must be taken into account in gene flow models
(physical marker)
A short DNA segment that occurs only once in the genome and whose exact location and order of bases are known. (They can be used as primers for PCR reaction).
(physical marker)
Short (100-300bps) part a cDNA which can be used to fish the rest of the gene out of the chromosome by matching base pairs with part of the gene.
PCR-based method
Target Sequence
= arbitrary primer (e.g. ggcattactc)
Analyze PCR products by agarose gel electrophoresis. Marker is dominant (presence/absence of band). No prior sequence knowledge required Many variations on the theme (e.g., RAMP, ISSR)
2. Run product on gel with denaturing gradient (parallel or perpendicular to direction gel runs)
3. Product begins denaturing at a certain point, depending on base sequence: greatly retards migration and allows discrimination of alleles based on small sequence differences
4. Denaturing gradient gels can be difficult to produce: use perpendicular gradient to identify optimal conditions, move to CDGE: constant denaturant gel electrophoresis
Fairly simple analysis (cutting can be a hassle) Requires sequence information from several alleles (or luck)
Allele 1 Allele 2 3. Alleles can be differentiated by size based on loss or gain of restriction site; May be able to analyze on agarose gel
Microsatellites
reiterated short sequences [of DNA] tandemly arrayed, with variations in copy number accounting for a profusion of distinguishable alleles - (Avise 1994)
Microsatellite Types
1. Dinucleotide
Animals - CA Plants - TA, GA
2. Trinucleotide
GTG, CAG, and AAT Related to disease and cancers
3. Tetranucleotide
GATA/GACA Highly polymorphic
Microsatellite Uses
1. Population Genetics
1. Gene flow 2. Stock Structure
2.
Genetic Probes
1. 2. 3. 4. Larvae Gut contents Scat Source populations
3. 4.
Microsatellite Advantages
1. 2. 3. 4. Highly Polymorphic Codominant In every organism examined to date Very abundant
5.
6. 7. 8. 9.
Microsatellite Disadvantages
1. 2. 3. 4.
Expensive Time consuming Several loci are needed to obtain sufficient statistical power Current analyses methods do not distinguish between changes in flanking regions vs. changes within the microsatellite regions
Mutation Mechanisms
1.
2.
Recombination
A. B. Unequal crossing over (UCO) Gene conversion
Microsatellite Mutations
10-3 to 10-6 events per locus per generation (point mutation 10-9 to 10-10)
Varies by
repeat type base composition of the repeat
taxonomic group
length of the allele
Mutation Models
1. Infinite Allele Model (IAM)
gain or loss of any number of repeats and always results in an allelic state not present in the population
2.
3.
4.
DNA Extraction
Digestion
Add Linkers
PCR
Hybridize to Beads
CACA GTGT
PCR
Cloning
Blots/ Hybridizations
Plasmid Preps
Enzyme Digest Isolated Plasmids Check Insert Size Dot Blot Hybridizations
References
www.biotech.ufl.edu/WorkshopsCourses/mm_manual.htm
Avise, J.C. 1994. Molecular Markers, Natural History and Evolution. Chapman and Hall, New York. 511 pp.
Balloux, F. and N. Lugon-Moulin. 2002. The estimate of population differentiation with microsatellite markers. Molecular Ecology. 11: 155-165.
Goldstein, D.B. and C. Schloterrer (Editors). 1999. Microsatellites: Evolution and Applications. Oxford University Press, Oxford, 352 pp.
Jarne, P and P.J.L. Lagoda. 1996. Microsatellites, from molecules to populations and back. Trends in Ecology and Evolution 11(10): 424-429.
Slatkin, M. 1995. A measure of population subdivision based on microsatellite allele frequencies. Genetics 139: 457-462.
Acrylamide gel with 5 microsatellite loci and internal size standard Simultaneous analysis of a dozen loci
Ease of scoring
Gel
Method
RAPD
AFLP Microsats CAPS DGGE, SSCP TaqMan
++
++ +++ + +++
N
N Y Y Y
N
N Y Y N
+
+ ++ + ++
++ ++ + +
++
++ + + +
+ +++ +(+)* ++
+
+ ++ +(+)* ++
+
++ ++ ++ ++
Agarose
Polyacryl. Polyacryl. Agarose Polyacryl.
+++
+++
++
++
+++
+++
++
None
* Depends
Linkage Disequilibrium- alleles at different loci are found together more or less often than expected based on their frequencies (and location in the genome).
Goldstein and Weale 2001 Population genomics: linkage disequilibrium holds the key. Current Biology 11:576-579
Understanding current gene flow and mating systems by direct methods (e.g., maternity analysis, paternity analysis)
Need high polymorphism, codominance, repeatability, low cost per sample Microsatellites, SNPs
Pharmacogenomics: polymorphism-based approaches for the discoveryand development of new medications; translating polymorphisms into new genomic medicine*
Need rapid, low-cost, repeatable way to distinguish alleles screening large numbers of individuals; SNPs and Sequencing *New York Times, Nov. 2002
MtDNA favored out of Africa hypothesis but lacked statistical support for deep African branches
53 human mtDNA sequences (16,500 bp) examined timing of evolutionary events mtDNA evolving in a clocklike fashion Linkage Disequilibrium not evident 3 deepest branches lead exclusively to sub-Saharan
Neighbor-joining phylogram based on complete mtDNA genome sequences (excluding D-loop). Note star-like vs deep branching topology- larger Ne 1000 bootstrap replicates shown on nodes. Asterisk refers to the MRCA of the youngest clade or longer genetic history in Africa; bottleneck in non-Affican containing both African and non-African individuals.
Exodus from Africa began 100 million years ago Divergence of Africans and non-Africans occurred 52,000 28,000 years ago
Human genome mining to produce 507,152 high-confidence SNP candidates as uniform resource for describing nucleotide diversity and regional variation within and between human populations
So Whats a SNP?
A mutation that causes a single base change is known as a Single Nucleotide Polymorphism (SNP)
SNPs are the most simple form and most common source of genetic polymorphism in the human genome
90% of all human DNA polymorphisms;1SNP in 1000 bp; 1.42 million
SNP Haplotype is a particular pattern of sequential SNPs (or alleles) found on a single chromosome
Microarrays, mass spectrometry and sequencing are all used to accomplish grouping or blocking of SNPs= haplotyping
Haplotype Determination Problem- find all haplotypes given a genome and all identified SNPs (algorithm development)
The SNP Consortium is an alliance of pharmaceutical and computer companies managed by Lincoln Stein at Cold Spring Harbor Lab. The SNP Consortium Ltd.. is a non-profit foundation organized for the purpose of providing public genomic data. Its mission is to develop up to 300,000 SNPs distributed evenly throughout the human genome and to make the information related to these SNPs available to the public without intellectual property restrictions. The project started in April 1999 and is anticipated to continue until the end of 2001.
We describe a map of 1.42 million single nucleotide polymorphisms (SNPs) distributed throughout the human genome, providing an average density on available sequence of one SNP every 1.9 kilobases. These SNPs were primarily discovered by two projects: The SNP Consortium and the analysis of clone overlaps by the International Human Genome Sequencing Consortium. The map integrates all publicly available SNPs with described genes and other genomic features. We estimate that 60,000 SNPs fall within exon (coding and untranslated regions), and 85% of exons are within 5 kb of the nearest SNP. Nucleotide diversity varies greatly across the genome, in a manner broadly consistent with a standard population genetic model of human history. This high-density SNP map provides a public resource for defining haplotype variation across the genome, and should help to identify biomedically important genes for diagnosis and therapy.
Built a set of pairwise sequence alignments by analyzing the overlapping regions of large insert clones Looked for mismatches; SNPs if Polybayes probability was 0.80 SNP marker density grouped by overlapping regions Modeled the marker density distribution
Distribution of polymorphic sites profoundly impacted Increased pop size yields abundance of new lineages with more mutation
Decreased pop size raises likelihood of relatedness resulting in over-representation of sequence identity
Evaluated degree of fit between observed density distribution and probability predicted using the log likelihood of the data for a given model
r indicates the per nucleotide, per generation recombination rate
Superior fit of the modeled parameters (with or without recombination) suggests a severe, 2- to 7 fold, collapse of population size 40,000 years (1600 generations) ago .followed by a modest recovery
% of successful trials for each model, at each data fraction; Assessments based on the amount of data required for rejection by X2 test. Interestingly, data fit between observations and best-fitting models decays with more data.
Compared the C57BL/6J Mouse genome sequence with 59 finished segments of the 129/Sv inbred strain Discovered nearly 70,000 SNPs on blocks of high SNP density (40 SNPs per 10kb) separated by blocks of low density (0.5 SNPs per 10kb) Surveyed panels of inbred mouse strains to find that distinct SNP haplotypes were shared among common inbred populations. Surveyed wild strains showed that 67% of each of the inbred genomes are derived from European mice and 33% from Asian mice
How about other organisms? or new model organisms; organisms that exemplify phenomena not well studied in human/worm/mouse?
Three-Spined Sticklebacks
morphological evolution populations isolated after last glaciation, have diverged morphologically and in sequence (CAn microsatellites) strategy: cross benthic and limnetic fish; intercross F1s, follow morphological traits and polymorphisms in F2s
see Peichel et al (2001) The genetic architecture of divergence between threespine stickleback species. Nature 414: 901-5.
Zebrafish
Genes
Postlehwait et al. 1994 A genetic linkage map for zebrafish. Science 264: 699-703. Woods et al. 2000 A comparative map of zebrafish genome. Genome Research 10: 1903-1914. Geisler et al. 1999 A radiation hybrid map of the zebrafish genome. Nature Genetics 23: 86-89.
Microsatellites
Shimoda et al. 1999 Zebrafish genetic map with 2000 microsatellite markers. Genomics 58: 219-232.
Vertical lines = 25 linkage groups Red dots correspond to SNPs represented on the olig. microarray
Understanding current gene flow and mating systems by direct methods (e.g., maternity analysis, paternity analysis)
Need high polymorphism, codominance, repeatability, low cost per sample Microsatellites, SNPs
Pharmacogenomics: polymorphism-based approaches for the discovery and development of new medications; translating polymorphisms into new genomic medicine*
Need rapid, low-cost, repeatable way to distinguish alleles screening large numbers of individuals; SNPs and Sequencing *New York Times, Nov. 2002
Problem:
Need to determine genetic relationships in populations without known pedigrees
Microsatellites current methods of choice among close kin within a population, but the number of independently segregating microsatellite markers is limited SNPs may provide large number of segregating loci with a large number of alleles at even frequencies
Goal:
To assess known pairwise relationships - via single nucleotide polymorphisms where already have parallel microsatellite results.
Recent advances in microarray technology permit genotyping of large #s of individuals at 100s to 1000s of SNP loci (reviewed by Kwok 2001)- this could be big! Need to know if SNPs equal or exceed the power of practical numbers of microsatellite loci in estimating relationships?
Glaubitz et al. 2003Computer simulations designed to evaluate SNPs ability to discriminate a variety of (pairwise) relationships likely to occur in natural populations, comparisons to microsatellites from Blouin et al 1996 SNPs segregate independently, ideal genome with 20 autosomes, 5 SNPs per chromosome, 10,000 individuals random genotypes Constructed 5 catagories of relationships types Constructed an array of pedigrees estimated pairwise relatedness at a single locus (r1) Evaluated the performance of 100 simulated SNPs by estimating misclassification (rate) of relationships
illustrates that different pairwise relationships can have different amounts of inherent variance in relatedness the parent offspring (PO) and unrelated (U) relationships have 0 inherent variance (share one or no alleles) FS has largest variance; second order relatives can not be distinguished from each other via estimation of r
100 independently segregating SNPs determinined parent-offspring pairs as well as about 16 or fewer microsatellite loci when both parents are unknown Even under the optimistic scenario of 100 independent loci, results show little promise for discriminating higher order relationships on the basis of pairwise relatedness.
Conclusion:
SNPs have limited potential for the delineation of genealogical relationships My two cents:
To take full advantage of the vast abundance of SNPs in metazoan genomes and their potential automation, we will need analytical methods that account for tight genetic linkage (McPeek and Sun 2000) and known recombination frequencies.
until then, SNP population genomics will likely only be used on model organisms.
Understanding current gene flow and mating systems by direct methods (e.g., maternity analysis, paternity analysis)
Need high polymorphism, codominance, repeatability, low cost per sample Microsatellites, allozymes
Pharmacogenomics: polymorphism-based approaches for the discovery and development of new medications; translating polymorphisms into new genomic medicine*
Need rapid, low-cost, repeatable way to distinguish alleles screening large numbers of individuals; SNPs and Sequencing *New York Times, Nov. 2002
Pharmacogenomics
The use of DNA sequence information to measure and predict the reaction of individuals to drugs. Pharmacogenetics is the study of this variation at the level of a single gene, while pharmacogenomics studies variation at the genome wide level. Observation that there is great individual variation in response to drugs- genetically determined. It is possible to measure many thousands of SNPs simultaneously in a small blood sample from a patient Can compare genotypes for SNP markers linked to virtually any trait
Evolving Paradigm for Discovery of Genetic Polymorphisms associated with aberrant drug disposition or effects
More discoveries thru polymorphisms in candidate genes (metabolism; transport; targets of candidate medication
5,00010,000
10,000
8,000 6,000
4,000
2,000 0
Cumulative Number of Targets Known Today New Targets Expected from Human Genome Project
Source: Drews J. Nat Biotechnol 1996;14.
Approx. 500
Functional Classifications
Disease genes classed by function and their relative representations
Some Examples
10% of African Americans have polymorphic alleles of Glucose-6phosphate dehydrogenase that lead to haemolyitic anemia when they are given the anti-malarial drug primaquine.
Succinylcholine Toxicity
0.04% of individuals are homozygous for alleles of psedocholineseterase that are unable to inactivate the muscle relaxant drug succinylcholine, leading to respiratory paralysis.
Isoniazid Metablolism
There are many polymorphic alleles of the N-acetlytransferase (NAT2) gene with reduced (or acclerated) ability to inactivate the drug isoniazid.
Some individuals developed peripheral neuropathy in reaction to this drug Some alleles of the NAT2 gene are also associated with succeptibility to various forms of cancer
Cytochrome P450
~10% of the Caucasian population is homozygous for alleles of the Cytochrome P450 gene CYP2D6 that do not metabolize the hypertension drug debrisoquine, which can lead to dangerous vascular hypotension.
ACE
Patients homozygous for an allele with a deletion in intron 16 of the gene for angiotensin-converting enzyme (ACE) showed no benefit from the hypertension drug enalapril while other patients benefit.
In early clinical trials, it is possible to identify people who react well and react poorly.
Can also speed clinical trials by testing on those who are likely to respond well.
The cDNAs are hybridized to microarrays on which every gene that has been cloned is present [the DNA is spotted on the microslides and each spot corresponds to DNA from a different gene]
If a particulatr gene is expressed, then it will be present and labelled in the the cDNA pool. It can then hybridize to the spot of the plate corresponding to that particular gene
The results from such an experiment look like this where the color of the spot tells you something about that gene expression and drug therapy optimization.
The data can then be analyzed and sorted into tables that show which genes are expressed in response to the stimulus and which are turned off This sort of experiment can be done with any collection of RNAs that you want to compare- particularly useful to compare normal to mutant/disease state- eg. tells you what genes are turned on in cancerous cells, may give you a clue as to how cancer works
Link Gene Expression to Genome Sequence Identify promoter and 5' sequence for a group of co-expressed genes. Scan for known transcription factor binding sites. Predict new regulatory sites based on common sequence elements.
Diagnostic arrays
-Examples of factors showing variability that could be detected on arrays -Provide information of status of SNPs and gene expression profiles
Target sensitivity
Toxicity Heterogeneity of disease mechanisms
We know very little about the the importance that variations in regulatory and intronic sequences have and how they differ between populations
Issues: