Anda di halaman 1dari 32

Bioinformatics analysis of the epigenetic landscape of Drosophila: identification of Polycomb Response Elements

Galle Cordier, Master degree in Developmental Biology and Genetics, Universitat de Barcelona

Bioinformatics analysis of the epigenetic landscape of Drosophila: identification of Polycomb Response Elements

Galle Cordier, Master degree in Developmental Biology and Genetics, Universitat de Barcelona

SUPERVISOR,

AUTHOR,

Dr. Enrique Blanco Garca

Galle Cordier

Barcelona, September 7, 2012

INDEX

INDEX

INTRODUCTION

AIMS

MATERIALS AND METHODS

10

RESULTS

14

CONCLUSIONS

24

BIBLIOGRAPHY

25

INTRODUCTION

INTRODUCTION
The gene regulation machinery

Regulation of gene expression includes the processes that cells use to govern the way that the information in genes is turned into gene products (RNAs and proteins). There are several check-points in which the expression of a gene can be modulated from transcription to structural reorganization of RNAs or posttranslational modification of the protein. Among all these processes, DNA to RNA transcription is particularly critical. Transcriptional regulation can be accomplished by a dual mechanism that acts at two levels: (a) over the DNA sequence: transcription factors interact with regulatory DNA sequences promoters and enhancers integrated in the neighborhood of genes; (b) over the chromatin structure: modifications in the histones of nucleosomes can alter the condensation level of the chromatin. In consequence, when and where a gene is transcribed depends on the coordinated interplay between both regulatory codes (genomic and

epigenomic), which makes more or less accessible particular regulatory regions to specific subsets of transcription factors under certain circumstances. Epigenetic regulation through histone modification: the role of Polycomb Group proteins (PcG)

The term epigenetics refers to all the changes in gene expression caused by mechanisms others than those that lead to alterations in the underlying DNA sequence. Despite such modifications do not affect the nucleotide sequence, they are also inherited from cell to cell in a process that is known as epigenetic memory, whose role is crucial in cell differentiation and in the maintenance of such state by keeping the appropriate genes activated or repressed. Chromatin remodeling is precisely one way of regulating genes, and it can be accomplished through DNA methylation and histone modification. In the latter case, chemical modifications on the histones tails, such as methylation or acetylation, modulate the chromatin structure, making it more or less accessible
1

INTRODUCTION
to transcription factors, and thus ensuring the activated or silenced state of a given gene. One methylation mark associated to gene repression is H3K27me3 (trimethylation on the Lysine 27 of Histone 3), while H3K4me3 (trimethylation on the Lysine 4 of Histone 3) is a methylation mark associated to gene activation (Pirrotta, 2011). Polycomb Group and Trithorax Group proteins (PcG and TrxG, respectively) are key components of the above cited mechanism. Both are antagonists of each other: PcG proteins repress gene expression, while TrxG proteins activate genes. Both groups were originally discovered in Drosophila melanogaster as regulators of the homeotic genes (aka Hox genes in invertebrates), which are in charge of the specification of the body plan (see FIG. 1). They are responsible of maintaining the homeotic pattern after the Hox genes have been turned off: PcG genes maintain silent expression states, while TrxG proteins maintain active expression patterns in the appropriate spatial domains (Cavalli et al., 2009). In fact, PcG proteins derive their name from the fact that the first sign of a decrease in PcG function is often a homeotic transformation of posterior legs towards anterior legs, which have a characteristic comb-like set of bristles (Pirrotta, 2011). What is clear is that mutations in PcG and TrxG proteins can lead to changes in the homeotic pattern, which can have drastic consequences in the development of the body (see FIG. 2).

Figure 1: Maintenance of the homeotic pattern through Hox genes in Drosophila: embryo (left) and adult (right). Adapted from W.H. Freeman and Company, 2008.

INTRODUCTION

Figure 2: Drosophila Hox mutants. A) Antennapedia (Antp) mutant: wild type on the left, mutant with fully developed legs in place of antennae on the right (Photo by FR Turner, Indiana); B) Ultrabithorax (Ubx) mutant with two pairs of wings instead of the normal single pair, as the third thoracic segment is transformed into another second thoracic segment (Adapted from Gnter P. Wagner, 2007).

Nowadays, both PcG and TrxG proteins are known to be part of a genome-wide mechanism that regulates many other target genes (including transcription factors) involved in a plethora of processes as cell differentiation, cell growth, and development (Cavalli et al., 2007; Pirrotta, 2011; Ringrose, 2007).

PcG complexes and components

Up to date, multiple members of the PcG/TrxG families have been identified (see TAB. 1). The PcG proteins form two classes of multiprotein complexes: the Polycomb Repressive Complex 1 (PRC1), and the Polycomb Repressive Complex 2 (PRC2). PRC2 is the direct responsible of introducing the H3K27me3 modification; the key component is Enhancer of Zeste (E(Z)), which contains a SET domain with Histone 3 methyltransferase activity. On the other hand, PRC1 binds to this methylation mark, stabilizing the whole complex; the key component is Polycomb (Pc), which binds to the H3K27me3 mark through a chromodomain. It could be said that there is a kind of histone code, where PRC2 complexes act as writers while PRC1 complexes act as readers (Pirrotta, 2011). TrxG proteins, on the other side, form a unique complex. However, they act similarly to PcG proteins, introducing the H3K4me3 mark also through a SET domain (Pirrotta, 2011).
3

INTRODUCTION

COMPLEX

COMPONENT Pc Sce Psc Ph (Ph-p & Ph-d) E(Z) Esc SU(Z)12 Pcl Trx Ash1 Ash2 Brm Mor Osa

NAME Polycomb Sex combs extra Posterior sex combs Polyhomeotic (proximal & distal) Enhancer of zeste Extra sex combs Supressor of zeste 12 homolog Polycomb-like Trithorax Absent, small or homeotic discs 1 Absent, small or homeotic discs 2 Brahma Moira Osa

DOMAIN Chromodomain RING domain RING domain SPM domain SET domain WD40 Zn finger PHD domain SET domain SET domain PHD domain ----

FUNCTION H3K27me3 binding E3 ubiquitin ligase (?) Sce cofactor Protein-protein interaction H3K27 methyltransferase E(Z) cofactor ? ? H3K4 methyltransferase H3K4 methyltransferase ? ATP-dependent nucleosome sliding Brm cofactor ?

PRC1 PcG PRC2

TrxG

Table 1: List of PcG and TrxG components.

COMPONENT

NAME

GROUP PcG (PhoRC complex)

KNOWN FUNCTION / SUPPOSED ROLE IN PREs DNA binding / PRC2 (EZ) recruitment Involved in regulation of homeotic and other genes / General role in PcG recruitment and silencing Chromatin modelling / Displacement of nucleosomes to allow binding Transcriptional activation of many genes / Activation and silencing at PRE/TREs Developmental expression patterns / Modulation of PRE/TRE function in different tissues or at different times of development Tissue specific expression patterns / Modulation of PRE/TRE function in different tissues or at different times of development

PHO

Pleiohomeotic

DSP1

Dorsal switch protein 1

---

GAF PRE BINDING PROTEINS

GAGA factor

TrxG

Zeste

TrxG

Grh

Grainy head protein

---

Sp1/KLF

---

---

Table 2: List of PRE-binding proteins (Schwartz and Pirrotta, 2007; Ringrose, 2007).

INTRODUCTION
The Polycomb Response Element (PRE)

In order to be able to methylate their target genes, both PRC2 and PRC1 must be recruited on the DNA, and this function is accomplished by the Polycomb Response Elements (PREs). These cis-regulatory DNA elements are sequences of several hundred base pair whose position relative to their target genes is variable: sometimes they overlap the promoter, whereas in other cases they are located tens of kilobases away (Negre et al., 2006). In fact, recent studies indicate that many of them could in fact be regulating promoters of nonannotated genes (Enderle et al., 2011). Such recruitment is performed via several sequence-specific DNA-binding proteins (see TAB. 2). Although their number, order and presence/absence is characteristic of each PRE, motif clustering is thought to be relevant. Thus, although PREs do not exhibit a conserved sequence, they do present short consensus motifs that are binding sites for these proteins (see FIG. 3). In fact, the idea that clusters of such motifs are characteristic features of PREs was exploited by Ringrose and colleagues to produce an algorithm to computationally predict putative PREs in the D. melanogaster genome (Ringrose et al., 2003). With regards to their function, it is not entirely clear what could be the particular role of each of these PRE-binding proteins, but it is suggested that the protein PHO is essential in PRC2 and PRC1 recruitment (Cavalli et al., 2007; Schwartz and Pirrotta, 2007). While several PREs have been identified in the fruit fly, there is no clear evidence of a sequence-based recruiting element in mammalian genomes. It is not known whether this is due to a totally different mechanism of recruitment or to a more diffuse one (Schwartz and Pirrotta, 2007; Pirrotta, 2011).

INTRODUCTION

Figure 3: bxd PRE (Ubx gene) from several Drosophila species. PHO , GAF and Zeste binding sites are shown. Adapted from Schwartz and Pirrotta, 2007.

Recruitment of PcG complexes and histone methylation

Following the discovery of the Pc chromodomain specific affinity for trimethylated H3K27, a sequential recruitment process was proposed by Wang et al. in 2004. According to this model, the DNA-binding protein recruits PRC2, which methylates surrounding nucleosomes at H3K27; PRC1 would then be recruited by the affinity of the Pc chromodomain for the H3K27me3. However, facts do not entirely support this scenario, and it is more likely that H3K27me3 plays a role in stabilizing the binding of PRC1 after its recruitment (Schwartz and Pirrotta, 2007; Pirrotta, 2011). In parallel, a looping model has been proposed to explain H3K27 methylation by PREs that work at distances of several tens of kilobases of its target genes, given that both PRC complexes remain localized at the PRE (Schwartz and Pirrotta, 2007). Such looping could be mediated by transient interactions of the Pc chromodomain with surrounding H3K27me3 marks, which would allow E(Z) to methylate more distant histones (see FIG. 4).

INTRODUCTION

Figure 4: PcG proteins recruitment and H3K27 methylation (looping model). (a) The PRE can act at several kilobases of its target gene and it is composed by motifs that recruit DNA-binding proteins such as PHO, GAF, and DSP1 (not shown). (b) These proteins recruit both PRC1 and PRC2 complexes (being a key component of such recruitment). PRC2 methylates surrounding histones through E(Z)s SET domain, and (c) PRC1 binds to such mark through Pcs chromodomain. The Pc-H3K27 union stabilizes the whole PRC-PRE complex, and allows PRC2 to methylate subsequent histones. PRC1 then can contact more distant nucleosomes, forming a loop that finally allows PRC2 to methylate the histones of the target gene. Adapted from Schwartz and Pirrotta, 2007.

Finding potential PREs throughout the genome: two different approaches

It would be of great interest to unveal the location of PREs in the genome. In order to validate a PRE in the wet lab, a transgenic reporter assay has to be made to prove that there is an actual cause-effect relationship between PcG proteins and a given gene being repressed (Ringrose et al., 2008). However, experimental methods are expensive and time-consuming. In consequence, it is not feasible to obtain by experimental means a comprehensive list of PREs in the whole genome. Therefore, it is necessary to narrow the search by obtaining a previous list of potential PREs that would be confirmed in a second step.

INTRODUCTION
A first way to find potential PREs is to implement a computational screening of motif consensus sequences in the Drosophila genome (Ringrose et al., 2003). However, this approach presents some drawbacks derived from the lack of a common sequence of PREs (Schwartz and Pirrotta, 2007), and the availability of poor motif predictive models. A second way is to seek for PcG/PRE-motifs proteins binding-sites and methylation marks through high-throughput methods (e.g. ChIP-on-chip and ChIP-seq experiments). From that approach, one would expect a PRE to present all three features: binding sites for PcG proteins, binding sites for PREmotifs proteins, as well as a particular H3K27me3 profile. However, although several remarkable experiments have been done in the last years, they all differ in greater or lesser degree in the proteins, the cell type or tissue, and the developmental stage employed. This makes difficult to compare them and to infer whether non-overlapping results may be due to actual

tissue/developmental differences or not (Ringrose, 2007).

AIMS

AIMS
The central objective of this project is to reconstruct bioinformatically the map of potential PREs in Drosophila, from published high-throughput sequencing experiments (ChIP-on-chip/ChIP-seq). In order to obtain a reliable cartography of PREs, we will integrate data reflecting PRE motifs localization, PcG protein binding sites and epigenomics profiles of H3K27me3 and H3K4me3. How to define this map of PREs is not a simple task, though. Therefore, two additional issues must be considered beforehand, in order to establish a solid framework of rules to interpret and integrate appropriately such data.

a) Study which are the common features of experimentally validated PREs.

b) Evaluate how to deal with variation due to different experimental conditions of equivalent experiments.

Finally, we would like to contribute in certain biological questions that still remain open, such as:

a) Is the computational prediction based on motif composition a suitable approach for finding reliable potential PREs?

b) About the location of these elements along the genome, are putative PREs actually enhancers that regulate target genes up to thousands on nucleotides of distance, or do they act as promoters being immediately adjacent to the gene?

MATERIALS AND METHODS

MATERIALS AND METHODS


Bibliographic research and construction of training data sets

To build our predicted PREs intersections and try to shed some light on the questions raised above, we performed a literature search in order to construct three data sets of interest: (a) experimentally validated PREs, (b) genome-wide high-throughput experiments of PcG members/epigenomics profiles and (c) published computational predictions of PREs in the Drosophila genome.

a) Experimentally validated PREs: we gathered 20 PRE sequences from different publications (see TAB. 3). For each one we annotated the genome location and the target genes. This information is fundamental to study the composition of PREs and to evaluate the performance of each predictive approach.

b) Genome-wide high-throughput sequencing data: we recovered from the bibliography the binding profiles of several PcG members and different histone posttranslational modifications (see TAB. 4). This information is very important to discriminate which regions of the genome present key features of PcG response elements. Our predictions, in fact, result from the appropriate intersection of these data. Prior to this, we studied the variability observed in these experiments due to different biological and methodological conditions.

c) List of 167 computationally predicted PREs: From the work by Ringrose and colleagues (Ringrose et al., 2003) we retrieved the list of computational predictions based on the motif composition of PREs. We integrated these predictions into our study to evaluate the consistency between this approach and high-throughput binding profiles.

10

MATERIALS AND METHODS

PREs inv1 inv4 en vg cycA bx bxd iab-2 Mcp iab-6_HS1 iab-6_HS2 Fab-7 iab-7 iab-8 hh Scr8.2Xba Scr10Xba.2 Scr10Xba.1 phd php

COMPLEX

COORDINATES chr2R:7356766-7358046

GENE inv en vg cycA ubx

REFERENCE Cunningham et al., 2010, Mol. Cell. Biol. 30, 820-8 Kassis, 1994, Genetics 136, 1025-1038 Kassis et al., 1989, Mol. Cell. Biol. 9, 4304-4311 Ringrose et al., 2011, Epigenetics & chromatin 4, 4 Martinez et al., 2006, Genes & Development 20, 501-513 Ringrose et al., 2003, Developmental Cell 5, 759-771 (Fig. 1D) Chan et al., 1994, EMBO J. 13, 2553-2564

INVECTEDENGRAILED

chr2R:7362343-7363955 chr2R:7415790-7417331 chr2R:8792591-8793190 chr3L:11825715-11826839 chr3R:12528059-12529718 chr3R:12589351-12590909 chr3R:12636160-12639160 chr3R:12694874-12699475

---

abd-A

Shimell et al., 2000, Dev. Biol. 218, 38-52 Busturia et al., 1997, Development 124, 4343-4350 Prez-Lluch, 2008, Nucleics Acids Res. 36, 6926-33

BITHORAX

chr3R:12707732-12708886 chr3R:12711833-12712760 chr3R:12723000-12726610 chr3R:12726611-12727470 chr3R:12744582-12749964 abd-B

Mihaly et al., 1997, Development 124, 1809-1820 Mishra et al., 2001, Mol. Cell. Biol. 21:1311-1318 Barges et al., 2000, Development 127, 779-790

-ANTENNAPEDIA

chr3R:18950425-18964425 chr3R:2656551-2661848 chr3R:2711834-2713052 chr3R:2718853-2721411 chrX:2017450-2019824 chrX:2031476-2033445

hh scr ph-d ph-p

Chanas & Maschat, 2005, Mechanisms of Development 122, 975-987 Gindhart and Kaufman, 1995, Genetics 139, 797-814 Fauvarque and Dura, 1993, Genes dev. 7, 1508-1520 Bloyer et al., 2003, Dev. Biol. 261, 426-442

POLYHOMEOTIC

Table 3: List of 20 experimentally validated PREs gathered from different publications. The table shows the different PREs, the complex they belong to, their corresponding coordinates, and their associated genes.

REFERENCE

CELL LINE

TECHNIQUE

Cavalli et al., 2009 PloS Biol 7, 146-163

Embryo 4-12h

ChIP-on-chip

BINDING ELEMENTS PC PH TRX-N TRX-C H3K27me3 H3K4me3 GAF PHO PHOL DSP1 E(Z) GAGA PC PH PSC TRX H3K4me3 H3K4me3 H3K27me3 GAGA L3 H3K27me3

SITES 2110 441 4868 167 2480 4893 3019 3152 2951 1982 974 4339

NCBI GEO

A-MEXP-1251

The modENCODE Consortium, 2010 Science 330, 1787-1797 Enderle et al., 2011 Genome Research 21, 216-226 Prez-Lluch et al., 2011 Nucleic Acids Research 39, 4628-4639 Ngre et al., 2011 Nature 471, 527531

Embryo-derived cell-line

ChIP-on-chip

N.A.

S2 cells

ChIP-seq

2274

GSE24521

Wing imaginal disc Embryo 12-24h L3

ChIP-seq ChIP-seq

5868 22620 5155 141

GSE24115 GSE23537 GSE15292

Table 4: Binding profiles of different PcG members and histone postranslational modifications recovered from the bibliography.

11

MATERIALS AND METHODS


Genome analysis

The UCSC Genome Browser (Kent et al., 2002) was used to stock the training data sets into custom tracks and to build our set of results based on the selected intersection of them. The UCSC graphical interface was used to capture several screenshots included on this work. We worked on the genome assembly D. melanogaster, Apr. 2006 (BDGP R5/dm3). The Lift Genome Annotations tool -UCSC Batch Coordinate Conversion (liftOver)- was used to convert published results that were only available in previous genome assemblies (dm1/dm2). The UCSC Table Browser was used to carry out the intersections and correlations among tracks. We used different UCSC tracks to infer novel knowledge from our final predictions (e.g. what genes are targets of a particular predicted PRE, phylogenetic sequence conservation levels, genome location in terms of promoter, introns, exons, intergenic region or potential association to noncoding elements): (a) RefSeq repository of genes (Pruitt et al., 2012): 14161 genes and 22738 transcripts are annotated in the fruit fly genome. To match a gene and a predicted PcG response element, we defined a maximum distance of 10 kb; (b) From RefSeq, we extracted the coordinates of exons and introns. In addition, we defined Promoters, as the regions 1000 bp upstream of the TSS of genes, Downstream, as the regions 1000 bp downstream of genes and Intergenic, as the regions free of exons/introns and promoters; (c) Drosophila Conservation track (Felsenstein and Churchill, 1996): based on an evolutionary alignment of 12 Drosophila species, we measured the degree of conservation of our predictions. Alternatively to the conservation track we employed the catalog of 6779 highly conserved non-coding elements (HCNEs, Engstrm et al., 2007). The Flybase Batch Download tool (the Flybase Consortium, 2012) was used to convert lists of gene names between different nomenclatures (e.g. from Refseq Genes to Flybase Genes annotations). DAVID web site (Huang et al, 2009) was used to perform Gene Ontology analysis of the genes potentially associated to our predictions. We have adapted different file formats to upload information into the UCSC Genome Browser. BED format displays the data into discrete boxes (e.g.
12

MATERIALS AND METHODS


ChIP-enriched regions) defined by a starting and an ending position into the chromosomes. This is the format selected to store our results and the list of validated PREs, computational predictions and genome-wide sequencing data. Intersection between custom tracks must be performed under this format. WIG and BEDGRAPH formats, on the contrary, display continuous data (e.g. ChIP profiles) as a graph. These formats are particularly useful to observe the actual methylation and PcG proteins occupancy profiles along the genome. Also, it is the format required to calculate correlations between tracks, as well as to display the occupancy profile on an idealized gene. To define ChIP-enriched regions on genome-wide profiles of those publications that do not provide the coordinates of such elements, we manually established a threshold for each experiment that ensures we will not miss any fundamental genome region (data not shown). Occupancy profile in an idealized gene: R

To build the occupancy profiles of PcG proteins, PRE binding elements and epigenetic markers into an idealized gene, we used the set of Perl and R scripts published by Prez-Lluch and colleagues (Perez-Lluch et al., 2011). This procedure estimates the average number of ChIP reads of a WIG track along the full set of genes in the fruit fly genome. Counts of reads of each ChIP were manually normalized to adjust the height of the maximum peak.

13

RESULTS

RESULTS
Biologic mechanism

Before studying the composition of the set of 20 validated PREs gathered from the literature (see Materials and Methods and TAB. 3) and evaluating the accuracy of computational predictions, we aimed to know more about the interaction between the Polycomb Repressive Complexes, the PRE-binding proteins and H3K27me3 using published ChIP data (see TAB. 4). These experiments involved different biological samples and developmental stages. However, when we plot the occupancy plots of equivalent genome-wide profiles, a similar distribution on the same area of the gene was observed (see FIG. 5). In fact, the correlation coefficient (CC) between such ChIP samples is significantly high. For instance, the CC of H3K4me3 from Enderle et al. and Prez-Lluch et al. is 0.71 -87% of target genes in common-, and the CC of H3K27me3 from modENCODE and Prez-Lluch et al. is 0.62 -51% of target genes in common- (see Supplementary TAB. 1 for further details).
modENCODE Ngre Cavalli Prez-Lluch modE Cavalli

GAF

H3K27me3

Pc

Enderle Cavalli

H3K4me3

Cavalli Prez-Lluch Enderle

Figure 5: Occupancy profiles of equivalent experiments on different biological conditions. X axis represents gene distribution and Y axis represents the number ChIP reads.

14

RESULTS
Therefore, we have chosen a representative ChIP profile for each of the equivalent experiments giving preference to samples that provide a better resolution of enriched regions (see TAB. 5 and Supplementary FIG. 1).
ELEMENT Pc Psc Ph E(Z) H3K27me3 H3K4me3 PHO GAF DSP1 Trx (Trx-C) REF. Enderle et al., 2011 The modENCODE Consortium, 2010 Prez-Lluch et al., 2011 SAMPLE S2 cells (Embryo 20-24h) Embryo-derived cell-line Wing imaginal disc METHOD ChIP-seq ChIP on chip ChIP-seq NCBI-GEO GSE24521 N.A. GSE24115

Cavalli et al., 2009

Embryo 4-12h

ChIP on chip

A-MEXP-1251

Table 5: selected ChIP profiles for each of the equivalent experiments, which represent PRC1 (Pc+Psc+Ph), PRC2 (E(Z)), methylation profiles (H3K4me3/H3K27me3), PRE motif proteins (PHO+GAF+DSP1) and TrxG proteins (Trx).

We next focused our interest on four different regulatory scenarios (see FIG. 6): PRE components; PRC1 and PRC2; Repression; Activation.

- PRE components: Dsp1/GAF/PHO peak in the same location near the TSS of genes indicating that most PREs are located in proximal promoters of genes, as reported by previous works (Cavalli et al., 2009; Enderle et al., 2011). In fact, the CC between Dsp1, GAF and PHO is in average about 0.75 at genome-wide level (see Supplementary TAB. 1). - PRC1 and PRC2: The PRC1 (Pc, Ph and Psc) and PRC2 (E(z)) components are located in the vicinity of genes as well, underlining the fact that both complexes interact together. In addition, we detected increasing spreading of E(z) within the genes, contrary to what observed by Schwartz and colleagues, who detected an apparent presence of Pc beyond the PRE (Schwartz and Pirrotta, 2007). Moreover, the CC between E(Z) and Pc is moderate (0.38). - Repression: When plotting the distribution of main actors involved in gene silencing we observed a similar pattern of peak between PRE binding proteins (PHO), PRC1/PRC2 components (Pc and E(z)). Additionally, while the PcG recruitment machinery is mostly located in the promoter, the H3K27me3 silencing mark is positioned covering the gene afterwards, supporting the

15

RESULTS

PRE

DSP1 GAF PHO

PRC

E(Z) Ph

Pc Psc

REPRESSION

Pc PHO

E(Z)

ACTIVATION

Trx H3K4me3 PHO ASH2

H3K27me3

Figure 6: Occupancy profiles of different regulatory scenarios

recruitment model here exposed. In agreement with that, we observe a low CC between PHO/Pc and H3K27 (0.27 and 0.13, respectively), and a high CC between PHO and E(Z). The CC between E(Z) and H3K27me3 is higher (0.49), probably due to the fact that E(Z) is also spreading over the gene. - Activation: The distribution of leading actors in gene activation, analogous to the repression profile, is located near the TSS of genes. Again, Trx recruitment machinery (PHO, Trx) is found before the location of H3K4me3 activating mark, as expected (Pirrota et al., 2011). Interestingly, PHO, which is suggested to participate in the recruitment of Trx, seems to colocalize with another Trx cofactor called ASH2 (Prez-Lluch et al., 2011). As observed in the repression profile for H3K27me3, the CC between PHO and H3K4me3 is not significant (0.36), while the CC between PHO and Trx is 0.73.

16

RESULTS
Experimentally validated PREs

To test the key features that are assumed to define a PRE, we evaluated the overlap between our set of 20 experimentally validated PREs and most representative ChIP genome-wide profiles described in the previous section. We performed the intersection against Pc (PRC2), E(Z) (PRC1),

PHO/Dsp1/GAF (PRE binding proteins) and H3K27me3. A strong overlap between the 20 validated PREs and E(Z), PHO or the H3K27me3 mark was observed (see FIG. 7). Actually, we would not expect H3K27me3 to colocalize with the recruitment machinery, since the methylation mark distribution is located after the Pc, E(Z) and PHO peaks. We consider that this might be due to the lack of resolution of the peak enrichment detection software. In addition, we appreciated a minor overlap for Dsp1 and GAF, which could be reflecting that not all PREs might be constituted of such elements.

Figure 7: Venn diagrams showing the intersection between our set of 20 validated PREs, different PRC1/PRC2/PRE components and H3K27me3.

17

RESULTS
Computationally predicted PREs

We repeated the same procedure on the set of 167 computationally predicted PREs (see FIG. 8) from Ringrose and colleagues (2003). Thus, we measured a poor overlap between ChIP experiments and predictions, as already reported in previous publications (Cavalli et al., 2009; Schwartz and Pirrotta, 2007). For example, 134 predicted PREs (about 80%) did not present neither E(Z), PHO nor H3K27me3 binding regions, which were previously described as key components of the 20 known PRE regions. In fact, only 4 out of the 20 experimentally validated PREs are included on the 167 predictions.

Figure 8: Venn diagrams showing the intersections between the computational predictions and Pc (Enderle et al., 2011), E(Z) (The modENCODE Consortium, 2010), PHO, Dsp1 and GAF (Cavalli et al., 2009), H3K27me3 (Prez-Lluch et al., 2011) and the set of 20 validated PREs.

18

RESULTS
Construction of our set of predictions

PcG COMPLEX PRE MOTIFS PREs PHO inv1 inv4 en vg cycA bx bxd iab-2 Mcp iab-6 HS1 iab-6 HS2 Fab-7 iab-7 iab-8 hh Scr8.2Xba Scr10Xba.2 Scr10Xba.1 ph-d ph-p
Table 6: Enriched regions that have been detected on the 20 validated PREs (marked in grey).

METHYLATION PRC1 PRC2 E(Z) H3K27me3 GAF Pc

DSP1

From each selected ChIP experiment shown in Table 5 we defined a number of representative elements. Next, we evaluated this collection of experiments on the validated set of 20 PREs (see TAB. 6). As we can see, PHO, E(Z) and H3K27me3 seem to be particularly relevant in detecting the majority of the validated PREs, whereas other elements such as GAF and DSP1 dont appear to be essential in defining a PRE. To produce our own set of predicted PREs from these ChIP profiles, we will define a formal framework, where an alphabet of symbols encode the key components of PREs on the genome (one symbol per ChIP experiment):
= {a, b, c, d, e, f, g, h}

where:

19

RESULTS

a = GAF b = DSP1 c = PHO d = H3K27me3 e = Pc f = E(Z) g = H3K4me3 h = TRX-C

Let us define now the language L that constitutes the complexes governing gene silencing/activation via a PRE-dependent mechanism:

L = {PRE_MOTIFS, PRC1, PRC2, PcG_PREd, TRX}

where:
PRE_MOTIFS = a b c PRC1 = PRE_MOTIFS d e PRC2 = PRE_MOTIFS d f PREd = PRC1 PRC2 TRX = PRE_MOTIFS g h

and where PREd correspond to our final predictions. However, since GAF and DSP1 are not essential features of the validated PREs (as depicted in Table 6), we also described a second word in which the PRE_MOTIFS were only defined by PHO (PREdPHO). As expected, PREdPHO produces more predictions than PREd (see TAB. 7), and retrieves more validated PREs as well (10 versus 7). However, our predictions failed in finding two of the validated PREs that were expected to be detected. This is due to the eventual inexact overlap among the custom tracks while realizing the intersections, so it might be taken into account that some true positive hits may be missing with this method. We have characterized these two sets in terms of genome distribution (see TAB. 8), phylogenetic conservation, number of target genes, and gene ontology enrichment (see FIG. 9).

20

RESULTS

SYMBOL a b c d e f

NAME GAF DSP1 PHO H3K27me3 Pc E(Z)

SITES 4317 3103 4738 5868 1193 972

WORD PRE_MOTIFS PRC1 PRC2 TRX PREd PREdPHO

SITES 1304 155 190 32 136 238

GENOMIC REGION PROMOTERS INTRONS EXONS INTERGENIC

PREd (136) 69 (51%) 28 (21%) 32 (24%) 43 (32%)

PREdPHO (238) 115 (48%) 40 (17%) 57 (24%) 83 (35%)

Table 7: Number of sites retrieved by each of the symbols and words. Table 8: Genome distribution of our predictions.

Transcription regulation Pattern specification Post-embryonic development Cell fate 0 20 40 60 80 100

PREd PREd_PHO

Figure 9: Gene ontology enrichment of the genes retrieved by the two sets of predictions. The log p score for PREd and PREdPHO respectively is 43 and 54 for transcription regulation, 11 and 17 for pattern specification, 10 and 12 for post-embryonic development, and 6 and 19 for cell fate.

In both cases, half of the predictions are located over gene promoters, in agreement with what has been reported by Enderle and colleagues (2011); of the remainder, about 30% are found in intergenic regions, and about 20% within introns/exons (note that the amount of percentages is not equal to 100, since predictions may be overlapping different regions). Regarding conservation, there is no evidence that they are particularly conserved (data not shown); this might be due to evolutionary plasticity, as discussed by Ringrose and colleagues (2008). The number of genes retrieved is 127 for PREd and 221 for PREdPHO, where about 50% are associated with transcription regulation, between 10% and 20% with pattern specification, between 10% and 20% with post-embryonic development, and between 10% and 15% with cell-fate. This is

21

RESULTS
consistent with what has been observed in previous works (Schwartz et al., 2006; Cavalli et al., 2007). FIG. 10 shows UCSC screenshots for three of these target genes. When we compared our list of genes with those extracted by different publications, we noticed a poor concordance with the ones retrieved by computationally predicted PREs (Ringrose et al., 2003) and Pc hits (Tolhuis et al., 2006). However, more than half were identified as well by Cavalli and colleagues (2007), probably due to the greater complexity of their own predictions, which integrated more than one element of the PcG complex. In either case, we think that this general poor overlapping is due to two main reasons: a) we might have found putative PREs that were not identified before; b) we dont find strong evidence that supports several of the already reported putative PREs.

Figure 10.1: UCSC screenshot of a region of chromosome 2R. From up to down: gene even-skipped (eve), PHO (a), DSP1 (b), GAF (c), H3K27me3 (d), Pc (e), E(Z) (f), and PREd.

22

RESULTS

Figure 10.2: UCSC screenshot of a region of chromosome 2L (left) and chromosome 3L (right). From up to down: genes goosecoid (Gsc) / Doc2 and Doc 3, PHO (a), DSP1 (b), GAF (c), H3K27me3 (d), Pc (e), E(Z) (f), and PREd.

23

CONCLUSIONS

CONCLUSIONS
Distribution of PRE elements and recruitment model

As previously reported, we find that PRE motifs as well as the recruitment machinery are located over the PRE, while the methylation marks spread over the gene. Our observations also support the idea of a looping action model: the presence of E(Z) beyond the PRE could be reflecting the interaction between E(Z)s SET domain with farther nucleosomes.

Computationally predicted PREs

The computational prediction of PREs based on motif composition has proved to be not a good approach: predictions dont match neither with the ChIP experiments, nor the validated PREs. As reported by the authors (Ringrose and Paro, 2007), this might be due to two main reasons: a) not all the PREs contain all the motifs that have been described; b) some PREs may be composed by other motifs that have not been identified yet. Our predictions: target genes and genome distribution

Contrary to the computationally predicted PREs, our predictions have a strong experimental support since they are the result of integrating several ChIP experiments from different research groups; future studies may want to use this compilation of putative PREs with the purpose of validating them through a transgenic reporter assay. We have found between 136 and 238 putative PREs in the Drosophila genome that may be regulating between 127 and 221 genes. In agreement with what has been reported in previous publications, the majority of these potential target genes are related with transcription regulation and postembryonic development. In addition, about 50% of our predictions are located over promoters, while about 30% are found in intergenic regions. According to Enderle and colleagues (2011), several of these intergenic PREs could be in fact regulating non-annotated genes.
24

BIBLIOGRAPHY

Barges et al., 2000, Development 127, 779-790 Bloyer et al., 2003, Dev. Biol. 261, 426-442 Busturia et al., 1997, Development 124, 4343-4350 Cavalli et al., 2007, Cell 128, 735-745 Cavalli et al., 2009, PloS Biol 7, 146-163 Chanas & Maschat, 2005, Mechanisms of Development 122, 975-987 Chan et al., 1994, EMBO J. 13, 2553-2564 Cunningham et al., 2010, Mol. Cell. Biol. 30, 820-8 Enderle et al., 2011, Genome Research 21, 216-226 Fauvarque and Dura, 1993, Genes dev. 7, 1508-1520 Gindhart and Kaufman, 1995, Genetics 139, 797-814 Kassis, 1994, Genetics 136, 1025-1038 Kassis et al., 1989, Mol. Cell. Biol. 9, 4304-4311 Martinez et al., 2006, Genes & Development 20, 501-513 Mihaly et al., 1997, Development 124, 1809-1820 Mishra et al., 2001, Mol. Cell. Biol. 21, 1311-1318 Ngre et al., 2006, PloS Biol 4, 917-932 Ngre et al., 2011, Nature 471, 527-531 Pirrotta, 2011, Handbook of Epigenetics, 107-121 Prez-Lluch, 2008, Nucleics Acids Res. 36, 6926-33 Prez-Lluch et al., 2011, Nucleic Acids Research 39, 4628-4639 Ringrose, 2007, Current Opinion in Cell Biology 19, 290-297 Ringrose and Paro, 2007, Development 134, 223-232 Ringrose et al., 2003, Developmental Cell 5, 759-771 Ringrose et al., 2008, PloS Biol 6, 2130-2143 Ringrose et al., 2011, Epigenetics & chromatin 4, 4 Schwartz and Pirrotta, 2007, Nature Reviews 8, 9-22 Schwartz et al., 2006, Nature Genetics 6, 700-705 Shimell et al., 2000, Dev. Biol. 218, 38-52 The modENCODE Consortium, 2010, Science 330, 1787-1797 Tolhuis et al., 2006, Nature Genetics 6, 694-699

25

SUPPLEMENTARY MATERIAL
TRACK 1 H3K4me3 Enderle H3K27me3 Ngre GAF Cavalli Pc Cavalli Trx Cavalli Ph Cavalli TRACK 2 H3K4me3 Prez-Lluch H3K4me3 Cavalli H3K27me3 Prez-Lluch H3K27me3 Cavalli GAF modENCODE GAF Ngre Pc Enderle Trx Enderle Ph Enderle C.C. 0,71 0,70 0,62 0,74 0,74 0,52 0,21 0,22 0,30 TRACK 1 Pc Enderle E(Z) modENCODE H3K4me3 Prez-Lluch GAF Cavalli TRACK 2 H3K27me3 Prez-Lluch GAF modENCODE E(Z) modENCODE H3K27me3 Prez-Lluch H3K27me3 modENCODE H3K27me3 Prez-Lluch DSP1 Cavalli Pc Enderle E(Z) modENCODE H3K27me3 Prez-Lluch H3K4me3 Prez-Lluch Trx Cavalli DSP1 Cavalli GAF Cavalli C.C. 0,13 0,27 0,38 0,49 0,33 0,01 0,79 0,26 0,43 0,27 0,36 0,73 0,87 0,66

PHO Cavalli

Supplementary Table 1: CC (window data = 1000) between chip experiments of equivalent samples (left) and representative samples (right)

Supplementary Figure 1: comparison between two Pc ChIP profiles (Bithorax complex). Note the better resolution of Enderles Pc profile (orange) with respect to Cavallis Pc profile (purple).

Anda mungkin juga menyukai