Bioinformatics Advance Access originally published online on December 3, 2008
Bioinformatics 2009 25(3):353-357; doi:10.1093/bioinformatics/btn622
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
A predictive model for identifying mini-regulatory modules in the mouse genome
Biotechnology Program, CIS, University of Pennsylvania, 3330 Walnut Street, Philadelphia, PA 19104, USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Rapidly advancing genome technology has allowed access to a large number of diverse genomes and annotation data. We have defined a systems model that integrates assembly data, comparative genomics, gene predictions, mRNA and EST alignments and physiological tissue expression. Using these as predictive parameters, we engineered a machine learning approach to decipher putative active regions in the genome.
Results: Analysis of genomic sequences showed nucleosome-free region (NFR) modules containing a higher percentage of conserved regions, RNA-encoding sequences, CpG islands, splice sites and GC-rich areas. In contrast, random in silico fragments revealed higher percentages of DNA repeats and a lower conservation. The larger conserved sequences from the Vista enhancer browser (VEB) showed a greater percentage of short DNA sequence matches and RNA coding regions in multiple species.
Our model can predict small regulatory regions in the genome with >95% prediction accuracy using NFR modules and >85% prediction accuracy with VEB elements. Ultimately, this systems model can be applied to any organism to identify candidate transcriptional modules on a genome scale.
Contact: myar{at}seas.upenn.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
| 1 INTRODUCTION |
|---|
|
|
|---|
Discovering transcriptional regulatory regions or modules is prerequisite to deciphering gene regulatory networks in the mammalian genome. Regulatory modules (100–1000 bp) may have multiple transcription factor binding sites or elements that are located throughout the genome (Sinha et al. 2003). Current state-of-the-art approaches integrate highly parallel sequencing technologies and ChIP arrays that work in tandem with computational models (GuhaThakurta 2006). Although computational methods for modeling and targeting DNA regulatory elements have been developed over the past 30 years, recent studies using comparative genomics have reported that even ultra conserved regions (<200 bp) have a very high false detection rate of regulatory elements (Visel et al. 2008).
Despite the arsenal of bioinformatics tools already developed for promoter and gene-finding regions, it is still estimated that there are 105 more regulatory elements than coding genes in most eukaryotic genomes (Yuh et al. 1998). Significantly reducing this genome search space on the order of log scales requires that we use an array of fast and simple approximations. The ENCODE Consortium advanced this idea of collective computational analysis by using a compendium of models to identify functional elements inside 1% of the genome (ENCODE Project Consortium 2007).
A major issue in identifying functional sites in the genome is that regulatory elements are contingent on higher order chromatin structure that is dictated by selective constraints imposed by the cellular environment (Elnitski et al. 2006). Therefore, a model that integrates features of a DNA sequence reflecting genomic, cellular and physiological properties may have high predictive potential (Fig. 1).
|
It is well established that 2% of the cellular genome is replete with untranslated regions, non-coding genes, chromosomal structural elements, conserved areas and nucleosome-free regions (NFRs) (Chiaromonte et al. 2003,Waterston et al. 2002). In a recent study, we identified small functional transcriptional modules derived from NFRs from the mouse genome (Yaragatti et al. 2008). In this article, we propose a parallel approach using encoded regions (150–250 bp) to train an in silico model with the hope of identifying novel regulatory modules in the genome.
Quantitative representation of the genome is rooted in systems biology including the mapping of DNA sequences into vector space (Muller and Koonin 2003). Likewise, methods using numerical feature vectors characterizing DNA sequences have been proposed that use position weight matrices and motifs based on biophysical considerations of protein–DNA interactions (Djordjevic et al. 2003). Faithful to this paradigm of the digital genome, our proposed model transforms genomic sequences into functional bits that can be analyzed in a high dimensional vector space.
We have introduced a systems model that integrates information from sequence features to physiological tissue expression with a widely used learning algorithm known as a support vector machine (SVM). In its most reduced form, genome annotation data have the capacity of classifying genomic DNA into discrete classes that can be either active or inactive. To learn this concept, the SVM is shown a set of labeled DNA vectors from two groups and is trained to differentiate between them. Upon successful training, the SVM will be able to predict functionality of new DNA vectors with a minimal number of errors (Vapnik 1995). SVMs often make significantly fewer errors than other learning algorithms such as neural networks, hierarchical clustering, self-organizing maps and Bayesian methods (Brown et al. 2000).
In this article, we have implemented a systems model using mouse genomic fragments characterized with expression and sequence data to predict putative transcriptional regulatory modules.
| 2 SYSTEMS AND METHODS |
|---|
|
|
|---|
2.1 Sequence database
All sequences used in our analysis can be found in Supplementary Table 2. The NFR sequences were experimentally validated in a mammalian F9 cell line to increase the level of GFP expression when inserted upstream of a promoter site (Yaragatti et al. 2008). The average size of these fragments was 149 bp.
Additional sequences were randomly taken from the Vista enhancer browser (VEB) database (http://enhancer.lbl.gov/). These elements were computationally predicted and tested using a LacZ vector and expression in embryonic mouse blastocysts (Pennachio et al. 2006). The average size of these fragments was 1200 bp.
Mouse chromosomes were downloaded from the NCBI website (ftp://ftp.ncbi.nih.gov/genomes/M_musculus). We used a PERL script to concatenate the file into a single linear sequence. In order to emulate the enzymatic approach used in (Yaragatti et al. 2008), we designed a program to search for fragments with GGCC sites flanking both ends. Fragments with length more than 150 and less than 250 were selected and all sequences were mapped to the mouse genome (UCSC mm9, July 2007 Assembly). We labeled these sequences as in silico.
The dataset for LIBSVM requires fixed-length vectors. Each vector is labeled with a +1 to denote functional transcriptional regulatory elements and –1 to denote in silico random fragments. Each vector contains values representative of its features (41 dimensions).
2.2 Support vector machine
We used LIBSVM, which is a publicly available online library for training and predicting with SVMs (Chih-Chung and Chih-Jen 2001). The dataset was imported into the application and we created the model using the linear kernel with default parameters. The model was saved and used later for the prediction phase.
In the training phase, the SVM algorithm searches for a hyperplane in the input space that separates the positive from the negative samples. Based on learning theory, when many hyperplanes exist, an optimal procedure selects the hyperplane that is farthest from any training vector. This optimal hyperplane is known as the maximum margin hyperplane and can be calculated by quadratic programming. In prediction mode, the model was applied to the test data to predict the classification of unknown data.
The linear kernel was chosen for our analysis where training vectors, xi, are mapped into a higher dimensional space by the function
.
After solving for the optimal separating hyperplane, predictions can be made for samples that are not in the training set. This involves predicting the associated +1/–1 label for each test sample by determining which side of the hyperplane it falls on.
Given a training set of instance-label pairs (xi,yi), i = 1,...,l where xi
Rn and y
{–1, +1}, the SVM solves the following optimization problem:
|
|
|
|
2.3 Statistical analysis
All the vectors in our dataset are separated into two parts (training and test), of which, the test set is considered unknown when training the classifier. In 10-fold cross-validation, the training set is divided into 10 subsets of equal size. Sequentially one subset is tested using the classifier trained on the nine remaining subsets. Thus, each instance of the whole training set is predicted once, so the cross-validation accuracy is the percentage of data that are correctly classified. The goal of cross-validation here is to assess out of sample accuracy.
Logistic regression analysis and t-tests were performed using SPSS software (version 15.0). All data are reported with P < 0.05. Discriminant function analysis was also performed over the dataset using the built in SPSS function. A leave-one-out feature was used to calculate the cross-validation accuracy of the discriminant analysis function.
We created ROC curves using an R script that took the SVM classifier's prediction scores and plotted recall versus false positive rate at each classification threshold.
| 3 RESULTS AND DISCUSSION |
|---|
|
|
|---|
This article introduces a quantitative scheme to functionally represent regulatory sequences and a robust predictive model to capture specific transcriptional regulatory regions in a genome for experimental analysis.
3.1 Current approaches
There is an abundance of bioinformatics tools currently available for identifying regulatory modules based on sequence matching using probabilistic models and gene expression features (Blanchette et al., 2006; Liu et al., 2008; Segal et al., 2003; Sinha et al., 2003). However, the paradigm in sequence analysis assumes evolutionary constraints and hinges on the idea that comparing sequence alignments from multiple species will separate regulatory elements from background sequences (Cliften et al. 2003,Kellis et al. 2003,Liu et al. 2008). Notwithstanding this likelihood, an ab initio analysis compared the widely used phastCon scores available at the UCSC genome browser (http://genome.ucsc.edu/). Our comparison of sequence alignment scores for functional regulatory modules shows a significant loss in precision with an increasing number of vertebrate species (Fig. 2). As expected, the random genomic fragments showed no significant change with an increasing number of species.
|
Studies have shown that conserved areas in the genome have shown correlations with regions containing regulatory elements (Pennachio et al. 2006). Although conservation is a key feature for regulatory elements, incorporating more genomes into the phastCon algorithm yields a lower overall score as shown with our functional sequences. Thus, short regulatory elements may be missed if the genomic sequences that are being aligned come from distant organisms. If the organisms are evolutionarily close, however, the alignments could be extensive and therefore generate a large number of inaccurate elements. Regardless, multiple species alignments may still be equally useful in finding regulatory modules, but a complementary method may increase its overall accuracy. Using the same dataset, we attempt an SVM approach with the hope of showing more precision than using a sequence alignment-based approach.
3.2 Analysis of features
We performed a post hoc analysis to uncover characteristic differ-ences between NFR, in silico, and VEB sequences. In a t-test using a 95% confidence interval, we interpreted values (t > 3) as significant. All data can be referred to Supplementary Table 1 and more information about these features can be found at the UCSC Genome Browser website (Karolchik et al. 2008).
Transcriptionally active areas include promoter regions and tran-scription start sites that often have conserved regions, a large percentage of CpG islands, as well as GC-rich areas (Siepel et al. 2005). Significant differences between NFR modules and random genomic fragments included CpG Islands (t = 10.49), GC-rich areas (t = 12.6) and mammalian conservation (t = 6.39). However, VEB fragments only showed changes with mammalian conservation (t = 11.08) and in GC-rich areas (t = 4.59). When we compared NFR to VEB, we saw large differences in GC-rich areas (t = 26.79) and CpG islands (t = 10.66).
Sequence tagged sites (STS) are
200 bp, while expressed sequence tagged sites (ESTs) are
500 bp and both serve as markers for transcribed regions in the genome. These tagged sites can be found in GenBank. Additionally, NFR modules displayed a higher change in spliced (t = 6.52) and unspliced ESTs (t = 3.87).
Non-coding sequences compose over a third of the human genome and are typically derived from interspersed transposable elements during genomic evolution (Smit 1996). We defined features of these sequences as neutral since they are not directly associated with transcriptional activity. Tandem repeats have been associated with some regulatory functions or pathological roles; however, short interspersed elements (SINEs), long terminal repeats (LTRs) and low complexity are characteristics of degenerated sequences. Microsatellites include di- and trinucleotide repeats as well as other simple repeats maybe associated with roles in disease as well. Likewise, VEB sequences showed changes in SINE (t = 3.31), LTR (t = 3.42), low complexity DNA (t = 3.36) and short sequence matches (t = 9.85). Surprisingly, comparing NFR modules to VEB sequences revealed a compelling difference (t = 23.10). We report that NFR modules contain a higher number of short sequence matches (t = 5.66).
We interpreted physiological mouse tissue expression by using the Affymetrix chip data, however, only the Affy MOE430 chip (t = 3.59) showed significant differences. The mouse tissues covered include an array of brain and other physiological tissues (Karolchik et al. 2008).
RNA coding features are based on regions where reverse transcription into corresponding cDNA was reported. The unspliced ESTs refer to sequences that did not use 5'-caps, thus there is a possibility of unprocessed mRNA or non-mRNA contamination. Non-mouse RNA displayed translated blat alignments of vertebrate and invertebrate mRNA from GenBank of organisms other than mouse. NFR modules exhibited a higher fold difference of mouse RNAs (t = 4.37) and RNAs from other species (t = 2.67). Vista sequences show a high change in RNAs from other species (t = 9.45). NFR sharply differed from VEB sequences with predicted RNA transcripts from other species (t = 13.89).
A category described as protein-coding features is derived from an array of algorithms used to predict genic regions in the genome. NFR modules exhibited strong differences from random fragments based on models including MGC Genes (t = 4.61), Aceview Genes (t = 4.09), N-SCAN (t = 3.02), GeneTrap (t = 3.91), Exoniphy (t = 3.66) and Alternative Events (t = 3.34). VEB sequences showed a difference in SGP genes (t = 3.34).
By reason of a 10-fold size difference, the VEB sequences are capable of capturing more features imposing a bias in the t-test analysis. Albeit, their performance in the SVM analysis shows that their functional behavior is similar to NFR elements. This exemplifies the advantage of using a machine learning approach rather than a simpler classification technique that associates the distribution of features with functional and random sequences.
3.3 De novo prediction of modules
In order to understand the features associated with active sequences, we implemented a forward stepwise logistic regression to classify random and NFR sequences (random = 0, NFR = 1). We found that the model improves performance as we go from step 1 to step 5 with corresponding increasing values of R2 and decreasing –2log likelihood ratios (computed out of sample). Thus, using just five features (GC%, CpG islands, splice sites, short sequence matches and conservation), we can classify NFR and random sequences with 86.0% accuracy (P < 0.05).
Similarly, with random and VEB sequences (random = 0, VEB = 1), we found that the model improves performance as we go from step 1 to step 8. Thus, using just eight features (CCDS, non-mouse RNA, DNA repeats, tandem repeats, short sequence matches, SINEs, low complexity and conservation), we can classify random and VEB sequences with 84.4% accuracy (P < 0.05).
The logistic regression model is a reductionist way to identify novel modules, which is counterintuitive to a systems approach. Using the minimum number of features may not target distal regulatory regions in the genome (i.e. regions not associated with gene/promoter activity). Thus, using a more appropriate linear classifier such as an SVM may be more advantageous.
Overall, SVM using a linear kernel showed higher cross-validation scores than logistic regression (Fig. 3A) or than SVMs with polynomial, radial basis or sigmoid kernels (data not shown).
|
In an initial test, we trained our SVM with only random sequences and ran test simulations on the NFR and VEB sequences. The prediction accuracies were significantly low (<50%) suggesting that random elements contain a combination of features that make them poor predictors of functional elements (data not shown). Our analysis using 50% of functional sequences and 50% random sequences verify our assumption that it can be predicted with higher accuracy (NFR > 95%, VEB > 85%). ROC analysis (Fig. 4) was implemented to compare the characterization of features against both sets of functional elements.
|
As a control for our SVM results, we used SPSS to perform discriminant analysis, a classification function, on our dataset (Fig. 3B). As expected, the group classification accuracies are comparatively high with corresponding leave-one-out cross-validation values. This suggests that the results from the SVM model are not just random occurrences and that subsets of sequences have correlation to certain features.
A key aspect in our findings is that NFR modules slightly edged the VEB sequences in performance despite being about 10-fold smaller in length. Apropos of our model, this novel aspect is precedent since it enables us to identify very small regions in the genome making it easier to target specific transcriptional binding sites.
3.4 Random versus functional DNA
We have shown that using randomly selected fragments from the genome lead to poor predictors of functional transcriptional modules (<50% accuracy). The random fragments may have contained genic, intergenic and non-coding regions of the genome, but it was critical that they were not selected with any bias or criteria in mind except for size. Thus, they do not necessarily function as negative control elements, but rather neutral elements that were not pre-selected in any way (i.e. conserved or nucleosome-free). The parameters used in our SVM do not exclusively define features of regulatory elements, which is why we defined it as a systems model. In fact, based on the sequences we used, the view that our model merely classifies genic versus non-genic regions may have some validity. Regardless, we underscore that this is an ideal high-throughput approach for generating experimentally testable transcriptional regulatory modules.
Our initial experimental approach for isolating NFR modules in (Yaragatti et al. 2008) was ideal in design, but we reported only 20% efficiency in recovering functional sequences. Unequivocally, the approach used to generate VEB sequences suffers from biased targeting of conserved areas, yields very large fragments and has reported a high false positive rate (Pennachio et al. 2006). In addition, both experimental systems only tested for enhancer activity leaving much of the genomic landscape uncharacterized. Thus, the computational model introduced here could easily complement these functional assays by predicting loci-specific genomic sequences.
| 4 CONCLUSION |
|---|
|
|
|---|
The integration of sequence information using bit vectors is a gateway for functionally representing modules in a genome and a platform for future analysis. Ultimately, we hope this systems model works in tandem with current experimental approaches to uncover novel transcriptional regulatory elements. Using this quantitative framework, we hope to develop a computational toolbox for analyzing sequences and inferring gene regulatory networks.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
We would like to specially thank Chris Stoeckert, Sridhar Hannanhalli and Ze Wang for reviewing this manuscript. We are indebted to Claudio Basilico and Lisa Dailey for their help. We also thank Leroy Hood and Scott Diamond for their support.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Trey Ideker
Received on June 26, 2008; revised on October 17, 2008; accepted on November 29, 2008
| References |
|---|
|
|
|---|
Blanchette M. Genome-wide computational prediction of transcriptional regulatory modules reveals new insights into human gene expression. Genome Res (2006) 5:656–668.
Brown MP. Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc. Natl Acad. Sci. USA (2000) 97:262–267.
Chiaromonte F, et al. The share of human genomic DNA under selection estimated from human–mouse genomic alignments. Cold Spring Harb. Symp. Quant. Biol (2003) 68:245–254.
Chih-Chung C, Chih-Jen L. LIBSVM, a library for support vector machines. last accessed date September 2008. Available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
Cliften P. Finding functional features in Saccharomyces genomes by phylogenetic footprinting. Science (2003) 301:71–76.
Djordjevic M. A biophysical approach to transcription factor binding site discovery. Genome Res (2003) 13:2381–2390.
Elnitski L. Locating mammalian transcription factor binding sites: a survey of computational and experimental techniques. Genome Res (2006) 12:1455–1464.
ENCODE Project Consortium. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature (2007) 447:799–816.[CrossRef][Web of Science][Medline]
GuhaThakurta D. Computational identification of transcriptional regulatory elements in DNA sequence. Nucleic Acids Res (2006) 12:3585–3598.
Karolchik D. The UCSC Genome Browser Database: 2008 update. Nucleic Acids Res (2008) 36:D773–D779.
Kellis M. Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature (2003) 423:241–254.[CrossRef][Medline]
Liu Y. Eukaryotic regulatory element conservation analysis and identification using comparative genomics. Genome Res (2008) 3:451–458.
Müller HM, Koonin SE. Vector space classification of DNA sequences. J. Theor. Biol. (2003) 2:161–169.
Pennacchio LA. In vivo enhancer analysis of human conserved non-coding sequences. Nature (2006) 444:499–502.[CrossRef][Medline]
Segal E. Genome-wide discovery of transcriptional modules from DNA sequence and gene expression. Bioinformatics (2003) 19:i273–i282.[Abstract]
Siepel A. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res (2005) 8:1034–1050.
Sinha S. A probabilistic method to detect regulatory modules. Bioinformatics (2003) 19:i292–i301.[Abstract]
Smit AF. The origin of interspersed repeats in the human genome. Curr. Opin. Genet. Dev. (1996) 6:743–748.[CrossRef][Web of Science][Medline]
Vapnik V. The Nature of Statistical Learning Theory. (1995) New York, NY: Springer.
Visel A. Ultraconservation identifies a small subset of extremely constrained developmental enhancers. Nat. Genet. (2008) 40:158–160.[CrossRef][Web of Science][Medline]
Waterston RH. Initial sequencing and comparative analysis of the mouse genome. Nature (2002) 420:520–562.[CrossRef][Medline]
Yaragatti M. Identification of active transcriptional regulatory modules by the functional assay of DNA from nucleosome-free regions. Genome Res (2008) 6:930–938.
Yuh CH. Genomic cis-regulatory logic: experimental and computational analysis of a sea urchin gene. Science (1998) 279:1896–1902.
This article has been cited by other articles:
![]() |
D. Y. Lee and K. C. P. Li Systems Diagnostics: The Systems Approach to Molecular Imaging Am. J. Roentgenol., August 1, 2009; 193(2): 287 - 294. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||




