Skip Navigation


Bioinformatics Advance Access originally published online on December 5, 2006
Bioinformatics 2007 23(3):281-288; doi:10.1093/bioinformatics/btl620
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/3/281    most recent
btl620v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (4)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Goh, L.
Right arrow Articles by Furey, T. S.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Goh, L.
Right arrow Articles by Furey, T. S.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Genomic sweeping for hypermethylated genes

Liang Goh , Susan K. Murphy , Sayan Muhkerjee and Terrence S. Furey *

Institute for Genome Sciences Policy, Duke University

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 SYSTEMS AND METHODS
 3 ALGORITHM
 4 IMPLEMENTATION
 5 DISCUSSION
 REFERENCES
 

Motivation: Genes silenced by the aberrent methylation of nearby CpG islands can contribute to the onset or progression of cancer and represent potential biomarkers for diagnosis and prognosis. Relatively few have thus far been validated as hypermethylated in cancer among over 14 000 candidates with promoter region CpG islands. A descriptive set of genes known to be unmethylated in cancer does not exist. This lack of a negative set and a large number of candidates necessitated the development of a new approach to identify novel genes hypermethylated in cancer.

Results: We developed a general method, cluster_boost, that in an imbalanced data setting predicts new minority class members given limited known samples and a large set of unlabeled samples. Synthetic datasets modeled after the hypermethylated genes data show that cluster_boost can successfully identify minority samples within unlabeled data. Using genome sequence features, cluster_boost predicted candidate hypermethylated genes among 14 000 genes of unknown status. In primary ovarian cancers, we determined the methylation status for 15 genes with different levels of support for being hypermethlyated. Results indicate cluster_boost can accurately identify novel genes hypermethylated in cancer.

Availability: Software and datasets are freely available at http://labs.genome.duke.edu/FureyLab/cluster_boost.php

Contact: tsfurey{at}duke.edu

Supplementary information: Supplementary data are available at Bioinformatics online.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 SYSTEMS AND METHODS
 3 ALGORITHM
 4 IMPLEMENTATION
 5 DISCUSSION
 REFERENCES
 
DNA methylation is an important epigenetic modification that is associated with transcriptional regulation, chromatin structure and embryonic development [for a review, see Robertson (2005) and references within]. The aberrent hypermethylation of CpG islands in promoter regions of key genes resulting in their transcriptional silencing has been associated with the onset and progression of human cancers (Robertson, 2005). The identification of genes consistently hypermethylated in cancer will contribute to our understanding of these diseases. These specific genes also represent potential biomarkers for the diagnoisis and prognosis of certain cancers and targets for new therapies (Laird, 2003).

Several studies have investigated computationally predicting the methylation status of CpG dinucleotides or CpG islands (Feltus et al., 2003; Bhasin et al., 2005; Feltus et al., 2006; Bock et al., 2006), but these regions were not necessarily in promoters of genes. Predictions were based on the presence of DNA sequence motifs (Bhasin et al., 2005; Feltus et al., 2003, 2006) and other DNA attributes, such as sequence repeats and DNA structure characteristics (Bock et al., 2006). Training data included either the methylation status of CpG islands in normal tissue (Bhasin et al., 2005; Bock et al., 2006) or in fibroblasts clones overexpressing DNMT1 (Feltus et al., 2003, 2006). The use of genomic features in creating accurate classifiers has similarily been demonstrated in other epigenetic gene silencing mechanisms, such as imprinted genes (Greally, 2002; Luedi et al., 2005) and X-inactivated genes (Wang et al., 2006).

Large scale experimentation using microarray technology (Adorjan et al., 2002; Weber et al., 2005; Hatada et al., 2006), bead arrays (Bibikova et al., 2006), and cloning and sequencing of methylated genomic sequence (Rollins et al., 2006) can be used to assay the methylation status of thousands of genomic regions at a time. Despite these technological advances, we still do not have a global representation of the genomic targets of gene specific hypermethylation in cancer. We also lack understanding of why certain genes are targets for hypermethylation in disease while other genes are not.

We have developed a novel computational approach, named cluster_boost, to sweep the genome for potential hypermethylated genes in cancer. Experimental evidence suggests that few genes are prone to hypermethylation (Weber et al., 2005), but thousands are possible candidates. Therefore, our algorithm is specifically designed for predicting members of a minority set (hypermethylated genes) within a large unlabeled dataset (remaining genes with promoter CpG islands). The approach adapts strategies from machine learning techniques developed for imbalanced data. These data, characterized by a disproportionate distribution of samples in the positive and negative classes, have been primarily studied in areas of fraud detection, target discrimination, text classification and computer security infringement (Kubat et al., 1998; Pednault et al., 2000; Japkowicz, 2003b; Chawla, 2003).

Previous studies have investigated imbalanced data in biological contexts (Choe et al., 2000; Qian et al., 2003; Yeo et al., 2005; Plant et al., 2006). Similar to classical imbalanced problems, these data contained samples for each class enabling a predictive model to be created. Our cluster_boost algorithm is the first designed not only for imbalanced data but also unlabeled data where only samples from one class are available.


    2 SYSTEMS AND METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 SYSTEMS AND METHODS
 3 ALGORITHM
 4 IMPLEMENTATION
 5 DISCUSSION
 REFERENCES
 
To evaluate the algorithm, we created two different sets of synthetic data that are modeled after the hypermethylated genes dataset and described below. Applying cluster_boost to these datasets provides rough measures of the sensitivity and specificity of this method as well as a means to approximate parameter settings.

Using genome sequence features, we used cluster_boost to predict genes prone to hypermethylation in cancer. A subset of these were tested experimentally in primary ovarian cancers. The specific experiments performed in this validation step are detailed below.

2.1 Hypermethylated genes and sequence features
We compiled a set of 63 genes previously reported to be hypermethylated in cancer (Supplementary Table S1). The majority of these are listed at the MD Anderson Cancer Center website (http://www.mdanderson.org/departments/methylation/) with a few additional genes extracted from literature. Each of these 63 hypermethylated genes has a promoter CpG island within 1.5 kb of its transcription start site (TSS). The CpG island annotation was taken from the UCSC Genome Browser (Kent, 2002) and is defined as described previously (Gardiner-Garden and Frommer, 1987). Genes defined in the Known Genes annotation in the UCSC Genome Browser that have a similarly placed promoter CpG island were considered potentially hypermethylated genes. The methylation status of these 14 249 genes in cancer is not known.

For each hypermethylated and unlabeled gene, we defined a window consisting of bases 100 kb upstream and 10 kb downstream of the TSS. For each window, 64 DNA sequence features were extracted from the UCSC Genome Browser (Supplementary Table S2). In general, they reflect the concentration of different sequence elements, primarily repeat sequences and transcription related elements, within the window.

2.2 Synthetic datasets
The two types of synthethic datasets created roughly modeled the hypermethylated genes dataset. For both, each sample is represented by a 64 value feature vector. There are three types of samples in these datasets: minority class samples with known labels; minority class samples with unknown labels; and majority class samples with unknown labels. The second and third categories comprise the set of unlabeled data from which we attempt to identify the minority samples. The number of samples from the minority class that are contained within the unlabeled data varied between 1 and 20% of the samples in this set.

2.2.1 Synthetic datasets SD1
We constructed samples for the minority and majority classes as follows. Let µhi and {sigma}hi, i = 1 . . 64 be the mean and SD for each of the 64 sequence features describing the hypermethylated genes. Let sep be a separation parameter such that for each feature i, Nhi, {sigma}hi) is the sampling distribution for the minority class and N(sep*{sigma}hi + µhi, {sigma}hi) is the distribution for the majority class. The parameter sep, therefore, controls the degree of separability of samples from the two classes. Figure 1 displays the effect of this separation parameter. (Samples are defined by only three features for visualization puposes.)


Figure 1
View larger version (25K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1 Synthetic data with different separation values, sep = 1, 2, 3. Each sample is represented by three features. For each feature i, i = 1 . . 3, Nhi, {sigma}hi), was the sampling distribution for the minority class and N(sep*{sigma}hi + µhi, {sigma}hi) was the distribution for the majority class.

 
Twelve datasets consisting of 14 050 samples were created. Of these, 50 were the labeled minority class samples, and the remaining 14 000 samples were considered unlabeled samples. For each dataset, a separation parameter of either 1, 2 or 3 was employed. The unlabeled data were comprised of either 1, 5, 10 or 20% minority class samples and the remaining were majority class samples.

2.2.2 Synthetic datasets SD2
The features for SD1 datasets were created using normal distributions, but few sequence feature in the known set of hypermethylated genes have values with this property of normality. To better simulate the hypermethylated genes data, we used feature vectors for these known hypermethylated genes to create new minority samples as follows:

(1) Randomly select 30% of the known, hypermethylated genes (minority class).
(2) Within this subset, determine the five nearest neighbors of the first gene in the subset based on Euclidean distance in feature space.
(3) Replace 30% of the 64 feature values for the selected gene with the mean value of that feature calculated using the five nearest neighbors.
(4) Repeat steps 1–3 to generate the desired number of minority samples.
(5) Randomly replace unlabeled samples with these new minority class samples.

A combination of the newly created minority samples and existing unlabeled data comprised the 14 249 samples in the unlabeled dataset. The number of new minority samples varied such that 1, 5, 10 or 20% of this unlabeled dataset were these synthetic samples. The 63 methylated genes were used as the known minority class.

2.3 Experimental validation
To validate predictions of hypermethylated genes, we tested for methylation in multiple primary ovarian cancers (N = 19 to N = 69, depending on the gene) using genomic DNA that was modified by sodium bisulfite as described previously (Huang et al., 2006). Sodium bisulfite converts unmethylated cytosines to uracils leaving methylated cytosines unaffected. Methylation-specific (MS) PCR were performed for each sample using one common primer that anneals to both methylated and unmethylated bisulfite modified DNA along with two primers that are specific to either methylated or unmethylated converted sequence. Bisulfite treated CpGenome Universal Methylated DNA (Chemicon International) was used as a positive control for methylation.


    3 ALGORITHM
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 SYSTEMS AND METHODS
 3 ALGORITHM
 4 IMPLEMENTATION
 5 DISCUSSION
 REFERENCES
 
The discrepancy in the number of hypermethylated (minority class) versus non-methylated (majority class) genes creates a classification problem involving an imbalanced dataset. Several methods have been designed to compensate for imbalanced data (Cardie and Howe, 1997; Kubat et al., 1998; Pednault et al., 2000; Japkowicz, 2003b; Chawla, 2003; Chen et al., 2004) as detailed below. The proposed techniques are aimed at ensuring that the uneven distribution of training data does not result in a biased classifier. As with traditional classifiers, these require known samples from each class to be used as training data.

Our novel cluster_boost algorithm has been designed for the general problem of predicting minority class members in this setting of imbalanced and unlabeled data. The algorithm uses a combintation of k-means clustering followed by classification using the boosting algorithm. We briefly provide some background on each of these general algorithms as well as how they are employed in our cluster_boost method.

3.1 Algorithms for imbalanced data
Techniques for handling imbalanced data can be categorized into supervised and unsupervised. Supervised techniques employ traditional classification algorithms, but attempt to compensate for the smaller number of samples in the minority class by either undersampling the majority class or over-sampling the minority class or both (Chawla, 2003; Japkowicz, 2003b). Cost or weight functions have also been employed to deal with this disparity in sample size (Cardie and Howe, 1997; Chen et al., 2004). Optimal sampling ratios or cost functions have not been defined for the general case and have been dependent on particular datasets. The overall benefit of these methods has been a subject of debate (Chawla, 2003; Japkowicz, 2003a). Unsupervised techniques generally employ recognition-based predictors trained on samples from only one of the two classes, usually the majority class (Japkowicz et al., 1995; Kubat et al., 1998). In some cases, the second class is used to refine the learned class boundary.

Previous research investigated ratios of minority to majority samples in the range of 1:5 to 1:25 (Guo and Viktor, 2004; Zhang et al., 2004). It has been suggested that it is not so much the imbalance but rather the inability to learn important hidden traits represented by a small number of samples in the minority or majority classes that is the cause of poor performance by standard classifiers (Japkowicz, 2003a). Thus, it is important to ensure that hidden traits in data are present during model training.

3.2 The cluster_boost algorithm
We designed an algorithm, cluster_boost, that predicts new members of a minority class from a large set of unlabeled samples. The general strategy consists of iteratively constructing imbalanced data classification problems. The unlabeled data are used to create a series of imperfect majority training sets to be used with a known minority training set. Each unlabeled sample is in a majority class training set for one experiment and in the test set for the remaining experiments. Unlabeled samples consistently classified into the minority class are predicted to be members of that class.

The algorithm is summarized in Figure 2 and consists of three main steps. First, the unlabeled data are clustered based on their feature values. These clusters should reflect aspects of the distribution of this unlabeled data in feature space. Samples are selected from these clusters in a balanced manner to create imperfect majority training sets, hopefully preserving important hidden traits in all sets. Second, a series of m classification experiments are performed using each of the majority training sets. Together, these will classify each unlabeled sample m – 1 times. Third, the final prediction set is determined based on the combined results of the m classification experiments.


Figure 2
View larger version (33K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2 Algorithm for cluster_boost. Inputs are a small set of known minority samples along with a large set of unlabeled samples. The output is a set of unlabeled samples predicted to be in the minority class.

 
3.2.1 Clustering and creation of majority class sets
The unlabeled data consists of the full set of majority class samples and some small, unknown number of samples from the minority class. From this data, we extract samples to form a series of imperfect majority training classes. In this way, we create majority training classes that are better balanced with respect to the known minority training class. At the same time, clustering allows possible hidden traits of the majority class to be preserved in each of the majority training sets.

We clustered unlabeled samples using the unsupervised k-means clustering algorithm. k-means clustering is non-deterministic and seeks to cluster samples into k clusters from k random starting points. Other clustering methods, such as PAM, clara and Clest (Dudoit and Fridlyand, 2002) were evaluated and found to be either too computationally intensive or better suited for smaller sample sizes. Initially, samples are clustered using a range of values for k to obtain a distribution of the cluster sizes. To ensure that the final clusters are robust, we repeat clustering for each k several times. We select a final k based on balancing the following properties:

  1. Cluster sizes are consistently observed in every iteration.
  2. Every cluster must be large enough to ensure adequate representation in all majority training sets, but not so large as to be dominant.
  3. Clusters are compact as measured by the sum of distances of each sample within a cluster to its barry-centre.

Formula

The final k is chosen based on Algorithm 3.1 that scores each k based on the above properties.

Imperfect majority training sets for classification experiments are then created by first randomly dividing each cluster into m partitions, m being the number of classification experiments to be performed. Each of the m majority training sets is simply a combination of exactly one partition from each of the clusters such that each partition belongs to exactly one training set.

3.2.2 Classification with the boosting algorithm
Boosting is an ensemble method based on repeated presentation of difficult samples for training so that the classifiers will learn these hard samples well (Freund and Schapire, 1996). Classifiers trained using the boosting algorithm and imbalanced datasets with ratios <1:25 have shown robust performance (Guo and Viktor, 2004; Joshi et al., 2001). This approach is well suited for imbalanced data as most discrimnation-based classifiers have a tendency to learn the majority class well but the minority class poorly. We used a modified Adaboost algorithm that adjusts weights with respect to errors between expected and actual outputs (instead of a hard-limit function for output) for a feed-forward back-propagation MLP, as shown in Algorithm 3.2.

Formula

3.2.3 Determination of predictions
Each sample from the unlabeled set is classified m – 1 times in m classification experiments. In general, we expect that the more times a sample is classified into the minority class, the greater the probability it belongs to that class. Using synthetic datasets, we evaluated the accuracy of samples classified as minority samples at different classification thresholds, where the threshold denotes the minimum number of experiments a sample must be classified in the minority class. We calculated a detection acccuracy (Dacc) defined as the percentage of true minority samples for each threshold, 0 to m – 1. From this, we can estimate a minimum sensitivity and specificity at each threshold for the hypermethylated genes data.


    4 IMPLEMENTATION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 SYSTEMS AND METHODS
 3 ALGORITHM
 4 IMPLEMENTATION
 5 DISCUSSION
 REFERENCES
 
We first applied cluster_boost to two sets of synthetic data, SD1 and SD2 (see Methods), to assess the accuracy of the algorithm. We then predicted a set of novel hypermethylated genes in cancer. A small set of these predicted genes have been experimentally tested in primary ovarian cancers.

4.1 Synthetic data SD1
Experiments involving the SD1 synthetic datasets demonstrated the general ability of cluster_boost to identify unlabeled minority class samples. Each dataset consisted of 50 known samples from the minority class and a set of 14 000 unlabeled samples of which a certain percentage were generated from the minority class (see Methods). SD1 datasets evaluated the effectiveness of the algorithm on datasets with different degrees of separability as controlled by a separation parameter, sep. Also, the effect of the number of unlabeled minority samples was explored.

Each of the unlabeled datasets were initially clustered with 32 clusters being created each time. The smallest cluster across all experiments was 102 samples while the largest was 815 with 87% of cluster sizes being between 200 and 700 samples.

We performed m = 20 classifications resulting in 19 classifications of each unlabeled sample. The training set was accurately learned even with imperfect majority class training sets. Training accuracy [(TP + TN)/(TP + TN + FP + FN)1] was 94.5–99.1% with a sensitivity [TP/(TP + FN)] of 100% and specificities [TN/(TN + FP)] between 94.5 and 99.1%. Due to noise in the majority training class set, we did not expect 100% specificity for the training data. In fact, we found that increasing the number of unlabeled minority samples slightly decreased training set specificity reflecting the average amount of noise in this data.

Table 1 shows average test set statistics for the 12 SD1 datasets. It is interesting to note that the separability of the data has little effect on the results. In contrast, a higher percentage of minority samples in the unlabeled data causes a noticable degradation of sensitivity rates. This is likely due to the increase in contaminating samples in the majority training data.


View this table:
[in this window]
[in a new window]

 
Table 1 Results for cluster_boost on synthetic data SD1

 
The high-specificity rates for the test set indicates that new minority are accurately being identified. We expect the sensitivity to be less than perfect due to the relatively small number of minority training samples and the noisy majority training set. Again we see that sensitivity depends more on the percentage of unlabeled data that is from the minority set than the separability of the two classes.

We can control the number and accuracy of predictions by setting a minimum threshold for the number of times a genes is predicted to be in the minority class in all classification experiments. As expected, the more a gene is selected to be in the minority class, the greater the probability it is in that class. In Table 1, we show the minimum threshold that achieves a Dacc of 100%. While the percentage of minority samples declines as the number of unlabeled minority samples increases, a high-number of accurate predictions are still made. In an iterative way, new validated minority samples could be used to increase the labeled set and decrease the percentage of minority samples in the unlabeled set eventually leading to the identification of most or all of the minority samples.

4.2 Synthetic data SD2
The second set of synthetic data, SD2, was created to more closely mimic the hypermethylated genes data. Data on the known 63 hypermethylated genes was used to create additional minority samples that were placed in the unlabeled dataset (see Methods). We varied the number of new samples that were created and repeated the entire experiment 10 times for each new sample size. Sizes of the 26 clusters created in each run ranged between 102 and 1391 samples with 85% between 200 and 1000. Again, each unlabeled gene was predicted 19 times (m = 20).

Table 2 displays the average results for each of the 10 iterations at each new sample size. These results only consider the synthetic data to be in the minority class, though certainly some of the real unlabeled genes also belong to this class. Therefore, the specificity is likely to be an underestimate, and the Dacc is probably higher as some of the ‘majority’ class samples are, in fact, true unlabeled minority samples (hypermethylated genes).


View this table:
[in this window]
[in a new window]

 
Table 2 Results for cluster_boost on synthetic data SD2

 
Figure 3 shows the average distribution of cumulative samples found at each classification threshold for datasets with 10% synthetic unlabeled data. Similar distributions are seen for datasets with 1, 5 and 20% synthetic unlabeled data (Supplementary Figure S1). The fraction of the synthetic samples at each threshold is indicated. Table 2 shows that at least 84% of synthetic samples are predicted to be in the minority class at least once.


Figure 3
View larger version (27K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3 Cumulative number of samples at each classification threshold for SD2 datasets with 10% of unlabeled data synthetically created for the minority class. It is likely that some of the real unlabeled data belongs to the minority class.

 
4.3 Hypermethylated genes data
The results from the synthetic datasets show that cluster_boost can successfully identify unlabeled minority samples. Therefore, we applied cluster_boost to the set of 63 known hypermethylated genes and 14 249 unlabeled genes. The unlabeled data were grouped into 30 clusters ranging in size between 102 and 1389. To determine how well cluster_boost is able to learn the training data, we performed a series of training set validation tests also exploring support vector machine (SVM) and linear discriminant analysis (LDA) classifiers in addition to boosting.

First, we assessed training accuracy given the more conventional cross-validation approach. The unclustered unlabeled data was randomly split into m = 20 partitions. Classifiers were trained with all but one of the 20 partitions of unlabeled data and with the 63 known hypermethylated genes. We then determined how well the classifier learned this data by classifying the training data. This was repeated 20 times, each time holding out a different unlabeled partition. We also tested cluster_boost and modified cluster_boost classifiers substituting SVMs (cluster_SVM) and LDA (cluster_LDA) for boosting in the classification step. Using the 30 clusters of the unlabeled data, m = 20 partitions were created as described previously. Each training set consisted of a single partition of unlabeled data and the 63 known hypermethylated genes. Again, this was repeated 20 times with each unlabeled partition being in the training set once.

The results of these training set validations are shown in Table 3. In general, LDA does not perform as well as boosting and SVMs. For boosting and SVM classifiers, we see that specificity rates for more standard cross validation training data are high, but their sensitivity is low. This is likely due to the extremely imbalanced nature of the training data. Both cluster_boost and cluster_SVM have very high-sensitivity and specificity with cluster_boost having a slightly higher sensitivity. Since we are most interested in accurately identifying new hypermethylated genes, the near perfect sensitivity of cluster_boost is extremely important.


View this table:
[in this window]
[in a new window]

 
Table 3 Training set validation experiments comparing boosting, SVM and LDA classifiers using the hypermethylated gene data

 
Following the cluster_boost algorithm, we performed 20 classification experiments resulting in 19 predictions for each unlabeled gene. Figure 4 shows the cumulative number of genes found at each threshold forming a similar distribution as for the SD2 datasets. From results using the SD2 data (Table 2), we estimate that the 69 predictions at a classification threshold of 13 will have an accuracy of at least 50% and possibly as high as 90%. Table 4 shows the 41 genes classified as hypermethylated in at least 15 of the experiments. A complete list of genes and the number of times each was classified as being hypermethylated can be found in Supplementary Table S3.


Figure 4
View larger version (22K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4 Number of potentially hypermethylated genes found at each classification threshold.

 


View this table:
[in this window]
[in a new window]

 
Table 4 Genes classified as hypermethylated in at least 15 of 19 classification experiments

 
For 15 genes, shown in Table 5, the methylation was determined experimentally. The selection of these genes was not based solely on predictions by cluster_boost. We did seek to validate genes that had been classified as hypermethylated at different thresholds to obtain a more general assessment of the accuracy of the results.


View this table:
[in this window]
[in a new window]

 
Table 5 Genes experimentally tested by methylation specific PCR

 
Each gene was assayed for the presence of methylation in primary ovarian cancer tissues. Genes were considered to exhibit promoter methylation when at least 5 of the analyzed cancer specimens produced an amplification product using the primer set specific to the methylated sequence (see Methods). Some genes were found methylated in less than five tissues and were considered as having an unknown methylation status. As shown in Table 5, all five genes that were classified as hypermethylated at least five times proved to be methylated in ovarian cancer tissues. In contrast, only two of the nine genes classified as hypermethylated four or fewer times were confirmed as methylated. The methylated genes not predicted often may contain features not well represented in the small known set. Therefore, including these in the known training set may improve accuracy and help identify new candidates not previously predicted. A more complete description of these results will be presented elsewhere.

It is interesting to note that {approx}7000 (49%) of genes are never classified into the hypermethylated class. The results for the synthetic datasets suggest that these should contain a much lower percentage of hypermethylated genes than the full unlabeled data. Therefore, this set may be a reasonable approximation of an unmethylated dataset to be used in more traditional classification experiments, or alternatively as a subset of the unlabeled data from which to choose majority training sets for cluster_boost.


    5 DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 SYSTEMS AND METHODS
 3 ALGORITHM
 4 IMPLEMENTATION
 5 DISCUSSION
 REFERENCES
 
Reasons for the aberrant acquisition of promoter methylation are still unknown. Our analyses, in addition to that by others, suggest that there is something special about the sequence context of certain promoter regions that predisposes them to becoming methylated. An initial analysis of the sequence features used to characterize genes shows that Alu repeat elements, and more generally the class SINE repeat elements, are depleted in promoter regions of hypermethylated genes. This has been reported by others for methylated CpG islands (Feltus et al., 2003), though the opposite has also been claimed (Weber et al., 2005). We also find a reduced amount of transcription in regions surrounding hypermethylated genes as indicated by decreased gene density and amount of sequence that is transcribed and translated. Lastly, we find an enrichment of single nucleotide polymorphisms (SNPs). A ranked list of the sequence features based on signal-to-noise ratio (SNR) calculations considering the full dataset and the average rank based on SNR calculations in each of the 20 training sets is included as Supplementary Table S4.

DNA sequence features had previously been shown to be factors for CpG island methylation, though because of data availability, these studies have been confined to the analysis of small numbers of genes or CpG islands. Instead of limiting ourselves to the same constraints, we decided to develop an approach that would allow us to do a genomic sweep for hypermethylated genes. By exploring beyond traditional data mining, we hope to expand the scope to mining for unknown, thus allowing us to ask questions beyond current constraints.

While other computational methods have been developed to predict methylation status, none have directly addressed this problem in cancer. Three array-based methods have assayed for differentially methylated regions in cancer cell lines and/or tissues (Weber et al., 2005; Bibikova et al., 2006; Hatada et al., 2006). Each provided a list of genes with evidence of their hypermethylatoin in some cancer, with 18 (Weber set) (Weber et al., 2005), 36 (Bibikova set) (Bibikova et al., 2006) and 400 (Hatada set) (Hatada et al., 2006) unique genes in each list. These lists were not directly used in the construction of our known hypermethylated genes set, though 1, 8 and 3 genes from these lists, respectively, overlapped our known set.

We identified genes predicted in these other studies and looked for those that we also predicted in our set of 69 above the classification threshold of 13. Due to our requirement of a promoter CpG island, not all of the genes in these other prediction lists were in our unlabeled set. Only 3 of our predictions were also in the Hatada set, and none were in the other two. This lack of agreement was seen amongst the other sets as well with only one in common between the Weber and Bibikova sets (one of our known hypermethylated genes), one in common between the Weber and Hatada sets, and six (one a known hypermethylated gene) in common between the Bibikova and Hatada sets. In general, this indicates that there are likely many hypermethylated genes to be uncovered and that these discoveries will benefit from the application of several methods, both experimental and computational.


    Acknowledgments
 
The authors gratefully acknowledge the excellent technical contributions of Yaqing Wen, Lauren R. Simel, Alison H. Gusberg and Carole Grenier. S.K.M. received support from the DoD Ovarian Cancer Research Program, award number W81XWH-05-1-0053.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Christos Ouzounis

1TP = true positives; TN = true negatives; FP = false positives; FN = false negatives. Back

Received on July 24, 2006; revised on November 15, 2006; accepted on December 1, 2006

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 SYSTEMS AND METHODS
 3 ALGORITHM
 4 IMPLEMENTATION
 5 DISCUSSION
 REFERENCES
 

    Adorjan, P., et al. (2002) Tumour class prediction and discovery by microarray-based DNA methylation analysis. Nucleic Acids Res, . 30, e21[Abstract/Free Full Text].

    Bhasin, M., et al. (2005) Prediction of methylated CpGs in DNA sequences using a support vector machine. FEBS Lett, . 579, 4302–4308[CrossRef][Web of Science][Medline].

    Bibikova, M., et al. (2006) High-throughput DNA methylation profiling using universal bead arrays. Genome Res, . 16, 383–393[Abstract/Free Full Text].

    Bock, C., et al. (2006) CpG island methylation in human lymphocytes is highly correlated with DNA sequence, repeats, and predicted DNA structure. PLoS Genet, . 2, e26[CrossRef][Medline].

    Cardie, C. and Howe, N. (1997) In Kaufmann, M. (Ed.). Improving minority class prediction using case-specific feature weights. International Conference on Machine Learning. AAAI Press, pp. 57–65.

    Chawla, N. (2003) In Chawla, N. (Ed.). C4.5 and imbalanced data sets: investigating the effect of sampling method, probabilistic estimate, and decision tree structure. International Conference on Machine LearningWashington DC AAAI Press.

    Technical Report Chen, C., et al. (2004) Using random forest to learn imbalanced data.

    Choe, W., et al. (2000) Neural network schemes for detecting rare events in human genomic DNA. Bioinformatics, 16, 1062–1072[Abstract/Free Full Text].

    Dudoit, S. and Fridlyand, J. (2002) A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biol, . 3, 1–21[Medline].

    Feltus, F.A., et al. (2003) Predicting aberrant CpG island methylation. Proc. Natl Acad. Sci. USA, 100, 12253–12258[Abstract/Free Full Text].

    Feltus, F.A., et al. (2006) DNA motifs associated with aberrant CpG island methylation. Genomics, 87, 572–579[CrossRef][Web of Science][Medline].

    Freund, Y. and Schapire, R. (1996) In Chawla, N. (Ed.). Experiments with a new boosting algorithm. Proceedings of the Thirteenth International Conference on Machine LearningBari, Italy, pp. pp. 148–156.

    Gardiner-Garden, M. and Frommer, M. (1987) CpG islands in vertebrate genomes. J. Mol. Biol, . 196, 261–282[CrossRef][Web of Science][Medline].

    Greally, J.M. (2002) Short interspersed transposable elements (SINEs) are excluded from imprinted regions in the human genome. Proc. Natl Acad. Sci. USA, 99, 327–332[Abstract/Free Full Text].

    Guo, H. and Viktor, H.L. (2004) Learning from imbalanced data sets with boosting and data generation: the databoost-im approach. SIGKDD Explor. Newsl, . 6, 30–39[CrossRef].

    Hatada, I., et al. (2006) Genome-wide profiling of promoter methylation in human. Oncogene, 25, 3059–3064[CrossRef][Web of Science][Medline].

    Huang, Z., et al. (2006) High-throughput detection of m6p/igf2r intronic hypermethylation and LOH in ovarian cancer. Nucleic Acids Res, . 34, 555–563[Abstract/Free Full Text].

    Japkowicz, N., et al. (1995) A novelty detection approch to classification. Proceedings of the Fourteenth Joint Conference on Artificial Intelligence., pp. pp. 518–523.

    Japkowicz, N. (2003a) In Chawla, N. (Ed.). Class imbalances: are we focusing on the right issue? International Conference on Machine LearningWashington DC, AAAI Press.

    Japkowicz, N. (2003b) In Chawla, N. (Ed.). Learning from imbalanced data sets: a comparison of various strategies. International Conference on Machine LearningWashington DC, AAAI Press.

    Joshi, M.V., et al. (2001) In Chawla, N. (Ed.). Evaluating boosting algorithms to classify rare classes: comparison and improvements. International Conference on Data MiningWashington DC, AAAI Press.

    Kent, W., et al. (2002) The human genome browser at UCSC. Genome Res, . 12, 996–1006[Abstract/Free Full Text].

    Kubat, M., et al. (1998) Machine learning for the detection of oil spills in satellite radar images. Mach. Learn, . 30, 195–215[CrossRef].

    Laird, P.W. (2003) The power and the promise of DNA methylation markers. Nat. Rev. Cancer, 3, 253–266[CrossRef][Web of Science][Medline].

    Luedi, P.P., et al. (2005) Genome-wide prediction of imprinted murine genes. Genome Res, . 15, 875–884[Abstract/Free Full Text].

    Pednault, E.P.D., et al. (2000) In Japkowicz, N. (Ed.). Handling imbalanced data sets in insurance risk modeling. AAAI Workshop.Austin, Texas, AAAI Press, pp. pp. 58–63.

    Plant, C., et al. (2006) Enhancing instance-based classification with local density: a new algorithm for classifying unbalanced biomedical data. Bioinformatics, 22, 981–988[Abstract/Free Full Text].

    Qian, J., et al. (2003) Prediction of regulatory networks: genome-wide identification of transcription factor targets from gene expression data. Bioinformatics, 19, 1917–1926[Abstract/Free Full Text].

    Robertson, K.D. (2005) DNA methylation and human disease. Nat. Rev. Genet, . 6, 597–610[CrossRef][Web of Science][Medline].

    Rollins, R.A., et al. (2006) Large-scale structure of genomic methylation patterns. Genome Res, . 16, 157–163[Abstract/Free Full Text].

    Wang, Z., et al. (2006) Evidence of influence of genomic DNA sequence on human X chromosome inactivation. PLoS Comp. Biol, . In press.

    Weber, M., et al. (2005) Chromosome-wide and promoter-specific analyses identify sites of differential DNA methylation in normal and transformed human cells. Nat. Genet, . 37, 853–862[CrossRef][Web of Science][Medline].

    Yeo, G.W., et al. (2005) Identification and analysis of alternative splicing events conserved in human and mouse. Proc. Natl Acad. Sci. USA, 102, 2850–2855[Abstract/Free Full Text].

    Zhang, J., et al. (2004) Learning rules from highly unbalanced data sets. IEEE International Conference on Data MiningBrighton, UK, IEEE Computer Society Press.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Clin. Cancer Res.Home page
M. T. McCabe, J. C. Brandes, and P. M. Vertino
Cancer DNA Methylation: Molecular Mechanisms and Clinical Implications
Clin. Cancer Res., June 15, 2009; 15(12): 3927 - 3937.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
M. Li, H.-i. H. Paik, C. Balch, Y. Kim, L. Li, T. H-M. Huang, K. P. Nephew, and S. Kim
Enriched transcription factor binding sites in hypermethylated gene promoters in drug resistant cancer cells
Bioinformatics, August 15, 2008; 24(16): 1745 - 1748.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
C. Bock and T. Lengauer
Computational epigenetics
Bioinformatics, January 1, 2008; 24(1): 1 - 10.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/3/281    most recent
btl620v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (4)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Goh, L.
Right arrow Articles by Furey, T. S.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Goh, L.
Right arrow Articles by Furey, T. S.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?