Skip Navigation


Bioinformatics Advance Access originally published online on April 6, 2005
Bioinformatics 2005 21(12):2867-2874; doi:10.1093/bioinformatics/bti417
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/12/2867    most recent
bti417v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (8)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Levin, A. M.
Right arrow Articles by Kardia, S. L. R.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Levin, A. M.
Right arrow Articles by Kardia, S. L. R.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions{at}oupjournals.org

A model-based scan statistic for identifying extreme chromosomal regions of gene expression in human tumors

Albert M. Levin 1,*, Debashis Ghosh 2, Kathleen R. Cho 3 and Sharon L. R. Kardia 1

1Department of Epidemiology, University of Michigan Ann Arbor, MI 48104-3028, USA
2Department of Biostatistics, University of Michigan Ann Arbor, MI 48109-2029, USA
3Department of Pathology, University of Michigan Ann Arbor, MI 48019-2216, USA

*To whom correspondence should be addressed at Department of Human Genetics, School of Medicine, University of Michigan, 1241 E. Catherine Street, Rm#5912, Ann Arbor, MI 48109-0618, USA.


    Abstract
 TOP
 Abstract
 INTRODUCTION
 SYSTEMS AND METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 

Motivation: The analysis of gene expression data in its chromosomal context has been a recent development in cancer research. However, currently available methods fail to account for variation in the distance between genes, gene density and genomic features (e.g. GC content) in identifying increased or decreased chromosomal regions of gene expression.

Results: We have developed a model-based scan statistic that accounts for these aspects of the complex landscape of the human genome in the identification of extreme chromosomal regions of gene expression. This method may be applied to gene expression data regardless of the microarray platform used to generate it. To demonstrate the accuracy and utility of this method, we applied it to a breast cancer gene expression dataset and tested its ability to predict regions containing medium-to-high level DNA amplification (DNA ratio values >2). A classifier was developed from the scan statistic results that had a 10-fold cross-validated classification rate of 93% and a positive predictive value of 88%. This result strongly suggests that the model-based scan statistic and the expression characteristics of an increased chromosomal region of gene expression can be used to accurately predict chromosomal regions containing amplified genes.

Availability: Functions in the R-language are available from the author upon request.

Contact: fcouples{at}umich.edu


    INTRODUCTION
 TOP
 Abstract
 INTRODUCTION
 SYSTEMS AND METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
In human tumors, gains as well as losses of genomic DNA occur commonly (Loeb and Christians, 1996). The application of genome-wide analysis techniques such as chromosome painting, comparative genomic hybridization (CGH) and loss of heterozygosity have given insights into the extensive nature of this genomic instability occurring in tumors (Gray and Collins, 2000). With the proliferation of gene expression microarray technologies and the availability of the human genome sequence, findings suggest that expression array data may be used to detect regions of amplification and deletion containing genes that influence tumor initiation and progression (Bayani et al., 2002; Beer et al., 2002; Godfried, 2002; Hyman et al., 2002; Kauraniemi et al., 2001; Miller et al., 2003; Pollack et al., 2002; Wolf et al., 2004). As such, the identification of chromosomal regions of increased or decreased expression has recently been a fruitful area of cancer research, and the number of methods being put forth to study these phenomena is growing (Caron et al., 2001; Crawley and Furge, 2002; Husing et al., 2003; Kano et al., 2003; Pollack et al., 2002; Zhou et al., 2003).

Generally, these methods utilize only the relative position of genes on an array-based gene map. This strategy implicitly assumes that the distances between the genes are fixed and constant, which further assumes that gene density is constant within and across chromosomes. Due to the spatial nature of the mechanism of amplification, genes that occur closer to one another are more likely to be included in the same amplicon. Therefore, in the context of studying gene expression patterns to identify amplified genes, a method that accounts for variation in both the distance between consecutive genes and gene density is necessary. Further, the GC content of a region has been shown to have an influence on gene density (Bernardi, 2000) and has been shown to be associated with patterns of increased gene expression (Godfried, 2002; Versteeg et al., 2003).

For these reasons, we have developed a method that accounts for variation in the distance between consecutive genes, gene density and GC content to identify extreme chromosomal regions of gene expression (E-CHaRGE). We demonstrate the method by applying it to a breast cancer dataset (Pollack et al., 2002) to assess the correspondence between increased and decreased E-CHaRGEs (I- and D-CHaRGEs, respectively) and measured DNA copy number. Finally, we demonstrate the ability of I-CHaRGEs to predict chromosomal regions containing gene amplification via a logistic regression model.


    SYSTEMS AND METHODS
 TOP
 Abstract
 INTRODUCTION
 SYSTEMS AND METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
Scan statistic methodology
The model-based scan statistic developed here expands upon the method developed by Wagner (1999) and is influenced by the landmark paper in this area of scan statistic research by Dembo and Karlin (1992). Without loss of generality, the method is outlined for the analysis of gene expression profiles. Briefly, the scan statistic for gene expression patterns uses two data types—base pair gene position and gene expression level. For a significant I- or D-CHaRGE to be detected, there must be statistical evidence of both a spatial clustering of genes and a clustering of similar gene expression levels (i.e. either above or below a certain threshold for the respective E-CHaRGE) in that cluster of genes. A simple Poisson Process model can be used to detect spatial clusters of genes on a chromosome, but a compound Poisson Process model is necessary to incorporate the gene expression data. A further extension to the compound Poisson Process model is developed below to incorporate variation in gene density, conditional on GC content, into the final model-based scan statistic.

A scan statistic for gene clustering using a simple Poisson Process
A simple Poisson Process is a counting process denoted by {N(t), t ≥ 0}, where N(t) is a count of the number of events that occur in some time or space of length t. To identify clusters of genes on a chromosome (ignoring expression levels), the occurrence of a gene on the chromosome is considered to be the event of interest, and N(t) is the number of genes which occur over a given base pair length t on a chromosome described by a single parameter, {lambda}g, which is the rate of occurrence of genes over a distance of t base pairs [i.e. N(t)/t]. The expected number of genes, E[N(t)], over a given base pair length t is equal to {lambda}gt. For a given group of N(t) genes, we can test the null hypothesis that the N(t) = E[N(t)] against the alternative that N(t) > E[N(t)]. Rejecting the null in favor of the alternative would be consistent with a particular group of genes being identified as a significant cluster.

Reformulating this Poisson Process in terms of a scan statistic, we consider the r distances between the N(t) = r + 1 genes, where r represents the number of intervals between N(t) genes. Let Xi represent the position of the i-th gene on a chromosome (i = 1,...,n = total number of genes on the chromosome), then the distance between gene i and gene i + 1 can be described as Yi = Xi+1Xi. For a group of r + 1 genes, the distance from gene i to gene i + r may be expressed as the sum of the r distances, Si,r, between these r + 1 genes—that is, . Since {N(t), t ≥ 0} is a simple Poisson Process, the Yis are independent and identically distributed as exponential random variables with parameter {lambda}g, and Si,r is distributed as a gamma random variable with rate parameter {lambda}g, scale parameter r and density as follows:

where {Gamma}(r) is the gamma function—that is, {Gamma}(r) = (r – 1)!. This density allows one to compute the probability of a cluster of r + 1 genes over a base pair distance of Si,r. Specifically, the probability of observing a cluster of r + 1 genes over a base pair distance as short or shorter than the observed value of Si,r is computed from the density f(si,r) as follows:

(1)
One can then evaluate the statistical significance of the calculated probability by comparing it to some previously determined {alpha}-level (e.g. {alpha} = 0.05). If the observed probability is smaller than this selected {alpha}-level, then the group of genes is identified as a significant gene cluster.

A scan statistic for detecting I-CHaRGEs
To incorporate gene expression data into this scan statistic framework to identify I-CHaRGEs, we consider the original Poisson Process describing genes on chromosomes {N(t), t ≥ 0} to be partitioned into two independent Poisson Processes: one for genes exceeding a particular threshold, N1(t), t ≥ 0, and a second for those that do not, N0(t), t ≥ 0. As the genes on a microarray may have different dynamic ranges and variances, transforming the genes so that they all have the same distribution facilitates the definition of a common threshold value, equally applicable to all genes. Making the assumption that the expression of each gene is normally distributed, applying a standard normal transformation would achieve this goal. Let Zij be the standardized gene expression value computed from the non-transformed gene expression value Gij for each gene i across all tumors j (where j = 1,..., M, and M is the total number of tumors in the sample). Explicitly, Zij is calculated from Gij as follows:

where and sd (Gi) are the mean and standard deviation of the expression level for gene i, respectively. Therefore, all Zij will have a common normal distribution, with a mean equal to 0 and variance equal to 1.

Based on Zij, an indicator variable Iij is defined to classify the expression level for a particular gene i and tumor j as being increased or not—for example, Iij=1 if Zij > 1.96 and Iij = 0 otherwise. Because the increased expression Poisson Process {N1(t), t ≥ 0} is a subset of the original process, we may say that genes with increased gene expression occur at a portion of the rate {lambda}g. That is, genes with increased gene expression occur at a rate equal to {lambda}1 = {lambda}gp1, where p1 is the probability that a gene's expression exceeds the specified threshold. Setting the threshold at +1.96, for example, results in p1 = P(Iij = 1) = 0.025. Likewise, genes without an increased gene expression occur at a rate equal to {lambda}0 = {lambda}g(1 – p1) such that {lambda}1+ {lambda}0 = {lambda}g(1–p1)+{lambda}g(p1) = {lambda}g.

Based on these two independent Poisson Processes, we define the compound Poisson Process, {U(t), t ≥ 0}, for identifying I-CHaRGEs. U(t) is a sum of the independent and identically distributed Iis for a given tumor, defined as . As one can see, U(t) counts the number of genes that are increased above the set threshold over a base pair distance t containing a total of N(t) genes.

The base pair width of that cluster in a particular tumor is then represented by and is used to calculate the probability of an increased expression cluster based on the gamma density as follows:

(2)
where k (similar to r above) is the number of intervals between U(t) increased expression genes. In the case where U(t) = N(t) (i.e. all genes observed over a base pair distance of size t are increased), k = r and the probability calculation is virtually identical to the probability calculated using Equation (1), with the exception of substituting {lambda}1 for {lambda}g. The case where k < r corresponds to less than all of the genes in a region being over-expressed.

For the detection of D-CHaRGEs, the method described here for the detection of I-CHaRGEs may be applied reciprocally.

A model-based scan statistic conditional upon gene density and GC content
Up to this point, we have assumed a single genomic rate parameter, {lambda}g, to determine the probability of observing a particular I-CHaRGE. Using a single genomic rate parameter will underestimate the probability of a cluster in gene-dense regions (where the distance between genes is smaller on average) and overestimate the probability in gene-poor regions (where the distance between genes is larger on average). By estimating {lambda}g based on its position in the genome, {lambda}i, we can account for variation in gene density by modeling gene occurrence throughout the genome as an ‘inhomogeneous’ compound Poisson Process (Ross, 1996) and simply replacing {lambda}1 with {lambda}1i = {lambda}ip1 in Equation (2). The complication lies in the estimation of the {lambda}is.

A primary issue in the estimation of the {lambda}is is that they must be on a scale determined by the array-based map to ultimately generate appropriate E-CHaRGE probability estimates. For this reason, the initial estimates of the {lambda}is will be generated directly from the array-based map. However, these estimates of gene density are subject to error due to the gene sampling strategy used in the design of a particular microarray. For example, a gene-rich portion of the genome might be under-sampled on a particular array, and the resulting {lambda}i for that region may be underestimated. To account for a portion of this error, we propose conditioning of the initial estimates of the {lambda}is based on local genomic GC content.

Using GC content has a number of appealing characteristics. First, GC content has long been shown to correlate closely with gene density (Bernardi, 2000). This has been corroborated in recent papers that also explore gene expression in the chromosomal context (Caron et al., 2001; Versteeg et al., 2003). Further, GC content is the sole parameter used by the gene prediction program GENSCAN (Burge and Karlin, 1997) to determine the expected rate of gene occurrence for an input sequence (i.e. gene density), which is also modeled by GENSCAN as an exponential random variable. Finally, the human genome sequence is readily available, and thus, information on GC content is known and complete.

We utilized a two-stage linear modeling strategy to estimate the {lambda}is. In the first stage, we generate estimates of the {lambda}is based on the average distances between genes on the array-based map. Because the rate parameter is estimated as the inverse of the average distance between two genes, it is impossible to estimate unique {lambda}is for each gene i on the map. Therefore, we propose the division of the genome into bins of equal length (e.g. 10 Mb), where the genes that fall within a particular bin j (j = 1,...,nbins= total number of 10 Mb bins) share the same rate parameter {lambda}j (number of genes per base pair). Summing the distances between adjacent genes within the jth bin and dividing it by the number of such distances gives us the average distance between genes in the jth bin, which is equal to an estimate of 1/{lambda}j as follows:

where nj is the number of genes in the jth bin, {nu}j is the index for the first gene in the jth bin and yi is the distance between gene i and gene i + 1, as above. In the second stage, we model the as a function of GC content for each bin j, xGC,j, using the following linear model:

Therefore, to estimate the probability of an E-CHaRGE within the jth bin conditional on gene density and GC content, the predicted values from this model, multiplied by p1, can then be substituted into Equation (2). If a potential E-CHaRGE spans more than a single bin, a weighted average of the predicted rate measures from each bin would be an appropriate rate parameter estimate, where the weights would correspond to the number of genes in the potential E-CHaRGE from each bin. A weighted average would smooth the transition of the rate parameter between bins and reduce the possibility of biased probabilities due to edge effects.


    RESULTS
 TOP
 Abstract
 INTRODUCTION
 SYSTEMS AND METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
Map construction and data preparation
Genome-wide DNA copy number variation and gene expression level data measured in a series of primary human breast carcinomas (n = 37) by Pollack et al. (2002) were downloaded from the Supplemental Materials section of the Proc. Natl Acad. Sci. USA website for this paper. The dataset consisted of log2 ratio (tumor/normal) measurements for 6095 UNIGENE clusters, referred to henceforth as ‘genes’. To fill in the missing values in both the aCGH (array comparative genomic hybridization) copy number and expression datasets, we used a k nearest-neighbors algorithm for imputation (Troyanskaya et al., 2001). As suggested in the manuscript describing this algorithm (Troyanskaya et al., 2001), an intermediate (10 ≤ k ≤ 20) value of k = 15 was employed.

The chromosomal map of the genes was created by aligning the 6095 cDNA sequences (downloaded from GenBank) to the human genome (build #34) using the Blast-like alignment tool (Kent, 2002) running in DNA mode. Of the original 6095 genes, 5883 genes had acceptable alignments in the genome and make up the final map for analysis. Following the chromosomal positioning of the genes, a 5-gene moving median method was used to smooth the aCGH data to reduce the noise associated with individual gene measurements. Each chromosome of each of the 37 tumors was smoothed in this manner.

One of the assumptions of the Poisson Process model is that the distances between genes are distributed as independent and identical exponential random variables. As can be seen from the QQ-plot in Figure 1A, the distribution of the 5860 base pair distances between these 5883 genes deviated from the expected exponential distribution (i.e. it was right skewed). By employing a square root transformation of distances, the approximation to an exponential distribution was greatly improved (Fig. 1B).



View larger version (13K):
[in this window]
[in a new window]
 
Fig. 1 QQ-plots comparing distance between gene distributions to randomly drawn exponential distributions. (A) The distribution of the 5860 base pair distances between adjacent genes spread throughout the genome (excluding chromosome Y) is compared to a randomly drawn exponential distribution, with a rate parameter equal to 2.01e – 06. The rate parameter is calculated as the inverse of the mean base pair distance between genes, which is equal to 496 892 bp. The dashed line drawn on the plot has an intercept = 0 and slope = 1. Deviation from this 45° line indicates deviation from the expected exponential distribution. (B) The distribution of the 5860 square root base pair distances between adjacent genes is compared to a randomly drawn exponential distribution, with a rate parameter equal to 0.0018 (the inverse of the mean of the 5860 square root distances between genes = 542.77).

 
For each of the 5883 genes of the array-based map, GC content was determined using annotation from the University of California at Santa Cruz Genome Browser (http://genome.ucsc.edu). For non-overlapping 20 kb regions across the genome, GC percent is estimated as the average number of G and C bases per 1000 base pairs over the 20 kb region (‘gcPercent’ file in the annotation database). Each gene was then assigned the GC content value of the 20 kb region within which it aligned. After the division of the genome into 10 Mb bins (n = 299), the GC content for each bin was determined by averaging the GC content of each gene within the bin. A table describing each of the 10 Mb bins (number of genes, base pair boundaries and average GC content) and the results of the two-stage linear modeling procedure to estimate the bin specific rate parameters is provided as supplemental data.

The model-based scan statistic results
The model-based scan statistic was applied to the genome-wide expression data for each of the 37 primary breast tumors using a standardized mRNA threshold value of 1.65 (+ for I-CHaRGEs and – for D-CHaRGEs), and statistically significant E-CHaRGEs were determined using a significance level {alpha} = 0.01. Using these criteria, a total of 377 I-CHaRGEs and 165 D-CHaRGEs were identified among the 217,671 expression measurements (5883 genes multiplied by 37 tumors). The relationship between these E-CHaRGEs and the smoothed DNA copy number measurements within them is displayed in Table 1.


View this table:
[in this window]
[in a new window]
 
Table 1 Percentage of aCGH DNA ratio measurements corresponding to I- and D-CHaRGEsa by DNA category.

 
We have used the DNA ratio ranges of Pollack et al. (2002) to classify the 217,671 aCGH copy number measurements into the following qualitative copy number categories: deletion = DNA ratio <0.8; normal = 0.8 ≤ DNA ratio < 1.2; low amplification = 1.2 ≤ DNA ratio < 2.0; medium amplification = 2.0 ≤ DNA ratio < 3.0; and high amplification = DNA ratio > 3.0. In Table 1, we show that the 165 D-CHaRGEs covered a total of 868 DNA measurements. Of the 4019 DNA measurements falling into the deletion category, 1.17% (n = 47) corresponded to D-CHaRGEs. In comparison, the 377 I-CHaRGEs covered a total of 2130 DNA measurements. Of the 62 DNA measurements within the high-amplification category, 67.74% (n = 42) corresponded to I-CHaRGEs. Further, of the 285 medium amplification measurements, 45.61% (n = 130) corresponded to I-CHaRGEs. This result shows a striking association between medium-to-high levels of copy number amplification and certain I-CHaRGEs.

Of the 377 I-CHaRGEs identified, 168 contained at least one gene with a DNA ratio measurement >1.2. The frequency distribution of these 168 I-CHaRGEs is displayed by chromosome in Figure 2. Of these 168 I-CHaRGEs, 33 contained at least one gene with a DNA ratio measurement in the medium-to-high amplification range (DNA ratio > 2.0). Whereas 135 I-CHaRGEs associated with low amplification seem to be spread more throughout the genome, these 33 I-CHaRGEs corresponding to medium-to-high level amplification seem to be found primarily on chromosomes 1 (n = 4), 8 (n = 6) and 17 (n = 15). To further explore the association between I-CHaRGEs and amplification within specific tumors, we focus on the results from chromosome 17 (Fig. 3).



View larger version (15K):
[in this window]
[in a new window]
 
Fig. 2 Frequency of I-CHaRGEs corresponding to DNA amplification by chromosome. A total of 168 I-CHaRGEs contained at least one gene with a DNA ratio measurement >1.2. Of these 168 I-CHaRGEs, 33 contained at least one gene with a DNA ratio measurement in the medium-to-high amplification range (DNA ratio >2.0) and are shown as black bars. The remaining 135 I-CHaRGEs contained at least one gene with a DNA ratio measurement in the low amplification range (1.2 ≤ DNA ratio <2.0) and are shown as white bars.

 


View larger version (13K):
[in this window]
[in a new window]
 
Fig. 3 Relationship between increased DNA copy number and I-CHaRGEs identified on chromosome 17. The 359 genes mapping to chromosome 17 are ordered by position from the p-terminus (p-ter) to the q-terminus (q-ter) according to the structure of chromosome 17 displayed at the top of the figure. A total of 12 primary breast tumors contained I-CHaRGEs which overlap with at least one gene with a DNA ratio measurement ≥1.2. The identifier for each of these 12 tumors is displayed down the left column of the figure. Each tumor possesses two lines of information. The top indicates those genes in I-CHaRGEs (black bars). The bottom indicates those genes with a DNA copy number ratio between 1.2 and 2.0 (light gray bars) and >2.0 (dark gray bars).

 
On chromosome 17, a total of 30 I-CHaRGEs were identified, and 23 (77%) of these were associated with DNA amplifications. These 23 I-CHaRGEs were identified in 12 tumors, which are displayed in Figure 3. There is significant overlap between I-CHaRGEs and low and medium-to-high copy number amplifications on chromosome 17. However, there is heterogeneity within these results as well. For instance, not all the genes identified by I-CHaRGEs show DNA copy number measures in the amplification range (e.g. the two I-CHaRGEs on the p-arm of tumor Norway 101). Conversely, not all the DNA copy number measures in the amplification range are associated with I-CHaRGEs (e.g. the region from 17q21.33 to the q-terminus of tumor Norway 65).

Despite the heterogeneity, the majority of contiguous medium-to-high copy level amplifications on chromosome 17 are associated with I-CHaRGEs. Further, based on the results from Table 1, this association is genomic and not just restricted to chromosome 17. Therefore, we have termed these particular I-CHaRGEs, associated with medium-to-high level copy number amplifications, AMP I-CHaRGEs, and sought to determine whether there is a particular pattern in the I-CHaRGE characteristics (number of genes, mean expression, median expression, max expression and min expression) of AMP I-CHaRGEs that distinguish them from non AMP I-CHaRGEs (s).

To formally explore the association between I-CHaRGE characteristics and whether or not an I-CHaRGE is an AMP I-CHaRGE, we built a logistic regression AMP I-CHaRGE classifier. Using a stepwise variable selection procedure, the best logistic regression model was determined based on the Akaike information criteria (AIC). The final model contained the number of genes, mean expression and min expression characteristics. This AMP I-CHaRGE classifier is displayed in Table 2.


View this table:
[in this window]
[in a new window]
 
Table 2 A logistic regression model for the classification of I-CHaRGEs associated with medium-to-high copy amplification: an AMP I-CHaRGEa classifier.

 
As can be seen from the parameter estimates (Table 2), both the number of genes and the mean expression of an I-CHaRGE have a positive association with the probability of being an AMP I-CHaRGE, while min expression is negatively associated. Therefore, the pattern of I-CHaRGE characteristics that distinguishes AMP I-CHaRGEs from s is that AMP I-CHaRGEs contain more genes, have a higher mean expression value and may possess at least one gene with very low expression relative to other tumors.

Using a predicted probability >0.65 for identifying an AMP I-CHaRGE, we validated this model using a 10-fold cross-validation procedure (Table 3). For the 377 I-CHaRGEs identified (using an expression threshold of 1.65 and an {alpha} = 0.01), nine AMP I-CHaRGEs (n = 33) and 341 s (n = 344) were correctly classified, giving an overall classification rate of 0.93. The classifier made a total of 11 AMP I-CHaRGE predictions, and nine were accurate, translating to a predictive value positive rate of 0.82 and a false discovery rate of 0.18 (1 minus the positive predictive value). Despite a relatively low sensitivity (0.26, or 9 of 33 AMP I-CHaRGEs correctly identified) of the AMP I-CHaRGE classifier, the cross-validated prediction result strongly suggests that the model-based scan statistic and the characteristics of individual I-CHaRGEs can be used to accurately predict chromosomal regions containing amplified genes.


View this table:
[in this window]
[in a new window]
 
Table 3 Ten-fold cross-validation results for the AMP I-CHaRGEa classifier by expression threshold and {alpha}-level combinations

 
We also explored the effect of altering the expression threshold and {alpha} -level used to identify I-CHaRGEs on the performance of the AMP I-CHaRGE classifier (Table 3). In general, there was a high degree of similarity among each classification measure across expression threshold and {alpha} -level combinations. The predictive value positive measures were generally high, with a median value of 0.81 and a range from 0.75 to 0.89. Similarly, the specificity measures were also high, with a median value of 1.00 and a range from 0.97 to 1.00. The sensitivity measures were lower, with a median value of 0.25 and a range from 0.16 to 0.41. This overall similarity of classification measures suggests that the AMP I-CHaRGE classifier is robust to alterations in the expression threshold and {alpha}-level used to identify I-CHaRGEs.


    DISCUSSION
 TOP
 Abstract
 INTRODUCTION
 SYSTEMS AND METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
Our model-based scan statistic represents a significant advancement in methods to detect chromosomal patterns of gene expression. With the incorporation of the distances between genes and variation in gene density conditional upon GC content, we have accounted for genomic complexity that will result in a more precise estimation of E-CHaRGE probabilities and corresponds to the spatial nature of DNA amplification, one mechanism that may account for I-CHaRGEs. In our application to data from a breast cancer study where gene expression and DNA copy number were measured in parallel (Pollack et al., 2002), notable features underlying the relationship between gene expression and gene amplification were revealed. Generally, many I-CHARGEs were associated with multiple contiguous gene amplifications. Such I-CHaRGEs were not restricted to one chromosome or tumor, but rather, they were spread throughout the genomes of all 37 tumors studied. Further, a majority of medium-to-high copy number amplifications were associated with I-CHaRGEs, which we termed AMP I-CHaRGEs. These results indicate a regional impact of amplification on gene expression that may be captured through a method such as the model-based scan statistic, which allows for the inclusion of important genomic features (e.g. GC content) related to amplification.

A major advantage of our model-based scan statistic is that inferences are made for each chromosomal region of each individual tumor rather than being restricted to inferences on collections of tumors (Caron et al., 2001; Zhou et al., 2003). Using all I-CHaRGEs identified in each tumor, we were able to demonstrate that individual attributes of I-CHaRGEs (number of genes, mean expression and min expression) could be used to accurately predict AMP I-CHaRGEs. While other studies have demonstrated that increased regions of gene expression correspond to confirmed cancer-specific regions of recurrent amplification (Crawley and Furge, 2002; Zhou et al., 2003), this is the first demonstration that the expression data may be used by itself to predict gene amplification in the tumors under study. This suggests that microarray gene expression data and the model-based scan statistic may be used together to target particular tumors for molecular genetic studies to identify new amplicons.

In addition to its prediction accuracy, the AMP I-CHaRGE classification model also reflects a pattern of expression corresponding to putative amplicons where the majority of genes have increased expression but one or more may have very low expression. This model reflects the results of studies of gene expression within identified amplicons in a variety of primary human tumors (Bayani et al., 2002; Hyman et al., 2002; Linn et al., 2003; Pollack et al., 2002; Wolf et al., 2004), which suggest that this situation is more likely to be the rule than the exception. However, the relatively low sensitivity estimates of this model suggest that improvements in the classification of AMP I-CHaRGEs can be made. For instance, the use of more sophisticated classification methodology (e.g. neural networks, classification trees, etc.) that exploit other features of the regional expression pattern may lead to improved sensitivity and potentially increase our understanding about patterns of gene expression within medium-to-high copy number amplicons.

In comparison to previous methods to detect chromosomal patterns of gene expression, the model-based scan statistic approach has the advantage of being parametric. Thus, the statistical significance of an E-CHaRGE is determined in comparison to a theoretical null distribution rather than having to estimate it using permutation strategies where the results are data dependent and cannot be easily compared across different datasets. A downside to the parametric nature of the method is that the required gene expression standardization restricts inferences about increased and decreased expression relative to the mean of the tumor sample under study. This situation will make it impossible to identify the case where a majority of tumors have an identical E-CHaRGE. While this fault is non-trivial, we expect this situation to be exceedingly rare, as there was no single gene in the breast cancer dataset where a majority of tumors were either two-fold over- or under-expressed relative to normal tissue. Further, standardization allows for the application of this methodology to tumor sets where normal tissue controls are unavailable for study.

There are important features that must be addressed before utilizing the model-based scan statistic. First, for the Poisson Process model to be appropriate, the distances between events must be exponentially distributed or transformed to approximate that distribution. With the increasing number of genes measured on expression arrays, deviations from the underlying exponential distribution are expected to be less extreme. Next, under the Poisson Process model, the assumption is made that the lengths of genes are negligible in comparison to the distances between them. Given the large ratio of non-coding to coding DNA in the genome, this assumption is typically valid. However, in some areas of the genome, extremely large genes may violate this assumption, resulting in distorted probability estimates. Finally, we chose to use gene density estimates based on the array-based map, and to account for a portion of the error in these estimates, we conditioned them on GC content to incorporate a varying rate parameter into the model-based scan statistic. In our application, we estimated the rate parameter for 10 Mb bins. Alteration of the bin size from 2 to 20 Mbs had little impact upon the E-CHaRGEs identified (data not shown). GC content, however, is just one genomic covariate that has been shown to be associated with gene density. As more is known about the actual gene distribution of the genome, this knowledge can be readily incorporated in this methodology.


    Acknowledgments
 
Funding for this project was supported by the genomic sciences training grant T32 HG00040, AHA0130567Z, HL54457 and CA84953. D.G. acknowledges the support of NIH 1R01GM72007-01 from the DMS/DBS/NIGMS Biological Mathematics Program. A.M.L acknowledges the contributions of Dr. Chiang-Ching Huang to the ideas presented in the manuscript. A.M.L also acknowledges the helpful comments of the reviewers of this manuscript, especially reviewer #2.

Received on April 28, 2004; revised on March 24, 2005; accepted on March 29, 2005

    REFERENCES
 TOP
 Abstract
 INTRODUCTION
 SYSTEMS AND METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 

    Bayani, J., et al. (2002) Parallel analysis of sporadic primary ovarian carcinomas by spectral karyotyping, comparative genomic hybridization, and expression microarrays. Cancer Res., 62, 3466–3476[Abstract/Free Full Text].

    Beer, D.G., et al. (2002) Gene-expression profiles predict survival of patients with lung adenocarcinoma. Nat. Med., 8, 816–824[Web of Science][Medline].

    Bernardi, G. (2000) The compositional evolution of vertebrate genomes. Gene, 259, 31–43[CrossRef][Web of Science][Medline].

    Burge, C. and Karlin, S. (1997) Prediction of complete gene structures in human genomic DNA. J. Mol. Biol., 268, 78–94[CrossRef][Web of Science][Medline].

    Caron, H., et al. (2001) The human transcriptome map: clustering of highly expressed genes in chromosomal domains. Science, 291, 1289–1292[Abstract/Free Full Text].

    Crawley, J.J. and Furge, K.A. (2002) Identification of frequent cytogenetic aberrations in hepatocellular carcinoma using gene-expression microarray data. Genome Biol., 3, RESEARCH0075[Medline].

    Dembo, A. and Karlin, S. (1992) Poisson approximations for r-scan processes. Ann. Appl. Probab., 2, 329–357.

    Godfried, M.B., et al. (2002) The N-myc and c-myc downstream pathways include the chromosome 17q genes nm23-H1 and nm23-H2. Oncogene, 21, 2097–2101[CrossRef][Medline].

    Gray, J.W. and Collins, C. (2000) Genome changes and gene expression in human solid tumors. Carcinogenesis, 21, 443–452[Abstract/Free Full Text].

    Husing, J., et al. (2003) Combining DNA expression with positional information to detect functional silencing of chromosomal regions. Bioinformatics, 19, 2335–2342[Abstract/Free Full Text].

    Hyman, E., et al. (2002) Impact of DNA amplification on gene expression patterns in breast cancer. Cancer Res., 62, 6240–6245[Abstract/Free Full Text].

    Kano, M., et al. (2003) Expression imbalance map: a new visualization method for detection of mRNA expression imbalance regions. Physiol. Genomics, 13, 31–46[Abstract/Free Full Text].

    Kauraniemi, P., et al. (2001) New amplified and highly expressed genes discovered in the ERBB2 amplicon in breast cancer by cDNA microarrays. Cancer Res., 61, 8235–8240[Abstract/Free Full Text].

    Kent, W.J. (2002) BLAT—the BLAST-like alignment tool. Genome Res., 12, 656–664[Abstract/Free Full Text].

    Linn, S.C., et al. (2003) Gene expression patterns and gene copy number changes in dermatofibrosarcoma protuberans. Am. J. Pathol., 163, 2383–2395[Abstract/Free Full Text].

    Loeb, L.A. and Christians, F.C. (1996) Multiple mutations in human cancers. Mutat. Res., 350, 279–286[Web of Science][Medline].

    Miller, C.T., et al. (2003) Amplification and overexpression of the dual-specificity tyrosine-(Y)-phosphorylation regulated kinase 2 (DYRK2) gene in esophageal and lung adenocarcinomas. Cancer Res., 63, 4136–4143[Abstract/Free Full Text].

    Pollack, J.R., et al. (2002) Microarray analysis reveals a major direct role of DNA copy number alteration in the transcriptional program of human breast tumors. Proc. Natl Acad. Sci. USA, 99, 12963–12968[Abstract/Free Full Text].

    Ross, S.M. Stochastic Processes, (1996) , New York Wiley.

    Troyanskaya, O., et al. (2001) Missing value estimation methods for DNA microarrays. Bioinformatics, 17, 520–525[Abstract/Free Full Text].

    Versteeg, R., et al. (2003) The human transcriptome map reveals extremes in gene density, intron length, GC content, and repeat pattern for domains of highly and weakly expressed genes. Genome Res., 13, 1998–2004[Abstract/Free Full Text].

    Wagner, A. (1999) Genes regulated cooperatively by one or more transcription factors and their identification in whole eukaryotic genomes. Bioinformatics, 15, 776–784[Abstract/Free Full Text].

    Wolf, M., et al. (2004) High-resolution analysis of gene copy number alterations in human prostate cancer using CGH on cDNA microarrays: impact of copy number on gene expression. Neoplasia, 6, 240–247[CrossRef][Web of Science][Medline].

    Zhou, Y., et al. (2003) Genome-wide identification of chromosomal regions of increased tumor expression by transcriptome analysis. Cancer Res., 63, 5781–5784[Abstract/Free Full Text].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
A. Buness, R. Kuner, M. Ruschhaupt, A. Poustka, H. Sultmann, and A. Tresch
Identification of aberrant chromosomal regions from gene expression microarray studies applied to human breast cancer
Bioinformatics, September 1, 2007; 23(17): 2273 - 2280.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
Y. V. Sun, D. M. Jacobsen, and S. L. R. Kardia
ChromoScan: a scan statistic application for identifying chromosomal regions in genomic studies
Bioinformatics, December 1, 2006; 22(23): 2945 - 2947.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
A. Callegaro, D. Basso, and S. Bicciato
A locally adaptive statistical procedure (LAP) to identify differentially expressed chromosomal regions
Bioinformatics, November 1, 2006; 22(21): 2658 - 2666.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/12/2867    most recent
bti417v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (8)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Levin, A. M.
Right arrow Articles by Kardia, S. L. R.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Levin, A. M.
Right arrow Articles by Kardia, S. L. R.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?