Skip Navigation


Bioinformatics Advance Access originally published online on October 10, 2006
Bioinformatics 2006 22(23):2898-2904; doi:10.1093/bioinformatics/btl500
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/23/2898    most recent
btl500v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (1)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Yoon, S.
Right arrow Articles by Seong, J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Yoon, S.
Right arrow Articles by Seong, J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Large scale data mining approach for gene-specific standardization of microarray gene expression data

Sukjoon Yoon 1,2,*, Young Yang 1,2, Jiwon Choi 1 and Jeeweon Seong 1

1 Department of Biological Sciences, Sookmyung Women's University Hyochangwongil 52, Youngsan-gu, Seoul, Republic of Korea, 140-742
2 Research Center for Women's Diseases (RCWD), Sookmyung Women's University Hyochangwongil 52, Youngsan-gu, Seoul, Republic of Korea, 140-742

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 REFERENCES
 

Motivation: The identification of the change of gene expression in multifactorial diseases, such as breast cancer is a major goal of DNA microarray experiments. Here we present a new data mining strategy to better analyze the marginal difference in gene expression between microarray samples. The idea is based on the notion that the consideration of gene's behavior in a wide variety of experiments can improve the statistical reliability on identifying genes with moderate changes between samples.

Results: The availability of a large collection of array samples sharing the same platform in public databases, such as NCBI GEO, enabled us to re-standardize the expression intensity of a gene using its mean and variation in the wide variety of experimental conditions. This approach was evaluated via the re-identification of breast cancer-specific gene expression. It successfully prioritized several genes associated with breast tumor, for which the expression difference between normal and breast cancer cells was marginal and thus would have been difficult to recognize using conventional analysis methods. Maximizing the utility of microarray data in the public database, it provides a valuable tool particularly for the identification of previously unrecognized disease-related genes.

Availability: A user friendly web-interface (http://compbio.sookmyung.ac.kr/~lage/) was constructed to provide the present large-scale approach for the analysis of GEO microarray data (GS-LAGE server).

Contact: yoonsj{at}sookmyung.ac.kr


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 REFERENCES
 
One of the most popular uses of DNA microarrays is the comparison of differences in gene expression under two distinct experimental conditions (treated versus untreated samples, diseased versus normal tissue, mutant versus wild-type organisms, etc.) (Breitling et al., 2004). In this type of experimental setup, a major challenge is the identification of those genes whose expression is significantly different between two conditions (Aittokallio et al., 2003). Many sophisticated statistical methods have been tested, in attempts to achieve a more reliable identification of differentially regulated genes (Huber et al., 2002; Irizarry et al., 2003; Yang et al., 2001; Yang et al., 2002). In fact, many of the existing statistical methods for microarray analysis have been developed by using datasets, in which changes in gene expression are abundant, with many genes having a high magnitude of change that far exceeds the observed variability in expression (Ramaswamy et al., 2001). However, for many physiological and metabolic conditions, the changes in gene expression are often moderate compared to the array-wide variability in expression, thus leading to modest P-values, such that existing statistical models often miss most of the real changes (Mootha et al., 2003).

Therefore, we sought to develop an analytical approach that provides experimental biologists with a more thorough understanding of the statistical significance of any list of genes produced under conditions where the real changes are of modest magnitude but the expressional level is still high in both experimental conditions. An important issue is associated with the normalization of the relative expression of genes across a series of microarray experiments (Colantuoni et al., 2002). The normalization across arrays has been extensively used to minimize systematic variations in specific samples (Bolstad et al., 2003; Gautier et al., 2004; Huber et al., 2002; Irizarry et al., 2003; Yang et al., 2002). The selection of appropriate controls for normalization has been proposed for comparisons of expression levels across samples (Yang et al., 2002). A set of controls (microarray sample pool) with minimal sample-specific bias over a large intensity range was introduced to aid in intensity-dependent normalization. In Affymetrix GeneChip arrays, each gene is represented by a set of 11–20 pairs of probes (perfect match and a mismatch), and the intensities for each probe set is summarized by the log scale robust multi-array analysis (RMA) (Irizarry et al., 2003). It has been reported (Li and Wong, 2001) that variation of a specific probe across multiple arrays could be considerably smaller than the variance across probes within a probe set. RMA method effectively accounted for this strong probe affinity effect, and consequently improved the ability to detect differentially expressed genes between samples.

Since using multiple arrays for normalization had improved the detection of differentially expressed genes between samples, we attempted a further development on gene-specific, multi-array standardization method using a large collection of expression data available from the public database. Our focus was to better understand the biological relevance of detected difference of gene expression, rather than improving the fluorescence intensity normalization that many of existing methods (e.g. RMA method) focused on. With the rapid increase of microarray expression data in the public database over the past few years, it has become possible to monitor the general expression level of a gene in diverse biological samples under various conditions. Through the creation of the database-wide expression profile for individual genes (or probesets), which allows an estimation of the gene-specific distribution of expression level in various experimental conditions, it is possible to standardize individual gene expression intensities in a specific assay by using their unique database-wide means and standard deviations. This consideration of gene's behavior in a wide variety of biological conditions gives us new insight on interpreting the expressional difference between given samples. In a biological point of a view, the expressional difference of a gene with a small DB-wide expressional variation should have more attention than those with large DB-wide variations. Retrieving a large amount of expression data available in the public database, NCBI GEO (Gene Expression Omnibus, http://www.ncbi.nlm.nih.gov/geo) (Barrett et al., 2005), we developed a web-based computational tool to apply GEO-wide means and standard deviations to re-standardizing individual gene expression levels in specific samples.

The organization of gene expression data in GEO are schematically presented in Figure 1. Submitted samples are assembled into biologically meaningful and statistically comparable GEO DataSets (GDS). Samples within a GDS refer to the same platform, i.e. a common set of elements are assayed. Each individual entity is assigned a unique and stable accession number; the accession number prefix indicates whether the record is a GEO Platform (GPL), Dataset (GDS) or Sample (GSM). GEO is the largest fully public repository for gene expression data. It currently holds over 70 000 sample data (GSMs) generated from more than 2000 different DNA chips (GPLs).


Figure 1
View larger version (14K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1 Schematic view of the organization of GEO (Gene Expression Ombibus) data. Geo samples indicated by bold letters were analyzed in this study. GSM2240 and GSM21241 were generated with normal epithelial cells. GSM21239 and GSM21238 are from a breast cancer cell line, HCC1954. The GEO platform GPL91 includes a total of 73 datasets (GDSs). GDS817 includes six GSMs, four of which were comparatively analyzed in this study. GPL91 includes 1850 GSMs in total. For the calculation of µGPL and {sigma}GPL for each gene, expression data in all 1850 GSMs from 73 GDSs were used.

 
It is assumed that the gene expression data in the GEO database are deposited after log ratio transformation for dual channel data and log (signal—background) transformation for single channel data. We tried to re-standardize single channel data based on the GEO-wide normal distribution model, NGPL, {sigma}GPL) where the mean (µGPL) and standard deviation ({sigma}GPL) of a gene were calculated from all the available expression data of the gene sharing the same GEO platform (i.e. DNA chip). This approach provides a unique gene-specific standardization based on the expression distribution of the gene in a collection of GDSs sharing the same platform. Data from dual channel experiments do not allow GEO-wide mean expression levels and standard deviations to be calculated for individual genes, since the deposited expression data are in the form of a ratio (R/G ratio) between a test and a reference set. Thus, the focus of this study was on the analysis of log (background-subtracted signal intensity) data from single channel experiments.

In order to calculate means and standard deviations from a large collection of heterogeneous datasets (GDSs), the individual array data must be locally standardized in advance to achieve a common scale. Thus, in practice, we attempted a two-step standardization procedure for single channel microarray data. Within-array standardization (array-specific Z-score calculation) was followed by the gene-specific multi-array standardization using the GEO-wide mean and standard deviation of individual genes (gene-specific Z-score calculation). For the demonstration, a GEO Dataset (GDS) including samples of normal cells and a breast cancer cell line was analyzed by the present two-step procedure. The possibility of obtaining meaningful information from the second step gene-specific standardization was investigated in comparison with the result from the one-step array-specific standardization. The second step standardization intrinsically prioritizes genes, which have small expressional variations in the database (small {sigma}GPL). Although, the differential expression of a gene, which has a large database-wide variation (large {sigma}GPL), may be underestimated by the present method, it is typically well recognized by conventional methods, such as direct t-tests and ranking tools provided from the GEO website. In this sense, the present method will be complementary to existing analytical tools, and thus contribute to maximize the utility of gene expression data deposited in the public database.


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 REFERENCES
 
2.1 Microarray gene expression data acquisition
Microarray gene expression data were obtained from the Entrez Gene Expression Omnibus (GEO) ftp site (ftp.ncbi.nlm.nih.gov/pub/geo) (Barrett et al., 2005; Edgar et al., 2002). A set of single channel microarray data on normal and breast cancer cell lines were analyzed for the demonstration. The GEO Accession no. GDS817, which includes six samples (GSMs) is a collection of microarray experiments for the comparison of gene expression between breast cancer and normal epithelial cell lines (Fig. 1). The two samples in GDS817 are experiments done with the breast cancer cell line, HCC622. Another two involve a normal epithelial cell line. We analyzed the difference in gene expression between these two types of cell lines. The GDS817 experiments were carried out with the Affymetrix U95A DNA chip (GEO Accession no. GPL91 [NCBI GEO] ), which includes 12 651 probesets. GDS817 contains expression data records for only 12 625 probesets. Thus, our analysis was limited to this subset of probesets in GPL91 [NCBI GEO] . In addition to GDS817, a total of 72 additional GDSs sharing a common platform, GPL91 [NCBI GEO] , were retrieved from the current version of GEO for the gene-specific GEO-wide standardization procedure. In summary, expression data for 12 625 genes in a total of 1850 GSMs from 73 GDSs, which share a common platform, GPL91 [NCBI GEO] , were retrieved and analyzed in this study.

2.2 Gene specific large-scale analysis of gene expression (GS-LAGE)
It is a standard practice to correct for foreground intensities by background subtraction (Edwards, 2003). For single channel experiment data deposited in GEO, it is assumed that the values were submitted as normalized (scaled) signal count data [e.g. log (signal—background) transformation]. However, in order to calculate the mean and standard deviation of the expression level of a gene using a large collection of datasets (GDSs) that were prepared and deposited by different research groups, the individual array data from GEO needed to be re-standardized to achieve a common scale. Thus, we first carried out within-array Z-transformation on each of all collected GEO samples by using the mean expression level and its standard deviation on the array basis, as below.

Formula
where, xi: the expression level of the ith probeets (gene) recorded in the given GEO sample (GSM). µGSM: the mean expression level of all genes in the given GSM. {sigma}GSM: the standard deviation of all genes in the given GSM.

ui is the standardized intensity of the ith gene in the given GSM. This GSM-based standardization process was repeated for all 1850 GSMs sharing the GPL91 [NCBI GEO] platform, to permit comparisons between samples to be made. We next calculated the mean expression level and its standard deviation of each gene across arrays using the Z-transformed expression data from 1850 GSMs.

Formula
ui,j: the standardized expression level of the ith gene recorded in the jth GSM. n: the total number of GSM in the given GPL (in this case, n = 1850).

µi,GPL is thus the GPL-wide mean of ith gene. The Z-transformed expression level (ui) in a specific assay (GSM) was next re-scaled by the model established from the GPL-wide mean and standard deviation.

Formula
{sigma}i,GPL: the standard deviation of GPL-wide expressions of ith gene.

The expression level, vi, of a gene thus represents the re-scaling based on the observation of its expressional behavior in a large collection of diverse experiments. It is assumed that the expressional variation of a gene in the database follows the normal distribution, Ni,GPL, {sigma}i,GPL) when n is large. To confirm this assumption, we implemented an iterative procedure to remove outliers from the final µi,GPL and {sigma}i,GPL calculations. An outlier was defined as an expression level of which distance to the original µGPL is three times greater or smaller than the original {sigma}GPL. Then, final µi,GPL and {sigma}i,GPL values were used for chi square goodness of fit test between observed distribution and its ideal normal distribution. Genes whose expressional variation is significantly deviated from normal distribution (P < 0.05 in the chi square test) were removed from the second standardization (v calculation).

In this study, we compared ui and vi in identifying changes in gene expression between normal and breast cancer cells. ui is assumed to be a quantity provided by the original contributor, which only considers the expression data within the GDS for the preparation. The difference in normalized gene expression between normal and breast cells was calculated as below.

Formula
where ui+: the standardized expression level of the ith gene in normal cells, ui: the standardized expression level of the ith gene in breast cancer cells,

avr(|{Delta}u|): the absolute average difference between u+ and u.

{Delta}ui' represents the relative change in gene expression in comparison to the average change in the given assy. On the other hand, the relative change in gene expression was calculated using the {Delta}vi' value where the intensity level was re-scaled based on the gene-specific behavior in a wide variety of experiments.

Formula
where vi+: the standardized expression level of the ith gene in normal cells, vi: the standardized expression level of the ith gene in breast cancer cells, avr(|{Delta}v|): the absolute average difference between v+ and v.

2.3 Validation using an Affymetrix spike-in study dataset
For the validation study, a dataset from the spike-in study by Affymetrix was retrieved from Affycomp website (http://affycomp.biostat.jhsph.edu/). In this dataset, Human cRNA fragments matching 16 probesets on the HGU95A GeneChip were added to the hybridization mixture of the arrays at concentrations ranging from 0 to 1024 pM. The same hybridization mixture, obtained from a common tissue source, was used for all arrays. The details of the spike-in data are found in the literature (Cope et al., 2004; Irizarry et al., 2003) and the website (http://affycomp.biostat.jhsph.edu/). The fluorescence intensities were normalized by RMA method. These RMA data were used for further analysis by the present two-step standardization method after log2 transformation.

2.4 Gene expression data analysis by GEO-provided tools
For comparison with the present method, we also analyzed the gene expression data between breast cancer and normal cell lines using GEO-provided analytical tools. Three methods were used to generate lists of probesets which showed significant higher expression levels in the breast cancer cell line than the normal epithelial cell line. The options were appropriately selected as below to include similar number of entries in the final lists. Here, ‘A’ represents gene expression data on the breast cancer cell line (HCC1954) and ‘B’ represents gene expression data on normal epithelial cell line.

  1. One-tailed t-test (A > B) (0.010 significance level) selected a total of 254 probesets.
  2. Query mean group A versus B by values (4-fold higher ) selected a total of 266 probesets.
  3. Query mean group A versus B by ranks (3-fold higher) selected a total of 204 probesets.


    3 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 REFERENCES
 
Since the downloaded data for the demonstration were the normalized and combined Affymatrix data from GEO (GPL91 [NCBI GEO] ), the MA plot of test samples from GDS817 was already in a good shape (Fig. 2A). The present two-step standardization procedure was applied to these datasets in order to achieve a better resolution in detecting the modest expressional difference in genes with relatively small DB-wide standard deviation, {sigma}GPL. We first carried out within-array Z-transformation on each of all collected samples by using the mean expression level and its standard deviation on the assay basis. The resulting scatter plot (Fig. 2B) is same as the MA plot (Fig. 2A). After the first step, array-specific standardization, the difference in gene expression between normal and cancer cells showed a symmetrical distribution along the diagonal axis (Fig. 2B). After u+ and u– are re-scaled via gene-specific normal distribution, i.e. Ni,GPL, {sigma}i,GPL) for the ith gene, the relatively large deviation in gene expression between normal and breast cancer cells were shifted to the high intensity region (Fig. 2C). This observation suggests that genes with large µi,GPL have more variation in {sigma}GPL than those with small µi,GPL.


Figure 2
View larger version (14K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2 Comparison of one-step and two-step standardizations for measuring gene expression difference between normal and breast cancer cells. u is calculated by the one-step array-specific standardization while v is from the two-step, array-specific standardization followed by gene-specific standardization. See the methods section for details of the u and v calculations. (A) MA plot of original GDS817 data on normal versus breast cancer cell lines. (B) u+ and u– represent the expression levels in normal and breast cancer cells, respectively. (C) v+ and v– represent the expression levels in normal and breast cancer cells, respectively.

 
To further investigate the difference between Figure 2B and C, we plotted the mean (µGPL) and standard deviation ({sigma}GPL) of each gene expression in 1850 experiments (Fig. 3). These mean and standard deviation were used for the second standardization (v calculation). It has been well recognized that the variance of the measured spot intensities increases with their mean. The standard deviation increases roughly linearly with the mean (Huber et al., 2002). The present plot of the DB-wide analysis also indicates that expressional variation (i.e. {sigma}GPL in the plot) is increased as µGPL increase. In addition, the plot shows that the vertical distribution of {sigma}GPL becomes wider as the µGPL increases. This relatively large variation in {sigma}GPL values in the high µGPL region contributed to the difference of the plot shape between Figure 2B and C. For a given µGPL value, a large {sigma}GPL resulted in a small v value, while a small {sigma}GPL value resulted in a large v in the second standardization. This analysis confirms that the present model, based on Ni,GPL, {sigma}i,GPL) provides an additional resolution particular to recognizing the expressional difference of genes with relatively small {sigma}GPL and high expressional intensity, i.e. large µGPL.


Figure 3
View larger version (18K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3 Gene-specific mean intensity and its standard deviation. The mean and standard deviation of up to 1850 expressions for a gene was calculated using all the standardized expression data (u) of the given gene in the same platform, GPL91, found in GEO database. A total of 10 792 genes were plotted. However, a small number of genes with a mean of >1.5 were omitted in this plot, to achieve a better resolution on the area of major distribution.

 
Figure 2A and B shows that the single step standardization actually represents the original normalization of Affymetrix GeneChip data. However, the second step standardization generated a substantial shift in the distribution from that of single step standardization of gene expression. From a biological point of view, a large expressional change between specific samples for a gene which has a large {sigma}GPL may be less meaningful than a moderate expressional change for a gene having a consistent expression in the database (low {sigma}GPL). Typical array-specific standardization and a consequent comparison of the intensity of gene expression between samples lack this kind of biological consideration. In this sense, our two-step analysis provides a unique tool for identifying additional genes with moderate changes between samples that are not highly prioritized by conventional assay-specific standardization methods.

We compared the performance of the two methods (i.e. {Delta}u' versus {Delta}v' scorings) in evaluating differences in gene expression between normal and breast cancer cells (Fig. 4A). A significant number of genes with low and high expression levels were evaluated differently by these two methods. The result shows that the utility of two-step standardization method is on identifying those genes in which the difference of expression between samples is underestimated by the single step, array-specific standardization method. On the upper boundary region of the distribution shown in Figure 4A, the {Delta}v' calculation gives up to 2-fold higher estimation than the {Delta}u' calculation for gene expression difference between samples. Among 10 792 test genes from GPL91 [NCBI GEO] , 90.5% showed the discrepancy of less than 1.0 between two methods (i.e. |{Delta}v' – {Delta}u'| < 1.0), while 9.5% of genes showed the discrepancy of greater than 1.0 (Fig. 4B). A total of 2.4% of genes showed the discrepancy of greater than 2.0 between two methods.


Figure 4
View larger version (15K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4 The discrepancy between the two methods in calculating gene expression difference between samples. (A) The gene expression difference based on the array-specific standardization ({Delta}u') was plotted against the difference based on the gene-specific standardization method ({Delta}v'). (B) The discrepancy between the two methods was plotted against the gene frequency. The total number of genes is 10 792.

 
We also compared the result of the present method with that of RMA method. RMA method is developed for better normalization of fluorescence intensity data by accounting for probe affinity effect, while our present method is for providing better biological insight on interpreting the expressional difference of genes in the database that are assumed to have already been properly normalized. We thus applied the present method to the data that were already normalized by RMA method. For this comparative analysis, a dataset from the spike-in study by Affymetrix was used. Human cRNA fragments matching 16 probesets on the HGU95A GeneChip were added to the hybridization mixture of the arrays at concentrations ranging from 0 to 1024 pM. The same hybridization mixture, obtained from a common tissue source, was used for all arrays (See Methods section for details). The fluorescence intensity data were normalized by RMA method, and then the present two-step method was applied to the normalized data. Observed concentrations are comparatively plotted against nominal concentration (Fig. 5). In this analysis, the observed intensities are averaged at each nominal concentration value, resulting in a single mean curve. Since the log2 scale was applied to the concentrations, observed concentrations should be linear in true concentrations. We therefore fit a simple linear model to the scatterplot data and report the R2 coefficient. The result shows that the present two-step standardization (LAGE) data has a similar R2 coefficient with the original RMA data. This confirms that the additional standardization by NGPL, {sigma}GPL) does not change the linearity of the original data. However, the rank order of test genes by observed differential expressions ({Delta}v' and {Delta}RMA) between samples showed a large disagreement between the present method and RMA method (Table 1). When expressional difference between samples was compared among 14 genes, the RMA method provided a more consistent performance than the present method. These 14 test genes showed large variations in their µGPL and {sigma}GPL values. The variation in {Delta}RMA among 14 genes is random, i.e. no correlation with their {sigma}GPL. However, our present method gives additional weights to the expressional difference of those genes (e.g. 407_at) that have relatively small {sigma}GPL, while it gives a low significance on the expressional difference of genes with a large {sigma}GPL (e.g. 33818_at). We believe that this {sigma}GPL-dependent prioritization of gene expression difference improves the identification of previously unknown disease-related genes from database search.


Figure 5
View larger version (22K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 5 Average observed log2 intensity plotted against nominal log2 concentration for each spiked-in gene for all arrays in Affymetrix spike-in experiment. The resulting slope estimates are plotted against average log intensity across all concentrations. LAGE represents the present two-step standardization method.

 


View this table:
[in this window]
[in a new window]

 
Table 1 Prioritization of test spike-in genes by the two-step standardization

 
For a demonstration of the usefulness of the present method, we compared the performance of the two-step method with those methods provided by GEO website (see Methods section for the detail) in evaluating differences in gene expression between normal and breast cancer cells. A total of 10 791 probesets were first ranked based on the {Delta}v' score. Then, a total of 100 top-ranked probesets were investigated if they were also prioritized by GEO-provided analytical methods. As a result, 22 probesets on the top ranking list on the {Delta}v' score were unique and not found in the GEO analysis reports, while 78 probesets were found in both {Delta}v'-ranking list and GEO analysis reports (Table 2). These two sets of genes (un-overlapped and overlapped with GEO lists) commonly showed relatively low µGPL and {sigma}GPL in comparison with those of total 10 791 probesets in GPL91 [NCBI GEO] . However, the un-overlapped set of probesets showed a lower GPL-wide expression variation (average {sigma}GPL = 0.10) than the overlapped set (average {sigma}GPL = 0.12). This result shows that the consideration of DB-wide expressional variations ({sigma}GPL values) has contributed to the identification of 22 additional probesets that are hardly prioritized by other methods.


View this table:
[in this window]
[in a new window]

 
Table 2 Top ranking genes (probesets) by {Delta}v' score in comparison with GEO analsyses

 
A total of 22 probesets (actually 20 different genes) that were exclusively found in the top ranking list by the two-step method needed to be further analyzed to determine if there are breast cancer-related genes that were not identified by GEO analysis reports. A subset of 12 probsets that showed large difference in the ranking between {Delta}u' and {Delta}v' are listed (Table 3). From a literature search, we found that 3 of these 12 genes were specifically elevated in cancer-related cells. For example the MRE11 (gene ID: AF073362 [GenBank] )—Rad50–NBS1 complex is a cell cycle check point protein and tumor cells have defects in the cell cycle check point protein. It has been known that two components of the MRE11–Rad50–NBS1 complex, RAD50 and NBS1 are breast cancer susceptibility genes associated with genomic instability (Heikkinen et al., 2006) and the MRE11 gene is mutated in an ataxia-telangiectasia-like disorder (Stewart et al., 1999). Insufficient information was available to determine if other nine genes exclusively found on the top ranking list of {Delta}v' scoring are associated with breast cancer pathogenesis in the current NCBI database. Table 3 shows that the difference of expressional intensity between normal and breast cancer cell lines for these genes was estimated to be at least 1.5-fold higher in {Delta}v' measure than in {Delta}u' measure. It can be concluded that relatively low {sigma}GPL values compensate the moderate difference in expression between samples and consequently rank genes in a different order. Further experimental study remains to confirm the association of the selected nine genes with breast tumors.


View this table:
[in this window]
[in a new window]

 
Table 3 Cancer-related genes exclusively identified by the present {Delta}v'-scoring method

 
The relative merit of the present method depends on its ability to successfully identify genes that are differentially expressed, while avoiding classifying highly fluctuating genes (i.e. genes with large {sigma}GPL) as being differentially expressed (i.e. their false positive or Type I Error rate). Since the false positive rate increases exponentially as the rank goes to the bottom (Norris and Kahn, 2006), medium-level fold changes (moderate {Delta}u') in gene expression were usually not considered for further experimental validation. In this sense, this new approach can enrich the hit list of genes in which expression difference between normal and breast cancer cells were moderate.

For public access to this two-step standardization method for GEO gene expression data, we constructed a user friendly web-based database, GS-LAGE (Gene Specific Large-scale Analysis of Gene Expression), which includes all single channel microarray experiments listed on the GEO database (Fig. 6). It can be accessed via http://compbio.sookmyung.ac.kr/~lage/index.html. It provides comparative values of {Delta}u' and {Delta}v' for each gene between user-selected experimental samples. It will provide a valuable tool for the in silico identification of previously unknown specific (or differential) gene expression patterns in disease-related samples.


Figure 6
View larger version (13K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 6 Schematic view of GS-LAGE. GEO data in GS-LAGE (Gene Specific Large-scale Analysis of Gene Expression) are periodically updated and simultaneously annotated with BLAST search results.

 

    Acknowledgments
 
The authors appreciate helpful and stimulating discussions with Dr. Young Ju Suh. This work was supported by the SRC/ERC program of MOST/KOSEF (R11-2005-017-01003-0) and by grant No.R01-2006-000-10515-0 from the Basic Research Program of the Korea Science & Engineering Foundation.,

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Joaquin Dopazo

Received on May 22, 2006; revised on September 7, 2006; accepted on September 30, 2006

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 REFERENCES
 

    Aittokallio, T., et al. (2003) Computational strategies for analyzing data in gene expression microarray experiments. J. Bioinform. Comput. Biol, . 1, 541–586[CrossRef][Medline].

    Barrett, T., et al. (2005) NCBI GEO: mining millions of expression profiles—database and tools. Nucleic Acids Res, . 33, D562–D566[Abstract/Free Full Text].

    Bolstad, B.M., et al. (2003) A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics, 19, 185–193[Abstract/Free Full Text].

    Breitling, R., et al. (2004) Rank products: a simple, yet powerful, new method to detect differentially regulated genes in replicated microarray experiments. FEBS Lett, . 573, 83–92[CrossRef][Web of Science][Medline].

    Colantuoni, C., et al. (2002) SNOMAD (Standardization and NOrmalization of MicroArray Data): web-accessible gene expression data analysis. Bioinformatics, 18, 1540–1541[Abstract/Free Full Text].

    Cope, L.M., et al. (2004) A benchmark for Affymetrix GeneChip expression measures. Bioinformatics, 20, 323–331[Abstract/Free Full Text].

    Edgar, R., et al. (2002) Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res, . 30, 207–210[Abstract/Free Full Text].

    Edwards, D. (2003) Non-linear normalization and background correction in one-channel cDNA microarray studies. Bioinformatics, 19, 825–833[Abstract/Free Full Text].

    Gautier, L., et al. (2004) Affy—analysis of Affymetrix GeneChip data at the probe level. Bioinformatics, 20, 307–315[Abstract/Free Full Text].

    Heikkinen, K., et al. (2006) RAD50 and NBS1 are breast cancer susceptibility genes associated with genomic instability. Carcinogenesis, 27, 1593–1599[Abstract/Free Full Text].

    Huber, W., et al. (2002) Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics, 18, Suppl. 1, S96–S104[Abstract].

    Irizarry, R.A., et al. (2003) Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Res, . 31, e15[Abstract/Free Full Text].

    Li, C. and Wong, W.H. (2001) Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. Proc. Natl Acad. Sci. USA, 98, 31–36[Abstract/Free Full Text].

    Mootha, V.K., et al. (2003) PGC-1alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat. Genet, . 34, 267–273[CrossRef][Web of Science][Medline].

    Norris, A.W. and Kahn, C.R. (2006) Analysis of gene expression in pathophysiological states: balancing false discovery and false negative rates. Proc. Natl Acad. Sci. USA, 103, 649–653[Abstract/Free Full Text].

    Ramaswamy, S., et al. (2001) Multiclass cancer diagnosis using tumor gene expression signatures. Proc. Natl Acad. Sci. USA, 98, 15149–15154[Abstract/Free Full Text].

    Stewart, G.S., et al. (1999) The DNA double-strand break repair gene hMRE11 is mutated in individuals with an ataxia-telangiectasia-like disorder. Cell, 99, 577–587[CrossRef][Web of Science][Medline].

    Yang, M.C., et al. (2001) A statistical method for flagging weak spots improves normalization and ratio estimates in microarrays. Physiol. Genomics, 7, 45–53[Abstract/Free Full Text].

    Yang, Y.H., et al. (2002) Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res, . 30, e15[Abstract/Free Full Text].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
J. D. Wren
A global meta-analysis of microarray expression data to predict unknown gene functions and estimate the literature-data divide
Bioinformatics, July 1, 2009; 25(13): 1694 - 1701.
[Abstract] [Full Text] [PDF]


Home page
Physiol. GenomicsHome page
W. Rodenburg, A. G. Heidema, J. M. A. Boer, I. M. J. Bovee-Oudenhoven, E. J. M. Feskens, E. C. M. Mariman, and J. Keijer
A framework to identify physiological responses in microarray-based gene expression studies: selection and interpretation of biologically relevant genes
Physiol Genomics, October 8, 2008; 33(1): 78 - 90.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/23/2898    most recent
btl500v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (1)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Yoon, S.
Right arrow Articles by Seong, J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Yoon, S.
Right arrow Articles by Seong, J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?