Skip Navigation


Bioinformatics Advance Access originally published online on October 25, 2005
Bioinformatics 2005 21(24):4427-4429; doi:10.1093/bioinformatics/bti729
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrowOA All Versions of this Article:
21/24/4427    most recent
bti729v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (3)
Google Scholar
Right arrow Articles by Drummond, R. D.
Right arrow Articles by Menossi, M.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Drummond, R. D.
Right arrow Articles by Menossi, M.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions{at}oxfordjournals.org
The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact journals.permissions{at}oxfordjournals.org

ISER: selection of differentially expressed genes from DNA array data by non-linear data transformations and local fitting

R. D. Drummond 1,3, A. Pinheiro 2, C. S. Rocha 1 and M. Menossi 1,3,*

1Laboratório de Genoma Funcional—Centro de Biologia Molecular e Engenharia Genética, Universidade Estadual de Campinas Campinas 13083-875, PO Box 6010, Brazil
2Departmento de Estatística, Instituto de Matemática, Estatística e Computação Científica, Universidade Estadual de Campinas Campinas 13083-859, Brazil
3Departmento de Genética e Evolução, Instituto de Biologia, Universidade Estadual de Campinas Campinas 13083971, Brazil

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 REFERENCES
 

Summary: This report describes an algorithm (intensity-dependent selection of expression ratios or ISER) developed to analyse DNA array data by optimizing the selection of genes with the most significant variations in expression amongst two RNA samples. The algorithm is designed for use when little or no replication of array hybridizations is available.

Availability: ISER is written in R language, and its code and on-line version are freely available at https://ipe.cbmeg.unicamp.br/pub/PMmA

Contact: menossi{at}unicamp.br

Supplementary information: https://ipe.cbmeg.unicamp.br/pub/ISER

DNA array technology is the main strategy for assessing gene expression profiles on a large scale. An important application of DNA arrays is the identification of genes that show significant changes in their expression in different RNA samples. However, because of the characteristic noise of array data and the usually limited number of experimental replicates, a statistical framework to select differentially expressed genes is not easily determined (Lee et al., 2000). This is especially the case in macroarray experiments (Freeman et al., 2000), where the requirement for larger amounts of RNA for each hybridization can be a limiting factor in the number of replicates, and each labelled cDNA sample is hybridized to a different nylon membrane, thereby increasing the data variability (in contrast to microarrays, in which two samples labelled with different dyes are hybridized on the same slide; Schena et al., 1996).

Some studies have developed strategies to select differentially expressed genes (Tusher et al., 2001; Kerr et al., 2000), but they usually require three or more experimental replicates to evaluate the data variability inherent to the method. However, in array experiments, it is often necessary to compare two samples without replicates, and in this case a strategy frequently adopted is to select those genes whose signals from both samples show a fold change of two or more (or any other arbitrarily fixed value; Schena et al., 1996). Other methods have also been developed to select differentially expressed genes when little experimental replication is available, e.g. the methods described by Mutch et al. (2002) and Loguinov et al. (2004). In this report, we describe an algorithm (intensity-dependent selection of expression ratios or ISER) designed to optimize the selection of genes that show the most significant variations in expression. One of the novelties of this method is the non-linear transformation to normality via a sliding window of variable size that allows the use of a normal distribution instead of heavier tails distributions and ensures accuracy of the P-values.

The algorithm is based on the commonly adopted assumption that most of the genes probed in arrays have a constant expression in the samples studied (Quackenbush, 2002) so that differentially expressed genes are identified as those that are outliers in the global distribution of expression ratios. Another (empirically supported) aspect that ISER deals successfully with is that the variance of the expression ratios depends on the mean log intensity of the gene (Mutch et al., 2002). The initial motivation for ISER is to find a data transformation that results in a normal distribution of the expression ratios (Box and Cox, 1964), which are generally considered to have a lognormal distribution (Friddle et al., 2000). In this case, ISER will return a log transformation. However, if small or large variations from this transformation are present, the sliding window will adapt it locally to the data.

The input for ISER is a text file containing the signal intensities (raw or normalized, if desired) for both RNA samples of each gene. The algorithm follows the steps below.

  1. To avoid working with zero values, a small constant (10–4 times the minimum expression value different from zero) is added to all expression values. For each gene, the expression ratio and the geometric mean of the signals from both samples are calculated.
  2. A data transformation is found that provides the transformed ratio values closest to the normal distribution. This is done by choosing the best exponent lambda in the Box–Cox family of transformations:

    The most suitable transformation is then applied to the expression ratios and to the geometric means of the expression values, to yield what are referred to, respectively, as the transformed ratios and transformed intensities.

  3. The genes are sorted according to their transformed intensities, in ascending order. A sliding window over this order is used to locally estimate the mean and standard deviation of the transformed expression ratios. The size of this window is determined locally so as to optimize the value of the Kolmogorov–Smirnov test for normality (Lehman, 1994), applied to the transformed ratios of the genes inside the window. Within each sliding window, the algorithm computes the trimmed mean and standard deviation (with respect to the trimmed mean) of the transformed ratios, both based on their 60% central values. These estimates and the P-value defined by the user are employed to determine the local upper and lower limits of the transformed ratios, outside of which the genes are considered as differentially expressed.
  4. A linear spline is used to approximate the relationship between these limits and the mean of the transformed intensities in each sliding window, thereby establishing the intensity-dependent limits for the transformed ratios to be selected as differentially expressed.

The performance of ISER was tested on simulated datasets generated based on the ANOVA model applied to DNA arrays described by Kerr et al. (2000). These datasets mimicked the results of array experiments with 3000 genes in which different numbers of genes were previously assumed as differentially expressed (induced/repressed). The model used also simulated an intensity-dependent bias; further details of the model are given in the Supplementary Information. These datasets were also analysed with the algorithm described by Mutch et al. (2002), as well as with the fixed fold criterion for the selection of differentially expressed genes. Local fitting performance for selecting differentially expressed genes should be superior in cases with few replications. Thus both ISER and the method described by Mutch et al. (2002) achieved better results than the fixed fold criterion (Fig. 1). Moreover, the ISER algorithm is more sensitive to intensity-dependent bias in the data, thus showing a performance superior to both of the other methods, as can be seen in Figure 1. This superiority was true for each false positive rate and for each of the dataset specifications analysed. We also tested ISER for its comparative performance when applied to raw or loess-normalized values (see Supplementary Information). The results showed no significant difference.



View larger version (33K):
[in this window]
[in a new window]
 
Fig. 1 Receiver operating characteristic (ROC) curves for the three methods used to analyse simulated datasets. Each grey line is the mean curve for a subset of datasets with the same specifications (see Supplementary Information). For each method of analysis, the black line is the median of the grey lines.

 
ISER was also tested with real datasets from Nogueira et al. (2003), studying the genetic response of sugarcane to low temperatures. It confirmed the differential expression of 46 of the 59 genes originally selected by those authors and identified an additional set of 147 genes as differentially expressed (see Supplementary Information). This experiment was performed with two replicates, and the original data analysis consisted of calculating, for each replicate and each time point (3, 6, 12, 24 and 48 h), the mean and standard deviation of the logarithms of the expression ratios—log(treated/control) and selecting those genes that, in both replicates, showed a logarithm of the expression more than 1.65 SD distant from the mean, and a fold-change >2. Since this analysis is not sensitive to intensity-dependent bias in the data, it may select more false-positive genes than ISER, and this might be the case of the 13 genes originally selected, which were rejected by the algorithm. In fact, the majority (9) of these genes were selected as induced at the time point of 48 h, although in one of the experimental replicates for this time point, their intensities were contained in a region where ISER detected a greater variability in the data, thus not selecting them. On the other hand, ISER additionally selected several genes of biological relevance, including transcription factors, genes related to cellular communication (transmembrane kinases), stress response (heat shock protein and others) and protein metabolism. A complete table of the differentially expressed genes identified by the algorithm is available in the Supplementary Information.

In several studies using DNA arrays, a subset of the arrayed genes is considered as housekeeping, i.e. genes that are supposed to have a constant expression in the samples being studied. ISER can use this assumption instead of supposing that most of the genes probed in the arrays have a constant expression. In this case, steps 2, 3 and 4 of the algorithm are carried out using only the housekeeping genes, thus establishing the intensity-dependent criterion of selection of the differentially expressed genes, based solely on those genes. This criterion is then applied to all the genes probed in the arrays.

Although ISER is designed to deal with data from single replicated experiments, comparing two RNA samples, it can be used when more samples are compared or replicates are present. In this case, the algorithm is applied separately to the comparison of each pair of samples and to each replicate, thus selecting genes that, for any pair of samples, are identified as differentially expressed on all replicates or on a large (pre-defined) proportion of them. The proposed algorithm avoids lognormal assumptions and shows similar performances for raw and normalized data. ISER is reasonably fast and does not have cumbersome memory requirements, thus providing researchers with a very useful tool for analysing array data.


    Acknowledgments
 
We would like to thank the two anonymous reviewers for their helpful comments and suggestions. R.D.D. was supported by a fellowship from a federal funding agency (CAPES—Coordenação de Aperfeiçoamento de Pessoal de Nível Superior, Brazil), C.S.R. was supported by a fellowship from the UNIEMP Institute (Brazil) and M.M. received a research fellowship from another federal funding agency (CNPq—Conselho Nacional de Desenvolvimento Científico e Tecnológico, Brazil). This work was partially supported by grants 02/01167-1 and 03/07244-0 from the São Paulo State funding agency (FAPESP—Fundação de Amparo à Pesquisa do Estado de São Paulo, Brazil).

Conflict of Interest: none declared.

Received on May 25, 2005; revised on October 17, 2005; accepted on October 17, 2005

    REFERENCES
 TOP
 ABSTRACT
 REFERENCES
 

    Box, G.E.P. and Cox, D.R. (1964) An analysis of transformations. J. R. Stat. Soc. Ser. B, 26, 211–252.

    Freeman, W.M., et al. (2000) Fundamentals of DNA hybridization arrays for gene expression analysis. BioTechniques, 29, 1042–1055[Web of Science][Medline].

    Friddle, C.J., et al. (2000) Expression profiling reveals distinct sets of genes altered during induction and regression of cardiac hypertrophy. Proc. Natl Acad. Sci. USA, 97, 6745–6750[Abstract/Free Full Text].

    Kerr, M.K., et al. (2000) Analysis of variance for gene expression microarray data. J. Comp. Biol, . 7, 819–837.

    Lee, M.L., et al. (2000) Importance of replication in microarrays gene expression studies: statistical methods and evidence from repetitive cDNA hybridizations. Proc. Natl Acad. Sci. USA, 97, 9834–9839[Abstract/Free Full Text].

    Lehman, E. Testing Statistical Hypotheses, (1994) 2nd Edn , New York Chapman Hall.

    Loguinov, A.V., et al. (2004) Exploratory differential gene expression analysis in microarray experiments with no or limited replication. Genome Biol, . 5, R18[CrossRef][Medline].

    Mutch, D.M., et al. (2002) The limit fold change model: a practical approach for selecting differentially expressed genes from microarray data. BMC Bioinformatics, 3, 17[CrossRef][Medline].

    Nogueira, F.T., et al. (2003) RNA expression profiles and data mining of sugarcane response to low temperatures. Plant Physiol, . 132, 1811–1824[Abstract/Free Full Text].

    Quackenbush, J. (2002) Microarray data normalization and transformation. Nat. Genet, . 32, 496–501.

    Schena, M., et al. (1996) Parallel human genome analysis: microarray-based expression monitoring of 1000 genes. Proc. Natl Acad. Sci. USA, 93, 10614–10619[Abstract/Free Full Text].

    Tusher, V.G., et al. (2001) Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl Acad. Sci. USA, 98, 5116–5121[Abstract/Free Full Text].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrowOA All Versions of this Article:
21/24/4427    most recent
bti729v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (3)
Google Scholar
Right arrow Articles by Drummond, R. D.
Right arrow Articles by Menossi, M.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Drummond, R. D.
Right arrow Articles by Menossi, M.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?