Bioinformatics Advance Access originally published online on March 30, 2006
Bioinformatics 2006 22(10):1272-1274; doi:10.1093/bioinformatics/btl108
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
SScore: an R package for detecting differential gene expression without gene expression summaries
1 Department of Biostatistics, Virginia Commonwealth University Box 980032, Richmond, VA 23298-0032, USA
2 Department of Pharmacology and Toxicology, Virginia Commonwealth University Box 980032, Richmond, VA 23298-0032, USA
3 Department of Neurology, Virginia Commonwealth University Box 980032, Richmond, VA 23298-0032, USA
4 Center for the Study of Biological Complexity, Virginia Commonwealth University Box 980032, Richmond, VA 23298-0032, USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Summary: SScore is an R package that facilitates the comparison of gene expression between Affymetrix GeneChips using the S-score algorithm. The S-score algorithm uses probe level data directly to assess differences in gene expression, without requiring a preliminary separate step of probe set expression summary estimation. Therefore, the algorithm avoids introduction of error associated with the expression summary estimation process and has been demonstrated to improve the accuracy of identifying differentially expressed genes. The S-score produces accurate results even when few or no replicates are available.
Availability: The R package SScore is available from Bioconductor at http://www.bioconductor.org
Contact: rkennedy{at}vcu.edu
| INTRODUCTION |
|---|
|
|
|---|
The S-score algorithm (Zhang et al., 2002) was developed as an alternative to MAS4 algorithm for identifying a list of differentially expressed genes among paired Affymetrix GeneChipsTM. Unlike commonly used class comparison methods such as SAM (Tusher et al., 2001), this algorithm does not require the estimation of an expression summary over a probe set using, for example, MAS5 (Hubbell et al., 2002) or RMA (Irizarry et al., 2003). Instead the S-score method utilizes the probe pair intensities directly. Therefore, any error that may be introduced by the estimation of probe set expression summaries from the probe pair signals is avoided. Further, the S-score method has been demonstrated to have better sensitivity and reliability in detecting differentially expressed genes in small datasets.
The basic assumption of the S-score algorithm is an error model for the expression of probe pair signals in which the detected signal is assumed proportional to the probe pair signal for highly expressed genes, while approaching a background noise level (rather than 0) for genes with low levels of expression. These probe pair level error estimates are then used in the calculation of a measure of relative change in gene expression, called the significance score or S-score. These relative changes are summed over the probe pairs to form the S-score of a probe set, which is a single measure of the significance of change for the gene in question. Under conditions of no differential expression between chips, the S-score follows a standard normal distribution, so it is easy to obtain p-values for each probe set compared. Since probe level data are used in forming the test statistic, S-scores can be reliably used for identifying differential expression between two GeneChipsTM. This makes the S-score method particularly advantageous in the analysis of preliminary data, such as pilot data for grant applications.
The S-score algorithm was originally coded in C++ and later ported to Borland Delphi. In order to extend its use, we have implemented the S-score algorithm in R, an open source programming environment (R Development Core Team, 2005). This integration allows R functions for preprocessing and visualization to be used with the S-score algorithm, which was not possible with the stand-alone version. In addition, the R version also offers various options for customization of the analysis that were not previously available. Further, this implementation, being open source, may be further modified to meet the needs of individual users.
| IMPLEMENTATION |
|---|
|
|
|---|
The SScore package accepts data from Affymetrix *.CEL files that have been read into the R programming environment using the affy (Gautier et al., 2004) library and stored in an AffyBatch object. The Bioconductor (Gentleman et al., 2004) affy package is automatically loaded by SScore to provide functions for reading Affymetrix data files into R.
The current implementation of the S-score algorithm allows the comparison of two Affymetrix GeneChipsTM at a time. Comparisons of multiple chips using the S-score in conjuction with SAM (Tusher et al., 2001) have been described previously (Kerns et al., 2003). Future versions of SScore will extend the model to allow comparison of three or more chips simultaneously. The function SScore is used to generate S-scores for a single two-chip comparison, i.e. a two-column AffyBatch object:
![]() |
The result is an object of type exprSet. The exprs slot contains a single column with the S-scores for the comparison, while the se.exprs slot contains the CorrDiff values. (The CorrDiff is the correlation between each gene on one GeneChip and the corresponding gene on the second GeneChip.)
The function SScoreBatch can be used to process multiple GeneChips. It requires that all files have been read into a single AffyBatch object, which will contain one column for each *.CEL file. GeneChips are processed two at a time in batch fashion. This is accomplished through the compare matrix, which specifies the pairs of chips to compare.
![]() |
The compare matrix is an N x 2 matrix, where N is the number of comparisons being made. Each row contains the column number of the chips in the AffyBatch object that are being compared. For example, if the compare matrix is set up as
![]() |
The first comparison would be between the chips in columns 2 and 5 in the AffyBatch object, the second comparison would be between the chips in columns 2 and 6, the third comparison would be between the chips in columns 5 and 9, and so forth. If the compare matrix has more than two columns, only the first two will be used for finding the column numbers in the AffyBatch object. Each column of eset will contain the results of a single two-chip comparison. The first column of eset will contain the comparison corresponding to the first row of the compare matrix, the second column of eset will contain the comparison corresponding to the second row of the compare matrix, and so forth.
Both SScore and SScoreBatch require that the scale factor and standard difference threshold be specified for each GeneChip. These may be given through the SF and SDT parameters, respectively, with the values available from the Affymetrix GeneChip Operating Software (GCOS) output (using the formula SDT = 4 * RawQ * SF). Alternatively, SScore and SScoreBatch will calculate these values automatically if they are omitted from the function call. Both SF and SDT are vectors with length equal to the number of columns in the AffyBatch object, and contain a single numeric value for each chip. The scale factor is used to scale each intensity on the chip to a target background value of 500, which is the default target intensity in the GCOS program. The standard difference threshold, as a function of the noise for a given probe array hybridization, is used as an estimate of background noise.
Both SScore and SScoreBatch provide additional options for fine-tuning the analysis and output. The rm.outliers, rm.mask and rm.extra options perform the same as they do in the ReadAffy function. The digits option specifies the number of significant decimal places for the S-score and CorrDiff values, which are rounded as needed. The default uses full precision with no rounding. Finally, the verbose option indicates whether additional information on the analyses is printed. This includes the chip type, sample names, values of alpha and gamma, and the SF and SDT values. These customization options are not available in the stand-alone SScore implementations.
| USING S-SCORES IN GENE EXPRESSION ANALYSIS |
|---|
|
|
|---|
Under conditions of no differential expression, the S-scores follow a standard normal (Gaussian) distribution with a mean of 0 and SD of 1 (Zhang et al., 2002). This makes it straightforward to calculate p-values corresponding to rejection of the null hypothesis in favor of concluding the alternative hypothesis of differential gene expression. Cutoff values for the S-scores can be set to achieve the desired level of significance. As an example, an absolute S-score value of 3 (signifying 3 SD from the mean, a typical cutoff value) would correspond to a p-value of 0.003. Under this scenario, the significant genes can be found as
R> sscores <- exprs(eset) ## extract the S-score values R> signif <- geneNames(eset)[abs(sscores) >= 3] ## find those greater than 3 SD
Similarly, the p-values can be calculated as
R> sscores <- exprs(eset) ## extract the S-score values R> p.values <- 1 - pnorm(abs(sscores)) ## find the corresponding one-sided p-values R> p.values <- 2*(1-pnorm(abs(sscores)))## find the corresponding two-sided p-values.
Probe sets identified as being significantly differentially expressed can be easily explored using the Bioconductor annotate package. The S-score algorithm does account for the correlations among probes within a two-chip comparison. However, it does not adjust p-values for multiple comparisons when comparing more than one pair of chips.
| CHANGES FROM THE STAND-ALONE VERSION |
|---|
|
|
|---|
The S-score algorithm has been previously implemented as a stand-alone executable for the Windows operating system, using Borland Delphi (Miles, 2003, http://www.brainchip.vcu.edu/expressionda.htm). Users of the stand-alone version will notice small differences in results compared with the SScore package as it is implemented in R, though these should not significantly affect gene expression analyses. These fall generally into three categories: (1) differences in calculations introduced by the rounding methods used in Delphi and R; (2) differences in processing of data files and (3) differences introduced by the use of Affymetrix routines for determining some parameter values. All changes are thoroughly described in the documentation accompanying the package.
| SUMMARY |
|---|
|
|
|---|
The S-score algorithm represents a new approach to the gene analysis of gene expression data, in which the significance is calculated directly from probe level data. The SScore package provides an implementation of this algorithm in the R programming environment. The testing approach used in SScore offers high sensitivity without loss of specificity in detecting differentially expressed genes. The current implementation is limited to two-chip comparisons, though workaround solutions have been published. Future versions will utilize mixed-effects modeling to extend the S-score, facilitating multi-chip comparisons.
| Acknowledgments |
|---|
The development of the S-score algorithm and its original implementation in C++ is the work of Dr Li Zhang. The Delphi implementation of the S-score is the work of Dr Robnet Kerns.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Joaquin Dopazo
Received on December 5, 2005; revised on February 24, 2006; accepted on March 18, 2006
| REFERENCES |
|---|
|
|
|---|
Gautier, L., et al. (2004) Affyanalysis of Affymetrix GeneChip data at the probe level. Bioinformatics, 20, 307315
Gentleman, R.C., et al. (2004) Bioconductor: open software development for computational biology and bioinformatics. Genome Biol, . 5, R80[CrossRef][Medline].
Hubbell, E., et al. (2002) Robust estimators for expression analysis. Bioinformatics, 18, 15851592
Irizarry, R.A., et al. (2003) Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Res, . 31, e15
Kerns, R.T., et al. (2003) Application of the S-score algorithm for analysis of oligonucleotide microarrays. Methods, 31, 274281[Medline].
Miles, M.F. (2003) Informatics tools: Expression data analysis.
R Development Core Team. R: A Language and Environment for Statistical Computing Version 2.2.0, . (2005) , Vienna, Austria R Foundation for Statistical Computing.
Tusher, V.G., et al. (2001) Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl Acad. Sci. USA, 98, 51165121
Zhang, L., et al. (2002) A new algorithm for analysis of oligonucleotide arrays: Application to expression profiling in mouse brain regions. J. Mol. Biol, . 317, 225235[Medline].
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||


