Bioinformatics Vol. 18 no. 3 2002
Pages 389-394
© 2002 Oxford University Press
Assessing the significance of consistently mis-regulated genes in cancer associated gene expression matrices
1 Division of Mechatronics, Chalmers
University of Technology, Göteborg,
Sweden
2 Department of Pharmacology, Uniformed
Services University of the Health Sciences, Bethesda, MD, USA
3 Cancer Genetics Branch, National Human
Genome Research Institute, NIH, Bethesda, MD, USA
4 Childrens Hospital Informatics
Program, Harvard Medical School, Boston, MA, USA
Received on April 20, 2001
; revised on October 1, 2001
; accepted on November 11, 2001
Motivation: The simplest level of statistical analysis of cancer associated gene expression matrices is aimed at finding consistently up- or down-regulated genes within a given set of tumor samples. Considering the high level of gene expression diversity detected in cancer, one needs to assess the probability that the consistent mis-regulation of a given gene is due to chance. Furthermore, it is important to determine the required sample number that will ensure the meaningful statistical analysis of massively parallel gene expression measurements.
Results: The probability of consistent mis-regulation is calculated in this paper for binarized gene expression data, using combinatorial considerations. For practical purposes, we also provide a set of accurate approximate formulas for determining the same probability in a computationally less intensive way. When the pool of mis-regulatable genes is restricted, the probability of consistent mis-regulation can be overestimated. We show, however, that this effect has little practical consequences for cancer associated gene expression measurements published in the literature. Finally, in order to aid experimental design, we have provided estimates on the required sample number that will ensure that the detected consistent mis-regulation is not due to chance. Our results suggest that less than 20 sufficiently diverse tumor samples may be enough to identify consistently mis-regulated genes in a statistically significant manner.
Availability: An implementation using Mathematica tm of the main equation of the paper, (4), is available at www.me.chalmers.se/~mwahde/bioinfo.html.
Contact: mwahde{at}me.chalmers.se, zszallasi{at}chip.org