Bioinformatics Vol. 17 no. 6 2001
Pages 520-525
© 2001 Oxford University Press
Missing value estimation methods for DNA microarrays
1 Stanford Medical Informatics
2 Department of Genetics, Stanford
University School of Medicine, Stanford, CA, USA
3 Department of Biochemistry, Stanford
University School of Medicine, and Howard Hughes Medical Institute,
Stanford, CA, USA
4 Departments of Statistics and Health
Research and Policy, Stanford University, Stanford, CA, USA
Received on November 13, 2000
; revised on February 22, 2001
; accepted on February 26, 2001
Motivation: Gene expression microarray experiments can generate data sets with multiple missing expression values. Unfortunately, many algorithms for gene expression analysis require a complete matrix of gene array values as input. For example, methods such as hierarchical clustering and K-means clustering are not robust to missing data, and may lose effectiveness even with a few missing values. Methods for imputing missing data are needed, therefore, to minimize the effect of incomplete data sets on analyses, and to increase the range of data sets to which these algorithms can be applied. In this report, we investigate automated methods for estimating missing data.
Results: We present a comparative study of several methods for the estimation of missing values in gene microarray data. We implemented and evaluated three methods: a Singular Value Decomposition (SVD) based method (SVDimpute), weighted K-nearest neighbors (KNNimpute), and row average. We evaluated the methods using a variety of parameter settings and over different real data sets, and assessed the robustness of the imputation methods to the amount of missing data over the range of 120% missing values. We show that KNNimpute appears to provide a more robust and sensitive method for missing value estimation than SVDimpute, and both SVDimpute and KNNimpute surpass the commonly used row average method (as well as filling missing values with zeros). We report results of the comparative experiments and provide recommendations and tools for accurate estimation of missing microarray data under a variety of conditions.
Availability: The software is available at http://smi-web.stanford.edu/projects/helix/pubs/impute/
Contact: russ.altman{at}stanford.edu
* To whom correspondence should be addressed.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
V. Ruppert, T. Meyer, S. Pankuweit, E. Moller, R. C. Funck, W. Grimm, B. Maisch, and German Heart Failure Network Gene expression profiling from endomyocardial biopsy tissue allows distinction between subentities of dilated cardiomyopathy. J. Thorac. Cardiovasc. Surg., August 1, 2008; 136(2): 360 - 369.e1. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Schachtner, D. Lutter, P. Knollmuller, A. M. Tome, F. J. Theis, G. Schmitz, M. Stetter, P. G. Vilda, and E. W. Lang Knowledge-based gene expression classification via matrix factorization Bioinformatics, August 1, 2008; 24(15): 1688 - 1697. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. D. Polpitiya, W.-J. Qian, N. Jaitly, V. A. Petyuk, J. N. Adkins, D. G. Camp II, G. A. Anderson, and R. D. Smith DAnTE: a statistical tool for quantitative analysis of -omics data Bioinformatics, July 1, 2008; 24(13): 1556 - 1558. [Abstract] [PDF] |
||||
![]() |
F. Geraci, M. Pellegrini, and M. E. Renda AMIC@: All MIcroarray Clusterings @ once Nucleic Acids Res., July 1, 2008; 36(suppl_2): W315 - W319. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Ghosh and A. M. Chinnaiyan Genomic outlier profile analysis: mixture models, null hypotheses, and nonparametric estimation Biostat., June 6, 2008; (2008) kxn015v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Chang, Z. Ding, Y. S. Hung, and P. C. W. Fung Fast network component analysis (FastNCA) for gene regulatory network reconstruction from microarray data Bioinformatics, June 1, 2008; 24(11): 1349 - 1358. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. A. Shabalin, H. Tjelmeland, C. Fan, C. M. Perou, and A. B. Nobel Merging two gene-expression studies via cross-platform normalization Bioinformatics, May 1, 2008; 24(9): 1154 - 1160. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. E. Futschik and H. Herzel Are we overestimating the number of cell-cycling genes? The impact of background models on time-series analysis Bioinformatics, April 15, 2008; 24(8): 1063 - 1069. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Pavelka, M. L. Fournier, S. K. Swanson, M. Pelizzola, P. Ricciardi-Castagnoli, L. Florens, and M. P. Washburn Statistical Similarities between Transcriptomics and Quantitative Shotgun Proteomics Data Mol. Cell. Proteomics, April 1, 2008; 7(4): 631 - 644. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. W. Tun, D. Personett, K. A. Baskerville, D. M. Menke, K. A. Jaeckle, P. Kreinest, B. Edenfield, A. C. Zubair, B. P. O'Neill, W. R. Lai, et al. Pathway analysis of primary central nervous system lymphoma Blood, March 15, 2008; 111(6): 3200 - 3210. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Adler, J. Reimand, J. Janes, R. Kolde, H. Peterson, and J. Vilo KEGGanim: pathway animations for high-throughput data Bioinformatics, February 15, 2008; 24(4): 588 - 590. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Yang, Y. Li, H. Xiao, Q. Liu, M. Zhang, J. Zhu, W. Ma, C. Yao, J. Wang, D. Wang, et al. Gaining confidence in biological interpretation of the microarray data: the functional consistence of the significant GO categories Bioinformatics, January 15, 2008; 24(2): 265 - 271. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. J. Brauer, C. Huttenhower, E. M. Airoldi, R. Rosenstein, J. C. Matese, D. Gresham, V. M. Boer, O. G. Troyanskaya, and D. Botstein Coordination of Growth Rate, Cell Cycle, Stress Response, and Metabolic Activity in Yeast Mol. Biol. Cell, January 1, 2008; 19(1): 352 - 367. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Varshavsky, A. Gottlieb, D. Horn, and M. Linial Unsupervised feature selection under perturbations: meeting the challenges of biological data Bioinformatics, December 15, 2007; 23(24): 3343 - 3349. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. J. Kang, D. H. Adams, A. Simen, B. B. Simen, G. Rajkowska, C. A. Stockmeier, J. C. Overholser, H. Y. Meltzer, G. J. Jurjus, L. C. Konick, et al. Gene Expression Profiling in Postmortem Prefrontal Cortex of Major Depressive Disorder J. Neurosci., November 28, 2007; 27(48): 13329 - 13340. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Ardigo, T. L. Assimes, S. P. Fortmann, A. S. Go, M. Hlatky, E. Hytopoulos, C. Iribarren, P. S. Tsao, R. Tabibiazar, T. Quertermous, et al. Circulating chemokines accurately identify individuals with clinically significant atherosclerotic heart disease Physiol Genomics, November 14, 2007; 31(3): 402 - 409. [Abstract] [Full Text] [PDF] |
||||
![]() |
A.-N. Spiess, C. Feig, W. Schulze, F. Chalmel, H. Cappallo-Obermann, M. Primig, and C. Kirchhoff Cross-platform gene expression signature of human spermatogenic failure reveals inflammatory-like response Hum. Reprod., November 1, 2007; 22(11): 2936 - 2946. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. A. Hibbs, D. C. Hess, C. L. Myers, C. Huttenhower, K. Li, and O. G. Troyanskaya Exploring the functional landscape of gene expression: directed search of large microarray compendia Bioinformatics, October 15, 2007; 23(20): 2692 - 2699. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Lai A moment-based method for estimating the proportion of true null hypotheses and its application to microarray gene expression data Biostat., October 1, 2007; 8(4): 744 - 755. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. P. Beyer, R. C. Fry, M. R. Lasarev, L. A. McConnachie, L. B. Meira, V. S. Palmer, C. L. Powell, P. K. Ross, T. K. Bammler, B. U. Bradford, et al. Multicenter Study of Acetaminophen Hepatotoxicity Reveals the Importance of Biological Endpoints in Genomic Analyses Toxicol. Sci., September 1, 2007; 99(1): 326 - 337. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Roback, J. Beard, D. Baumann, C. Gille, K. Henry, S. Krohn, H. Wiste, M.I. Voskuil, C. Rainville, and R. Rutherford A predicted operon map for Mycobacterium tuberculosis Nucleic Acids Res., August 1, 2007; 35(15): 5085 - 5095. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Zhan, H. Yamaza, Y. Sun, J. Sinclair, H. Li, and S. Zou Temporal and spatial transcriptional profiles of aging in Drosophila melanogaster Genome Res., August 1, 2007; 17(8): 1236 - 1243. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Naderi, A. E. Teschendorff, J. Beigel, M. Cariati, I. O. Ellis, J. D. Brenton, and C. Caldas BEX2 Is Overexpressed in a Subset of Primary Breast Cancers and Mediates Nerve Growth Factor/Nuclear Factor-{kappa}B Inhibition of Apoptosis in Breast Cancer Cell Lines Cancer Res., July 15, 2007; 67(14): 6725 - 6736. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Roberts, L. McMillan, W. Wang, J. Parker, I. Rusyn, and D. Threadgill Inferring missing genotypes in large SNP panels using fast nearest-neighbor searches over sliding windows Bioinformatics, July 1, 2007; 23(13): i401 - i407. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Shiga, I. Takigawa, and H. Mamitsuka Annotating gene function by combining expression data with a modular gene network Bioinformatics, July 1, 2007; 23(13): i468 - i478. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Sahoo, D. L. Dill, R. Tibshirani, and S. K. Plevritis Extracting binary signals from microarray time-course data Nucleic Acids Res., June 28, 2007; 35(11): 3705 - 3712. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. J. F. Keijser, A. Ter Beek, H. Rauwerda, F. Schuren, R. Montijn, H. van der Spek, and S. Brul Analysis of Temporal Gene Expression during Bacillus subtilis Spore Germination and Outgrowth J. Bacteriol., May 1, 2007; 189(9): 3624 - 3634. [Abstract] [Full Text] [PDF] |
||||
![]() |
W. Stacklies, H. Redestig, M. Scholz, D. Walther, and J. Selbig pcaMethods a bioconductor package providing PCA methods for incomplete data Bioinformatics, May 1, 2007; 23(9): 1164 - 1167. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Nicolau, R. Tibshirani, A.-L. Borresen-Dale, and S. S. Jeffrey Disease-specific genomic analysis: identifying the signature of pathologic biology Bioinformatics, April 15, 2007; 23(8): 957 - 965. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. S. V. Wong, F. K. Wong, and G. R. Wood A multi-stage approach to clustering and imputation of gene expression profiles Bioinformatics, April 15, 2007; 23(8): 998 - 1005. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Hua and Y. Lai An ensemble approach to microarray data-based gene prioritization after missing value imputation Bioinformatics, March 15, 2007; 23(6): 747 - 754. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Li, Y. Sun, and M. Zhan The discovery of transcriptional modules by a two-stage matrix decomposition approach Bioinformatics, February 15, 2007; 23(4): 473 - 479. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Guan, M. J. Dunham, and O. G. Troyanskaya Functional Analysis of Gene Duplications in Saccharomyces cerevisiae Genetics, February 1, 2007; 175(2): 933 - 943. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Demeter, C. Beauheim, J. Gollub, T. Hernandez-Boussard, H. Jin, D. Maier, J. C. Matese, M. Nitzberg, F. Wymore, Z. K. Zachariah, et al. The Stanford Microarray Database: implementation of new analysis tools and open source release of software Nucleic Acids Res., January 12, 2007; 35(suppl_1): D766 - D770. [Abstract] [Full Text] [PDF] |
||||
![]() |
X. Liu and L. Wang Computing the maximum similarity bi-clusters of gene expression data Bioinformatics, January 1, 2007; 23(1): 50 - 56. [Abstract] [Full Text] [PDF] |
||||
![]() |
D.-W. Kim, K.-Y. Lee, K. H. Lee, and D. Lee Towards clustering of incomplete microarray data without the use of imputation Bioinformatics, January 1, 2007; 23(1): 107 - 113. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. A. Ashley, R. Ferrara, J. Y. King, A. Vailaya, A. Kuchinsky, X. He, B. Byers, U. Gerckens, S. Oblin, A. Tsalenko, et al. Network Analysis of Human In-Stent Restenosis Circulation, December 12, 2006; 114(24): 2644 - 2654. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. Rodrigues, M. Sarkar-Tyson, S. V. Harding, S. H. Sim, H. H. Chua, C. H. Lin, X. Han, R. K. M. Karuturi, K. Sung, K. Yu, et al. Global Map of Growth-Regulated Gene Expression in Burkholderia pseudomallei, the Causative Agent of Melioidosis J. Bacteriol., December 1, 2006; 188(23): 8178 - 8188. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Wang, Y. Lv, Z. Guo, X. Li, Y. Li, J. Zhu, D. Yang, J. Xu, C. Wang, S. Rao, et al. Effects of replacing the unreliable cDNA microarray measurements on the disease classification based on gene expression profiles and functional modules Bioinformatics, December 1, 2006; 22(23): 2883 - 2889. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Huttenhower, M. Hibbs, C. Myers, and O. G. Troyanskaya A scalable method for integration and functional analysis of multiple microarray datasets Bioinformatics, December 1, 2006; 22(23): 2890 - 2897. [Abstract] [Full Text] [PDF] |
||||
![]() |
H.-Q. Yin, M. Kim, J.-H. Kim, G. Kong, M.-O. Lee, K.-S. Kang, B.-I. Yoon, H.-L. Kim, and B.-H. Lee Hepatic Gene Expression Profiling and Lipid Homeostasis in Mice Exposed to Steatogenic Drug, Tetracycline Toxicol. Sci., November 1, 2006; 94(1): 206 - 216. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Sorlie, C. M. Perou, C. Fan, S. Geisler, T. Aas, A. Nobel, G. Anker, L. A. Akslen, D. Botstein, A.-L. Borresen-Dale, et al. Gene expression profiles do not consistently predict the clinical treatment response in locally advanced breast cancer. Mol. Cancer Ther., November 1, 2006; 5(11): 2914 - 2918. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Thalamuthu, I. Mukhopadhyay, X. Zheng, and G. C. Tseng Evaluation and comparison of gene clustering methods in microarray analysis Bioinformatics, October 1, 2006; 22(19): 2405 - 2412. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. C. Y. Chang, L. Zsak, Y. Feng, R. Mosseri, Q. Lu, P. Kowalski, A. Zsak, T. G. Burrage, J. G. Neilan, G. F. Kutish, et al. Phenotype-based identification of host genes required for replication of african Swine Fever virus. J. Virol., September 1, 2006; 80(17): 8705 - 8717. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Li, L. Wu, and Z. Zhang Constructing biological networks through combined literature mining and microarray analysis: a LMMA approach Bioinformatics, September 1, 2006; 22(17): 2143 - 2150. [Abstract] [Full Text] [PDF] |
||||
![]() |
Z. S. Qin Clustering microarray gene expression data using weighted Chinese restaurant process Bioinformatics, August 15, 2006; 22(16): 1988 - 1997. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Hothorn, P. Buhlmann, S. Dudoit, A. Molinaro, and M. J. Van Der Laan Survival ensembles Biostat., July 1, 2006; 7(3): 355 - 373. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Sun, R. J. Carroll, and H. Zhao Bayesian error analysis model for reconstructing transcriptional regulatory networks PNAS, May 23, 2006; 103(21): 7988 - 7993. [Abstract] [Full Text] [PDF] |
||||
![]() |
V. Dumeaux, J. Johansen, A.-L. Borresen-Dale, and E. Lund Gene expression profiling of whole-blood samples from women exposed to hormone replacement therapy. Mol. Cancer Ther., April 1, 2006; 5(4): 868 - 876. [Abstract] [Full Text] [PDF] |
||||
![]() |
Z. Barutcuoglu, R. E. Schapire, and O. G. Troyanskaya Hierarchical multi-label prediction of gene function Bioinformatics, April 1, 2006; 22(7): 830 - 836. [Abstract] [Full Text] [PDF] |
||||
![]() |
X. Gan, A. W.-C. Liew, and H. Yan Microarray missing data imputation based on a set theoretic framework and biological knowledge Nucleic Acids Res., March 20, 2006; 34(5): 1608 - 1619. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Missal, M. A. Cross, and D. Drasdo Gene network inference from incomplete expression data: transcriptional control of hematopoietic commitment Bioinformatics, March 15, 2006; 22(6): 731 - 738. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Tuikkala, L. Elo, O. S. Nevalainen, and T. Aittokallio Improving missing value estimation in microarray data with gene ontology Bioinformatics, March 1, 2006; 22(5): 566 - 572. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Li Survival prediction of diffuse large-B-cell lymphoma based on both clinical and gene expression information Bioinformatics, February 15, 2006; 22(4): 466 - 471. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. T. Leek, E. Monsen, A. R. Dabney, and J. D. Storey EDGE: extraction and analysis of differential gene expression Bioinformatics, February 15, 2006; 22(4): 507 - 508. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. Dysvik, E. N. Vasstrand, R. Lovlie, O. A-A. Elgindi, K. W. Kross, H. J. Aarstad, A. Chr. Johannessen, I. Jonassen, and S. O. Ibrahim Gene Expression Profiles of Head and Neck Carcinomas from Sudanese and Norwegian Patients Reveal Common Biological Pathways Regardless of Race and Lifestyle Clin. Cancer Res., February 15, 2006; 12(4): 1109 - 1120. [Abstract] [Full Text] [PDF] |
||||



















