Skip Navigation


Bioinformatics Advance Access originally published online on February 24, 2005
Bioinformatics 2005 21(10):2417-2423; doi:10.1093/bioinformatics/bti345
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/10/2417    most recent
bti345v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (18)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Sehgal, M. S. B.
Right arrow Articles by Dooley, L. S.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Sehgal, M. S. B.
Right arrow Articles by Dooley, L. S.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions{at}oupjournals.org

Collateral missing value imputation: a new robust missing value estimation algorithm for microarray data

Muhammad Shoaib B. Sehgal *, Iqbal Gondal and Laurence S. Dooley

Gippsland School of Computing and Information Technology, Monash University VIC 3842, Australia

*To whom correspondence should be addressed.


    Abstract
 TOP
 Abstract
 1 INTRODUCTION
 2 OVERVIEW OF EXISTING...
 3 THE CMVE ALGORITHM
 4 THEORETICAL FOUNDATIONS OF...
 5 RESULTS ANALYSIS
 6 CONCLUSIONS
 REFERENCES
 

Motivation: Microarray data are used in a range of application areas in biology, although often it contains considerable numbers of missing values. These missing values can significantly affect subsequent statistical analysis and machine learning algorithms so there is a strong motivation to estimate these values as accurately as possible before using these algorithms. While many imputation algorithms have been proposed, more robust techniques need to be developed so that further analysis of biological data can be accurately undertaken. In this paper, an innovative missing value imputation algorithm called collateral missing value estimation (CMVE) is presented which uses multiple covariance-based imputation matrices for the final prediction of missing values. The matrices are computed and optimized using least square regression and linear programming methods.

Results: The new CMVE algorithm has been compared with existing estimation techniques including Bayesian principal component analysis imputation (BPCA), least square impute (LSImpute) and K-nearest neighbour (KNN). All these methods were rigorously tested to estimate missing values in three separate non-time series (ovarian cancer based) and one time series (yeast sporulation) dataset. Each method was quantitatively analyzed using the normalized root mean square (NRMS) error measure, covering a wide range of randomly introduced missing value probabilities from 0.01 to 0.2. Experiments were also undertaken on the yeast dataset, which comprised 1.7% actual missing values, to test the hypothesis that CMVE performed better not only for randomly occurring but also for a real distribution of missing values. The results confirmed that CMVE consistently demonstrated superior and robust estimation capability of missing values compared with other methods for both series types of data, for the same order of computational complexity. A concise theoretical framework has also been formulated to validate the improved performance of the CMVE algorithm.

Availability: The CMVE software is available upon request from the authors.

Contact: Shoaib.Sehgal{at}infotech.monash.edu.au


    1 INTRODUCTION
 TOP
 Abstract
 1 INTRODUCTION
 2 OVERVIEW OF EXISTING...
 3 THE CMVE ALGORITHM
 4 THEORETICAL FOUNDATIONS OF...
 5 RESULTS ANALYSIS
 6 CONCLUSIONS
 REFERENCES
 
DNA microarrays are extensively used to probe the genetic expression of tens of thousands of genes under a variety of conditions, as well as in the study of many biological processes varying from human tumors (Sehgal et al., 2004a) to yeast sporulation (Troyanskaya et al., 2001). There are several statistical, mathematical and machine learning algorithms (Gustavo et al., 2003; Ramaswamy et al., 2001; Shipp et al., 2002) that exploit these data for diagnosis (Furey et al., 2000; Brown et al., 1997), drug discovery and protein sequencing for instance. The most commonly used methods include data dimension reduction techniques (Sehgal et al., 2004e), class prediction techniques (Sehgal et al., 2004bc; Golub et al., 1999) and clustering methods (Munagala et al., 2004).

Despite the wide usage of microarray data, they frequently contain missing values with up to 90% of genes affected (Ouyang et al., 2004). Missing values can occur for various reasons, such as spotting problems, slide scratches, blemishes on the chip, hybridization error, and image corruption or simply dust on the slide (Oba et al., 2003). It has been proven (Sehgal et al., 2004d), (Acuna et al., 2004) that missing values affect class prediction and data dimension reduction techniques, such as support vector machines (SVMs), neural networks (NNs), principal component analysis (PCA) and singular value decomposition (SVD). The problem can be managed in many different ways from repeating the experiment, although this is often not feasible for economic reasons, to simply ignoring the samples containing missing values, although this is inappropriate because usually there are only a very limited number of samples available. The best solution is to attempt to accurately estimate the missing values, but unfortunately most approaches use zero impute (replace the missing values by zero) or row average/median (replacement by the corresponding row average/median), neither of which take advantage of data correlations, thereby leading to high estimation errors (Troyanskaya et al., 2001). Current research demonstrates that if the correlation between data is exploited then missing value prediction error can be reduced significantly (Sehgal et al., 2004d; Hellem et al., 2004). Several methods including K-nearest neighbour (KNN) impute, least square imputation (LSImpute) (Hellem et al., 2004) and Bayesian PCA (BPCA) (Oba et al., 2003) have been used; however, the prediction error generated using these methods still impacts on the performance of statistical and machine learning algorithms including class prediction, class discovery and differential gene identification algorithms (Sehgal et al., 2004e). There is, thus, considerable potential to develop new techniques that will provide minimal prediction errors for different types of microarray data including both time and non-time series sequences.

This paper presents a collateral missing value estimation (CMVE) algorithm which combines multiple value matrices for particular missing data and optimizes its parameters using linear programming and least square (LS) regression. CMVE is compared with other well-established techniques, including KNN, LSImpute and BPCA, with their performance rigorously tested for the prediction of randomly introduced missing values, with probabilities ranging from 0.01 to 0.2 for the BRCA1, BRCA2, sporadic mutation microarray data (mutations present in ovarian cancer), which is non-time series data (Amir et al., 2001). The reason for introducing missing values is that the number of actual missing values in the BRCA1, BRCA2 and sporadic mutation data are negligibly small compared with the size of the dataset—only 0.01, 0.003 and 0.01% values, respectively. Since randomly introduced missing values may not be distributed in the same way as actual missing values (Oba et al., 2003), a separate experiment was performed, with CMVE and the other three estimation algorithms being applied to the yeast sporulation time series dataset (Spellman et al., 1998), which contains 1.7% missing values.

The normalized root mean square (NRMS) error (Ouyang et al., 2004) metric was used to quantitatively evaluate the estimation performance of each technique, with results demonstrating the improved accuracy and robustness of CMVE over a wide range of randomly introduced missing values. In addition, while computational complexity is not as critical a factor as accuracy for missing value imputation because estimation is performed only once during the data collection (Troyanskaya et al., 2001; Hellem et al., 2004), the order of computational complexity for CMVE proved to be exactly the same as the LSimpute and KNN algorithms.

The remainder of the paper is organized as follows: Section 2 presents a brief overview of existing estimation techniques, with their respective advantages and disadvantages, while the new CMVE algorithm and methodology is detailed in Section 3. Section 4 provides the theoretical framework for the improved performance of CMVE compared with the KNN, LSImpute and BPCA algorithms, while Section 5 analyses fully the respective estimation performance of all four imputation methods. Section 6 provides some conclusions.


    2 OVERVIEW OF EXISTING MISSING VALUE ESTIMATION TECHNIQUES
 TOP
 Abstract
 1 INTRODUCTION
 2 OVERVIEW OF EXISTING...
 3 THE CMVE ALGORITHM
 4 THEORETICAL FOUNDATIONS OF...
 5 RESULTS ANALYSIS
 6 CONCLUSIONS
 REFERENCES
 
The following convention is adopted for all the imputation algorithms described in this paper. The microarray data have the form of an m x n matrix Y, where m is the number of genes and n is the number of samples. The YIJ component of Y represents the expression level of gene I for sample J.

An overview is now presented of the strengths and limitations of the three estimation techniques used for comparative purposes in assessing the performance of CMVE.

2.1 KNN estimation
The KNN method imputes missing values by selecting genes with expression values similar to the gene of interest (Toyanasaka et al., 2001). In order to estimate the missing value YIJ of gene I in sample J, k genes are selected whose expression vectors are similar to genetic expression of I in samples other than J. The similarity measure between two expression vectors Y1 and Y2 is determined by the Euclidian distance {psi} over the observed components in sample J.

(1)
The missing value is then estimated as the weighted average of the corresponding entries in the selected k expression vectors:

(2)

(3)
where and X is the input matrix containing gene expressions. Equations (2) and (3) show that each gene contribution is weighted by the similarity of its expression to gene I.

The Euclidean distance measure used by KNN is sensitive to outlier values which may be present in microarray data; although log-transforming the data significantly reduces their effects on gene similarity determination (Toyanasaka et al., 2001). The choice of a small k degrades the performance of the classifier as the imputation process overemphasizes a few dominant genes in estimating the missing values. Conversely, a large neighbourhood may include genes that are significantly different from those containing missing values; thereby, degrading the estimation process and commensurately the classifier's performance. Empirical results have demonstrated that for small datasets k = 10 is the best choice (Accuna et al., 2004), while Toyanasaka et al. (2001) observed that KNN is insensitive to values of k in the range 10–20.

The computational complexity of KNN is O(m2n), where m and n are the number of genes and samples, respectively, and while this is the same order as the LSImpute algorithm, Section 2.3 will show it is higher than BPCA. A vital feature of KNN is that it does not consider negative correlations between data, which can lead to estimation errors.

2.2 Least square impute estimation
LSImpute is a regression-based estimation method that exploits the correlation between genes. To estimate the missing value YIJ of gene I from gene expression matrix Y, the k-most correlated genes are first selected whose expression vectors are similar to gene I from Y in all samples except J, containing non-missing values for gene I. The LS regression method then estimates the missing value YIJ. By possessing the flexibility to adjust the number of predictor genes k in the regression, LSImpute performs best when data have a strong local correlation structure and for the same order of computational complexity O(m2n), as KNN.

2.3 Bayesian PCA based estimation
BPCA estimates missing values Ymiss in data matrix Y using those genes Yobs having no missing values. The probabilistic PCA (PPCA) is calculated using Bayes theorem and the Bayesian estimation calculates posterior distribution of model parameter {theta} and input matrix X containing gene expression samples using:

(4)
where p({theta}) is known as the prior distribution which contributes a priori preference to {theta} and X.

Missing values are estimated using a Bayesian estimation algorithm, which is executed for both {theta} and Ymiss (similar to the Expectation–Maximization repetitive algorithm) and calculates the posterior distributions for {theta} and Ymiss, q({theta}) and q(Ymiss) (Oba et al., 2003). Finally, the missing values in the gene expression matrix are imputed using:

(5)

(6)
where {theta}true is the posterior of the missing value.

By exploiting only the global correlation in the datasets, BPCA has the advantage of prediction speed incurring a computational complexity O(mn), which is one degree less than for both KNN and LSImpute. For imputation purposes, however, improved estimation accuracy is always a greater priority than speed.


    3 THE CMVE ALGORITHM
 TOP
 Abstract
 1 INTRODUCTION
 2 OVERVIEW OF EXISTING...
 3 THE CMVE ALGORITHM
 4 THEORETICAL FOUNDATIONS OF...
 5 RESULTS ANALYSIS
 6 CONCLUSIONS
 REFERENCES
 
The complete CMVE algorithm, which is detailed in Figure 1, introduces the concept of multiple parallel estimations of missing values. For instance, if value YIJ of gene I and sample J is missing, multiple estimates ({Phi}1, {Phi}2 and {Phi}3) are generated and the final estimate {chi} distilled from these estimates. The covariance function is employed since unlike KNN, it is unbiased in considering both positive and negative correlation values. The covariance function CoV is formally defined as:

(7)
where {omega} is the predictor gene vector and {nu} the expression vector of gene I which has the missing values. The absolute diagonal covariance CoV is first computed for a gene vector {nu}, where every gene except I is iteratively considered as {omega} (Step 2 in Fig. 1). The genes are then ordered with respect to their values and the first k-ranked covariate genes Rk selected, whose expression vectors have the most similarity to gene I from Y in all samples except J (Step 4). The LS regression method Harvey and Arthur (2004) is then applied to estimate value {Phi}1 for YIJ (Step 5) as:

(8)
where {xi} is the error term that minimizes the variance in the LS model (parameters {alpha} and ß). For a single regression, the estimate of {alpha} and ß are, respectively,

where is the empirical covariance between X and Y, YJ is the gene with the missing value and XJ is the predictor gene in Rk.



View larger version (47K):
[in this window]
[in a new window]
 
Fig. 1 The CMVE algorithm.

 
is the empirical variance of X with and being the respective means over X1,...,Xn and Y1,...,Yn, so the LS estimate of Y given X is expressed as:

The two other missing value estimates {Phi}2 and {Phi}3 (Step 6) are, respectively, given by:

(9)

(10)
where {varphi} is the vector that minimizes {xi}0 in Equation (12), {eta} is the normal residual and {xi} is the actual residual. These three parameters are obtained from the non-negative least square (NNLS) algorithm (Charles et al., 1974). The objective is now to find a linear combination of models that best fit Rk and I. The objective function in NNLS minimizes, using linear programming techniques, the prediction error {xi}0 so that:

(11)
i.e. min({xi}0) is a function that locates the normal vector {varphi} with minimum prediction error {xi}0 and residual {eta}. The value of {xi}0 in Equation (11) is obtained from

(12)
where SV are the singular values of the difference vector between the dot product Rk and prediction coefficients {varphi} with the gene expression row I. The tolerance used in the linear programming to compute vector {varphi} is given by

(13)
where k is the number of predictor genes, n the number of samples in the dataset and C is the normalization factor. The final estimate {chi} for YIJ is formed using

(14)
where {rho} = {Delta} = {Lambda} = 0.33 ensures an equal weighting to the respective estimates {Phi}1, {Phi}2 and {Phi}3. The rationale for this choice is that as each estimate is highly data-dependent, it avoids any bias toward one particular estimate.

The reason (14) has a lower NRMS error is because the imputation matrix {Phi}1 uses LS regression, while matrices {Phi}2 and {Phi}3 use NNLS, which is superior for estimating positive correlated values. NNLS is unable, however, to estimate negative values and given microarray data, possesses both negative and positive values, this was the motivation to embed the LSImpute-based matrix into the gene expression prediction, thus combining the advantages of both algorithms to more accurately estimate the missing values.


    4 THEORETICAL FOUNDATIONS OF CMVE
 TOP
 Abstract
 1 INTRODUCTION
 2 OVERVIEW OF EXISTING...
 3 THE CMVE ALGORITHM
 4 THEORETICAL FOUNDATIONS OF...
 5 RESULTS ANALYSIS
 6 CONCLUSIONS
 REFERENCES
 
This section explores the theoretical principles underpinning the reasons behind why the CMVE algorithm performs better when compared with KNN, LSImpute and BPCA techniques in estimating missing values. For completeness, a computational complexity analysis of CMVE is provided and is shown to be exactly of the same order as both LSImpute and KNN.

PROPOSITION 1
KNN only considers positive correlations.

If there are two sets {alpha} and ß which are inversely proportional to each other, then the distance d between {alpha} and ß will be larger in those sets which are proportional to each other. Several distance functions can be used for KNN, with the most common being Euclidian distance which is given by

(15)
so d is always higher when {alpha} is inversely proportional to ß, rather than when they are directly proportional to each other.

PROPOSITION 2
The CMVE algorithm considers both positive and negative correlation values.

Assume two sets {nu} and {omega} that are inversely proportional, such that CoV < 0 {forall}{nu}, {omega}. From (7) it is clear that if a high correlation exists between the gene values (either directly proportional and positive correlation or inversely proportional and negative correlation) then a higher absolute CoV value will exist.

PROPOSITION 3
The probability P({Theta}) of the normalized imputation error {Theta} of missing values using CMVE is always less than that for BPCA, LSImpute and KNN.

The probability P({Theta}) of the normalized imputation error {Theta} of the missing value for correlated data is directly proportional to the number of missing values M (Mclean, 2000). Assume P1 and P2 are the probabilities of normalized imputation errors of CMVE ({Theta}1) and the three comparative algorithms ({Theta}2) such that:

(16)

(17)
Since the comparative methods do not estimate any future missing value predictions, such algorithms only consider M missing values for each prediction. In contrast, CMVE uses estimated values for the future prediction of missing values so each estimate increases the number of predictor genes to be considered, while concomitantly decreasing the prediction probabilities in Equation (17). Hence

(18)

PROPOSITION 4
CMVE always has a lower estimation error of missing values in the case of transitive gene dependency (Gene A -> B -> C) than BPCA, LSImpute and KNN.

Assume that gene Ga1 is correlated with S1 such that

(19)
Similarly, gene Gb1 is correlated with S2 as:

(20)
If the values of both Ga1 and Gb1 are missing then Gb1 can be predicted using set S2 and subsequently used to predict Ga1 more accurately using S1 by including Gb1 rather than ignoring it.

CMVE, unlike the other imputation techniques considers estimated values in predicting future missing values. LSImpute replaces the gene missing value with an average value to compute the CoV matrix (Hellem et al., 2004). The NRMS error using this approach is always higher than CMVE, since each iteration (Fig. 1, Steps 1–7) lowers this error. KNN and BPCA consider that missing genes have no correlation with the missing value gene as they ignore these missing values while searching the estimation space. In contrast, CMVE includes these genes when searching for the most correlated gene. This may incur a small accumulative error in future predictions, but it will always be less than when either the average value of the gene is used or the gene is totally ignored.

PROPOSITION 5
CMVE generates a lower estimation error than BPCA when genes have dominant local correlation.

BPCA assumes only a global correlation structure and has a similar effect in selecting a high value of k for CMVE. Owing to this assumption, BPCA does not provide accurate estimates when genes have dominant local correlation Oba et al. (2003), because in predicting missing values, information from all genes is considered, many of which have little or no correlation with the gene with the missing value. In contrast, the CMVE variable k can be adjusted depending upon the type of the data, ensuring that only those genes with strong correlations are considered, which concomitantly reduces the estimation error. The empirical results presented in the next section demonstrate that a value of k = 10 is suitable for local correlated data.

Computational complexity analysis. The order of computational complexity for CMVE is exactly the same as for the KNN and LSImpute algorithms.

The critical operation for the CMVE, KNN and LSImpute algorithms is the search for the most correlated genes. These algorithms search for correlated genes with the gene that has missing values. Each estimation takes linear time O(n); therefore, for m genes and n samples the complexity order is O(m2n) for all algorithms. KNN uses a weighted average of k correlated genes to estimate the missing values, while CMVE and LSImpute use regression and linear programming for estimation, although these additional overheads are negligible compared with the time taken to search for the most correlated genes. Similar to KNN and LSImpute, CMVE also only searches once per estimation for correlated genes.

As discussed in Section 2.3, BPCA has a computational complexity of O(mn) as it only considers the global correlation structure of the data. This is pyrrhic, however, because the corresponding estimation accuracy is significantly inferior whenever data have a localized correlation structure.


    5 RESULTS ANALYSIS
 TOP
 Abstract
 1 INTRODUCTION
 2 OVERVIEW OF EXISTING...
 3 THE CMVE ALGORITHM
 4 THEORETICAL FOUNDATIONS OF...
 5 RESULTS ANALYSIS
 6 CONCLUSIONS
 REFERENCES
 
To test the different imputation algorithms, four different types of microarray data were used including both time series and non-time series data. The dataset contained 18, 16, 27 and 77 samples of BRCA1, BRCA2, sporadic mutations (neither BRCA1 nor BRCA2) of ovarian cancer data (non-time series) and yeast sporulation data (time series), respectively. Each ovarian cancer data sample contained logarithmic microarray data of 6445 genes while there were 6179 genetic expressions per sample for yeast dataset. The rationale for selecting cancer data is that in such data some of the genes are up/downregulated hence it is very difficult to determine their expression levels from non-regulated genes. The missing value estimation techniques were tested by randomly removing data values and then computing the estimation error. In the experiments, between 1 and 5% of the values were removed from each dataset samples and the NRMS error {theta} was computed by

(21)
where M is the original data matrix and Mest is the estimated matrix using KNN, LSImpute, BPCA and CMVE. This particular metric was used for error estimation because {xi} = 1 for zero imputation (Ouyang et al., 2004).

To compare the performance of the CMVE, KNN and LSImpute imputation algorithms, k = 10 was used throughout the experiments. The rationale for this was that Olga et al. (2001) observed that KNN was insensitive to values of k in the range 10–20 and the best estimation results were observed in this range. Hellem et al. (2003) also suggested using k = 10 for LSImpute. Figure 2 plots the minimum overall prediction error rates for CMVE over a range of k values for the different test datasets, with results showing that k in the range 10–15 (highlighted) is the most appropriate. Lower k values include only a small set of correlated genes for prediction leading to prediction errors as other correlated genes are ignored. Conversely, when k is high, genes which have either very little or no correlation with the gene having missing values will be included in the prediction, again leading to erroneous results (Troyanskaya et al., 2001).



View larger version (24K):
[in this window]
[in a new window]
 
Fig. 2 NRMS error over a wide range of k in the CMVE algorithm for 5% missing values.

 
To fully test the robustness of the new CMVE algorithm, experiments were performed for missing values up to 20% (Figs 39). Figure 8a and 8b show the error values for 10% missing values which especially reveal (Fig. 8a) the significant deterioration in the results of KNN for the sporadic dataset.



View larger version (26K):
[in this window]
[in a new window]
 
Fig. 3 NRMS error for 1% missing values for ovarian cancer data.

 


View larger version (32K):
[in this window]
[in a new window]
 
Fig. 9 (a) NRMS error for 20% missing values for ovarian cancer data. (b) NRMS error (log scale) for 20% missing values for ovarian cancer data.

 


View larger version (29K):
[in this window]
[in a new window]
 
Fig. 8 (a) NRMS error for 10% missing values for ovarian cancer data. (b) NRMS error of CMVE, BPCA and LSImpute for 10% missing values for ovarian cancer data.

 
To clarify the performance of CMVE, Figure 8b plots the error results without KNN, which consistently confirm the lower error values compared with LSImpute and BPCA. Figure 9a and 9b show the corresponding results for 20% missing values, which again reveal the superiority and greater robustness of the CMVE algorithm for missing value imputation. Note, for the sake of clarity a logarithmic scale is used in Figure 9b.

Whenever there is a high number of missing values in a gene, sparse covariance matrices will ensue and with them, the increased likelihood of ill-conditioning. The CMVE algorithm avoids ill-conditioning by ensuring the removal of all genes with >20% missing values before imputation.

As highlighted in Section 1, experiments performed on datasets with randomly introduced missing values may not truly reflect the nature of actual microarray data missing values. All four imputation algorithms were, therefore, tested on the yeast time series data containing 1.7% missing values. Since NRMS errors could not be calculated for these actual missing values, the gene value adjacent to the gene with the missing value was replaced before applying the imputation algorithms. This had the effect of a delay function, while retaining the same distribution of missing values. The results in Table 1 again confirm the superior performance of CMVE, particularly when an additional 4, 5 and 10% of missing values are introduced into the data, with the corresponding average improvements being 60, 72 and 64%, respectively. The imputation results also reveal some other broader noteworthy issues. KNN for instance, performed better when missing values were randomly introduced because KNN only considers positive correlations and certain randomly introduced missing values will inevitably contain negative correlations with other genetic data. Similarly, LSImpute exhibited an improved performance compared with BPCA (Oba et al., 2003) confirming the discussion underpinning Proposition 5.


View this table:
[in this window]
[in a new window]
 
Table 1 NRMS errors for the actual missing value distribution (AMVD) of 1.7% missing values and additional 1–10% randomly introduced values in the yeast dataset

 

    6 CONCLUSIONS
 TOP
 Abstract
 1 INTRODUCTION
 2 OVERVIEW OF EXISTING...
 3 THE CMVE ALGORITHM
 4 THEORETICAL FOUNDATIONS OF...
 5 RESULTS ANALYSIS
 6 CONCLUSIONS
 REFERENCES
 
This paper has presented a new CMVE algorithm based on the novel concept of multiple imputations. Experimental results confirmed that CMVE consistently provided superior estimation accuracy compared with the existing missing value imputation algorithms, including KNN, LSImpute and BPCA. This performance improvement was especially evident when estimating higher numbers of missing values in both time series and non-time series data. The algorithm's theoretical basis, which was the exploitation of a combination of global and local correlations in a given dataset, repeatedly proved to be a more effective and robust strategy than the distance function used by KNN, with no increase in the order of computational complexity for all values of k. The results corroborate the fact that CMVE can be successfully applied to accurately impute missing values before any microarray data experiment, crucially without any bias being introduced into the estimation process.



View larger version (25K):
[in this window]
[in a new window]
 
Fig. 4 NRMS error for 2% missing values for ovarian cancer data.

 



View larger version (29K):
[in this window]
[in a new window]
 
Fig. 5 NRMS error for 3% missing values for ovarian cancer data.

 



View larger version (26K):
[in this window]
[in a new window]
 
Fig. 6 NRMS error for 4% missing values for ovarian cancer data.

 



View larger version (25K):
[in this window]
[in a new window]
 
Fig. 7 NRMS error for 5% missing values for ovarian cancer data.

 

Received on November 21, 2004; revised on January 30, 2005; accepted on February 18, 2005

    REFERENCES
 TOP
 Abstract
 1 INTRODUCTION
 2 OVERVIEW OF EXISTING...
 3 THE CMVE ALGORITHM
 4 THEORETICAL FOUNDATIONS OF...
 5 RESULTS ANALYSIS
 6 CONCLUSIONS
 REFERENCES
 

    Acuna, E. and Rodriguez, C. (2004) The treatment of missing values and its effect in the classifier accuracy. In Banks, D. (Ed.), et al. Classification, Clustering and Data Mining Applications, , Berlin, Heidelberg Springer-Verlag, pp. 639–648.

    Amir, A.J., et al. (2002) Gene expression profiles of BRCA1-linked, BRCA2-linked, and sporadic ovarian cancers. J. Natl Cancer Inst., 94, 981–990[Abstract/Free Full Text].

    Brown, W.N., et al. (1997) Knowledge-based analysis of microarray gene expression data using support vector machines. Proc. Natl Acad. Sci. USA, 97, 262–267.

    Golub, T.R., et al. (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286, 531–537[Abstract/Free Full Text].

    Gustavo, B. and Monard, C.M. (2003) An analysis of four missing data treatment methods for supervised learning. Appl. Artif. Intell., 17, 519–533.

    Furey, T.S., et al. (2000) Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics, 16, 906–914[Abstract/Free Full Text].

    Harvey, M. and Arthur, C. Fitting Models to Biological Data Using Linear and Nonlinear Regression, (2004) , Oxford Oxford University Press.

    Hellem, B.T., et al. (2004) LSimpute: accurate estimation of missing values in microarray data with least squares methods. Nucleic Acids Res., 32, e34[Abstract/Free Full Text].

    Lawson, C.L. and Hanson, R.J. Solving Least Squares Problems, (1974) , Englewood Cliffs, NJ Prentice-Hall, Inc.

    Munagala, K., et al. (2004) Cancer characterization and feature set extraction by discriminative margin clustering. BMC Bioinformatics, 5, 21[CrossRef][Medline].

    McLean, A. (2000) The predictive approach to teaching statistics. J. Stat. Education, 8, .

    Oba, S., et al. (2003) A bayesian missing value estimation method for gene expression profile data. Bioinformatics, 19, 2088–2096[Abstract/Free Full Text].

    Ouyang, M., et al. (2004) Gaussian mixture clustering and imputation of microarray data. Bioinformatics, 20, 917–923[Abstract/Free Full Text].

    Ramaswamy, S., et al. (2001) Multiclass cancer diagnosis using tumour gene expression signatures. Proc. Natl Acad. Sci. USA, 98, 15149–15154[Abstract/Free Full Text].

    Sehgal, M.S.B., et al. (2004a) Support vector machine and generalized regression neural network based classification fusion models for cancer diagnosis. In HIS'04, Japan.

    Sehgal, M.S.B., et al. (2004b) A collimator neural network model for the classification of genetic data. ICBA 04, USA.

    Sehgal, M.S.B., et al. (2004c) Communal neural network for ovarian cancer mutation classification. Complex 04, Australia.

    Sehgal, M.S.B., et al. (2004d) K-Ranked covariance based missing values estimation for microarray data classification. HIS'04, Japan.

    Sehgal, M.S.B., et al. (2004e) Statistical neural networks and support vector machine for the classification of genetic mutations in ovarian cancer. IEEE CIBCB 04, USA.

    Shipp, M.A., et al. (2002) Diffuse large B-cell lymphoma outcome prediction by gene expression profiling and supervised machine learning. Nat. Med., 8, 68–74[CrossRef][ISI][Medline].

    Spellman, P.T., et al. (1998) Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell, 9, 3273–3297[Abstract/Free Full Text].

    Troyanskaya, O., et al. (2001) Missing value estimation methods for DNA microarrays. Bioinformatics, 17, 520–525[Abstract/Free Full Text].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Mol. Cell. ProteomicsHome page
N. Pavelka, M. L. Fournier, S. K. Swanson, M. Pelizzola, P. Ricciardi-Castagnoli, L. Florens, and M. P. Washburn
Statistical Similarities between Transcriptomics and Quantitative Shotgun Proteomics Data
Mol. Cell. Proteomics, April 1, 2008; 7(4): 631 - 644.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
W. Stacklies, H. Redestig, M. Scholz, D. Walther, and J. Selbig
pcaMethods a bioconductor package providing PCA methods for incomplete data
Bioinformatics, May 1, 2007; 23(9): 1164 - 1167.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
D. S. V. Wong, F. K. Wong, and G. R. Wood
A multi-stage approach to clustering and imputation of gene expression profiles
Bioinformatics, April 15, 2007; 23(8): 998 - 1005.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
D. Hua and Y. Lai
An ensemble approach to microarray data-based gene prioritization after missing value imputation
Bioinformatics, March 15, 2007; 23(6): 747 - 754.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
X. Gan, A. W.-C. Liew, and H. Yan
Microarray missing data imputation based on a set theoretic framework and biological knowledge
Nucleic Acids Res., March 20, 2006; 34(5): 1608 - 1619.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
J. Tuikkala, L. Elo, O. S. Nevalainen, and T. Aittokallio
Improving missing value estimation in microarray data with gene ontology
Bioinformatics, March 1, 2006; 22(5): 566 - 572.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
R. Jornsten, H.-Y. Wang, W. J. Welsh, and M. Ouyang
DNA microarray data imputation and significance analysis of differential expression
Bioinformatics, November 15, 2005; 21(22): 4155 - 4161.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/10/2417    most recent
bti345v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (18)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Sehgal, M. S. B.
Right arrow Articles by Dooley, L. S.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Sehgal, M. S. B.
Right arrow Articles by Dooley, L. S.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?