Skip Navigation


Bioinformatics Advance Access originally published online on January 31, 2007
Bioinformatics 2007 23(6):747-754; doi:10.1093/bioinformatics/btm010
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/6/747    most recent
btm010v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (1)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Hua, D.
Right arrow Articles by Lai, Y.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Hua, D.
Right arrow Articles by Lai, Y.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

An ensemble approach to microarray data-based gene prioritization after missing value imputation

Dong Hua 1 and Yinglei Lai 2,*

1Department of Computer Science, The George Washington University, 801 22nd Street, Suite 704 and 2Department of Statistics and Biostatistics Center, The George Washington University, 2140 Pennsylvania Avenue, N.W. Washington, DC 20052, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENT
 REFERENCES
 

Motivation: Microarrays have been widely used to discover novel disease related genes. Some types of microarray, such as cDNA arrays, usually contain a considerable portion of missing values. When missing value imputation and gene prioritization are sequentially conducted, it is necessary to consider the distribution space of prioritization scores due to the existence of missing values. We propose an ensemble approach to address this issue. A bootstrap procedure enables us to generate a resample multivariate distribution of the prioritization scores and then to obtain the expected prioritization scores.

Results: We used a published microarray two-sample data set to illustrate our approach. We focused on the following issues after missing value imputation: (i) concordance of gene prioritization and (ii) control of true and false positives. We compared our approach with the traditional non-ensemble approach to missing value imputation. We also evaluated the performance of non-imputation approach when the theoretical test distribution was available. The results showed that the ensemble imputation approach provided clearly improved performances in the concordance of gene prioritization and the control of true/false positives, especially when sample sizes were about 5–10 per group and missing rates were about 10–20%, which was a common situation for cDNA microarray studies.

Availability: The Matlab codes are freely available at http://home.gwu.edu/~ylai/research/Missing.

Contact: ylai{at}gwu.edu


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENT
 REFERENCES
 
Microarrays enable us to simultaneously monitor gene expressions at a genomic scale (Der et al., 1998). They provide the tool to extract biological significances such as the changes in expression profiling of genes under distinct types (e.g. normal versus cancer types), which shed the light on use of them in a number of studies over a broad range of biological disciplines including cancer classification (Golub et al., 1999), identification of the unknown effects of a specific therapy (Perou et al., 2000), identification of genes relevant to a certain diagnosis or therapy (Cho et al., 2003) and cancer prognosis (Shipp et al., 2002; van't Veer et al., 2002). Due to their relatively high costs, the sample sizes of microarray studies are generally small, which may lead to considerable false positive rates. Since microarrays are widely used in pilot studies before the follow-up large sample validation studies, it is crucial to control the false positives in genes prioritized by microarray studies.

Some types of microarray, such as cDNA arrays, usually contain a considerable portion of missing values. These missing values exist due to various reasons including insufficient resolution, image corruption, dust or scratches on the slides or experimental error during the laboratory process (Kim et al., 2005). Effective imputation methods, which intend to recover these missing values, are important: on one hand, it is costly to repeat the experiments; and on the other hand, the repeat of experiments cannot guarantee data completeness.

Before data analysis, missing value imputation is generally required. Many algorithms for the gene expression data analysis, like support vector machines (Vapnik, 1995), and multivariate statistical analysis methods such as principal component analysis (Golub and van Loan., 1996), singular value decomposition (Alter et al., 2000) and generalized singular value decomposition (Alter et al., 2003), require a complete data set as the input. It can also be the case when microarrays are used in pilot studies for gene prioritization. Scores from a certain statistical test are generally used to rank genes. The sample sizes of different genes must be uniform so that the test scores of different genes can be comparable. One may consider using the corresponding P-values to rank genes, in which the sample sizes of different genes can be different. This approach is feasible when we know the theoretical test distribution. If such a distribution is unknown, which is usually the case in practice, we have to consider the permutation method for evaluating P-values. However, it is generally difficult to accurately evaluate the P-values for genes with missing observations when their sample sizes are small.

Recently, many methods have been developed for missing value imputations. These include a SVD-based method and a weighted k-nearest neighbors imputation (Troyanskaya et al., 2001), Bayesian approaches (Oba et al., 2003; Zhou et al., 2003), a fixed rank approximation algorithm (FRAA) (Friedland et al., 2003), a least squares method (Bo et al., 2004), a local least squares imputation (Kim et al., 2005), a collateral imputation method (Sehgal et al., 2005) and a SVM and orthogonal coding scheme based method (Wang et al., 2006). Furthermore, Kim et al. (2004) proposed to reuse the imputed data to improve missing value estimation, Tuikkala et al. (2006) proposed to consider gene ontology information for improving missing value estimation, and Gan et al. (2006) proposed to consider a set theoretic framework and biological knowledge to improve missing value estimation.

The impact of missing value imputation on differentially expressed gene identification has also been recently studied (Jornsten et al., 2005; Scheel et al., 2005). For multi-sample microarray data, gene prioritization is equivalent to detecting differentially expressed genes. When missing value imputation and gene prioritization are sequentially conducted, it is necessary to consider the distribution space of prioritization scores due to the existence of missing values. However, this issue has not been addressed since all the aforementioned missing value imputation methods only provide one estimate for each missing observation.

Ensemble methods, such as boosting (Freund and Schapire, 1997) and random forest (Breiman, 2001), have been widely used in the field of machine learning. When the number of predictor variables is relatively large, these methods can usually achieve satisfactory classification performance through combining a group of weak classifiers. In this study, we propose an ensemble approach to address the issue of microarray data based gene prioritization after missing value imputation. We first describe some preliminaries. Then, we detail a bootstrap based procedure. A two-sample microarray data set is used to illustrate our approach.


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENT
 REFERENCES
 
2.1 Preliminaries
Throughout the article, we will use Gisin {rectangle} mx n to represent a gene expression data matrix with m genes (rows) and n experiments (columns). Gmisin {rectangle} 2508x 10} may consist of the observed component X and the missing component Y :


Formula

We will use Formula to denote the estimation of Y by an imputation method I:


Formula

The existing imputation methods designed for microarray missing value estimation focus on the once-for-all estimation of the missing values based on the observed gene expression values. The subsequent gene prioritization and other analyses are entirely separated from the missing value imputation. Typically, a complete matrix G' is obtained by replacing the missing component Y with the estimated Formula . Then, the complete data matrix G' is used for the subsequent analyses. The subsequent analyses are no more relevant to the previous imputation process after G' is constructed. We will use C to denote the operation of the subsequent analyses. The traditional way to the whole process can be represented by Formula , which is a non-ensemble approach.

In this article, we introduce a novel ensemble approach, which is capable of incorporating any imputation method for missing value estimation, where the target outcome for the missing component Y is featured to be random as Y{omega} and to follow a multivariate distribution {Omega}. We walk through {Omega}, i.e. the instantiation of Y{omega} , through a bootstrap procedure. In this way, we can obtain an estimate for the operation C through an ensemble over C(X {oplus} Y{omega}) . The whole process can be represented by Formula . Generally, Formula

We evaluate the proposed approach using a two-sample microarray data set for various sample sizes and missing rates. The proposed approach is a framework, which can incorporate any imputation method I and any analysis operation C. The chosen local least squares imputation is coupled with L2 norm based similarity measure (referred to as LLSimpute/L2) because of its satisfactory performance (Kim et al., 2005), although other imputation methods can be selected in practice. We employ the simple Student's t-test or the widely used SAM t-test for gene prioritization as the operation C.

Our approach can be summarized in three steps: (i) bootstrapping imputation for individual samples; (ii) constructing resample complete matrices; (iii) averaging resample prioritization vectors. Figure 1 gives a flow chart for this approach. The details are described as follows.


Figure 1
View larger version (32K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Flow chart for the ensemble approach to gene prioritization after missing value imputation: resample imputed data vectors are first generated in a bootstrap manner for each sample; resample imputed data matrices are then generated by randomly selecting one resample vector for each sample; resample prioritization vectors are finally calculated based on these resample matrices, and their average is considered as the estimated vector for gene prioritization.

 
2.2 Bootstrapping imputation for individual samples
The bootstrap method, first proposed by Efron (1979), enables us to generate a resample distribution of estimates in a non-parametric manner. Since different samples are generally unrelated, we perform a bootstrap procedure for each column (sample) to generate a resample distribution of missing value estimates.

Without loss of generality, we describe the bootstrap procedure for the first column g· 1 . First, we sample n 1 number s2, s3, ..., sn from 2, 3, ..., n with replacement; Then, we use these resample n – 1 columns (g... s2, g... s3, · , g· sn) to impute the missing values in the column g· 1 (see Section 2.5 for the description of imputation method). The above two steps are repeated b times and we obtain b resample columns Formula . Notice that those non-missing values in the original column are not changed in these resample columns.

After performing the above procedure for all individual columns, we obtain n x b resample vectors Formula with b resample replicates for each sample (column). b will be referred to as the size of bootstrap in the rest of the article.

2.3 Constructing resample complete matrices
For each individual column j, j=1,2, ... ,n , we randomly select a resample replicate Formula from these b resample replicates generated by the above procedure. In this way, we construct a resample matrix Formula . (There are bn possible combinations.) We obtain r resample matrices by repeating the above step r times. (Notice that those non-missing values in the original matrix are not changed in these resample matrices.) r will be referred to as the size of ensemble in the rest of the article.

2.4 Averaging resample prioritization vectors
Using the method for gene prioritization (see Section 2.6 for detail), we can obtain a resample score vector for each resample matrix. We consider the average of these r resample vectors as the estimate of prioritization vector Formula . When the size of ensemble r is large, we consider averaging by mean. However, this is usually time-consuming. Generally, we set r=100 and consider averaging by median, which is less sensitive to outliers and is especially useful when P-values are used as prioritization scores. Notice that if there is no missing value in the original data, then this average vector will be the same as the prioritization vector calculated based on the original data.

2.5 Imputing missing values
The LLSimpute/L2 method proposed by Kim et al. (2005) is used for missing value imputation. We briefly describe it as follows. Without loss of generality, we consider the first gene and calculate the L2-norm based similarity measures between this gene and the rest m – 1 genes. These m – 1 genes are ranked according to the calculated similarity measures and the top k genes are identified as k nearest neighbors. The missing values in the first gene vector are estimated through the least squares method. We can perform this procedure for all gene vectors and have all missing values imputed. In this study, we follow the convention and choose k = 10.

2.6 Prioritizing genes
For a two-sample data set, we first consider the Student's t-test for prioritizing genes. We use n1 and n2 to denote the sample sizes of the first and the second groups, respectively, n1+n2=n . For each gene, we use x11, x12, ..., x1n1 and x21, x22, ..., x2n2 to denote its measurements (observed for non-missing or imputed for missing) in the first and the second groups, respectively. The Student's t-test is given by t={Delta} s , where {Delta} = (Formula1-Formula2) and Formula ; Formula , Formula and


Formula

The above test is performed for all m genes. For the data analyzed in this study, we observed through the Quantile–Quantile (Q–Q) plot that the theoretical t-distribution was consistent with the permutation distribution. Therefore, P-values from their corresponding t-distributions are used as the score vector for gene prioritization. This allows us to evaluate the performance of the non-imputation approach, which calculates the t-test only based on the observed data.

We also consider the SAM t-test (Tusher et al., 2001) for prioritizing genes. This test adds a fudge factor to the denominator of the Students t-test: t={Delta} (s + s0) . s0 is numerically determined according to the given data. This test has been widely used in two-sample microarray data analyses since it can generally improve the control of false positives by excluding genes with relatively small variances. However, the theoretical distribution of this test is unknown, and we have to use the permutation method to evaluate the P-values of tests. Since it is difficult to evaluate the P-values for genes with missing observations (missing values are actually not allowed in the implemented R function sam), the performance of the non-imputation approach cannot be evaluated.

Remark: Although test scores and their corresponding P-values have equivalent effects in prioritizing genes when genes have uniform sample sizes, we recommend to use P-values since they can provide additional significance information of the tests.


    3 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENT
 REFERENCES
 
We use the two-sample ZAP-70 dataset (Wiestner et al., 2003), which is publicly available at http://llmpp.nih.gov/cll/, to evaluate our ensemble imputation approach against the non-ensemble imputation approach as well as the non-imputation approach. We focus on the following two questions with the consideration of various sample sizes and missing rates:

  • Will the ensemble imputation approach provide better concordant prioritization of genes?
  • Will the ensemble imputation approach provide improved control of true and false positives?

3.1 ZAP-70 Dataset
There are 12447 genes and 107 cases involved in this two-sample data set, which was collected for the study of identification of a chronic lymphocytic leukemia subtype with unmutated immunoglobulin (Ig) genes, inferior clinical outcome and distinct gene expression profile. The sample sizes of the Ig-mutated and Ig-unmutated group are 79 (n1) and 28 (n2), respectively. The overall missing rate is 12.1%. There are 2508 genes with no missing value. We denote this subset as Gisin {rectangle} 2508x 107 and use it for our evaluation study with different sample sizes (5+5 , 10+10 , and 15+15) and missing rates (5 , 10 , and 20%).

Without loss of generality, we briefly describe the procedure for data matrix generation of the sample size 5+5 coupled with 5% missing rate. First, we randomly choose five columns (samples) from each group to form a new complete data matrix Gc isin {rectangle} 2508x 10 . Then, we randomly knock out entries as missing with 5% probability. The newly constructed incomplete matrix Gmisin {rectangle} 2508x 10} , which contains missing values, is used for imputation. Based on these matrices, we consider both the ensemble and the non-ensemble approaches to gene prioritization after missing value imputation. We use vC =(vC1, vC2, ..., vCm)Tisin {rectangle} mx 1} , vE =(vE1, vE2, ..., vEm)T isin {rectangle} mx 1 , and vS =(vS1, vS2, ... , vSm)Tisin {rectangle} mx 1} to denote the prioritization score vectors generated by the complete data matrix Gc , the ensemble and non-ensemble imputation approaches based on the incomplete data matrix Gm , respectively. For the analysis based on the Student's t-test, since theoretical P-values are used for gene prioritization, it is feasible to consider the non-imputation approach: prioritization scores are calculated only based on these observed data in Gm . We use vN =(vN1, vN2, ..., vNm)Tisin {rectangle} mx 1 to denote this score vector.

To answer the aforementioned two questions, we conduct the following two evaluations. We set the size of bootstrap b = 100, which enables us to generate a possibly huge ensemble. In order to determine the size of ensemble, for different sizes of ensemble, we calculated the Pearson correlation between the prioritization vectors vC and vE . Both the Student's and SAM t-tests were considered for prioritizing genes. Figure 2 shows the relationship between the correlation and size of ensemble: the correlation tends to be stable when the size of ensemble is close to 100 for all different configurations of sample size and missing rate. Therefore, we fix the size of ensemble r = 100 for the following evaluations.


Figure 2
View larger version (15K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. Improvement of prioritization concordance with the increase of size of ensemble. The Pearson correlation is calculated between the prioritization vector based on the complete data and that based on the incomplete data with the ensemble imputation approach. Different sample sizes and missing rates are considered. Both the Student's and SAM t-tests are considered for prioritizing genes.

 
When the SAM t-test is considered for prioritizing genes, it is difficult to evaluate the performance of non-imputation approach. Therefore, we only compare the ensemble imputation approach with the non-ensemble imputation approach.

3.2 Concordance of gene prioritization
One issue about gene prioritization after missing value imputation is whether the prioritization vector, which is used to select genes, is concordant with the one under the situation of no missing values. We use the Pearson correlation coefficient (Pearson, 1894) to measure the concordance. The improvement from the ensemble approach is measured by the Pearson Correlation Improvement Rate (PCIR) defined as follows. We first calculate the Pearson correlation coefficients {rho}E between vE and vC , and {rho}S between vS and vC . If vN is available, then {rho}N between vN and vC is also calculated. Since these correlations are usually large, even small improvement will be considered significant. Therefore, we define the Pearson Correlation Improvement Rate (PCIR) between {rho}E and {rho}S as:


Formula

PCIR between {rho}E and {rho}N can also be similarly calculated.

As shown in Table 1, the proposed ensemble imputation approach consistently outperforms the non-ensemble imputation approach as well as the non-imputation approach: we always obtain positive PCIRs. Compared to the non-ensemble imputation approach, the ensemble imputation approach achieves 50% and higher PCIRs when the sample size is 5+5, and 20% and higher PCIRs when the sample sizes are 10+10 and 15+15. These are observed when either the Student's or SAM t-test is used for prioritizing genes.


View this table:
[in this window]
[in a new window]

 
Table 1. Improvement of the ensemble method in prioritization concordance

 
3.3 Control of true and false positives
When genes are selected after missing value imputation for the follow-up studies, it is necessary to understand the impact of missing value imputation on the control of true and false positives. Since only the selected genes will be used for the follow-up large sample validation studies, it is crucial to control the false positives in genes selected from microarray studies. Controlling the true positives is also important since it is undesirable to miss too many truly differentially expressed genes.

One difficulty to address this issue is that we do not know which genes are truly differentially or non-differentially expressed. In this study, for each configuration, we define gold standards based on the prioritization vector vC from the complete data Gc . Previous studies showed that an accurate estimate of the number (N) of differentially expressed genes can be obtained when the sample size is relatively large (Lai, 2006). Therefore, we first use the original large and complete data Gmisin {rectangle} 2508x 10} to estimate N with a recently proposed method (Lai, 2006). Then, we define these genes with top N ranks as gold standard positives and the rest genes as gold standard negatives. With these gold standards defined, we can generate the widely used receiver operating characteristic (ROC) curves (true positive rate against false positive rate for different cutoff points) to compare different approaches. Since different results may be obtained when different test statistics are used for gene prioritization, the above procedure is performed separately for the Student's and the SAM t-tests.

When the Student's t-test is used for gene prioritization, we obtain the estimate N = 1013 (40.4%); when the SAM t-test is used for gene prioritization, we obtain the estimate N = 984 (39.3%). Figures 3 and 4 show the ROC curves for different configurations when the Student's and the SAM t-test are used as test statistic. Although the curves become lower as missing rate increases, the ensemble imputation approach consistently outperforms the non-ensemble imputation and non-imputation approaches. The advantage is especially distinct when the sample sizes are about 5–10 per group and the missing rates are about 10–20%, which is a common situation for cDNA microarray studies.


Figure 3
View larger version (20K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. Improvement of the ensemble method in receiver operating characteristic (ROC) curves when the Student's t-test is used for gene prioritization. Different sample sizes and missing rates are considered. x and y axes represent false and true positive rates, respectively.

 

Figure 4
View larger version (20K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4. Improvement of the ensemble method in receiver operating characteristic (ROC) curves when the SAM t-test is used for gene prioritization. Different sample sizes and missing rates are considered. x and y axes represent false and true positive rates, respectively.

 

    4 DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENT
 REFERENCES
 
We proposed an ensemble approach to address the issue of gene prioritization after missing value imputation. Compared with the traditional missing value imputation methods, which only provide one estimate for each missing observation, our approach considers the distribution space of prioritization scores due to the existence of missing values. We simulated the distribution space through a bootstrap procedure. To compare different approaches, we evaluated their performances in the concordance of gene prioritization and the control of true/false positives. A published two-sample microarray gene expression data set was used for our evaluations. The results confirmed the advantages of our proposed approach; the results also allowed us to compare the non-ensemble imputation approach with the non-imputation approach. From Table 1 and Figure 3, the non-imputation approach showed comparable or even better performances in many cases when it was compared with the non-ensemble imputation approach.

We also evaluated the classification performance of genes selected from a pilot study. Support vector machine (Vapnik, 1995) was used as the classifier and different numbers of genes with top ranks were selected. The results showed that a relatively low classification error rate could be achieved, regardless of the choice of approach to missing value imputation, when a certain number (30–100) of genes were included. This is not surprising since the classification performance depends on not only the differentiability of selected genes but also the combination of these genes.

Our proposed ensemble approach is novel. It is capable of incorporating any imputation method for missing value estimation and any statistical test for gene prioritization. In this study, for simplicity, we chose to use the Student's or SAM t-test for gene prioritization. Because of its satisfactory performance (Kim et al., 2005), we chose the local least squares coupled with L2 norm based similarity measure for missing value imputation. Kim et al. (2005) also proposed an automatic estimator for the number k of nearest neighbor genes, and showed its satisfactory performances. However, this procedure is very time consuming. We actually performed this procedure for some of our evaluations and observed similar results. Since our purpose was to introduce an ensemble approach to gene prioritization after missing value imputation, we simply followed the convention and fixed k = 10 to save our computation time.

It should be noted that the benefits of our proposed approach may be affected by different methods for imputing missing values. Many imputation methods have been proposed and each one has its advantage in a certain situation. It is necessary to conduct further studies so that the impacts of different imputation methods can be well understood. It should also be noted that the current approach may not be applicable to time series microarray data (Gan et al., 2006) since the dependence structure among observations from different time points. We are currently investigating a further development so that we can generate this ensemble idea to integrate the missing value imputation with gene prioritization for time series microarray data.

If the missing rate is almost zero, then there will be no clear difference between the ensemble and the non-ensemble approaches; if almost all the data are missing, then the missing value imputation will not work well and both approaches will have poor performances. If the sample size is extremely small, then the missing value imputation will not work well and there will be no clear difference between the two approaches; If the sample size is relatively large, then the result of gene prioritization will be relatively robust and insensitive to the missing values, and therefore there will be no clear difference between the two approaches. The above discussion implies that there are certain ranges of the sample size and the missing rate such that the ensemble approach will provide considerable improvements over the non-ensemble approach (as we observed in Table 1 and Figs 3 and 4). To better understand the performance of ensemble approach, it is necessary to conduct more theoretical and simulation studies.


    ACKNOWLEDGEMENT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENT
 REFERENCES
 
We thank the associate editor and two anonymous reviewers for their valuable comments. This work was supported by a NIH grant DK-75004.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Golan Yona

Received on August 31, 2006; revised on December 26, 2006; accepted on January 14, 2007

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENT
 REFERENCES
 

    Alter O, et al. Singular value decomposition for genome-wide expression data processing and modeling. (2000) 97. USA: Proc. Natl Acad. Sci. 10101–10106.

    Alter O, et al. Generalized singular value decomposition for comparative analysis of genome-scale expression datasets of two different organisms. (2003) 100. USA: Proc. Natl Acad. Sci. 3351–3356.

    Bo TH, et al. LSimpute: accurate estimation of missing values in microarray data with least squares methods. In: Nucleic Acids Res. (2004) 32:e34.[Abstract/Free Full Text]

    Breiman L. Random forests. In: Mach. Learn. (2001) 45:5–32.[CrossRef]

    Cho JH, et al. New gene selection method for classification of cancer subtypes considering within class variation. In: FEBS Lett. (2003) 551:3–7.[CrossRef][Web of Science][Medline]

    Der SD, et al. Identification of genes differentially regulated by interferon alpha, beta, or gamma using oligonucleotide arrays. (1998) 95. USA: Proc. Natl Acad. Sci. 15623–15628.

    Efron B. Bootstrap methods: another look at the jackknife. In: Ann. Stati. (1979) 7:1–26.[CrossRef]

    Friedland S, et al. A simultaneous reconstruction of missing data in DNA microarrays. (2003) Institute for Mathematics and its Applications Preprint Series No. 1948.

    Freund Y, Schapire RE. A decision-theoretic generalization of on-line learning and an application to boosting. J. Compu. Sys. Sci. (1997) 55:119–139.[CrossRef]

    Gan X, et al. Microarray missing data imputation based on a set theoretic framework and biological knowledge. In: Nucleic Acids Res. (2006) 34:1608–1619.[Abstract/Free Full Text]

    Golub GH, van Loan CF. Matrix Computations (1996) Baltimore, CA: Johns Hopkins University Press.

    Golub TR, et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. In: Science (1999) 286:531–537.[Abstract/Free Full Text]

    Jornsten R, et al. DNA microarray data imputation and significance analysis of differential expression. Bioinformatics (2005) 21:4155–4161.[Abstract/Free Full Text]

    Kim H, et al. Missing value estimation for DNA microarray gene expression data: local least squares imputation. Bioinformatics (2005) 21:187–198.[Abstract/Free Full Text]

    Kim KY, et al. Reuse of imputed data in microarray analysis increases imputation efficiency. BMC Bioinformatics (2004) 5:160.[CrossRef][Medline]

    Lai Y. A statistical method for estimating the proportion of differentially expressed genes. In: Comput. Biol. Chem. (2006) 30:193–202.[CrossRef][Web of Science][Medline]

    Oba S, et al. A Bayesian missing value estimation method for gene expression profile data. Bioinformatics (2003) 19:2088–2096.[Abstract/Free Full Text]

    Pearson K. Contributions to the mathematical theory of evolution. (1894) 185. London: Phil. Trans. R. Soc. 71–110.

    Perou CM, et al. Molecular portraits of human breast tumors. In: Nature (2000) 406:747–752.[CrossRef][Medline]

    Scheel I, et al. The influence of missing value imputation on detection of differentially expressed genes from microarray data. Bioinformatics (2005) 21:4272–4279.[Abstract/Free Full Text]

    Sehgal M.SB, et al. Collateral missing value imputation: a new robust missing value estimation algorithm for microarray data. Bioinformatics (2005) 21:2417–2423.[Abstract/Free Full Text]

    Shipp MA, et al. Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. In: Nat. Med. (2002) 8:68–74.[CrossRef][Web of Science][Medline]

    Troyanskaya O, et al. Missing value estimation methods for DNA microarray. Bioinformatics (2001) 17:520–525.[Abstract/Free Full Text]

    Tuikkala J, et al. Improving missing value estimation in microarray data with gene ontology. Bioinformatics (2006) 22:566–572.[Abstract/Free Full Text]

    Tusher VG, et al. Significance analysis of microarrays applied to the ionizing radiation response. (2001) 98. USA: Proc. Natl Acad. Sci. 5116–5121.

    van t Veer LJ, et al. Gene expression profiling predicts clinical outcome of breast cancer. In: Nature (2002) 415:530–536.[CrossRef][Medline]

    Vapnik V. The Nature of Statistical Learning Thery (1995) New York: Springer-Verlag.

    Wang X, et al. Missing value estimation for DNA microarray gene expression data by Support Vector Regression imputation and orthogonal coding scheme. In: BMC Bioinformatics (2006) 7:32.[CrossRef][Medline]

    Wiestner A, et al. ZAP-70 expression identifies a chronic lymphocytic leukemia subtype with unmutated immunoglobulin genes, inferior clinical outcome, and distinct gene expression profile. In: Blood (2003) 101:4944–4951.[Abstract/Free Full Text]

    Zhou X, et al. Missing-value estimation using linear and non-linear regression with Bayesian gene selection. In: Bioinformatics (2003) 19:2302–2307.[Abstract/Free Full Text]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
R. Varshavsky, A. Gottlieb, D. Horn, and M. Linial
Unsupervised feature selection under perturbations: meeting the challenges of biological data
Bioinformatics, December 15, 2007; 23(24): 3343 - 3349.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/6/747    most recent
btm010v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (1)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Hua, D.
Right arrow Articles by Lai, Y.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Hua, D.
Right arrow Articles by Lai, Y.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?