Skip Navigation


Bioinformatics Advance Access originally published online on November 2, 2005
Bioinformatics 2006 22(3):317-325; doi:10.1093/bioinformatics/bti738
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
22/3/317    most recent
bti738v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Zhang, W.
Right arrow Articles by Bertrand, K.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Zhang, W.
Right arrow Articles by Bertrand, K.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

A method for predicting disease subtypes in presence of misclassification among training samples using gene expression: application to human breast cancer

Wensheng Zhang 1, Romdhane Rekaya 1,2,* and Keith Bertrand 1

1Department of Animal and Dairy Science, University of Georgia Athens, GA 30602, USA
2Department of Statistics, University of Georgia Athens, GA 30602, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 3 A REAL DATA-BASED...
 4 CONCLUSIONS
 REFERENCES
 

Motivation: An accurate diagnostic and prediction will not be achieved unless the disease subtype status for every training sample used in the supervised learning step is accurately known. Such an assumption requires the existence of a perfect tool for disease diagnostic and classification, which is seldom available in the majority of the cases. Thus, the supervised learning step has to be conducted with a statistical model that contemplates and handles potential mislabeling in the input data.

Results: A procedure for handling potential mislabeling among training samples in the prediction of disease subtypes using gene expression data was proposed. A real data-based simulation study about the estrogen receptor status (ER+/ER–) of breast cancer patients was conducted. The results demonstrated that when 1–4 training samples (N = 30) were artificially mislabeled, the proposed method was able not only in correcting the ER status of mislabeled training samples but also more importantly in predicting the ER status of validation samples as well as using ‘true’ training data.

Availability: The programs (in Matlab) used for analysis are publicly available at http://nce.ads.uga.edu/~romdhane

Contact: rrekaya{at}uga.edu


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 3 A REAL DATA-BASED...
 4 CONCLUSIONS
 REFERENCES
 
It is widely recognized that an accurate diagnosis is crucial to the design of an adequate treatment protocol and ultimately to the survival of the patient. Sometimes, the accurate assignment of patients to specific disease subclasses is a difficult and expensive process and requires the collective expertise of a number of professionals (Yeoh et al., 2002). Several deadly diseases such as cancer are complex and very heterogeneous (Ross et al., 2003). Even for a specific cancer type, such as leukemia, a few subclasses with varying phenotypes exist (Golub et al., 1999; Ramaswarmy et al., 2001; West et al., 2001). Using typical diagnostic technologies, the different subtypes (subclasses) of a specific disease can be difficult to tell apart because they tend to look alike under microscopic analysis and share the same symptoms or markers used in the diagnostic, or simply because one or more subclasses of the disease are unknown. As a result, such similarity can lead to misdiagnosis and improper treatment (Khan et al., 2001). Recently, the use of microarray expression profiling for the classification and subtype discovery of cancer and other diseases has been proposed and frequently investigated (Alizadeh et al., 2000; Dudoit et al., 2002; Golub et al., 1999; Khan et al., 2001; Ross et al., 2003; Ramaswarmy et al., 2001; Tibshirani et al., 2002; West et al., 2001; Yeoh et al., 2002). However, the problem of potential mislabeling in the training set has not yet been solved. That is, the categorical outcomes of some samples may be mislabeled due to using less than perfect classical techniques. Consequently, using these phenotypes to determine the set of genes that discriminate between disease subtypes in training a classifier could lead to erroneous results.

Furthermore, while many algorithms for training classifiers and methods for selecting a subset of discriminative genes for sample classification have been proposed (Antonov et al., 2004; et al., 2002; Golub et al., 1999; Dudoit et al., 2002; Tusher et al., 2001) and microarray data have been successfully applied in improving tumor classification, the robustness of these procedures to potential misclassification has received little attention. For example, Golub et al. (1999) using 50 genes that best discriminated between acute myeloid leukemia and acute lymphoblastic leukemia correctly classified 36 out of the 38 training set samples and 29 out of the 34 validation set samples.

Although a high prediction rate was achieved using microarray expression profiling, it is clear that 10–15% of the samples were not correctly classified. Many later studies (Dudoit et al., 2002; et al., 2002; Li et al., 2001; Statnikov et al., 2005; Tibshirani et al., 2002) tried to improve the classification accuracy using different classification procedures, but no one, to our knowledge, has investigated the effects of potential mislabeling prior to the implementation of the supervised learning step.

We believe that an accurate diagnostic prediction will not be achieved unless we are completely sure about the status of the disease subtype for every training sample used in the supervised learning step. Such an assumption requires the existence of a perfect tool for diagnostic classification, which is seldom available. Usually, unsupervised learning, such as clustering, by itself is not adequate in discriminating subclasses, especially those that are similar and very close together (Ross et al., 2003). Thus, the supervised learning step has to be conducted with a procedure/statistical model that contemplates and handles potential mislabeling of the disease subtype status in the input data. Furthermore, mislabeling is not rare and has in fact been investigated by researchers in other fields. Rekaya et al. (2001) developed a Bayesian procedure for addressing mislabeling of binary responses. An analysis of animal breeding data showed that, with their procedure, cows with mislabeled fertility status were identified with high probability.

Several studies (West et al., 2001; West 2003; Spang et al., 2002) have developed Bayesian regression models for predicting subtypes of tumors using microarray data, in which the classification confidence was expressed as a probability with a corresponding confidence interval. In one study, West et al.'s (2001) work was extended to deal with potential mislabeling in the training dataset. The proposed procedure was tested with a published dataset for predicting Estrogen Receptor (ER) status of human breast cancer (West et al., 2001).


    2 MATERIALS AND METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 3 A REAL DATA-BASED...
 4 CONCLUSIONS
 REFERENCES
 
2.1 Latent variable model
Assume Y = (Y1, Y2, ... , Yn) to be a vector of independent random binary variables where each Yi (i = 1, 2, ... , n) follows a Bernoulli process. Let X be a structure matrix representing exploratory factors and xi be the i-th column. Let {ell}i be a continuous and unobserved latent random variable that relates to the binary response via the following relationship:

Formula
Furthermore, the liability {ell}i could be modeled through a simple linear regression model as follows:

Formula 1(1)

In this case, the link function of the systematic component Formula 1 with the binary response Yi is structured via a probit model (Albert and Chib, 1993; Johnson and Albert, 1999; West et al., 2000). Thus,

Formula 2(2)
where {Phi}(·) is the standard normal cumulative distribution function and Formula 2 is the linear predictor of Yi. For classification of binary responses, the probability pi can be used, leading to the following discrimination rule:

Formula 3(3)

It is clear from Equation (3) that the class boundary is linear and several techniques including support vector machine approach could be used to detect the most likely mislabeled samples. However, for studying the influence of mislabeling in disease diagnosis, a statistical method with clear proprieties that relies on classification probabilities and their associated confidential intervals is more appropriate.

In practical applications, a realization of Y, say y, is generally observed. In this study, the vector y represents ER status of human breast cancer samples (determined by clinical tests) with element yi = 1 when the ER status of i-th sample is positive and yi = 0 when the status is negative. The binary data consisted of ER status of 49 patients determined by immunohistochemistry (IHC) and immunoblotting (IB) tests and the matrix X consisted of expression levels of 7129 genes for each of the 49 individuals.

2.2 Dimension reduction
Typically, gene expression experiments involve several thousand genes where the majority of them have no effect in the biological process under study. Furthermore, the number of samples is always much smaller than the number of genes leading to the well known problem of large p small n (West et al., 2000). Consequently, it is better, at least in a statistical sense, to reduce the number of genes to be used in the inference. In this study, the most influential genes were selected based on the absolute values of the point biserial correlation coefficients (rb) computed as follows:

Formula 4(4)
where Mp and Mq are the average expressions of a given gene in each of the two binary response classes, p and q are the correspond proportions and St is the pooled standard deviation. Based on our preliminary exploration of the data, it was evident that the prediction of the binary status was consistent when the number of selected genes varied from 100 to 300. Thus, only the case of 100 genes selected was considered in this study.

Even with only 100 genes being used, the training set had a smaller number of samples and an additional step was required to reduce the dimension of gene expression profiling. Singular value decomposition (SVD) technique was used for that purpose (West et al., 2001; West, 2003). Let X be the matrix of expression profiling having 100 rows (number of selected genes) and 49 columns (number of training samples). SVD consists in decomposing X as follows:

Formula 5(5)
where A is the p x n SVD loading matrix and has orthogonal columns so that A'A = I; D = diag(d1, ... , dn), a diagonal matrix of positive singular values, ordered as di ≥ d2 ≥ ... dn ≥ 0; and F is an n x n SVD factor matrix, an orthogonal matrix with FF' = F'F = I and the i-th column is denoted by Fi.

Consequently, Equation (2) can be rewritten as follows:

Formula 6(6)
where Formula 6. Through SVD, the dimension of the parameter vector is reduced from p (the number of genes) in ß to n (the number of samples). Biologically, the dimension reduction can be interpreted as changing the p gene effects into n supper-gene factors. In other words, the n supper-gene factors contained all the information originally contained in the p genes.

2.3 Inference on parameters
Inference about ß was indirectly obtained through the appropriate transformation of {gamma}. Consequently, point and interval estimates of the quantity of main interest, pi (Yi = 1), were easily achieved using Bayesian approach implementation via Markov chain Monte Carlo (MCMC) methods (West et al., 2001; West, 2003).

The observed binary outcome vector y = (y1, y2, ..., yn)' was used in the sampling of the latent variables from their truncated normal distribution through the following relationship:

Formula 7(7)
where ui ~ U(0, 1).

2.4 Dealing with potential misclassification
Let r = (r1, r2, ... , rn) be the ‘true’ binary outcome vector of samples in the training set. The vector rcan be regarded as a realization of the independent random vector Y = (Y1, Y2, ..., Yn) conditional on the gene expression profiling matrix, X, and subject to model (2) and the discriminative rule (3). Assume the observed data vector y, is a copy of r but with one or more elements being mislabeled. Misclassification or mislabeling occurs if some yi are switched, e.g. ri = 0 becomes yi = 1 (i.e. a zero was coded as one) or vice versa. Following the notation by Rekaya et al. (2001), let {alpha} = ({alpha}1, {alpha}2, ... , {alpha}n)', where {alpha}i is an indicator variable for observation i such that {alpha}i = 1 if yi was switched and {alpha}i = 0 otherwise. Suppose each {alpha}i follows a Bernoulli model with success probability {pi} (global misclassification probability) as a prior such that Formula 7. Assuming independence between {alpha} and r, their joint distribution, given {gamma}, {pi} and the expression level X, is

Formula 8(8)
Thus,

Formula 9(9)
where pi (FiD{gamma}) = {Phi}i (FiD{gamma}). Furthermore, the following relationship between ri and yi, given {alpha}i, can be established as follows:

Formula 10(10)
such that ri = yi when there is no mislabeling of the true binary response.

Using the relationships in (9) and (10), the joint probability of {alpha} and y given {gamma} and {pi}, is

Formula 11(11)

To complete the Bayesian formulation a conjugate beta prior was assumed for the probability of misclassification as follows:

Formula 12(12)
where a was set to 1 so that the beta distribution has a unique zero-mode and b was set to a small number (such as 3) after some empirical evaluations.

Following Rekaya et al. (2001), the conditional posterior distribution of each {alpha}i was given by

Formula 13(13)
and

Formula 14(14)
where Formula 14

Thus, each {alpha}i (i = 1,2, ... , n) is sampled from a discrete distribution with probabilities as in Equations (13) and (14). Furthermore, it is not hard to show that the fully conditional distribution of {pi} was as follows:

Formula 15(15)

2.4.1 Implementation
Estimates of parameters of interest, including the misclassification probability, pi({alpha}i = 1), for every binary observation were obtained through successive sampling from the above-mentioned conditional distributions. Further, pi({alpha}i = 1) could be considered as a realization of a random variable that follows a certain probability distribution. Thus, a confidence interval for pi ({alpha}i = 1) could be approximated using Chebychev's Inequality (Casella and Berger, 2001). More importantly, the later could be used to detect possible outliers. In this study, observations with outlying behavior were considered as indicators of potential misclassification, and they were not discarded from the analysis. In gene expression data, the removal of certain samples will result in loss of information and in the case of heterogeneous diseases it could lead to a potential change in the features of the data, which in turn could reduce its prediction capacity.

For the specific case of using microarray data for binary response classification, it is reasonable to assume that the reported data are largely correct. Thus, based on the observed data, y, the individual with the highest pi({alpha}i = 1) will be switched from its current ER status to its ‘true’ status using Equation (3) and the resulting new binary data, y(1), will be used to infer the gene effects as well the misclassification probability of each observation. Based on the latter, the observation with the highest pi ({alpha}i = 1) will see its binary status switched and a new binary data y(2), will be generated. The process will be repeated for several rounds (say k rounds) and the resulting data y(k) will be considered free of mislabeling and it will be used to infer pi(Yi = 1) and the high posterior density interval (HPD) for samples in the training and validation datasets.

A remaining problem is that, after all mislabeled observations have been corrected after the (k – 2)-th round, the program may continue switching Formula 15 between 0 and 1 because it is always possible that one pi({alpha}i = 1) has an irregular large value. In this case, the process of model selection is forced to stop at round k and yk will be used to infer pi(Yi = 1) and the corresponding confidence interval.

A graphical illustration of the algorithm is presented in Figure 1. The letters A and B were used to indicate samples with ER+ and ER– status, respectively. In Fig 1A, the expression level of the 100 most influential genes and the original ER status were used to estimate the parameter vector, beta1, and to compute the misclassification probability of every sample. Based on the latter, Sample 5 (in red) was identified as the most likely mislabeled observation. Consequently, Sample 5 was switched (in blue) from B (ER+) to A (ER–) as indicated in Fig 1B. Using the new binary data and the expression matrix (X2) of newly selected 100 most influential genes, the parameter vector beta2 and the misclassification probability of every sample were computed. Sample 1 (with red color in Fig 1B) had the highest misclassification probability and was then switched (with blue in Fig 1C) from A to B. The same process was repeated again. However, only sample 1 kept going back and forth between ER+ (A) and ER– (B) as illustrated in Fig 1C and Fig 1D. At that point, the algorithm was stopped and sample 1 maintained its original status (A). It is worth mentioning that in the majority of scenarios, the procedure automatically stopped when no sample showed an outlying behavior.


Figure 1
View larger version (12K):
[in this window]
[in a new window]
 
Fig. 1 Graphical illustration of the algorithm used for dealing with mislabeling. Only eight samples in the training set were used to simplify the presentation. The vertical axis represents the misclassification probability. (A) Sample 5 (in red) was identified as the most likely mislabeled observation, (B) sample 5 (in blue) was switched from its original ER– (B) to ER+ (A) status and the most likely mislabeled sample, given the new input data, was observation 1 (in red), (C) sample 1 was switched from A (ER+) to B (ER–), (D) sample 1 was identified again as the most likely mislabeled observation and was switched back to its original status (A). At this point, the algorithm was stopped and only sample 5 has switched status.

 
2.5 Summary of the procedure
Set k = 0 and y(0) = y.

Step 1. Select the 100 most discriminative genes according to the absolute values of the point biserial correlation coefficients of the gene expression and the observed binary outcomes y in the training samples as described in Section 2.2. Generate the design matrix X, which contains the expression levels of the selected 100 genes for the training and validation samples.

Step 2. Fit the regression models Formula 15 as described by West et al. (2000), where the binary outcomes of validation samples are considered as missing. Given the liabilities, {ell}, solving the above regression model is straightforward. Get the estimate or prediction of classification probability pi(Yi = 1) for each sample (i = 1, 2, ... , n1, n1 + 1, n1 + 2, ... , n2) where n1 and n2 are the number of samples in the training and validation datasets, respectively.

Step 3. Calculate the misclassification probability pi({alpha}i = 1) for each training sample at each iterate of Step 2. For the observations in the validation set, simply set pi({alpha}i = 1) equal to zero.

Step 4. Compute the empirical Bayesian estimate of pi({alpha}i = 1) for each sample in the training set. Determine the observation with the highest pi({alpha}i = 1), set m = i and switch Formula 15 from 0 to 1or from 1 to 0. Set k = k + 1 and y(k) = y(k–1).

Step 5. Repeat Steps 1–4 until the observation m with the highest pm({alpha}m = 1) falls within the 90% confidence interval computed using Chebychev's Inequality or if y(k) = y(k–2).

Step 6. Output the point estimates of predictor µi, classification probabilities pi(Yi = 1), and their respective 90% HPD intervals calculated using Chen et al.'s algorithm (Chen et al., 2000).


    3 A REAL DATA-BASED SIMULATION STUDY
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 3 A REAL DATA-BASED...
 4 CONCLUSIONS
 REFERENCES
 
3.1 Dataset, primary treatment and statistical tools
A publicly available expression data (http://data.cgt.duke.edu/west.php) that had been analyzed by West et al. (2001) was used in the present study. The dataset consisted of expression levels of 7129 genes of 49 human primary breast cancer samples generated using the Affymetrix platform. ER status for every sample was determined using two clinical methods. Five cases had unsolved ER status based on IHC and IB tests and were placed into the validation dataset. From the remaining 44 samples, 2 samples that were assumed as having failed hybridization were also placed into the validation dataset. Furthermore, 12 randomly selected samples from the remaining 42 samples were added to the validation dataset. As a result, 30 samples (15 ER+ and 15 ER–) were included in the training set and 19 samples (7 ER+, 7 ER–, 5 unresolved by the two clinical tests) were placed in the validation dataset.

Preprocessing of the data consisted of threshold treatment to the Affymetrix average difference (AD) with 1 as the lower limit and 16 000 as the upper limit. On the remaining data, a logarithm base 2-transformation was applied. Furthermore, genes with the highest transformed AD being smaller than two times the minimum AD were deleted. Bayesian implementation via Gibbs sampling converged, in general, in less than 200 iterations. In this study, a more conservative implementation was carried out. A long chain of 20 000 iterations was implemented with the first 5000 iterations discarded as burn-in. In order to reduce autocorrelation, one in every 50 samples of the remaining 15 000 draws was maintained for post-Gibbs processing.

3.2 Influence of mislabeling on gene selection
Using expression information and the observed ER status of the 30 samples in the training set (L0), the 100 most discriminative genes for ER status prediction were selected. Randomly, the ER status of 1–4 samples were artificially mislabeled, such that ER+ was set to ER– and vice versa. Thus, four new training datasets were created and labeled L1, L2, L3 and L4 corresponding to 1, 2, 3 and 4 switched samples, respectively. Each of these four datasets was used to determine the list of 100 most discriminative genes. The process was replicated 10 times. Table 1 presents summary statistics of the number of common genes between L0 and the other four lists (L1–L4). It is clear that the number of mislabeled samples has significantly influenced the selection of most discriminative genes. In fact, when only one sample was mislabeled, roughly 20% of the most discriminative genes were not identified as such. When two samples had their ER status switched, corresponding to a mislabeling rate (MP) of 6.7%, almost one-third of the genes in L0 were not selected. The situation got worse when the number of mislabeled samples increased. For an MP of 10% (three mislabeled samples) and 13.3% (four mislabeled samples), 37.3 and 48.6% of the most discriminative genes were not selected, respectively. These results highlighted the fact that even with a low-to-moderate MP (3–13%), mislabeling could have detrimental effect in identifying the most influential genes for ER status classification. In a statistical model using microarray data for disease classification and prediction, the selected set of most influential genes plays the role of exploratory variables. Therefore, mislabeling influences statistical diagnosis mainly via the change of the structure of the fitted model. This change will necessarily lead to poor classification and low prediction capacity.


View this table:
[in this window]
[in a new window]
 
Table 1 Percentage of the 100 most discriminative genes when 1 to 4 tumors in the 30 samples training set had their ER status mislabeled

 
3.3 Influence of misclassification on cancer status prediction
The practical value of using gene expression information in disease diagnosis depends on the capacity of the fitted statistical model in classifying the clinical status of individual samples. With the current data, samples were classified based on their probability of being ER+ or ER–. Thus, if the i-th sample had a probability greater than 0.5 of being ER+, it will be classified as such, otherwise it would be considered as ER–. Apparently, this hard classification is not adequate in helping doctors in the diagnosis because uncertainty is ignored. West et al. (2001) suggested that doctors should be provided with the uncertainty of the classification for a specific individual or sample. Here, a simple assessment criterion is presented. That is, if the statistical classification is consistent with biological diagnosis and the 90% HPD interval of the classification probability is between (0.0, 0.5) and (0.5, 1.0), the sample will be considered as correctly classified with low uncertainty. Otherwise, it will be considered as incorrect and/or with high uncertainty.

To evaluate the impact of mislabeling in classification probability and uncertainty, the true (L0) as well as four ‘noisy’ training (L1–L4) datasets were used to predict the ER status for validation samples. The analysis was conducted using the procedure described in Section 2.5.

For clarity, samples in the validation set were divided into four groups based on results obtained in West et al. (2001): GV1 included samples 11, 16, 40, 43 and 47 for which the two clinical tests agreed on the ER status, but the statistical predictions showed high uncertainty; GV2 included samples 31, 33, 45 and 46, for which the two clinical tests disagreed on the ER status, however the statistical predictions had low uncertainty; GV3 included sample 14, for which the two clinical tests disagreed and the statistical predictions had high uncertainty; and GV4 included samples 4, 7, 8, 18, 26, 29, 44, 48 and 49, for which the two clinical tests agreed on the ER status and the statistical predictions had low uncertainty. When the true training set (L0) was used, the statistical classification of samples in the validation set was the same as in West et al. (2001) and is shown in the fourth column of Table 2. The bold sign (+ or –) indicates predictions with high statistical uncertainty as defined above and the +/– sign in the third column indicates conflicting ER status based on the two clinical tests. For samples in the training set and in group GV4, their ER status was correctly resolved by the clinical tests and the statistical prediction. For samples in group GV2, for which the two clinical tests conflicted, the statistical prediction seems to have determined their ER status with low uncertainty (high certainty). However, for samples in groups GV1 and GV3, the statistical prediction had a high uncertainty and sometimes was in contradiction with the clinical determination of ER status.


View this table:
[in this window]
[in a new window]
 
Table 2 Effects of misclassification on the prediction of ER status of human breast cancer samples

 
When the training sets included mislabeled samples (L1–L4), the capacity of the model in predicting the ER status of samples in the validation set deteriorated. Major changes were observed when samples in groups GV2 and GV4 were considered. For samples in group GV2, for which the clinical tests disagreed and the statistical prediction had a low uncertainty, mislabeling has made their statistical prediction highly uncertain. In fact, when only one sample was mislabeled in the training set (L1), three (31, 45, 46) out the four samples in GV2 had their ER status changed from certain to uncertain as indicated by the number of times their ER status was resolved out of 10 replications. Furthermore, the prediction of the ER status of all the samples became uncertain when more than two observations were mislabeled in the training set. More interestedly, for samples in GV4 for which both the clinical tests and statistical prediction agreed on their ER status, mislabeling had a marked effect on their predicted status. With only one mislabeled sample in the training set (L1), two samples (8, 29) out of the nine had their predicted ER status unresolved based on the 10 replicates. When more than two samples in the training set had their ER status mislabeled, six samples had their ER status clearly unresolved and the remaining three samples (7, 26, 49) were correctly classified only 6 or 7 times out of 10. All nine samples had their ER status unresolved when four samples were mislabeled in the training set (L4). For samples in groups GV1 and GV3, no major changes in prediction of ER status were observed between using the ‘true’ (L0) and any of the four ‘noisy’ (L1–L4) training sets. This result was expected given that the ER status of these samples was unclear even when the true data were used.

3.4 Dealing with mislabeling
Results from the previous section showed the sensitivity of ER status prediction for samples in the validation set when mislabeling is present in the training data even at a low rate. Thus, it is of crucial importance for disease diagnosis using microarray data to establish a procedure robust to potential errors in the training set. Using the method presented in this study, the results indicated that mislabeling in the training set could be dealt with or at least have its effects reduced. Figure 2A and B shows the ER status classification based on the fitted regression for the training samples (Fig. 2A) and the predictive probability of each sample in the validation set (Fig. 2B) when two training samples were mislabeled (marked with ‘M’). For the training set, all samples except the two mislabeled were correctly placed into their respective ER classes. However, for the validation set only 7 out of 19 samples were correctly classified with acceptable uncertainty. When the proposed method was applied to the same training data, the two mislabeled sample were corrected (Fig. 2C) and all training samples were correctly classified. Furthermore, 14 out of the 19 validation samples had their ER status resolved (Fig. 2D) including all samples with certain ER status (greens and blacks) and even the samples for which the clinical tests conflicted. Based on the results in Figure 2A–D, the proposed method was not only able to correct the ER status for mislabeled samples in the training set but also more importantly was effective in improving the prediction of ER status for samples in the validation set. The high uncertainty for samples 11, 14, 40, 43 and 47 was due to their irregular gene expression pattern (West et al., 2001; Pittman et al., 2004) and no change was observed using the proposed method.


Figure 2
View larger version (20K):
[in this window]
[in a new window]
 
Fig. 2 (AD) The regression fit and prediction profiles when two training samples had their ER status mislabeled. Model-1: misclassification was present in the training set but was not addressed. Model-2: misclassification was addressed using the proposed method. Numbers in the plots indicate sample ID. Green marks (Black marks) indicate samples with ER+ (ER–) status as determined by clinical tests. Orange marks indicate samples for which the two clinical tests conflicted. The vertical dot lines indicate the 90% HPD of the classification probabilities. The two mislabeled samples were number 1 and 36 (marked with blue ‘M’). In plot C, sample 1 (sample 36) is hidden within the ER+ (ER–) group.

 
A similar trend was observed when four samples in the training set were mislabeled (Fig. 3). When the training data were assumed as correct (Fig. 3A), not only was the ER status of the mislabeled samples unresolved but also the regression fit of the non-mislabeled samples had high uncertainty, although their ER status was correctly classified. For the validation set, almost all the samples were incorrectly and/or ambiguously classified (Fig. 3B). However, using the proposed method, all samples in the training set, including the mislabeled samples were correctly classified (Fig. 3C). Furthermore, the majority of validation samples (14 out of 19) were classified with low uncertainty and the pattern was similar to what was obtained when the true training data were used (Fig. 3D). These results clearly indicate that, at ~14% misclassification rate in the training set, the proposed method had performed extremely well. In fact, even at a mislabeling rate of 20% (results not shown), the proposed procedure showed strong capacity for fixing potential mislabeled observation(s) and for predicting the ER status of validation samples with high accuracy.


Figure 3
View larger version (20K):
[in this window]
[in a new window]
 
Fig. 3 (AD) The regression fit and prediction profiles when four training samples had their ER status mislabeled. The four misclassified samples were number 5, 27, 30 and 42 (marked with blue ‘M’). In plot C, samples 5 and 30 are hidden within ER+ group and samples 27 and 42 are hidden within ER– group. Everything else is as defined in Figure 2.

 
3.5 Dealing with more practical data
In the previous sections, the ER status of all training samples was confirmed by both clinical and statistical tests, unless mislabeling was artificially introduced. However in a more realistic and practical environment, the training data may not be as reliable. In fact, some samples may not have their ER status completely resolved (classification probability near 0.5) and/or the association between their binary status and gene expression profiling may follow an irregular pattern. Thus with field data, similar samples as well as samples with mislabeled ER status could be jointly present in the training set. In order to test the robustness of the proposed procedure to deal with this kind of field data, four samples from GV1 group (11, 16, 40 and 43) were moved from the validation to the training set. Furthermore, two samples (32 and 37) had their ER status mislabeled. Figure 4 illustrates the regression patterns and prediction profiles in the training and validation datasets, assuming the training was correct (Fig. 4A and B) and using the proposed method (Fig. 4C and D). Assuming the training data were correct, the ER status of the two mislabeled samples (32 and 37) was not resolved (Fig. 4A) and some samples (16, 34, 40, 43) had a relatively higher uncertainty attached to their regression fit. For the validation set, the classification resolution was as bad, if not worse, as when only two mislabeled samples were included into the training set (Fig. 2B). This is not surprising given the added noise to the training set in this case. Using the proposed procedure, the two mislabeled samples were correctly classified (Fig. 4C). However, samples 16 and 40 were classified as ER+ which is contrary to the clinical tests, but were in agreement with the predicted status for these two samples by West et al. (2001). Given the robustness for the proposed method in dealing with misclassification, it seems that these two samples had some outlying behavior, and the gene expression profiling was more similar to ER+ tumors. Thus, the conflict between the clinical and statistical tests could be due to failed hybridizations and/or inaccurate clinical results for unknown reasons; thus leading to mislabeling of these samples. More importantly however, the proposed method was able to ascertain the ER status of the majority of the validation samples (Fig. 4D).


Figure 4
View larger version (19K):
[in this window]
[in a new window]
 
Fig. 4 (AD) The regression fit and prediction profiles when the training set included two samples with artificially mislabeled ER status and four samples with unclear ER status. The two mislabeled samples were number 32 and 37 (marked with blue ‘M’). In plot C, sample 32 (sample 37) is hidden in the ER+ (ER–) group. The four samples with unclear ER status were number 11, 16, 40 and 43. In plot C, samples 16 and 40 are hidden in the ER+ group. Everything else is as defined in Figure 2.

 

    4 CONCLUSIONS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 3 A REAL DATA-BASED...
 4 CONCLUSIONS
 REFERENCES
 
For the prediction of ER status using gene expression profiling, mislabeling or misclassification among training samples negatively influenced the selection of most discriminative genes, and consequently the structure of the fitted model. As a result, validation samples, which have been classified with low uncertainty using the correct training data, had unsolved ER status and even were incorrectly classified. To deal with the problem, an iterative procedure was proposed. It is an extension of West et al. (2000) binary regression method and relies on identifying and then switching the binary outcome of the observation with the highest misclassification probability in each round. Furthermore, the most discriminative genes used in the next round were selected based on the ‘new’ data. This allowed the selected classification model to move, each round, closer to the true one at low risk and without wasting any information.

The procedure proposed in this paper showed strong capacity for dealing with potential mislabeling in the training set. In fact, all mislabeled samples in the training set were identified and corrected. More importantly, the proposed method was able to predict the ER status of validation samples using a mislabeled training set with similar accuracy as if the true training set was used. Furthermore, even in a more realistic scenario when the training set contained mislabeled observations and samples with irregular expression patterns, the proposed procedure performed well in the training and validation datasets. The results suggest that the proposed method is adequate and effective for practical applications using microarray gene expression profiling for disease diagnostics and classification.

In order to test the robustness of the proposed procedure, it was also applied to the frequently used acute myeloid leukemia and acute lymphoblastic leukemia dataset (Golub et al., 1999). Based on our preliminary results, it seems that samples 2 and 67 in the original data are likely to be mislabeled. In part, these findings are in agreement with the results of Golub et al. (1999) in which sample 67 was considered as unpredictable.


    Acknowledgments
 
The authors are grateful to Professor Mike West, Institute of Statistics and Decision Sciences at Duke University, for providing additional information about the data. Special thanks to the reviewers for their constructive comments and suggestions.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Alvis Brazma

Received on June 15, 2005; revised on August 25, 2005; accepted on October 20, 2005

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 3 A REAL DATA-BASED...
 4 CONCLUSIONS
 REFERENCES
 

    Albert, J. and Chib, S. (1993) Bayesian analysis of binary polychotomous response data. J. Am. Stat. Assoc, . 88, 669–670[CrossRef].

    Alizadeh, A.A., et al. (2000) Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature, 403, 503–511[CrossRef][Medline].

    Antonov, A.V., et al. (2004) Optimization models for cancer classification: extracting gene interaction information from microarray expression data. Bioinformatics, 20, 644–652[Abstract/Free Full Text].

    BØ, T.H. and Jonassen, I. (2002) New feature subset selection procedures for classification of expression profiles. Genome Biol, . 3, research0017.1–0017.11.

    Casella, G. and Berger, R.L. Statistical Inference, (2001) , Duxbury Thomson Learning.

    Chen, M.-H., Shao, Q.-M., Ibrahim, J.G. Monte Carlo Methods in Bayesian Computation, (2000) , NY Springer, pp. 313–332.

    Dudoit, S., et al. (2002) Comparison of discrimination methods for classification of tumors using gene expression data. J. Am. Statist Assoc, . 97, 77–87[CrossRef][ISI].

    Golub, T.R., et al. (1999) Molecular classification of cancer: class discovery and class prediction by gene expression prediction. Science, 286, 531–537[Abstract/Free Full Text].

    Johnson, V.E. and Albert, J.H. Ordinary Data Model, (1999) , NY Springer.

    Khan, J., et al. (2001) Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat Med, . 7, 673–679[CrossRef][ISI][Medline].

    Li, H. and Hong, F. (2001) Cluster-Rasch models for microarray expression data. Genome Biol, . 2, research 0031.1–0031.13.

    Pittman, J., et al. (2004) Bayesian analysis of binary prediction tree models for retrospectively sampled outcomes. Biostatistics, 5, 587–601[Abstract].

    Ramaswamy, S., et al. (2001) Multiclass cancer diagnosis using tumor gene expression signatures. Proc. Natl Acad. Sci. USA, 98, 15149–15154[Abstract/Free Full Text].

    Rekaya, R., et al. (2001) Threshhold model for misclassified binary responses with applications to animal breeding. Biometrics, 57, 1123–1129[Medline].

    Ross, M.E., et al. (2003) Classification of pediatric acute lymphoblastic leukemia by gene expression profiling. Blood, 102, 2951–2959[Abstract/Free Full Text].

    Spang, R., et al. (2002) Prediction and uncertainty in the analysis of gene expression profiles. Silico Biol, . 2, 369–381.

    Statnikov, A., et al. (2005) A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics, 21, 631–643[Abstract/Free Full Text].

    Tibshirani, R., et al. (2002) Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc. Natl Acad. Sci. USA, 99, 6567–6572[Abstract/Free Full Text].

    Tusher, V.G., et al. (2001) Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl Acad. Sci. USA, 98, 5116–5121[Abstract/Free Full Text].

    West, M., Nevins, J.R., Marks, J.R., Spang, R., Blanchette, C., Zuan, H. (2000) DNA microarray data analysis and regression modeling for genetic expression profiling. Technical Report 00-15. Institute of Statistics and Decision Sciences, Duke University.

    West, M., et al. (2001) Predicting the clinical status of human breast cancer by using gene expression profiles. Proc. Natl Acad. Sci. USA, 98, 11462–11467[Abstract/Free Full Text].

    West, M. (2003) Bayesian factor regression models in the ‘Large p, Small n’ paradigm. Bayesian Statistics, 7, 723–732.

    Yeoh, E.-J., et al. (2002) Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. Cancer Cell, 1, 133–143[CrossRef][ISI][Medline].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
22/3/317    most recent
bti738v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Zhang, W.
Right arrow Articles by Bertrand, K.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Zhang, W.
Right arrow Articles by Bertrand, K.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?