Bioinformatics Advance Access originally published online on October 22, 2007
Bioinformatics 2007 23(24):3374-3381; doi:10.1093/bioinformatics/btm497
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Boosting multiclass learning with repeating codes and weak detectors for protein subcellular localization
1Faculty of Life Sciences and Institute of Genomes, National Yang-Ming University, Taipei, 2Brain Research Center, University System of Taiwan, Hsin-Chu, 3Department of Biomedical Engineering, Chung Yuan Christian University, Jhongli, 4Institute of Information Science, Academia Sinica, Taipei, Taiwan and 5Department of Cell Biology/Biophysics, EMBL Heidelberg, 69117 Heidelberg, Germany
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Determining locations of protein expression is essential to understand protein function. Advances in green fluorescence protein (GFP) fusion proteins and automated fluorescence microscopy allow for rapid acquisition of large collections of protein localization images. Recognition of these cell images requires an automated image analysis system. Approaches taken by previous work concentrated on designing a set of optimal features and then applying standard machine-learning algorithms. In fact, trends of recent advances in machine learning and computer vision can be applied to improve the performance. One trend is the advances in multiclass learning with error-correcting output codes (ECOC). Another trend is the use of a large number of weak detectors with boosting for detecting objects in images of real-world scenes.
Results: We take advantage of these advances to propose a new learning algorithm, AdaBoost.ERC, coupled with weak and strong detectors, to improve the performance of automatic recognition of protein subcellular locations in cell images. We prepared two image data sets of CHO and Vero cells and downloaded a HeLa cell image data set in the public domain to evaluate our new method. We show that AdaBoost.ERC outperforms other AdaBoost extensions. We demonstrate the benefit of weak detectors by showing significant performance improvements over classifiers using only strong detectors. We also empirically test our method's capability of generalizing to heterogeneous image collections. Compared with previous work, our method performs reasonably well for the HeLa cell images.
Availability: CHO and Vero cell images, their corresponding feature sets (SSLF and WSLF), our new learning algorithm, AdaBoost.ERC, and Supplementary Material are available at http://aiia.iis.sinica.edu.tw/
Contact: chunnan{at}iis.sinica.edu.tw
Supplementary information: Supplementary data are available at Bioinformatics online.
| 1 INTRODUCTION |
|---|
|
|
|---|
Localization of proteins in living cells is directly related to their functions. It has been shown that mis-localization of proteins correlates with several diseases. Analysis of protein localization for thousands of genes is a tremendous job. Therefore, development of an automatic massive analysis method is important. There are two approaches available for protein subcellular localization. One approach is by prediction based on protein sequences. (Eisenhaber and Bork, 1998; Garrels, 1996; Nakai and Horton, 1999). The accuracy of their prediction ranges from 60% to 80% for benchmark data sets of the proteins whose localizations are known in advance. This approach is inherently limited by the available training data and still needs experimental confirmation. Since GFP technology makes proteins fluorescent, construction of expressing vectors, transfection and cell imaging can be automated and large-scale subcellular localizations of GFP-tagged fusion proteins can be practically accomplished. Many groups have been using cell imaging to determine subcellular localization and established image-based protein localization databases, e.g. LIFEdb and Yeast Protein Localization Database(YPLDB) (Bannasch et al., 2004; Habeler et al., 2002; Simpson et al., 2000). However, human classification of fluorescence cell micrographs is still subjective, time consuming and dependent on expertise. Therefore, to minimize inconsistency and ambiguity, systematic determination of protein subcellular locations from fluorescence microscopy images is required.
Over the past decade, many machine-learning methods to automate the determination of subcellular location from fluorescence microscope images have been developed (Boland and Murphy, 2001; Boland et al., 1997, 1998; Conrad et al., 2004; Huang and Murphy, 2004b). These methods have been shown to convincingly outperform human examination. Generally speaking, the dominant strategy has been searching for an optimal set of features that can discriminate different classes of subcellular structures based on the knowledge from cell biology or image processing. In addition to features that can be extracted from protein images, features from parallel corresponding DNA images and other channels were considered. Then, different combinations of feature sets are tested with standard machine-learning algorithms designed mainly for binary classification. In this article, we present an alternative strategy that takes advantage of recent advances in machine learning and computer vision. In machine learning, we present AdaBoost.ERC, a new multiclass learning algorithm that combines AdaBoost with a randomly generated code to handle multiclass learning problems. In computer vision, we propose to use a large set (about thousands) of randomly extracted features called weak detectors (Murphy et al., 2006; Sudderth et al., 2005) to capture subtle characteristics beyond visual discrimination to complement traditional knowledge-based strong detector features. Experimental results show that AdaBoost.ERC and weak detectors always significantly improve the performance of classifiers using only the strong detector features.
| 2 REVIEW OF PREVIOUS WORK |
|---|
|
|
|---|
We briefly review previous work in automatic protein subcellular localization and extensions of AdaBoost for multiclass learning.
2.1 Automatic protein subcellular localization
Realization of large-scale protein subcellular localization by high-throughput microscopy requires a variety of advanced methods including cell labeling, image acquisition, image processing and pattern classification (Glory and Murphy, 2007). This section focuses on previous works in image processing and pattern classification.
Murphy and his colleagues have been making pioneering contributions to this problem for a decade (Boland and Murphy, 2001; Boland et al., 1997, 1998; Huang and Murphy, 2004b). They have been using large collections of HeLa cell immunofluorescence images containing 10 distinct subcellular location patterns as their benchmark. In search of an optimal feature set, they have developed 13 sets of subcellular location features (SLF) for 2D images. The feature sets consist of different combinations of a wide variety of feature types, including morphological, edge, texture, geometric, moment and wavelet features. They also included features extracted from a reference DNA channel to improve the accuracy of their classifiers. With the DNA channel, it is easy to detect the boundary of nucleus and the problem will be somewhat simplified but at the cost of staining another dye for DNA. In contrast, our new method can achieve high average accuracies without the DNA channel. Recently, they have proposed new approaches to alleviate this problem (Chebira et al., 2007; Huang and Murphy, 2004a).
Another research group that also has been making contributions to this problem is the team from EMBL (Bannasch et al., 2004; Conrad et al., 2004; Simpson et al., 2000). As described in Conrad et al. (2004), they captured a collection of 2 182 images containing 11 different classes of subcellular locations. They compared the performance of two well-known machine learning algorithms with three features selection methods. The performance of their best classifier was
82.6%.
2.2 Boosting for multiclass learning
The AdaBoost algorithm (Freund and Schapire, 1995; Schapire, 1999) is one of the most popular machine-learning algorithms for binary classification. AdaBoost works by repeatedly applying a weak learner to learn a weak classifier from reweighted training examples at each iteration. AdaBoost had been extended to handle the problem of multiclass learning by incorporating the idea of error-correcting output codes (ECOC) (Dietterich and Bakiri, 1995). Unlike other general strategies such as one-against-all and one-against-one, ECOC can take advantage of the error-correcting property of the code originally designed for noise-tolerance in digital signal transmission. ECOC for multiclass learning works as follows. Let T be the length of a codeword and K be the number of classes. We can create an error-correcting output coding matrix M
{ – 1, + 1}K x T such that each column corresponds to a binary class partitioning of K classes. M(k, ·), the k-th row of M, corresponds to a codeword for class k. The sign of the t-th bit in M(k, ·) indicates whether class k is in the positive or negative partition of the t-th partitioning. Then for each column M(·, t), we train a binary classifier to classify the data as defined in the t-th partitioning. In the end, we will obtain T binary classifiers h1, ... ,hT. An unseen data will be classified as class k if the Hamming distance between M(k, ·) and the output of the T classifiers, a { – 1, + 1} codeword of length T, is the closest among all K classes.
The error-correcting property of ECOC emerges when any pair of rows in M has at least d different bits, which implies that even if as many as [(d – 1)/2] binary classifiers misclassify, the closest match still corresponds to a correct class (Guruswami and Sahai, 1999). There are many ways to construct a coding matrix with error-correcting property. For example, we can pick a subset of recursively defined Hadamard matrix. A great deal of effort has been devoted to choosing an optimal coding matrix independent from the underlying data set. However, James (1998) showed that matrices with random codes produced error rates as low as those of the best optimal matrices and proposed to learn the matrix from the training data.
ECOC can be applied to extend any learner of binary classifiers to multiclass learning problems, but it is especially suitable for AdaBoost because each weak classifier in AdaBoost can be considered as a column in M (Schapire, 1997). Many AdaBoost extensions based on ECOC has been proposed, including AdaBoost.ECC (Guruswami and Sahai, 1999), AdaBoost.ERP (Ling, 2006) and JointBoost (Torralba et al., 2004). Among them, AdaBoost.ECC can use any coding matrix, AdaBoost.ERP is also independent of any coding matrix choice but will dynamically adjust the matrix according to the training error rates, while JointBoost proactively learns an optimal coding matrix from the training data. At each iteration, JointBoost picks a subset of classes to share a classifier that maximally reduces the error on the weighted training examples for all the classes.
| 3 METHODS |
|---|
|
|
|---|
Our solution of pattern recognition for protein subcellular localization consists of a novel learning algorithm based on the idea of combining AdaBoost and ECOC and a novel set of features that combine both strong and weak detectors. In this section, we start by deriving our learning algorithm and our features. Then, we describe how we captured the images for our experimental evaluation.
3.1 Boosting with repeating codes
Given a measure D = {(xi,yi)} 
x
where
is an arbitrary input space and
= {1,...,K} is the label space, multiclass learning problems involve finding an unknown function H such that the value of E(x,y)
D[H(x)
y] is minimum. We consider ECOC to solve multiclass learning problems. Again, let M be the coding matrix. Given an unseen data x, an ECOC classifier will generate an ensemble output H(x) = (h1(x), ... , hT(x)) from T binary classifiers. The Hamming distance between H(x) and M(k, ·), the k-th row of M, is defined as
|
| (1) |
|
| (2) |
(M(y),H(x)) must be smaller than
(M(k),H(x)) for any k
y.
It has been shown that a classifier's error rate is related to a quantity called margin. In our case, margin for an example (x, y) and class k is defined as follows (Sun et al., 2005):
|
| (3) |
The idea is that a learning algorithm may achieve a low error rate if it maximizes the margins of training examples. Based on this idea, AdaBoost.ECC and AdaBoost.ERP both try to optimize an exponential objective function based on the margins
|
| (4) |
|
| (5) |
t is the error rate of classifier ht. Both AdaBoost.ECC and AdaBoost.ERP try to maximize the negative gradient. Along the negative gradient they can minimize the objective function CM(H). Two terms determine the value of the negative gradient: Ut and
t. The larger Ut is, the stronger the error-correcting ability is, while
t is the error rate of weak classifiers trained based on each column of M. In order to optimize the objective function, we should both maximize Ut and minimize
t. Unfortunately, empirical studies showed that there is a trade-off between Ut and
t (Ling, 2006; Schapire, 1997).
|
Consider the problem of how to maximize Ut. Ut can be rewritten as
|
| (6) |
Instead of maximizing Ut, we propose a more straightforward method, called AdaBoost.ERC, i.e. AdaBoost with Error-correcting Repeating Codes, which attempts to minimize
t directly. Unlike AdaBoost.ERP, AdaBoost.ERC simply repeats a randomly generated binary partitioning as the columns in the coding matrix. For each random partitioning, AdaBoost.ERC applies AdaBoost to reweight its weak classifier for each repeating columns. Therefore, even if a random partitioning creates a hard binary classification problem for the weak learner, by increasing the number of repeating columns, we can still decrease the error rate to maximize the margin. The algorithm of AdaBoost.ERC is given in Figure 1.
By repeating codes with AdaBoost to minimize error rates, our algorithm AdaBoost.ERC has the advantage that the codeword length required to achieve a certain level of accuracy is less than its competing algorithms. In our preliminary study, we compared AdaBoost.ERC with other AdaBoost-ECOC extensions (Lin and Hsu, 2006). The experimental results showed that AdaBoost.ERC significantly outperformed those extensions for all test data sets. Its performance continued to improve with the increasing of the number of repeats.
3.2 Features
We extract two types of features from raw cell images to serve as the input of our classifier. Strong detectors are knowledge-driven features that are supposed to provide strong hints for classification, while weak detectors are randomly extracted patterns to allow the classifier to consider subtle characteristics of each class. We consider a feature as a detector because AdaBoost.ERC learns a decision stump as its weak classifier, which is a decision tree with only one feature.
3.2.1 Strong detectors
The raw cell images are passed through a series of processing before the feature extraction. The image size is firstly normalized to 500 x 500 by the bilinear rescaling method. The color band (red, green or blue) of the image is detected and is used for gray level conversion. As an optional preprocessing step, a non-linear sigmoid function (Wen et al., 2001) was applied to adjust the contrast of the gray level image. The sigmoid function is defined as 1/1 + exp{ – g x (N – t)}, where 1
g
10 is gain, 0
t
1 is a threshold and N is the normalized image. In our experiments, g and t were set to 6 and 0.6, respectively.
A search algorithm is then used to identify and remove fragmented, incomplete cells on the edges of the image. Figure 1 in the Supplementary Material shows the flowchart of the search algorithm. Then, we apply an automatic Region-Of-Interest (ROI) selection procedure. Each row of pixel values is accumulated to delineate vertical gray level profile. Likewise, each column of pixel values is accumulated to delineate horizontal gray level profile. By setting an appropriate threshold of the profile, the boundary of a rectangle ROI can be detected. Figure 2 in the Supplementary Material shows an example of ROI detection.
|
Next, we apply the Top-hat and Bottom-hat morphological filters to reduce the large and high gray level clusters and to enhance the edge of subcellular structures (Movafeghi et al., 2004). In order to perform the progressive visual perception, before applying the morphological filters, we convert the image to bi-leveled images with 0.1, 0.3, 0.5 and 0.7, four thresholds. Also, after applying the morphological filters, we convert the resulting images with 0.2, 0.4, 0.6 and 0.8, thresholds. This yields eight bi-leveled images that are subjected to the feature extraction process.
To identify subcellular vesicles, our strong detectors consist of both geometric and texture features. Table 1 in the Supplementary Material gives the list of geometric features. The texture features are extracted based on the gray-level co-occurrence matrix (GLCM) method proposed by Haralick (1979). Twelve GLCM with distances 1, 2 and 10 and angles 0°, 45°, 90° and 135° are applied to the bi-leveled images. Then, various co-occurrence quantities including entropy, energy, contrast, homogeneity and correlation can be evaluated from the co-occurrence matrix to produce our texture features. The definition of these quantities are given in Table 2 in the Supplementary Material.
|
|
The above-method extracts a total of 155 geometric features and 500 texture features from a cell image. We conducted a backward stepwise discriminant analysis to select 134 most discriminant features as our final set of strong detectors.
3.2.2 Weak detectors
The weak detectors for each image are extracted in four steps as follows:
- Randomly pick up five images of each class as templates.
- Randomly extract a set of five fragments from each template. The fragments vary in size from 9 x 9 to 25 x 25.
- Convolve each fragment i with a set of four filters. This set includes the original image, x derivative, y derivative and a Laplacian filter.
- Then for a given image either for training or testing, apply normalized cross correlation between the given image and the fragment i to find where the fragment i occurs and then record the maximum correlation as the i-th component in the vector of weak detectors for the given image.
Since the fragments are extracted from ROI as described in Section 3.2.1, some of them may be extracted from the background. In the experiments described in this article, we simply manually identified those fragments and extracted again to ensure that all fragments come from the cells. An automatic cell segmentation method can be applied to avoid this problem but it is beyond the scope of this article.
3.3 Preparation of fluorescence microscope images
We created a collection of 815 2D images of CHO cells with eight distinct subcellular location patterns in Taiwan to test the performance of our classifier. We stained our CHO cells by specific fluorescence subcellular probes (nucleus: Hoechst 33342) or overexpressing fluorescence fusion proteins (actin:EYFP-Actin; peroxisome: ECFP-Peroxi; ER: EYFP-ER; microtubule: EYFP-Tub; Golgi: EYFP-Golgi; nucleolus: EYFP-Nuc and Mitochondrion: DsRed-Mito) to label particular subcellular compartments. Forty-eight hours after transfection in carbonatefree culture medium equilibrated with 10 mM HEPES, we acquired the images of subcellular structures by a Zeiss Axiovert 25CF microscope with a 100x NA 1.3 oil EC Plan-Neofluar® objective using DAPI, CFP or YFP filters. Images were captured with an Axiocam CCD camera (color) using the Axiovision software (Zeiss, Jena, Germany).
We have another collection of 4 439 2D images of Vero cells from EMBL in Germany. Vero cells were imaged at 16, 24 and 40 h after transfection in carbonate-free culture medium equilibrated with 10 mM HEPES on a Leica DM/IBRE microscope with a 63x NA 1.4PL Apo objective using custom designed CFP and YFP filters. Images were captured with a Hammatsu CCD camera (ORCA 1) using the Openlab 2.0 software (improvision, Coventry, UK) (Simpson et al., 2000).
In order to compare the results, we focused on eight classes of subcellular localization structures (nucleus, actin, peroxisome, ER, microtubule, Golgi, nucleolus and mitochondrion) shared by both data sets and removed images of the other classes. In the end, we selected 662 CHO cell images and 1451 Vero cell images as our benchmark data sets.
Besides, we also downloaded the 2D HeLa image data set from Murphy's Lab.1 Figure 3 shows some example images from CHO, Vero and HeLa data sets. Notice that HeLa images have been processed so that each image contains exactly one cell (Boland and Murphy, 2001), while CHO and Vero images may contain more than one cell.
|
Huang and Murphy (2004b) describe how they achieved 92.3% for the HeLa images. Their classifier is an unweighted majority voting ensemble consisting of eight classifiers trained by state-of-the-art classifier learning algorithms, including ANN, SVM and AdaBoost. They removed three of them to maximize the cross-validation accuracy. The feature set they used was SLF16, which consists of 47 features selected by stepwise discriminant analysis (SDA) from a set of 180 features2 Among them, six are DNA-channel features. We obtained this 180-feature set and will call it as 180-FS when we compare the performance of our methods with theirs in the following sections. Table 1 summarizes the statistics and available feature sets of the three data sets used in our experiments.
| 4 RESULTS AND DISCUSSION |
|---|
|
|
|---|
This section reports the results of our experimental demonstration of the effectiveness of our new method. To distinguish different types of features, we refer to the strong detectors described in Section 3.2.1 as Strong Subcellular Location Features (SSLF) and the weak detectors described in Section 3.2.2 as Weak Subcellular Location Features (WSLF) in the following discussion.
4.1 Comparing multiple-class boosting algorithms
We start by comparing our new multiclass learning algorithm AdaBoost.ERC to previous AdaBoost extensions AdaBoost.ECC, AdaBoost.ERP and JointBoost. We used the CHO images with both SSLF and WSLF as the feature sets in this experiment. To compare the performance fairly, we generated a code at random and applied the code for ECOC-based extensions: AdaBoost.ECC, AdaBoost.ERP and AdaBoost.ERC. We also specified the same codeword length for all of them so that they would share the same code in the comparison. There is no need to specify a code for JointBoost. We ran all four algorithms and measured their performance each time they used a new class partitioning, which amounts to increasing the length of the codeword for each class by one. We used the codeword length rather other quantities such as iterations or the number of weak classifiers learned to ensure fair comparison for these algorithms. All algorithms use the decision stump as the weak learner.
Figure 4 shows the average test accuracy over stratified 10-fold cross-validation trails as a function of the codeword length. The result shows that our new algorithm AdaBoost.ERC outperforms AdaBoost.ECC, AdaBoost.ERP and JointBoost for the task of recognizing protein subcellular structures.
|
4.2 Impact of weak detectors
Next, we demonstrate the impact of weak detectors by experimentally comparing the performance of AdaBoost.ERC with different types of feature sets, namely, SSLF, WSLF and their combinations, for both CHO and Vero image sets. For the CHO images, we used three different sizes of WSLF: 800, 1 280 and 2 048, referred to as WSLF1, WSLF2 and WSLF3, respectively.
We ran AdaBoost.ERC over stratified 10-fold cross-validation trials and obtained the average classification rates with different combinations of SSLF and WSLF sets and different codeword lengths. The best performing classifier used the combination of SSLF and WSLF1 with 248 bits for the codeword, and its average classification rate reaches 94.7%. Figure 5 shows the average classification rates of the classifiers using different feature sets. Comparing three classifiers that used WSLF only, we found that the performance improves as the size of WSLF increases, but not by much. However, when we combined SSLF with WSLF, it turned out that the smallest WSLF achieved the best performance. Hence, for computational efficiency, in the following experiments we would use WSLF1 to combine with SSLF. WSLF1 consists of 100 weak detectors for each class.
|
We conducted the same experiment for the Vero images but only considered WSLF1 as our weak detector. We obtained a 84.8% average classification rate by SSLF, 79.1% by WSLF1 and 89.1% by combining SSLF and WSLF1. Again, the experimental results show that incorporating WSLF as additional features significantly improves the classification performance.
4.3 Comparison with other subcellular localization methods
We report our experiments with the 2D HeLa images from Murphy's Lab. We have tried our best to compare our methods with the best results reported in Huang and Murphy (2004b). However, it is difficult for us to exactly reproduce their classifier ensemble with so many parameters to adjust. Also, we did not use our CHO and Vero images for the comparison because it was impossible for us to reproduce the same DNA-channel enhanced feature sets given the available resource at hand. Instead, we used 180-FS (Huang and Murphy, 2004b) as described in Section 3.3 as the strong detectors for AdaBoost.ERC and compared its performance with the best results using that feature set reported in the literature.
Since the 2D HeLa images contain 10 different classes, we generated a total of 100 weak detectors for each class to obtain 1000 features for this image data set. From these features, we randomly selected five different size of WSLF: 0, 200, 500, 700 and 1000. Then, we combined these WSLF's with 180-FS as the feature sets and applied AdaBoost.ERC to classify the HeLa images. Note that we did not perform SDA to select features as in Huang and Murphy (2004b). Table 2 gives the results, which show that AdaBoost.ERC with different size of WSLF all achieved a better accuracy than the best reported accuracy, even if we removed the DNA features. Moreover, without WSLF, AdaBoost.ERC still outperformed the Mixtures-of-Experts regardless of whether DNA channel features were used. The best classifier is AdaBoost.ERC with 141 bits of the codeword using a combination of 180-FS and WSLF of size 500, with its classification rate reaching 93.6%. Note that more recently, Chebira et al. (2007) achieved a better result at 95.3% for the HeLa images with a different set of features. Huang and Murphy (2004a) obtained 94.8% accuracy on a multi-cell version of the HeLa images. Both achieved their results without using the DNA channel.
Table 3 presents the confusion matrix of our best performing classifier for the 2D HeLa images. The matrix shows that most classes can be distinguished quite well, even for the classes that have been consistently confusing human inspectors, such as lysosomes and transferrin receptor (Lam/TfR) and two Golgi proteins (Gia/Gpp) (Boland and Murphy, 2001; Boland et al., 1997, 1998; Huang and Murphy, 2004b). They overlap often in subcellular localization and are difficult to distinguish even for human inspectors.
|
Furthermore, in order to verify the benefit of weak detectors, we conducted paired t-tests for three different image data sets as shown in Table 4, which indicates that statistically significant improvements were obtained for all three image sets. Therefore, combining weak detectors with strong detectors has a significant positive effect on the classification rate.
|
4.4 Generalization to heterogeneous sources
In this section, we report the results of our experiment to test if our method can generalize to images of different cell types and microscope imaging methods. Initially, we used CHO images as training data and Vero images as test data. Not surprisingly, we obtained poor accuracy (
38.9%). In a contrary manner, training with Vero images and testing with CHO images also yielded poor but better accuracy (
54.7%). It is obvious that the degree of heterogeneity is too high. Therefore, we combined both CHO and Vero images as the training examples and tried again. We randomly selected 5 template images from CHO and 5 from Vero to generate a set of 10(template)x5(fragment) x4(filter)x8(class) = 1600 weak detectors as our WSLF, larger than the set we used for homogeneous images. Figure 6 shows the results of our experiment. With training examples from both collections, AdaBoost.ERC achieved 84.2% and 79.4% average accuracy rates for CHO and Vero images using only SSLF and WSLF, respectively, and improved to 88.5% when using both SSLF and WSLF. When applied separately for CHO and Vero images, the average accuracy for AdaBoost.ERC is 90.9%, the degradation is merely 2.6%. The results show that combining strong and weak detectors improves the performance for images with heterogeneous cell types, but it is still difficult to generalize to an unseen cell type or imaging method. Recently, Chen and Murphy (2007) conducted a similar study for 3D HeLa and NIH 3T3 cell images. They suggested some generic strategies that can be applied on top of our method to resolve this problem.
|
| 5 CONCLUSIONS AND FUTURE WORK |
|---|
|
|
|---|
We have presented a new multiclass boosting algorithm, AdaBoost.ERC, and weak detectors for the problem of subcellular localization. A large number of weak detectors, when combined with knowledge-driven strong detectors, allow AdaBoost.ERC to recognize protein subcellular location structures with high accuracy. Our experimental results show that this method is robust and accurate for all three image data sets. It is more appropriate as a true solution for real lab applications than the ensemble method proposed in previous work. In our future work, we will try to extend our method to 3D images.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
We wish to thank Robert Murphy for providing us 2D HeLa images and the elaborate features (180-FS). This work was partly performed while C.-C.L served as assistant professor at Department of Biomedical Sciences, Chung Shan Medical University, Taichung, Taiwan. C.-C.L is supported in part by National Science Council (NSC), Taiwan, under Grant No. NSC94-2311-B-010-008. Y.-S.L and C.-N.H are supported in part by the National Research Program in Genomic Medicine (NRPGM), NSC, Taiwan, under Grant No. NSC95-3112-B-001-017 (Advanced Bioinformatics Core), and in part under Grant No. NSC95-2221-E-001-038.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Jonathan Wren
1 See http://murphylab.web.cmu.edu/data/2Dhela_images.html ![]()
2 See http://murphylab.web.cmu.edu/data/2Dhela_images_download.html ![]()
Received on June 1, 2007; revised on August 28, 2007; accepted on September 30, 2007
| REFERENCES |
|---|
|
|
|---|
Bannasch D, et al. LIFEdb: a database for functional genomics experiments integrating information from external sources, and serving as a sample tracking system. Nucleic Acids Res (2004) 32:D505–D508.
Boland MV, Murphy RF. A neural network classifier capable of recognizing the patterns of all major subcellular structures in fluorescence microscope images of HeLa cells. Bioinformatics (2001) 17:1213–1223.
Boland MV, et al. Automated classification of cellular protein localization patterns obtained via fluorescence microscopy. (1997) Proceedings of the 19th Annual International Conference of the IEEE Engineering in Medicine and Biology Society. 594–597.
Boland MV, et al. Automated recognition of patterns characteristic of subcellular structures in fluorescence microscopy images. Cytometry (1998) 33:366–375.[CrossRef][Web of Science][Medline]
Chebira A, et al. A multiresolution approach to automated classification of protein subcellular location images. BMC Bioinformatics (2007) 8:210.[CrossRef][Medline]
Chen X, Murphy RF. Interpretation of protein subcellular location patterns in 3d images across cell types and resolutions. Lect. Notes Bioinformatics (2007) 4414:328–342.
Conrad C, et al. Automatic identification of subcellular phenotypes on human cell arrays. Genome Res (2004) 14:1130–1136.
Dietterich TG, Bakiri G. Solving multiclass learning problems via error-correcting output codes. J. Artif. Intell. Res (1995) 2:263–286.
Eisenhaber F, Bork P. Wanted: subcellular localization of proteins based on sequence. Trends Cell Biol (1998) 8:169–170.[CrossRef][Web of Science][Medline]
Freund Y, Schapire RE. A decision-theoretic generalization of on-line learning and an application to boosting. (1995) Proceedings of the Second European Conference on Computational Learning Theory. Berlin, Germany: Springer-Verlag. 23–37.
Garrels JI. YPD-A database for the proteins of Saccharomyces cerevisiae. Nucleic Acids Res (1996) 24:46–49.
Glory E, Murphy RF. Automated subcellular location determination and high throughput microscopy. Dev. Cell (2007) 12:7–16.[CrossRef][Web of Science][Medline]
Guruswami V, Sahai A. Multiclass learning, boosting, and error-correcting codes. (1999) Proceedings of the 12th Annual Conference on Computational Learning Theory. New York, USA: ACM Press. 145–155.
Habeler G, et al. YPL.db: the yeast protein localization database. Nucleic Acids Res (2002) 30:80–83.
Haralick RM. Statistical and structural approaches to texture. Proc. IEEE (1979) 67:786–804.
Huang K, Murphy RF. Automated classification of subcellular patterns in multicell images without segmentation into single cells. (2004a) Proceedings of the 2004 IEEE International Symposium on Biomedical Imaging (ISBI 2004). VA, USA: Arlington. 1139–1142.
Huang K, Murphy RF. Boosting accuracy of automated classification of fluorescence microscope images for location proteomics. BMC Bioinformatics (2004b) 5:78.[CrossRef][Medline]
James G. Majority vote classifiers: theory and applications. In: Ph.D. Thesis (1998) Stanford, CA, USA: Department of Statistics, Stanford University.
Lin Y-S, Hsu C-N. Boosting multiclass learning with repeating codes. In: Technical report TR-IIS-07-014 (2007) Institute of Information Science, Acedemia Sinica. Also in Proceedings 2006 TAAI Conference on Artificial Intelligence and Applications (TAAI 2006).
Ling L. Multiclass boosting with repartitioning. (2006) Proceedings of the 23rd International Conference on Machine Learning. New York, USA: ACM Press. 569–576.
Movafeghi A, et al. Quality improvement of digitized radiographs by filtering technique development based on morphological transformations. In: 2004 IEEE Nuclear Science Symposium Conference Record (2004) 3. Italy: Rome. 1845–1849.[CrossRef]
Murphy K, et al. Object detection and localization using local and global features. In: Toward Category-Level Object Recognition, volume 4170 of Lecture Notes in Computer Science—Ponce J, et al, eds. (2006) New York, USA: Springer-Verlag. 393–412.
Nakai K, Horton P. Psort: a program for detecting sorting signals in proteins and predicting their subcellular localization. Trends in Biochem. Sci (1999) 24:34–35.[CrossRef][Web of Science][Medline]
Schapire RE. Using output codes to boost multiclass learning problems. (1997) Proceedings of the 14th International Conference on Machine Learning. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc. 313–321.
Schapire RE. A brief introduction to boosting. (1999) Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence (IJCAI 1999). San Francisco, CA, USA: Morgan Kaufmann Publishers Inc. 1401–1406.
Simpson JC, et al. Systematic subcellular localization of novel proteins identified by large-scale cDNA sequencing. EMBO Rep (2000) 1:287–292.[CrossRef][Web of Science][Medline]
Sudderth EB, et al. Learning hierarchical models of scenes, objects, and parts. (2005) 2. Proceedings of the 2005 IEEE International Conference on Computer Vision (ICCV-2005). Washington, DC, USA: IEEE Computer Society. 1331–1338.
Sun Y, et al. Unifying the error-correcting and output-code adaboost within the margin framework. (2005) Proceedings of the 22nd International Conference on Machine Learning. New York, USA: ACM Press. 872–879.
Torralba A, et al. Sharing features: efficient boosting procedures for multiclass object detection. (2004) 2. Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR04). Washington, DC, USA: IEEE Computer Society. 762–769.
Wen C-H, et al. Adaptive quartile sigmoid function operator for color image contrast enhancement. (2001) Proceedings of the Nineth Color Imaging Conference: Color Science and Engineering: Systems, Technologies, Applications. Scottsdale, AZ, USA: IS&T - The Society for Imaging Science and Technology. 280–285.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

) denotes the indicator function whose value is 1 if statement 



