Automated image analysis of protein localization in budding yeast
1Center for Bioimage Informatics, Department of Biomedical Engineering, 2Department of Machine Learning and 3Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, PA 15213, USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: The yeast Saccharomyces cerevisiae is the first eukaryotic organism to have its genome completely sequenced. Since then, several large-scale analyses of the yeast genome have provided extensive functional annotations of individual genes and proteins. One fundamental property of a protein is its subcellular localization, which provides critical information about how this protein works in a cell. An important project therefore was the creation of the yeast GFP fusion localization database by the University of California, San Francisco, USA (UCSF). This database provides localization data for 75% of the proteins believed to be encoded by the yeast genome. These proteins were classified into 22 distinct subcellular location categories by visual examination. Based on our past success at building automated systems to classify subcellular location patterns in mammalian cells, we sought to create a similar system for yeast.
Results: We developed computational methods to automatically analyze the images created by the UCSF yeast GFP fusion localization project. The system was trained to recognize the same location categories that were used in that study. We applied the system to 2640 images, and the system gave the same label as the previous assignments to 2139 images (81%). When only the highest confidence assignments were considered, 94.7% agreement was observed. Visual examination of the proteins for which the two approaches disagree suggests that at least some of the automated assignments may be more accurate. The automated method provides an objective, quantitative and repeatable assignment of protein locations that can be applied to new collections of yeast images (e.g. for different strains or the same strain under different conditions). It is also important to note that this performance could be achieved without requiring colocalization with any marker proteins.
Availability: The original images analyzed in this article are available at http://yeastgfp.ucsf.edu, and source code and results are available at http://murphylab.web.cmu.edu/software
Contact: murphy{at}cmu.edu
| 1 INTRODUCTION |
|---|
|
|
|---|
Yeast is the first eukaryote whose genome was sequenced completely. Large-scale analysis has been performed by different research groups and extensive annotations and experimentally determined properties have been organized into various databases, such as the Saccharomyces Genome Database (SGD) (Cherry et al., 1998) and the Comprehensive Yeast Genome Database (CYGD) (Güldener et al., 2005).
One critical protein property is its subcellular localization, which provides critical information about how a protein works inside a cell. This property is frequently determined by visual interpretation of fluorescence microscope images. A major advance therefore came from the creation of the yeast GFP fusion localization database (Huh et al., 2003) at the University of California, San Francisco, USA (UCSF). In order to maximally preserve the wild-type levels of protein expression, this GFP-tagged library was constructed by inserting the coding sequence of Aequorea Victoria GFP (S65T) (Tsien, 1998) into the yeast genome in-frame immediately preceding the stop codon of each open reading frame (ORF). Annotation of the images was done manually: two human scorers initially independently classified these proteins into one or more of 12 subcellular localization categories and then refined these categories by performing a series of co-localization experiments with monomeric red fluorescent proteins (mRFP) (Campbell et al., 2002) whose location had been characterized previously. They defined 22 unique location categories (one of which was ambiguous) and assigned each protein to either one or mixtures of these categories. This database provides localization data for 75% of the proteins encoded by the yeast genome.
Over the past 10 years, we have developed automated systems for comprehensive and quantitative analysis of protein subcellular location patterns in mammalian cells. By utilizing informative numerical descriptors for high resolution images and statistical machine-learning methods for pattern recognition, these systems can recognize the major subcellular patterns (Boland et al., 1998, 2001; Murphy et al., 2000) and can learn new patterns directly from fluorescence microscope images (Chen and Murphy, 2005; Chen et al., 2003). Previous work showed that our system is able to distinguish similar patterns better than visual examination (Murphy et al., 2003).
In this work, we present computational methods to classify the protein images in the yeast GFP fusion localization database. The results demonstrate that highly accurate and objective assignments can be made by a fully automated system.
| 2 METHODS |
|---|
|
|
|---|
An overview of our method is presented in Figure 1 and described briefly below. Our initial desire was to utilize machine-learning algorithms to classify proteins into one of 22 clearly predefined localization categories. Since some of the proteins in the UCSF database were assigned to an ambiguous category or to a mixture of localization categories, we initially excluded them to obtain a set of proteins showing only one location pattern. We also excluded the 'punctate_composite' class given the inherent ambiguity of this designation. We used a novel graphical model approach to segment each image into single cell regions and extracted numerical features to describe the pattern in each cell. These were used to train and test support vector machine classifiers. Plurality voting was used to combine results from different cells and classifiers to obtain a unique localization category for each image.
|
2.1 UCSF image collection and image selection
The UCSF yeast GFP fusion localization database (Huh et al., 2003) was created by attempting to tag 6234 annotated ORFs identified in SGD (Cherry et al., 1998) by homologous recombination. A total of 6029 ORFs were confirmed to have been successfully tagged, of which 4156 ORFs showed positive GFP signals. We downloaded images for all but one of these from the online database (http://yeastgfp.ucsf.edu); ORF YMR128W was missing the DAPI channel image. Each set of images consists of three channels of the same field of yeast cells: a DAPI image, a GFP image and a DIC (differential interference contrast) image. The GFP channel shows the location pattern of the tagged protein, the DAPI channel reflects the DNA distribution and the DIC image is useful for identifying cell boundaries.
After removing 29 image sets that failed to be segmented properly, 156 image sets that are labeled as ambiguous and 1294 image sets that had multiple location categories, we had 2713 image sets available to train and test our automated analysis system. Each of these sets was in one of the 21 categories shown in Table 1. However, the number of image sets in each category is not uniform. The four major categories, cytoplasm, nucleus, mitochondrion and ER, contain 2059 image sets, while the category ER_to_Golgi has only 6 image sets.
|
2.2 Feature extraction and image segmentation
2.2.1 Field-level features
To analyze the location patterns in an image, we can extract features that summarize the pattern at the level of the entire field (assuming that a field only contains a homogenous population of cells expressing a single labeled protein). The advantage of using field-level features is that it makes it unnecessary to segment each image into subregions containing individual cells, and we have previously shown that such features can be used to obtain good classification accuracy for the major subcellular patterns in mammalian cells (Huang and Murphy, 2004a). The feature set used previously consisted of 26 features, 13 morphological and 13 Haralick texture features using 256 gray levels for images resampled to 1.15 micron/pixel. We extended this feature set by adding 60 Gabor texture features and calculated the Haralick texture features using 128 gray levels at the original image resolution.
2.2.2 Cell-level features
If images are segmented into single cell regions, additional features that are not appropriate for whole fields can be calculated. These include Zernike moment features, additional morphological features and wavelet features. The details for different feature sets (which we term SLF) have been reviewed previously (Huang and Murphy, 2004b). The advantage of the cell-level features is that more information in one image can be extracted and utilized. The disadvantage is the requirement for a cell segmentation step, which is crucial because its accuracy determines the accuracy of the downstream analysis. We first developed a novel segmentation method based on a graphical model approach (Chen et al., 2006) which is particularly suitable for this yeast dataset. We then extracted 245 cell-level features for each cell subregion. These features were calculated after background subtraction and normalization of the total intensity of the image to one. Thus differences in protein expression and background (reflected in brightness and contrast of the image) are removed. This feature set contains 49 Zernike moment features, 16 morphological features, 30 wavelet features, 6 DNA features, 5 edge features, 1 non-object feature and 78 Haralick texture features using 128 gray levels at the original image resolution and various resolutions with downsampling factors ranging from 2 to 6.
2.2.3 Image segmentation with graphical model approach
Yeast cells are often clustered together, so it is hard to define good initial seeding for conventional techniques, such as the seeded watershed algorithm (Bengtsson et al., 2004; Velliste and Murphy, 2002) and Voronoi segmentation (Rodenacker and Bischoff, 1990). We developed a graphical-model-based segmentation approach (Chen et al., 2006), which assumes two parallel images are available for each field: an image containing information about the nuclear positions (such as an image of a DNA probe) and an image containing information about the cell boundaries (such as a DIC image). The nuclear information provides an initial assignment of whether each pixel belongs to the background or one of the cells. The boundary information is used to estimate the probability that any two pixels in the graph are separated by a cell boundary. This graphical model segmentation is fast and accurate, and it is especially useful for segmenting a field containing cells that are touching each other and where nuclear and cell boundary images are noisy.
2.3 Support vector machine classification
Support vector machines were originally designed for binary classification by finding a maximum margin hyperplane between two classes (Cortes and Vapnik, 1995). They can be extended to solve multi-class classification problems by combining several binary classifiers. For the work described here, we used the LIBSVM package (http://www.csie.ntu.edu.tw/~cjlin/libsvm) with the one-against-one method (Hsu and Lin, 2002) for multi-class classification. To improve accuracy for small classes, we used class-dependent weights. The weight for class C was the size of the smallest class divided by the size of class C. For SVMs, the proper choice of parameters will greatly affect the classification performance. There are two parameters to be adjusted, C (the penalty parameter of the error term) and
(the width of Gaussian function in the kernel function), for a SVM with a radial basis function kernel:
|
|
For parameter tuning with cell-level features, we randomly split all the data into 6 folds, 5 of them for training and one of them for testing, ran stepwise discriminant analysis (SDA) to select the most discriminative features on the training set, and conducted a grid-search on C and
using 6-fold cross-validation. The best set of parameters for the cell level classification was (C,
)=(215, 2–7). We have previously evaluated eight different feature reduction methods in the context of subcellular pattern analysis and obtained the best performance using SDA (Huang et al., 2003).
2.4 Plurality voting
Given a yeast image which was segmented into several cell regions, and results for the classification of each cell, we can utilize information from more than one cell to improve the classification accuracy. Previous work (Boland and Murphy, 2001) has shown individual classification accuracy of sets of HeLa cells can be improved from 83 to 98% by allowing cells to vote for a single classification for the entire slide (and choosing the class that receives a plurality of the vote). We adopt the same strategy here. We used 25 seeds to randomly split the data into 6 folds, and used each fold to train a classifier. Each image was classified by 25 classifiers, and thus we can do another level of plurality voting over the classifiers to decide a single location category for each image. For each image in classes not included in training (the ambiguous and the punctate_composite classes), its label was decided by plurality voting over 150 classifiers (25 classifiers times 6 folds). This strategy works when we can assume that a field contains a homogenous location pattern. Since each classification is made independently at each run of cross-validation, the plurality voting would not cause overfitting problems. However, the computation is time-consuming for the multiple runs of cross-validation to test all the images.
| 3 RESULTS |
|---|
|
|
|---|
3.1 Field-level classification
We first evaluated the classification of entire, unsegmented fields of cells. Using 6-fold cross-validation on the four largest classes, we obtained a 92.7% overall accuracy (Table 2). Here the 6-fold cross-validation just evenly split the data into 6 folds for each class. However, the classification accuracy decreased significantly when this approach was extended to all 21 classes (data not shown). The overall accuracy (number of correctly classified images out of all images) was 76.2% but the adjusted accuracy (the average over the diagonal of the confusion matrix) was 33.9%. This suggests that although most images from large classes were classified correctly, most images from small classes were classified incorrectly. It also illustrates the difficulty of classifying uneven data described in Section 2.4. This is at least partially due to the small number of training images available for the small classes (since only one set of features is calculated for each field).
|
3.2 Cell-level classification
We next evaluated cell-level classification as described above. The results are shown in Table 3. A total of 2139 out of 2640 images (81%) were classified correctly. Most images in the large classes were classified correctly, but many images in the small classes were not (probably due to the small number of training images available for these classes). However, the accuracy of the classifications can be significantly improved by considering only those that have high confidence. This can be done by setting a threshold on the fraction of classifiers that must agree in order for a classification to be assigned. When this number is varied from 28% to 100%, the precision of the classification results (the number of assignments that agree with the UCSF assignments divided by the number of assignments made) increases from 81.02% to 94.70%. The recall (fraction of all images whose classification agrees with the UCSF assignments) is over 80% at this high accuracy (Fig. 2). The accuracies in Table 3 assume that the UCSF labels were all correct. Since the vast majority is likely to be correct, the results strongly confirm the feasibility of automated classification of yeast images, even without using co-localization. However, it is possible that at least some of the visually assigned labels are incorrect. We therefore next examined the difference between our automatically-derived labels and the labels from the UCSF database.
|
|
3.3 Identification of potentially incorrect labels
Of the 2640 images classified in Table 3, there are 501 images whose computer-assigned labels did not match the previously assigned labels. Of these, 96 were classified with 100% confidence (25 out of 25 classifiers agreed). A list of these proteins can be found at http://murphylab.web.cmu.edu/data. (The complete list is available at http://murphylab.web.cmu.edu/data.). We plan to further explore these proteins and evaluate evidence from other databases to get further insight into which may be correct. As a preliminary analysis, we show images for two of the proteins in Figures 3 and 4. According to the CYGD database, the ORF YAL009W in Figure 3 is a meiotic protein which should have strong expression in the M (mitosis) phase. Our method classified it as vacuole protein with 52% confidence. (13 of the 25 classifiers classified it as vacuole, while another 11 classifiers classified it as cytoplasm and one classifier classified it as mitochondrion). We hypothesize that these cells are not in the M phase so that the proteins are not expressed in the nucleus. Another example is the ORF YGR130C in Figure 4. According to the CYGD database, YGR130C is a protein of unknown function localized to cytoplasm, and it is classified as a punctate_composite protein by the UCSF database. We can tell from the image that the proteins are localized at cell periphery, which is consistent with the automated classification. (91 of the 150 classifiers classified it as cell_periphery, 45 classifiers classified it as cytoplasm and 14 classifiers classified it as ER). These two examples suggest that automated classification can provide a better measurement of the protein location patterns, at least in some cases. The validations of other proteins with potentially incorrect labels are still under investigations.
|
|
3.4 Classification of proteins with ambiguous class
Besides the proteins identified as being in only one location, we applied the classifier on the 156 proteins within the ambiguous category. We sought location labels of these proteins in the CYGD database, but the labels in CYGD are all mixtures of ambiguous and other categories. The results of automated classification of these proteins are available at http://murphylab.web.cmu.edu/data. An example is shown in Figure 5. ORF YFL034W was classified as ambiguous class in both UCSF and CYGD databases, while our system classifies it as a ER protein with 46.67% confidence. (70 of the 150 classifiers classified it as ER, while 48 classifiers classified it as cytoplasm and 32 classifiers classified it as cell_periphery.) The results for 37 proteins for which all automated classifiers agree are listed in Table 4.
|
|
| 4 CONCLUSIONS AND DISCUSSIONS |
|---|
|
|
|---|
We have presented computational methods to annotate localization categories of yeast proteins in the yeast GFP fusion localization database. This measurement of protein localization is objective, quantitative and repeatable. Our results have a high accuracy of 81.02% for all proteins and 94.70% precision at 80.22% recall, compared to human labels for those proteins with pure location pattern. The high accuracy confirms the feasibility of automated classification of yeast images.
The most important future work is to seek evidence which explains the differences between the human and computer generated labels. This can be done by looking at the literature for these proteins, checking information from other yeast databases or performing location prediction from protein sequences. In addition, there are 1294 images with mixed location patterns. We plan to extend the existing system to handle more than one location pattern in the same cell.
This work can be placed in the context of growing efforts to automate analysis of yeast phenotypes. For example, automated analysis of yeast morphology changes in response to mating pheromone has been described (Narayanaswamy et al., 2006).
The system we have described should be valuable for automatically identifying changes in Saccharomyces cerevisiae under different conditions (e.g. various carbon sources) and for assigning location on a genome-wide basis for other yeast species.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
We thank Adam Carroll and Erin O'Shea for making the yeast images available. This work was supported in part by NSF grant EF-0331657, by NIH National Technology Center for Networks and Pathways grant U54 RR022241, and by NIH National Center for Biomedical Computing grant National U54 DA021519.
Conflict of Interest: none declared.
| REFERENCES |
|---|
|
|
|---|
Bengtsson E, et al. Robust cell image segmentation methods. Pattern Recogn. Image Anal, ( (2004) ) 14, : 157–167..
Boland MV, Murphy RF. A neural network classifier capable of recognizing the patterns of all major subcellular structures in fluorescence microscope images of HeLa cells. Bioinformatics, ( (2001) ) 17, : 1213–1223.
Boland MV, et al. Automated recognition of patterns characteristic of subcellular structures in fluorescence microscopy images. Cytometry, ( (1998) ) 33, : 366–375.[CrossRef][ISI][Medline].
Campbell RE, et al. A monomeric red fluorescent protein. Proc. Natl Acad. Sci. USA, ( (2002) ) 99, : 7877–7882.
Chen S-C, et al. A novel graphical model approach to segmenting cell images. Proc. IEEE Symp. Comput. Intell. Bioinform. Comput. Biol, ( (2006) ) (CIBCB 06), 1079, 1–8..
Chen X, Murphy RF. Objective clustering of proteins based on subcellular location patterns. J. Biomed. Biotechnol, ( (2005) ) 2005, : 87–95.[CrossRef][Medline].
Chen X, et al. Location proteomics building subcellular location trees from high–resolution 3D fluorescence microscope images of randomly-tagged proteins. Proc. SPIE, ( (2003) ) 4962, : 298–306.[CrossRef].
Cherry JM, et al. SGD: Saccharomyces Genome Database. Nucleic Acids Res, ( (1998) ) 26, : 73–79.
Cortes C, Vapnik V. Support vector networks. Mach. Learn, ( (1995) ) 20, : 1–25..
Güldener U, et al. CYGD: the comprehensive yeast genome database. Nucleic Acids Res, ( (2005) ) 33, : D364–D368.
Hsu C-W, Lin C-J. A comparison of methods for multi-class support vector machines. IEEE Trans. Neural Net, ( (2002) ) 13, : 415–425.[CrossRef].
Huang K, Murphy RF. Automated classification of subcellular patterns in multicell images without segmentation into single cells. Proc. 2004 IEEE Int. Symp. Biomed. Imaging, ( (2004a) ) 1139–1142..
Huang K, Murphy RF. From quantitative microscopy to automated image understanding. J. Biomed. Opt, ( (2004b) ) 9, : 893–912.[CrossRef][ISI][Medline].
Huang K, et al. Feature reduction for improved recognition of subcellular location patterns in fluorescence microscope images. Proc. SPIE, ( (2003) ) 4962, : 307–318.[CrossRef].
Huh W-K, et al. Global analysis of protein localization in budding yeast. Nature, ( (2003) ) 425, : 686–691.[CrossRef][Medline].
Murphy RF, et al. Towards a systematics for protein subcellular location: quantitative description of protein localization patterns and automated analysis of fluorescence microscope images. Proc. Int. Conf. Intell. Syst. Mol. Biol, ( (2000) ) 8, : 251–259.[Medline].
Murphy RF, et al. Robust numerical features for description and classification of subcellular location patterns in fluorescence microscope images. J. VLSI Signal. Proc, ( (2003) ) 35, : 311–321.[CrossRef].
Narayanaswamy R, et al. Systematic profiling of cellular phenotypes with spotted cell microarrays reveals mating-pheromone response genes. Genome Biol, ( (2006) ) 7, : R6.[CrossRef][Medline].
Rodenacker K, Bischoff P. Quantification of tissue sections: graph theory and topology as modelling tools. Pattern Recognit. Lett, ( (1990) ) 11, : 275–284.[CrossRef].
Tsien RY. The green fluorescent protein. Ann. Rev. Biochem, ( (1998) ) 67, : 509–544.[CrossRef][ISI][Medline].
Velliste M, Murphy RF. Automated determination of protein subcellular locations from 3D fluorescence microscope images. Proc. IEEE Int. Symp. Biomed. Imaging, ( (2002) ) (ISBI-2002), 867–870..
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||




