Bioinformatics Advance Access originally published online on January 20, 2006
Bioinformatics 2006 22(10):1158-1165; doi:10.1093/bioinformatics/btl002
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
MultiLoc: prediction of protein subcellular localization using N-terminal targeting sequences, sequence motifs and amino acid composition
1 Division for Simulation of Biological Systems, WSI/ZBIT, Eberhard Karls University Tübingen Sand 14, D-72076 Tübingen, Germany
2 Department of Biochemistry and Center for Bioinformatics, Saarland University D-66041 Saarbrücken, Germany
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Functional annotation of unknown proteins is a major goal in proteomics. A key annotation is the prediction of a protein's subcellular localization. Numerous prediction techniques have been developed, typically focusing on a single underlying biological aspect or predicting a subset of all possible localizations. An important step is taken towards emulating the protein sorting process by capturing and bringing together biologically relevant information, and addressing the clear need to improve prediction accuracy and localization coverage.
Results: Here we present a novel SVM-based approach for predicting subcellular localization, which integrates N-terminal targeting sequences, amino acid composition and protein sequence motifs. We show how this approach improves the prediction based on N-terminal targeting sequences, by comparing our method TargetLoc against existing methods. Furthermore, MultiLoc performs considerably better than comparable methods predicting all major eukaryotic subcellular localizations, and shows better or comparable results to methods that are specialized on fewer localizations or for one organism.
Availability: http://www-bs.informatik.uni-tuebingen.de/Services/MultiLoc/
Contact: hoeglund{at}informatik.uni-tuebingen.de
| 1 INTRODUCTION |
|---|
|
|
|---|
Assigning subcellular localization to a protein is an important step towards elucidating its interaction partners, function and its potential role(s) in the cellular machinery (Rost et al., 2003). Despite recent technological advancements, experimental determination of subcellular localization remains time-consuming and laborious. Hence, computational methods for assigning localization on a proteome-wide scale offer an attractive complement.
Free diffusion in the cell is prohibited by membranes, leading to the subdivision of specific biochemical microenvironments into several subcellular organelles. Intracellular transport of proteins into these organelles is a highly regulated and specific process (Pfeffer and Rotheman, 1987). Signals for protein sorting exist either in the form of primary sequences, usually N-terminal targeting sequences (Rusch and Kendall, 1995; Emanuelsson et al., 2000) and internal sequence motifs (Cokol et al., 2000), or in the form of 3D protein-surface patches (Helenius and Aebi, 2001). Proteins localized in the same organelle have been reported to show a similar overall amino acid composition and are thought to have evolved to function optimally in that specific environment (Andrade et al., 1998). Furthermore, a functional domain can be specific to proteins in one organelle, e.g. DNA-binding domains are present almost exclusively in nuclear proteins (Höglund et al., 2005).
Methods for predicting subcellular localization can be categorized according to the underlying theory, e.g. classification based on N-terminal targeting sequences, overall amino acid composition and sequence homology (Dönnes and Höglund, 2004). TargetP (Emanuelsson et al., 2000) uses neural networks for discriminating four localizations: chloroplast, mitochondrial, secretory pathway and other proteins, based on their N-terminal sequence information. An alternative and comparable method, iPSORT (Bannai et al., 2002), uses biologically interpretable rules of the N-terminal sequences for assigning the same localizations as TargetP. A number of different computational approaches using the overall amino acid composition have presented, including neural networks (Reinhardt and Hubbard, 1998), Hidden Markov Models (HMMs) (Yuan, 1999), Support Vector Machines (SVMs) (Hua and Sun, 2001; Park and Kanehisa, 2003; Chou and Cai, 2002) and nearest neighbours (Cai and Chou, 2003; Ying and Yanda, 2004). Marcotte et al. explored the possibility to assign subcellular localization based on the distribution of protein homologues and their phylogenetic profiles (Marcotte et al., 2000). Recent studies tend to combine several sources of information for prediction. Cai and Chou presented a method using gene ontology and functional domain composition (Cai and Chou, 2004). ELSpred is an SVM-based method using PSI-BLAST scores and amino acid composition (Bhasin and Raghava, 2004), which was recently complemented by HSLpred and PSLpred for predicting four human and prokaryotic localizations, respectively (Garg et al., 2005; Bhasin et al., 2005). PSLT is based on InterPro motifs and specific membrane domains (Scott et al., 2004). A recently published method, LOCtree (Nair and Rost, 2005), integrates various sequence and predicted structural properties. The prediction shows promising results, but covers a limited number of localizations. The most comprehensive prediction system reported so far, PSORT, is based on a collection of expert if-then rules originating from experimental and computational observations (Nakai and Kanehisa, 1992; Nakai and Horton, 1999). PSORT covers the main 11 eukaryotic organelles and their intraorganellar localizations.
In the development of new prediction methods, special attention should be given to sequence homology within the dataset and the method chosen for performance evaluation (Höglund et al., 2005). Including too homologous sequences in the training dataset will lead to recognition of identical sequences (thereby overfitting the prediction method), rather than common general features. Leave-one-out cross-validation is useful when the data at hand are limited. However, for larger datasets it becomes impractical due to the require computation time, which makes 5-fold cross-validation a more appropriate alternative (Hastie et al., 2001).
Here we present a new integrated approach for predicting subcellular localization. The approach considers N-terminal targeting sequences, amino acid composition and the presence of specific protein sequence motifs obtained from established motif databases. These features form the input for a set of SVMs used for predicting the localization. This novel approach was used for developing two prediction systems TargetLoc and MultiLoc. TargetLoc predicts four plant and three non-plant localizations based on N-terminal targeting sequences, whereas MultiLoc covers all 11 eukaryotic subcellular localizations. Both methods have been compared with the respective current state-of-the-art methods TargetP, iPSORT and PSORT. TargetLoc shows an improved discrimination of chloroplast and mitochondrial proteins, which is a well-known challenge due to evolutionary similarity between the two (Creissen et al., 1995). MultiLoc shows an overall prediction accuracy of
75% in a cross-validation test, which can be directly compared with the overall accuracy of slightly <60% obtained by the PSORT method. TargetLoc and MultiLoc are described in detail, followed by the results of the benchmark studies and predictions on two novel datasets. We show that our integrative approach, which utilizes several different protein-specific features, gives a robust and an accurate prediction method. Both prediction systems have been implemented as web services and are accessible at http://www-bs.informatik.uni-tuebingen.de/Services/MultiLoc/
| 2 MATERIALS AND METHODS |
|---|
|
|
|---|
The aim of our work was to investigate if the novel approach, including additional biologically relevant information, could significantly improve prediction of localization. First, our approach was tested on classification of N-terminal localization categories. We developed the novel method TargetLoc, which was compared with the two methods TargetP and iPSORT using the well-known TargetP dataset (Emanuelsson et al., 2000). Secondly, we applied our ideas to a more challenging problem, predicting all main eukaryotic subcellular localizations. The new prediction method, MultiLoc, was evaluated using an extensive dataset obtained from the Swiss-Prot database and finally compared with the PSORT method. The architecture of the TargetLoc and MultiLoc prediction systems will be outlined and illustrated in the following subsections. The individual building blocks of the two methods are described in detail, followed by information about the datasets used for training and testing, the machine learning procedures and the measures used for performance evaluation.
2.1 TargetLoc architecture
Prediction of localization categories based on N-terminal targeting sequences is relatively reliable and directly connected to the underlying biological process. However, N-terminal-based discrimination is only possible for the mitochondrial (mi), chloroplast (ch), secretory pathway (SP) and other (OT) categories. TargetLoc was designed to integrate several biologically relevant features for generating a protein profile vector (PPV) representing each protein. The features in the PPV originate from a set of specialized methods designed to detect protein-specific features. Novel in TargetLoc (compared with other methods based on N-terminal prediction) is that in addition to the prediction based on the N-terminal sequences (here performed by SVMTarget), information about the overall amino acid composition (SVMaac) and protein-specific motifs are collected (MotifSearch) in the PPV. All methods mentioned here are described in detail in the following. The overall architecture of TargetLoc is illustrated in Figure 1. A query sequence is processed by a first layer of three different methods for detecting protein features; SVMTarget, SVMaac and MotifSearch. The results from these specific methods are collected in the PPV. The PPV is a representation of each protein and used as input for the final layer of SVMs in TargetLoc.
|
2.2 MultiLoc architecture
The basic idea to integrate additional biologically relevant information for improving the prediction of localization categories, was extended to enable discrimination between the main 11 eukaryotic subcellular localizations. Discrimination of all putative localizations is highly desirable from a biological point of view. However, it presents a challenging computational task for sparsely populated localizations and for localizations for which no unique features exist. MultiLoc addresses this challenge by utilizing the same approach as was used in TargetLoc, but incorporating more information in order to enable extended prediction of further localizations. The architecture of our novel prediction method MultiLoc is presented in Figure 2. Several additional features have been incorporated into the first layer in order to facilitate for the extended number of localizations to be discriminated by MultiLoc. Furthermore, a method (SVMSA) for detecting signal anchors (SAs) has been implemented (described in detail below). MultiLoc was trained using a new homology-reduced dataset and the prediction performance of the three versions animal, fungal, and plant was compared with that of the PSORT method.
|
2.3 Subprediction methods
The individual building blocks used in the development of TargetLoc and MultiLoc (illustrated in Fig. 1 and Fig. 2, respectively) are presented here.
2.3.1 SVMTarget
Similarly to TargetP, SVMTarget (plant and non-plant versions) predicts localization categories based on N-terminal targeting sequences. The plant version predicts four categories (chloroplast, mitochondrial, secretory pathway and other) based on the type of targeting sequence: chloroplast transit peptides (cTP), mitochondrial transit peptides (mTP), targeting peptides of the secretory pathway (SP) and other proteins lacking N-terminal targeting sequences. The three non-plant categories mitochondrial, secretory pathway and other are predicted based on the recognition of the same types as above except for the cTP. The main differences between SVMTarget and TargetP are the encoding of the N-terminal sequences and the enhanced discrimination of cTPs and mTPs. TargetP uses the primary amino acid sequence, whereas SVMTarget uses the partial amino acid composition.
SVMTarget was constructed to reflect three basic biological observations, for a graphical illustration of the plant version as shown in Figure 3. First, each type of N-terminal targeting sequence has a characteristic amino acid composition (rather than a similar primary amino acid sequence), which is different from the other N-terminal targeting sequences and the mature protein sequence. The second observation is that the different N-terminal targeting sequences have different average lengths. SPs are normally about 10 to 15 amino acids shorter than mTPs, and cTPs are usually significantly longer than mTPs. Finally, it has been shown that the main difference between cTPs and mTPs is within the amino acid composition of the 15 most N-terminal residues. The first layer is a set of binary predictors, each specialized to recognize the differences in the amino acid composition between two types of N-terminal sequences. SVMs have been trained to recognize the first 60 (for mTPs and SPs) and the first 100 (for cTPs) N-terminal amino acids, by sliding a window of length W over the amino acids in the targeting sequence (Fig. 3). W is dependent on the length (L) of the experimentally observed targeting sequences and has been manually optimized for each localization (55 for cTPs, 35 for mTPs and 23 for SPs). The cTP/mTP classification is not constructed using windows, instead the fact that mTPs tend to have a higher fraction of positive amino acids in their N-terminal end compared with cTPs (Bannai et al., 2002) is used. The second layer in the architecture is a set of SVMs, used for final classification based on the output of the first layer. The one-versus-one classification procedure uses probability estimates for determining the localization. In order to evaluate the performance of SVMTarget, the training and testing was done using a strict 5-fold cross-validation procedure and the same dataset as the TargetP method.
|
2.3.2 SVMaac
Overall amino acid composition contributes to the PPV through a set of binary localization-specific SVMs (SVMaac), which is illustrated in Figures 1 and 2, respectively. TargetLoc has four plant (chloroplast, mitochondrial, secretory pathway and other) and three non-plant (mitochondrial, secretory pathway and other) binary classifiers for the overall amino acid composition. In MultiLoc the number of binary classifiers corresponds to the number of localizations, nine for animal and fungi, and ten for plant. Each binary classifier discriminates between one localization and all others. Additionally there is a classifier that specifically discriminates cytoplasmic (cy) from nuclear (nu) proteins.
2.3.3 SVMSA
Membrane proteins of the secretory pathway can have a signal anchor (SA) sequence instead of the N-terminal targeting sequence. SAs are localized further away from the N-terminal end of the protein and they usually have a longer hydrophobic part compared with SPs. The SAs may escape detection by methods like SVMTarget and TargetP. To address this problem a novel classifier, SVMSA, specifically designed to recognize SAs was constructed. The basic architecture of SVMSA is similar to that of SVMTarget (which was presented in Fig. 3). The differences are that there is only one classifier at the first level and that the first 100 amino acids (L) and a window length (W) of 21 are used, compared with the first 60 amino acids and windows of length 23, 35 or 55 in SVMTarget.
Out of the 2595 proteins of the homology-reduced secretory pathway category, exactly 300 proteins contain an SA sequence instead of an SP sequence. This dataset was used for training the SVMSA against an equal number of proteins lacking SAs and SPs (obtained from the cytoplasmic, chloroplast, mitochondrial, nuclear and peroxisomal (pe) categories), using a 5-fold cross-validation procedure. The recognition of SAs is very reliable, with an overall accuracy >90%. The TargetP dataset does not contain proteins with SAs, hence SVMSA was only used as an integrated part of the PPV in MultiLoc but not in TargetLoc.
2.3.4 MotifSearch
Sequence motifs and structural domains provide essential biological information about a protein. Detection of such information was facilitated through the development of MotifSearch, which has been integrated into both TargetLoc and MultiLoc and is illustrated in Figures 1 and 2. MotifSearch relies on the information mainly from the PROSITE (Bairoch and Bucher, 1994) and NLSdb (Cokol et al., 2000; Nair et al., 2003) databases.
Most nuclear proteins carry a nuclear localization signal sequence (NLS), which can be recognized by nuclear import receptors. There are two major types of NLSs, the monopartite (NLSm) and the bipartite (NLSb). The NLSm:s are short (48 amino acids) and are rich in positively charged amino acids, whereas NLSb:s consist of two parts, each with a length of two to four amino acids that are connected by a spacer sequence. NLSdb is a database containing experimentally known and potential NLSs (Cokol et al., 2000). In addition to the specific NLSdb entries, the NLSm consensus pattern K(K|R)X(K|R) is detected. The PROSITE database contains information about protein sequence motifs, such as structural and functional domains. Four types of PROSITE motifs were found to have a high discrimination power on a scan of the MultiLoc homology-reduced dataset. These motifs; the endoplasmic reticulum retention signal KDEL, the C-terminal targeting signal for the peroxisome SKL, 25 different DNA-binding domains (dbD) and 16 plasma membrane receptor domains (pmD) were included in the MotifSearch method.
2.4 Datasets
2.4.1 TargetP
The TargetP datasets were obtained from the TargetP web site and used for training and benchmarking TargetLoc against other comparable methods predicting N-terminal localization categories. These datasets contain a total of 3678 proteins representing four plant (chloroplast, mitochondrial, secretory pathway and other) and three non-plant localizations (mitochondrial, secretory pathway and other). The secretory pathway category are proteins from the endoplasmic reticulum (er), extracellular space (ex), Golgi apparatus (go), lysosome (ly), plasma membrane (pm) and vacuole (va). Cytoplasmic and nuclear proteins belong to the category of other proteins.
2.4.2 MultiLoc
The extensive dataset was obtained by extracting all animal, fungal and plant protein sequences from the SWISS-PROT (Bairoch and Apweiler, 2000) database release 42 (20032004), using the keywords Metazoa, Fungi or Viridiplantae, respectively, in the OC (organism classification) field. These proteins were further assigned to 1 of 11 possible eukaryotic subcellular localizations, based on the annotation in the CC (comments) field. Plant proteins can be localized in the chloroplast, cytoplasm, endoplasmic reticulum, extracellular space, Golgi apparatus, mitochondrion, nucleus, peroxisome, plasma membrane and vacuole. Fungal cells share the same subcellular localizations as plant cells, except that they lack the chloroplast. Finally, animal cells share all localizations with fungal cells, but have lysosomes instead of vacuoles.
A few localizations such as the cytoplasm, extracellular space and the nucleus are densely populated, hence all proteins with annotations containing uncertainties such as potential, by similarity or probable were excluded. Mitochondrial and chloroplast proteins were only accepted if the keyword transit followed by an annotated cleavage site was present in the FT (feature) field. Proteins of the secretory pathway category (endoplasmic reticulum, extracellular space, Golgi apparatus, lysosome, plasma membrane and vacuole) were selected if the keywords signal or signal-anchor and annotated start and stop sites were present in the FT field. Plasma membrane proteins were required to have the keywords domain and extracellular as well as domain and cytoplasmic in the FT fields. These proteins were not accepted if the keywords domain and lumenal were present in one of the FT fields. A total of 9761 sequences were extracted, with no restrictions on the level of homology (All), Table 1. Training the prediction models on datasets containing sequences with too high similarity will lead to recognition of nearly identical sequences, rather than general features. Hence, a homology-reduced dataset was created using the ClustalW (Thompson et al., 1994) algorithm, by removing proteins from the original dataset until it contained no sequences with a pair-wise similarity >80%, Table 1.
|
2.5 SVM training and performance evaluation
The mathematical and technical details of SVMs are not explained here, but have been described in detail by Vapnik (1999). In this study the LIBSVM (C.-C. Chang and C.-J. Lin, 2001, http://www.csie.ntu.edu.tx/~cjlin/libsvm/) software was used. The radial basis function kernel was used in all SVMs at all stages of the classification process and optimized by tuning the c and g parameters. The one-versus-one (appropriate for multi-class classification) procedure was adopted throughout the SVM training, in favor of the one-versus-all procedure. The probability estimates by LIBSVM were used for choosing the most probable classifications. Five-fold cross-validation was applied throughout the training and evaluation of both TargetLoc and MultiLoc. Special care was taken to ensure that no protein sequence used for training either SVMTarget, SVMaac or SVMSA, was used in the evaluation neither of the TargetLoc, nor the MultiLoc performance. Five-fold cross-validation is a robust method for performance evaluation and used in favor of leave-one-out cross-validation when the dataset is large enough, since it better avoids the danger of overfitting (Hastie et al., 2001).
The original dataset split used for training and testing the TargetP method is not available. Hence, in addition to the 5-fold cross-validation, a randomization process was performed by randomly splitting the data five times. This procedure delivers five parallel models for each classifier and enables a statistically sound and fair comparison between the TargetLoc, SVMTarget and TargetP methods. Specificity (SP), sensitivity (SE), and Matthews correlation coefficient (MCC) (Matthews, 1975) (a measure capturing both SP and SE) were calculated for all prediction methods. Furthermore, the overall accuracy (correct[%]) and standard deviation of the performances were calculated.
| 3 RESULTS |
|---|
|
|
|---|
3.1 TargetLoc
The performances of the new method TargetLoc were compared against TargetP and iPSORT using the TargetP dataset, Table 2. Five random splits of the datasets were used for training and evaluation, showing very low standard deviations. TargetP and iPSORT predict the four plant categories (chloroplast, mitochondrial, secretory pathway and other) with an overall accuracy of 85.3 and 83.4.%, respectively, whereas TargetLoc reaches an overall accuracy of 89.7%. A further method has reported an overall accuracy of 92.3%, however, this result cannot be directly compared with the others due to differences in the experimental validation (Chou and Cai, 2004). An increased performance is also observed for the three non-plant categories (mitochondrial, secretory pathway and other), where TargetLoc reaches an overall accuracy of 92.5%. The corresponding performances for TargetP and iPSORT are 90.0 and 88.5%, respectively. The most important improvement by TargetLoc compared with previously reported methods is in the discrimination between chloroplast and mitochondrial proteins, and classification of the other category. The MCCs for the chloroplast, mitochondrial, secretory pathway and other categories are reported to be 0.72, 0.77, 0.90 and 0.77 for TargetP, which have been significantly improved to 0.78, 0.84, 0.93 and 0.86 by TargetLoc.
|
3.2 MultiLoc
The overall accuracies of the three MultiLoc versions, animal, fungal and plant, reach
75%. These results reflect a considerable improvement and should be compared with the corresponding values for PSORT ranging between 58 and 60%, Table 3. The fungal version of MultiLoc can be compared with the yeast version of PSORT, since they predict the same localizations. The SE, SP and MCC values for each version and localization of both MultiLoc and PSORT are presented in detail in Table 3. Using the MultiLoc animal version the MCC ranges between 0.44 for peroxisomal and 0.83 for mitochondrial proteins and the other two versions show similar results. These results can be directly compared with those of PSORT in the same table. The PSORT performance on the homology-reduced dataset is in agreement with earlier reports of PSORT performance, which was slightly below a 60% using a smaller dataset (Nakai and Kanehisa, 1992; Nakai and Horton, 1999). The performance of the PSORT animal version varies widely with an MCC between 0.11 for proteins of the endoplasmic reticulum and 0.73 for plasma membrane proteins. The PSORT fungal version predicts the Golgi apparatus and vacuole localizations with a very low MCC values of 0.04 and 0.08, respectively. The corresponding MCCs for MultiLoc show a clear improvement to 0.60 and 0.42.
|
The overall accuracy of MultiLoc is significantly higher and less dependent on the localization category compared with that of PSORT. Low standard deviations indicate that the different prediction models are robust. The effect of bringing together different sources of information is even more prominent when predicting 11 localizations. MultiLoc has an overall accuracy >74% for all three versions, which is a significant improvement when compared with that of the PSORT performance of <60%.
| 4 DISCUSSION |
|---|
|
|
|---|
We have shown that our new approach for predicting protein subcellular localization significantly improves the robustness and prediction reliability. In this approach several sources of biological information are integrated, thereby covering several aspects of the protein sorting process. Its successful application is exemplified through our new prediction methods TargetLoc and MultiLoc, which have been compared with other current state-of-the-art methods.
Predicting localization of proteins based on their N-terminal amino acid sequence is considered to be very reliable, as the overall prediction accuracy reaches
85% with TargetP (Emanuelsson et al., 2000). The N-terminal amino acid sequences are important and highly characteristic for the precursor proteins of the chloroplast, mitochondrial and secretory pathway categories. Proteins of the other category, on the other hand, lack this N-terminal precursor sequence and can easily be mixed up with mature proteins from one of the three first categories.
Our new integrative approach, TargetLoc, predicts the same localization categories as TargetP but differs in a number of fundamental ways. Three complementary sources of biologically relevant information are brought together, namely, N-terminal sequence information, overall amino acid composition and protein-specific sequence motifs. In SVMTarget the encoding of the N-terminal targeting sequence reflects the meaningful amino acid composition (Clausmeyer et al., 1993), rather than the primary sequence. Proteins in the chloroplast and mitochondrial categories are specifically discriminated by SVMTarget through the inclusion of the composition differences known to exist within the 15 most N-terminal amino acids (Creissen et al., 1995). The overall amino acid composition is used for capturing subtle differences between the different categories (SVMaac). MotifSearch identifies secretory pathway and other proteins by their relatively high probability to carry one of the protein sequence motifs characteristic to the mixed group of proteins in these categories. TargetLoc has improved the overall accuracy compared with TargetP from 85.3 to 89.7% and from 90.0 to 92.5% for the plant and the non-plant versions, respectively.
Predicting these N-terminal categories is reliable and useful, nevertheless, the approach suffers from two major drawbacks. First, predicting only four localizations in plant, when there are at least ten main localizations, and three for non-plant proteins, when there are nine main localizations. The further intraorganellar sorting is completely neglected. Second, precursor protein sequences of the chloroplast, mitochondrial and secretory pathway categories are not always available and the cleavage sites of the targeting sequences are not trivial to identify (Frishman et al., 1999). MultiLoc was developed in order to meet the need of a fine-grained and reliable prediction system for protein subcellular localization. Utilizing biological knowledge for modeling the biological sorting process, an extensive homology-reduced dataset and the strong predictive power of SVMs proved successful. As strict cross-validation was used for evaluation, the performance of MultiLoc (75%) can be directly compared with that of PSORT (<60%). The low standard deviations indicate that MultiLoc is robust. The major improvements were made for the cytoplasm, endoplasmic reticulum, Golgi apparatus, lysosome, peroxisome and vacuole localizations, which have been a major challenge so far.
Several biological aspects of protein sorting are yet to be understood. In the meantime it is useful to include as much information about each protein as possible when designing prediction models (Dönnes and Höglund, 2004). MultiLoc performs outstandingly well for a wide range of localizations, reaching levels of accuracy where it becomes interesting to investigate the incorrectly predicted proteins in greater detail. It is likely that some of these incorrectly predicted proteins are to be found in multiple localizations (multiplex localizations) (Creissen et al., 1995; Menand et al., 1998; Cai and Chou, 2004), which has also been considered computationally (Chou and Cai, 2005). We have also taken an important step towards high-reliability predictions of all eukaryotic subcellular localizations. Interesting computational challenges lie ahead, the main still being to mirror the biological events in the protein sorting process. The concept of a PPV makes the prediction method easily extendable. Future improvements of TargetLoc and MultiLoc, include careful selection of additional protein features for the PPV and extension of the prediction to cover the further intraorganellar sorting of mitochondria and chloroplast proteins. Integrating multiple sources of information and simulating the structure of the sorting process will probably be the key to inferring protein function from subcellular localization.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Alex Bateman
Received on November 10, 2005; revised on December 9, 2005; accepted on January 12, 2006
| REFERENCES |
|---|
|
|
|---|
Andrade, M.A., et al. (1998) Adaption of protein surfaces to subcellular location. J. Mol. Biol, . 276, 517525[CrossRef][Web of Science][Medline].
Bairoch, A. and Apweiler, R. (2000) The SWISS-PROT protein sequence database and its supplement in TrEMBL in 2000. Nucleic Acids Res, . 28, 4548
Bairoch, A. and Bucher, P. (1994) PROSITE: recent developments. Nucleic Acids Res, . 22, 35833589[Web of Science][Medline].
Bannai, H., et al. (2002) Extensive feature detection of N-terminal protein sorting signals. Bioinformatics, 18, 298305
Bhasin, M. and Raghava, G.P. (2004) ELSpred: SVM-based method for subcellular localization of eukaryotic proteins using dipeptide composition and PSI-BLAST. Nucleic Acid Res, . 32, (Web Server issue) W414W419
Bhasin, M., et al. (2005) PSLpred: prediction of subcellular localization of bacterial proteins. Bioinformatics, .
Cai, Y.D. and Chou, K.C. (2003) Nearest neighbour algorithm for predicting protein subcellular location by combining functional domain composition and pseudo-amino acid composition. Biochem. Biophys. Res. Commun, . 305, 407411[Medline].
Cai, Y.D. and Chou, K.C. (2004a) Predicting 22 protein localizations in budding yeast. Biochem. Biophys. Res. Commun, . 323, 425428[CrossRef][Web of Science][Medline].
Cai, Y.D. and Chou, K.C. (2004b) Predicting subcellular localization of proteins in a hybridization space. Bioinformatics, 20, 11511156
Chou, K.C. and Cai, Y.D. (2002) Using functional domain composition and support vector machines for prediction of protein subcellular location. J. Biol. Chem, . 277, 4576545769
Chou, K.-C. and Cai, Y.-D. (2004) Predicting subcellular localization of proteins by hybridizing functional domain composition and pseudo-amino acid composition. J. Cell. Biochem, . 91, 11971203[CrossRef][Medline].
Chou, K.-C. and Cai, Y.-D. (2005) Predicting protein localization in budding yeast. Bioinformatics, 21, 944950
Clausmeyer, S., et al. (1993) Protein import into chloroplasts. The hydrophilic lumenal proteins exhibit unexpected import and sorting specificities in spite of structurally conserved transit peptides. J. Biol. Chem, . 268, 1386913876
Cokol, M., et al. (2000) Finding nuclear localization signals. EMBO Rep, . 1, 411415[CrossRef][Web of Science][Medline].
Creissen, G., et al. (1995) Simultaneous targeting of pea glutathione reductase and of a bacterial fusion protein to chloroplasts and mitochondria in transgenic tobacco. Plant J, . 8, 167175[CrossRef][Web of Science][Medline].
Dönnes, P. and Höglund, A. (2004) Predicting protein subcellular localization: past, present, and future. Genomics, Proteomics, Bioinformatics, 2, 209215[Medline].
Emanuelsson, O., et al. (2000) Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. J. Mol. Biol, . 300, 10051016[CrossRef][Web of Science][Medline].
Frishman, D., et al. (1999) Starts of bacterial genes: estimating the reliability of computer predictions. Gene, . 234, 257265[CrossRef][Web of Science][Medline].
Garg, A., et al. (2005) Support vector machine-based method for subcellular localization of human proteins using amino acid compositions, their order, and similarity search. J. Biol. Chem, . 280, 1442714432
Hastie, T., et al. The Elements of Statistical Learning, (2001) , NY Springer-Verlag.
Helenius, A. and Aebi, M. (2001) Intracellular functions of N-linked glycans. Science, 291, 23642369
Höglund, A., et al. (2005) From prediction of subcellular localization to functional classification: discrimination of DNA-packing and other nuclear proteins. Online J. Bioinformatics, 6, 5164.
Hua, S. and Sun, Z. (2001) Support vector machine approach for protein subcellular localization prediction. Bioinformatics, 17, 721728
Marcotte, E.M., et al. (2000) Localizing proteins in the cell from their phylogenetic profiles. Proc. Natl Acad. Sci. USA, 97, 1211512120
Matthews, B.W. (1975) Comparison of predicted and observed secondary structure of T4 phage lysozyme. Biochim. Biophys. Acta, . 405, 442451[Medline].
Menand, B., et al. (1998) A single gene of chloroplast origin codes for mitochondrial and chloroplastic methionyl-tRNA synthetase in Arabidopsis thaliana. Proc. Natl Acad. Sci. USA, 95, 1101411019
Nair, R., et al. (2003) NLSdb: database of nuclear localization signals. Nucleic Acids Res, . 31, 397399
Nair, R. and Rost, B. (2005) Mimicking cellular sorting improves prediction of subcellular localization. J. Mol. Biol, . 348, 85100[CrossRef][Web of Science][Medline].
Nakai, K. and Horton, P. (1999) PSORT: a program for detecting sorting signals in proteins and predicting their subcellular localization. Trends Biochem Sci, . 24, 3436[CrossRef][Web of Science][Medline].
Nakai, K. and Kanehisa, M. (1992) A knowledge base for predicting protein localization sites in eukaryotic cells. Genomics, 14, 897911[CrossRef][Web of Science][Medline].
Park, K.-J. and Kanehisa, M. (2003) Prediction of protein subcellular location by support vector machines using compositions of amino acids and amino acid pairs. Bioinformatics, 19, 16561663
Pfeffer, S.R. and Rotheman, J.E. (1987) Biosynthetic transport and sorting by the endoplasmatic reticulum and Golgi. Annu. Rev. Biochem, . 56, 829852[CrossRef][Web of Science][Medline].
Reinhardt, A. and Hubbard, T. (1998) Using neural networks for prediction of the subcellular location of proteins. Nucleic Acids Res, . 26, 22302236
Rost, B., et al. (2003) Automatic prediction of protein function. Cell. Mol. Life Sci, . 60, 26372650[CrossRef][Web of Science][Medline].
Rusch, S.L. and Kendall, D.A. (1995) Protein transport via amino-terminal targeting sequences: common themes in diverse systems. Mol. Membr. Biol, . 12, 295307[Web of Science][Medline].
Scott, M., et al. (2004) Predicting subcellular localization via protein motif co-occurrence. Genome Res, . 14, 19571966
Thompson, J.D., et al. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res, . 22, 46734680
Vapnik, V.N. The Nature of Statistical Learning Theory, (1999) , NY Wiley.
Ying, H. and Yanda, L. (2004) Prediction of protein subcellular locations using fuzzy k-NN method. Bioinformatics, 20, 2128
Yuan, Z. (1999) Prediction of protein subcellular locations using Markov chain models. FEBS Lett, . 451, 2326[CrossRef][Web of Science][Medline].
This article has been cited by other articles:
![]() |
W. Qian and J. Zhang Protein Subcellular Relocalization in the Evolution of Yeast Singleton and Duplicate Genes Gen Biol Evol, October 19, 2009; 2009(0): 198 - 204. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Atteia, A. Adrait, S. Brugiere, M. Tardif, R. van Lis, O. Deusch, T. Dagan, L. Kuhn, B. Gontero, W. Martin, et al. A Proteomic Survey of Chlamydomonas reinhardtii Mitochondria Sheds New Light on the Metabolic Plasticity of the Organelle and on the Nature of the {alpha}-Proteobacterial Mitochondrial Ancestor Mol. Biol. Evol., July 1, 2009; 26(7): 1533 - 1548. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. A. Mudd, S. Sullivan, M. F. Gisby, A. Mironov, C. S. Kwon, W.-I. Chung, and A. Day A 125 kDa RNase E/G-like protein is present in plastids and is essential for chloroplast development and autotrophic growth in Arabidopsis J. Exp. Bot., July 1, 2008; 59(10): 2597 - 2610. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Yang, Q. Ren, and Z. Zhang Cleavage of Mcd1 by Caspase-like Protease Esp1 Promotes Apoptosis in Budding Yeast Mol. Biol. Cell, May 1, 2008; 19(5): 2127 - 2134. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. K. Werner, I. A. Sparkes, T. Romeis, and C.-P. Witte Identification, Biochemical Characterization, and Subcellular Localization of Allantoate Amidohydrolases from Arabidopsis and Soybean Plant Physiology, February 1, 2008; 146(2): 418 - 430. [Abstract] [Full Text] [PDF] |
||||
![]() |
H.-B. Shen and K.-C. Chou Nuc-PLoc: a new web-server for predicting protein subnuclear localization by fusing PseAA composition and PsePSSM Protein Eng. Des. Sel., November 10, 2007; (2007) gzm057v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Shatkay, A. Hoglund, S. Brady, T. Blum, P. Donnes, and O. Kohlbacher SherLoc: high-accuracy prediction of protein subcellular localization by integrating text and protein sequence data Bioinformatics, June 1, 2007; 23(11): 1410 - 1417. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||









