Bioinformatics Advance Access originally published online on October 28, 2004
Bioinformatics 2005 21(7):944-950; doi:10.1093/bioinformatics/bti104
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Predicting protein localization in budding Yeast
1Gordon Life Science Institute San Diego, CA 92130, USA
2Shanghai Jiaotong University, Biomedical Engineering Shanghai 200030, China
3Tianjin Institute of Bioinformatics and Drug Discovery (TIBDD) Tianjin, China
4Biomolecular Sciences Department, UMIST Manchester, M60 1QD, UK
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Motivation: Most of the existing methods in predicting protein subcellular location were used to deal with the cases limited within the scope from two to five localizations, and only a few of them can be effectively extended to cover the cases of 1214 localizations. This is because the more the locations involved are, the poorer the success rate would be. Besides, some proteins may occur in several different subcellular locations, i.e. bear the feature of multiplex locations. So far there is no method that can be used to effectively treat the difficult multiplex location problem. The present study was initiated in an attempt to address (1) how to efficiently identify the localization of a query protein among many possible subcellular locations, and (2) how to deal with the case of multiplex locations.
Results: By hybridizing gene ontology, functional domain and pseudo amino acid composition approaches, a new method has been developed that can be used to predict subcellular localization of proteins with multiplex location feature. A global analysis of the proteins in budding yeast classified into 22 locations was performed by jack-knife cross-validation with the new method. The overall success identification rate thus obtained is 70%. In contrast to this, the corresponding rates obtained by some other existing methods were only 1314%, indicating that the new method is very powerful and promising. Furthermore, predictions were made for the four proteins whose localizations could not be determined by experiments, as well as for the 236 proteins whose localizations in budding yeast were ambiguous according to experimental observations. However, according to our predicted results, many of these ambiguous proteins were found to have the same score and ranking for several different subcellular locations, implying that they may simultaneously exist, or move around, in these locations. This finding is intriguing because it reflects the dynamic feature of these proteins in a cell that may be associated with some special biological functions.
Contact: kchou{at}san.rr.com
Supplementary information: www.pami.sjtu.edu.cn/kcchou
| 1 INTRODUCTION |
|---|
|
|
|---|
One of the fundamental goals in cell biology and proteomics is to identify the functions of proteins in the context of compartments that organize them in the cellular environment. To realize this, it is indispensable to first identify the subcellular locations of proteins. However, it is time-consuming and costly to determine the localization of a newly found protein in a cell purely based on experiments. Particularly, we are facing the times the number of protein sequences is growing extremely fast. For instance, the total number of protein sequences entering into the SWISS-PROT databank was only 3939 in 1986, and now the number has jumped to 162,781 according to version 44.7 released on October 11, 2004. This is more than 41 times the size in 1986! With the explosion in the number of sequences, it is highly desirable to develop an automated method to quickly identify the subcellular location of a newly found protein. Actually, many efforts have been made (Cedano et al., 1997; Chou, 2001; Chou and Cai, 2002, 2003a, b; Chou and Elrod, 1999b; Emanuelsson et al., 2000; Feng, 2001; Hua and Sun, 2001; Nakai and Kanehisa, 1991, 1992; Nakashima and Nishikawa, 1994; Pan et al., 2003; Park and Kanehisa, 2003; Reinhardt and Hubbard, 1998; Zhou and Doctor, 2003) during the last decade or so. The development in this area has generally followed two trends. One is to improve the prediction quality by extracting more and more useful information from a protein sequence, such as using the information from the amino acid composition (Cedano et al., 1997; Reinhardt and Hubbard, 1998), to the amino acid pair composition (Park and Kanehisa, 2003), to the pseudo amino acid composition (Chou, 2001; Pan et al., 2003) and to the functional domain composition (Cai et al., 2003; Chou and Cai, 2002). The other trend is to enhance the practical application value by enlarging the coverage scope, such as from the scope of covering only two subcellular locations (Nakashima and Nishikawa, 1994) to five locations (Cedano et al., 1997) to 12 locations (Chou and Elrod, 1999b; Park and Kanehisa, 2003), and to 14 locations (Chou and Cai, 2003b). Recently, using the GFP (green fluorescent protein) fluorescence technique, Huh et al. (2003) made a global analysis of protein localization in budding yeast, classifying the proteins into 22 distinct subcellular localization categories. Compared with the previous datasets, the dataset determined experimentally by these authors not only covers the largest scope so far, but also reflects the fact that some proteins may occur in several different subcellular locations; i.e. have the attribute with multiplex locations. Actually, all the previous methods were developed to deal with only the mono-location case where a given protein is assumed to belong to one, and only one, subcellular location. Now we are facing a multi-location problem. How to deal with the case of multiplex locations is a big challenge that was always artificially avoided in the previous treatments. The present study is devoted to addressing this problem.
| 2 SYSTEMS AND METHODS |
|---|
|
|
|---|
The experimental classification results by Huh et al. (2003) can be downloaded from the website http://www.yeastgfp.ucsf.edu. After excluding those whose sequences are not available, we have 4115 proteins, of which four proteins, i.e. YFL030W, YJL057C, YJL107C and YLR426W, do not have subcellular location, and 236 proteins whose locations are ambiguous. Thus, we have 4115 4 236 = 3875 proteins left. The remaining 3875 proteins, which are clearly classified into 22 distinct subcellular locations, can serve as a solid basis for further development in predicting protein subcellular locations. Meanwhile, the 3875 proteins will also serve as a training dataset to predict the 4 + 236 = 240 proteins whose subcellular locations are unknown or ambiguous. A breakdown of the 3875 proteins into 22 subcellular locations is given in Table 1 from which we can see that, owing to the fact that some proteins coexist in several different subcellular locations, the so-called multiplex location feature as mentioned above, the total number of different proteins
. The relationship between these two is given by
![]() | (1) |

is the number of proteins that occur simultaneously in
different subcellular locations. For instance, of the
= 3875 proteins provided by Huh et al., 2003 2968 (=
1) occur in only one subcellular location, 1106 (=
2) in two different locations, 63 (=
3) in three different locations, 7 (=
4) in four different locations, 1 (=
5) in five different locations, and 0 in
(=6, 7, ..., 22) different locations. Substituting these numbers into Equation (1) we have
![]() | (2) |
derived from Table 1.
|
The key to improving the prediction quality of the protein subcellular location is to grasp the core features of a protein that are intimately related to the current theme, and then use these features to represent it. In this sense, we can use the source of Gene Ontology (GO) Consortium (Ashburner et al., 2000) as a vehicle to formulate the prediction algorithm. The term ontology was originally borrowed from philosophy, where an ontology is a systematic account of existence. In other words, an ontology is an explicit specification of a conceptualization. In the GO database, gene products are organized according to the following three principles in a species-independent manner: cellular components, molecular function and biological process.
The first principle is directly related to the subcellular localization, while the other two are associated with the molecular function of a protein and its acting object, and hence are also closely relevant to the subcellular location of a protein (Alberts et al., 1994; Chou and Elrod, 1999a). Accordingly, it is anticipated that the prediction quality will be significantly improved if the GO database is used to define proteins according to the following steps.
Step 1 Mapping InterPro (Apweiler et al., 2001) entries to GO, one can get a list of data called InterProt2GO (ftp://ftp.ebi.ac.uk/pub/databases/interpro/interpro2go/), where each InterPro entry corresponds to a GO number. Since a protein may have one or more molecular functions, be used in one or more biological processes, and be associated with one or more cellular components, the relationships between InterPro and GO may be one-to-many. For instance, the InterPro entry IPR_000003 corresponds to GO_0003677, GO_0004879, GO_0005496, GO_0006355 and GO_0005634. Also, since the current GO database is far from complete yet, some InterPro entries (such as IPR_000001, IPR_000002 and IPR_000004) do not have the corresponding GO numbers in the InterProt2GO list.
Step 2 The GO numbers in the InerProt2GO database do not increase successively and orderly, and hence an operation to reorganize and compress the GO numbers obtained in Step 1 is needed. For example, after such an operation, the original GO numbers GO_0000012, GO_0000015, GO_0000030, ..., GO_0046413 would become GO-compress_0000001, GO-compress_0000002, GO-compress_0000003, ..., GO-compress_0001930, respectively. The database thus obtained is called GO-compress database or the 1930D GO database, whose dimensions have been reduced to 1930 from 46,413 of the original GO database.
Step 3 Each of the 1930 GO numbers will serve as a base to define a protein P in terms of the following 1930D (dimensional) vector:
![]() | (3) |
Step 4 If no hit (i.e. no corresponding GO number) is found in the entire 1930D GO-compress space, the protein P formulated by Equation (3) will correspond to a naught vector. To cope with such a circumstance, the protein should be defined in the 7785D FunD (Functional Domain composition) space (Apweiler et al., 2001), as given below:
![]() | (4) |
Step 5 If no hit is found even in the entire 7785D FunD space, the protein should be defined in the (20 +
)D PseAA (Pseudo Amino Acid composition) space, as given below:
![]() | (5) |
components in Equation (5) that incorporate some sequence-order effects into the vector representation of a protein. Generally speaking, the larger the number of the
components, the more the sequence-order effects incorporated. However, the number
cannot exceed the length of a protein (i.e. the number of its total residues). Also, if the number of
is too large, the overall success rate by jack-knife tests might be reduced (Chou, 2001). Therefore, for different training datasets,
may have different optimal values. For the current study, the optimal value of
is 37. Given a protein, the (20 + 37) = 57 pseudo amino acid components in Equation (5) can be easily derived by following the procedures as described in Chou (2001), the paper that introduced the concept of pseudo-amino acid composition. Thus, the protein that corresponds to a naught vector in both the 1930D GO space [Equation (3)] and the 7785D FunD space [Equation (4)] can always be explicitly defined in the 57D PseAA space [Equation (5)].
The prediction was performed with the ISort (Intimate Sorting) predictor, which can be briefed below. Suppose there are
proteins (P1, P2, ..., P
) which have been classified into categories 1, 2, ..., µ. Now, for a query protein P, how can we predict which category it belongs to? To deal with this problem, let us define the following scale function to measure the similarity between P and Pi(i = 1, 2,...,
):
![]() | (6) |
Pi, we have
(P, Pi) = 1, meaning they have perfect or 100% similarity. Generally speaking, the similarity is within the range of 0 and 1; i.e. 0
(P, Pi)
1. Accordingly, the ISort predictor can be formulated as follows. If the similarity between P and Pk (k = 1, 2, ...,
) is the highest, i.e.
![]() | (7) |
During the course of prediction, the following self-consistency principle should be followed. If a query protein could be defined in the 1930D GO space [Equation (3)], then the prediction should be carried out based on those proteins in the training set that could also be defined in the same 1930D GO space. If all of the components for the query protein in the 1930D Go space are zero and hence it is defined by shifting to the 7785D functional domain space [Equation (4)], then the prediction should be conducted on the basis that all the rule parameters are derived from the same 7785D space. Finally, if all the components for the query protein in the 7785D functional domain space are also zero and its definition must be made by shifting to the (20 +
)D PseAA space [Equation (5)], then the prediction should be carried out according to the principle that all the proteins in the training dataset be defined in the same PseAA space as well.
Accordingly, the current ISort predictor actually consists of three subpredictors: (1) the ISort-1930D GO predictor that operates in the compressed 1930D gene ontology space, (2) the ISort-7785D FunD predictor that operates in the 7785D functional domain composition space, and (3) the ISort-57D PseAA predictor that operates in the 57D pseudo-amino acid composition space with
=37. The entire process is called GO-FunD-PseAA hybridization approach.
| 3 SOME REMARKS ABOUT THE MONO-LOCATION AND MULTI-LOCATION PREDICTIONS |
|---|
|
|
|---|
As mentioned at the beginning, all the previous studies (Cedano et al., 1997; Chou, 2001; Chou and Cai, 2002, 2003a, b; Chou and Elrod, 1999b; Emanuelsson et al., 2000; Feng, 2001; Hua and Sun, 2001; Nakai and Kanehisa, 1991, 1992; Nakashima and Nishikawa, 1994; Pan et al., 2003; Park and Kanehisa, 2003; Reinhardt and Hubbard, 1998; Zhou and Doctor, 2003) were confined to within the scope of mono-location prediction. Here we are facing a multi-location problem, i.e. some proteins may coexist in several different subcellular locations. To deal with this kind of situation, it is instructive to highlight the difference between the mono-location and multi-location predictions according to the following points.
Training dataset For the mono-location case where a given protein belongs to one, and only one, subcellular location, the total number of samples in the training dataset can be expressed as
![]() | (8) |
![]() | (9) |
nm because a protein may simultaneously occur in several different subsets. This implies that
in Equations (6) and (7) should be replaced by
during the process of prediction.
Success rate Suppose the proteins in budding yeast form a set S, which is the union of the 22 subsets; i.e.
![]() | (10) |
on
, the kth protein in the mth subset, is the location belonging to the
th subset; i.e.
![]() | (11) |
![]() | (12) |
![]() | (13) |
is a multi-location predictor, instead of Equation (11) we should have
![]() | (14) |
is not a number but a set that is formed by one or more of the 22 subsets in Equation (10). Thus, the overall success rate is defined by
![]() | (15) |
function is defined by
![]() | (16) |
Score of the scale function
(P, Pi) The prediction is governed by the score of the similarity scale function according to Equation (6). Its interpretation is quite straightforward for the mono-location case; i.e. if
(P, P2) has the highest score, then the query protein P is predicted to belong to the same location as P2, the 2nd protein in the training dataset. For the multi-location case, however, the following two points should be realized. First, if P2 belongs to three different subcellular locations, then three identical highest scores are expected with each corresponding to one of the three locations. And the query protein P is predicted to belong to these three locations as well. Secondly, as additional information, the results for the 2nd highest score and the 3rd highest score are also provided here.
| 4 RESULTS AND DISCUSSION |
|---|
|
|
|---|
The computation was performed in a Silicon Graphics IRIS Indigo workstation (Elan 4000). According to steps 15 as described in Section 2, we obtained the following results (Table 2). For the 3875 different protein sequences in budding yeast, 2571 got hits in the GO database and hence were defined in the 1930D GO space, 539 of the remainder got hits in the FunD database and were hence defined in the 7785D FunD space, and finally the 765 proteins left were defined in the 57D PseAA space. For the 5132 classified proteins, the corresponding breakdown numbers are also given in Table 2. This means that if only the GO database was used, 3875 2571 = 1304 proteins in budding yeast would have no definition, leading to a failure of identifying their localization. By incorporating the InterPro FunD database, we still have 765 proteins without definition (Table 2). That is why it is so important to hybridize with the pseudo-amino acid composition (PseAA), by which not only a protein can always be defined but also its sequence-order effects may considerably be reflected (Chou, 2001). Thus, the hybrid algorithm was operated according to the procedures: if a query protein was defined in the GO database, then the ISort-1930D GO predictor was used to predict its subcellular location; if the query protein could not be defined in the GO database but could be defined in the InterPro FunD database, then the ISort-7785D FunD predictor was used to predict its subcellular location; if the query protein could be defined neither in the GO database nor in the InterPro FunD database, then the ISort-57D PseAA predictor was used to predict its subcellular location.
|
As is well known, in statistical prediction the single independent dataset test, sub-sampling test and jack-knife test are the three methods often used for cross-validation. Of these three, the jack-knife test is deemed as the most rigorous and objective one [see the review by Chou and Zhang (1995) for a comprehensive discussion about this, and the monograph by Mardia et al. (1979) for the underlying mathematical principle]. Therefore, the jack-knife test has been used by more and more investigators (Feng, 2001; Hua and Sun, 2001; Pan et al., 2003; Yuan, 1999; Zhou, 1998; Zhou and Assa-Munt, 2001; Zhou and Doctor, 2003) in examining the power of various prediction methods. With the current approach, the success rates by the jack-knife cross-validation for the 5132 classified proteins in budding yeast are given in Table 3 from which we can see that the overall success rate is 70.07%. It is instructive to mention the following two points. First, by following the procedures described in Equations (1), (9) and (14)(16), those predictors which were established based on the amino acid composition, such as the Least Euclidean Distance algorithm (Nakashima and Nishikawa, 1994; Nakashima et al., 1986) the Least Hamming Distance algorithm (Chou, 1989) and ProtLoc predictor (Cedano et al., 1997), can be augmented to deal with the multi-locational case as well. However, the corresponding success rates obtained by those predictors were only 13.89, 14.03 and 13.95%, respectively. This implies that the success rate by the present approach is more than 56% higher. Secondly, as shown in Table 3 if ranking II (results with the 2nd highest score) and ranking III (results with the 3rd highest score) were also counted, the likelihood of hitting the localization of a protein in budding yeast could be as high as 90%.
|
Now let us use the 5132 classified proteins (Huh et al., 2003) as the training dataset to predict the four proteins whose subcellular locations could not be determined by experiments and the 236 proteins whose subcellular locations were ambiguous (Huh et al., 2003). The predicted results for the four location-unknown proteins are given in Table 4, where the roman numerals (I, II and III) reflect the ranking of likelihood. For example, nuclear periphery (I) has the highest likelihood for the subcellular location of protein YFL030W, and the next highest is mitochondrion (II), followed by cell periphery (III). The predicted results for the 236 ambiguous proteins are given in Online Supplementary Materials A. To help readers understand the data listed in the Online Supplementary Materials A, the predicted results for the first five of the 236 proteins are summarized in Table 5 according to the format of Table 4. As we can see from Table 5, of the five proteins listed there, four have the same rankings for different subcellular locations, meaning that these proteins will coexist, or move around, in these locations. For example, protein YAR027W has ranking I for the following 20 locations: actin, bud, bud neck, cell periphery, cytoplasm, early Golgi, endosome, ER, ER to Golgi, Golgi, late Golgi, microtubule, mitochondrion, nuclear periphery, nucleolus, nucleus, punctuate composite, spindle pole, vacuolar membrane and vacuole. This implies that YAR027W may coexist, or move around, in the 20 subcellular locations. It can be seen by looking at the data at the Online Supplementary Materials A that many of the proteins there have the same ranking for different subcellular locations. That is why the 236 proteins were attributed by Huh et al. (2003) as ambiguous in subcellular location. According to our predicted results, these location-ambiguous proteins should be interpreted as those which coexist, or move around, in several different subcellular locations.
|
|
| 5 CONCLUSION |
|---|
|
|
|---|
The key to enhancing the success rate of predicting protein subcellular location is to grasp the core features of proteins that are intimately related to their biological functions. This can be realized by defining a protein based on the GO (Ashburner et al., 2000) and functional domain database (Apweiler et al., 2001) developed recently. However, the current GO and functional domain database do not give a complete coverage so that some proteins cannot be meaningfully defined. Although the problem will be eventually solved as the GO and functional domain database increase in size, to deal with such a situation right now, a hybrid approach was introduced by combining them with the pseudo amino acid composition (Chou, 2001). With the latter, not only a protein can always be explicitly defined but also its sequence-order effects can be considerably incorporated. That is why a hybridization of these three approaches can yield the success rate that is far beyond the reach of the other existing methods, as demonstrated by a rigorous cross-validation test.
Particularly, the subcellular locations for the four proteins, whose localizations could not be determined by experiments (Huh et al., 2003) have been explicitly predicted. Predictions were also made for the 236 proteins whose locations in budding yeast were ambiguous by experimental observations. According to our predicted results, however, it has been found that many of these proteins belong to several different subcellular locations, implying that they might simultaneously exist, or move around, in these locations. This finding is intriguing because it reflects the dynamic feature of these proteins in a cell that may have very special biological functions.
Just as the emergence of structural bioinformatics has greatly stimulated the process of both basic research and drug discovery (Chou, 2004) it is anticipated that the development of protein subcellular location prediction, particularly for cases with the multiplex location feature, will have important impacts on not only basic research but also on pharmaceutical industry and medical practice because proteins with such a dynamic feature are particularly interesting, and identifying differences in how proteins move within healthy and diseased cells is one critical way that doctors could diagnose disorders and gauge response to treatment.
| Acknowledgments |
|---|
The authors wish to thank the anonymous reviewers whose constructive comments have greatly improved the presentation of this paper.
Received on September 23, 2004; revised on October 13, 2004; accepted on October 18, 2004
| REFERENCES |
|---|
|
|
|---|
Alberts, B., Bray, D., Lewis, J., Raff, M., Roberts, K., Watson, J.D. Molecular Biology of the Cell, Ch. 1, (1994) 3rd edn , New York, London Garland Publishing.
Apweiler, R., Attwood, T.K., Bairoch, A., Bateman, A., Birney, E., Biswas, M., Bucher, P., Cerutti, L., Corpet, F., Croning, M.D.R., Durbin, R., Falquet, L., Fleischmann, W., Gouzy, L., Hermjakob, H., Hulo, N., Jonassen, I., Kahn, D., Kanapin, A., Karavidopoulou, Y., Lopez, R., Marx, B., Mulder, N.J., Oinn, T.M., Pagni, M., Servant, F., Sigrist, C.J.A., Zdobnov, E.M. (2001) The InterPro database, an integrated documentation resource for protein families, domains and functional sites. Nucleic Acids Res., 29, , pp. 3740
Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., Harris, M.A., Hill, D.P., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J.C., Richardson, J.E., Ringwald, M., Rubin, G.M., Sherlock, G. (2000) Gene ontology: tool for the unification of biology. Nat. Genet., 25, 2529[CrossRef][ISI][Medline].
Cai, Y.D., Zhou, G.P., Chou, K.C. (2003) Support vector machines for predicting membrane protein types by using functional domain composition. Biophys. J., 84, 32573263
Cedano, J., Aloy, P., P'erez-Pons, J.A., Querol, E. (1997) Relation between amino acid composition and cellular location of proteins. J. Mol. Biol., 266, 594600[CrossRef][ISI][Medline].
Chou, P.Y. (1989) Prediction of protein structural classes from amino acid composition. In Fasman, G.D. (Ed.). Prediction of Protein Structure and the Principles of Protein Conformation, , New York Plenum Press, pp. 549586.
Chou, K.C. (1995) A novel approach to predicting protein structural classes in a (20-1)-D amino acid composition space. Proteins, 21, 319344[CrossRef][ISI][Medline].
Chou, K.C. (2001) Prediction of protein cellular attributes using pseudo-amino-acid-composition. Proteins, 43, 246255 (Erratum, 2001, 44, 60)[CrossRef][ISI][Medline].
Chou, K.C. (2004) Review: structural bioinformatics and its impact to biomedical science. Curr. Med. Chem., 11, 21052134[ISI][Medline].
Chou, K.C. and Cai, Y.D. (2002) Using functional domain composition and support vector machines for prediction of protein subcellular location. J. Biol. Chem., 277, 4576545769
Chou, K.C. and Cai, Y.D. (2003a) A new hybrid approach to predict subcellular localization of proteins by incorporating Gene ontology. Biochem. Biophys. Res. Commun., 311, 743747[CrossRef][ISI][Medline].
Chou, K.C. and Cai, Y.D. (2003b) Prediction and classification of protein subcellular location: sequence-order effect and pseudo amino acid composition. J. Cell. Biochem., 90, 12501260 (Addendum, 2004, 91 (5) 1085)[CrossRef][ISI][Medline].
Chou, K.C. and Elrod, D.W. (1999a) Prediction of membrane protein types and subcellular locations. Proteins, 34, 137153[CrossRef][ISI][Medline].
Chou, K.C. and Elrod, D.W. (1999b) Protein subcellular location prediction. Protein Eng., 12, 107118
Chou, J.J. and Zhang, C.T. (1993) A joint prediction of the folding types of 1490 human proteins from their genetic codons. J. Theor. Biol., 161, 251262[CrossRef][ISI][Medline].
Chou, K.C. and Zhang, C.T. (1995) Review: prediction of protein structural classes. Critical Rev. Biochem. Mol. Biol., 30, 275349[ISI][Medline].
Emanuelsson, O., Nielsen, H., Brunak, S., von Heijne, G. (2000) Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. J. Mol. Biol., 300, 10051016[CrossRef][ISI][Medline].
Feng, Z.P. (2001) Prediction of the subcellular location of prokaryotic proteins based on a new representation of the amino acid composition. Biopolymers, 58, 491499[CrossRef][ISI][Medline].
Hua, S. and Sun, Z. (2001) Support vector machine approach for protein subcellular localization prediction. Bioinformatics, 17, 721728
Huh, W.K., Falvo, J.V., Gerke, L.C., Carroll, A.S., Howson, R.W., Weissman, J.S., O'Shea, E.K. (2003) Global analysis of protein localization in budding yeast. Nature, 425, 686691[CrossRef][Medline].
Mardia, K.V., Kent, J.T., Bibby, J.M. Multivariate Analysis, Chs. 1113, (1979) , London Academic Press, pp. 322381.
Nakai, K. and Kanehisa, M. (1991) Expert system for predicting protein localization sites in Gram-negative bacteria. Proteins, 11, 95110[CrossRef][ISI][Medline].
Nakai, K. and Kanehisa, M. (1992) A knowledge base for predicting protein localization sites in eukaryotic cells. Genomics, 14, 897911[CrossRef][ISI][Medline].
Nakashima, H. and Nishikawa, K. (1994) Discrimination of intracellular and extracellular proteins using amino acid composition and residue-pair frequencies. J. Mol. Biol., 238, 5461[CrossRef][ISI][Medline].
Nakashima, H., Nishikawa, K., Ooi, T. (1986) The folding type of a protein is relevant to the amino acid composition. J. Biochem., 99, 152162.
Pan, Y.X., Zhang, Z.Z., Guo, Z.M., Feng, G.Y., Huang, Z.D., He, L. (2003) Application of pseudo amino acid composition for predicting protein subcellular location: stochastic signal processing approach. J. Protein Chem., 22, 395402[CrossRef][ISI][Medline].
Park, K.J. and Kanehisa, M. (2003) Prediction of protein subcellular locations by support vector machines using compositions of amino acid and amino acid pairs. Bioinformatics, 19, 16561663
Reinhardt, A. and Hubbard, T. (1998) Using neural networks for prediction of the subcellular location of proteins. Nucleic Acids Res., 26, 22302236
Yuan, Z. (1999) Prediction of protein subcellular locations using Markov chain models. FEBS Lett., 451, 2326[CrossRef][ISI][Medline].
Zhou, G.P. (1998) An intriguing controversy over protein structural class prediction. J. Protein Chem., 17, 729738[CrossRef][ISI][Medline].
Zhou, G.P. and Assa-Munt, N. (2001) Some insights into protein structural class prediction. Proteins, 44, 5759[CrossRef][ISI][Medline].
Zhou, G.P. and Doctor, K. (2003) Subcellular location prediction of apoptosis proteins. Proteins, 50, 4448[CrossRef][ISI][Medline].
This article has been cited by other articles:
![]() |
K. Lee, D.-W. Kim, D. Na, K. H. Lee, and D. Lee PLPD: reliable protein localization prediction from imbalanced and overlapped datasets Nucleic Acids Res., October 18, 2006; 34(17): 4655 - 4666. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Hoglund, P. Donnes, T. Blum, H.-W. Adolph, and O. Kohlbacher MultiLoc: prediction of protein subcellular localization using N-terminal targeting sequences, sequence motifs and amino acid composition Bioinformatics, May 15, 2006; 22(10): 1158 - 1165. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

















