Bioinformatics Advance Access originally published online on November 5, 2004
Bioinformatics 2005 21(8):1644-1652; doi:10.1093/bioinformatics/bti103
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Predicting gene function through systematic analysis and quality assessment of high-throughput data
1Department of Physiological Chemistry, University Medical Center Utrecht PO Box 85060, 3508 AB Utrecht, The Netherlands
2Department of Innovation Studies, Copernicus Institute, Utrecht University Utrecht, The Netherlands
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Motivation: Determining gene function is an important challenge arising from the availability of whole genome sequences. Until recently, approaches based on sequence homology were the only high-throughput method for predicting gene function. Use of high-throughput generated experimental data sets for determining gene function has been limited for several reasons.
Results: Here a new approach is presented for integration of high-throughput data sets, leading to prediction of function based on relationships supported by multiple types and sources of data. This is achieved with a database containing 125 different high-throughput data sets describing phenotypes, cellular localizations, protein interactions and mRNA expression levels from Saccharomyces cerevisiae, using a bit-vector representation and information content-based ranking. The approach takes characteristic and qualitative differences between the data sets into account, is highly flexible, efficient and scalable. Database queries result in predictions for 543 uncharacterized genes, based on multiple functional relationships each supported by at least three types of experimental data. Some of these are experimentally verified, further demonstrating their reliability. The results also generate insights into the relative merits of different data types and provide a coherent framework for functional genomic datamining.
Availability: Free availability over the Internet.
Contact: f.c.p.holstege{at}med.uu.nl
Supplementary information: http://www.genomics.med.uu.nl/pub/pk/comb_gen_network
| INTRODUCTION |
|---|
|
|
|---|
Whole-genome sequencing projects form a tremendous resource for biological discovery (Grunenfelder and Winzeler, 2002). Besides genome annotation (Rust et al., 2002), an important challenge arising from the availability of genomes is elucidation of the function of thousands of newly discovered genes, the prime goal of functional genomics (Brent, 2000). The Gene Ontology (GO) Consortium provides gene annotations for most organisms (Harris et al., 2004). Curating literature and annotating genes accordingly is a labor-intensive task. Traditional approaches for determining gene function cannot keep up with the current rate of gene discovery. For example, although the Saccharomyces cerevisiae genome has been available since 1996, 3497 genes are still classified as unknown according to GO molecular function or cellular component categories. Classical approaches for assigning gene function include sequence homology, gene fusion events, gene order conservation and phylogenetic profiles (Huynen et al., 2000). Although extremely powerful approaches, one drawback is lack of experimental evidence. In addition, homology on its own may result in imprecise annotation or return weak similarity with well-characterized genes. As in most areas of biological research, complementary approaches are required to more precisely determine gene function.
To address the rate of gene discovery, high-throughput approaches are being developed for biological experimentation. These include mRNA expression-profiling (Brown and Botstein, 1999; Chu et al., 1998; DeRisi et al., 1997; Ferea et al., 1999; Galitski et al., 1999; Gasch et al., 2000; Hughes et al., 2000; Lockhart and Winzeler, 2000; Roberts et al., 2000; Spellman et al., 1998; Travers et al., 2000; Young, 2000), determination of gene-deletion phenotypes (Giaever et al., 2002; Winzeler et al., 1999), cellular localization of proteins (Huh et al., 2003; Kumar et al., 2002), proteinprotein interactions (ppi) (Gavin et al., 2002; Ho et al., 2002, Ito et al., 2001; Uetz et al., 2000), assays for biological activity using protein arrays (Washburn et al., 2001; Zhu et al., 2001), synthetic lethal screens (Tong et al., 2004) and RNA interference screens (Berns et al., 2004; Kamath et al., 2003). When used on their own, such high-throughput data sets have already proved their usefulness for inferring hypotheses about gene function. High-throughput data can also be used in combination (Asthana et al., 2004; Bader et al., 2004; Ge et al., 2001, 2003; Jansen et al., 2002; Kemmeren et al., 2002; Marcotte et al., 1999; Parsons et al., 2004; Pereira-Leal et al., 2004; Schlitt et al., 2003; Walhout et al., 2002). Most previous studies have examined the relationships revealed by studying particular combinations of data. Such studies indicate that combining several high-throughput data types will likely aid gene function prediction.
For dealing with and integrating many different sources of high-throughput data, several issues need addressing. First, differences in data quality within and between data sets exist because of their high-throughput nature (Ge et al., 2003; Kemmeren and Holstege, 2003; Kemmeren et al., 2002). The presence of false positives therefore needs to be assessed and dealt with. Secondly, each data type has different properties. For instance, phenotype data are categorical, whereas mRNA expression data are continuous. The underlying statistical distribution is different and needs to be addressed accordingly. Thirdly, the availability of these data sets should be organized so that researchers can efficiently mine the data and prioritize hypotheses. Relevant in this respect are speed, visualization, flexibility and ease of use. Finally, it makes sense to develop methods that are easily scalable to accommodate an ever-increasing amount of data, and sufficiently generic to incorporate new data types.
Here we present a novel method for integration of several high-throughput data sets to accurately predict gene function. The method is based on the use of bit-vectors and has been applied to a database containing the majority of presently available high-throughput data for S.cerevisiae. The method applied addresses most of the issues described above, is highly flexible, scalable and efficient. To demonstrate the power of the approach some of the functional predictions are verified. The broad-ranging combination of data types used here have as yet not been mined in parallel before. Besides predicting gene function, the approach can also be used for studying networks of functional relationships. The results reveal interesting characteristics of the high-throughput data sets, their capacity for predicting gene function and demonstrates that combining high-throughput data sets is a powerful approach for functional genomic analyses.
| SYSTEMS AND METHODS |
|---|
|
|
|---|
Significant mRNA coexpression
mRNA coexpression is calculated based on the standard correlation measure. All distance matrix calculations were performed using Expression Profiler (http://www.ebi.ac.uk/expressionprofiler/) (Kapushesky et al., 2004). To determine a significant level of coexpression, the expression data were randomized and then used in otherwise identical analysis as described previously (Kemmeren et al., 2002). Here, positive correlation distances with a P-value smaller than 0.0001 were considered significantly coexpressed and set to one for subsequent analyses. Expression data sets with an information content less than 90 were excluded from further analysis.
Information content
The information content (IC) for each individual data set was calculated by
![]() |
![]() |
![]() |
![]() |
Gene Ontology category assignment and graph drawing
Gene Ontology category assignments were performed using the GO Term Finder perl module (http://search.cpan.org/dist/GO-TermFinder/) developed by the Stanford Microarray Database (SMD). In short, a hypergeometric test with a standard Bonferroni correction was used to obtain those GO categories that show a significant overlap with the genes obtained from the different networks. From these significant GO categories only the top six scoring categories per gene are shown in the networks. All networks used throughout this study have been drawn using AT&T's GraphViz tool (http://www.research.att.com/sw/tools/graphviz/).
Functional predictions
Functional predictions are based on the GO annotations from SGD obtained on June 24, 2003. Those relationships containing one gene with a known function and one gene with the annotation molecular function unknown or cellular component unknown were used for inferring the functional annotation from the better-characterized protein to the uncharacterized protein.
Analysis of function for uncharacterized genes
Yeast gene deletion strains were from Research Genetics and derived from S288C (BY4741 MATa his3
1 leu2
0 met15
0 ura3
0) as described (Winzeler et al., 1999).
Thermotolerance assay
Cells were grown in YEPD to mid-exponential phase (OD600 = 0.6) at 30°C and were then shifted to 47°C, while shaking. Aliquots were removed at regular intervals and put on ice. Approximately 500 cells at 0, 15 and 20 min and approximately 50 000 cells at 25 min were plated in triplicate on YEPD plates. The resulting colonies were counted after 3 days of growth at 30°C.
Growth at high salt concentrations
Cells were grown at 30°C in liquid media. The concentration was then adjusted to approximately 1 x 106cells/ml in YEP. Dilution steps of 3 were spotted on YEPD plates, YEPD plates containing 1.75M NaCl and YEPD plates containing 1.5M KCl.
All supplemental material is available at http://www.genomics.med.uu.nl/pub/pk/comb_gen_network/.
| RESULTS |
|---|
|
|
|---|
A framework for integrated analysis of high-throughput data
To combine different high-throughput data types, a general framework was devised (Fig. 1). The database scheme includes ppi (Gavin et al., 2002, Ho et al., 2002; Ito et al., 2001; Uetz et al., 2000), mRNA expression (exp) (Chu et al., 1998; DeRisi et al., 1997; Ferea et al., 1999; Galitski et al., 1999; Gasch et al., 2000; Hughes et al., 2000; Roberts et al., 2000; Spellman et al., 1998; Travers et al., 2000), phenotype (phen) (Ross-Macdonald et al., 1999; Winzeler et al., 1999) and cellular localization data (loc) (Huh et al., 2003; Kumar et al., 2002). For integrated analysis of the data, each data set is transformed into a binary representation. The value 0 denotes the absence, 1 the presence of a relationship for a protein pair, based on that particular data set. Relationships for protein interaction data obtained from tagging approaches (Gavin et al., 2002; Ho et al., 2002) can be represented in two different ways (Bader and Hogue, 2002). The first method connects the tagged protein with all other proteins found in the screen (spoke). The second approach creates an all versus all connection scheme (matrix). Due to the versatility of the bit-vector approach, both models are represented and can be chosen in subsequent analyses.
|
Relationships obtained from mRNA expression data are based on the presence of a significant degree of coexpression (Kemmeren et al., 2002) and are calculated at different levels of granularity. This means that coexpression is calculated either across single experiments such as a single time course or across all data sets. Again, both representations can be chosen for subsequent queries. Connections obtained from phenotype data are derived from gene knockout experiments showing the same phenotype for a given gene pair. For localization data, a relationship is based on two proteins sharing the same cellular compartment. Proteins that are found in multiple cellular compartments are only connected to those proteins that are also found in an identical set of compartments. The rationale and consequences of these choices are discussed later. Together, initial preprocessing results in 125 data sets summarized in a single table using a bit-vector representation. The advantage of this approach is that subsequent selections can be based on individual data sets, data type, or any arbitrary combination of these.
Increasing the reliability of high-throughput data queries
Currently the database holds 3.7 million relationships (Table 1). These are all based on single types of high-throughput experimental observations with varying degrees of accuracy and information content. As expected, the number of pairwise relationships is reduced dramatically when those supported by combinations of different data sets are taken into account (Table 1). In contrast, the percentage of previously known ppi and complexes increases from 0.1 to 28.1% and from 0.24 to 34.4%, respectively, when relationships supported by multiple data sets are taken into account. One goal of integrated analyses is to lower the influence of false positives. The chance of finding false positives within the overlap between different data sets is drastically reduced. For example when combining four data sets with a false-positive rate of 0.05, the chance of finding a false positive that occurs in all data sets will become 0.054 = 6.25 x 106. Previously established ppi's represent relationships of higher confidence than those from high-throughput studies. The large increase in their representation within the overlaps between data sets shows that integrated analyses are well suited for determining relationships of higher confidence.
|
To assess the amount of valuable information within each data set, the IC is calculated for each individual data set (see Systems and Methods). In principle, the less data derived from a potentially large data space the more informative it is. In line with other information theoretical approaches, such as Shannon's information measure (Shannon, 1997), an increase in the information content corresponds to a decrease in uncertainty. Using this IC an implicit ranking can be made for the different data types (Table 1). This table shows that the protein interaction data contains the most valuable information (99.9), followed by phenotype (99.8), mRNA coexpression (97.8) and localization (93.8). Based on the IC, a measure can then be calculated for each individual relationship (relationship confidence, see Systems and Methods), a higher relationship confidence represents those relationships with a high amount of information, i.e. low degree of uncertainty. Both the relationship confidence and a restriction on the data type can then be used to obtain a reliable set of functional relationships.
Predicting functional annotation
The high-throughput data sets can be mined in a variety of ways. Figure 2A shows all 2398 relationships supported by significant mRNA coexpression and ppi. A more stringent query is represented in Figure 2B. Only relationships supported by ppi (matrix), mRNA coexpression, shared phenotype, and localization are represented. This creates a smaller, but densely interconnected network (Fig. 2B). Use of GO (Harris et al., 2004) shows that genes involved in similar processes are highly interconnected and cluster together in the network structure. Using such queries, hypotheses can be formulated about the function, based on neighboring proteins.
|
Depending on how the high-throughput data are used, different selection criteria can be applied. When used as an exploratory method to find novel relationships, relaxed criteria can be applied, such as sharing at least ppi and mRNA coexpression (Fig. 2A). However, for improving current functional annotation of uncharacterized genes, more stringent criteria are warranted. For this purpose four different queries have been performed which can be used for high confidence functional predictions (Table 1). The most stringent query selects only those relationships supported by all four data types, using the spoke model for ppi. The least stringent of these queries selects those relationships supported by ppi (matrix), mRNA coexpression and either shared phenotype or cellular localization (Table 1). For each query, the relationships are ranked by relationship confidence, allowing for a more granular subselection or prioritization of hypotheses (see also Supplemental Tables 18). More predictions can be made if the criteria are further relaxed.
With these four queries, all based on support from three completely independent kinds of experimental data, 2323 relationships are found, 543 of which apply to genes for which there is no current molecular function or cellular component annotation (Table 1). Two examples are depicted in Figure 3A,B. KRE33 (Fig. 3A) is described as killer toxin resistant with no GO annotation. It has relationships with 20 genes, each relationship based on a physical interaction, mRNA coexpression and either a shared phenotype or cellular localization. Of the 20 linked genes, 13 are part of the U3 snoRNP complex involved in rRNA processing. Four of the other genes are also involved in rRNA metabolism, making it extremely likely that KRE33 is involved in these processes. FUN11 (Fig. 3B) stands for Function Unknown Now and also has no GO annotations. It is linked to five functionally characterized genes, all of which are involved in translation initiation. From this it is clear that FUN11 too can be tentatively assigned to a role in this process.
|
Another category of genes for which this approach is useful are those genes with vague or perhaps incorrect annotation. Figure 3C,D shows examples of this. YDR091c (Fig. 3C) is annotated as a putative member of the ATP-binding cassette superfamily of non-transporters. Ten of the 15 genes with which it has reliable links are involved in translation initiation, making it extremely likely that the same holds for YDR091c. PPH22 (Fig. 3D) is a serine-threonine protein phosphatase, annotated with a role in the G1/S transition of the mitotic cell cycle. Three of the 5 genes with which PPH22 is connected, are proteosome components, arguing strongly for a similar annotation for PPH22.
The examples above are from a total of 543 predictions for genes with unknown function, supported by multiple pairwise relationships, each of these based on at least three types of experimental data. Thirty-seven of the 543 novel functional predictions concern relationships between single gene pairs. Because such relationships are supported by at least three different types of high-throughput data, it is still likely that the functional prediction is correct. To demonstrate this further, some of these predictions were selected and tested experimentally.
YGR205W suppresses thermotolerance
The first example is a link between HSP104 and YGR205w. Hsp104 is a heat shock protein involved in the rescue of stress-damaged proteins. Together with Hsp70 and Hsp40 it forms a chaperone system that can refold denatured proteins (Glover and Lindquist, 1998). Cells lacking Hsp104 do not acquire thermotolerance when given a mild pre-heat treatment (Ferreira et al., 2001). A putative link between Hsp104 and Ygr205w is indicated by mRNA coexpression, ppi and identical localization. These three kinds of experimental evidence would together predict that the uncharacterized ORF YGR205w is also involved in stress response. Interestingly, mRNA coexpression is observed during various stress conditions, including heat shock and stationary phase (Fig. 4B). To validate the prediction, an YGR205w deletion strain was tested for thermotolerance. Cells lacking HSP104 show decreased resistance to heat shock when compared with wild type (Fig. 4C) (Ferreira et al., 2001). In agreement with the prediction, deletion of YGR205w also shows a thermotolerance phenotype. Rather than precisely mimicking HSP104 deletion, deletion of YGR205w results in increased thermotolerance (Fig. 4C), indicating that it is involved in the same pathway as HSP104, but with an opposite role. Based on this evidence, YGR205w can now be annotated as involved in negative regulation of thermotolerance.
|
ASC1 and YDJ1 are associated with protein folding, translation and ribosome biogenesis
Another intriguing example of genes for which novel functional annotation is predicted are ASC1 and YDJ1. In the network of high-throughput data relationships, ASC1 is surrounded by three groups of genes (Fig. 5A). Shared mRNA coexpression, ppi and phenotypes are found with four genes involved in rRNA processing. Multiple links are also found with six translation initiation factors. Given the number of related genes, it is likely that ASC1 is involved in both rRNA processing and translation, despite having no annotation. A third category is represented by two genes with dnaJ homolog regions, one of which, ZUO1, is a ribosome-associated chaperone. We reasoned that Ydj1 might also be involved as a nascent protein chaperone due to its ppi, shared mRNA expression and localization with Asc1. mRNA coexpression for ASC1 and YDJ1 is observed during multiple conditions (Fig. 5B). In accordance with a role for ASC1 in stress-induced protein misfolding, ASC1 deletion results in a severe growth defect at elevated NaCl concentrations. (Dunn et al., 2004). In agreement with the prediction that ASC1 and YDJ1 are both involved in protecting cells from elevated salt concentrations, YDJ1 deletion also results in a growth defect at elevated NaCl and KCl concentrations (Fig. 5C).
|
Interactive datamining tool
The examples above illustrate the utility of combining different high-throughput data types. The focus here is on functional annotation, but this approach of a single database, bit-vector queries and IC ranking is equally useful for examining the network structures and characteristics resulting from many different combinations of the 125 data sets included in the database. A datamining tool has therefore been set up on the accompanying website (http://www.genomics.med.uu.nl/pub/pk/comb_gen_network/). This tool allows detailed study of relationships found when starting from all genes, a certain complex, a single protein, or a number of protein pairs. Information about the confidence, supporting data types, mRNA expression profiles, gene neighborhood and GO annotation can be retrieved. Restrictions can be placed on the type of data support needed and individual data sets can be included or excluded at will.
| DISCUSSION |
|---|
|
|
|---|
These results demonstrate how several issues arising from the availability of genome sequences and high-throughput experimental data can be addressed through data integration. A general framework is provided that can incorporate any data type represented in a binary format. The approach takes into account characteristics of different data types, heterogeneity in data quality and IC. Besides predicting gene function, the study also sheds light on the properties of different data types and sets.
One interesting aspect is the effect of the level of granularity on the relationships found within mRNA expression data. For some gene pairs, mRNA coexpression is only found when looking at individual experiments such as a single time course, rather than the entire collection of expression data. This agrees with the fact that some interacting proteins are only coexpressed under specific conditions. This has been taken into account in the bit-vector representation. Another aspect involves the compartmentation of the cellular localization data. Here we have chosen to base relationships on an exact match of the localization pattern. If partly matching locations had been permitted, the IC of the localization data would be too low to be meaningful. This is due to the low number of possible cellular localizations and in this sense the localization data will remain less useful until cellular localization data become more fine-grained.
Important steps have been taken in the past to integrate different types of genome-scale data (Ge et al., 2001; Jansen et al., 2002; Kemmeren et al., 2002; Marcotte et al., 1999; Parsons et al., 2004; Schlitt et al., 2003), usually in pairwise combinations and typically for the purpose of examining genome-scale network characteristics. Four different data types have been used here and the system can be extended to include other data types such as synthetic lethality, chromosomal localization, regulatory motifs, sequence homology, metabolic pathways, etc. There is not a large degree of overlap between the different data types, especially when selecting relationships supported by all four data types (Table 1). Although not all gene function relationships can be expected to demonstrate support from all data types, it is still striking, given that most of the underlying data sets are presented as genome-wide. Many clearly are not completely genome-wide, often due to constraints of the methods used. When comparing multiple data sets this soon leads to a significant loss of genome coverage.
Another reason that has been offered in the past for the lack of overlap between genome-wide data sets, especially for those of the same type, is heterogeneity in quality (Kemmeren and Holstege, 2003; Kemmeren et al., 2002). Two measures have been taken here to deal with this. Restricting the analyses to relationships supported by multiple data types considerably improves reliability as is obvious from the enrichment for previously known interactions (Table 1). A second step is the use of a relationship confidence based on the IC of the data sets. Using the IC a clear distinction between the different data types can be made (Table 1). As expected protein interaction data, the most direct type of evidence for a functional relationship, have the highest IC, directly followed by phenotype data.
The approach presented here is suitable for generating hypotheses about individual genes and can aid annotation initiatives such as GO (Harris et al., 2004) or genome databases such as SGD (Christie et al., 2004). To better utilize existing knowledge, the annotations themselves should also be annotated as to how information is obtained. These so-called evidence codes already exist within GO, albeit at a somewhat undifferentiated level. For example, no distinction is made between a physical interaction arising from a two-hybrid or coimmunoprecipitation assay, or whether the approach was high-throughput. Initiatives such as BIND and IntAct are tackling this problem for protein interactions (Bader et al., 2003; Hermjakob et al., 2004). Another aspect involves the annotations based on computational methods. If this information is not stored, valuable information is lost. On the other hand, if such annotations are used by another computational method, such predictions will lead to less trustworthy annotations. Therefore more details about how annotation is obtained need to be available and computational methods need to use this information.
Most of the approaches to integrate different types of high-throughput data have been performed in Saccharomyces cerevisiae. As more data are generated, similar approaches will be required for other organisms. Besides being easily extendable to other data types, the framework presented here is well suited for other organisms, as it already takes into account scalability, flexibility and standardization. The results presented here show that alongside existing methods, integrating diverse types of functional genomic data is a powerful method for tackling gene function annotation.
| Acknowledgments |
|---|
We thank Philip Lijnzaad, Thomas Schlitt, Harm van Bakel, and Arnaud Leijen for discussions and technical support. Supported by grants from the Netherlands Organization for Scientific Research (NWO); 05050205, 016026009, 05071002 and by the European Union fifth framework project TEMBLOR.
Received on August 27, 2004; revised on October 1, 2004; accepted on October 15, 2004
| REFERENCES |
|---|
|
|
|---|
Asthana, S., King, O.D., Gibbons, F.D., Roth, F.P. (2004) Predicting protein complex membership using probabilistic network reliability. Genome Res., 14, 11701175
Bader, G.D. and Hogue, C.W. (2002) Analyzing yeast proteinprotein interaction data obtained from different sources. Nat. Biotechnol., 20, 991997[CrossRef][ISI][Medline].
Bader, G.D., Betel, D., Hogue, C.W. (2003) BIND: the Biomolecular Interaction Network Database. Nucleic Acids Res., 31, 248250
Bader, J.S., Chaudhuri, A., Rothberg, J.M., Chant, J. (2004) Gaining confidence in high-throughput protein interaction networks. Nat. Biotechnol., 22, 7885[CrossRef][ISI][Medline].
Berns, K., Hijmans, E.M., Mullenders, J., Brummelkamp, T.R., Velds, A., Heimerikx, M., Kerkhoven, R.M., Madiredjo, M., Nijkamp, W., Weigelt, B., et al. (2004) A large-scale RNAi screen in human cells identifies new components of the p53 pathway. Nature, 428, 431437[CrossRef][Medline].
Brent, R. (2000) Genomic biology. Cell, 100, 169183[CrossRef][ISI][Medline].
Brown, P.O. and Botstein, D. (1999) Exploring the new world of the genome with DNA microarrays. Nat. Genet., 21, 3337[CrossRef][ISI][Medline].
Christie, K.R., Weng, S., Balakrishnan, R., Costanzo, M.C., Dolinski, K., Dwight, S.S., Engel, S.R., Feierbach, B., Fisk, D.G., Hirschman, J.E., et al. (2004) Saccharomyces Genome Database (SGD) provides tools to identify and analyze sequences from Saccharomyces cerevisiae and related sequences from other organisms. Nucleic Acids Res., 32, D311314
Chu, S., DeRisi, J., Eisen, M., Mulholland, J., Botstein, D., Brown, P.O., Herskowitz, I. (1998) The transcriptional program of sporulation in budding yeast. Science, 282, 699705
DeRisi, J.L., Iyer, V.R., Brown, P.O. (1997) Exploring the metabolic and genetic control of gene expression on a genomic scale. Science, 278, 680686
Dunn, B., Spellman, F.T., Schwarz, P., Terraciano, J., Troyanovich, J., Walker, J., Greene, S., Shaw, J., DiDomenico, K., Wang, B., et al. (2004) Genetic footprinting: a functional analysis of the S.cerevisiae genome. Personal Communication.
Ferea, T.L., Botstein, D., Brown, P.O., Rosenzweig, R.F. (1999) Systematic changes in gene expression patterns following adaptive evolution in yeast. Proc. Natl. Acad. Sci. USA, 96, 97219726
Ferreira, P.C., Ness, F., Edwards, S.R., Cox, B.S., Tuite, M.F. (2001) The elimination of the yeast [PSI+] prion by guanidine hydrochloride is the result of Hsp104 inactivation. Mol. Microbiol., 40, 13571369[CrossRef][ISI][Medline].
Galitski, T., Saldanha, A.J., Styles, C.A., Lander, E.S., Fink, G.R. (1999) Ploidy regulation of gene expression. Science, 285, 251254
Gasch, A.P., Spellman, P.T., Kao, C.M., Carmel-Harel, O., Eisen, M.B., Storz, G., Botstein, D., Brown, P.O. (2000) Genomic expression programs in the response of yeast cells to environmental changes. Mol. Biol. Cell, 11, 42414257
Gavin, A.C., Bosche, M., Krause, R., Grandi, P., Marzioch, M., Bauer, A., Schultz, J., Rick, J.M., Michon, A.M., Cruciat, C.M., et al. (2002) Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature, 415, 141147[CrossRef][Medline].
Ge, H., Liu, Z., Church, G.M., Vidal, M. (2001) Correlation between transcriptome and interactome mapping data from Saccharomyces cerevisiae. Nat. Genet., 29, 482486[CrossRef][ISI][Medline].
Ge, H., Walhout, A.J., Vidal, M. (2003) Integrating omic information: a bridge between genomics and systems biology. Trends Genet, 19, 551560[CrossRef][ISI][Medline].
Giaever, G., Chu, A.M., Ni, L., Connelly, C., Riles, L., Veronneau, S., Dow, S., Lucau-Danila, A., Anderson, K., Andre, B., et al. (2002) Functional profiling of the Saccharomyces cerevisiae genome. Nature, 418, 387391[CrossRef][Medline].
Glover, J.R. and Lindquist, S. (1998) Hsp104, Hsp70, and Hsp40: a novel chaperone system that rescues previously aggregated proteins. Cell, 94, 7382[CrossRef][ISI][Medline].
Grunenfelder, B. and Winzeler, E.A. (2002) Treasures and traps in genome-wide data sets: case examples from yeast. Nat. Rev. Genet., 3, 653661[CrossRef][ISI][Medline].
Harris, M.A., Clark, J., Ireland, A., Lomax, J., Ashburner, M., Foulger, R., Eilbeck, K., Lewis, S., Marshall, B., Mungall, C., et al. (2004) The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res., 32, D258D261
Hermjakob, H., Montecchi-Palazzi, L., Lewington, C., Mudali, S., Kerrien, S., Orchard, S., Vingron, M., Roechert, B., Roepstorff, P., Valencia, A., et al. (2004) IntAct: an open source molecular interaction database. Nucleic Acids Res., 32, D452455
Ho, Y., Gruhler, A., Heilbut, A., Bader, G.D., Moore, L., Adams, S.L., Millar, A., Taylor, P., Bennett, K., Boutilier, K., et al. (2002) Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature, 415, 180183[CrossRef][Medline].
Hughes, T.R., Marton, M.J., Jones, A.R., Roberts, C.J., Stoughton, R., Armour, C.D., Bennett, H.A., Coffey, E., Dai, H., He, Y.D., et al. (2000) Functional discovery via a compendium of expression profiles. Cell, 102, 109126[CrossRef][ISI][Medline].
Huh, W.K., Falvo, J.V., Gerke, L.C., Carroll, A.S., Howson, R.W., Weissman, J.S., O'shea, E.K. (2003) Global analysis of protein localization in budding yeast. Nature, 425, 686691[CrossRef][Medline].
Huynen, M., Snel, B., Lathe, W., 3rd and Bork, P. (2000) Predicting protein function by genomic context: quantitative evaluation and qualitative inferences. Genome Res., 10, 12041210
Ito, T., Chiba, T., Ozawa, R., Yoshida, M., Hattori, M., Sakaki, Y. (2001) A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc. Natl. Acad. Sci. USA, 98, 45694574
Jansen, R., Greenbaum, D., Gerstein, M. (2002) Relating whole-genome expression data with proteinprotein interactions. Genome Res., 12, 3746
Kamath, R.S., Fraser, A.G., Dong, Y., Poulin, G., Durbin, R., Gotta, M., Kanapin, A., Le Bot, N., Moreno, S., Sohrmann, M., et al. (2003) Systematic functional analysis of the Caenorhabditis elegans genome using RNAi. Nature, 421, 231237[CrossRef][Medline].
Kapushesky, M., Kemmeren, P., Culhane, A.C., Durinck, S., Ihmels, J., Korner, C., Kull, M., Torrente, A., Sarkans, U., Vilo, J., et al. (2004) Expression profiler: next generationan online platform for analysis of microarray data. Nucleic Acids Res., 32, W465W470
Kemmeren, P. and Holstege, F.C. (2003) Integrating functional genomics data. Biochem. Soc. Trans., 31, 14841487[ISI][Medline].
Kemmeren, P., van Berkum, N.L., Vilo, J., Bijma, T., Donders, R., Brazma, A., Holstege, F.C. (2002) Protein interaction verification and functional annotation by integrated analysis of genome-scale data. Mol. Cell, 9, 11331143[CrossRef][ISI][Medline].
Kumar, A., Agarwal, S., Heyman, J.A., Matson, S., Heidtman, M., Piccirillo, S., Umansky, L., Drawid, A., Jansen, R., Liu, Y., et al. (2002) Subcellular localization of the yeast proteome. Genes Dev., 16, 707719
Lockhart, D.J. and Winzeler, E.A. (2000) Genomics, gene expression and DNA arrays. Nature, 405, 827836[CrossRef][Medline].
Marcotte, E.M., Pellegrini, M., Thompson, M.J., Yeates, T.O., Eisenberg, D. (1999) A combined algorithm for genome-wide prediction of protein function. Nature, 402, 8386[CrossRef][Medline].
Parsons, A.B., Brost, R.L., Ding, H., Li, Z., Zhang, C., Sheikh, B., Brown, G.W., Kane, P.M., Hughes, T.R., Boone, C. (2004) Integration of chemical-genetic and genetic interaction data links bioactive compounds to cellular target pathways. Nat. Biotechnol., 22, 6269[CrossRef][ISI][Medline].
Pereira-Leal, J.B., Enright, A.J., Ouzounis, C.A. (2004) Detection of functional modules from protein interaction networks. Proteins, 54, 4957[CrossRef][ISI][Medline].
Roberts, C.J., Nelson, B., Marton, M.J., Stoughton, R., Meyer, M.R., Bennett, H.A., He, Y.D., Dai, H., Walker, W.L., Hughes, T.R., et al. (2000) Signaling and circuitry of multiple MAPK pathways revealed by a matrix of global gene expression profiles. Science, 287, 873880
Ross-Macdonald, P., Coelho, P.S., Roemer, T., Agarwal, S., Kumar, A., Jansen, R., Cheung, K.H., Sheehan, A., Symoniatis, D., Umansky, L., et al. (1999) Large-scale analysis of the yeast genome by transposon tagging and gene disruption. Nature, 402, 413418[CrossRef][Medline].
Rust, A.G., Mongin, E., Birney, E. (2002) Genome annotation techniques: new approaches and challenges. Drug Discov. Today, 7, S70S76[CrossRef][ISI][Medline].
Schlitt, T., Palin, K., Rung, J., Dietmann, S., Lappe, M., Ukkonen, E., Brazma, A. (2003) From gene networks to gene function. Genome Res., 13, 25682576
Shannon, C.E. (1997) The mathematical theory of communication (1963). MD Comput., 14, 306317[Medline].
Spellman, P.T., Sherlock, G., Zhang, M.Q., Iyer, V.R., Anders, K., Eisen, M.B., Brown, P.O., Botstein, D., Futcher, B. (1998) Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell, 9, 32733297
Tong, A.H., Lesage, G., Bader, G.D., Ding, H., Xu, H., Xin, X., Young, J., Berriz, G.F., Brost, R.L., Chang, M., et al. (2004) Global mapping of the yeast genetic interaction network. Science, 303, 808813
Travers, K.J., Patil, C.K., Wodicka, L., Lockhart, D.J., Weissman, J.S., Walter, P. (2000) Functional and genomic analyses reveal an essential coordination between the unfolded protein response and ER-associated degradation. Cell, 101, 249258[CrossRef][ISI][Medline].
Uetz, P., Giot, L., Cagney, G., Mansfield, T.A., Judson, R.S., Knight, J.R., Lockshon, D., Narayan, V., Srinivasan, M., Pochart, P., et al. (2000) A comprehensive analysis of proteinprotein interactions in Saccharomyces cerevisiae. Nature, 403, 623627[CrossRef][Medline].
Walhout, A.J., Reboul, J., Shtanko, O., Bertin, N., Vaglio, P., Ge, H., Lee, H., Doucette-Stamm, L., Gunsalus, K.C., Schetter, A.J., et al. (2002) Integrating interactome, phenome, and transcriptome mapping data for the C.elegans germline. Curr. Biol., 12, 19521958[CrossRef][ISI][Medline].
Washburn, M.P., Wolters, D., Yates, J.R., III. (2001) Large-scale analysis of the yeast proteome by multidimensional protein identification technology. Nat. Biotechnol., 19, 242247[CrossRef][ISI][Medline].
Winzeler, E.A., Shoemaker, D.D., Astromoff, A., Liang, H., Anderson, K., Andre, B., Bangham, R., Benito, R., Boeke, J.D., Bussey, H., et al. (1999) Functional characterization of the S.cerevisiae genome by gene deletion and parallel analysis. Science, 285, 901906
Young, R.A. (2000) Biomedical discovery with DNA arrays. Cell, 102, 915[CrossRef][ISI][Medline].
Zhu, H., Bilgin, M., Bangham, R., Hall, D., Casamayor, A., Bertone, P., Lan, N., Jansen, R., Bidlingmaier, S., Houfek, T., et al. (2001) Global analysis of protein activities using proteome chips. Science, 293, 21012105
This article has been cited by other articles:
![]() |
Y. Tao, L. Sam, J. Li, C. Friedman, and Y. A. Lussier Information theory applied to the sparse gene ontology annotation network to predict novel gene function Bioinformatics, July 1, 2007; 23(13): i529 - i538. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||









