Bioinformatics Advance Access originally published online on August 27, 2007
Bioinformatics 2007 23(20):2692-2699; doi:10.1093/bioinformatics/btm403
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Exploring the functional landscape of gene expression: directed search of large microarray compendia
1Lewis-Sigler Institute for Integrative Genomics, Princeton University, Carl Icahn Laboratory and 2Department of Computer Science, Princeton University, 35 Olden Street, Princeton, NJ, USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: The increasing availability of gene expression microarray technology has resulted in the publication of thousands of microarray gene expression datasets investigating various biological conditions. This vast repository is still underutilized due to the lack of methods for fast, accurate exploration of the entire compendium.
Results: We have collected Saccharomyces cerevisiae gene expression microarray data containing roughly 2400 experimental conditions. We analyzed the functional coverage of this collection and we designed a context-sensitive search algorithm for rapid exploration of the compendium. A researcher using our system provides a small set of query genes to establish a biological search context; based on this query, we weight each dataset's relevance to the context, and within these weighted datasets we identify additional genes that are co-expressed with the query set. Our method exhibits an average increase in accuracy of 273% compared to previous mega-clustering approaches when recapitulating known biology. Further, we find that our search paradigm identifies novel biological predictions that can be verified through further experimentation. Our methodology provides the ability for biological researchers to explore the totality of existing microarray data in a manner useful for drawing conclusions and formulating hypotheses, which we believe is invaluable for the research community.
Availability: Our query-driven search engine, called SPELL, is available at http://function.princeton.edu/SPELL
Contact: ogt{at}genomics.princeton.edu
Supplementary information: Several additional data files, figures and discussions are available at http://function.princeton.edu/SPELL/supplement
| 1 INTRODUCTION |
|---|
|
|
|---|
The recent, rapid expansion in the amount of functional genomics data created by the biology community promises to provide broad understanding of protein function and regulation on a systems level. In particular, the increased accessibility and lower cost of gene expression microarrays has led to the publication of hundreds of studies in a variety of organisms. However, these data have thus far remained vastly underutilized. While much work has been done investigating individual datasets, advancement of knowledge in the field requires intuitive methods for biology researchers to quickly and easily explore the totality of existing data, to identify the datasets and publications relevant to their area of interest, and to locate the important information within those datasets. For example, a biologist interested in DNA damage repair should not be limited to analysis of a single dataset concerned with exposure to DNA damaging agents, but rather should be able to quickly determine which published microarray experiments elicit a DNA damage response, find the relevant portions of those datasets and then be able to examine that data to draw conclusions and form hypotheses.
No existing approach for microarray analysis allows for fast, intuitive exploration of the large, diverse collection of published gene expression data. The utility and necessity of exploration-based techniques has been demonstrated for microarray data on the much smaller scale of one or a few datasets. General clustering techniques and bi-clustering methods have been successfully used to allow biologists to find relevant information in this small-scale setting. However, these methods are not appropriate for application to very large-scale microarray compendia due to sensitivity to noise that is compounded when aggregating data, an inability to work with data generated under diverse conditions, and/or prohibitively slow running times.
Typical clustering approaches group genes together to minimize a distance function between genes. While these distances can be quickly calculated across the concatenation of many datasets, their biological accuracy greatly decreases when taken over heterogeneous conditions. This approach is sometimes referred to as mega-clustering in the literature (Baldwin et al., 2003; Gasch et al., 2000; Saldanha et al., 2004) and while appropriate in limited experimental settings involving small numbers of biologically related datasets, it is not appropriate for analysis of large-scale, heterogeneous collections of gene expression data (Madeira and Oliveira, 2004). Signals present in only a few of the datasets in a compendium are lost when the total data collection is large, causing clustering techniques to capture only the global signals in the compendium and miss more specific signals. Thus, clustering is best limited to initial exploratory analysis of single datasets.
Bi-clustering methods seek gene similarity in only a subset of available conditions, which is more appropriate for functionally heterogeneous data (Cheng and Church, 2000; Madeira and Oliveira, 2004). However, the most basic formulations of bi-clustering allow for the selection of any subset of conditions, which is often not biologically meaningful when the selected conditions bear no relationship to each other. As data compendia increase in size, it becomes more conceivable for these bi-clustering formulations to find patterns in the noise, as finding arbitrary subsets of conditions where genes exhibit similar levels of expression becomes easier by pure chance as the number of conditions increases. Further, the general bi-clustering problem is NP-complete (Madeira and Oliveira, 2004), meaning that these methods can require unreasonable running times to find complete solutions, particularly on large data collections.
As the general bi-clustering problem is often intractable, a variety of heuristics and normalization steps are utilized in practice. For example, some approaches obtain faster running times by limiting the types of bi-clusters they can identify (Tanay et al., 2002), or by focusing on specific types of data, such as time courses (Madeira and Oliveira, 2005). Other bi-clustering methods achieve tractable complexity by starting with a query set of related seed genes and iteratively growing out maximal bi-clusters around the seed (Ihmels et al., 2002).
Another approach for microarray data exploration is a query-driven search process, such as the feature selection-based Gene Recommender algorithm (Owen et al., 2003). This approach has proven very useful on the scale of smaller data compendia, however, it is not as effective when applied to very large-scale collections. As with some formulations of bi-clustering, feature selection techniques may find noisy patterns among unrelated conditions, and can require lengthy computation times for complete analysis.
To address all of these shortcomings, we propose a more scalable, context-specific search methodology that enables biology researchers to explore the entirety of very large microarray compendia in a biologically meaningful manner. Our approach offers many fold higher biological accuracy and running speeds many times faster than current techniques. We have also categorized the functional coverage and biases of this collection to assess which biological areas are well characterized in the current microarray compendium and which areas are open to further study. Based on this compendium of data, we demonstrate the effectiveness and usefulness of our approach for information exploration and hypothesis formulation. We have implemented our algorithm in an interactive, web-based search engine available at http://function.princeton.edu/SPELL.
| 2 METHODS |
|---|
|
|
|---|
In this section, we briefly discuss our collection of microarray data and our functional coverage analysis of this compendium. We then discuss in detail our fast, context-sensitive search procedure, called SPELL.
2.1 Creation of the Saccharomyces cerevisiae gene expression data compendium
We collected 117 microarray datasets from 81 publications totaling 2394 array hybridizations from a variety of sources (Brazma et al., 2003; Cherry et al., 1998; Edgar et al., 2002; Le Crom et al., 2002; Sherlock et al., 2001). Missing values were imputed using the KNN impute algorithm with K = 10 using Euclidean distance (Troyanskaya et al., 2001) and technical replicates (i.e. spot repeats and dye swaps) were averaged together, resulting in data files of complete matrices with one entry per gene appearing in the dataset (see Supplementary Materials for details).
Gene similarities are calculated within a dataset containing n conditions using the Pearson correlation coefficient,
, as defined by:
|
|
x and
y are SDs. However, the distribution of all pair-wise Pearson correlations varies greatly from one dataset to the next. This is a function of several factors, including the number of experimental conditions in a dataset, the biological process targeted, and the microarray technology employed. In order to better compare correlations between datasets, we apply Fisher's z-transform to improve comparability (Fisher, 1915). The Fisher z-transformed correlations, z, are defined as: |
|
is defined as above. As a final step, we standardize these quantities by subtracting the mean correlation within each dataset and dividing by the corresponding SD which results in approximately normal distributions [
N(0,1)] of correlations within each dataset under the assumption, based on empirical observation, that the true underlying distribution of the data is approximately normal (see Supplementary Material for examples).
2.2 Functional coverage analysis
As motivation for our search algorithm presented in the next section, and in order to characterize which biological processes are represented in the compendium, we analyzed the functional coverage of each dataset over a variety of Gene Ontology (GO) terms (Ashburner et al., 2000) using the z-test for significance. Given the background of all pair-wise z-scores within a dataset, d, for each GO term, g, we calculated all pair-wise correlations for the ng genes annotated to the term and find the mean sample correlation, µg. The z-test statistic for each GO term/dataset pair,
g,d, was calculated as:
|
|
b is the background SD. Approximate significance of these z-statistics was computed based on an upper-tailed hypothesis test (Montgomery et al., 2001). The calculated P-values are approximate due to the assumption of underlying normality in the data and because correlations among genes annotated to the same GO term are not necessarily independent. For display in Figure 1, the resulting matrix of pseudo P-values was hierarchically clustered in both dimensions (see Supplementary Material for complete matrix). In addition to the z-test presented here, we have calculated significance using the non-parametric Kolmogorov–Smirnov test (see Supplementary Material for results).
|
2.3 Search algorithm details
Motivated by our characterization of the functional coverage of the compendium, we have devised a search procedure to leverage the compendium's diversity. Our search algorithm is based on two components: a signal balancing technique that enhances biological information; and dataset relevance weighting to identify functional patterns within datasets that are meaningful given a set of user-provided query genes. (Note that this algorithm is independent of the functional coverage analysis presented in Section 2.2.) We refer to this algorithm as SPELL (Serial Patterns of Expression Levels Locator). A schematic overview of this method is shown in Figure 2.
|
2.3.1 Identification of functional patterns through signal balancing
While correlations between the original data vectors in microarray datasets are biologically meaningful, the high levels of noise in these datasets can lead to spurious results, particularly in the context of very large compendia. Singular value decomposition (SVD) has been applied to several other problems in microarray analysis, and it has been shown that this process can lead to substantial noise reduction (Alter et al., 2000; Wall et al., 2003). We apply SVD in a novel way to re-balance the signals present in datasets.
Briefly, SVD factors an original m x n data matrix, X, into three component matrices of the form:
|
|
contains the singular values of X along its diagonal in decreasing order and U and VT contain the left- and right-singular vectors, respectively. In practice, VT defines an orthonormal basis for the columns of X in decreasing order of corresponding singular values, while U defines the projection of each original data vector in this new basis.
In contrast to typical applications of SVD for microarray analysis, we calculate correlations between genes coefficients in U rather than re-project to an approximation of X. In this case, U can be interpreted as the balanced projection of X onto its right singular basis, where the balancing weights are inversely proportional to the singular values defined by
, i.e. U = XV
–1. Correlations between genes in U equally weight each dimension of the orthonormal basis and balance their contributions such that the least prominent patterns are amplified and more dominant patterns are dampened. This process helps reveal biological signals, as some of the dominant patterns in many microarray datasets are not biologically meaningful (see Supplementary Material for comprehensive evaluation of this signal balancing approach).
We apply this signal balancing approach to each dataset in our compendium separately. All correlations calculated during our search procedure in the next section are calculated in the resulting signal balanced U matrices rather than the original data matrices.
2.3.2 Query-based search
Given a compendium of signal balanced microarray datasets, D, and a query set of genes of interest, Q, our approach assigns a relevance weight to every dataset in the compendium. We then identify additional genes closely related to the query set within the weighted datasets. Given a set of query genes, qi
Q, we determine a relevance weight, w, for each dataset, d, in our compendium as the mean of all pair-wise z-transformed correlations, z, among the query genes:
|
|
|
|
Given these weights for each dataset, we calculate a per-gene score, s, as the mean of weighted correlations to the query set for each gene x, across all D datasets in the compendium as:
|
|
Once scores are calculated for all genes, the results are sorted and the top results are returned. The effect of this process is to select those datasets most relevant to the biological context defined by the query and identify additional genes related in these datasets.
2.4 Performance evaluation methodology
In order to evaluate our method's performance, we assessed the ability of our approach to recapitulate known biology by examining a set of 126 functionally distinct GO terms selected by an expert curation of the hierarchy performed by Myers et al., (2006). These GO terms were identified as both specific enough such that predicted annotations could be validated through laboratory testing, but also general enough to reasonably expect high-throughput data to be informative. We excluded very small terms (less than 10 annotated genes), as results can be misleading with such small numbers of positive examples.
We estimated precision-recall characteristics of our method through extensive cross-validation. For each GO term examined, we executed a separate search with each possible pair of annotated genes as the query set (i.e. leave-two-in cross-validation). Each of these queries resulted in an ordered list of all genes in the genome as ranked by the algorithm tested. We combined these lists by calculating the average rank of each gene across all lists (excluding the query genes) and producing an ordered master list for each GO term from best average rank to worst. Precision-recall curves were generated based on the master list's performance over the GO term examined, and average precision was used as a summary statistic for comparisons. To create precision-recall graphs averaged across GO terms, mean precisions were calculated at the scale of the smallest recall step examined (i.e. the inverse of the number of genes annotated to the largest GO term tested). The average precision, AP, for each GO term, G, is calculated as:
|
|
In addition to testing the performance of our SPELL algorithm, we compare our results with commonly used mega-clustering techniques based on both raw Pearson correlation and Fisher z-transformed, standardized z-scores. For Pearson correlation, results were calculated across the concatenation of all data into a single large matrix. For z-scores, results were calculated in individual datasets and the z-scores were averaged together. We also compared SPELL with another unsupervised, query-driven search technique, the Gene Recommender algorithm (Owen et al., 2003). However, as this algorithm was not designed for analysis on this scale over such a large collection of data, the running time limited this comparison to the 82 smallest of the 126 GO terms used in other comparisons. In all cases, the same cross-validation and bootstrapping procedure was used. Several results of these comparisons are shown in Figures 3 and 4 (see Supplementary Material for complete results).
|
|
| 3 IMPLEMENTATION |
|---|
|
|
|---|
Our SPELL methodology is implemented in a web-accessible search engine at http://function.princeton.edu/SPELL. Our interface allows a researcher to provide a list of query genes, then the search engine reports which datasets are most relevant to that query, lists additional genes related to the query within the relevant conditions and displays the expression levels of these genes. Links to extra information about each dataset, the original publications, and gene information are also provided. Queries are processed in seconds, which allows researchers to quickly locate and observe the relevant portions of the data compendium.
In addition to processing initial searches, users can refine and direct their search in a serial fashion, which allows researchers to more fully explore the data compendium by observing which biological conditions induce stronger or weaker correlations among varying sets of query genes. Thus a user can target the query to particular biological processes, which is especially valuable when investigating genes that are involved in multiple functions. A screenshot of this search engine is shown in Figure 5.
|
| 4 RESULTS AND DISCUSSION |
|---|
|
|
|---|
4.1 Functional coverage analysis of the microarray compendium
To map out the functional landscape of existing gene expression microarray data in S.cerevisiae, we have collected a large data compendium and examined it for coverage of known pathways and biological processes. Our collection contains 117 distinct datasets spanning 2394 array hybridizations. To our knowledge, this is the largest single microarray data compendium for S.cerevisiae.
In general, we expect different datasets to activate different pathways depending on the experimental condition studied. For example, stress response datasets should show a strong signal for ribosomal processes, but not necessarily meiosis, for which a sporulation time course may be better suited. We quantified this effect for our S.cerevisiae microarray compendium over a broad selection of biological processes as defined by GO and the Saccharomyces Genome Database (SGD) annotations (Cherry et al., 1998). For each GO term and dataset combination, we examined the statistical difference between the expression correlation among annotated genes and the background correlation among all genes within the dataset (see Methods section for details). The results of this evaluation are summarized in Figure 1 (see Supplementary Material for full matrix).
This analysis illustrates both which datasets are informative of each biological area and which biological areas are represented in the compendium at large. Some subsets of GO terms are significant in nearly all datasets, such as ribosomal processes (Fig. 1B). In contrast, many biological processes are active in only a few datasets, generally those where experimental conditions were specifically targeting the process in question. An example of this is GO terms that relate to the process of meiosis (Fig. 1C), which are significant in only a few, targeted datasets.
Finally, our analysis identifies several functional groups not significantly represented in our compendium, and thus likely not covered by currently available microarray data. These fall into several categories: pathways not believed to be transcriptionally regulated, functions that do not occur in many lab strains and finally, functional areas which may not have been targeted by a specific assay to induce co-regulation (see Supplementary Materials for complete results).
4.2 Query-driven search
Our approach to analysis relies on signal balancing coupled with context-sensitive search to provide fast, accurate performance. Given a set of query genes from a user, we weight the relevance of each dataset based on the query genes correlation within that dataset. We then calculate the context-weighted correlation of every other gene back to the query set to identify the genes most related to the query set to report as results. Note that this approach is unsupervised in that the search process is independent of the functional coverage analysis discussed above.
By considering correlations only in entire logical datasets (e.g. a heat shock time course), we harness the biological diversity in the collection in a meaningful way. As we know that different datasets contain signals from different biological processes, it is vital to examine signals in those subsets of the compendium that are relevant to a particular area. By determining dataset relevance based on the query sets correlation, our method uses the data itself to determine which datasets are important for a specific query, rather than relying on a literature search or curation. This approach allows specific signals that may be present in only a few datasets in the compendium to be found without explicit prior knowledge of what the compendium contains. Another important benefit of examining correlations only in functionally coherent units is that this approach is able to compare and combine information from datasets generated using diverse technologies. Regardless of inter-dataset differences in signal or noise, our method is able to isolate and identify the most important information.
4.3 Performance evaluation in 126 biological areas
We have evaluated the ability of SPELL and other methods to reconstruct a known pathway given only a subset of genes in that pathway as input (see Methods section for details). We find that SPELL recovers known process proteins with substantially higher accuracy than other commonly used approaches (see Figs 3 and 4). For instance, measured in average precision, SPELL improves by a mean of 273% over the typical Pearson correlation concatenation approach. In 35 of the 126 GO terms examined, performance increases by more than 200%, in 71 cases performance increases by more than 100% and in a total of 101 cases performance increases by more than 50%. We find a performance decrease in only 5 GO terms, each of which has no biological signal in our gene expression compendium. Specifically, 4 of these 5 GO terms were identified as underrepresented in the collection during our functional coverage analysis, meaning no datasets in the compendium can be confidently deemed relevant to these processes. The remaining GO term where performance decreased is DNA recombination which contains many genes with very high sequence similarity (transposons), causing cross-hybridization effects that make dataset co-expression not biologically meaningful. Thus, for all GO terms examined where a biologically meaningful signal is present in the microarray compendium, our approach leads to an increase in biological accuracy over mega-clustering.
We also compared the performance of SPELL with another unsupervised search approach, Gene Recommender (Owen et al., 2003). On average, SPELL exhibits a 67% performance increase over this approach and is dramatically faster (Fig. 4). In this analysis using a very large data collection, SPELL demonstrates a substantial improvement in biological accuracy over both simple mega-clustering techniques and the sophisticated feature selection-based Gene Recommender algorithm.
4.4 Novel biological predictions and confirmation
The results of our cross-validation and bootstrapping analysis can also be used to make novel gene function predictions. We examined the high-precision, low-recall area of the SPELL results to identify potential functions for genes currently lacking any annotations to the GO biological process branch. In many cases we have found supporting evidence for these predictions in the literature, and/or conducted laboratory experiments that support the hypotheses.
4.4.1 Multiple functions of un-annotated gene ARP8 are predicted by SPELL
SPELL makes 13 novel functional predictions for the gene, ARP8, which fall into three categories: processes related to the cell cycle, processes related to transcription by RNA polymerase II and processes related to cellular morphogenesis and structure (see Supplementary Material for complete list). Although this gene is not annotated to the GO biological process branch, several studies have been conducted that support these predictions.
Arp8 is a component of the 12 protein complex INO80. INO80 is a chromatin remodeling complex that is involved in regulation of transcription and in DNA damage response (Shen et al., 2000). The role of ATP-dependent chromatin remodeling complexes in transcriptional regulation is well documented (Cairns, 2005), and thus it comes as no surprise that an important component of the INO80 complex was predicted to the GO terms involved in transcriptional regulation. Perhaps more interesting, SPELL also predicted a recently characterized function of INO80—its role in both repairing double-stranded DNA breaks and homologous recombination (van Attikum and Gasser, 2005). Mutants which cripple INO80 function have been shown to be sensitive to DNA damaging agents, and temperature-sensitive alleles of INO80 arrest at G2/M (Shen et al., 2000). Thus, the series of GO terms related to progress through the cell cycle are extremely relevant to the function of Arp8 in the INO80 complex.
A novel predicted function for the ARP8 gene was a role in cellular morphogenesis and cytoskeleton organization. Using a complete deletion of the ARP8 gene from the yeast deletion set (Giaever et al., 2002), we grew four independent colonies of both wild-type yeast and an arp8
in rich media. We measured the cell volume for these cultures and found a dramatic increase in cell volume to 66.7 ± 2.1 fl for arp8
, up from 36.9 ± 0.7 fl for wild type. Furthermore, by observing these cultures with microscopy we discovered that arp8
cells had an abnormal, enlarged ellipsoid shape compared to the rounded shape of wild-type yeast as shown in Figure 6. These data verify that the ARP8 gene plays a critical role in maintaining normal cellular shape and size, which supports these predictions of our system.
|
The ability of SPELL to identify several distinct functions of ARP8 demonstrates the effectiveness of our methodology. By searching through the available data in a context-sensitive manner, our approach has the ability to identify signals in biologically diverse subsets of the compendium in a meaningful way.
4.4.2 SPELL predicts YDL089W is involved in sporulation
Another biological prediction made by our system is that the previously uncharacterized ORF YDL089W is involved in sporulation. Several lines of evidence strongly support this prediction. First, overexpression of YDL089W suppresses the sporulation defect of a csm1
strain (Wysocka et al., 2004). Csm1 is involved in chromosome segregation during meiosis and Csm1 was demonstrated to have a physical interaction with YDL089W. Furthermore, a protein chip screen for targets of the Cdc28 kinase (an important regulator of chromosome segregation at G2/M) found YDL089W as a target (Ubersax et al., 2003). These results experimentally support our prediction that YDL089W plays a role in sporulation.
4.4.3 Support for other novel GO biological process annotation predictions by SPELL
SPELL predicts that the un-annotated protein SET7 is involved with protein amino acid alkylation. The most common alkylation event in cells is the transfer of a methyl group to an amino acid. The SET domain has been shown to catalyze the methylation of lysine residues (Xiao et al., 2003). The assignment of the process amino acid alkylation to SET7 is consistent with the lysine methylation function of the Set7 protein.
Another novel annotation prediction that is consistent with recently published data is the assignment of TVP38 to glycoprotein metabolism. The Tvp38 protein was recently identified as one of nine novel components in the Golgi apparatus where much of protein glycosylation occurs (Inadome et al., 2005). Furthermore, the copurification with glycosylation proteins found in this study strongly supports this functional prediction.
4.4.4 Effectiveness of SPELL for novel biological process annotations
The biological diversity of these verified predictions of our system demonstrate the effectiveness of our approach. Novel functions for genes as diverse as double-stranded break repair, sporulation, glycosylation and transcriptional regulation have been correctly predicted by our approach using only publicly available gene expression microarray data. We believe systems such as SPELL that can enable fast generation of meaningful hypotheses given existing data will play a key role in directing future laboratory work.
| 5 CONCLUSIONS |
|---|
|
|
|---|
As the biology community is producing a very large amount of gene expression data, it is critical to develop fast, biologically relevant search methods to enable researchers to leverage all of the available data in their own analyses. To this end, we have gathered the largest single collection of S.cerevisiae microarray data and studied the representation of various pathways and functions within the datasets contained in this collection. Our study exhibits the biological diversity of publicly available data and also points to several biological areas which are not yet covered by the gene expression collection.
We propose a general, effective search method for harnessing very large gene expression data compendia. We have implemented this method, called SPELL, in a web-based, context-sensitive search engine for the large-scale S.cerevisiae data collection. The accuracy of our approach is on average more than 250% improved over existing mega-clustering techniques when recapitulating known biology. Further, our system makes several novel biological predictions that we have verified through recent publications in the literature and additional laboratory tests. While we believe that our system will be very useful for biologists, there is still room for the development of additional methods for query-driven data exploration. For example, modifications to bi-clustering algorithms or the further development of feature selection techniques may also be useful paths for future research. These types of approaches will prove invaluable for the research community by providing an easy, direct link to biologically relevant information that exists within published gene expression data.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
The authors would like to thank the members of the Botstein, Kruglyak and Dunham laboratories for advice and input on the system. We also thank John Wiggins and Mark Schroeder for excellent technical support. O.G.T. is an Alfred P. Sloan Research Fellow. This research was partially supported by NSF grant CNS-0406415, NSF CAREER award DBI-0546275 to O.G.T., NIH grant R01 GM071966, NSF grant IIS-0513552, NIH grant T32 HG003284 and NIGMS Center of Excellence grant P50 GM071508 and partially supported by a Google Research Award.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: David Rocke
Received on May 4, 2007; revised on August 2, 2007; accepted on August 2, 2007
| REFERENCES |
|---|
|
|
|---|
Alter O, et al. Singular value decomposition for genome-wide expression data processing and modeling. Proc. Natl Acad. Sci. USA, ( (2000) ) 97, : 10101–10106.
Ashburner M, et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet, ( (2000) ) 25, : 25–29.[CrossRef][ISI][Medline].
Baldwin DN, et al. A gene-expression program reflecting the innate immune response of cultured intestinal epithelial cells to infection by Listeria monocytogenes. Genome Biol, ( (2003) ) 4, : R2.[CrossRef][Medline].
Brazma A, et al. ArrayExpress–a public repository for microarray gene expression data at the EBI. Nucleic Acids Res, ( (2003) ) 31, : 68–71.
Cairns BR. Chromatin remodeling complexes: strength in diversity, precision through specialization. Curr. Opin. Genet. Dev, ( (2005) ) 15, : 185–190.[CrossRef][ISI][Medline].
Cheng Y, Church GM. Biclustering of expression data. Proc. Int. Conf. Intell. Syst. Mol. Biol, ( (2000) ) 8, : 93–103.[Medline].
Cherry JM, et al. SGD: Saccharomyces Genome Database. Nucleic Acids Res, ( (1998) ) 26, : 73–79.
Edgar R, et al. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res, ( (2002) ) 30, : 207–210.
Fisher RA. Frequency distribution of the values of the correlation coefficient in samples from an indefinitely large population. Biometrika, ( (1915) ) 10, : 507–521.
Gasch AP, et al. Genomic expression programs in the response of yeast cells to environmental changes. Mol. Biol. Cell, ( (2000) ) 11, : 4241–4257.
Giaever G, et al. Functional profiling of the Saccharomyces cerevisiae genome. Nature, ( (2002) ) 418, : 387–391.[CrossRef][Medline].
Ihmels J, et al. Revealing modular organization in the yeast transcriptional network. Nat. Genet, ( (2002) ) 31, : 370–377.[CrossRef][ISI][Medline].
Inadome H, et al. Immunoisolation of the yeast Golgi subcompartments and characterization of a novel membrane protein, Svp26, discovered in the Sed5-containing compartments. Mol. Cell. Biol, ( (2005) ) 25, : 7696–7710.
Le Crom S, et al. yMGV: helping biologists with yeast microarray data mining. Nucleic Acids Res, ( (2002) ) 30, : 76–79.
Madeira SC, Oliveira AL. A Linear Time Biclustering Algorithm for Time Series Gene Expression Data. ( (2005) ) Proceedings of the 5th Workshop on Algorithms in Bioinformatics (WABI05). 39–52..
Madeira SC, Oliveira AL. Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans. Comput. Biol. Bioinform, ( (2004) ) 1, : 24–45.[CrossRef].
Montgomery C, et al. Engineering Statistics, ( (2001) ) New York: John Wiley & Sons, Inc..
Myers CL, et al. Finding function: evaluation methods for functional genomic data. BMC Genomics, ( (2006) ) 7, : 187.[CrossRef][Medline].
Owen AB, et al. A gene recommender algorithm to identify coexpressed genes in C. elegans. Genome Res, ( (2003) ) 13, : 1828–1837.
Primig M, et al. The core meiotic transcriptome in budding yeasts. Nat. Genet, ( (2000) ) 26, : 415–423.[CrossRef][ISI][Medline].
Saldanha AJ, et al. Nutritional homeostasis in batch and steady-state culture of yeast. Mol. Biol. Cell, ( (2004) ) 15, : 4089–4104.
Shen X, et al. A chromatin remodelling complex involved in transcription and DNA processing. Nature, ( (2000) ) 406, : 541–544.[CrossRef][Medline].
Sherlock G, et al. The Stanford Microarray Database. Nucleic Acids Res, ( (2001) ) 29, : 152–155.
Tanay A, et al. Discovering statistically significant biclusters in gene expression data. Bioinformatics, ( (2002) ) 18, (Suppl. 1): S136–S144.[Abstract].
Troyanskaya O, et al. Missing value estimation methods for DNA microarrays. Bioinformatics, ( (2001) ) 17, : 520–525.
Ubersax JA, et al. Targets of the cyclin-dependent kinase Cdk1. Nature, ( (2003) ) 425, : 859–864.[CrossRef][Medline].
van Attikum H, Gasser SM. ATP-dependent chromatin remodeling and DNA double-strand break repair. Cell Cycle, ( (2005) ) 4, : 1011–1014.[ISI][Medline].
Wall E, et al. Singular value decomposition and principal component analysis. In: A Practical Approach to Microarray Data Analysis, —Berrar P, et al, eds. ( (2003) ) Boston, MA: Kluwer Academic Publishers. 91–109..
Wysocka M, et al. Saccharomyces cerevisiae CSM1 gene encoding a protein influencing chromosome segregation in meiosis I interacts with elements of the DNA replication complex. Exp. Cell Res, ( (2004) ) 294, : 592–602.[CrossRef][ISI][Medline].
Xiao B, et al. SET domains and histone methylation. Curr. Opin. Struct. Biol, ( (2003) ) 13, : 699–705.[CrossRef][ISI][Medline].
This article has been cited by other articles:
![]() |
C. Huttenhower and O.G. Troyanskaya Assessing the functional structure of genomic data Bioinformatics, July 1, 2008; 24(13): i330 - i338. [Abstract] [PDF] |
||||
![]() |
D. Aguilar, L. Skrabanek, S. S. Gross, B. Oliva, and F. Campagne Beyond tissueInfo: functional prediction using tissue expression profile similarity searches Nucleic Acids Res., June 1, 2008; 36(11): 3728 - 3737. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||







