Bioinformatics Advance Access originally published online on October 27, 2004
Bioinformatics 2005 21(7):1037-1045; doi:10.1093/bioinformatics/bti074
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Local correlation of expression profiles with gene annotationsproof of concept for a general conciliatory method
1Biomathematics Group, Instituto de Tecnologia Química e Biológica, Universidade Nova de Lisboa 2781-901 Oeiras, Portugal
2Department of Biochemistry and Molecular Biology, Medical University of South Carolina Charleston, SC 29425, USA
3Departments of Ophthalmology and Physiology & Neuroscience, Medical University of South Carolina Charleston, SC 29425, USA
4Department of Biostatistics Bioinformatics and Epidemiology, Medical University of South Carolina Charleston, SC 29425, USA
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Motivation: Integrated analysis of expression data and gene ontology annotations is a prime example of biological data that need co-explanatory interpretation. This particular application is used to validate a new method for integrated analysis of varied biological information.
Results: The proposed method consists of determining local correlation coefficients and the corresponding P-values calculated per biological entity. This measure considers the combined intensity and significance of the agreement or disagreement, between two data sources about the same biological entity. The method is applied to the integrated analysis of gene expression and annotation of two gene sets, one from yeast and other from mouse. The potential of the method to generate accurate mechanistic hypothesis is also demonstrated. Specially, negative correlation results pose a new kind of biological hypothesis. Method performance was compared with annotation enrichment methods, and optimal conditions for the superiority of local correlation results are discussed.
Availability: The matlab functions described in this article are available at http://bioinformatics.musc.edu/~frpinto/
Contact: almeidaj{at}musc.edu
Supplementary information: Further information, tables and figures are available at http://bioinformatics.musc.edu/~frpinto/
| INTRODUCTION |
|---|
|
|
|---|
The availability of large quantities of biological data documenting complementary attributes of common biological entities requires the establishment of objective conciliatory methods for data integration. Such methods should allow for a more global view of the data, enabling a more powerful study of biological systems as a whole by establishing a quantitative approach to conciliate analysis. The fundamental rationale for this approach is not that congruence needs to be forced on disparate data. On the contrary, it reflects the emerging realization that fundamentally significant and unifying principles for the organization and behaviour of biological systems can be deduced from consilience (Wilson, 1999) spanning multiple sources of information deriving from the same genetic components.
Over the last years, several methods for the integration of biological data have been proposed. Some were specifically designed for the integration of gene sequence and gene expression data (Chiang et al., 2001; Caselle et al., 2002) mainly with the aim of discovering novel promoter regulatory elements. Other studies use information from gene annotations to evaluate clustering algorithms' performance (Gibbons and Roth, 2002; Gat-Viks et al., 2003) or to simplify gene expression data interpretation (Grosu et al., 2002). Data documenting biological networks have also been analysed in conjunction with gene expression data through co-clustering (Hanisch et al., 2002). Other authors proposed a more general framework for the integration of heterogeneous data sources, although requiring expert knowledge for data accuracy evaluation (Troyanskaya et al., 2003). Having as final goal the objective discovery of transcript modules, a common strategy using the expectation-maximization (EM) algorithm to learn probabilistic models from data has been used to integrate sequence and protein interaction with gene expression data (Segal et al., 2003a,b).
This article proposes a new method for integration and analysis of diverse biological data. The proposed method does not make assumptions regarding the mechanism underlying the relationships among the different sources of information. Its only requirement is the previous identification of cross-tabulated distance matrices. Starting with distance measures rather than with attribute variables enables the inclusion of biological data with no absolute space coordinates, such as biological networks data. These matrices can be computed using data from diverse sources such as gene sequences (Vinga and Almeida, 2003), proteinprotein interaction networks (Schwikowski et al., 2000), metabolic networks (Ravasz et al., 2002) or functional annotations (Gene Ontology Consortium, 2004). The matrices used have distance values between the same set of objects in the two different data spaces. These objects can be genes, proteins or other biological entities of interest. Basically, this method addresses the question: Is there an agreement between data from the two information sources? That is, is the position of each object relative to the others similar in the two data spaces? For example, if the objects of study are genes, it can be asked whether a given gene shares sequence properties with the same subset of genes with which it is closely connected as regards proteinprotein interactions, or with other genes that are distant in that interaction network.
The method proposed examines each object individually and evaluates the relation between the distances of other objects to that object in both spaces. Due to the complexity of biological systems, it will be shown that examining all objects simultaneously makes it less likely to find significant correlations between two distance variables, even if a relation between them does exist. Studying the relation between the relative positions of the same object in two different representation spaces removes the effect of position variation in both spaces, focusing instead on the conservation of local context. Consequently, an objective quantitative measure of correlation should allow for the detection of different but consistent relations between the two data sources in different regions of both spaces.
The proposed method was developed with the intent to identify groups of genes, pathways and processes concertedly reacting to experimental perturbations. It can also be more narrowly used for the identification of groups or modules of genes with similar behaviour in different datasets. Conversely, it also assists with the prediction of annotations from the dynamics of gene expression by identifying congruence beyond linear correlation of expression profiles.
In this report, the proposed method is illustrated for the integration of data from gene expression and from gene annotation in yeast and mouse data. Three distinct alternative formulations are considered to evaluate the local agreement of the two distances. This local agreement will be designated local correlation as a reflection of its quantification by correlation coefficients.
| METHODS |
|---|
|
|
|---|
Method outline
Throughout this article, the term distance is used to refer to the measure of dissimilarity between two genes in the same data space (expression or annotation space), total correlation expresses the level of agreement between two distance matrices and local correlation expresses the agreement between corresponding distance values relative to a given gene in the two distance matrices.
A quick overview of the proposed method is represented in Figure 1. From the two data sources, gene expression profiles and functional annotations, distance matrices are generated. The two paired distance matrices are jointly used in the computation of local correlation coefficients, one for each gene. With the aim of establishing confidence measures for the local correlation values, the two distance matrices are re-sampled many times by bootstrapping. The bootstrapped matrix pairs, where the original column-wise relations are disrupted, are used to compute local correlation values under the null hypothesis of local correlation absence. These values allow the determination of an empirical null distribution of local correlation coefficients. This distribution is then used to assign P-values to the local correlation values obtained with the two original matrices. The list of local correlation values and corresponding P-values carries the novel information about the genes that results from the joint analysis of the two data sources. This new information can be read in multiple ways: set P-value thresholds (one for positive and other for negative local correlations) and analyse selected genes or annotation terms, re-group genes by one of the data sources and look for the resulting groups or patterns of P-values. Both strategies were followed in this work. Genes were clustered by expression profiles, and the re-ordered list of local correlation P-values was searched for groups of genes positively or negatively local correlated. More detailed descriptions of each of these steps are provided in the next sections.
|
Gene expression data
The present work uses gene expression data from two distinct biological systems: yeast (Saccharomyces cerevisae) and mouse (Mus musculus). Data were obtained using DNA chips from Affymetrix. For both systems, wild-type organisms were compared with mutants along time series. The yeast mutants (lcb100 strain) are unable to synthesize sphingolipids under heat stress, while the mouse mutants (rd/rd mouse) have permanently open cation channels in the photoreceptor membrane leading to photoreceptor degeneration. The supplementary text (available at http://bioinformatics.musc.edu/~frpinto/) contains a more detailed description of the generation and previous analysis of these data sets, which is also reported by the authors, Cowart et al., 2003, for yeast data and by Rohrer et al., submitted for publication, for mouse data.
These data sets were chosen for testing the local correlation method of analysis because (1) they are typical one-lab experiments studying one biological perturbation and (2) the authors have expertise in the associated fields of research. The former guarantees that the new method is widely applicable, and the latter enables subsequent validation of the new method's findings.
Gene ontology annotations
The annotations for the two genomes used in this work were obtained at the Gene Ontology Consortium (Gene Ontology Consortium, 2004) database (http://www.geneontolgy.org) between August and September 2003. Annotations for molecular function, biological process and cellular component were used separately. The relation scheme between the Gene Ontology (GO) terms used was also imported. This scheme sets hierarchical relations among all the GO terms, establishing a rooted directed acyclic graph (DAG), in which each term is a node. The term Gene Ontology is the root node establishing relations with three child nodes corresponding to function, process and cellular component. Each of the three nodes may be parent of more specific nodes. In general, the most distant terms of the root node are the most specific ones.
Computation of distance matrices in gene expression space
In the case of expression data, the space coordinates are expression log ratios for each experimental time point. Therefore, metric distances Everitt and Dunn, 2001 were used to quantify expression dissimilarity between any two genes. For the yeast gene dataset, Euclidean distances were calculated, and for the mouse gene set, correlation distances were used.
Computation of distance matrices in functional annotation space
The functional annotation space does not have explicit coordinates because each gene is characterized by one or more terms defining a space that is neither quantitative nor coordinated. Nevertheless, the inter-relation of function terms in a DAG still allows the direct calculation of a distance measure between each two terms. In this work, a measure based on probabilistic and information theory concepts, designated semantic distance, is used. This measure was previously defined and discussed by Lord et al. (2003) and is briefly described in the supplementary text (http://bioinformatics.musc.edu/~frpinto/). For both yeast and mouse data sets, three separate distance matrices were computed, corresponding to function, process and cellular component annotation spaces.
Local correlation calculation
After the calculation of both distance matrices, for the yeast and mouse datasets, it becomes feasible to calculate a correlation of the two sets of distance measures for each gene. In this work, three alternative formulations for local correlation are described and evaluated. The first one is the reference Pearson linear correlation coefficient; the other two were devised by the authors and are described below. The aim of this comparative study is the characterization of the local correlation method's sensitivity on the actual coefficient chosen to capture correlations. In the following coefficient definitions, n is the number of objects under study (genes in this work case), d1 and d2 are the n x n distance matrices obtained from two different data/information sources (gene expression and gene annotations in this paper). The expression d(i,j) refers to a distance value in the matrix d, on the intersection of row i and column j. The letters i and j are going to be consistently used as indexes of matrix rows and columns, respectively. d(i,·) identifies the i-th row of matrix d. A software library allowing the reader to evaluate the three correlation coefficients for small user-definable datasets is made publicly available with the supplemental material (http://bioinformatics.musc.edu/~frpinto/).
Pearson linear correlation coefficient
This coefficient can be calculated using the conventional expression presented as Equation (3) where
x
denotes the mean value of x:
![]() | (1) |
The Pearson correlation is measured for each of the genes individually, which means that for the gene i only the row i of the d1 and d2 matrices is being used. The sums are made from 1 until n1 and not n because distances to self [d(i,j) when i = j] are not included. The Pearson coefficient varies between 1 and 1, reaching values near 0 when the values of the two distances do not covariate, and values close to 1 and 1 when the values covariate in the inverse and the same direction, respectively.
Ratio correlation coefficient
This measure consists of calculating a ratio between averages of distances to a given object, weighted by the corresponding distances in the other space, relative to the same average of distances without weighting, as detailed in Equation (4). Note, once again, that the sums in Equation (4) are always across j, meaning that all distances used are in the rows d1 (i, ·) and d2(i,·), i.e. all distance values are expressing distances from object i to others.
![]() | (2) |
Score correlation coefficient
The third cross-space correlation proposed is a non-parametric measure of the neighbourhood similarity of each object in the two spaces. In fact, the order of the neighbours is compared. This coefficient can be seen as a discrete adaptation of a correlation integral (Baker and Gollub, 1996). The procedure used to compute the score correlation coefficient is presented in a simplified way in the following pseudo-code.
rnk1 = matrix of column-wise ranks of matrix d1rnk2 = matrix of column-wise ranks of matrix d2
for j = 1 to n
order1 = sort rnk1(·,j) by its ascending order
order2 = sort rnk2(·,j) by the ascending order of rnk(1)(·,j)
score(j) = 0
for i = 1 to n
neib = number of common ranks in order 1 and order2 in the first i elements of both vectors
score(j) = score(j) + neib
end
score(j) = (score(j) minscore)/(maxscore - minscore)
end
In the case of ties in the rank attribution, the measure is calculated twice. In the first calculation, the ranks are distributed in order to reach the higher rank concordance between both spaces, leading to the higher score value. In the second calculation, the ranks are distributed in a way to reach the lower possible concordance between the ranks in both spaces, leading to the smaller score value. The final score is the arithmetic average of these two extreme values. To normalize the score values between 0 and 1, the theoretical values of score are calculated for the two extreme situations, one in which order1 and order2 are equal, leading to maxscore, and the other where order1 and order2 have inversely ordered elements, leading to minscore. In this way, objects with score of 1 have their neighbours ordered in exactly the same way in both spaces. Objects with score 0 have their neighbours inversely ordered; the ones that are closer in one space are far away in the other space. This coefficient is expected to be more robust to scale changes in distance values.
Bootstrapping
After the calculation of local correlation measures, it is necessary to determine whether the values calculated are significantly different from the expected values in the absence of correlation between the two distances. To accomplish this, probability distributions of the local correlation measures under the null hypothesis of correlation were calculated. Although for one of the measures employed in this work, the Pearson correlation coefficient, this distribution is explicitly definable (Daniel, 1999) all three measures were equally treated, and the null distributions were numerically determined by bootstrapping (Quinn and Keough, 2002). The pairs of corresponding distances in the two spaces, expression and annotation, are re-sampled 5000 times with repetition. Other randomization strategies were tested, as re-sampling without repetition or without maintaining joint distance distributions, but similar results were obtained. The bootstrap sample size used is considered to estimate P-values with a precision >0.01 (Quinn and Keough, 2002).
A bootstrapping procedure was also used for the calculus of confidence intervals for total correlation coefficients.
Other complementary methods are described in the supplemental text available at http://bioinformatics.musc.edu/~frpinto/.
| IMPLEMENTATION |
|---|
|
|
|---|
All algorithms and computations referred in the Methods section were implemented (or were already implemented) in Matlab (Version 6.5), requiring its Statistics Toolbox. Matlab functions for the calculation of local correlation measures, as well as semantic distances, are made publicly available at http://bioinformatics.musc.edu/~frpinto/. Datasets are also included to illustrate usage.
| DISCUSSION |
|---|
|
|
|---|
Distribution of local correlation coefficient values
The distributions of local correlation coefficient under the null hypothesis of correlation absence were estimated for different experimental distance value distributions. It was found that local correlation coefficient distributions were robust to changes in distance value distributions and were always symmetric and unimodal. Slight variations in coefficient distributions with the distance distributions appear to be due to changes in total correlations between the two distance matrices. These results assure the applicability of the local correlation method to diverse experimental conditions. More details on this part of the analysis are available in the supplementary text (http://bioinformatics.musc.edu/~frpinto/).
Agreement between local correlation coefficients
The cases of positive local correlation were analysed separately from the negative ones (Figs 2 and 3). The numbers in the lower left corner of each subplot indicate the frequency with which all [subplots (a) and (e)] or each [subplots (b)(d) and (f)(h)] of the coefficients detect [subplots (a)(d)] or not [subplots (e)(h)] local correlations. They show that most times the three coefficients agree, or at least two of them do. Strong detections (P < 0.1 or P > 0.9) by one coefficient only are rare, except for the score coefficient. This coefficient also shows a greater tendency to detect positive local correlations. The shapes of the positive local correlated relations between expression and semantic distances are plotted in Figure 2. All the genes represented in Figure 2 had P-values <0.50 for all the local correlation coefficients. The gene set for which all the P-values are <0.10 (Fig. 2a) clearly satisfies the general notion of positive correlation between two distances. Figure 2d and f also present clearly positive correlated distance relations, revealing that the large number of genes detected selectively by the score coefficient constitutes a merit of this measure (Fig. 2d), while the ratio coefficient is not able to detect many genes with confirmed positive local correlation (Fig. 2f). Figure 3a shows what the three coefficients agree to a negatively local correlated distance relation. Once again, it is the score coefficient that selectively detects a larger number of genes that can be visually identified as negative relations in Figure 3d. But the same coefficient is not able to detect a similar number of genes that also show clear, but not intense (Fig. 3h), negative local correlation between expression and semantic distances. As in the positive correlation case, the ratio coefficient fails to detect clear and intense negative local correlations (Fig. 3f). When only one or two local correlations are detected, the quantilequantile distance profiles typically confirm the existence of a correlation. This fact encourages us to use the three coefficients complementarily, as they seem to be sensitive to different kinds of consistent local relations between data spaces.
|
|
Detection of perturbed functions, processes or cellular localizations
The search for genes with strong (P < 0.1 or P > 0.9) or mild (P < 0.2 or P > 0.8) local correlations (defined by at least one of the three coefficients) identifies the ones with a consistent relation between their annotations and expression profiles, either positive or negative (Table 1). The authors hypothesize that these genes are related to significant biological information. One may test this hypothesis by looking for the detection of expected phenomena in the performed experiments. In yeast, sphingolipids are known to be involved in the induction of a transient cell-cycle arrest in response to heat stress (Jenkins and Hannun, 2001). In fact, among the list of genes that are differentially expressed in the yeast mutant (unable to synthesize sphingolipids during heat stress), our method identified three cell-cyclerelated genes (Table 1). Sphingolipids have also been associated with the regulation of protein degradation in the heat stress response, via endocytosis vacuolar degradation and 26 S proteasome pathways (Chung et al., 2000). Possibly related with this fact, Score and Pearson coefficients identified two endocytosis-related genes. Knowing this, it was easier to accept that the sphingolipids seemed to regulate some amino-acid metabolism genes (Table 1), apparently shifting from an anabolic to a catabolic state (Cowart et al., 2003). Additionally, genes involved in amino-acid transport were identified, among other nutrient transport genes (Table 1). The regulation of nutrient uptake by sphingolipids was previously observed (Chung et al., 2001). These experimental results allowed the inclusion of protein biosynthesis, glycogen metabolism, lipid metabolism, one carbon compound and purine base metabolism as possible sphingolipid-regulated processes during heat stress response in yeast (Table 1). The mouse experiment followed the development and degeneration of photoreceptor cells in the retina of RD mutant mice. This mutation is known to permanently open cGMP-gated cation channels, allowing high Na+ and Ca2+ influxes to the photoreceptor cells. This phenomenon leads ultimately to cell death, and this experiment aimed at the clarification of the involved pathways (Rohrer et al., submitted for publication). The local correlation method was efficient at the detection of the expected perturbations of ion-channel genes, signal transduction genes and vision/phototransduction-related genes (Table 1). Several transcription and translation factors were identified (Table 1), with the genes selected being involved in the regulation of photoreceptor degeneration. Two apoptosis-related genes were also found to be strongly locally correlated. Supporting this finding, four proteolysis enzyme genes were identified by the local correlation method, one of them being simultaneously apoptosis related (Table 1). Interestingly, genes involved in immune inflammatory response, cytoskeleton related, development and cell-cycle regulation were highlighted by the local correlation method (Table 1), suggesting the participation of these pathways in the photoreceptor degeneration process. With the objective of visually helping the detection of these genes of interest, the lists of genes from each set (yeast and mouse) were reorganized according to expression profile similarity, enabling the local correlation coefficients of genes with similar expression to be easily analysed and the detection of concordant annotations. The gene sorting by similar expression was performed through hierarchical clustering of gene expression matrices. The detailed results are tabulated in the supplementary material and are graphically displayed in Figures S3 and S4 (available in supplementary text, http://bioinformatics.musc.edu/~frpinto/). These figures show that gene expression clusters tend to be alternatively dominated by positive or negative locally correlated genes, for function, process or cellular component annotations.
|
Negative local correlations generate a new kind of biological hypothesis
In the previous section, positive and negative local correlations have been equally treated. But one important feature of the local correlation method is that these two kinds of relationships between data spaces may pose different hypothesis about the biological role of the detected genes. Positive local correlation generates the more familiar hypothesis in data assimilation. The positively locally correlated genes are hypothetical members of a module, in the sense that they are composed of several genes with similar annotation whose expression is regulated in block under experimental conditions. That is what appears to happen with the nutrient transport and proteinbiosynthesis genes in the yeast experiment and with the vision- and cell-cyclerelated genes in the mouse experiment. If a group of genes with similar annotations is divided into positive and negative local correlation subgroups, and if, simultaneously, those subgroups have expression profiles belonging to different expression clusters, the modular regulation hypothesis is still strong, involving distinct but coordinated expression programs for gene module members. This is probably the case with the amino-acid metabolism genes in the yeast expression, where the regulation of catabolic enzyme genes should be different from the anabolic ones (Cowart et al., 2003). The last situation is when negative local correlation predominates. Here, genes with similar annotation are co-expressed with genes that are very differently annotated. This poses a new kind of biological meaningful hypothesis: these genes are linking the regulation of different cellular processes or functions. Looking through Table 1 one finds this kind of situation with signal transduction and immune inflammatory response genes, i.e. classes of genes known to simultaneously interact with different pathways. The same pattern is not seen for transcription or translation factors. This may happen because many of these genes are simultaneously annotated with the processes that they regulate, when known. The authors believe that these findings are powerful demonstrations of the usefulness of the proposed method
Comparison with alternative methods
The local correlation method is most directly comparable with the commonly used two-step method of clustering genes by expression profiles and afterwards access annotation term enrichment (or impoverishment) in each cluster. Several applications are available to perform the second step of this analysis. They typically use similar statistical approaches (qui-square, Fisher exact, binomial or hypergeometric tests), differing in the way used to correct for multiple hypothesis testing (Bonferroni corrections, bootsrapping/Monte Carlo procedures or none at all). FuncAssociate (http://llama.med.harvard.edu/cgi/func/funcassociate) was chosen here for comparison with the local correlation method because it provides adequate correction for the multiple testing problem (performs a Monte Carlo simulation to adjust Fisher exact test P-values), supports both model organisms studied in this article and is easily used through a Web-based interface (Berriz et al., 2003). Both complete differentially expressed gene sets and separated gene expression clusters were analysed by FuncAssociate. The mouse gene set was divided into six expression clusters according to the results presented by Rohrer et al. (submitted for publication), and the yeast gene set was partitioned in two main clusters easily identified in the dendrogram of Figure S4 (available in the supplemental material). The program outputs are also available in the supplemental material. The P-value threshold was set to 0.1, so that both methods could be compared with the same strictness level, although the two P-values have different meanings: the FuncAssociate P-value is the probability of a random gene list taken from the genome having the same or higher enrichment (or impoverishment) of a specific term; the local correlation P-value is the probability of the correlation coefficient for a specific gene being the same or higher (or lower) if there is no correlation between the data from the two sources related to that gene. It should be stressed that the bootstrapping procedure used to generate the local correlation P-values takes into account the multiple hypothesis nature of the method. For the yeast gene analysis, the FuncAssociate method only detected one carbon compound metabolism genes as overrepresented, and only for one of the yeast clusters it detected two more enriched terms. These two terms were very specific and were only relative to two genes which were also related to one carbon compound metabolism. The method also detected some underrepresented terms (that are unrelated with the detection of negative local correlations). These terms are very general, and for that reason, the biological meaning of these detections is difficult to address. In the analysis of the mouse gene set, local correlation and FuncAssociate gave more comparable results. Both detected vision-related processes, cell-cycle regulation, signal transduction and immune response annotations. The local correlation method only detected a significantly higher number of features in the yeast data set. This suggests that the observed higher detection rate is not an intrinsic property of the local correlation method. Strengthened by the fact that most of the detections made solely by the local correlation method are supported by current biological knowledge of the yeast experimental system (Table 1), the higher detection rate may be justified by the difference in the features searched by both methods. The enrichment in a given annotation term among a set of co-expressed genes implies the occurrence of a positive local correlation. Inversely, a positive local correlation can be significant (it is a rare event under the null hypothesis of correlation absence, as measured by a low P-value) without implying the enrichment of any annotation term. Negative local correlations are also a feature not detected by FuncAssociate or similar methods. Additionally, the output of FuncAssociate shows some redundancy, since several levels of GO are presented but are related to the same genes. Different levels of annotation are elegantly treated by the local correlation method because information about the relatedness of every two terms is contained in the semantic distance matrix. The difference in the results of the two methods for the yeast data set may also be related with the size of the gene list. The statistical procedures of the over- or underrepresentation detection methods, like FuncAssociate, are not so reliable for smaller gene lists. The local correlation method does not have this constraint, so it can be readily applied to simple perturbation experiments in which differentially expressed genes are actually rarer or are stringently selected. This does not mean that the presented method cannot be applied to larger data sets. In fact, it has been observed that the analysis of an increasing number of microarray experiments (for a given organism) increases the total correlation between gene expression, annotation and transcription regulation (Allocco et al., 2004). In those conditions, total and local correlation findings become more harmonized, and the proposed method performance will get similar to alternative ones.
| CONCLUSION |
|---|
|
|
|---|
The main objective of the proposed method was to quantify the relation between distinct data sources (attributes of biological objects) from the perspective of one biological entity at a time, instead of trying to capture a similar relation between sets of similar biological entities. It was shown that local correlation method is robust to large variations in the distance distributions from each data source. The three alternative correlation coefficients tested were successful in the recognition of positive and negative local correlations, as can be observed in the quantilequantile plots. Additionally, they were able to complementarily identify different types of consistent relations, with the score coefficient being the one that identified a higher number of valid local correlations, as verified through the quantilequantile profiles. Tested with experimental data, the local correlation method successfully uncovered the known mechanistic relationships and additionally uncovered relationships supporting plausible new hypothesis. On the other hand, negative local correlations pose a new kind of biological meaningful hypothesis. In comparison with alternative methods, it proved to be more efficient, especially in single perturbation experiments with a modest number of arrays measured. The local correlation method was here applied to the integrated analysis of gene expression profiles and respective annotations, but due to the method's modularity, its applicability is much wider. The method allows as input virtually every kind of biological data (codified as distance matrices) and can search different kinds of relations by changing the correlation coefficient used. It could be particularly useful in the integrated analysis of biological network data. Those data contain information that is not accessible in a Cartesian coordinate format, but can be easily quantified in matrices with the cross-tabulation of all the distances between network nodes. Future work is expected to expand local correlation analysis to the simultaneous inclusion of more than two data sources.
| SUPPLEMENTARY DATA |
|---|
|
|
|---|
Supplementary data for this paper are available on Bioinformatics online.
| Acknowledgments |
|---|
The authors thank Margarida Carrolo for the help in manuscript preparation, Nuno Sepúlveda for fruitful discussions and Kathryn Hulse for excellent technical help generating the mouse microarray data. Francisco Rodrigues Pinto is financially supported by the Portuguese Foundation for Science and Technology with the grant SFRH/BD/6488/2001. The authors also acknowledge support by SAPIENS/34794/99 from Fundação para a Ciência e a Tecnologia (FCT) of the Portuguese Ministério da Ciéncia e do Ensino Superior the and also by (NIH/USA) grants NHLBI Proteomics Initiative through contract N01-HV-28181 and GM 63625. Baerbel Rohrer is funded by an NIH grant EY13520 and a grant by the Karl Kirchgessner Foundation.
Received on April 4, 2004; revised on August 31, 2004; accepted on September 20, 2004
| REFERENCES |
|---|
|
|
|---|
Agresti, A. Categorical Data Analysis, (1990) , New York Wiley.
Allocco, D.J., Kohane, I.S., Butte, A.J. (2004) Quantifying the relationship between co-expression, co-regulation and gene function. BMC Bioinformatics, 5, , pp. 18[CrossRef][Medline].
Argraves, G.L., Barth, J.L., Argraves, W.S. (2003) The MUSC DNA microarray database. Bioinformatics, 19, 24732474
Baker, G.L. and Gollub, J.P. Chaotic Dynamics: An Introduction, (1996) , Cambridge Cambridge University Press.
Baldi, P., Brunak, S., Chauvin, Y., Andersen, C.A.F., Nielsen, H. (2000) Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics, 16, , pp. 412424
Berger, J. (2003) Could Fisher, Jeffreys, and Neyman have agreed on testing?. Stat. Sci., 18, 132[CrossRef].
Berriz, G.F., King, O.D., Bryant, B., Sander, C., Roth, F.P. (2003) Characterizing gene sets with FuncAssociate. Bioinformatics, 19, 25022504
Caselle, M., Di Cunto, F., Provero, P. (2002) Correlating overrepresented upstream motifs to gene expression: a computational approach to regulatory element discovery in eukaryotes. BMC Bioinformatics, 3, 7[CrossRef][Medline].
Chiang, D.Y., Brown, P.O., Eisen, M.B. (2001) Visualizing associations between genome sequences and gene expression data using genome-mean expression profiles. Bioinformatics, 17, (suppl.), S49S55[Abstract].
Chung, N., Jenkins, G., Hannun, Y.A., Heitman, J., Obeid, L.M. (2000) Sphingolipids signal heat stress-induced ubiquitin-dependent proteolysis. J. Biol. Chem., 275, 1722917232
Chung, N.J., Mao, C.G., Heitman, J., Hannun, Y.A., Obeid, L.M. (2001) Phytosphingosine as a specific inhibitor of growth and nutrient import in Saccharomyces cerevisiae. J. Biol. Chem., 276, 3561435621
Cowart, L.A., Okamoto, Y., Pinto, F.R., Gandy, J.L., Almeida, J.S., Hannum, Y.A. (2003) Roles for sphingolipid biosynthesis in mediation of specific programs of the heat stress response determined through gene expression profiling. J. Biol. Chem., 278, 3032830338
Daniel, W.W. Biostatistics: A Foundation for Analysis in the Health Sciences, (1999) , New York Wiley.
Everitt, B.S. and Dunn, G. Applied Multivariate Data Analysis, (2001) 2nd edn , London Arnold.
Gat-Viks, I., Sharan, R., Shamir, R. (2003) Scoring clustering solutions by their biological relevance. Bioinformatics, 19, , pp. 23812389
Gene Ontology Consortium. (2004) The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res., 32, D258D261
Gibbons, F.D. and Roth, F.P. (2002) Judging the quality of gene expression-based clustering methods using gene annotation. Genome Res., 12, 15741581
Grosu, P., Townsend, J.P., Hartl, D.L., Cavalieri, D. (2002) Pathway processor: a tool for integrating whole-genome expression results into metabolic networks. Genome Res., 12, 11211126
Hanisch, D., Zien, A., Zimmer, R., Lengauer, T. (2002) Co-clustering of biological networks and gene expression data. Bioinformatics, 18, (suppl.), S145S154[Abstract].
Hosack, D.A., Dennis, G., Jr., Sherman, B.T., Lane, H.C., Lempicki, R.A. (2003) Identifying biological themes within lists of genes with EASE. Genome Biol., 4, P4[CrossRef].
Jenkins, G., Richards, A., Wahl, T., Mao, C., Obeid, L., Hannun, Y. (1997) Involvement of yeast sphingolipids in the heat stress response of Saccharomyces cerevisiae. J. Biol. Chem, 272, 3256632572
Jenkins, G.M. and Hannun, Y.A. (2001) Role for de novo sphingoid base biosynthesis in the heat-induced transient cell cycle arrest of Saccharomyces cerevisiae. J. Biol. Chem., 276, 85748581
Lee, T.I., Rinaldi, N.J., Robert, F., Odom, D.T., Bar-Joseph, Z., Gerber, G.K., Hannett, N.M., Harbison, C.T., Thompson, C.M., Simon, I., et al. (2002) Transcriptional regulatory networks in Saccharomyces cerevisiae. Science, 298, 799804
Lord, P.W., Stevens, R.D., Brass, A., Goble, C.A. (2003) Semantic similarity measures as tools for exploring the gene ontology. Pac. Symp. Biocomput., 601612.
NIST/SEMATECH e-Handbook of Statistical Methods. NIST. (2003) Quantilequantile plot. http://www.itl.nist.gov/div898/handbook/eda/section893/qqplot.htm.
Quinn, G.P. and Keough, M.J. Experimental Design and Data Analysis for Biologists, (2002) , Cambridge Cambridge University Press.
Ravasz, E., Somera, A.L., Mongru, D.A., Oltvai, Z.N., Barabási, A.-L. (2002) Hierarchical organization of modularity in metabolic networks. Science, 297, , pp. 15511555
Rohrer, B., Pinto, F., Zhang, L., Hulse, K., Lohr, H., Seeliger, M.W., Almeida, J. Multi-destructive pathways triggered in photoreceptor cell death of the RD mouse as determined through gene expression profiling. (submitted for publication).
Schwikowski, B., Uetz, P., Fields, S. (2000) A network of proteinprotein interactions in yeast. Nat. Biotechnol., 18, 12571261[CrossRef][Web of Science][Medline].
Segal, E., Wang, H., Koller, D. (2003a) Discovering molecular pathways from protein interaction and gene expression data. Bioinformatics, 19, (suppl.), i264i272[Abstract].
Segal, E., Yelensky, R., Koller, D. (2003b) Genome-wide discovery of transcriptional modules from DNA sequence and gene expression. Bioinformatics, 19, (suppl.), i273i282[Abstract].
Shah, N.H. and Fedoroff, N.V. (2004) CLENCH: a program for calculating Cluster ENriCHment using the Gene Ontology. Bioinformatics, 20, 11961197
Steuer, R., Kurths, J., Daub, C.O., Weise, J., Selbig, J. (2002) The mutual information: detecting and evaluating dependencies between variables. Bioinformatics, 18, (suppl.), S231S240[Abstract].
Troyanskaya, O.G., Dolinski, K., Owen, A.B., Altman, R.B., Botstein, D. (2003) A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae). Proc. Natl Acad. Sci. USA, 100, 83488353
Vinga, S. and Almeida, J.S. (2003) Alignment-free sequence comparisona review. Bioinformatics, 19, 513523
Wilson, E.O. Consilience: The Unity of Knowledge, (1999) , House Random .
Zhang, L., Miles, M.F., Aldape, K.D. (2003) A model of molecular interactions on short oligonucleotide microarrays. Nat. Biotechnol., 21, , pp. 818821[CrossRef][Web of Science][Medline].
This article has been cited by other articles:
![]() |
D. Shriner, T. M. Baye, M. A. Padilla, S. Zhang, L. K. Vaughan, and A. E. Loraine Commonality of functional annotation: a method for prioritization of candidate genes from genome-wide linkage studies Nucleic Acids Res., March 27, 2008; 36(4): e26 - e26. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||





