Bioinformatics Advance Access originally published online on March 9, 2006
Bioinformatics 2006 22(11):1359-1366; doi:10.1093/bioinformatics/btl087
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Comparing gene expression networks in a multi-dimensional space to extract similarities and differences between organisms
1 Laboratoire de Génétique Moléculaire, CNRS UMR 8541, Ecole Normale Supérieure 46 rue d'Ulm, 75230 Paris cedex 05, France
2 Equipe de Bioinformatique Génomique et Moléculaire, INSERM U726, Université Paris 7 case 7113, 2 Place Jussieu, 75251 Paris cedex 05, France
3 Laboratoire de Recherche en Informatique, UMR CNRS 8623, Faculté des Sciences d'Orsay, Université Paris Sud 91405 Orsay, France
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Molecular evolution, which is classically assessed by comparison of individual proteins or genes between species, can now be studied by comparing co-expressed functional groups of genes. This approach, which better reflects the functional constraints on the evolution of organisms, can exploit the large amount of data generated by genome-wide expression analyses. However, it requires new methodologies to represent the data in a more accessible way for cross-species comparisons.
Results: In this work, we present an approach based on Multi-dimensional Scaling techniques, to compare the conformation of two gene expression networks, represented in a multi-dimensional space. The expression networks are optimally superimposed, taking into account two criteria: (1) inter-organism orthologous gene pairs have to be nearby points in the final multi-dimensional space and (2) the distortion of the gene expression networks, the organization of which reflects the similarities between the gene expression measurements, has to be circumscribed. Using this approach, we compared the transcriptional programs that drive sporulation in budding and fission yeasts, extracting some common properties and differences between the two species.
Availability: The source code is freely distributed to academic users upon request to the authors. More information can be found online at http://www.biologie.ens.fr/lgmgml/publication/comp3d/.
Contact: lelandais{at}biologie.ens.fr
Supplementary information: Supplementary data are available at http://www.biologie.ens.fr/lgmgml/publication/comp3d/SupData.php
| 1 INTRODUCTION |
|---|
|
|
|---|
Comparing genomic properties of different species at varying evolutionary distances is a powerful approach for studying biological and evolutionary principles. Because whole-genome sequences are available for a large number of organisms (Bernal et al., 2001), gene and protein sequences have received the most attention as the basis for investigating evolutionary changes. It has been valuable to develop methodologies based on comparative analysis for identifying coding and functional non-coding sequences, as well as sequences that are unique for a given organism [see Frazer et al. (2003) for review]. But evolution involves biological variations at many levels and one of the next major steps is to understand how the genes interact to perform particular biological processes. In this context, the accumulation of microarray data from multiple species (Barrett et al., 2005) provides new opportunities to (1) discover how the genes interact to perform specific biological process and (2) to study the evolution of expression network properties. Previous studies have initiated comparative analyses of large datasets of expression profiles from different organisms (Bergmann et al., 2004; Lefebvre et al., 2005; Stuart et al., 2003). Authors generally attempt to consolidate the classical compendium approach (Hughes et al., 2000; Kim et al., 2001) by identifying gene pairs exhibiting co-expression in multiple organisms and across a large number of arrays in each species (Frazer et al., 2003). In spite of very interesting results that demonstrate the potential of cross-species comparisons using expression data, the global approach that consists of the integration of large sets of unrelated microarray data precludes a precise comparison of the sets of genes involved in a particular cellular process. Analyses based on selected experiments that are as close as possible between species, appear to be a more promising approach. They constitute an alternative way to gain in informativity and accuracy in order to understand how conserved are regulatory sub-networks. This principle was first pursued by Alter et al. (2003), who compared time points during the cell cycle between yeast and human. But since this pioneer study, the standardization of microarray protocol (Knudsen and Daston, 2005), now accessible to numerous research groups, and the DNA sequences data from closely related species (Dujon et al., 2004; Kellis et al., 2003), lead to a rapid accumulation of new expression data directly comparable between organisms. The challenge of comparing the transcriptional response of several organisms to equivalent environmental conditions is only starting to be addressed, but many aspects of this question are fascinating. For instance, will orthologous genes occupy similar positions in expression networks? To answer this question, it is necessary to (1) be able to analyze the intrinsic structure of a gene expression network and (2) compare the organization of two networks in different organisms. In the present paper, we present an approach based on Multi-dimensional Scaling (MDS) techniques that is particularly well-suited for solving these two problems. It permits the representation of the expression data in a space with a reduced dimension [for instance three-dimensional (3D), for visualization] while preserving well the relationships between the genes in highest dimension. The comparison of two gene expression networks is then performed through an original optimization of the superimposition in a multi-dimensional space. In addition, we provide different indexes for assessing the quality of the optimal superimposition. We demonstrate the utility of our approach by a comparative analysis of the transcriptional program that drives the developmental process of sporulation in the yeasts Saccharomyces cerevisiae (Chu et al., 1998) and Schizosaccharomyces pombe (Mata et al., 2002). We conclude that although the transcriptional programs in the two yeasts exhibit common overall properties that certainly mirror the robustness of the gene expression networks during evolution, they are not identical. In particular, orthologous genes were separated according to the conservation of their expression between the two yeasts and we could characterize different degrees of cross-species conservation over time.
| 2 METHODS |
|---|
|
|
|---|
2.1 Notation
For the sake of clarity, we will name the two genomes to be compared
and
. Using microarray data, gene expression matrices can be drawn up independently for each genome. The rows correspond to individual genes, the columns are individual microarray experiments and cells contain a measure of the gene activity. Note that the microarray experiments are either time-course gene expression data or gene expression values measured under different experimental conditions. In these matrices, let Yi,n [or Yi(tn)] be the gene expression level of the i-th gene in the n-th experiment (or the gene expression value at the time tn), for i = 1, ..., G1 and n = 1,...,N1 where G1 denotes the total number of genes in
and N1 the number of experiments (or time points). Likewise, let Yj,m [or Yj(t'm)] be the gene expression level of the j-th gene in the m-th experiment (or the gene expression value at the time t'm), for j = 1,...,
and m = 1, ... , N2 where G2 denotes the total number of genes in
and N2 the number of time points (or experiments). Therefore, each gene can be given a coordinate called its expression profile which is, for the i-th gene, the vector
and for the j-th gene,
. This coordinate represents a point in an N-dimensional space, where N is the number of experiments (N1 or N2).
2.2 Similarity measure between intra-organism gene expression profiles
For each genome (
and
), the matrices of distances between the gene expression profiles (denoted D1 and D2) can be calculated independently. In our study, the inter-gene distances are simply Euclidean distances after normalizing each gene expression vector to have a mean 0 and a variance of 1. Thus, considering two genes, i and i', belonging to the genome
the distance between their corresponding expression profiles is:
![]() | (1) |
and
.
2.3 Representation of a gene expression network in a space of reduced dimension
MDS is a well-established technique for dimensionality reduction and visualization. It can be used to display the high-dimensional microarray data in a space of lower dimension. Using a distance matrix like D1 or D2, this approach enables the transformation of a N-dimensional space into a p-dimensional space (p < N), by finding the optimum low-dimensional configuration. The criterion to be optimized, i.e. numerically minimized to find the optimum configuration, maintains the small distances while approximating the large distances. Its expression for the genome
is as follows (i
i'):
![]() | (2) |
transforms the criterion into a dimension-free number.
Consequently, each gene of
(or
) can be located in a low-dimensional space in such a way that the geometric distances in the low-dimensional space are as close as possible to the real distances measured between the gene expression profiles in the N1 (or N2)-dimensional space. For each genome, the resulting gene expression network reflects the intrinsic structure of the system studied by microarray experiments: two genes with similar expression profiles will be represented by nearby points, whereas two genes with very different patterns of expression will be distant from one another.
2.4 Definition of inter-organism pairs of genes: orthology assignments
To compare the transcriptional responses in different organisms, pairs of related genes, one in each organism, have to be defined. Several types of information can be used. In this study we use orthology relationships to connect expression data and compare two transcriptional programs. We apply the INPARANOID program for generating orthologous gene pairs between genomes
and
, as described previously (O'Brien et al., 2005; Remm et al., 2001). To summarize, the algorithm starts the detection of orthologs with calculation of all pairwise similarity scores between the complete sets of protein sequences from the two genomes. This is done with the BLAST program (Altschul et al., 1990). Then, sequence pairs with mutual best hits are detected and serve as central points around which additional orthologs from both species are clustered. The idea is that if the sequences are orthologs, they should score higher with each other than with any other sequence. Finally, overlapping groups are resolved. Note that for some sequences, there is more than one ortholog in one or both species. One gene from genome
can have a significant similarity score with several genes in genome
, and vice versa. Our methodology is able to manage these multiple inter-organism links.
2.5 Cross-species comparison of two gene expression networks in a multi-dimensional space
To compare the gene expression networks of two genomes in the same multi-dimensional space, we developed a procedure for optimal superimposition. This process needs first to define inter-organism pairs of genes (i, j) having correspondence (see previous section). Then, the selected pairs of genes constitute a set of attractor pairs in the multi-dimensional space. In other words, the strategy aims at moving one genome (for instance
) towards the other (genome
) such that inter-organism gene pairs are brought as close as possible in the multi-dimensional space, while limiting the distortion of the displaced genome by keeping at the best inter-gene distances of the matrix D1 in the translation of points. More precisely, to find the optimum superimposition configuration, we define a criterion E* which is numerically minimized and has two components: the first, E, quantifies the proximity of inter-organism related genes (i.e. the attractor pairs), and the second, C, is a constraint measuring the conservation of the inter-gene distances of the moving genome. The quantity C0 corresponds to the initial value of C, i.e. before displacing the point subset. Thus, the expression of E* is
![]() | (3) |
is a weighting. A large
-value leads to a displacement of the set of points of genome
, without distorting the expression network. The only authorized displacements are thus orthogonal transformations because of the important constraint of maintaining all the inter-gene distances. In this case, the proximity between attractor genes can remain imperfect. Then, to refine the superimposition between attractors, plastic transformations are allowed by decreasing the
-value. A
-value of 0 in the criterion leads to the perfect superimposition of the attractors in the multi-dimensional space, but also a large distortion of the gene expression network of the moving genome
. During the superimposition procedure, the quantity E is initially maximal and decreases during the displacement whereas the other quantity (C C0) initially equals zero and increases. The
-value is chosen to maintain a balance between the levels of the two components E and (C C0).
The proximity, E, between inter-organism associated genes depends on the geometric distances in the multi-dimensional space between the points that compose attractor pairs. Its expression is similar to that used in the MDS technique:
![]() | (4) |
i, j indicates whether a given gene pair is defined as an attractor. It equals 1 when the genes i and j are linked (orthologs in our study), otherwise it equals 0. The term
allows normalization of the quantity E. We have introduced a particular weighting in the criterion E. The term mj is the mean distance between the j-th gene and its l nearest neighbor genes (l is fixed at 10 in our study). As the geometric distances di,j between the attractor points must theoretically equal 0 (if we assume that the gene expression networks are entirely identical), the distances must be weighted with a quantity not biased by the large distances, explaining the choice of this weighting. Note that the component E can be divided into individual values Ei for each gene i in the displaced genome
:
![]() | (5) |
i'):
![]() | (6.1) |
![]() | (6.2) |
2.6 The optimization result and its significance
The reduction of an N-dimensional space into a p-dimensional space (p < N) involves a loss of information. The optimization procedure approximates, for each pair of genes (i, i'), the inter-gene distance Di,i' between the gene expression profiles to a geometric distance di,i' between points in the p-dimensional space. Thus, considering a pair of genes (i,i') belonging to the same genome, say
, we can calculate the following indexes:
![]() | (7.1) |
![]() | (7.2) |
i,i' value means that the points corresponding to the genes i and i' are correctly mapped in the multi-dimensional space according to the measures of their gene expression. Moreover, the
i value enables to capture the relevance of the position for a particular gene i in the light of all the other genes of genome
. Note that the constraint C is
[see Equation. (6)].
These two indices can be calculated after the initial mapping of genes into a multi-dimensional space, thus assessing the loss of information during dimensionality reduction. But they can also be used to quantify, for each gene, the local distortion of the gene expression network during the superimposition procedure described above. For instance, if a gene i is attracted in a direction that does not agree with its position in the gene expression network, it will result in a large local deformation of the network and a substantial increase in its
i value (
i(final) =
i(optim)
i(init)). In contrast, if a gene is attracted in a direction that agrees with its position in the gene expression network, there will only be a small local deformation of the network and a small increase in its
i value.
2.7 Technical information
All programs were implemented in the R programming language (http://cran.r-project.org/). Functions were numerically minimized using the quasi-Newton method (R function available in the BASE package). All calculations were carried out on a Dell precision 360 equipped with a 2.6 GHz processor running a Linux DEBIAN distribution. 3D visualizations and snapshots were performed using the RGL package (http://rgl.neoscientists.org/News.html).
2.8 A case study of sporulation: comparison of the transcriptional programs in yeasts
To illustrate our approach, we compared two species: the budding yeast Sacharromyces cerevisiae and the fission yeast S.pombe. Both have been entirely sequenced (Goffeau et al., 1996; Wood et al., 2002) and comprehensively annotated. These two model organisms were chosen because two very similar sets of microarray time-course experiments (one for each organism) were available. Indeed, two different laboratories have used DNA microarrays to study the transcriptional program that drives the developmental process of sporulation, in which diploid cells undergo meiosis to produce haploid germ cells. The first study, published in 1998 (Chu et al., 1998) led to a precise description of different temporal gene expression patterns of about 500 genes induced in S.cerevisiae. More recently, it has been demonstrated that around 1000 genes are also regulated in successive waves of transcription in S.pombe (Mata et al., 2002) during meiosis and sporulation. Both Chu et al. (1998) and Mata et al. (2002) measured the changes in the concentrations of mRNA transcripts for each gene at successive times after transfer of wild-type diploid yeast cells to a nitrogen-deficient medium that induces sporulation. We have collected the measured log2(Ratio) expression values for each time point: 7 for S.cerevisiae, tn: {0, 30 min, 2, 5, 7, 9, 11 h} and 13 for S.pombe, tm: {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 h}.
| 3 RESULTS |
|---|
|
|
|---|
3.1 Overall properties of the gene expression networks derived from the sporulation datasets
Selection of up-regulated genes. From all the genes for which expression data were available, we only considered those which showed significant changes in mRNA levels during the time-course analysis, as described in Vaquerizas et al. (2005) (root mean square >1.13). According to the notations defined in Methods, the resulting expression matrices comprised G1 = 518 genes and N1 = 7 time points for the S.cerevisiae genome (
), and
= 724 genes and N2 = 13 time points for the S.pombe genome (
). Representations in a 3D space show similar organization in the two yeasts. In this case study of sporulation, we decided to use a 3D space for two reasons. First, principal component analyses applied to the S.cerevisiae and S.pombe sporulation data (Chapman et al., 2002) showed that, in both cases, >93% of the variability is found in three dimensions (Supplementary Figure S1). 3D representations are thus good pictures of the high-dimensional representations of the gene expression networks. Second, it has the advantage of allowing the visualization of each step of the cross-species comparison without any subsequent transformations. Therefore, we mapped the yeast sporulation data in a 3D space using the MDS procedure for dimensionality reduction described in Methods. Results are presented in Figure 1A. Note that because there are more measurements in the S.pombe (13 points) than in the S.cerevisiae (7 points) time-course, the pairwise relationships between gene expression profiles are better discriminated in the S.pombe 3D-gene expression network. Nevertheless, Figure 1A suggests that the overall organizations of the 3D-gene expression networks for the two species are similar (see also Supplementary Figure S2). For instance, we can clearly distinguish in both cases a highly connected region from all other genes, and there is a similar temporal organization (see below).
|
Temporal organization of sporulation in yeast. Genes induced during the time-course experiments were classified according to their expression profiles (Quackenbush, 2001) into four temporal classes by k-means clustering (Fig. 1B). In both species, these four temporal classes coincide with the major biological processes of sporulation as described in Mata et al. (2002): response to nutritional changes (Nitrogen class of genes), premeiotic S phase and recombination (Early class), meiotic division (Middle class) and spore formation (Late class). To visualize the successive waves of transcription during the sporulation process on the 3D representation, each gene was colored according to its temporal class, showing that genes can be placed in a sequence that reflects the time of initial induction during the time-course (see also Supplementary Figure S3).
Common properties of the transcriptional programs of sporulation between the two yeasts. Taken at face value, the parallel observation of comparable temporal classes in S.pombe and S.cerevisiae (Fig. 1B) suggests that the overall process of sporulation is identical in fission and budding yeasts. However, to address evolutionary aspects, it is valuable to determine whether orthologous genes in the two yeasts are implicated in the biological processes of sporulation. Orthology defines the relationship between genes in different species that originate from a single gene in the last common ancestor of these species (Fitch, 2000; Sonnhammer and Koonin, 2002). Such gene pairs are most likely to share the same function and should, in principle, occupy a similar position in the gene expression networks derived from the sporulation process. To test this hypothesis, the structure of the orthologous gene expression networks were compared in a 3D space, in order to systematically characterize both similarities and differences across species.
3.2 Cross-species comparison of orthologous genes
Gene orthology and sporulation in S.cerevisiae and S.pombe. Orthology relationships among the genes were inferred using the INPARANOID algorithm (O'Brien et al., 2005; Remm et al., 2001) (see Methods). We detected 5053 orthologous gene pairs between the complete genomes of S.cerevisiae and S.pombe. Then, we determined whether orthologous genes induced during sporulation in S.cerevisiae (518 genes) were also up-regulated in S.pombe (724 genes). We identified 122 orthologous gene pairs, which correspond to G1 = 85 S.cerevisiae genes and G2 = 98 S.pombe genes. We thus constructed a set of orthologs between the two yeasts, all implicated in the sporulation process, and serving as the kernel for the following cross-species comparison. All the other genes, i.e. those with no identifiable orthologous counterpart represent a set of organism-specific genes (see Discussion).
Optimal superimposition in a 3D space. First, each set of orthologous genes was independently represented in a 3D space according to the expression measurements (Fig. 2A). For both species, highly connected regions can still be distinguished from all other genes. We also checked that the gene distribution into the previously characterized temporal classes was unchanged following selection of inter-organism gene pairs (Supplementary Materials S4). Next, the 3D-gene expression networks were mapped in the same space and associated genes between the two yeasts were connected with red segments (Fig. 2B). The length of a segment is a good visual indicator of the spatial proximity of orthologs. Then, to analyze the concordance between orthology relationship and expression data, we modified the superimposition of the two 3D-networks using the optimization procedure described in Methods. The main idea is to move one of the two genomes, in this case the S.cerevisiae genome, such that orthologous genes become nearby points in the 3D space, whilst limiting the distortion of the S.cerevisiae gene expression network. The parameter
(see Methods) establishes equilibrium between these two constraints (Fig. 3). A large
-value (>3) leads to a small distortion of the S.cerevisiae gene expression network (C C0 is low), but the superimposition between orthologous gene pairs is defective (E is high). Inversely, a
-value close to 0 leads to a perfect superimposition of the orthologous gene pairs (E is minimal), but also a large distortion of the S.cerevisiae gene expression network (C C0 is maximal). White segments connect co-expressed gene pairs for which the correlation between the gene expression values is better than 0.95, visualizing the gene expression network deformation (Supplementary Figures S5). For
values near 1.5 and below, the quantity (C C0) suddenly increases (Fig. 3): the 3D-gene expression network is extremely flexible and cannot counteract attractions between inter-organism associated genes. Subsequent analyses in this paper use a
-value of 2 (Fig. 2C), as good compromise between proximity of orthologous gene pairs and distortion of the displaced gene expression network.
|
|
Expression in the two yeasts and sequence conservation are significantly correlated. The superimposition of the two 3D-gene expression networks was optimized. Figure 2C gives a graphical representation of the cross-species comparison result: expression data and sequence conservation (i.e. orthologous links) are connected in the same space, allowing a quantitative assessment of the similarities and differences between the transcriptional programs of the two yeasts. Note that the highly connected regions of the two networks are superimposed. This is because of the numerous attractor pairs that linked them. It demonstrates that these orthologs occupy similar positions in the gene expression networks and are expressed similarly in the two species. Then, to assess the significance of overall 3D-proximity between orthologs after optimization, we used a statistical analysis comparing the criterion E. This method is based on a bootstrap procedure which consists of comparing the final proximity of orthologous gene pairs, E(real), to random controls, E(random), obtained by superposing the two gene-expression networks using randomly rearranged orthologous links (Fig. 3). For all values of
, E(real) was >3 SD lower than the mean of the E(random) distributions. More precisely, for a
-value of 2, the calculated Z-score was 4.7 (Fig. 4) meaning that orthology relationships and expression during the sporulation process have significant concordance between S.cerevisiae and S.pombe.
|
3.3 Budding and fission yeasts show different degrees of cross-species conservation over the sporulation process
Comparative analysis of the temporal classes: particularities of the Early genes. In the final superimposition (Fig. 2C) several inter-organism associated genes are distant from one another after the optimization procedure. This suggests that some genes with sequence conservation have different locations in the 3D-gene expression networks in the two species, i.e. different roles during the sporulation process. We first examined whether the overall concordance between sequence conservation and expression (see previous paragraph) is similar for the four temporal classes of genes: Nitrogen, Early, Middle and Late. The criterion E was decomposed into four components: ENitrogen, EEarly, EMiddle and ELate, representing the contribution of genes belonging to the same temporal class, in the displaced genome
(E = ENitrogen + EEarly + EMiddle + ELate). Each component was then compared with random controls obtained by reshuffling the orthologous gene list (Fig. 4). For the classes Nitrogen, Middle and Late, the calculated values were >3 SD lower than the mean of the corresponding random distributions. Findings for the Early class were, however, different. The EEarly value was +0.5 SD higher than the mean of the random distribution, indicating that there was no significant concordance between sequence conservation and expression in the two yeasts. Interestingly, this finding agrees with a previous study (Mata and Bahler, 2003), which described the important role in speciation for genes induced during meiotic prophase, when homologous chromosomes pair and recombine. Indeed, organism-specific proteins with roles in homologous chromosome pairing during meiotic prophase may help to prevent fruitful meiosis between closely related organisms, and they may thus favor speciation.
Similarities and differences in ortholog expression. Next, we scrutinized each orthologous gene pair. To find systematically the most interesting patterns (similar or dissimilar), we classified the S.cerevisiae genes according to their individual Ei values in E. A large Ei value means that the genes i and j are distant in the final 3D space, because of the constraint of limiting the distortion of the displaced gene expression network. In these cases, the two orthologs (i, j) exhibit different locations in the two gene expression networks. A low Ei value means that the genes i and j are close and thus presumably exhibit similar expression profiles. To avoid false positive results, it is necessary to verify that the proximity of two genes does not result from major distortion of the displaced gene expression network (S.cerevisiae). For this reason we defined an index,
i (see Methods and Supplementary Materials S6), which quantifies for each gene i (attracted to the area where its related gene j is located) the resulting local deformation of the initial gene expression network after the superimposition procedure. Each of the 85 genes from the S.cerevisiae genome were plotted according to their Ei and
i values (Fig. 5A). Average Ei values were computed in cases of multiple links. On the resulting graph, the lower left group includes S.cerevisiae genes with expression profiles similar to those of their S.pombe orthologs (Fig 5B and 13); other genes, those with high Ei or
i values, exhibit gene expression profiles different from those of their associated genes in S.pombe (Fig 5B, 4 and 5). Note that the case of the orthologous gene pair number 3 (YHR087W and SPBC21C3.19) deserves a special mention. Even if the expression profiles are not strictly identical (the correlation value is only 0.36), their corresponding final Ei and
i values are low. Indeed, these orthologous genes are both induced at the end of the time course and thus occupy a similar external position in the S.pombe and S.cerevisiae gene expression networks (Fig. 2, horizontal arrow). This demonstrates the usefulness of our approach, which goes beyond a simple direct comparison of the gene expression profiles for orthologous pairs (Supplementary Figure S7). It integrates expression data with gene orthology to compare the global organization of two gene expression networks, thus taking into account all the intra- and inter-species relationships between genes.
|
Genes contained in each temporal class are shown in a different grey scale on the plot. Genes belonging to the Middle class (located in the highly connected regions) and implicated in the meiotic division, for example B-type cyclins and components of the Anaphase Promoting Complex (Hartwell et al., 1974), are over-represented in the set of genes with similar expression in the two yeasts. This may reflect a particular conservation for cell-cycle genes, involved in a core eukaryotic process (Rustici et al., 2004; Spellman et al., 1998). This analysis also reveals genes that have sequence conservation, but display different expression pattern. It demonstrates that despite an overall conservation, the sporulation process is not strictly identical in the two yeasts.
Integration of sequence and expression data provides new insight into cross-species comparisons. Using our methodology, we were able to identify differences between the sporulation processes of the two species, despite conserved functions, revealing the plasticity of the functional modules. Comparing genomic properties of different organisms is of fundamental importance and our approach yielded new information on essential genes. Several examples, which exhibit characteristic situations and illustrate the potential of combining sequence and expression data to address particular evolutionary issues, are presented in Supplementary Materials S8 and S9 (http://www.biologie.ens.fr/lgmgml/publication/comp3d/SupData.php).
| 4 DISCUSSION |
|---|
|
|
|---|
Direct comparison of gene expression networks involving similar biological processes in two different organisms is still in infancy. One of the main obstacles is the difficulty of normalizing data that have been obtained in different experimental conditions. In this paper we present a new methodology based on MDS techniques which allows graphical visualization as an integral part of statistical modelling and data analysis. We believe that this multi-dimensional space representation of whole-genome expression data can be considered as a useful topology model to stimulate both intuitive and rational comparison of orthologous sets of data. Simulations, performed to evaluate the ability of our approach to identify orthologous pairs of genes whose expression is conserved have led to promising results (see Supplementary Materials S10). The procedure for optimal superimposition has a good tolerance to noise and is able to precisely identify classes of genes whose expression is well-conserved between two species. Our method relies, of course, on the availability of accurate gene expression data. Therefore, unlike previous cross-species comparative analyses of large datasets of expression profiles (Bergmann et al., 2004; Stuart et al., 2003), we decided in common with Alter et al. (2003) and McCarroll et al. (2004) to focus our approach on strictly comparable datasets. The choice of the sporulation data in the fission (Mata et al., 2002) and budding (Chu et al., 1998) yeasts was favored because the same nutritional stress triggers the two sporulation programs and accurate time-series experiments describe the connected transcriptome variations. We also compared the gene expression networks of the yeasts S.cerevisiae and S.pombe, using the cell cycle data described in Spellman et al. (1998) and Rustici et al. (2004) (see Supplementary Materials S11).
The purpose of this work was to present a rational approach to the scrutiny of the differences and similarities in these orthologous gene expression programs, taking into account all the intra- and inter-species relationships between genes. In that respect, we could take advantage of previous comparisons of S.cerevisiae and S.pombe sporulation and cell-cycle programs (Mata et al., 2002; Rustici et al., 2004) to assess the validity of our method. Together, these two examples demonstrate the effectiveness of our methodology, since both cross-species comparisons lead to interesting results that are in agreement with the general conclusions of these pioneering studies. For instance, the specific behavior of the genes induced during the meiotic prophase (with roles in homologous chromosome pairing) is coherent between our analysis and that of Mata et al. (2002). We also agree concerning the characterization of organism-specific genes activated during the sporulation and cell-cycle programs. This gives evidence that our approach can provide better comparisons of orthologous gene expression networks. For instance, we could reveal hidden orthologous relationships between sporulation-induced genes (see Supplementary Figure S12). In addition it is important to emphasize that the 3D space, used here for the sake of simplicity, can easily be extended to a multi-dimensional space, opening the way to deeper analyses. Nevertheless, the 3D space representation will remain a useful visualization tool to point out the most interesting features.
One of the long-term objectives of this work is to develop, from gene expression profile data, automatic approaches for characterizing groups of genes which are involved in similar biological functions in different organisms. In that respect our 3D transcriptome representation can be considered as analogous to 3D protein structure in which we try to identify specific structural motifs (Chou, 2004). Whether rigorous modeling of expression profile comparisons between genetic programs of multiple organisms is possible on a genomic scale remains to be seen. Designing strictly comparable transcriptome analyses to start from homogeneous data is certainly an important prerequisite to achieve this goal. Nevertheless, the data will not provide new insights automatically. They will require new tools to be rigorously scrutinized. Thus, for instance, identification of orthologous groups of expression will have to be connected with transcriptional sub-networks. Our approach may represent one of the first steps in this direction.
| Acknowledgments |
|---|
The authors wish to thank Catherine Etchebest for helpful advice and comments on the manuscript. Frédéric Devaux and Alexandre de Brevern for discussions and Boris Barbour for English correction. This work was supported by ACI-IMPBIO-2004-45. This paper is dedicated to the Prof. Serge Hazout, who died in April 2005.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: David Rocke
Received on October 25, 2005; revised on February 9, 2006; accepted on March 4, 2006
| REFERENCES |
|---|
|
|
|---|
Alter, O., et al. (2003) Generalized singular value decomposition for comparative analysis of genome-scale expression data sets of two different organisms. Proc. Natl Acad. Sci. USA, 100, 33513356
Altschul, S.F., et al. (1990) Basic local alignment search tool. J. Mol. Biol, . 215, 403410[CrossRef][Web of Science][Medline].
Barrett, T., et al. (2005) NCBI GEO: mining millions of expression profilesdatabase and tools. Nucleic Acids Res, . 33, D562D566
Bergmann, S., et al. (2004) Similarities and differences in genome-wide expression data of six organisms. PLoS Biol, . 2, E9[CrossRef][Medline].
Bernal, A., et al. (2001) Genomes OnLine Database (GOLD): a monitor of genome projects world-wide. Nucleic Acids Res, . 29, 126127
Chapman, S., et al. (2002) Using biplots to interpret gene expression patterns in plants. Bioinformatics, 18, 202204
Chou, K.C. (2004) Structural bioinformatics and its impact to biomedical science. Curr. Med. Chem, . 11, 21052134[Web of Science][Medline].
Chu, S., et al. (1998) The transcriptional program of sporulation in budding yeast [Erratum (1998) Science, 282, 1421.]. Science, 282, 669705.
Dujon, B., et al. (2004) Genome evolution in yeasts. Nature, 430, 3544[CrossRef][Medline].
Fitch, W.M. (2000) Homology a personal view on some of the problems. Trends Genet, . 16, 227231[CrossRef][Web of Science][Medline].
Frazer, K.A., et al. (2003) Cross-species sequence comparisons: a review of methods and available resources. Genome Res, . 13, 112
Goffeau, A., et al. (1996) Life with 6000 genes. Science, 274, 546, 563567.
Hartwell, L.H., et al. (1974) Genetic control of the cell division cycle in yeast. Science, 183, 4651
Hughes, T.R., et al. (2000) Functional discovery via a compendium of expression profiles. Cell, 102, 109126[CrossRef][Web of Science][Medline].
Kellis, M., et al. (2003) Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature, 423, 241254[CrossRef][Medline].
Kim, S.K., et al. (2001) A gene expression map for Caenorhabditis elegans. Science, 293, 20872092
Knudsen, T.B. and Daston, G.P. (2005) MIAME guidelines. Reprod Toxicol, 19, 263[Medline].
Lefebvre, C., et al. (2005) Balancing protein similarity and gene co-expression reveals new links between genetic conservation and developmental diversity in invertebrates. Bioinformatics, 21, 15501558
Mata, J. and Bahler, J. (2003) Correlations between gene expression and gene conservation in fission yeast. Genome Res, . 13, 26862690
Mata, J., et al. (2002) The transcriptional program of meiosis and sporulation in fission yeast. Nat. Genet, . 32, 143147[CrossRef][Web of Science][Medline].
McCarroll, S.A., et al. (2004) Comparing genomic expression patterns across species identifies shared transcriptional profile in aging. Nat. Genet, . 36, 197204[CrossRef][Web of Science][Medline].
O'Brien, K.P., et al. (2005) Inparanoid: a comprehensive database of eukaryotic orthologs. Nucleic Acids Res, . 33, D476D480
Quackenbush, J. (2001) Computational analysis of microarray data. Nat. Rev. Genet, . 2, 418427[CrossRef][Web of Science][Medline].
Remm, M., et al. (2001) Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. J. Mol. Biol, . 314, 10411052[CrossRef][Web of Science][Medline].
Rustici, G., et al. (2004) Periodic gene expression program of the fission yeast cell cycle. Nat. Genet, . 36, 809817[CrossRef][Web of Science][Medline].
Sonnhammer, E.L. and Koonin, E.V. (2002) Orthology, paralogy and proposed classification for paralog subtypes. Trends Genet, . 18, 619620[CrossRef][Web of Science][Medline].
Spellman, P.T., et al. (1998) Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell, 9, 32733297
Stuart, J.M., et al. (2003) A gene-coexpression network for global discovery of conserved genetic modules. Science, 302, 249255
Vaquerizas, J.M., et al. (2005) GEPAS, an experiment-oriented pipeline for the analysis of microarray gene expression data. Nucleic Acids Res, . 33, W616W620
Wood, V., et al. (2002) The genome sequence of Schizosaccharomyces pombe. Nature, 415, 871880[CrossRef][Medline].
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||













