Skip Navigation


Bioinformatics Advance Access originally published online on April 19, 2005
Bioinformatics 2005 21(13):2988-2993; doi:10.1093/bioinformatics/bti457
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/13/2988    most recent
bti457v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (23)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Al-Shahrour, F.
Right arrow Articles by Dopazo, J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Al-Shahrour, F.
Right arrow Articles by Dopazo, J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions{at}oupjournals.org

Discovering molecular functions significantly related to phenotypes by combining gene expression data and biological information

Fátima Al-Shahrour , Ramón Díaz-Uriarte 1 and Joaquín Dopazo *

Department of Bioinformatics, and Functional Genomics Node (INB), Centro de Investigación Príncipe Felipe Autopista del Saler 16, 46013 Valencia, Spain
1Bioinformatics Unit, Centro Nacional de Investigaciones Oncológicas (CNIO) Melchor Fernández Almagro 3, 28029 Madrid, Spain

*To whom correspondence should be addressed.


    Abstract
 TOP
 Abstract
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 

Motivation: The analysis of genome-scale data from different high throughput techniques can be used to obtain lists of genes ordered according to their different behaviours under distinct experimental conditions corresponding to different phenotypes (e.g. differential gene expression between diseased samples and controls, different response to a drug, etc.). The order in which the genes appear in the list is a consequence of the biological roles that the genes play within the cell, which account, at molecular scale, for the macroscopic differences observed between the phenotypes studied. Typically, two steps are followed for understanding the biological processes that differentiate phenotypes at molecular level: first, genes with significant differential expression are selected on the basis of their experimental values and subsequently, the functional properties of these genes are analysed. Instead, we present a simple procedure which combines experimental measurements with available biological information in a way that genes are simultaneously tested in groups related by common functional properties. The method proposed constitutes a very sensitive tool for selecting genes with significant differential behaviour in the experimental conditions tested.

Results: We propose the use of a method to scan ordered lists of genes. The method allows the understanding of the biological processes operating at molecular level behind the macroscopic experiment from which the list was generated. This procedure can be useful in situations where it is not possible to obtain statistically significant differences based on the experimental measurements (e.g. low prevalence diseases, etc.). Two examples demonstrate its application in two microarray experiments and the type of information that can be extracted.

Availability: The software used for the association of significant Gene Ontology (GO) terms to sets of genes is available at http://www.fatigo.org and http://www.babelomics.org. Software for ranking genes according to phenotypes is available in GEPAS (http://www.gepas.org). The multtest program from the bioconductor package is available at http://www.bioconductor.org/repository/devel/package/html/multtest.html.

Contact: jdopazo{at}ochoa.fib.es


    1 INTRODUCTION
 TOP
 Abstract
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 
High throughput methodologies, such as gene expression (Eisen et al., 1998), produce vast amounts of data that can be used to relate molecular level differences in gene behaviour in response to macroscopic phenotypes. Microarray gene expression data can be processed using different methods, based on distinct criteria, such that ordered lists of genes can be generated. The rationale behind this step is that the disposition found for the genes must be a direct consequence of the distinct functions carried out by them in the cell under the different experimental conditions or distinct phenotypes studied. In order to understand the molecular basis that differentiate the phenotypes, two steps are typically followed. First, gene expression values are examined by means of an appropriate statistical test. Genes with extreme differential expression values (over the threshold defined in the test) are believed to have a differential behaviour in the experiment (that can involve discrete classes—e.g. healthy versus diseased samples—or continuous traits—such as the concentration of a metabolite, drug dosages, time series or even survival data). Then, available biological information for these genes are used, a posteriori, in order to understand the biological processes operating at molecular level that account for the experiment. For example, van't Veer et al. (2002) selected 231 prognostic reporter genes according to their correlation coefficients with an average good prognosis gene expression profile. The functional annotations of the genes were then studied to gain insight into the underlying biological mechanism leading to rapid metastases. Genes involved in cell cycle, invasion and metastasis, angiogenesis and signal transduction were found to be significantly upregulated in the poor prognosis signature (van't Veer et al., 2002). Other authors have used DNA microarrays to identify clinically relevant subtypes of cancers (see, for example, Lapointe et al., 2004) and then found genes differentially expressed among the classes that are associated with distinct biological properties. In particular, Cunliffe et al. (2003) used gene ontology to highlight functionally distinct biological responses to different mitogens in a study of response of breast cancer to growth regulators.

Despite the prior availability of this biological information, thresholds for defining differential behaviour of genes are obtained relying exclusively on experimental measurements of gene expression, with functional study as the subsequent step. Only a few attempts have been made to use the biological information as part of the gene selection process. Mootha et al. (2003) used Gene Set Enrichment Analysis (GSEA) to detect modest but coordinate changes in the expression of groups of functionally related genes. They applied GSEA to a dataset in which no genes were over conventional thresholds, based only on the comparison of gene expression values. They were able to find a group of genes involved in phosphorylation with decreased expression in human diabetic muscle.

Here we present a different procedure that uses biological information in combination with the experimental measurements to understand the biological processes operating at molecular level that explain macroscopic differences between phenotypes, or experiments using DNA microarray or other high throughput techniques, such as proteomics. The procedure proposed here uses Gene Ontology (GO) (Ashburner et al., 2000) terms as functional labels for genes. GO provides information on molecular function, biological processes and cellular components for a number of different organisms. In addition we use information on metabolic pathways taken from KEGG (Kanehisa et al., 2004).

The procedure consists of the application of a test for extracting significant over- or under-represented terms associated with groups of genes (see Al-Shahrour et al., 2004). The test is sequentially applied to different partitions of an ordered list of genes, previously obtained by studying their differential expression according to the phenotypes analysed. Then, labels accounting for biological information are used to test if groups of genes functionally related are simultaneously over- or under-expressed (or, in other words, if these genes tend to be concentrated in one of the ends of the list instead of being uniformly distributed across it.) Contrarily to GSEA, and other alternative methods based on comparison of distributions through Kolmogorov–Smirnov or related tests, this method does not require an extreme non-uniform distribution of genes. It is capable of finding different types of asymmetries in the distribution of groups of genes across the list of data. The strong multiple-testing nature of the statistical contrast has been taken into account. Two examples show the use of the procedure and demonstrate its sensitivity. The proposed procedure seems to have a better performance than GSEA.


    2 METHODS
 TOP
 Abstract
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 
2.1 General description of the method
The proposed method has three steps (Fig. 1). First, a list of genes is ordered using experimental information on their differential expression according to the phenotype studied in the experiment (Fig. 1A). For example, genes can be ordered on the basis of their differential expression between two experimental conditions (pre- and post-drug administration, healthy versus diseased samples, etc.). The second step (Fig. 1B and C) involves the study of the distribution of functional terms in different partitions of this list using a test described below (Al-Shahrour et al., 2004). Then, a plot with the significant terms obtained upon the application of the test can be used to detect significant asymmetrical distributions of genes, responsible for diverse biological processes, across the list (Fig. 1D). Finding significant asymmetrical distributions of functional terms across the list strongly suggests that groups of genes having in common functional labels are significantly over- or under-expressed as a block.



View larger version (48K):
[in this window]
[in a new window]
 
Fig. 1 Summary of the steps of the proposed method: (A) list of genes ordered according their differential behaviour in the experimental conditions studied by means of a statistical test, (B) establishment of different partitions across the list, (C) extraction of the biological terms differentially expressed in each partition and (D) plot of the super-adjusted terms with significant differential expression. (E) Genes labelled with two different terms. The gene on the left is uniformly distributed across the list whereas the gene on the right is mainly present in one of the tails of the list and, consequently, this term is likely to be related to the experimental trait studied.

 
The last significant partition defines the border for the number of genes belonging to a functional class that can be considered to display a significant non-uniform distribution across the list. For example, if we assume that in Figure 1 the second partition is the last significant one for genes in the right column (Fig. 1E), the conclusions would be: (1) this function is asymmetrically distributed across the rank and (2) six of these genes can be considered to have a coordinated, significant non-uniform distribution, while the other four do not show an asymmetrical behaviour with respect to the rank. It is not unusual to find such a tail of genes. The reason for this is that many genes participate in more than one biological function and they might be carrying out different biological roles unrelated to the experimental conditions that originated the rank of genes.

2.2 Extraction of significantly under- and over-represented functional terms in a set of genes: the FatiGO/FatiWise algorithm
The core of the method proposed is based on an algorithm to test whether a set of genes, labelled with terms (biological information), contain significant enrichments on one or several of these terms with respect to another set of genes of reference. We have used the FatiGO (Al-Shahrour et al., 2004) tool to perform such test. FatiGO works with tables of correspondence between genes and GO terms. Kegg pathways are implemented in the program FatiWise, included in the GEPAS (Herrero et al., 2004) and BABELOMICS (Al-Shahrour et al., 2005) packages. The application of FatiGO and FatiWise to consecutive partitions of the list of genes constitutes the step represented in Figure 1B. Both programs use a Fisher's exact test for 2 x 2 contingency tables for comparing two groups of genes and extracting a list of GO terms whose distribution among the groups is significantly different. Given that many GO terms are simultaneously tested, the results of the test are corrected for multiple-testing to obtain an adjusted P-value. FatiGO and FatiWise return adjusted P-values based on three different ways of accounting for multiple testing: step-down minP method (Westfall and Young, 1993) and two False Discovery Rate (FDR) methods (Benjamini and Hochberg, 1995; Benjamini and Yekutieli, 2001) (Fig. 1C).

2.3 Step 1: preparing the data: obtaining lists of genes ordered according to phenotypes
Typical DNA microarray data usually consist of a matrix of gene expression values where columns represent different experiments, corresponding to different phenotypes, and rows are genes. Phenotypes are understood here in a broad sense: columns can correspond to discrete classes (e.g. case control studies, distinct tissues, etc.), to continuous traits (e.g. the level of a metabolite, survival data, etc.), to the different responses of a cell culture to distinct drugs, etc.

The first step of the method is to obtain an ordered list of genes based on their differential expression with respect to the trait studied. This ordered list can be obtained by means of the application of tests for differential expression between two classes (t-test) or among multiple classes (ANOVA). Continuous traits can be studied by correlation, and survival data can be analysed using cox models. The POMELO tool, included in the GEPAS package (Herrero et al., 2003) implements all these tests. The value of the statistic can be used as ranking criteria (Figure 1A).

2.4 Step 2: scanning ordered lists of genes with FatiGO/FatiWise and final selection of significant terms
Once an ordered list of genes has been obtained, the method checks whether blocks of genes with common functional properties are uniformly distributed across it or, on the contrary, are either cumulated in one of the tails or present in any sort of asymmetrical distribution. A uniform distribution of genes with a particular functional property across the list implies that this property is most probably unrelated to the trait studied. Contrarily, an accumulation of genes with a common function in the upper or lower range of the list is a clear indication of a possible relationship between this function and the trait (Fig. 1E). And this is true at the level of blocks of functionally related genes even in the case of no one single gene satisfies an individual test based solely on expression values.

The second step involves the study of different partitions of this list using the FatiGO/FatiWise test described above (Fig. 1B). Here we have to perform one test per partition and, consequently, we confront a problem with a severe multiple-testing component. The procedure proposed here implies producing as many partitions of the list of genes as defined by the step selected. Since FatiGO adjusts P-values for only one partition at a time and we are conducting P partitions, we need a more rigorous adjustment. We then have to correct multiple terms (M) in multiple partitions (P). In other words, we have to correct unadjusted P-values for M x P tests. We have used the multtest package (Pollard et al., 2004) of bioconductor (Gentleman, 2004, http://www.bioconductor.org) to obtain adjusted P-values by FDR (Benjamini and Yekutieli, 2001) from the unadjusted P-values. Multtest inputs all the uncorrected P-values obtained, and outputs FDR-adjusted P-values.

Finally, a plot with the significant terms obtained for each partition upon the application of the FatiGO/Fatiwise test is produced. This plot can be used to detect significant asymmetrical distributions of blocks of functionally related genes across the list (Fig. 1D). Only terms significant after the super-adjustment process will be plotted. Since the test is based on the differences of percentages, the plot will represent these differences as a function of the value of the statistic. If some biological process is activated as a block, genes playing this role will tend to be cumulated in the most extreme values of the statistic (which measures differential expression). As a general trend, significant differences in percentages will tend to be higher for extreme values of the statistic and will tend to be lower as partitions tend to move to values of the statistic nearer to zero. This happens because we dilute the term by adding other genes less differentially expressed. Actually, what is important in the results is not the plot itself, but the term which is found as differentially distributed in the list of genes. The plot helps to establish a scale of trustworthiness for the terms. Thus, terms significant across a wide range of partitions are more trustworthy than terms that are significant only for a unique partition.

2.5 How many partitions? A heuristic approach
The decision on the number of partitions to be made is an heuristic one. More the partitions more number of tests (which will cause an artificial decrease in the power and the sensitivity of the method after the adjustment of P-values). On the other hand, a very small number of partitions (say ten or less) can cause a loss of sensitivity because we could not detect regions of the list enriched in genes belonging to some functional category.

Partitions can be made in different ways (based on percentiles, the value of the statistic, etc.) In the case of studying differential gene expression the use of the value of the statistic is probably the best choice, because it reflects differences in expression as well as significance of these differences. Our experience shows that a number between 20 and 50 partitions often gives optimal results in terms of sensitivity and results recovered. The method aims to find blocks of genes (belonging to a functional category) asymmetrically distributed across an ordered list of genes. Consequently, the sensitivity of the method is related to the number of genes detected belonging to these categories. An example with a diabetes dataset (Mootha et al., 2003) was used to illustrate the influence of the number of partitions on the number of genes detected for the above mentioned functionally related blocks. In the example, different functional terms (pathways in this case) showed a significant asymmetrical distribution (i.e. they were related to the trait studied). The goal is to find the level of granularity at which the maximum number of genes belonging to a functional category can be detected by the FatiGO/Fatiwise algorithm. Figure 2 shows how, in some cases, even 20 partitions were sufficient to find all the genes belonging to a functional category asymmetrically distributed. In all cases, the selection of a window step that produces 50 partitions detected a number of genes for each category that is not further improved by using more partitions. Based on empirical results we suggest the use of a level of granularity between 20 and 50 partitions.



View larger version (17K):
[in this window]
[in a new window]
 
Fig. 2 Relation between the number of genes sharing a given label (GO categories in this case) showing a significant differential distribution between partitions and the number of partitions performed. A number of 50 partitions is enough for finding all the genes in each GO category differentially expressed as a block. The use of more partitions does not increase the number of genes found (except in the case of ‘primary active transporter activity’ where another gene is found for 250 partitions).

 

    3 RESULTS
 TOP
 Abstract
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 
3.1 Differential gene expression in human diabetes samples
To illustrate the use of the method, we had chosen a study of gene expression in human diabetes (Mootha et al., 2003). The authors applied the GSEA (a version of the Kolmogorov–Smirnov test for comparing distributions) to the data composed of 43 samples (17 with normal tolerance to glucose, 8 with impaired tolerance and 18 with type 2 diabetes mellitus, DM2) and found that genes related to oxidative phosphorylation (that were not detected by conventional statistical tests) were downregulated in patients with diabetes.

We ordered the genes according to their differential expression among normal tolerance to glucose versus impaired tolerance together with DM2. A t-test, as implemented in the POMELO tool from the GEPAS package (Herrero et al., 2003) was used for this. The value of the statistic was used as the ranking criteria for ordering the list. As in the case of Mootha et al. (2003) we were unable to find individual genes with a significant differential expression (differentially expressed genes with an adjusted P-value lower than 0.05).

We have used the value of the t-statistic to carry out the partitions. Starting from the highest values, which correspond, in this case, to genes showing higher expression in normal tolerance to glucose samples (right-hand side of the plots in Figs 3 and 4), we produced ~50 partitions of the whole set of genes (by using steps of 0.2 U of the t-statistic). Then, for each partition we compared the genes with highest expression in normal tissues versus the remaining ones using the FatiWise algorithm from the GEPAS package (Herrero et al., 2004). Whenever a biological term corresponding to any KEGG pathway (Kanehisa et al., 2002) is significantly over- or under-represented in one of the compared partitions versus the other one, it was plotted in the figures.



View larger version (15K):
[in this window]
[in a new window]
 
Fig. 3 Over- and under-represented pathway terms in genes differentially expressed when comparing the normal samples with MD2 and samples with impaired tolerance to glucose. The right part of the x-axis corresponds to higher levels of expression in the normal samples, as measured by the statistics (see text). The y-axis represents the differences in percentages of each term in the corresponding partition (normal part minus diseased part).

 


View larger version (22K):
[in this window]
[in a new window]
 
Fig. 4 Over- and under-represented GO terms corresponding to level 4 of molecular function in genes differentially expressed when comparing the normal samples to MD2 and samples with impaired tolerance to glucose. The right part of the x-axis corresponds to higher levels of expression in the normal samples, as measured by the statistics (see text) while in the other part of the axis would be the genes highly expressed in diseased samples. The y-axis represents the difference between the percentages of each term in normal part and the diseased part of the corresponding partition.

 
When we studied the distribution of genes corresponding to different pathways (Fig. 3) we found oxidative phosphorylation over-represented in normal tissue (and consequently under-represented in diseased tissue) with a clear trend of reduction towards diseased tissues. Actually, 82 out of a total of 95 genes labelled as oxidative phosphorylation were found in the last significant partition which approximately correspond to half of those highly expressed in healthy samples. This was already reported by Mootha et al. (2003). But, in addition, our approach detected another term: ubiquinone biosynthesis. This term is also under-represented in diseased tissues. Ubiquinone is an endogenous antioxidant and is a part of the transport electron chain which is altered in diabetes processes (Wold et al., 2003). In fact, supplements of this coenzyme are used as therapy in DM2. In this case, we found all the genes (30 in this case) labelled with this term. The proposed approach seems to be more sensitive to the GSEA (Mootha et al., 2003), which was unable to identify ubiquinone biosynthesis probably because of the low number of genes with this annotation.

When we studied the distribution of GO terms, using the FatiGO algorithm (Al-Shahrour et al., 2004) we found similar information from the ‘viewpoint’ of terms of molecular functions and also some new findings and more detailed explanations on how the biology of the cells is affected when healthy versus diseased situations are compared (Figure 4).

We have found molecular function terms at level 4, such as oxidoreductase activity acting on NADH or NADPH or NADH dehydrogenase activity, accounting, as described by means of GO categories, for the same biological roles as oxidative phosphorylation in pathways and displaying the same trend as in the plot of pathways in Figure 3. Terms, such as cation transporter activity, metal ion transporter activity and primary active transporter activity are also significantly lower in diseased patients. Alterations in transport activities might be a consequence of the oxidative stress caused by the diabetes.

This case is a representative example of a quite common situation in gene expression data analysis of clinical samples. In many cases, the biological variability among patients' samples is so high that it is difficult to find genes showing significant differential expression attending only to their experimental values of gene expression. Using biological information in combination with experimental data allows focusing on sets of related genes, instead of on individual genes, and thereby find significant differential behaviours.

3.2 Roles carried out by genes related to survival in breast cancer
Additionally we have studied the results by van't Veer et al. (2002) in which gene expression patterns of 98 primary breast cancers were analysed. Two patients groups could be distinguished: 34 of them developed distant metastases within 5 years (bad prognosis group) and the other 44 continued free of the disease after a period of at least 5 years (good prognosis group). The authors selected 231 prognostic genes according to their correlation coefficients with an average good prognosis profile.

As in the previous example, we ordered the genes according to their differential expression with respect to the prognosis group using a t-test implemented in the POMELO tool. The value of the statistic was used for ordering the list. We performed approximately 50 partitions and we analyzed GO terms corresponding to molecular function and biological process.

Figure 5 shows biological processes significantly over-represented. Here the interpretation of the plot is the opposite to one in the previous section. FatiGO performs a two-tailed test and in this case we found the significance on the part corresponding to the bad prognostic. Negative values of the statistic correspond to genes highly expressed in the bad prognosis group. Since differences in percentages are calculated as the percentage in the good prognosis part of the partition minus the percentage in the bad prognosis part, a negative difference implies that these genes are significantly over-represented in the bad prognosis part (or under-represented in the good prognosis part). Thus, genes labelled as biological processes, such as cell proliferation (GO level 4) and cell growth and/or maintenance (GO level 3), are over-represented among the genes over-expressed in the bad prognosis part of the plot. On the other hand, genes labelled as response to biotic stimulus (GO level 3), which mainly correspond to genes labelled as cell communication and response to external stimulus (GO level 4) are under-represented among the genes highly expressed in bad prognostic patients.



View larger version (20K):
[in this window]
[in a new window]
 
Fig. 5 Over- and under-represented GO terms corresponding to levels 3 and 4 of biological process in genes differentially expressed when comparing bad prognosis patients versus good prognosis patients. The right part of the x-axis corresponds to higher levels of expression in the good prognosis patients, as measured by the statistics (see text) while in the other part of the axis would be the genes highly expressed in bad prognosis patients. The y-axis represents the difference between the percentages of each term in good and bad prognosis parts of the corresponding partition.

 
Anecdotically, this last term (response to external stimulus) illustrates a curious effect that sometimes occur when the source of biological information used to label genes has a hierarchical structure in which upper levels correspond to more general definitions and lower levels have a higher degree of precision (GO terms follow exactly this structure). The use of very general terms (high in the hierarchy) can produce bimodal distribution of genes because two children terms of this particular term can have opposite behaviours. A typical example would be the term apoptosis, which has among their descendants anti-apoptosis and induction of apoptosis, defined as ‘a process that directly inhibits any of the steps of the apoptosis’ and ‘a process that directly activates any of the steps of the apoptosis’, respectively. It is easy to imagine an experiment in which genes corresponding to both children terms will be activated, but in the opposite sides of the rank. In the genes that are studied with the general term apoptosis, they will clearly show a bimodal distribution. Nevertheless, due to the few partitions in which the terms response to biotic stimulus, cell communication and response to external stimulus showed a significant asymmetrical distribution it cannot be discarded that this finding is due to chance.

When functions are studied, it can be observed that terms accounting for the significant biological processes also appear as significant. Figure 6 shows how genes belonging to GO functional categories related to communication, such as receptor activity (GO level 3), transmembrane receptor activity (GO level 4) or transcription factor activity (GO level 3) are significantly under-represented among the genes highly expressed in the bad prognosis patients. On the other hand, functional terms related to cell proliferation, such as: nucleotide binding (GO level 3), more accurately defined as purine nucleotide binding (GO level 4), or transferase activity (GO level 3), in particular transferase activity transferring phosphorus-containing groups (GO level 4) or kinase activity (GO level 3), are over-represented in genes highly expressed in bad prognosis patients.



View larger version (22K):
[in this window]
[in a new window]
 
Fig. 6 Over- and under-represented GO terms corresponding to levels 3 and 4 of molecular function in genes differentially expressed when comparing bad prognosis patients versus good prognosis patients. The right part of the x-axis corresponds to higher levels of expression in the good prognosis patients, as measured by the statistics (see text) while in the other part of the axis would be the genes highly expressed in bad prognosis patients. The y-axis represents the difference between the percentages of each term in good and bad prognosis parts of the corresponding partition.

 
Both the observed trends, those corresponding to biological processes and to molecular functions, are consistent with the biology of the metastatic cells. On one hand, we observe a significant activation of proliferation processes (cell proliferation and cell growth and/or maintenance) as well as the corresponding molecular functions. On the other hand, processes related to communication tend to be switched off. Both biological processes are the basis of the metastatic uncontrolled cellular growth. This example is a good biological proof of concept of the efficiency of the method proposed.


    4 DISCUSSION
 TOP
 Abstract
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 
As genomic technologies are becoming popular, the possibility of studying problems at the system level instead of at the gene level is becoming a reality. A more interesting question than ‘what is this gene?’ within the context of system biology is ‘what do the genes in this group have in common?’ Unfortunately, most of the data analysis tools were designed under the paradigm of ‘one gene at a time’ and cannot properly address these questions. By means of two examples based on microarray data we have tried to show how a very simple method can be used to answer functional questions relevant for sets of genes. Many phenotypic traits can be studied at gene expression level by designing series of experiments. Genes can be ranked according to their differential expression with respect to the trait (differential expression in cases versus controls, correlation with respect to the level of a metabolite, etc.). These ordered lists of genes can be studied at the level of blocks of genes having in common some functional labels by the proposed method. This method can also be applied to other large-scale genomic techniques, such as proteomics, interactomics, etc.

Genome-scale data analysis must be carried out with caution due to the high likelihood of finding relationships just by chance (Ge et al., 2003). Multiple testing is an important issue that, unfortunately, is not very often addressed (Slonim, 2002). If the multiple-testing nature of the statistical contrast is not taken into account an increase in the rate of false positives (i.e. terms identified as over- or under-represented that, in reality, are not significant) occurs. Statistical tests for finding differentially expressed genes in multiple-testing problems require both high differences in the values and a large number of samples to provide reliable results. Unfortunately, there are cases in which the differences in experimental values are too low to be detected by conventional tests. There are also circumstances in which the availability of a sufficient (usually large) number of samples is a serious problem (e.g. in the case of low prevalence diseases) that impairs the application of conventional statistic approaches. The method presented here provides a statistical framework to study the biological processes that accounts, at molecular level, for the traits studied in gene expression experiments. The method combines experimental and biological information in a way that focuses on groups of genes, related by means of a functional label, instead of focusing on single genes. Contrary to other approaches based on direct comparison of distributions (e.g. using a Kolmogorov–Smirnov test, or the GSEA), our method is able to find asymmetrical distributions of genes with a common biological label across a ranking provided this asymmetry is not too extreme. The argument in favour of using Kolmogorov–Smirnov-based tests is that only one test per term is required. This is an advantage considering the step of multiple-testing adjustment. Nevertheless, finding significantly asymmetrical distributions by this method requires considerable differences in the distribution of terms with respect to the uniform distribution (which is the null hypothesis). On the other hand, the advantage of using the proposed method is that, despite the apparent disadvantage due to the necessity of performing a strong adjustment in the P-values, the sensitivity in detecting modest asymmetries clearly counterbalances the global sensitivity. The final outcome is a procedure of improved sensitivity, able to detect asymetrical distributions of groups of genes, even in the case of modest asymetries.

In the example shown, the procedure proposed here has been more sensitive than the GSEA previously proposed by Mootha et al. (2003) when analysing the same dataset.

The procedure described here is of a straightforward application with publicly available software. FatiGO (Al-Shahrour et al., 2005), FatiWise and Pomelo are part of the of GEPAS (Herrero et al., 2003, 2004), a suite of web tools for microarray gene expression data analysis; (http://www.gepas.org) and can be used in combination with the multtest package (Pollard et al., 2004 http://www.bepress.com/ucbbiostat/paper164) of Bioconductor (Gentleman, 2004, http://www.bioconductor.org) to apply the method described here to microarray data.


    Acknowledgments
 
F.A. is supported by Grant BIO2001-0068 from MCYT. This work has been partly supported by grants from Fundación Ramón Areces, Fundación BBVA and Fundació La Caixa. The Functional Genomics node from the INB is supported by Fundación Genoma España.

Received on November 13, 2004; revised on March 20, 2005; accepted on April 18, 2005

    REFERENCES
 TOP
 Abstract
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 

    Al-Shahrour, F., et al. (2004) FatiGO: a web tool for finding significant associations of Gene Ontology terms with groups of genes. Bioinformatics, 20, 578–580[Abstract/Free Full Text].

    Al-Shahrour, F., et al. (2005) BABELOMICS: a suite of web-tools for functional annotation and analysis of groups of genes in high-throughput experiments. Nucleic Acids Res., (in press).

    Ashburner, M., et al. (2000) Gene Ontology: tool for the unification of biology. Nat. Genet., 25, 25–29[CrossRef][ISI][Medline].

    Benjamini, Y. and Hochberg, Y. (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc., B 57, 289–300.

    Benjamini, Y. and Yekutieli, D. (2001) The control of the false discovery rate in multiple testing under dependency. Annals of Statistics, 29, 1165–1188[CrossRef].

    Cunliffe, H.E., et al. (2003) The gene expression response of breast cancer to growth regulators: patterns and correlation with tumor expression profiles. Cancer Res., 63, 7158–7166[Abstract/Free Full Text].

    Eisen, M., et al. (1998) Cluster analysis and display of genome-wide expression patterns. Proc. Natl Acad. Sci. USA, 95, 14863–14868[Abstract/Free Full Text].

    Ge, H., et al. (2003) Integrating ‘omic’ information: a bridge between genomics and systems biology. Trends Genet., 19, 551–560[CrossRef][ISI][Medline].

    Gentleman, R.C., et al. (2004) Bioconductor: Open software development for computational biology and bioinformatics. Genome Biol., 5, R80[CrossRef][Medline].

    Herrero, J., et al. (2003) GEPAS, a web-based resource for microarray gene expression data analysis. Nucleic Acids Res., 31, 3461–3467[Abstract/Free Full Text].

    Herrero, J., et al. (2004) New challenges in gene expression data analysis and the extended GEPAS. Nucleic Acids Res., 32, W485–W491[Abstract/Free Full Text].

    Kanehisa, M., et al. (2004) The KEGG resource for deciphering the genome. Nucleic Acids Res., 32, D277–D280[Abstract/Free Full Text].

    Lapointe, J., et al. (2004) Gene expression profiling identifies clinically relevant subtypes of prostate cancer. Proc. Natl Acad. Sci. USA, 101, 811–816[Abstract/Free Full Text].

    Mootha, V.K., et al. (2003) PGC-1alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat. Genet., 34, 267–273[CrossRef][ISI][Medline].

    U.C. Berkeley Division of Biostatistics Working Paper Series Pollard, K.S., et al. (2004) Multiple testing procedures and applications to genomics. Working Paper 164.

    Slonim, D.K. (2002) From patterns to pathways: gene expression data analysis comes of age. Nat. Genet., 32, Suppl, 502–508.

    van't Veer, L.J., et al. (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature, 415, 530–536[CrossRef][Medline].

    Westfall, P.H. and Young, S.S. Resampling-Based Multiple Testing, (1993) , NY John Wiley & Sons.

    Wold, L.E., et al. (2003) Insulin-like growth factor I (IGF-1) supplementation prevents diabetes-induced alterations in coenzymes Q9 and Q10. Acta Diabetol., 40, 85–90[Medline].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Nucleic Acids ResHome page
J. Tarraga, I. Medina, J. Carbonell, J. Huerta-Cepas, P. Minguez, E. Alloza, F. Al-Shahrour, S. Vegas-Azcarate, S. Goetz, P. Escobar, et al.
GEPAS, a web-based tool for microarray data analysis and interpretation
Nucleic Acids Res., July 1, 2008; 36(suppl_2): W308 - W314.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
F. Al-Shahrour, J. Carbonell, P. Minguez, S. Goetz, A. Conesa, J. Tarraga, I. Medina, E. Alloza, D. Montaner, and J. Dopazo
Babelomics: advanced functional profiling of transcriptomics, proteomics and genomics experiments
Nucleic Acids Res., July 1, 2008; 36(suppl_2): W341 - W346.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
P. Minguez, F. Al-Shahrour, D. Montaner, and J. Dopazo
Functional profiling of microarray experiments using text-mining derived bioentities
Bioinformatics, November 15, 2007; 23(22): 3098 - 3099.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
G. Gamberoni, E. Lamma, G. Lodo, J. Marchesini, N. Mascellani, S. Rossi, S. Storari, L. Tagliavini, and S. Volinia
Fun&Co: identification of key functional differences in transcriptomes
Bioinformatics, October 15, 2007; 23(20): 2725 - 2732.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
F. Tai and W. Pan
Incorporating prior knowledge of predictors into penalized classifiers with multiple penalty terms
Bioinformatics, July 15, 2007; 23(14): 1775 - 1782.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
F. Al-Shahrour, P. Minguez, J. Tarraga, I. Medina, E. Alloza, D. Montaner, and J. Dopazo
FatiGO +: a functional profiling tool for genomic data. Integration of functional annotation, regulatory motifs and interaction data with microarray experiments
Nucleic Acids Res., July 13, 2007; 35(suppl_2): W91 - W96.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
R. Diaz-Uriarte, A. Alibes, E. R. Morrissey, A. Canada, O. M. Rueda, and M. L. Neves
Asterias: integrated analysis of expression and aCGH data using an open-source, web-based, parallelized software suite
Nucleic Acids Res., July 13, 2007; 35(suppl_2): W75 - W80.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
S.-B. Kim, S. Yang, S.-K. Kim, S. C. Kim, H. G. Woo, D. J. Volsky, S.-Y. Kim, and I.-S. Chu
GAzer: gene set analyzer
Bioinformatics, July 1, 2007; 23(13): 1697 - 1699.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
D. Huang and T. W. S. Chow
Identifying the biologically relevant gene categories based on gene expression and biological data: an example on prostate cancer
Bioinformatics, June 15, 2007; 23(12): 1503 - 1510.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
D. M. Kemp, N. R. Nirmala, and J. D. Szustakowski
Extending the pathway analysis framework with a test for transcriptional variance implicates novel pathway modulation during myogenic differentiation
Bioinformatics, June 1, 2007; 23(11): 1356 - 1362.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
J. J. Goeman and P. Buhlmann
Analyzing gene expression data in terms of gene sets: methodological issues
Bioinformatics, April 15, 2007; 23(8): 980 - 987.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
D. Nam, S.-B. Kim, S.-K. Kim, S. Yang, S.-Y. Kim, and I.-S. Chu
ADGO: analysis of differentially expressed gene sets using composite GO annotation
Bioinformatics, September 15, 2006; 22(18): 2249 - 2253.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
F. Al-Shahrour, P. Minguez, J. Tarraga, D. Montaner, E. Alloza, J. M. Vaquerizas, L. Conde, C. Blaschke, J. Vera, and J. Dopazo
BABELOMICS: a systems biology perspective in the functional annotation of genome-scale experiments.
Nucleic Acids Res., July 1, 2006; 34(Web Server issue): W472 - W476.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
D. Montaner, J. Tarraga, J. Huerta-Cepas, J. Burguet, J. M. Vaquerizas, L. Conde, P. Minguez, J. Vera, S. Mukherjee, J. Valls, et al.
Next station in microarray data analysis: GEPAS.
Nucleic Acids Res., July 1, 2006; 34(Web Server issue): W486 - W491.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
M. Scheer, F. Klawonn, R. Munch, A. Grote, K. Hiller, C. Choi, I. Koch, M. Schobert, E. Hartig, U. Klages, et al.
JProGO: a novel tool for the functional interpretation of prokaryotic microarray data using Gene Ontology information.
Nucleic Acids Res., July 1, 2006; 34(Web Server issue): W510 - W515.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
D. Huang and W. Pan
Incorporating biological knowledge into distance-based clustering analysis of microarray gene expression data
Bioinformatics, May 15, 2006; 22(10): 1259 - 1268.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
W. Pan
Incorporating gene functions as priors in model-based clustering of microarray gene expression data
Bioinformatics, April 1, 2006; 22(7): 795 - 801.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/13/2988    most recent
bti457v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (23)
Right arrowRequest Permissions
Google Scholar