Skip Navigation


Bioinformatics Advance Access originally published online on November 8, 2005
Bioinformatics 2006 22(2):233-241; doi:10.1093/bioinformatics/bti764
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/2/233    most recent
bti764v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (2)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Novak, B. A.
Right arrow Articles by Jain, A. N.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Novak, B. A.
Right arrow Articles by Jain, A. N.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions{at}oxfordjournals.org

Pathway recognition and augmentation by computational analysis of microarray expression data

Barbara A. Novak and Ajay N. Jain *

UCSF Cancer Research Institute and Comprehensive Cancer Center, University of California at San Francisco San Francisco, CA 94143-0128, USA

*To whom correspondence should be addressed at UCSF Cancer Center, 2340 Sutter Street, #S336, San Francisco, CA 94115, USA


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 

Motivation: We present a system, QPACA (Quantitative Pathway Analysis in Cancer) for analysis of biological data in the context of pathways. QPACA supports data visualization and both fine- and coarse-grained specifications, but, more importantly, addresses the problems of pathway recognition and pathway augmentation.

Results: Given a set of genes hypothesized to be part of a pathway or a coordinated process, QPACA is able to reliably distinguish true pathways from non-pathways using microarray expression data. Relying on the observation that only some of the experiments within a dataset are relevant to a specific biochemical pathway, QPACA automates selection of this subset using an optimization procedure. We present data on all human and yeast pathways found in the KEGG pathway database. In 117 out of 191 cases (61%), QPACA was able to correctly identify these positive cases as bona fide pathways with p-values measured using rigorous permutation analysis. Success in recognizing pathways was dependent on pathway size, with the largest quartile of pathways yielding 83% success. In cross-validation tests of pathway membership prediction, QPACA was able to yield enrichments for predicted pathway genes over random genes at rates of 2-fold or better the majority of the time, with rates of 10-fold or better 10–20% of the time.

Availability: The software is available for academic research use free of charge by email request.

Contact: ajain{at}jainlab.org

Supplementary information: Data used in the paper may be downloaded from http://www.jainlab.org/downloads.html


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 
Rapidly accumulating quantitative gene expression data has facilitated the quantitative investigation of biological pathways. The study of a specific pathway often requires painstaking research into each possible element of the network. Consequently, existing knowledge of gene regulatory networks is limited, resulting in hypotheses that are incomplete or at various levels of refinement. Quantitative analytical approaches seek to fill this information gap. With modern experimental methods, it is possible to survey the expression of thousands of genes in a single experiment, while proteomic methods are beginning to offer a solution to quantifying protein levels (Gavin et al., 2002; Uetz et al., 2000). Given sufficiently dense information on the concentrations of specific molecular species in a cell type under many conditions and multiple time-points, it should be possible to investigate many biological pathways automatically. Currently, comprehensive quantitative data are available primarily for gene expression, but not generally for translated proteins or metabolites. Such data are noisy, and regulation of gene expression is complex, often controlled by combinatorial factors (Holstege et al., 1998). Despite this under-constrained and noisy system, many groups have begun to approach the problems in pathway-based analysis using plentiful and relatively inexpensive RNA expression data. Much of this early work in pathway inference has focused on yeast or other simple model organisms, which are more tractable than the human system because they are better understood and characterized and have fewer genes to assess. Additionally, these organisms are easily manipulated, allowing the analysis of systematically perturbed system states, and thereby further reducing the inherent complexity in the networks being studied. Several of these attempts have yielded biologically reasonable results (Friedman et al., 2000; Ideker et al., 2001; Paley and Karp, 2002).

For cancer-specific pathway delineation, one would like to perform perturbation-based biological experiments broadly in a human system. For example, given an easily manipulated cell line relevant to breast cancer, we would like to measure the concentration of both expressed transcripts and translated proteins, both over time and under multiple conditions e.g. gene knockouts, drug treatments, RNAi treatment, etc. Given access to data that include information on metabolite concentration and assuming that the data were at an appropriate granularity, it should be possible to work out a great deal of pathway biology. Given such ideal data, many specific biochemical interactions and transformations could be identified. Datasets that approach this theoretical construct, however, are not even available in yeast and other lower organisms, much less in the human case. In model organisms, such as yeast, it is possible to generate experimental data that are geared to examining a particular condition or set of conditions that are expected to be particularly relevant to a given set of genes. For the yeast transcriptional network, Ihmels et al. (2002) have used this notion of relevant sample sets to create ‘transcriptional modules’ that overcome the limitations of simple clustering. Their method involves selecting relevant experiments based on an initial set of genes, then finding the gene signature that corresponds to this experiment signature, thus forming context-dependent and potentially overlapping modules of genes.

In human cancer, we frequently find datasets involving multiple instances of the cancer phenotype, across many individuals (and possibly multiple tumor types), each representing a unique experiment of nature. Given that each sample in such a dataset has a different set of genomic, genetic and epigenetic disregulations, there is no reason to expect, a priori, that all of the samples will be relevant to a quantitative study of a particular set of genes in a particular pathway. In yeast, systematic perturbations have shown that only certain perturbations are relevant to certain pathways. In Hughes' yeast work, perturbations of genes in a particular pathway yielded the most informative changes in gene expression within that pathway (but not the largest) (Hughes et al., 2000). For example, deletion of erg genes had large effects on amino acid biosynthesis (as did many other deletions), but the most specific (though smaller) effects were on the expression of genes in the erg pathway.

Here we present QPACA (Quantitative Pathway Analysis in Cancer), a system for pathway visualization and analysis. QPACA addresses three aspects of the general problem: (1) representation and visualization of pathways in the context of biological data, (2) recognition of gene sets that are part of a pathway or coordinated process and (3) augmentation of pathways by prediction of pathway membership. The analysis method directly addresses the issues inherent in analysis of human systems without making limiting assumptions about the structure of pathways or discretizing data.


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 
2.1 Pathway representation
Computationally, the attributes of a pathway representation can be modeled with a simple directed graph structure, which is a collection of objects (nodes) and their interrelationships (edges). Additionally, directed graphs allow the use of an extensive set of pre-existing algorithms for navigation, visualization and structure analysis. Each node in the pathway graph is composed of an assembly of one or more gene products, small molecules, processes or other assemblies. In this way, it is possible to represent single components of the pathway as well as multimolecular complexes and families in an extensible fashion. The edges in the graph represent either activation or inhibition interactions between the nodes. To model more complex interactions, a special event node can be used to represent a particular event (such as phosphorylation). The pathway representation and language is both general and extensible, placing no biologically implausible constraints on pathway description. Figure 1 shows a representative image automatically generated by QPACA based on the receptor tyrosine kinase signaling (RTK) pathway, as defined by local experts. Currently, QPACA includes a set of filters for accessing KEGG Markup Language (KGML) (Kanehisa et al., 2004) formatted pathways, and filters for BioPAX (http://www.biopax.org) formatted pathways will be included in a future version. This representation not only enables visualization of data in the context of pathways, but, more importantly, also allows computational analysis of pathway composition and structure.


Figure 1
View larger version (33K):
[in this window]
[in a new window]
 
Fig. 1 A representative pathway image. The image was automatically generated by QPACA based on the receptor tyrosine kinase signaling (RTK) pathway, as defined by local experts, using a simple and an extensible text-based language. Black lines/arrows indicate excitatory interactions, while red lines/tees indicate inhibitory interactions. Nodes in the graph can represent gene products (rectangles), molecules (diamonds), or processes (green text). Joined rectangles indicate that a particular node can be represented by more than one gene product. Complexes of gene products are indicated by black dots with blue lines joining them to the members of the complex. The RTK pathway consists of the following 57 genes: EGFR, ERBB2, EPHA2, HGFR, SRC, SOS1, GAP1IP4BP, RGL3, RALA, RAF1, MAP2K1, MAPK3, ELK1, CCND1, CDK4, CCND2, CDK6, RB1, E2F1, E2F2, E2F3, E2F4, CCNE1, CCNA2, CDK2, ARF1, MDM2, CDKN2A, CDKN1A, TP53, MYC, MYCN, HRAS, NRAS, GSK3B, CDKN1B, TNFSF6, MLLT7, BAD, AKT1, AKT2, RPS6KB1, PDK1, PTEN, PIK3CB, PIK3CD, PIK3R1, PIK3R2, PIK3CA, RACGAP1, PLCG1, DAG1, PKC, AP1M2, FOSB, RASGRP1 and RAP1GDS1.

 
2.2 Optimization method
Given that it is not possible to know which samples are likely to be the most informative for a given set of genes in the typical human dataset, we have developed a method of optimization to automatically select these relevant samples by maximizing a scoring metric (Fig. 2). The goal of this method is to identify a specific subset of biological samples, where the variation of gene expression across samples bears on the function of the genes within a set under study.


Figure 2
View larger version (39K):
[in this window]
[in a new window]
 
Fig. 2 Algorithm. Sample subselection is achieved through the use of an optimization method that takes as input a large microarray dataset, a set of genes and a number of iterations. The method follows a simple hill-climbing optimization approach. Initially, a random subset of samples is chosen and scored. Then, in each subsequent iteration, a random sample is swapped into the subset and a new score is calculated. If the new score is better than the old score, the new subset is retained for the next iteration; otherwise the subset reverts to its previous state. The scoring metric was the median of the gene–gene correlations across all pairs of genes within a sample subset.

 
The optimization method employed in sample subselection follows a simple hill-climbing approach. A dataset consists of a matrix of log-expression changes (denoted E) of a gene g isin G{1, ..., N} over experimental samples s isin S{1, ..., M}, where N and M are the number of genes and samples, respectively. The optimization finds the optimal sample subset, Spath, for a specific subset of genes, Gpath. Following is a pseudocode procedure for the function optimize(Gpath, E):

For 20 iterations:

choose a random subset of samples Sinit sub S.

number_iterations colone 0

Scurrent colone Sinit

do until no change in Scurrent or number_iterations == 600:

snew colone pick random sample snew isin S {cap} Scurrent

sold colone pick random sample sold isin Scurrent

Stest colone (Scurrent – sold) {cup} snew

if score(Gpath, Stest, E) > score(Gpath, Scurrent, E)

Scurrent colone Stest

++number_iterations

if first iteration or score(Gpath, Scurrent, E) > score(Gpath, Spath, E)

Spath colone Scurrent

Where score (gene subset, sample subset, expression matrix) is defined as the median of all gene–gene correlations (using Pearson's correlation) across all pairs of genes within the gene subset over the sample subset given the expression matrix. The numbers 600 and 20 were empirically derived to ensure convergence. To decide whether a set of genes is part of a pathway, Gpath is composed of all genes in the dataset that belongs to that specific putative pathway.

2.3 Performance evaluation
One application of the method is to test whether a gene set is behaving in a coordinated fashion, and whether the coordination is relevant to our notion of a biological pathway. This test pertains to the common biological observation that some small set of genes appears to be behaving similarly (as in a hierarchical clustering of expression patterns), but where it is important to ask more quantitatively whether this gene set is also related through a common pathway. We tested the algorithm's ability to yield high scores on sets of known pathway genes as compared with random sets of genes (to simulate false pathways) or randomized expression data (to simulate data with no signal).

The optimization method may also be applied to the question of biological pathway augmentation. Given a pathway for which the algorithm was able to recognize the gene set as a pathway, the scoring metric can then be used to rank the remaining non-pathway genes. Highly ranked genes could be hypothesized to also belong to the pathway. We tested the algorithm's ability to find new pathway genes by conducting a series of cross-validation experiments in which a percentage of randomly chosen known pathway genes were held out during optimization and scoring of the pathway. Then, QPACA's scoring methodology was used to rank-order the held-out pathway genes and compare them with the background of non-pathway genes, both those in other KEGG pathways and those not known to belong to any other pathways.

We considered both human and yeast expression datasets in developing and assessing QPACA. The Hughes compendium of yeast data (Hughes et al., 2000) (287 deletion mutants and 13 environmental conditions) was used primarily to develop and test the algorithm, while the NCI set of 60 human cancer-derived cell lines (Staunton et al., 2001) was used to validate the results in human. The datasets contained expression levels for 6356 and 7071 genes, respectively. The data were processed to remove duplicates, genes with excessive missing values, and genes with insufficient annotations. Because the algorithm computes correlations across at most 20 samples, a fairly stringent measure of missing values was used. Each gene had to have signal for at least 80% of the total samples. In the NCI60 dataset, this threshold eliminated roughly 3600 genes. The other genes were eliminated due to insufficient annotation (~700 genes) and consolidation of duplicates (~800 genes). The yeast dataset had fewer issues with missing annotations and duplicated genes, as well as better general signal strength. After processing, 6163 genes remained in the yeast dataset, while 1954 genes remained in the human dataset. Raw expression values were converted to log2 ratios by taking the logarithm of each gene's value divided by the mean for that gene across all samples. Additionally, data were normalized by median centering within each sample (this is an automatic feature of QPACA's data processing, as it can influence results markedly if omitted). Both datasets were analyzed in the context of all pathways in the KEGG database with at least five genes in the corresponding dataset. In the human dataset, the receptor tyrosine kinase signaling (RTK) pathway, as defined by local experts, was also analyzed. Figure 1 delineates the composition of this pathway. This resulted in a total of 83 yeast pathways and 108 human pathways. The pathways varied substantially in size, with the bottom half containing fewer than 20 genes, and the top quartile containing 30 genes or more.

2.4 Permutation testing and p-value calculation
We tested the algorithm's ability to recognize pathway membership and to select samples relevant to a gene set. We compared the scores for each known pathway gene set (coupled with its experimental expression data) with three distributions: (1) scores resulting from randomly chosen gene sets of the same size as the pathway set in question; and (2) scores resulting from randomizing the gene expression data itself and (3) scores resulting from randomly chosen gene sets of the same size as the pathway set in question, restricted to genes represented within KEGG. In each permutation test, the randomized data come from the same microarray experiment as that used for the pathway case being tested. The first permutation method estimates the likelihood that any equal-sized gene set will find a sample subset which scores equal to or better than the test. If the probability is low, confidence of the metric's rationality increases. The second randomization method controls for the effects of the distribution of data values within a particular set of genes, which may yield unusual distributions of correlation values under certain degenerate conditions. Given a particular set of genes from a particular dataset, data were randomized for each gene across all samples. The third method is a further control to test for the possibility that KEGG genes as a set may be biased.

Permutation Method 1: Randomize pathway gene set, Gpath, do not change E (expression matrix).

for i in 1 to 500 permutations:

Grandom colone choose a random set of genes from G of the samesize as Gpath.

Sfinal colone optimize(Grandom, E)

ScoreArray[i] = score(Grandom, Sfinal, E)

++i

Permutation Method 2: Randomize the expression matrix, E, do not change Gpath.

for i in 1 to 500 permutations:

E' colone randomized matrix E, where each gene's values are permuted across samples (distribution of each gene is unchanged)

Sfinal colone optimize(Grandom, E')

ScoreArray[i] = score(Grandom, Sfinal, E')

++i

Permutation Method 3:Same as Method 1, but restrict Gpath to those genes found in KEGG pathways.

In each case, we computed the p-value for Gpath by comparing the score (Gpath, Spath, E) with the distribution of scores in ScoreArray where

Formula
Additionally, we also compared the optimized score(Gpath, Spath, E) with the unoptimized score(Gpath, S, E) to determine if the optimization procedure actually increased our ability to recognize pathways.

2.5 Cross-validation in biological pathway augmentation
We tested the algorithm's ability to predict additional pathway members by performing cross-validation experiments in which we held out a subset of known pathway genes. We identified the set of pathways for which all three permutation-based p-values were less than 0.1, in order to eliminate those pathways for which prediction would be unlikely to work. This matches how one would proceed in practice, where one would not choose to believe predicted augmentation of a pathway for which the recognition process failed. For each of the resulting 101 pathways, we randomly chose 10% of the known pathway genes to hold out. Then, the optimization algorithm was run on the remaining pathway genes, resulting in a score for the pathway and subset of relevant samples. Using this subset of samples, the remaining genes, both holdouts and a background of non-pathway genes, were scored by adding them to the pathway set and recomputing the score. The resulting gene scores were ranked. This procedure was carried out a total of five times for each pathway. Two separate backgrounds of non-pathway genes were examined: (1) those known to occur in other KEGG pathways and (2) those not known to occur in any other pathways. Additionally, we also tested the ability of the optimization procedure to affect the gene rankings by scoring the held-out and non-pathway genes with no optimization step.


    3 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 
Representative results for gene set recognition on a diverse and non-overlapping set of yeast and human experimental datasets can be seen in Tables 1 and 2. Table 1 reports the most pessimistic p-value computed from the three methods described above, while Table 2 provides the breakdown of all three individual p-values. A p-value of X can be interpreted as the probability that a random set of genes will yield a p-value of less than or equal to X. For this limited set of pathways, those for which subselection did not improve the significance of the score included the metabolic pathways (e.g. galactose metabolism) which are likely to show coordinated expression under many different conditions. In these pathways, the score was highly significant both with and without subselection. There were a few pathways, however, for which the score was not significant in either condition. In these cases (inositol phosphate metabolism and phosphatidylinositol signaling), it is possible that the perturbations in these pathways either did not show up in these particular datasets or that the effects were outside the scope of expression data, e.g. through post-translational modifications. In none of these cases did the subselection process cause a significant result to become insignificant. Note, however, that we occasionally observed significant variability in the p-values obtained using the three different methods. This is expected, to the extent that each of the different permutation methods will yield different distributions of gene expression values. In the random data case, the distribution for the control is exactly the same as in the non-control. In the other two types of permutations, this is not so, and consequently, there were differences depending on the method used.


View this table:
[in this window]
[in a new window]
 
Table 1 Results table for a representative set of pathways

 

View this table:
[in this window]
[in a new window]
 
Table 2 Breakdown of individual p-values with sample subselection for each pathway

 
Figure 3 illustrates the comprehensive results obtained from all 191 human and yeast pathways. Panel A shows the cumulative histograms for the three different permutation methods. While there are some differences in the distributions, particularly at very low p-values, the three methods do not produce substantially different interpretations of the data. Using the most pessimistic values (as in Table 1), in 61% of all cases, we observed significant scores under optimization (p ≤ 0.05). For the reason of distributional equivalence, we focused on the randomized data method. Panel B shows the effect of both the optimization procedure and larger gene set sizes for the data randomization permutation method. The optimization procedure clearly shifts the p-value distribution to the left, and the subset of pathways with larger gene set sizes also shifts the distribution to the left (both observations are highly statistically significant). Using the data randomization permutation method, for pathways with at least 30 genes, we observe 96% recognition at a false positive rate of 0.05. A more conservative test, using the most pessimistic of all permutation methods, yields a recognition rate of 83%.


Figure 3
View larger version (26K):
[in this window]
[in a new window]
 
Fig. 3 Comprehensive pathway recognition results from all 191 human and yeast pathways. (A) Cumulative histograms of permutation-based p-values for all analyzed pathways. Most pathways (117/191 or 61%) had significant scores (p ≤ 0.05) under the most stringent of these methods, indicating that this method is reproducible over a large and varied set of pathways. (B) Cumulative histograms of the p-values with optimization for all pathways (solid line), all pathways without optimization (dotted line) and for those pathways with ≥ 30 genes using optimization (dashed line). For each case, the randomization permutation method was used. The optimization method clearly improves the p-value distribution over no optimization (p << 0.001 by Mann–Whitney), and restriction to larger gene set sizes also improves the p-value distribution (p << 0.001 by Mann–Whitney).

 
These results with QPACA address the first step in computational pathway structure elucidation using gene expression data: given a set of genes hypothesized to be part of a pathway, QPACA can recognize these while rejecting random pathways. The next step, which is clearly both more challenging and more exciting, is prediction of genes that should be considered for addition to a partially elucidated pathway. For the 101 pathways from the foregoing experiments where QPACA recognized the gene sets as pathways, we performed cross-validation experiments by holding out random subsets of known pathway genes and using QPACA's scoring methodology to rank-order the held-out pathway genes within a background of non-pathway genes (see Methods for additional details). Figure 4 summarizes the results using two background gene sets: one consisting only of genes found in KEGG and the other of non-KEGG genes. Figure 4A shows smoothed histograms of scores for holdout RTK pathway genes and non-pathway genes (KEGG control gene set). The distributions are roughly normal, with the scores for the RTK pathway genes clearly shifted to the right. The RTK case was fairly typical in terms of enrichment of high scores within the pathway holdout set versus the non-pathway set. Figure 4B shows the corresponding representative receiver-operating characteristic (ROC) curve for the RTK pathway and for two other human pathways.


Figure 4
View larger version (39K):
[in this window]
[in a new window]
 
Fig. 4 Comprehensive pathway augmentation results from 101 human and yeast pathways. (A) Smoothed histogram of scores for holdout RTK pathway genes (solid line) and non-pathway genes (dashed line). The non-RTK-pathway genes were taken from the full set of KEGG genes that were not found in the annotated RTK pathway. The distributions are roughly normal, with the scores for the RTK pathway genes shifted to the right. (B) Three representative ROC curves: RTK pathway (solid line), two KEGG human pathways (dashed lines). The shift of the RTK holdout distribution to the right of the non-pathway gene distribution yields a favorable ROC curve, with deflection into the upper left corner, with an area of 0.71. HSA00190 (Oxidative phosphorylation) also shows a favorable enrichment of known pathway genes against a background of non-pathway genes. HSA04020 (calcium signaling pathway) shows an atypical case, with no enrichment of pathway genes over non-pathway genes. (C) Cumulative distributions of ROC areas for all 101 pathways tested with two different backgrounds. The solid line shows results using the KEGG gene set as the source of non-pathway genes for each pathway holdout experiment. The dashed line shows similar results, but makes use of genes not annotated within KEGG as the source of non-pathway genes. Both distributions are highly skewed to the right of 0.5, indicating that positive enrichment for pathway genes over non-pathway genes occurs in the vast majority of cases. The non-KEGG control gene population yields a slightly stronger enrichment. (D) Cumulative histograms of maximal enrichment ratios for both control gene populations. Enrichments of 2-fold or better occur for 60 and 70% of the pathways using the KEGG and non-KEGG control gene sets, respectively. Enrichments of 10-fold or better occur in 12 and 20%, respectively.

 
Figure 4C and D summarize cross-validation results for the full set of 101 tested pathways with two different backgrounds. In Figure 4C, the cumulative distributions of ROC areas for the two control gene sets are shown. Both distributions are highly skewed to the right of 0.5 (p < 10–6 by exact binomial), indicating that positive enrichment for pathway genes over non-pathway genes occurs in the vast majority of cases. The non-KEGG control gene population yields a slightly stronger enrichment. Figure 4D shows cumulative histograms of maximal enrichment ratios for both control gene populations. Given a specific portion of the top-ranked genes in a ranked list, the enrichment reflects the ratio of the number of true positive genes found divided by the number of such genes expected by chance. The maximal enrichment ratio is the highest such ratio over the full range of possible proportions. Enrichments of 2-fold or better occur for 60 and 70% of the pathways using the KEGG and non-KEGG control gene sets, respectively. Enrichments of 10-fold or better occur in 12 and 20%, respectively. Note that all of the results in Figure 4 resulted from running the subset optimization procedure described previously. On the same set of 101 pathways, without running the optimization procedure, the chief difference in the results is a marked increase in the number of pathways for which the resulting ROC areas were significantly less than 0.5 (7% of non-optimized cross-validation runs yielded ROC areas <0.485 compared with 0% of the runs with optimization).

Recall that the RTK pathway (Fig. 1) was a hand-curated representation of genes downstream of receptor tyrosine kinase signaling, particularly emphasizing pathways that affected cell-cycle regulation, proliferation and apoptosis (each particularly important in cancer). Also, note the nominally better enrichments observed in using the non-KEGG genes as a control set as compared with the KEGG gene set. We hypothesized that the high-scoring control genes from the KEGG population of nominally non-RTK genes might, in fact, include genes that properly belonged to the RTK pathway, based on existing literature evidence. We considered the top 30 highest-scoring genes (the right-hand tail of the ‘Other Genes’ distribution from Fig. 4A). Figure 5 depicts the documented relationships of 12 of these genes to the RTK pathway as originally curated. Of the 30 highest-scoring KEGG genes (all nominal false positives in our cross-validation experiment) 15 properly belong to the RTK or very closely related pathways.


Figure 5
View larger version (42K):
[in this window]
[in a new window]
 
Fig. 5 Pathway augmentation for the RTK pathway. This image shows the original annotated RTK pathway as in Figure 1 with the 15/30 top scoring genes that have documented links to the pathway. Interactions between highly scoring genes (yellow) with annotated RTK pathway genes (gray) are highlighted in green (Hall et al., 1995; Nagata et al., 1998; Liu et al., 2000; Watson et al., 1997; Moores et al., 2000; Gulbins et al., 1995; Bertagnolo et al., 1998; Das et al., 2000; Hobert et al., 1996; Yang et al., 1995; Weinmann et al., 2002; Ohtoshi et al., 2000; Humbert et al., 2002; Delcommenne et al., 1998; Zhang et al., 2001). Three additional genes (IKBKE, PLCD1 and DYRK4) are shown unconnected to the existing diagram. These genes were members of closely related pathways (MAPK signaling, Calcium signaling and Phosphatidylinositol signaling). DYRK4 is also a tyrosine kinase (Zhang et al., 2005; Becker et al., 1998).

 

    4 DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 
Our results with QPACA have addressed two steps in computational pathway structure elucidation using gene expression data. First, given a gene set postulated to be part of a pathway or coordinated process, we have demonstrated the ability to discriminate between real pathways and non-pathways. Second, given a gene set known to represent part of a pathway, we have demonstrated the ability to significantly enrich for pathway genes over non-pathway genes in making predictions of putative pathway members. We term the first problem pathway recognition, and the second problem pathway augmentation. Specifically, for the pathway recognition problem, in 61% of known pathways (83% for larger pathways), QPACA was able to answer affirmatively while rejecting random pathways over 95% of the time. For the pathway augmentation problem, QPACA was able to yield enrichments for predicted pathway genes over random genes at rates of 2-fold or better the majority of the time, with rates of 10-fold or better 10–20% of the time, depending on the background used. While enrichment rates on the level of 2-fold may or may not be of practical significance, we believe that rates of 10-fold may help prioritize experimental exploration of candidate genes for elaborating pathways of particular biological interest.

QPACA's pathway representation is flexible, extensible and accessible. By employing an automated method for identifying relevant sample subsets, we have been able to move into complex pathways relevant to major disease processes in humans. There are two major areas for future work. First is extending the recognition and prediction process to make use of new scoring metrics and optimization strategies. Given that the current metric yielded positive results, we are optimistic that a more refined scoring metric would be able to tease more information out of the data. As this work is extended, other metrics will be considered that move beyond the simple correlation metric presented here, which will allow both direction and magnitude to be taken into account in determining the score. In particular, we plan to examine metrics that include pathway structure information (e.g. by making use of graph-based algorithms that identify minimum spanning trees, maximum flow networks, etc.). We are also interested in exploring what synergy might arise by combining sequence-based analytical methods (e.g. motif finding in gene promoter sets) with the information derived from measurements of variation in gene expression or protein levels. Yet more challenging problems, including recognition and prediction of interactions (edges in the pathway graph) and full pathway induction (as opposed to just augmentation) are also areas for further research. An exciting future direction of this work would be joint subselection of both genes and samples simultaneously, which may offer an approach to the induction problem, beginning with a large set of genes and selecting a smaller highly coordinated set.


    Acknowledgments
 
The authors are grateful to Frank McCormick and Joe Gray for fruitful discussion and collaboration. The authors would like to acknowledge NIH/NCI for partial funding of the work (GM070481, CA64602 and the Cancer Bioinformatics Grid Project, CaBIG).

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Steen Knudsen

Received on August 8, 2005; revised on October 7, 2005; accepted on November 3, 2005

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 

    Becker, W., et al. (1998) Sequence characteristics, subcellular localization, and substrate specificity of DYRK-related kinases, a novel family of dual specificity protein kinases. J. Biol. Chem, . 273, 25893–25902[Abstract/Free Full Text].

    Bertagnolo, V., et al. (1998) Nuclear association of tyrosine-phosphorylated Vav to phospholipase C-gamma1 and phosphoinositide 3-kinase during granulocytic differentiation of HL-60 cells. FEBS Lett, . 441, 480–484[CrossRef][Web of Science][Medline].

    Das, B., et al. (2000) Control of intramolecular interactions between the pleckstrin homology and Dbl homology domains of Vav and Sos1 regulates Rac binding. J. Biol. Chem, . 275, 15074–15081[Abstract/Free Full Text].

    Delcommenne, M., et al. (1998) Phosphoinositide-3-OH kinase-dependent regulation of glycogen synthase kinase 3 and protein kinase B/AKT by the integrin-linked kinase. Proc. Natl Acad. Sci. USA, 95, 11211–11216[Abstract/Free Full Text].

    Friedman, N., et al. (2000) Using Bayesian networks to analyze expression data. J. Comput. Biol, . 7, 601–620[CrossRef][Web of Science][Medline].

    Gavin, A.C., et al. (2002) Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature, 415, 141–147[CrossRef][Medline].

    Gulbins, E., et al. (1995) Molecular analysis of Ras activation by tyrosine phosphorylated Vav. Biochem. Biophys. Res. Commun, . 217, 876–885[CrossRef][Web of Science][Medline].

    Hall, M., et al. (1995) Evidence for different modes of action of cyclin-dependent kinase inhibitors: p15 and p16 bind to kinases, p21 and p27 bind to cyclins. Oncogene, 11, 1581–1588[Web of Science][Medline].

    Hobert, O., et al. (1996) SH3 domain-dependent interaction of the proto-oncogene product Vav with the focal contact protein zyxin. Oncogene, 12, 1577–1581[Web of Science][Medline].

    Holstege, F.C., et al. (1998) Dissecting the regulatory circuitry of a eukaryotic genome. Cell, 95, 717–728[CrossRef][Web of Science][Medline].

    Hughes, T.R., et al. (2000) Functional discovery via a compendium of expression profiles. Cell, 102, 109–126[CrossRef][Web of Science][Medline].

    Humbert, S., et al. (2002) The IGF-1/Akt pathway is neuroprotective in Huntington's disease and involves Huntingtin phosphorylation by Akt. Dev. Cell, 2, 831–837[CrossRef][Web of Science][Medline].

    Ideker, T., et al. (2001) Integrated genomic and proteomic analyses of a systematically perturbed metabolic network. Science, 292, 929–934[Abstract/Free Full Text].

    Ihmels, J., et al. (2002) Revealing modular organization in the yeast transcriptional network. Nat. Genet, . 31, 370–377[CrossRef][Web of Science][Medline].

    Kanehisa, M., et al. (2004) The KEGG resource for deciphering the genome. Nucleic Acids Res, . 32, D277–D280[Abstract/Free Full Text].

    Liu, Y.F., et al. (2000) Activation of MLK2-mediated signaling cascades by polyglutamine-expanded huntingtin. J. Biol. Chem, . 275, 19035–19040[Abstract/Free Full Text].

    Moores, S.L., et al. (2000) Vav family proteins couple to diverse cell surface receptors. Mol. Cell Biol, . 20, 6364–6373[Abstract/Free Full Text].

    Nagata, K., et al. (1998) The MAP kinase kinase kinase MLK2 co-localizes with activated JNK along microtubules and associates with kinesin superfamily motor KIF3. EMBO J, . 17, 149–158[CrossRef][Web of Science][Medline].

    Ohtoshi, A., et al. (2000) Human p55(CDC)/Cdc20 associates with cyclin A and is phosphorylated by the cyclin A-Cdk2 complex. Biochem. Biophys. Res. Commun, . 268, 530–534[CrossRef][Web of Science][Medline].

    Paley, S.M. and Karp, P.D. (2002) Evaluation of computational metabolic-pathway predictions for Helicobacter pylori. Bioinformatics, 18, 715–724[Abstract/Free Full Text].

    Staunton, J.E., et al. (2001) Chemosensitivity prediction by transcriptional profiling. Proc. Natl Acad. Sci. USA, 98, 10787–10792[Abstract/Free Full Text].

    Uetz, P., et al. (2000) A comprehensive analysis of protein–protein interactions in Saccharomyces cerevisiae. Nature, 403, 623–627[CrossRef][Medline].

    Watson, D.K., et al. (1997) FLI1 and EWS-FLI1 function as ternary complex factors and ELK1 and SAP1a function as ternary and quaternary complex factors on the Egr1 promoter serum response elements. Oncogene, 14, 213–221[CrossRef][Web of Science][Medline].

    Weinmann, A.S., et al. (2002) Isolating human transcription factor targets by coupling chromatin immunoprecipitation and CpG island microarray analysis. Genes Dev, . 16, 235–244[Abstract/Free Full Text].

    Yang, E., et al. (1995) Bad, a heterodimeric partner for Bcl-XL and Bcl-2, displaces Bax and promotes cell death. Cell, 80, 285–291[CrossRef][Web of Science][Medline].

    Zhang, B.H., et al. (2001) Specific involvement of G(alphai2) with epidermal growth factor receptor signaling in rat hepatocytes, and the inhibitory effect of chronic ethanol. Biochem. Pharmacol, . 61, 1021–1027[CrossRef][Web of Science][Medline].

    Zhang, D., et al. (2005) DYRK gene structure and erythroid-restricted features of DYRK3 gene expression. Genomics, 85, 117–130[CrossRef][Web of Science][Medline].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/2/233    most recent
bti764v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (2)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Novak, B. A.
Right arrow Articles by Jain, A. N.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Novak, B. A.
Right arrow Articles by Jain, A. N.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?