Bioinformatics Advance Access originally published online on November 8, 2005
Bioinformatics 2006 22(2):233-241; doi:10.1093/bioinformatics/bti764
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Pathway recognition and augmentation by computational analysis of microarray expression data
UCSF Cancer Research Institute and Comprehensive Cancer Center, University of California at San Francisco San Francisco, CA 94143-0128, USA
*To whom correspondence should be addressed at UCSF Cancer Center, 2340 Sutter Street, #S336, San Francisco, CA 94115, USA
| ABSTRACT |
|---|
|
|
|---|
Motivation: We present a system, QPACA (Quantitative Pathway Analysis in Cancer) for analysis of biological data in the context of pathways. QPACA supports data visualization and both fine- and coarse-grained specifications, but, more importantly, addresses the problems of pathway recognition and pathway augmentation.
Results: Given a set of genes hypothesized to be part of a pathway or a coordinated process, QPACA is able to reliably distinguish true pathways from non-pathways using microarray expression data. Relying on the observation that only some of the experiments within a dataset are relevant to a specific biochemical pathway, QPACA automates selection of this subset using an optimization procedure. We present data on all human and yeast pathways found in the KEGG pathway database. In 117 out of 191 cases (61%), QPACA was able to correctly identify these positive cases as bona fide pathways with p-values measured using rigorous permutation analysis. Success in recognizing pathways was dependent on pathway size, with the largest quartile of pathways yielding 83% success. In cross-validation tests of pathway membership prediction, QPACA was able to yield enrichments for predicted pathway genes over random genes at rates of 2-fold or better the majority of the time, with rates of 10-fold or better 1020% of the time.
Availability: The software is available for academic research use free of charge by email request.
Contact: ajain{at}jainlab.org
Supplementary information: Data used in the paper may be downloaded from http://www.jainlab.org/downloads.html
| 1 INTRODUCTION |
|---|
|
|
|---|
Rapidly accumulating quantitative gene expression data has facilitated the quantitative investigation of biological pathways. The study of a specific pathway often requires painstaking research into each possible element of the network. Consequently, existing knowledge of gene regulatory networks is limited, resulting in hypotheses that are incomplete or at various levels of refinement. Quantitative analytical approaches seek to fill this information gap. With modern experimental methods, it is possible to survey the expression of thousands of genes in a single experiment, while proteomic methods are beginning to offer a solution to quantifying protein levels (Gavin et al., 2002; Uetz et al., 2000). Given sufficiently dense information on the concentrations of specific molecular species in a cell type under many conditions and multiple time-points, it should be possible to investigate many biological pathways automatically. Currently, comprehensive quantitative data are available primarily for gene expression, but not generally for translated proteins or metabolites. Such data are noisy, and regulation of gene expression is complex, often controlled by combinatorial factors (Holstege et al., 1998). Despite this under-constrained and noisy system, many groups have begun to approach the problems in pathway-based analysis using plentiful and relatively inexpensive RNA expression data. Much of this early work in pathway inference has focused on yeast or other simple model organisms, which are more tractable than the human system because they are better understood and characterized and have fewer genes to assess. Additionally, these organisms are easily manipulated, allowing the analysis of systematically perturbed system states, and thereby further reducing the inherent complexity in the networks being studied. Several of these attempts have yielded biologically reasonable results (Friedman et al., 2000; Ideker et al., 2001; Paley and Karp, 2002).
For cancer-specific pathway delineation, one would like to perform perturbation-based biological experiments broadly in a human system. For example, given an easily manipulated cell line relevant to breast cancer, we would like to measure the concentration of both expressed transcripts and translated proteins, both over time and under multiple conditions e.g. gene knockouts, drug treatments, RNAi treatment, etc. Given access to data that include information on metabolite concentration and assuming that the data were at an appropriate granularity, it should be possible to work out a great deal of pathway biology. Given such ideal data, many specific biochemical interactions and transformations could be identified. Datasets that approach this theoretical construct, however, are not even available in yeast and other lower organisms, much less in the human case. In model organisms, such as yeast, it is possible to generate experimental data that are geared to examining a particular condition or set of conditions that are expected to be particularly relevant to a given set of genes. For the yeast transcriptional network, Ihmels et al. (2002) have used this notion of relevant sample sets to create transcriptional modules that overcome the limitations of simple clustering. Their method involves selecting relevant experiments based on an initial set of genes, then finding the gene signature that corresponds to this experiment signature, thus forming context-dependent and potentially overlapping modules of genes.
In human cancer, we frequently find datasets involving multiple instances of the cancer phenotype, across many individuals (and possibly multiple tumor types), each representing a unique experiment of nature. Given that each sample in such a dataset has a different set of genomic, genetic and epigenetic disregulations, there is no reason to expect, a priori, that all of the samples will be relevant to a quantitative study of a particular set of genes in a particular pathway. In yeast, systematic perturbations have shown that only certain perturbations are relevant to certain pathways. In Hughes' yeast work, perturbations of genes in a particular pathway yielded the most informative changes in gene expression within that pathway (but not the largest) (Hughes et al., 2000). For example, deletion of erg genes had large effects on amino acid biosynthesis (as did many other deletions), but the most specific (though smaller) effects were on the expression of genes in the erg pathway.
Here we present QPACA (Quantitative Pathway Analysis in Cancer), a system for pathway visualization and analysis. QPACA addresses three aspects of the general problem: (1) representation and visualization of pathways in the context of biological data, (2) recognition of gene sets that are part of a pathway or coordinated process and (3) augmentation of pathways by prediction of pathway membership. The analysis method directly addresses the issues inherent in analysis of human systems without making limiting assumptions about the structure of pathways or discretizing data.
| 2 METHODS |
|---|
|
|
|---|
2.1 Pathway representation
Computationally, the attributes of a pathway representation can be modeled with a simple directed graph structure, which is a collection of objects (nodes) and their interrelationships (edges). Additionally, directed graphs allow the use of an extensive set of pre-existing algorithms for navigation, visualization and structure analysis. Each node in the pathway graph is composed of an assembly of one or more gene products, small molecules, processes or other assemblies. In this way, it is possible to represent single components of the pathway as well as multimolecular complexes and families in an extensible fashion. The edges in the graph represent either activation or inhibition interactions between the nodes. To model more complex interactions, a special event node can be used to represent a particular event (such as phosphorylation). The pathway representation and language is both general and extensible, placing no biologically implausible constraints on pathway description. Figure 1 shows a representative image automatically generated by QPACA based on the receptor tyrosine kinase signaling (RTK) pathway, as defined by local experts. Currently, QPACA includes a set of filters for accessing KEGG Markup Language (KGML) (Kanehisa et al., 2004) formatted pathways, and filters for BioPAX (http://www.biopax.org) formatted pathways will be included in a future version. This representation not only enables visualization of data in the context of pathways, but, more importantly, also allows computational analysis of pathway composition and structure.
|
2.2 Optimization method
Given that it is not possible to know which samples are likely to be the most informative for a given set of genes in the typical human dataset, we have developed a method of optimization to automatically select these relevant samples by maximizing a scoring metric (Fig. 2). The goal of this method is to identify a specific subset of biological samples, where the variation of gene expression across samples bears on the function of the genes within a set under study.
|
The optimization method employed in sample subselection follows a simple hill-climbing approach. A dataset consists of a matrix of log-expression changes (denoted E) of a gene g
G{1, ..., N} over experimental samples s
S{1, ..., M}, where N and M are the number of genes and samples, respectively. The optimization finds the optimal sample subset, Spath, for a specific subset of genes, Gpath. Following is a pseudocode procedure for the function optimize(Gpath, E):
For 20 iterations:choose a random subset of samples Sinit
S.
number_iterations
0
Scurrent
Sinit
do until no change in Scurrent or number_iterations == 600:
snew
pick random sample snew
S
Scurrent
sold
pick random sample sold
Scurrent
Stest
(Scurrent sold)
snew
if score(Gpath, Stest, E) > score(Gpath, Scurrent, E)
Scurrent
Stest
++number_iterations
if first iteration or score(Gpath, Scurrent, E) > score(Gpath, Spath, E)
Spath
Scurrent
Where score (gene subset, sample subset, expression matrix) is defined as the median of all genegene correlations (using Pearson's correlation) across all pairs of genes within the gene subset over the sample subset given the expression matrix. The numbers 600 and 20 were empirically derived to ensure convergence. To decide whether a set of genes is part of a pathway, Gpath is composed of all genes in the dataset that belongs to that specific putative pathway.
2.3 Performance evaluation
One application of the method is to test whether a gene set is behaving in a coordinated fashion, and whether the coordination is relevant to our notion of a biological pathway. This test pertains to the common biological observation that some small set of genes appears to be behaving similarly (as in a hierarchical clustering of expression patterns), but where it is important to ask more quantitatively whether this gene set is also related through a common pathway. We tested the algorithm's ability to yield high scores on sets of known pathway genes as compared with random sets of genes (to simulate false pathways) or randomized expression data (to simulate data with no signal).
The optimization method may also be applied to the question of biological pathway augmentation. Given a pathway for which the algorithm was able to recognize the gene set as a pathway, the scoring metric can then be used to rank the remaining non-pathway genes. Highly ranked genes could be hypothesized to also belong to the pathway. We tested the algorithm's ability to find new pathway genes by conducting a series of cross-validation experiments in which a percentage of randomly chosen known pathway genes were held out during optimization and scoring of the pathway. Then, QPACA's scoring methodology was used to rank-order the held-out pathway genes and compare them with the background of non-pathway genes, both those in other KEGG pathways and those not known to belong to any other pathways.
We considered both human and yeast expression datasets in developing and assessing QPACA. The Hughes compendium of yeast data (Hughes et al., 2000) (287 deletion mutants and 13 environmental conditions) was used primarily to develop and test the algorithm, while the NCI set of 60 human cancer-derived cell lines (Staunton et al., 2001) was used to validate the results in human. The datasets contained expression levels for 6356 and 7071 genes, respectively. The data were processed to remove duplicates, genes with excessive missing values, and genes with insufficient annotations. Because the algorithm computes correlations across at most 20 samples, a fairly stringent measure of missing values was used. Each gene had to have signal for at least 80% of the total samples. In the NCI60 dataset, this threshold eliminated roughly 3600 genes. The other genes were eliminated due to insufficient annotation (
700 genes) and consolidation of duplicates (
800 genes). The yeast dataset had fewer issues with missing annotations and duplicated genes, as well as better general signal strength. After processing, 6163 genes remained in the yeast dataset, while 1954 genes remained in the human dataset. Raw expression values were converted to log2 ratios by taking the logarithm of each gene's value divided by the mean for that gene across all samples. Additionally, data were normalized by median centering within each sample (this is an automatic feature of QPACA's data processing, as it can influence results markedly if omitted). Both datasets were analyzed in the context of all pathways in the KEGG database with at least five genes in the corresponding dataset. In the human dataset, the receptor tyrosine kinase signaling (RTK) pathway, as defined by local experts, was also analyzed. Figure 1 delineates the composition of this pathway. This resulted in a total of 83 yeast pathways and 108 human pathways. The pathways varied substantially in size, with the bottom half containing fewer than 20 genes, and the top quartile containing 30 genes or more.
2.4 Permutation testing and p-value calculation
We tested the algorithm's ability to recognize pathway membership and to select samples relevant to a gene set. We compared the scores for each known pathway gene set (coupled with its experimental expression data) with three distributions: (1) scores resulting from randomly chosen gene sets of the same size as the pathway set in question; and (2) scores resulting from randomizing the gene expression data itself and (3) scores resulting from randomly chosen gene sets of the same size as the pathway set in question, restricted to genes represented within KEGG. In each permutation test, the randomized data come from the same microarray experiment as that used for the pathway case being tested. The first permutation method estimates the likelihood that any equal-sized gene set will find a sample subset which scores equal to or better than the test. If the probability is low, confidence of the metric's rationality increases. The second randomization method controls for the effects of the distribution of data values within a particular set of genes, which may yield unusual distributions of correlation values under certain degenerate conditions. Given a particular set of genes from a particular dataset, data were randomized for each gene across all samples. The third method is a further control to test for the possibility that KEGG genes as a set may be biased.
Permutation Method 1: Randomize pathway gene set, Gpath, do not change E (expression matrix).
for i in 1 to 500 permutations:Grandom
choose a random set of genes from G of the samesize as Gpath.
Sfinal
optimize(Grandom, E)
ScoreArray[i] = score(Grandom, Sfinal, E)
++i
Permutation Method 2: Randomize the expression matrix, E, do not change Gpath.
for i in 1 to 500 permutations:E'
randomized matrix E, where each gene's values are permuted across samples (distribution of each gene is unchanged)
Sfinal
optimize(Grandom, E')
ScoreArray[i] = score(Grandom, Sfinal, E')
++i
Permutation Method 3:Same as Method 1, but restrict Gpath to those genes found in KEGG pathways.
In each case, we computed the p-value for Gpath by comparing the score (Gpath, Spath, E) with the distribution of scores in ScoreArray where
![]() |
2.5 Cross-validation in biological pathway augmentation
We tested the algorithm's ability to predict additional pathway members by performing cross-validation experiments in which we held out a subset of known pathway genes. We identified the set of pathways for which all three permutation-based p-values were less than 0.1, in order to eliminate those pathways for which prediction would be unlikely to work. This matches how one would proceed in practice, where one would not choose to believe predicted augmentation of a pathway for which the recognition process failed. For each of the resulting 101 pathways, we randomly chose 10% of the known pathway genes to hold out. Then, the optimization algorithm was run on the remaining pathway genes, resulting in a score for the pathway and subset of relevant samples. Using this subset of samples, the remaining genes, both holdouts and a background of non-pathway genes, were scored by adding them to the pathway set and recomputing the score. The resulting gene scores were ranked. This procedure was carried out a total of five times for each pathway. Two separate backgrounds of non-pathway genes were examined: (1) those known to occur in other KEGG pathways and (2) those not known to occur in any other pathways. Additionally, we also tested the ability of the optimization procedure to affect the gene rankings by scoring the held-out and non-pathway genes with no optimization step.
| 3 RESULTS |
|---|
|
|
|---|
Representative results for gene set recognition on a diverse and non-overlapping set of yeast and human experimental datasets can be seen in Tables 1 and 2. Table 1 reports the most pessimistic p-value computed from the three methods described above, while Table 2 provides the breakdown of all three individual p-values. A p-value of X can be interpreted as the probability that a random set of genes will yield a p-value of less than or equal to X. For this limited set of pathways, those for which subselection did not improve the significance of the score included the metabolic pathways (e.g. galactose metabolism) which are likely to show coordinated expression under many different conditions. In these pathways, the score was highly significant both with and without subselection. There were a few pathways, however, for which the score was not significant in either condition. In these cases (inositol phosphate metabolism and phosphatidylinositol signaling), it is possible that the perturbations in these pathways either did not show up in these particular datasets or that the effects were outside the scope of expression data, e.g. through post-translational modifications. In none of these cases did the subselection process cause a significant result to become insignificant. Note, however, that we occasionally observed significant variability in the p-values obtained using the three different methods. This is expected, to the extent that each of the different permutation methods will yield different distributions of gene expression values. In the random data case, the distribution for the control is exactly the same as in the non-control. In the other two types of permutations, this is not so, and consequently, there were differences depending on the method used.
|
|
Figure 3 illustrates the comprehensive results obtained from all 191 human and yeast pathways. Panel A shows the cumulative histograms for the three different permutation methods. While there are some differences in the distributions, particularly at very low p-values, the three methods do not produce substantially different interpretations of the data. Using the most pessimistic values (as in Table 1), in 61% of all cases, we observed significant scores under optimization (p
0.05). For the reason of distributional equivalence, we focused on the randomized data method. Panel B shows the effect of both the optimization procedure and larger gene set sizes for the data randomization permutation method. The optimization procedure clearly shifts the p-value distribution to the left, and the subset of pathways with larger gene set sizes also shifts the distribution to the left (both observations are highly statistically significant). Using the data randomization permutation method, for pathways with at least 30 genes, we observe 96% recognition at a false positive rate of 0.05. A more conservative test, using the most pessimistic of all permutation methods, yields a recognition rate of 83%.
|
These results with QPACA address the first step in computational pathway structure elucidation using gene expression data: given a set of genes hypothesized to be part of a pathway, QPACA can recognize these while rejecting random pathways. The next step, which is clearly both more challenging and more exciting, is prediction of genes that should be considered for addition to a partially elucidated pathway. For the 101 pathways from the foregoing experiments where QPACA recognized the gene sets as pathways, we performed cross-validation experiments by holding out random subsets of known pathway genes and using QPACA's scoring methodology to rank-order the held-out pathway genes within a background of non-pathway genes (see Methods for additional details). Figure 4 summarizes the results using two background gene sets: one consisting only of genes found in KEGG and the other of non-KEGG genes. Figure 4A shows smoothed histograms of scores for holdout RTK pathway genes and non-pathway genes (KEGG control gene set). The distributions are roughly normal, with the scores for the RTK pathway genes clearly shifted to the right. The RTK case was fairly typical in terms of enrichment of high scores within the pathway holdout set versus the non-pathway set. Figure 4B shows the corresponding representative receiver-operating characteristic (ROC) curve for the RTK pathway and for two other human pathways.
|
Figure 4C and D summarize cross-validation results for the full set of 101 tested pathways with two different backgrounds. In Figure 4C, the cumulative distributions of ROC areas for the two control gene sets are shown. Both distributions are highly skewed to the right of 0.5 (p < 106 by exact binomial), indicating that positive enrichment for pathway genes over non-pathway genes occurs in the vast majority of cases. The non-KEGG control gene population yields a slightly stronger enrichment. Figure 4D shows cumulative histograms of maximal enrichment ratios for both control gene populations. Given a specific portion of the top-ranked genes in a ranked list, the enrichment reflects the ratio of the number of true positive genes found divided by the number of such genes expected by chance. The maximal enrichment ratio is the highest such ratio over the full range of possible proportions. Enrichments of 2-fold or better occur for 60 and 70% of the pathways using the KEGG and non-KEGG control gene sets, respectively. Enrichments of 10-fold or better occur in 12 and 20%, respectively. Note that all of the results in Figure 4 resulted from running the subset optimization procedure described previously. On the same set of 101 pathways, without running the optimization procedure, the chief difference in the results is a marked increase in the number of pathways for which the resulting ROC areas were significantly less than 0.5 (7% of non-optimized cross-validation runs yielded ROC areas <0.485 compared with 0% of the runs with optimization).
Recall that the RTK pathway (Fig. 1) was a hand-curated representation of genes downstream of receptor tyrosine kinase signaling, particularly emphasizing pathways that affected cell-cycle regulation, proliferation and apoptosis (each particularly important in cancer). Also, note the nominally better enrichments observed in using the non-KEGG genes as a control set as compared with the KEGG gene set. We hypothesized that the high-scoring control genes from the KEGG population of nominally non-RTK genes might, in fact, include genes that properly belonged to the RTK pathway, based on existing literature evidence. We considered the top 30 highest-scoring genes (the right-hand tail of the Other Genes distribution from Fig. 4A). Figure 5 depicts the documented relationships of 12 of these genes to the RTK pathway as originally curated. Of the 30 highest-scoring KEGG genes (all nominal false positives in our cross-validation experiment) 15 properly belong to the RTK or very closely related pathways.
|
| 4 DISCUSSION |
|---|
|
|
|---|
Our results with QPACA have addressed two steps in computational pathway structure elucidation using gene expression data. First, given a gene set postulated to be part of a pathway or coordinated process, we have demonstrated the ability to discriminate between real pathways and non-pathways. Second, given a gene set known to represent part of a pathway, we have demonstrated the ability to significantly enrich for pathway genes over non-pathway genes in making predictions of putative pathway members. We term the first problem pathway recognition, and the second problem pathway augmentation. Specifically, for the pathway recognition problem, in 61% of known pathways (83% for larger pathways), QPACA was able to answer affirmatively while rejecting random pathways over 95% of the time. For the pathway augmentation problem, QPACA was able to yield enrichments for predicted pathway genes over random genes at rates of 2-fold or better the majority of the time, with rates of 10-fold or better 1020% of the time, depending on the background used. While enrichment rates on the level of 2-fold may or may not be of practical significance, we believe that rates of 10-fold may help prioritize experimental exploration of candidate genes for elaborating pathways of particular biological interest.
QPACA's pathway representation is flexible, extensible and accessible. By employing an automated method for identifying relevant sample subsets, we have been able to move into complex pathways relevant to major disease processes in humans. There are two major areas for future work. First is extending the recognition and prediction process to make use of new scoring metrics and optimization strategies. Given that the current metric yielded positive results, we are optimistic that a more refined scoring metric would be able to tease more information out of the data. As this work is extended, other metrics will be considered that move beyond the simple correlation metric presented here, which will allow both direction and magnitude to be taken into account in determining the score. In particular, we plan to examine metrics that include pathway structure information (e.g. by making use of graph-based algorithms that identify minimum spanning trees, maximum flow networks, etc.). We are also interested in exploring what synergy might arise by combining sequence-based analytical methods (e.g. motif finding in gene promoter sets) with the information derived from measurements of variation in gene expression or protein levels. Yet more challenging problems, including recognition and prediction of interactions (edges in the pathway graph) and full pathway induction (as opposed to just augmentation) are also areas for further research. An exciting future direction of this work would be joint subselection of both genes and samples simultaneously, which may offer an approach to the induction problem, beginning with a large set of genes and selecting a smaller highly coordinated set.
| Acknowledgments |
|---|
The authors are grateful to Frank McCormick and Joe Gray for fruitful discussion and collaboration. The authors would like to acknowledge NIH/NCI for partial funding of the work (GM070481, CA64602 and the Cancer Bioinformatics Grid Project, CaBIG).
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Steen Knudsen
Received on August 8, 2005; revised on October 7, 2005; accepted on November 3, 2005
| REFERENCES |
|---|
|
|
|---|
Becker, W., et al. (1998) Sequence characteristics, subcellular localization, and substrate specificity of DYRK-related kinases, a novel family of dual specificity protein kinases. J. Biol. Chem, . 273, 2589325902
Bertagnolo, V., et al. (1998) Nuclear association of tyrosine-phosphorylated Vav to phospholipase C-gamma1 and phosphoinositide 3-kinase during granulocytic differentiation of HL-60 cells. FEBS Lett, . 441, 480484[CrossRef][Web of Science][Medline].
Das, B., et al. (2000) Control of intramolecular interactions between the pleckstrin homology and Dbl homology domains of Vav and Sos1 regulates Rac binding. J. Biol. Chem, . 275, 1507415081
Delcommenne, M., et al. (1998) Phosphoinositide-3-OH kinase-dependent regulation of glycogen synthase kinase 3 and protein kinase B/AKT by the integrin-linked kinase. Proc. Natl Acad. Sci. USA, 95, 1121111216
Friedman, N., et al. (2000) Using Bayesian networks to analyze expression data. J. Comput. Biol, . 7, 601620[CrossRef][Web of Science][Medline].
Gavin, A.C., et al. (2002) Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature, 415, 141147[CrossRef][Medline].
Gulbins, E., et al. (1995) Molecular analysis of Ras activation by tyrosine phosphorylated Vav. Biochem. Biophys. Res. Commun, . 217, 876885[CrossRef][Web of Science][Medline].
Hall, M., et al. (1995) Evidence for different modes of action of cyclin-dependent kinase inhibitors: p15 and p16 bind to kinases, p21 and p27 bind to cyclins. Oncogene, 11, 15811588[Web of Science][Medline].
Hobert, O., et al. (1996) SH3 domain-dependent interaction of the proto-oncogene product Vav with the focal contact protein zyxin. Oncogene, 12, 15771581[Web of Science][Medline].
Holstege, F.C., et al. (1998) Dissecting the regulatory circuitry of a eukaryotic genome. Cell, 95, 717728[CrossRef][Web of Science][Medline].
Hughes, T.R., et al. (2000) Functional discovery via a compendium of expression profiles. Cell, 102, 109126[CrossRef][Web of Science][Medline].
Humbert, S., et al. (2002) The IGF-1/Akt pathway is neuroprotective in Huntington's disease and involves Huntingtin phosphorylation by Akt. Dev. Cell, 2, 831837[CrossRef][Web of Science][Medline].
Ideker, T., et al. (2001) Integrated genomic and proteomic analyses of a systematically perturbed metabolic network. Science, 292, 929934
Ihmels, J., et al. (2002) Revealing modular organization in the yeast transcriptional network. Nat. Genet, . 31, 370377[CrossRef][Web of Science][Medline].
Kanehisa, M., et al. (2004) The KEGG resource for deciphering the genome. Nucleic Acids Res, . 32, D277D280
Liu, Y.F., et al. (2000) Activation of MLK2-mediated signaling cascades by polyglutamine-expanded huntingtin. J. Biol. Chem, . 275, 1903519040
Moores, S.L., et al. (2000) Vav family proteins couple to diverse cell surface receptors. Mol. Cell Biol, . 20, 63646373
Nagata, K., et al. (1998) The MAP kinase kinase kinase MLK2 co-localizes with activated JNK along microtubules and associates with kinesin superfamily motor KIF3. EMBO J, . 17, 149158[CrossRef][Web of Science][Medline].
Ohtoshi, A., et al. (2000) Human p55(CDC)/Cdc20 associates with cyclin A and is phosphorylated by the cyclin A-Cdk2 complex. Biochem. Biophys. Res. Commun, . 268, 530534[CrossRef][Web of Science][Medline].
Paley, S.M. and Karp, P.D. (2002) Evaluation of computational metabolic-pathway predictions for Helicobacter pylori. Bioinformatics, 18, 715724
Staunton, J.E., et al. (2001) Chemosensitivity prediction by transcriptional profiling. Proc. Natl Acad. Sci. USA, 98, 1078710792
Uetz, P., et al. (2000) A comprehensive analysis of proteinprotein interactions in Saccharomyces cerevisiae. Nature, 403, 623627[CrossRef][Medline].
Watson, D.K., et al. (1997) FLI1 and EWS-FLI1 function as ternary complex factors and ELK1 and SAP1a function as ternary and quaternary complex factors on the Egr1 promoter serum response elements. Oncogene, 14, 213221[CrossRef][Web of Science][Medline].
Weinmann, A.S., et al. (2002) Isolating human transcription factor targets by coupling chromatin immunoprecipitation and CpG island microarray analysis. Genes Dev, . 16, 235244
Yang, E., et al. (1995) Bad, a heterodimeric partner for Bcl-XL and Bcl-2, displaces Bax and promotes cell death. Cell, 80, 285291[CrossRef][Web of Science][Medline].
Zhang, B.H., et al. (2001) Specific involvement of G(alphai2) with epidermal growth factor receptor signaling in rat hepatocytes, and the inhibitory effect of chronic ethanol. Biochem. Pharmacol, . 61, 10211027[CrossRef][Web of Science][Medline].
Zhang, D., et al. (2005) DYRK gene structure and erythroid-restricted features of DYRK3 gene expression. Genomics, 85, 117130[CrossRef][Web of Science][Medline].
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||


S.
0
Scurrent
snew

30 genes using optimization (dashed line). For each case, the randomization permutation method was used. The optimization method clearly improves the p-value distribution over no optimization (p << 0.001 by MannWhitney), and restriction to larger gene set sizes also improves the p-value distribution (p << 0.001 by MannWhitney).
