Bioinformatics Advance Access originally published online on November 24, 2007
Bioinformatics 2008 24(2):250-257; doi:10.1093/bioinformatics/btm575
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Probabilistic path ranking based on adjacent pairwise coexpression for metabolic transcripts analysis
Bioinformatics Center, Institute for Chemical Research, Kyoto University, Gokasho, Uji, Kyoto 611-0011, JAPAN
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Pathway knowledge in public databases enables us to examine how individual metabolites are connected via chemical reactions and what genes are implicated in those processes. For two given (sets of) compounds, the number of possible paths between them in a metabolic network can be intractably large. It would be informative to rank these paths in order to differentiate between them.
Results: Focusing on adjacent pairwise coexpression, we developed an algorithm which, for a specified k, efficiently outputs the top k paths based on a probabilistic scoring mechanism, using a given metabolic network and microarray datasets. Our idea of using adjacent pairwise coexpression is supported by recent studies that local coregulation is predominant in metabolism. We first evaluated this idea by examining to what extent highly correlated gene pairs are adjacent and how often they are consecutive in a metabolic network. We then applied our algorithm to two examples of path ranking: the paths from glucose to pyruvate in the entire metabolic network of yeast and the paths from phenylalanine to sinapyl alcohol in monolignols pathways of arabidopsis under several different microarray conditions, to confirm and discuss the performance analysis of our method.
Contact: takigawa{at}kuicr.kyoto-u.ac.jp
Supplementary information: Supplementary data are available at Bioinformatics online.
| 1 INTRODUCTION |
|---|
|
|
|---|
Many enzymatic conversions between metabolites occur in every living cell; they are necessary for obtaining energy, for breaking down nutrients for building-block molecules, and for synthesizing complex biomolecules, such as nucleic acids and proteins. Cells have potentially multiple pathways (specific reaction sequences) to convert one metabolite to another. We write each biochemical pathway from metabolite c (initial substrate) to c' (final product) as a path with n consecutive conversions:
|
| (1) |
Knowledge of metabolites, reactions and their enzyme genes for various organisms has been rapidly accumulated in pathway databases (Kanehisa et al., 2006; Keseler et al., 2005). These data enable us to examine how individual metabolites are connected via chemical reactions and what genes are implicated in those conversion processes.
However, even for fixed c and c', the number of possible paths in the entire network becomes intractably large in many practical cases, particularly when the choice of genes g1, ... , gn are also considered. One reason is that there are several nodes in the metabolic network that are highly connected to other nodes. Another reason is that there exist a lot of isozymes, enzymes that differ in sequence but catalyze the same reaction. In particular, multicellular organisms often have many isozyme genes to adapt their metabolism to individual tissues or developmental stages. Furthermore, it is inevitable that currently available data on pathways include many biologically inappropriate paths mainly due to the ambiguity in those data. For example, one reaction a + b
c + d can be formally regarded as four conversions a
c, a
d, b
c and b
d, but one or more of them are meaningless.
Thus, it is strongly required to rank these many paths for finding the informative ones among them. We consider evaluating each path by the coexpression of included genes. That is, if a path includes many coexpressed genes, this path would be more worth than others and should be ranked higher. We can measure the coexpression of genes using microarray, which is a widely used technique to simultaneously monitor expression levels of thousands of genes.
Thus, assuming that a metabolic network and microarray datasets are available, we develop a method for listing all possible paths between two given (sets of) chemical compounds, and ranking them for the microarray datasets as a sorted list according to the extent to which highly coexpressed genes are included. Note that these two (sets of) compounds can be specified arbitrarily. Metabolism is generally a dynamic and coordinated activity, whose behavior differs in organisms, organs, tissues, subcellular location and external environment conditions. Therefore, under a specific condition, only portions of the possible paths would be preferentially coregulated, and thus would include highly correlated genes. In this article, we call such paths including many highly correlated genes the most functioning paths under a given condition. Since the sorted list of path relies heavily on the microarrays chosen, a different set of microarrays would return a different set of functioning paths. More importantly, when it is practically impossible to list all possible paths, for a specified k, our method outputs only the top k paths in a practical amount of computation time.
As well as the data on pathways, a lot of microarray datasets under various conditions have also been stored in several databases (Ball et al., 2005; Barrett et al., 2005; Parkinson et al., 2005). To use both of these rich data resources, several approaches have been proposed (Curtis et al., 2005). They can be classified into the following three major types: (i) functional enrichment generates a group of associated genes, often all upregulated, and compares them with a specified pathway (Draghici et al., 2003; Mootha et al., 2003; Tian et al., 2005), (ii) visualization overlays gene expression data onto pathways and manually groups genes (Dahlquist et al., 2002; DeRisi et al., 1997; Ideker et al., 2001; Mlecnik et al., 2005), and (iii) co-clustering groups genes in terms of both expression similarity and topological distance on pathways (Hanisch et al., 2002; Huang and Pan, 2006; Pan, 2006; Vert and Kanehisa, 2003; Zhu et al., 2005). In particular, the last approach includes path ranking (Pavlidis et al., 2002; Rahnenführer et al., 2004; Zien et al., 2000) that directly attempts to find the most functioning paths between given two metabolites. All of the above work essentially assumes expression patterns of all genes in the same group to be similar to each other.
When two arbitrary metabolites are specified, however, paths between them often lie across several coregulated units, and each unit may have a different expression pattern. Alternatively, they may have expression patterns gradually varying along pathways. Recently, two studies (Ihmels et al., 2004; Kharchenko et al., 2005) revealed the predominance of local gene regulation in metabolism, indicating that many transcriptional regulations in metabolism would be organized more locally rather than simultaneously as a group. Motivated by their results, we consider here the adjacent pairwise coexpression of genes, i.e. the coexpression of gene pairs whose corresponding enzymes can catalyze two consecutive reactions. Ihmels et al. (2004) showed how metabolically adjacent gene pairs formed local units of transcriptional coregulation.
| 2 METABOLICALLY ADJACENT PAIR OF GENES |
|---|
|
|
|---|
Multiple paths in metabolic pathways are caused by branchpoints such as compound in the following example.
|
|
Recent studies suggested that genes adjacent by this type of branchpoints are predominantly coregulated. For example, Ihmels et al. (2004) reported the following two findings at the transcriptional level: (i) at most of the branchpoints, only one of the outgoing branches is significantly coregulated with the incoming branch. This means that in our example, only one of g4 and g5 can be coregulated with g1; (ii) when an incoming branch has isozymes, distinct outgoing branches are often preferentially coexpressed with alternative isozymes. For example, assume that g1 and g2 are coregulated with g4 in our example. Then, this implies that gene g3 is preferentially coexpressed with g5. These findings imply that we can evaluate the gene-level activity of each path through two kinds of coexpression: coexpression of adjacent gene pairs across a branchpoint of the path, and coexpression between genes at another point in the path. The adjacent pairwise coexpression we used as the criterion is one reasonable way to take advantage of both of these two aspects. Furthermore, this idea allows us to develop a new, efficient method to tackle a large-scale combinatorial problem.
In addition, for each reaction with multiple isozymes, Ihmels et al. (2004) pointed out the following two different functions of isozymes. (1) Redundant isozymes: isozymes could provide redundancy that may be needed for making the reaction robust against genetic mutations, or for boosting metabolite production. These redundant isozymes are expected to be coregulated. (2) Specific isozymes: alternatively, distinct isozymes could be dedicated to separate biochemical pathways using the corresponding reaction. These specific isozymes are expected to be differentially expressed with alternative processes. In the previous example, isozyme genes g1 and g2 can be coexpressed redundant isozymes, whereas g3 can be a differentially expressed specific isozyme. These two classes can be examined through our path ranking problem: for given conditions, if multiple isozyme genes for a conversion appeared in equally high-scoring paths, those genes will be redundant. On the other hand, if there are multiple alternative isozyme genes for a conversion but only a few are in high-scoring paths, then such genes are considered to be specific.
It also should be noted that the adjacent pairs of genes are usually positively coexpressed (Ihmels et al., 2004; Kharchenko et al., 2005). Hence, we examine only positive correlations for analyzing adjacent pairs, although any other correlation measure can be substituted into our problem setting.
| 3 PROBABILISTIC PATH RANKING WITH ADJACENT PAIRWISE COEXPRESSION |
|---|
|
|
|---|
We briefly illustrate the essential ideas of our method in this section, while the technical details are described in the next section.
Our goal is to return all possible paths between two given compounds in a metabolic network, and rank them as a sorted list using a microarray dataset according to the extent to which highly correlated gene pairs are included.
Let us consider an example of scoring the following path
between c0 and c4.
|
|
The numbers between the genes indicate Pearson correlation coefficients between their expression patterns over a set of microarrays. The questions to address are to what extent this path includes highly correlated gene pairs and what criterion we can use to evaluate it. To answer these questions, we have to give an exact meaning to what highly-correlated is. We consider this point by evaluating how large each value, say the first 0.8, is among all possible pairs in the data.
For simplicity, assume that we have only four genes, g1, g2, g3 and g4 in the given microarray dataset, and thus there are six possible pairings: (g1, g2), (g2, g3), (g1, g3), (g3, g4), (g2, g4), and (g1, g4). Also, suppose the correlation coefficients for these pairs are respectively 0.8, 0.7, 0.5, 0.1, –0.1 and –0.4. This situation simulates the case where pair (g1, g2) with 0.8, as well as pair (g2, g3) with 0.7, is coexpressed, and because of this, pair (g1, g3) is also weakly coexpressed (with 0.5), while the other pairs related to g4 are not coexpressed (the values are 0.1, –0.1 and –0.4). Let
be this set of all possible six pairs.
We can define the score prob(
) for the above path
as the probability of the event that
|
|
0.75) = 0.95. In our example, event X1
0.8 always holds true because 0.8 is the maximum in
0.8) = 1. At the same time, event X2
0.7 corresponds to the cases when we select one pair except (g1, g2) from
0.7) = 5/6
0.83. Similarly, we can obtain P(X3
0.1) = 3/6 = 0.5. In this way, these probabilities for respective events can be computated independently even though the correlation coefficient of (g1, g3) would also be high because of (g1, g2) and (g2, g3). The reason why these events are independent from each other is that we can always use the same set of pairs (i.e. set
) can be reduced to |
|
This score is based on the deviation of the observations from the model where pairs are independently sampled. When large correlation coefficients are observed consecutively for a path, this results in a higher path score. This is because they are improbable in independently sampled pairs and almost all independently sampled pairs will have values less than observed. In this way, our scoring scheme described above would be one reasonable answer to the question on what criterion we can use to evaluate the extent a path includes highly correlated gene pairs.
We note that we can calculate the confidence of an obtained path score by some criterion, such as P-value. For example, we can artificially generate the same number of independently sampled pairs as the number of gene pairs on the path, and then check directly how large the difference is between the two corresponding scores with repeating as necessary. Our scheme gives an explicit formula for the P-value without the need for such repeated sampling because we can derive the exact form of the distribution for the path scores, as shown in the next section.
Given two compounds, all possible paths between them are scored according to this definition and sorted as a list for path ranking. However, the number of paths is too large in many cases. By specifying k, our ranking algorithm described in the next section can output only the top k paths in the list in a practical amount of computation time.
| 4 METHODS |
|---|
|
|
|---|
4.1 Conjunctive approach: adjacent-pair factorization of path probability
Let
g'}, and calculate the similarity µ(g, g') of the pair. In this article, we used the Pearson correlation coefficient between their expression levels over microarray experiments as the similarity measure.
When we repeat this random selection independently, the obtained value µ(g, g') can be thought of as probabilistically arising as a result of random selection for (g, g')
. Hence, we can interpret the observed value µ(g, g') as a realization of a random variable X drawn independently from a fixed (cumulative) distribution function F.
Under the above setting, we define the score for a gene pair (g, g') by P(X
µ(g, g')) = F(µ(g, g')) that can be estimated by the empirical distribution
. Here, the notation
is an indicator for event A that takes 1 if A is true, 0 otherwise.
Let Xi denote a random variable for the similarity value obtained by the i-th selection. Then, since each random selection is assumed to be done independently, we can obtain the score for a path
with
of length |
| by a factorizable conjunctive probability:
|
| (2) |
|–1 are random variables independently identically distributed from a fixed distribution function F.
Furthermore, by comparing to the random situation, we also can obtain the explicit form for the P-value of a score x of the path including |
| genes as
|
| (3) |
4.2 The k maximum-likelihood estimation for path ranking
Given two metabolite sets C and C', our purpose is to rank feasible paths between any c
C and c'
C' according to the path score prob(
). More precisely, the goal is to provide a method for efficiently computing the k-best solutions of the following optimization problem imposed by maximum-likelihood estimation:
|
| (4) |
from c to c' exists. By taking advantage of factorizability for prob(
), we can reduce this problem to the k shortest loopless path problem on a non-negatively edge-weighted digraph. Let us consider a digraph G: = (V, E) that satisfies the following three properties:
- Each vertex v
V is in the form of the ordered triplet (g, c, c') corresponding to each
.
- Each edge (v, v')
E is defined as an ordered pair satisfying
and each weight is assigned by the value of –log F(µ(g, g')), which is always non-negative because 0 < F(x) < 1.
(5)
- In addition to the above, we introduce two special vertices, a super-source vertex s and a super-sink vertex t connected to vertices {(
,c,
) | c
C}
V and {(
,
,c') | c'
C '}
V, respectively, both with the weight of 0.
We can then obtain the k-best solutions of the optimization problem Equation (4) by solving the k shortest loopless s-t path problem on G due to the following equivalence:
|
| (6) |
4.3 Remark on the averaged score
This conjunctive approach prefers shorter paths, since the score of a shorter reaction chain tends to be larger than that of a longer one. To avoid this issue, we can use the averaged version of Equation (4) described as follows:
|
| (7) |
4.4 Computational issues
To solve the k shortest loopless path problem, our implementation uses the Yen–Lawler algorithm (Lawler, 1972; Yen, 1971) with relaxed heaps (Driscoll et al., 1988), which is a classical algorithm but still has the best worse-case bound of O(kn (m + n log n)) for digraphs with n nodes and m edges. Our method can search all paths between given compounds in polynomial time, whereas currently available methods are based on the exhaustive generation of all possible paths in advance (Zien et al., 2000). We note that the loopless restriction is practically important because a metabolic pathway has many loops with weights of zero. This restriction makes the problem harder than the non-restricted case, since finding k shortest paths (allowing loops) requires only O(m + n log n + k) by a well-known algorithm (Eppstein, 1998).
| 5 RESULTS |
|---|
|
|
|---|
5.1 Analysis I: metabolically adjacent pair of genes
Our method is based on highly correlated gene pairs. Thus, at first, we investigated to what extent they are adjacent, and also how often highly correlated adjacent pairs are consecutive in the entire metabolic network.
In previous works such as Ihmels et al. (2004) and Kharchenko et al. (2005), the averaged correlation coefficient at each metabolic distance, i.e. the minimum number of reactions for connecting genes, was investigated. However, because of differentially expressed isozymes, this averaged correlation coefficient can reveal only the rough relationship between the expression similarity and the metabolic distance. For this reason, we focused on only highly correlated gene pairs and then examined the distribution of their metabolic distances. More concretely, for each gene pair (g, g') in the highest 5% correlated pairs among all, we assessed the minimum number n of required intermediates c1, ... , cn such that a path exists from g to g' (or from g' to g) as
|
|
We define this pair (g, g') linked by n minimum intermediates as an n-adj pair. Among each set of pairs such that
for fixed c1, c2 and c3, i.e. the 1-adj pairs that share the same consecutive conversions, we call the pair with the maximum correlation coefficient a max-1-adj pair.
We used two microarray datasets for our analysis: (i) Gasch (Gasch et al., 2000) of 6 361 genes and 156 conditions retrieved from Stanford Microarray Database (Ball et al., 2005) with default preprocessing1 includes many stress conditions. (ii) Hughes (Hughes et al., 2000) of 6 312 genes and 300 conditions downloaded from the website2 of Rosetta Inpharmatics LLC contains diverse mutations and chemical treatments. For the metabolic network, we used all available pathways of yeast metabolism in KEGG (Kanehisa et al., 2006). The 82 pathways in category Metabolism of the KEGG Orthology had all deleted outgoing branches from the ubiquitous compound CO2. We also used all yeast genes annotated in this category, which resulted in 645 enzyme genes after excluding genes of parasites and isolated genes with no edges to any other genes. Throughout this article, whenever computing the correlation coefficient between two genes, we skipped pairs of expression values if one of the values was missing. We made use of the correlation coefficient of a gene pair unless more than 75% of the value pairs were skipped.
Figure 1a shows the distribution of correlation coefficients of all (unordered) gene pairs in two microarray datasets, Gasch and Hughes. We extracted the top 5% coexpressed gene pairs, and classified them according to their minimum number i of intermediates. The height of each bar at number i in Figure 1b indicates the ratio of the number of i-adj pairs in the top 5% of the distribution in Figure 1a to the number of all i-adj pairs. This figure shows that the ratio of i-adj pairs with a smaller i, particularly that of 1-adj pairs, is sufficiently higher than 5%. This indicates that highly correlated pairs are likely to be adjacent (to be 1-adj pairs) in both microarray datasets.
|
Next, we investigated to what extent the 1-adj pairs in the top 5% are chained. The numbers in the dark bars in Figure 1b shows the number of pairs that cannot be linked by chains of only 1-adj pairs in the top 5%. After removing the effects of 1-adj, the ratios of 2-adj to 5-adj pairs are also below around 5%. This suggests that local transcriptional coregulation in the metabolic network can be mostly represented by 1-adj pairs in a statistical sense, which also implies that correlated 1-adj pairs are often consecutive in the metabolic network.
Finally, we examined the effect by distinguishing isozyme genes to confirm another advantage of our gene-level path ranking. Figure 1c shows the distributions of the correlation coefficients of all, 1-adj, and max-1-adj pairs. As shown in the areas filled with dark colors, the probability P(X > 0.5) for each one are 11.0%, 21.7%, 32.8% for the Gasch data, and 2.3%, 6.7%, 15.4% for the Hughes data. The distribution shifted in the positive direction for both datasets. We could see that the probability for observing correlated pairs (corr > 0.5) changed significantly. Furthermore, for quantitative evaluation, we performed a two-sample Kolmogorov–Smirnov test that is a general method used to determine whether two univariate distributions differ. The P-values from the tests were always less than 2.2 x 10–16 for all possible comparisons. Thus, with very high confidence, we can reject the null hypothesis that the two distributions are the same, for all combinations. These results suggest that, even for gene-level paths of the same length originally corresponding to the same compound-level path, the coexpression of genes can drastically differ according to which isozyme genes are included in the path. This motivates us to distinguish isozyme genes for scoring paths.
All these facts show the potential in basing our method on the coexpression of adjacent pairs so that higher scoring paths include more highly correlated genes. Microarray data reflect various levels of transcriptional regulation, and focusing on only local ones can be one informative approach for the beneficial use of such datasets.
5.2 Analysis II: glycolysis of Saccharomyces cerevisiae
Using the same microarray and metabolic network for yeast as Analysis I, we next investigated the top 1000 paths returned by the proposed method from either
-D-glucose (C00267
[GenBank]
) or β-D-glucose (C00221) to pyruvate (C00022). Figure 2a and b show the score distributions of the top 1000 paths for the Gasch and Hughes datasets, respectively. The 24 and 26 highest scoring paths (indicated as dashed circles in the score distribution of the two datasets) have been combined into Figure 2c and d.
|
This result shows that the paths in Figure 2c and d are consistent with a typical route of glycolysis pathways, including the branch to Glycerone phosphate (C00111). Furthermore, another finding is that the paths that are well known in the literature were ranked at the top, while those which are ranked beyond the top 1000 are apparently unacceptable, because they come from the ambiguity of the currently available pathway database (see the Supplementary Material for details).
For comparison, we also investigated the k shortest paths (by path length) in the compound network, specifying the same two compounds, i.e. C00267 [GenBank] (or C00221) and C00022 (see the Supplementary Material for the details). The shortest path length is 9 which is the same length of the top ranked path by our method. However, we obtained 6 shortest paths of length 9, 29 paths of length 10, 66 paths of length 11, 100paths of length 12 and so on, and there are many paths of the same length that are ranked equally. This means that a larger number of gene paths will tie if we use the shortest path criterion instead of our coexpression criterion. In the above case, the number of gene paths corresponding to the shortest path in the compound network was over 1000. On the other hand, among the top 50 paths of our result for Gasch dataset, there are only 20 paths of length 9. Our method can distinguish those tied scoring paths, and often can rank only portions of many paths of the same length among the top.
Furthermore, our method found two functions of isozyme genes described in Section 2: redundant isozymes and specific isozymes. First, as shown in Figure 2c, all three isozymes, TDH1/YJL052W, TDH2/YJR009C, TDH3/YGR192C are shown for C00118
C00236, implying that all of them are redundant isozymes. Similarly, this is true for the two isozymes, ENO1/YGR254W, ENO2/YHR174W, for C00631
C00074. Next, three isozymes, GPM1/YKL152C, GPM2/YDL021W, GPM3/YOL056W are already known for C00197
C00631, but only GPM1 was shown in both Figure 2c and d, implying that GPM1 is a specific isozyme. GPM2 and GPM3 are homologous to GPM1, and the active site residues of these two proteins are the same as those of GPM1. However, neither of them complements a GPM1 deletion mutant, and the glycolysis is neither affected in the GPM2 nor GPM3 single or double deletion mutants (Heinisch et al., 1988). Our result reflects this fact since GPM2 and GPM3 appear to constitute non-functional homologs of GPM1. Similarly, both Figure 2c and d show only HXK2 for C00221/C00267
C01172/C00668 out of three candidates HXK2/YGL253W, HXK1/YFR053C, GLK1/YCL040W. HXK2 is the predominant isozyme during the growth on glucose, and appears to play the main role during glucose phosphorylation in vivo. HXK2 also shows both cytoplasmic and nuclear localization (Randez-Gil et al., 1988), whereas both HXK1 and GLK1 proteins are found in the cytosol. Cytoplasmic HXK2 protein is a key enzyme in glycolysis, whereas nuclear HXK2 protein is involved in signaling the glucose-induced repression of the HXK1 and GLK1 genes and the glucose-induced expression of its own gene, HXK2 (Rodriguez et al., 2001). In addition, when cells are shifted to a non-fermentable carbon source, the HXK2 gene is repressed and the HXK1 and GLK1 genes are rapidly de-repressed (Rodriguez et al., 2001). Thus, HXK2 are unlikely to be coexpressed with any of {HXK1,GLK1}, which is consistent with our results.
5.3 Analysis III: monolignol biosynthesis of Arabidopsis thaliana
Finally, we show an example of applying our method to the specific metabolism of higher organisms whose genetic control is considered to be specialized to each tissue or organ. We considered all possible paths from phenylalanine (C00079) to sinapyl alcohol (C02325) for syringyl (S) units of lignin, in monolignol synthesis pathways of A.thaliana.
For the microarray dataset, we used AtGenExpress Developmental Series data from NASCArrays (Craigon et al., 2004) that uses Affymetrix ATH1 arrays (25k full-genome chips). We manually selected only arrays of wild-type Columbia (Col-0) resulting in 22 814 genes and 141 conditions (each corresponding to a tissue: 45 for flowers/pollen, 54 for leaves, 21 for stems/shoots and 21 for roots). The topology and the gene annotations of this pathway were taken directly from the literature (Raes et al., 2003).
We emphasize that this problem has two challenging features: (i) Arabidopsis has many isozymes; e.g. 14 candidate genes are currently annotated for caffeic acid O-methyltransferase (COMT). (ii) There are many possible paths that differ only in the order of the genes. We thus have to determine the order as well, but conventional simultaneous coexpression would be insufficient for this purpose. The corresponding compound network of this pathway is grid-like, and thus the length of all possible paths between C00079 and C02325 are all 11. However, for a path of length 11, if each chemical reaction has m isozymes, the number of possible paths in this gene network becomes m11, obviously which can be practically intractable by a brute-force approach.
Our method selected the top 1000 paths from C00079 to C02325 for the entire 141 microarray experiments (all) as well as for the sub-experiments with each tissue (flowers/pollen, leaves, stems/shoots, roots). Figure 3 shows the distributions of path scores under the above five cases. Our method can contribute to cross-array comparison under the same principle due to probabilistic normalization in the array such that each path score is in the range [0,1]. In the comparison between different tissues, the highest path scores for each case were 0.8657 for all, 0.4422 for flowers/pollen, 0.5261 for leaves, 0.9554 for stems/shoots and 0.6408 for roots. The scores in stems/shoots were saliently high compared to the scores in other tissues, and this difference would contribute to the entire score of all. This is consistent with the fact that stems are the most lignified tissue of arabidopsis. We then examined the 23 and 47 paths with the highest scores (p < 10–8) for all and stems/shoots, respectively. These included well-studied genes only, meaning that no hypothetical genes appeared. On the other hand, we note that some of them appeared in the highest scoring paths for other tissues, whose scores were clearly lower. See the Supplementary Material for further details.
|
As in Analysis II, we can find a typical example of both redundant and specific isozymes among 13 candidate isozymes for C00811
C00223. Four enzymes, 4CL1/At1g51680, 4CL2/At3g21240, 4CL3/At1g65060 and 4CL4/At3g21230, are well known among all the candidates, and only three (4CL1, 4CL2 and 4CL4) are chosen out of the thirteen candidates (see the Supplementary Material for details). The 4CL proteins fall into two classes according to phylogenetic analysis (Ehlting et al., 1999), i.e. 4CL1, 4CL2 and 4CL4 in class I, and 4CL3 in class II. Both 4CL1 and 4CL2 are related by function to monolignol biosynthesis during developmental lignification (Raes et al., 2003), though 4CL3 corresponds to another function (Ehlting et al., 1999). These facts are consistent with our result, and we can find other similar examples in our analysis. | 6 DISCUSSION |
|---|
|
|
|---|
The idea that expression patterns of all genes in a group are assumed to be similar to each other has been widely considered for analyzing coexpression in metabolism. For example, Zien et al. (2000) developed a method for ranking metabolic paths on the basis of this idea. In contrast, recent papers showed local coexpression along metabolic pathways by averaged correlation analysis (Ihmels et al., 2004; Kharchenko et al., 2005), suggesting that analyzing local coexpression, especially 1-adj coexpression, is helpful for understanding how enzyme genes are transcriptionally coregulated for converting one metabolite to another. We also confirmed the importance of local coregulation in metabolism by using highly correlated gene pairs in microarrays. We then developed a time-efficient method based on the coexpression of adjacent gene pairs for finding highly functioning paths in a given dataset. The results indicate that the obtained paths for two different types of experiments, including the monolignol pathway still under study, are consistent with existing studies. The results also suggest that our method found some clear examples of redundant and specific isozymes that are important, especially for transcriptional coregulation at metabolic branchpoints. Other potential strengths of our method include the following: (i) it can be applied to the case of transitively varying expression along pathways such as in time series data; (ii) it is independent from the measure of the similarity between genes, and another measure can be used in our method.
In this article, we focused only on chains of pairwise coexpression as motivated in Sections 2 and 5.1. However, taking higher-order dependencies among genes into account would be an interesting direction for future work.
Currently, various kinds of microarray experiments and metabolic knowledge are being rapidly accumulated in public databases. We believe that our method can contribute to a better understanding of metabolic coregulation under a wide variety of conditions.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
We thank Dr Toshiaki Tokimatsu and Dr Masanori Arita for helpful discussions on the monolignol biosynthesis pathway of A.thaliana. This work is partly supported by the Grant-in-Aid for Young Scientists (B), The Ministry of Education, Culture, Sports, Science and Technology, Japan.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Jonathan Wren
1http://smd.stanford.edu/cgi-bin/publication/viewPublication.pl?pub_no=37 ![]()
2http://www.rii.com/publications/2000/cell_hughes.html ![]()
Received on July 7, 2007; revised on November 15, 2007; accepted on November 16, 2007
| REFERENCES |
|---|
|
|
|---|
Ahuja R, et al. Combinatorial optimization with rational objective functions: a communication. Math. Oper. Res., ( (1983) ) 8, : 314.[CrossRef][ISI].
Ball C, et al. The Stanford Microarray Database accommodates additional microarray platforms and data formats. Nucleic Acids Res., ( (2005) ) 33, (Database issue): D580–D582.
Barrett T, et al. NCBI GEO: mining millions of expression profiles–database and tools. Nucleic Acids Res., ( (2005) ) 33, (Database issue): D562–D566.
Craigon D, et al. NASC Arrays: a repository for microarray data generated by NASC's transcriptomics service. Nucleic Acids Res., ( (2004) ) 32, (Database issue): D575–D577.
Curtis R, et al. Pathways to the analysis of microarray data. Trends Biotechnol., ( (2005) ) 23, : 429–435.[CrossRef][ISI][Medline].
Dahlquist K, et al. GenMAPP, a new tool for viewing and analyzing microarray data on biological pathways. Nat. Genet., ( (2002) ) 31, : 19–20.[CrossRef][ISI][Medline].
DeRisi J, et al. Exploring the metabolic and genetic control of gene expression on a genomic scale. Science, ( (1997) ) 278, : 680–686.
Draghici S, et al. Global functional profiling of gene expression. Genomics, ( (2003) ) 81, : 98–104.[CrossRef][ISI][Medline].
Driscoll J, et al. Relaxed heaps: an alternative to Fibonacci heaps with applications to parallel computation. Commun. ACM, ( (1988) ) 31, : 1343–1354.[CrossRef].
Ehlting J, et al. Three 4-coumarate:coenzyme A ligases in Arabidopsis thaliana represent two evolutionarily divergent classes in angiosperms. Plant J, ( (1999) ) 19, : 9–20.[CrossRef][ISI][Medline].
Eppstein D. Finding the k shortest paths. SIAM J. Computing, ( (1998) ) 28, : 652–673.[CrossRef].
Feller W. An Introduction to Probability Theory and Its Applications, ( (1971) ) 2, . New York: John Wiley & Sons..
Gasch A, et al. Genomic expression programs in the response of yeast cells to environmental changes. Mol. Biol. Cell, ( (2000) ) 11, : 4241–4257.
Hanisch D, et al. Co-clustering of biological networks and gene expression data. Bioinformatics, ( (2002) ) 18, (Suppl. 1): S145–S154.[Abstract].
Heinisch J, et al. Investigation of two yeast genes encoding putative isoenzymes of phosphoglycerate mutase. Yeast, ( (1988) ) 14, : 203–213.[CrossRef].
Huang D, Pan W. Incorporating biological knowledge into distance-based clustering analysis of microarray gene expression data. Bioinformatics, ( (2006) ) 22, : 1259–1268.
Hughes T, et al. Functional discovery via a compendium of expression profiles. Cell, ( (2000) ) 102, : 109–126.[CrossRef][ISI][Medline].
Ideker T, et al. Integrated genomic and proteomic analyses of a systematically perturbed metabolic network. Science, ( (2001) ) 292, : 929–934.
Ihmels J, et al. Principles of transcriptional control in the metabolic network of Saccharomyces cerevisiae. Nat. Biotechnol., ( (2004) ) 22, : 86–92.[CrossRef][ISI][Medline].
Kanehisa M, et al. From genomics to chemical genomics: new developments in KEGG. Nucleic Acids Res., ( (2006) ) 34, (Database issue): D354–D357.
Keseler I, et al. EcoCyc: a comprehensive database resource for escherichia coli. Nucleic Acids Res., ( (2005) ) 33, (Database issue): D334–D337.
Kharchenko P, et al. Expression dynamics of a cellular metabolic network. Mol. Syst. Biol., ( (2005) ) 1, : 2005. 0016..
Lawler E. A procedure for computing the k best solutions to discrete optimization problems and its application to the shortest path problem. Manage. Sci., ( (1972) ) 18, : 401–405..
Mlecnik B, et al. PathwayExplorer: web service for visualizing high-throughput expression data on biological pathways. Nucleic Acids Res., ( (2005) ) 33, (Web Server issue): W633–W637.
Mootha V, et al. PGC-1 alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat. Genet., ( (2003) ) 34, : 267–73.[CrossRef][ISI][Medline].
Pan W. Incorporating gene functions as priors in model-based clustering of microarray gene expression data. Bioinformatics, ( (2006) ) 22, : 795–801.
Parkinson H, et al. ArrayExpress–a public repository for microarray gene expression data at the EBI. Nucleic Acids Res., ( (2005) ) 33, (Database issue): D553–D555.
Pavlidis P, et al. Exploring gene expression data with class scores. Pac. Symp. Biocomput., ( (2002) ) 474–485..
Raes J, et al. Genome-wide characterization of the lignification toolbox in Arabidopsis. Plant Physiol., ( (2003) ) 133, : 1051–1071.
Rahnenführer J, et al. Calculating the statistical significance of changes in pathway activity from gene expression data. Stat. Appl. Genet. Mol. Biol., ( (2004) ) 3, . Article16..
Randez-Gil F, et al. Hexokinase PII has a double cytosolic-nuclear localisation in Saccharomyces cerevisiae. FEBS Lett., ( (1988) ) 425, : 475–478.[CrossRef].
Rodriguez A, et al. The hexokinase 2 protein regulates the expression of the GLK1, HXK1 and HXK2 genes of Saccharomyces cerevisiae. Biochem. J., ( (2001) ) 355, (Pt 3): 625–631.[ISI][Medline].
Tian L, et al. Discovering statistically significant pathways in expression profiling studies. Proc. Natl Acad. Sci. USA, ( (2005) ) 102, : 13544–13549.
Vert J, Kanehisa M. Extracting active pathways from gene expression data. Bioinformatics, ( (2003) ) 19, (Suppl. 2): II238–II244.[Medline].
Yen J. Finding thek-shortest loopless paths in a network. Manage Sci., ( (1971) ) 17, : 712–716..
Zhu D, et al. Network constrained clustering for gene microarray data. Bioinformatics, ( (2005) ) 21, : 4014–4020.
Zien A, et al. Analysis of gene expression data with pathway scores. Proc. Int. Conf. Intell. Syst. Mol. Biol., ( (2000) ) 8, : 407–417.[Medline].
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||




