Bioinformatics Advance Access originally published online on February 19, 2008
Bioinformatics 2008 24(7):995-1001; doi:10.1093/bioinformatics/btn056
Detecting hierarchical structure in molecular characteristics of disease using transitive approximations of directed graphs
1Computational Diagnostics Group, Institute of Functional Genomics, University of Regensburg, 93053 Regensburg, Germany and 2Max Planck Institute for Molecular Genetics, Ihnestaße 63/73, 14195 Berlin, Germany
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Molecular diagnostics aims at classifying diseases into clinically relevant sub-entities based on molecular characteristics. Typically, the entities are split into subgroups, which might contain several variants yielding a hierarchical model of the disease. Recent years have introduced a plethora of new molecular screening technologies to molecular diagnostics. As a result molecular profiles of patients became complex and the classification task more difficult.
Results: We present a novel tool for detecting hierarchical structure in binary datasets. We aim for identifying molecular characteristics, which are stochastically implying other characteristics. The final hierarchical structure is encoded in a directed transitive graph where nodes represent molecular characteristics and a directed edge from a node A to a node B denotes that almost all cases with characteristic B also display characteristic A. Naturally, these graphs need to be transitive. In the core of our modeling approach lies the problem of calculating good transitive approximations of given directed but not necessarily transitive graphs. By good transitive approximation we understand transitive graphs, which differ from the reference graph in only a small number of edges. It is known that the problem of finding optimal transitive approximation is NP-complete. Here we develop an efficient heuristic for generating good transitive approximations. We evaluate the computational efficiency of the algorithm in simulations, and demonstrate its use in the context of a large genome-wide study on mature aggressive lymphomas.
Availability: The software used in our analysis is freely available from http://compdiag.uni-regensburg.de/software/transApproxs.shtml
Contact: Juby.Jacob{at}klinik.uni-regensburg.de, Rainer.Spang{at}klinik.uni-regensburg.de
Supplementary information: Supplementary data are available at Bioinformatics online.
| 1 INTRODUCTION |
|---|
|
|
|---|
High-throughput genomic approaches have entered molecular diagnostics, generating large amounts of molecular data. Recently, clinical studies have been conducted, which screen a fixed set of patients with several complementary high-throughput approaches in parallel (Hummel et al., 2006). Among the most frequently used technologies in oncology are transcriptional profiling using gene expression microarrays (Alizadeh et al., 2000; van de Vijver et al., 2002; Jones et al., 2005), genomic profiling using array CGH or SNP arrays (Solinas-Toldo et al., 1997; Pinkel et al., 1998; Pollack et al., 1999). These are typically complemented by non-genome-wide molecular approaches like fluorescent in situ hybridization (FISH) and classical immunophenotyping or non-molecular characteristics like the morphology of cancer tissues, and clinical data typically including the age and sex of the patient as well as treatment response or survival. For example, a male cancer patient can carry two SNPs which are associated with his disease. At the same time his tumor displays genetic losses of four different chromosomal locations, and carries one chromosomal translocation. Moreover, the tumor cells display certain morphological properties and immunophenotypically express three antigens. Finally a gene expression profile shows a typical pattern for a molecularly defined subtype of the patient's disease. No other patient in the study has the exactly same characteristics. Nevertheless, the challenge is to assign this patient to a well-defined subgroup in an hierarchical classification scheme of his disease. To do so, intrinsic hierarchical structure in the integrated data sets needs to be made apparent.
Here we describe a novel computational approach, which is aiming at uncovering hierarchical structure in complex datasets. We aim at identifying molecular characteristics, which are stochastically implying other molecular characteristics. We explain this by an example: Assume that 95% of all tumors with the immunohistochemical property B also display the expression pattern A. Assume also that 91% of all tumors with the chromosomal amplification C and 100% of all tumors with the expression pattern D display the immunohistochemical property B. A toy dataset representing this constellation is shown in Figure 1a. The patients with property A are a noisy superset of those with property B, while this set contains the two not necessarily disjoint subsets C and D, shown in Figure 1b. Up to a small number of exceptions property B implies property A and properties C and D always imply property B. We encode this hierarchical structure by the graph shown in Figure 1c. Our algorithm aims at extracting such hierarchical structure.
|
We proceed in two steps, which we call graph construction and graph consolidation. The latter is based on the transitive approximation problem and is the main focus of the article.
The construction step: In the construction step we screen all pairs of molecular properties for potential noisy subset/superset relations and encode these in a primary graph by drawing directed edges from supersets to subsets. More precisely, for any two molecular features A and B, if
percent or more of patients with feature B also exhibit feature A then we say B is a noisy subset of A and we draw an edge from A to B. We obtain the primary graph by screening all pairs of molecular features for possible subset/superset relation using the above criteria. We note that in case two or more molecular features are identical with respect to the cases they include, then we join them to a single group in the primary graph. The choice of the parameter
is based on practical considerations. Clearly a graph that hardly contains any edge is useless as a hierarchical model and so is an almost fully connected one. We use
to scale the sparseness of the primary graph.
The consistency problem: Note that in the subset relation between the various properties, if property B is a subset of A (denoted in the graph by the edge A
B) and property C is a subset of B (denoted in the graph by the edge B
C), then for consistency reasons C is a subset of A implies also directly. Unfortunately, due to noise in the data the edge A
C can be missed during graph construction. More generally, logical consistency is requiring that a graph representing a hierarchical structure is transitively closed, while the edgewise construction approach might violate this requirement. For this reason we calculate transitive approximations of the primary graph in the graph consolidation step. This step is based on the transitive approximation problem, which can be formalized as follows: For a given directed graph g find all transitively closed graphs with minimal edit distance to g, where the edit distance between two graphs is the minimal number of edit operations (adding or deleting edges) necessary to transfer the first graph into the second. We refer to all graphs with this optimality property as optimal transitive approximations of g.
For undirected graphs, the transitive approximation problem is as old as Zahn (1964) and has been extensively studied in Delvaux and Horsten (2004) and De Clercq and Horsten (2005). Delvaux and Horsten (2004) have shown that the problem is NP-complete. The equivalent problem for directed graphs is less studied. Nevertheless, Natanzon et al. (1999) show that the optimal transitive approximation problem is NP-complete also for directed graphs.
Here we develop a heuristic algorithm to calculate good transitive approximations, which are graphs with a small but not necessarily optimal edit distance to the original graph. The challenge is to derive a computationally efficient algorithm.
The article is organized as follows: In the next section we develop a heuristic for calculating good transitive approximations and evaluate it in section 3 using random simulations. In section 4 we give technical details of the graph construction step and demonstrate the use of the two-step procedure in the context of a large genomic study on aggressive lymphomas.
| 2 TRANSITIVE APPROXIMATIONS |
|---|
|
|
|---|
2.1 Notations and definitions
Graphs and sequential constructions: Throughout we let
g to denote that ek is an edge in g and ek
g to denote ek is not an edge in g. Next, let g + ek denote the graph obtained by adding edge ek to g and g – ek, the graph obtained by removing edge ek from g. For any random ordering of the edges e1, ... , em, we obtain a sequence of subgraphs gk of g, by setting g0 to be the empty graph (graph with all nodes but no edges) and defining gk = gk–1 + ek, 0 < k
m. Note that gm = g. We call the sequence of subgraphs a sequential construction (gk)k of g.
Distances and neighborhoods: For two graphs
let d(g1, g2) denote the edit distance between them, i.e. the number of edges that are different in the two graphs. For
, we define the
-neighborhood of g, denoted by B
(g), as the set of all graphs with maximal edit distance
to g and for
, we define
.
Transitive approximations: For
, we let
*(g) denote the minimum distance of g to a transitive directed graph. We define optimal transitive approximations of g, denoted by S0(g), to be the set of all transitive directed graphs
which are at minimal distance to g. Clearly,
. Usually, there exist more than one optimal transitive approximation to a directed graph. We give an example: Consider the non-transitive graph given in Figure 2a. Adding the edge AD and deleting the edge DE results in the transitive graph, shown in Figure 2b. Deleting edges BD and CD also results in a transitive graph, which is shown in Figure 2c. Both graphs are optimal transitive approximations of the graph in Figure 2a, with a minimum distance of 2. Finally, we define suboptimal transitive approximations of g denoted by S
(g) as the set of all transitive graphs that are
edges short of being at minimal distance to g.
|
2.2 Preliminaries
LEMMA 1
Let g be a directed graph with sequential construction (gk)k. Then for any, we have
PROOF
Let l = d(gk – 1, h). We have gk = gk–1 + ek. It is easy to see that if ek is an edge of h, then d(gk, h) = l – 1 and if ek is not an edge of h then d(gk, h) = l + 1. In either case it follows that |d(gk – 1, h) – d(gk, h)| = 1.We next show that when going from gk – 1 to gk in a sequential construction, the minimum distance to transitively closed graphs can at most change by 1. Depending on the edge ek that is added, the distance can either increase or decrease by 1.
LEMMA 2
Let g be a directed graph with sequential construction (gk)k and let*(gk) denote the minimum distance of gk to the set of transitive directed graphs. Then
PROOF
We have gk = gk – 1 + ek. Clearly*(gk – 1) + 1 is an upper bound for
*(gk), since by Lemma 1 all optimal transitive approximations of gk can have at most this distance. On the other hand ek could be an edge of an optimal transitive approximation h
S0(gk – 1). Then d(gk – 1, h) =
*(gk – 1). Also since ek is an edge of h, by Lemma 1 we have d(gk, h) =
*(gk – 1) – 1. We now show that
*(gk – 1) – 1 is a lower bound. Suppose that there existed a transitively closed graph
with distance d(gk, c)
![]()
*(gk – 1) – 2. Then by using Lemma 1, we have d(gk – 1, c)
![]()
*(gk – 1) – 1, a contradiction to the optimality of
*(gk – 1).
LEMMA 3
Let g be a directed graph with sequential construction (gk)k. Then for all fS
(gk) (0 < k
m), we have
PROOF
Let fS
(gk), from the definition of suboptimal transitive approximations it follows that
From Lemma 1, we have |d(gk – 1, f) – d(gk, f)| = 1. That is, –1
(1) d(gk – 1, f) – d(gk, f)
1. Using (1) and Lemma 2, we have
Also
Hence we have
The lemma now follows from the definition of suboptimal transitive approximation.
The following corollary immediately follows from the previous lemma.
COROLLARY 1
With the assumptions of Lemma 3 and for fS0(g), we have
The next lemma shows that suboptimal approximations of gk of order at most 2 can be constructed from small modifications of suboptimal approximations of order at most 4 of the previous subgraph gk – 1 in the sequential construction. From Lemma 2 we observe that in going from gk – 1 to gk, depending on the edge ek that is added, the minimum distance can at most change by 1. Consequently, for gk with minimum distance
*(gk) we have that
*(gk) can equal (i)
*(gk – 1) – 1, (ii)
*(gk – 1) or (iii)
*(gk – 1) + 1. We have
LEMMA 4
Letbe a directed graph with sequential construction (gk)k. For each case indicated by the first column of Table 1, optimal and suboptimal transitive approximations of gk have the properties with respect to gk – 1 as indicated by the respective columns.
|
PROOF
If ek is an edge of an optimal transitive approximation in S0(gk – 1), then we have*(gk) =
*(gk – 1) – 1 and row (i) in Table 1 follows from the definitions and previous lemmas. Similarly, if ek is an edge of a graph in S1(gk – 1), the minimum distance to a transitive graph remains the same, yielding row (ii). Finally, in the case when ek is an edge of a graph in S2(gk – 1) we have
*(gk) =
*(gk – 1) + 1 and row (iii).
The previous lemma provides updating rules for suboptimal approximations of gk using small modifications of suboptimal approximations of gk – 1. Updating S0 requires S0, S1 and S2 of the previous subgraph. Updating S1 requires S0, S1, S2, S3 whereas updating S2 requires S0, S1, S2, S3 and S4. Lemma 3 shows that computing S3 and S4 require S5, respectively, S6 of the previous subgraph in the construction. In the following we give an alternative to updating S3 and S4.
LEMMA 5
With the assumptions of Lemma 4, let P = B3(S0(gk – 1))B2(S1(gk – 1))
B1(S2(gk – 1)) and Q = B4(S0(gk – 1))
B3(S1(gk – 1))
B2(S2(gk – 1))
B1(S3(gk – 1)). Moreover, let
and
. Then
with R = { f
S4(gk – 2)
S5(gk – 2) | d(f, gk – 1) =
* (gk – 1) + 3} and
with R' = { f
S5(gk – 2)
S6(gk – 2)| d(f, gk – 1) =
*(gk – 1) + 4}.
PROOF
The lemma follows directly from the definition of suboptimal transitive approximations.
2.3 Algorithm
We now present a heuristic algorithm for calculating good transitive approximations of directed graphs. By a good transitive approximation we understand a not necessarily optimal transitive approximation, which from a practical perspective is close enough to the optimum to reveal meaningful hierarchical structure. The main ingredients of the algorithm are Lemma 4 and Lemma 5.
Let
be an arbitrary directed graph with the set of vertices V = {v1, v2, ... , vn} and the set of edges E = {e1, e2, ... , em} in an arbitrary but fixed order. Let g0
g1
...
gm = g be the corresponding sequential construction of g. We will proceed iteratively starting with g0 by constructing transitive approximations of gk from transitive approximations of gl, l < k. From Corollary 1 we see that optimal transitive approximations of gk can be derived from suboptimal approximations of distance at most two of earlier subgraphs in the sequential construction by introducing at most two modifications. From Lemma 3 approximations of distance at most two can be constructed from approximations of distance at most four of earlier subgraphs, which again can be constructed from approximations of distance at most six of even earlier subgraphs and so on. In this iterative process the number of intermediate approximations increases rapidly requiring inordinate running times for exhaustive enumeration. Therefore, we resort to heuristic as a feasible line of attack. The heuristic aspect of the algorithm is to calculate in each step only suboptimal transitive approximations of distances 0, 1 and 2 to the optimum and the sets
and
ignoring possible contributions from the sets R and R'. The motivation is that for a small number of reference graphs that we examined more closely, we observed that R and R' are small and often empty sets. Nevertheless, ignoring the sets renders our algorithm inexact. Errors can occur and propagate throughout the whole procedure. Simulation experiments will show that the heuristic still produces good transitive approximations for most graphs g.
We start with initializing three sets of graphs called
,
and
. During initialization the sets refer to transitive approximations of the empty graph g0. Since it does not violate transitivity assumptions it is its own unique transitive approximation. Suboptimal transitive approximations of distance 1 and 2 are all transitive graphs with 1 and 2 edges, respectively. The minimum distance to a transitive graph is 0 and we initialize a variable
with this value. At this stage we use Lemma 5 and compute
,
which give suboptimal transitive approximations of g0, of distance 3, respectively 4 stored in variables
and
.
Next we add edge by edge along the chosen sequential construction of g and iteratively update the sets
,
and
. Assume that the sets are updated up to gk – 1. We now add ek and use Table 1 and Lemma 5 to update the sets of transitive approximations. We check successively among the current sets
,
and
for graphs which contain the edge ek. If we find a graph in
with edge ek, we know that this graph has distance
– 1 to gk. We update
and assign the graph to the update of
. We then use row (i) of Table 1 to fully update the sets
,
and
. Next we compute
and
and update
, respectively,
ignoring possible contributions from R and R'.
If instead we find a graph in
with edge ek, we know that this graph has distance
to gk. We assign the graph to the update of
and use row (ii) of Table 1 to update the sets
,
and
. Sets
and
are again updated using
and
, respectively.
If, however, none of the above holds and we find a graph in
with edge ek then this graph has distance
+ 1 to gk. We update
and assign the graph to the update of
and use row (iii) of Table 1 to update the sets
,
and
. As before we compute
to update
, respectively,
. We continue by adding the next edge from the sequential construction and proceed as before. After the last edge is added, the output of our algorithm is the set of good transitive approximations
.
| 3 EVALUATION ON SIMULATIONS |
|---|
|
|
|---|
Running time: To validate the running time performance of the algorithm, we constructed random reference graphs with fixed numbers of nodes N and preset edge probabilities P. For each of the possible N2 edges a uniformly distributed random number between zero and one was generated. If this number was smaller than P the edge was added to the graph. Thus, the expected number of edges is P * N2. We measured running times on graphs with 10, 15, 20, 25 nodes and edge probabilities from .1 to .9. For each combination of N and P, 20 random graphs were generated, and the observed running times on an AMD Athlon machine (1800 MHz clock, main memory size: 2 GB) were averaged. The results are shown in Figure 3. As expected running times increase both with the number of nodes in the graph and with its density. We obtain practically feasible running times below 20 hours for dense graphs up to 15 nodes. For sparse graphs (P < .2) even 25 nodes can be computed in < 3 h. In our application where we look at the hierarchical disease modeling, sparseness of reference graphs can be controlled by the noise parameter
. Moreover, practical hierarchical models will most likely be sparse. Running times depend on the order of edges in the sequential construction of the graphs. To quantify the range of running times for the same graph with different orderings of edges, we ran the algorithm on the graph shown in Figure 2a using 120 random orderings of its edges. Running times varied around an average of 25 s with a SD of 6. Orderings producing early subgraphs gk, which are close to being transitive generally produce short running times. In view of using the algorithm for larger problems (N > 25) further research into optimizing the edge ordering appears promising.
|
Accuracy of approximations: A rigorous validation of the accuracy of transitive approximations is not trivial. Ideally, one would use a large set of reference graphs, for which the complete set of optimal transitive approximations is known. Unfortunately, this validation data is hard to produce due to the NP-completeness of the problem. Instead, we generated random graphs, calculate their transitive closures, modify the closures by a fixed number k of edge deletions or edge insertions, run our algorithm on the modified graphs, and compare the computed distance
with k. If
is larger than k, we observe evidence for a non-optimal solution. Clearly,
k does not prove optimality of the approximation, but it also does not contradict it, while
> k indicates a non-optimal approximation. Along these lines, we generated 30 graphs of sizes N
{8, 10, 12, 15, 20} and report the results in Table 2. Notably, in no instance we observed evidence for suboptimality of the approximation. On the contrary, in many cases the algorithm found transitive graphs at distances below k.
|
| 4 A HIERARCHY FOR MOLECULARLY DEFINED GROUPS OF LYMPHOMAS |
|---|
|
|
|---|
We now show the practical use of the procedure in the context of the classification of mature aggressive lymphomas. We used the data of a large lymphoma study conducted by the research network Molecular Mechanisms of Malignant Lymphoma published in Hummel et al. (2006). A total of 220 mature aggressive B-cell lymphomas were analyzed using various molecular approaches including gene expression profiling, immunophenotyping and molecular cytogenetic analysis. The data is available from www.ncbi.nlm.nih.gov/geo through the GEO accession number GSE4475 [NCBI GEO] . All molecular characterizations yield binary labels for all lymphomas in the collection. Gene expression signatures: Gene expression properties were summarized in signatures, which were either present or absent in the profile of an individual lymphoma. Here we used the mBL signature of (Hummel et al., 2006), which characterizes molecular Burkitt lymphoma (mBL). Lymphomas not presenting the mBL signature were called non-mBL. Moreover, we used the ABC/GCB (acivated B-cells/germinal center B-cells) signatures (Rosenwald et al., 2002), which defines two subgroups of diffuse large B-cell lymphomas (DLBCL): Lymphomas with expression profiles similar to GCB and lymphomas with profiles similar to ABC. Note, that lymphomas with intermediate expression properties between ABC and GCB exist. They were excluded from both groups. Although the GCB/ABC signatures were originally only used for DLBCLs, mBL also display the GCB signature, and we have included them into the GCB group. Immunophenotyping: Histological sections of lymphoma tissues were stained with antibodies for the biomarkers CD10, CD5 and Ki-67. The expression of the corresponding proteins was classified by expert pathologists into present and absent. In case of the proliferation marker Ki-67, lymphomas with a score > 95% were termed Ki-67 positive. Cytogenetics: Chromosomal translocations were determined using interphase FISH. The group IG-MYC comprises lymphomas with a translocation of the MYC locus involving fusion of MYC to the immunoglobulins IGH, IGK or IGL, atyp.myc includes lymphomas with a breakpoint in the MYC locus, but without fusion of the gene with one of the IG genes, while bcl6Br denote lymphomas with breakpoints in the BCL6 locus and IGH-BCL2 denotes fusion of BCL2 to IGH. For every cytogenetic and immunophenotypic label we included two sets of lymphomas into the analysis, one with the lymphomas, which are positive for the label and one with the lymphomas, which are negative for it. The original dataset contains more molecular features of lymphomas than we included in the model. The omitted features were either present in almost all or almost none of the lymphomas and hence did not contribute to a hierarchical model. Also for simplicity of calculations, we excluded all lymphomas where at least one of the 17 modeled features was not assessed, leaving us with 176 remaining lymphomas.
From this data we constructed a graph based on noisy subset/superset relations, which we then corrected for transitivity by using our approximation algorithm. Graph construction was done with several values for the noise parameter
. Small
returned dense graphs with almost 90% edges, whereas large values yielded graphs with hardly any edge. The value
= 0.9 allowing for 10% noise in the subset relations yielded a primary graph with 10% edge density, which is well in the range of practically useful hierarchical models. With this graph as a reference, we ran the transitive approximation algorithm resulting in six good transitive approximations with an edit distance of four to the reference graph. In order to choose one of them, we scored them by averaging the actual accuracy of all edges in them. By accuracy of an edge we understand the percentage of lymphomas in the daughter node that are also contained in the parent node. Clearly this accuracy is bound by
from below. The hierarchical model of lymphomas resulting from the highest scoring approximation is shown in Figure 4. Some caution in interpreting the model is needed. If an entity has two daughter entities this only means that the daughter entities are essentially included in the parent. It does not mean that the two daughter entities are disjoint, nor does it mean that the two daughter nodes fully cover the lymphomas in the parent node. The model reflects known properties of mature aggressive lymphomas. For example, mBL carry the GCB signature, are CD10 positive, and do not carry a IGH-BCL2 fusion nor a breakpoint in the locus of the BCL6 gene, in line with findings of Hummel et al. (2006). As described by Rosenwald et al. (2002), lymphomas with a BCL2 break are also GCB. Moreover ABC, are CD10 negative non-mBLs with no abberations of the MYC locus. In addition to confirming published results, we can also easily identify contradictions of the present study data to other published data. For example, Hans et al. (2004) claim that CD10 expression implies the GCB type of DLBCLs. Our model does not show a directed edge between these features and in fact the modeled study data is not supporting this claim at all. Furthermore, the WHO proposes the presence of an IG-MYC fusion and a Ki-67 score larger than 95% as defining criteria for Burkitt lymphoma (Jaffe et al., 2001). Neither IG-MYC nor Ki-67 positivity implies Burkitt lymphoma. The disagreement of Ki-67 may be due to limited reproducibility of the staining of this marker (de Jong et al., 2007). Interestingly, low expression of Ki-67 is a feature of ABC non-mBL cases.
|
In order to evaluate the influence of the choice of
on the final transitive graph, we ran the transitive approximation algorithm on primary graphs obtained by varying
from 0.85 to 0.95 in steps of 0.01. For each
we then chose one transitive approximation by a scoring using the actual accuracy of edges, as described before. A comparison of the transitive approximation with
= 0.9 to the transitive approximations with
= 0.85, ... , 0.95 showed that on an average 85% of the edges in the various transitive approximations are shared by the transitive approximation with
= 0.9. | 5 DISCUSSION |
|---|
|
|
|---|
With the availability of high-dimensional data in both biological and medical research, there is an increasing need to develop new and efficient strategies to analyze such data. In genomic clinical studies, where the same patients are characterized by various molecular readouts, large and complex datasets are generated. Often these datasets contain hierarchical structure on molecular characteristics. In the present article, we have introduced a novel tool to uncover hierarchical structure in complex datasets. We aim for identifying molecular characteristics, which imply other molecular characteristics. The final hierarchical structure is summarized in a directed transitive graph.
Closely associated with constructing hierarchical models of disease is the problem of transitive approximations of directed graphs. Formally stated, given a reference graph, the problem is to find transitive approximations of the reference graph, which are transitive graphs with minimal edit distance to the reference graph. The problem is known to be NP-complete for both directed and undirected graphs. Here we have developed an efficient heuristics for directed graphs and to our knowledge it is the first of its kind. Note that in principal our algorithm can be used to approximate a reference graph g by graphs from an arbitrary set
. We did not use that
in our formal arguments. Of course our validations of the efficiency and accuracy of the algorithm refer only to the transitive approximation problem. We obtain transitive approximation for sparse graphs with up to 25 nodes in <3 h, with good accuracy. Finally we used the algorithm to retrieve a hierarchical model of mature aggressive B-cell lymphomas, which reproduces well-known results, suggest new relationships between lymphoma groups and uncovers discrepancies between different studies.
Besides the hierarchical modeling of disease the transitive approximation algorithm might also be useful in reconstructing signaling pathways using nested effects models (Markowetz et al., 2007). Here the graph of up- and down-stream relations in molecular signaling networks is estimated, from phenotypic screens of RNAi experiments. In a modular approach, this is done by deciding for each pair (or triple) of signaling molecules, whether they are up- or down-stream from each other. Combining these estimated relations for all analyzed signaling molecules yields a reference graph for the signaling network. Although consistency requires that up/down-stream relations need to be transitive, noise in the data leads to non-transitive reference graphs, such that the transitive approximation algorithm can be used to consolidate the signaling networks.
The hierarchical modeling approach can be interpreted as a generalization of the clustering problem. Clustering the binary characterization vectors of lymphomas detects sets of molecular features that are present in essentially the same lymphomas, suggesting equivalence of the features. Going beyond equivalence, we detect with our approach features that tend to imply each other. While clustering uncovers a noisy equivalence relation, our algorithm uncovers a noisy order relation.
Here we have focused on the discrete transitive approximation problem. The graph construction part of the algorithm is relatively simple. In a more elaborate modeling approach it could be replaced by a full Bayesian model, which specifies posterior probabilities for each potential edge. This would lead to a generalization of the transitive approximation problem, since then one needs to find transitively closed graphs with optimal posterior probability across all edges.
We believe that models of hierarchical structure will be a valuable complementation of the current repertoire of data mining tools. Due to the general nature of the underlying concept, we expect it to be useful in many fields of application.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
We thank Florian Markowetz for his intellectual input. We also thank all former and present members of the Computational Diagnostics Group for inspiring discussions. J.J. and R.S. were supported by the Bayerisches Genomforschungsnetz (BayGene). D.K. was supported by the Deutsche Krebshilfe in the context of the Molecular Mechanisms in Malignant Lymphomas Network Project.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Thomas Lengauer
Received on October 16, 2007; revised on November 16, 2007; accepted on February 7, 2008
| REFERENCES |
|---|
|
|
|---|
Alizadeh A, et al. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature (2000) 403:503–511.[CrossRef][Medline]
De Clercq R, Horsten L. Closer. Synthese (2005) 146:371–393.[CrossRef][Web of Science]
de Jong D, et al. Immunohistochemical prognostic markers in diffuse large B-cell lymphoma: validation of tissue microarray as a prerequisite for broad clinical applications – a study from the Lunenburg Lymphoma Biomarker Consortium. J. Clin. Oncol (2007) 25:805–812.
Delvaux S, Horsten L. On best transitive approximations to simple graphs. Acta Informatica (2004) 40:637–655.[Web of Science]
Hans C, et al. Confirmation of the molecular classification of diffuse large B-cell lymphoma by immunohistochemistry using a tissue microarray. Blood (2004) 103:275–282.
Hummel M, et al. A biologic definition of Burkitt's lymphoma from transcriptional and genomic profiling. N. Engl. J. Med (2006) 354:2419–2430.
Jaffe E, et al, eds. World Health Organization Classification of Tumours. Pathology and Genetics of Tumours of Haematopoietic and Lymphoid Tissues (2001) Lyon, France: IARC Press.
Jones J, et al. Gene signatures of progression and metastasis in renal cell cancer. Clin. Cancer. Res (2005) 11:5730–5739.
Markowetz F, et al. Nested effects models for high-dimensional phenotyping screens. Bioinformatics (2007) 23:i305–i312.
Natanzon A, et al. Complexity classification of some edge modification problems. In Workshop on Graph-Theoretic Concepts in Computer Science, WG '99. Ascona, Switzerland (1999) 65–77. citeseer.ist.psu.edu/article/natanzon99complexity.html.
Pinkel D, et al. High resolution analysis of DNA copy number variation using comparative genomic hybridization to microarrays. Nat. Genet (1998) 20:207–211.[CrossRef][Web of Science][Medline]
Pollack J, et al. Genome-wide analysis of DNA copy-number changes using cDNA microarrays. Nat. Genet (1999) 23:41–46.[Web of Science][Medline]
Rosenwald A, et al. The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma. N. Engl. J. Med (2002) 346:1937–1947.
Solinas-Toldo S, et al. Matrix-based comparative genomic hybridization: biochips to screen for genomic imbalances. Genes Chromosomes Cancer (1997) 20:399–407.[CrossRef][Web of Science][Medline]
van de Vijver M, et al. A gene-expression signature as a predictor of survival in breast cancer. N. Engl. J. Med (2002) 347:1999–2009.
Zahn C. Approximation symmetric relations by equivalence relations. Journal of the Soceity for Industrial and Applied Mathematics (1964) 12:840–847.[CrossRef]
This article has been cited by other articles:
![]() |
H. Frohlich, T. Beissbarth, A. Tresch, D. Kostka, J. Jacob, R. Spang, and F. Markowetz Analyzing gene perturbation screens with nested effects models in R and bioconductor Bioinformatics, November 1, 2008; 24(21): 2549 - 2550. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||


We next show that when going from gk – 1 to gk in a sequential construction, the minimum distance to transitively closed graphs can at most change by 1. Depending on the edge ek that is added, the distance can either increase or decrease by 1.

B2(S1(gk – 1)) 

