Skip Navigation


Bioinformatics Advance Access originally published online on November 13, 2008
Bioinformatics 2009 25(1):98-104; doi:10.1093/bioinformatics/btn593
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
25/1/98    most recent
btn593v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Wu, X.
Right arrow Articles by Jiang, R.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Wu, X.
Right arrow Articles by Jiang, R.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2008. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Align human interactome with phenome to identify causative genes and networks underlying disease families

Xuebing Wu , Qifang Liu and Rui Jiang *

MOE Key Laboratory of Bioinformatics and Bioinformatics Division, TNLIST/Department of Automation, Tsinghua University, Beijing 100084, China

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

Motivation: Understanding the complexity in gene–phenotype relationship is vital for revealing the genetic basis of common diseases. Recent studies on the basis of human interactome and phenome not only uncovers prevalent phenotypic overlap and genetic overlap between diseases, but also reveals a modular organization of the genetic landscape of human diseases, providing new opportunities to reduce the complexity in dissecting the gene–phenotype association.

Results: We provide systematic and quantitative evidence that phenotypic overlap implies genetic overlap. With these results, we perform the first heterogeneous alignment of human interactome and phenome via a network alignment technique and identify 39 disease families with corresponding causative gene networks. Finally, we propose AlignPI, an alignment-based framework to predict disease genes, and identify plausible candidates for 70 diseases. Our method scales well to the whole genome, as demonstrated by prioritizing 6154 genes across 37 chromosome regions for Crohn's disease (CD). Results are consistent with a recent meta-analysis of genome-wide association studies for CD.

Availability: Bi-modules and disease gene predictions are freely available at the URL http://bioinfo.au.tsinghua.edu.cn/alignpi/

Contact: ruijiang{at}tsinghua.edu.cn

Supplementary information: Supplementary data are available at Bioinformatics online.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Recently, several large-scale studies have systematically evaluated the complex relationship between human genetic diseases and genes, revealing prevalent phenotypic overlap (van Driel et al., 2006) and genetic overlap (Rzhetsky et al., 2007) between human diseases. Our previous effort in the genome-wide inference of disease genes for 5080 human diseases reveals a modular organization of the genetic landscape of human diseases (Wu et al., 2008). These endeavors further spur the transition from the Mendelian ‘one gene – one phenotype’ rule to a ‘muti-gene – multi-phenotype’ paradigm. It is now well recognized that phenotypes are the outward manifestation of network effects among products of multiple genes. For example, a macrophage-enriched network has been shown to be responsible for a group of metabolic traits (Chen et al., 2008). As genes and diseases are highly intra- and inter-connected, the new paradigm requires new network-based framework to reduce the complexity and to facilitate the discovery of novel disease genes (Pujana et al., 2007).

We have shown that a simple linear regression model efficiently captures the underlying architecture of the human interactome and phenome networks (Wu et al., 2008). The human disease phenome is depicted by a network of disease phenotypes, with edges weighted by phenotypic overlap scores. Similarly, the interactome is a network of genes linked by physical interactions between their protein products. The two networks are further linked by gene–phenotype associations. We have shown that the proximity between disease genes in the gene network could explain the phenotypic overlap of diseases, and the success of this model suggests a global concordance of the topology between the phenotype network and the gene network. It remains interesting to see whether a direct comparison of the network topology can identify consistent or ‘conserved’ parts between the human interactome and phenome networks. For example, it may be possible that we could find a group of phenotypically overlapped diseases (a disease module), with a corresponding group of causative genes (a gene module). In such a scenario, the causative gene network may suggest a common pathway for the disease family and explain the overlap between the diseases. In addition, the alignment could also provide an effective way to peel modular sub-structures (or bi-module here) from the modular genetic landscape of human diseases, hence greatly reducing the complexity for further analysis.

As a proof-of-concept, we compare human interactome and phenome networks with the network alignment technique, which is originally proposed for comparing protein networks (Sharan and Ideker, 2006). Typically, network alignment works on networks from two species and seeks to identify pairs of sub-networks, one from each species, with sequence similarity between nodes (proteins) from different species. The identified pairs of sub-networks are thought to be conserved protein complexes or pathways. The alignment takes three inputs: two protein networks from different species and some inter-network links (similarity in sequence). We call this a homogenous alignment, because the aligned networks are of the same type (protein–protein interaction network). However, technically, network alignment can also be applied to heterogeneous networks, as far as there are inter-network links defining the correspondence between nodes from two networks. In this study, we perform the first heterogeneous alignment of human interactome and phenome networks, with inter-network links defined as the causal relationships between genes and diseases.

The underlying rationale for aligning human interactome and phenome networks is the consistency between phenotypic overlap and genetic overlap. That is, phenotypic overlap between two disease phenotypes implies their shared pathogenesis. This consistency assumption has not yet been verified systematically and quantitatively. However, a similar hypothesis, that similar diseases (or mutant phenotypes) are caused by functionally related genes (Oti and Brunner, 2007), has been supported by more and more evidences from not only model organisms (Fraser and Plotkin, 2007; Lee et al., 2008; McGary et al., 2007) but also human (Goh et al., 2007; Lage et al., 2007; Lim et al., 2006; van Driel et al., 2006; Wood et al., 2007), and has led to remarkable success in screening candidate disease genes (Lage et al., 2007; Wu et al., 2008). Recently, van Driel et al. (2006) quantified the pairwise phenotypic similarity/overlap among 5080 human disease phenotypes by examining the overlap of medical terms that describe the phenotypes. Later, Rzhetsky et al. (2007) estimated the genetic overlap between 161 disorders based on their frequency of co-occurrence in 1.5 million patient records. With these quantitative data, we are able to verify the correlation between phenotypic overlap and genetic overlap.


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
2.1 Data source
The gene network contains 34 364 manually curated protein–protein interactions of 8919 human genes, and is obtained from HPRD (Mishra et al., 2006). The phenotype network consists of 5080 human phenotypes defined in the OMIM database (McKusick, 2007) and the pairwise similarity scores are calculated by text mining, reported by van Driel et al. (2006). The gene–phenotype links are defined in the morbidmap of OMIM and 1428 can be mapped to our dataset. The genetic overlap estimation between 161 disorders is published by Rzhetsky et al. (2007). Disease category information is from a manual classification concerning the physiological system affected (Goh et al., 2007). Linkage loci with unknown molecular basis are extracted from the OMIM database (entries with prefix %). Gene position information is obtained from NCBI.

2.2 Network alignment and bi-module analysis
We use the network comparison toolkit developed by Ideker lab for network alignment (http://chianti.ucsd.edu/nct/index.php), which implements the model proposed by Sharan et al. (2005). Here, we briefly describe the framework applied to our problem. First, the input networks are assembled into a network alignment graph, and then a log likelihood ratio model is used to score the sub-networks on the weighted alignment graph. The scoring model compares the fit of a sub-network to the desired structure (linear path or clique) versus its likelihood given that each network is randomly constructed. Finally, an algorithm searches exhaustively over the alignment graph to identify high-scoring sub-networks. We have tried most of the tunable parameters in this algorithm, and found that they actually have quite limited impact. Therefore, we use their default settings. We call the identified pairs of sub-networks bi-modules, each comprising a disease module (the disease sub-network) and a gene module (the gene sub-network), together with gene–disease links between them. We perform enrichment analysis to find over-represented gene functions and disease categories for each bi-module. Gene functions (Gene Ontology terms) analysis for the gene module is carried out by DAVID (Dennis et al., 2003): http://david.abcc.ncifcrf.gov/. The P-value of enriched disease category is calculated using Fisher's exact test, which has been widely used for enrichment analysis (Al-Shahrour et al., 2007; Beissbarth and Speed, 2004).

2.3 Benchmark test and prediction
We test the disease gene prediction framework using phenotype network with edge weight threshold of 0.50, 0.55, 0.60 and 0.65. For a threshold smaller than 0.5, the dataset is too large for the program to run, while for a threshold larger than 0.65, the gene–phenotype links are too few for a statistically reasonable validation. At each threshold, the remaining gene–disease links are used to construct the benchmark data. For each gene–disease link, we simulate a linkage locus around the true disease gene by including 108 neighboring genes as negative controls. This strategy for resembling known disease loci in the OMIM database has been widely used in previous studies (Lage et al., 2007; Wu et al., 2008). The 109 test genes are then treated equally by assuming links to the disease under study and go through the network alignment procedure. The genes will compete with each other in this procedure, and the one retained in the bi-module with the highest score is predicted as the causative gene. For prediction, all settings are the same as in the benchmark test, except that the genetic loci are real linkage results collected in OMIM instead of simulated loci.


    3 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
3.1 Phenotypic overlap implies genetic overlap
Assuming that there are shared genetic variations underlying multiple disorders that co-occur in individual patients significantly more (or significantly less) frequently than expected, Rzhetsky et al. inferred the genetic overlaps between 161 disorders based on 1.5 million patient records and a sophisticated statistical model (Rzhetsky et al., 2007). To investigate the phenotypic overlap between the 161 disorders, we manually map OMIM phenotypes to these disorders. Frequently, more than one OMIM phenotype will be found for one disorder. In such cases, the highest pairwise phenotypic overlap score between mapped OMIM phenotypes, one from each disorder, is used as the score for corresponding disorder pair (using score mean yields similar results). We are able to map at least one OMIM phenotype entry for 107 of the 161 disorders and assign phenotypic overlap scores to them.

We compare the average genetic overlap between disorder pairs with phenotypic overlap larger than a threshold (T) and those of smaller. At each threshold, the disorder pairs are divided into two groups: those with phenotypic overlap scores smaller than the threshold and those with phenotypic overlap scores larger than the threshold. Then, the average genetic overlap score is calculated for each group separately, and the result is plotted as bars. Results for T=0.4, 0.5 and 0.6 are plotted in Figure 1a. We find indeed that disorder pairs with higher phenotypic overlap have higher genetic overlap, and this contrast becomes sharper for higher phenotypic overlap score threshold. We also calculate the Pearson's correlation coefficient (PCC) between the genetic overlap and phenotypic overlap of the same disorder pair. Similarly, we check the correlation of phenotypic overlap and genetic overlap for disorder pairs with different levels of phenotypic overlap. Given a threshold for phenotypic overlap scores, we calculate the correlation coefficient for disorder pairs whose phenotypic overlap is larger than the threshold. We first transform the genetic overlap score by a log formula y = ln(1 + x), because the score ranges from zero to several thousand. Most of the genetic overlap scores are positive, but some are negative [co-occur less frequently than expected, interpreted as a genetic overlap via competition (Rzhetsky et al., 2007)]. For the negative ones, we use their absolute value, but analysis excluding negative scores yields similar results. Results (Fig. 1b) show that the overall correlation is weak (PCC=0.1), but very significant (P=1.2x10–13). Further, for disorder pairs with higher phenotypic overlap, the correlation becomes stronger, and there is a linear relationship between the correlation coefficient and the phenotypic overlap score (Fig. 1b). For disorder pairs with phenotypic overlap scores larger than 0.6, the correlation coefficient is larger than 0.4. These results confirm that phenotypic overlap is a general indicator of shared pathogenesis.


Figure 1
View larger version (14K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Phenotypic overlap implies genetic overlap. (a) Average genetic overlap of disorder pairs with phenotypic overlap larger than a threshold (T) versus those of smaller. Results with T = 0.4, 0.5, 0.6 are given. (b) The correlation between phenotypic overlap and genetic overlap becomes stronger when phenotypic overlap increases. Each point (circle) represents the correlation coefficient (Y) for genetic overlap and phenotypic overlap scores when considering disorder pairs with phenotypic overlap larger than a threshold (X).

 
3.2 Align human interactome and phenome networks
With the consistency between phenotypic overlap and genetic overlap justified, we perform the first heterogeneous alignment of human interactome and phenome networks, to identify pairs of matched sub-networks, or bi-modules. To obtain meaningful results, and also to make it computationally feasible, we remove phenotype links with phenotypic overlap scores <0.5, resulting in a smaller phenotype network (4256 phenotypes and 30 551 edges). The alignment identifies several hundred bi-modules, but there are significant overlap of nodes and edges between them. Using the program's default filtering procedure, we obtain 39 bi-modules with <80% duplications. Two representative bi-modules are shown in Figure 2 (see Supplementary Material 1 for all 39 bi-modules). We find that most diseases in the same module belong to the same category. For example, in Figure 2a, 12 of the 13 diseases in the module are neurological diseases, and in Figure 2b, all diseases are metabolic diseases. The enrichment for specific disease category is not surprising, given that diseases in the same module share significant phenotypic overlap with each other. We also find that genes in the same module are enriched in specific biological processes. For example, the eight genes in Figure 2a are enriched in neurotransmitter secretion and its regulation, dopamine/catecholamine metabolic process and apoptosis, while the six genes implicated in metabolic diseases in Figure 2b highlight the cholesterol/sterol metabolic and transport process (Supplementary Table S1 and S2). We also find that these genes are enriched in specific molecular function, and cellular component (Supplementary Material 2). These enriched common features are consistent with the pathogenesis of diseases in the module, suggesting that the causative gene network may serve as a common pathway for the disease family. To see if these observations are general for bi-modules, we perform gene function enrichment analysis and disease category enrichment analysis for each bi-module. Table 1 lists the most enriched category and function (Gene Ontology biological process terms) for each bi-module. From the table, we can see that all bi-modules are enriched with a specific category and a specific function at a significance level of 0.1, and 38 of the 39 bi-modules are further enriched at a level of 0.02, for both function and category. These results confirm that the identified bi-modules are biologically meaningful. Again, we can see reasonable correspondences from these results. For example, bi-module 38 is enriched for ‘Immunological’ disease and the function of ‘B cell proliferation’. Another interesting relationship is that, both of the two bi-modules enriched for ‘Renal’ disease (19 and 27) are associated with genes for ‘visual behavior’.


Figure 2
View larger version (37K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. Representative bi-modules. Circles are diseases and rectangles are genes. Orange, red and blue circles indicates neurological, metabolic and unclassified diseases. Edge between diseases indicates that the two-ended diseases share significant phenotypic overlap (score >0.5). Edge between genes indicates physical interaction between protein products of two-ended genes. Dashed line between gene and disease indicates a causal relationship. (a) A neurological bi-module. 607822: AD 3, 260540: Parkinson-dementia syndrome, 104310: AD 2, 600274: Frontotemporal dementia, 607485: Frontotemporal lobar degeneration with ubiquitin-positive inclusions, 606688: Spongiform encephalopathy with neuropsychiatric features, 606889: AD 4, 168601: Parkinson disease, familial, type 1, 601104: Supranuclear palsy, progressive, 1, 127750: Dementia, lewy body, 168600: Parkinson disease, 172700: Pick disease of brain, 600116: Parkinson disease 2, autosomal recessive juvenile. (b) a metabolic bi-module. 136120: Fish-eye disease, 144010: Hypercholesterolemia, autosomal dominant, type b, 143890: Hypercholesterolemia, autosomal dominant, 604091: Hypoalphalipoproteinemia, primary, 603813: Hypercholesterolemia, autosomal recessive, 205400: Tangier disease, 245900: Lecithin:cholesterol acyltransferase deficiency.

 

View this table:
[in this window]
[in a new window]

 
Table 1. The most enriched category and function

 
3.3 Predict disease genes via network alignment
The network alignment identifies modular sub-structure between human interactome and phenome networks, reduces the complexity, and facilitates their analysis. One limitation is that these sub-structures are identified within gene–disease relationships that are already known. To make novel discovery, one can incorporate candidate genes that are assumed to involve in a particular disease yet to be confirmed, such as those predicted by computational approaches or those reside in a genomic region identified by linkage analysis or association studies. The candidate gene–disease relationships can be treated as inter-network links and some of them may be retained in bi-modules after the alignment. According to the characteristics of the bi-module, these retained candidate genes share many features with, and are closely connected to, other genes that cause the same or similar diseases. They can explain the phenotypic overlap between these diseases and thus are likely to be true disease genes. We test this hypothesis by a benchmark test with known gene–disease relationships and simulated linkage loci (see Section 2). We call this novel framework AlignPI, which is short for Align Phenome & Interactome. The performance of this approach at different thresholds of phenotypic overlap scores is summarized in Table 2. For example, at the threshold of 0.6, there are 653 known gene–disease links tested, of which 178 cases have at least one test gene matched with the test disease (i.e. retained after alignment), and the average number of matched genes is 3.3. In 129 (72.5%) of the 178 loci, the 3.3 gene list contains (hits) the true disease gene, and the true disease gene can be correctly predicted (retained in the bi-module with the highest score) in 111 cases, yielding a relatively high precision of 0.623 (111/178), and an overall recall rate 0.17 (111/653). In summary, the novel approach greatly reduces the number of candidate genes (from 109 to 3.3) and is able to find the disease gene with high precision.


View this table:
[in this window]
[in a new window]

 
Table 2. Performance at different score threshold

 
The statistics for thresholds of 0.50, 0.55 and 0.65 are also provided in Table 2, from which we can further assess the impact of this parameter on the performance of the proposed framework. We show that the threshold indeed has impact on the values of precision and recall. However, one cannot say which threshold is the best, because the threshold just introduces a tradeoff between precision and recall—that is, the precision is generally higher for larger threshold, but the recall will be lower.

We assess the significance of these results by performing a permutation test. The known gene–disease links are randomly rewired to remove the modularity in gene–disease relationship. Then the same benchmark is performed for each randomized dataset. We repeat this procedure 30 times for the threshold at 0.60, and summarize the results in Figure 3. We can see that without modularity the performance drops drastically. Much fewer genes are found; the precision and recall are much lower; the ability to enrich disease genes to a short list is also significantly weakened. These results confirm that AlignPI is able to explore the modularity of gene–disease relationship for disease gene discovery.


Figure 3
View larger version (18K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. Comparison of performance for real and randomized gene–disease links. The number of correctly predicted locus, precision, recall and average matched genes are shown. In each panel, the left bar indicates results on known gene–disease links, and the right bar shows the mean and SD of the results with randomized gene–disease links.

 
3.4 Predict novel disease gene
The OMIM database collects 876 genetic loci previously associated with particular disease but without the causal gene identified. For example, in 1997, three groups (Bowden et al., 1997; Ji et al., 1997; Zouali et al., 1997) reported linkage to a 20-cM region of 20q12–q13.1 for type II diabetes. Ten years later, we still do not know which of the 175 genes in this region accounts for the linkage. These genetic loci are great treasure to be explored for human disease genetics. We are able to map at least one gene for each of 591 such loci that are included in our data. The average and median number of genes in these loci are 337.2 and 268, respectively. The proposed framework makes predictions for 70 disease loci. Averagely, 7.4 genes are matched with the test phenotype (See Supplementary Material 3 for all predictions).

Here, we show the example of Late-onset familial Alzheimer disease (AD) (MIM: 608907 [OMIM] ), mapped to 19p13.2 (Wijsman et al., 2004). Two of the 207 genes within this locus, LDLR and ICAM5, reside in a bi-module that has the highest score (Supplementary Fig. S1). LDLR (low density lipoprotein receptor) has already been speculated as an AD gene, because it is a receptor of the AD gene APOE, and modulates the homeostasis of cholesterol, which itself appears associated with AD. Previous population studies (Gopalraj et al., 2005) supported that the LDLR haplotype is associated with reduced odds of AD.

Compared with LDLR, the link between ICAM5 and AD is less studied. The ICAM5 protein (intercellular adhesion molecule 5, or TLN—telencephalin) is expressed in the somadendritic region of neurons of the mammalian brain, and may be a critical component in neuron–microglial cell interactions in the course of normal development or as part of neurodegenerative diseases (Gahmberg et al., 2008). It is involved in immune privilege of the brain and acts as an anti-inflammatory agent (Tian et al., 2008). More directly, the immunoreactivity of ICAM5 is markedly decreased in the brain of AD patients, particularly in the hippocampal formation (Hino et al., 1997). Soluble ICAM5 has been detected in brain ischemia (Guo et al., 2000), encephalitis (Lindsberg et al., 2002) and epilepsy (Rieckmann et al., 1998). Further, ICAM5 directly binds to two AD genes: PSEN1 and PSEN2, and other member of the ICAM family has been implicated in AD (Combarros et al., 2005). These evidences strongly support a role of ICAM5 in AD.

3.5 Crohn's disease: genome-wide screen and multi-loci effect
The above locus by locus prediction scheme seeks to find from a single locus a gene that is probably part of an existing bi-module that contains the disease under investigation. The term ‘existing’ is used because the bi-module is largely shaped by already established gene–disease relationships. The novel locus is assumed to contain only one true disease gene, thus has limited impact in defining bi-modules.

For complex diseases with heterogeneous origins, there are often multiple loci identified without the causative genes specified. It is likely that the implicated genes from these loci interact with each other and form a novel modular structure/pathway for the disease. In such a scenario, the locus by locus scheme would fail to find the causative genes from these loci. To account for the potential effect of unknown interacting loci, we could fuse candidate genes in multiple loci as if they came from a single locus so that all genes inside could be aligned simultaneously and the potential interacting effect could be automatically considered.

We test this multi-loci scheme for Crohn's disease (CD). Recently, a meta-analysis of three genome-wide association studies (Burton et al., 2007; Libioulle et al., 2007; Rioux et al., 2007) reported 40 susceptibility loci for CD (Barrett et al., 2008). These loci correspond to 37 distinct chromosome regions (Tables 2 and 3 in Barrett et al.'s paper) containing 6154 genes in total. We first perform the single locus scheme for each locus and no significant bi-modules are identified. The result suggests that current functional (interactome) data does not support the idea that genes in these novel loci are part of a known CD-related bi-module. However, as pointed out earlier, there is a possibility that the combination of several genes in these loci renders some local structure to be significant enough to become a novel bi-module. To test all possible combinations, we fuse all 6154 genes in the 37 regions into one region, and align CD (MIM 266600 [OMIM] ) with all genes simultaneously. This genome-wide alignment identifies 48 candidate genes that might be associated with CD (Table 3). Three of the 48 genes (STAT3, JAK2 and PTPN2, darkgray rows in Table 3) are inside the critical region defined by genome-wide association studies (Barrett et al., 2008), and all three genes are also proposed as the potential causative genes by Barrett et al. Two (STAT3 and JAK2) of these three genes are inside the same bi-module with the highest score. Beside these three genes, nine genes (light gray rows in Table 3) are <1 Mb away from the critical region, such as STAT5A (23-kb upstream), STAT5B (60-kb upstream) and MST1R (30-kb downstream). Of the nine genes near the critical regions, three (SUMO4, GRB10 and CARD6) are near a critical region that contains no genes. Besides these candidate genes that are consistent with genome-wide association studies, we also identify many other genes that are plausible candidates. Of particular interest are the other two genes in the most highly scored bi-module: IL12RB1 and FYN. Further works are needed to verify the role of these genes in CD pathogenesis. Nonetheless, the above results not only demonstrate the usefulness of our novel method, but also illustrate the ability of the method to perform genome-wide prediction and to handle multi-loci effect.


View this table:
[in this window]
[in a new window]

 
Table 3. Potential CD genes identified

 

    4 DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
As a proof-of-concept analysis, we have not investigated the impact of different network alignment algorithms, though there are a dozen of methods available (Berg and Lassig, 2006; Flannick et al., 2006; Sharan and Ideker, 2006; Singh et al., 2008). There are several reasons for the choice of the NetworkBlast algorithm used here. First, NetworkBlast is one of the pioneering works, and has successfully led to novel biological discoveries (Suthram et al., 2005). Second, it is conceptually simple; especially no evolutionary model is imposed. Most of the methods developed later assume a dynamic evolutionary history between the aligned networks, which cannot be explained for the alignment of heterogeneous networks. Of course it is interesting to see if some of the evolutional operations (such as node duplication and edge deletion) can be used to explain the pathogenesis of disease families. Third, the alignment of NetworkBlast is local, which aims to find matched substructures between networks. We make use of this capability to identify bi-modules. Many later-developed methods perform global alignment, thus are not appropriate here.

Recently, a number of network-based methods have been proposed to predict or prioritize disease gene candidates (Ala et al., 2008; Franke et al., 2006; Koller et al., 2008; Lage et al., 2007; Oti et al., 2006; Wu et al., 2008). Though it is not our primary concern to develop a disease gene prediction method that outperforms existing ones, the good performance of the novel AlignPI framework renders it as one of the top methods in this field. Of these network-based methods, two are of particular interest: the Bayesian predictor proposed by Lage et al., (2007) and our previous regression model CIPHER (Wu et al., 2008). These two methods are based on the same types of data as this study: phenotype similarity and protein interaction. In general, the precision of AlignPI is slightly better than CIPHER, though the recall is lower. CIPHER has a precision ranges from 0.47 to 0.66, and a recall ranges from 0.3 to 0.5, while AlignPI can achieve a precision of 0.69 at the score threshold of 0.65, where the recall is 0.15. As a comparison, the precision for the Bayesian predictor ranges from 0.23 to 0.65, and the recall ranges from 0.13 to 0.23.

Certainly, there are several limitations of this study. First, there are imprecision and subjectiveness in quantifying phenotypic overlap score. The standardization and quantification of phenotypic description is another issue that is out of the scope of this study (Biesecker, 2005). Second, though we have shown that the alignment algorithm designed for protein networks is also effective in aligning phenome and interactome networks; it is still worthwhile to design specific algorithms for this problem. For instance, the phenotype network is a weighted complete graph (all pairs are connected), while the protein network is binary and sparse. Specific algorithms are needed to address the alignment problem under this scenario.

Our framework could also be applied to model organisms, providing that there are systematic phenotype similarity and gene interaction data, for example, in Caenorhabditis elegans (Gunsalus et al., 2005). Similarly, the framework could also be applied to other labeling problems, such as protein function prediction, as there are also observations of the correlation between protein functional distance (semantic similarity of GO annotations) and network distance (Sharan et al., 2007).


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
We thank Brunner lab for providing the phenome data, and Dr MQ Zhang from CSHL, Dr FZ Sun from USC for critically reading the article. We are grateful to the anonymous reviewers whose suggestions and comments contributed to the significant improvement of this article.

Funding: Natural Science Foundation of China (60575014 and 60805010); Hi-Tech Research and Development Program of China (863project) (2006AA02Z325); National Basic Research Program of China (2004CB518605); a startup supporting plan at Tsinghua University.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Trey Ideker

Received on July 16, 2008; revised on September 24, 2008; accepted on November 11, 2008

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Al-Shahrour F, et al. FatiGO+: a functional profiling tool for genomic data. Integration of functional annotation, regulatory motifs and interaction data with microarray experiments. Nucleic Acids Res. (2007) 35:W91–W96.[Abstract/Free Full Text]

    Ala U, et al. Prediction of human disease genes by human-mouse conserved coexpression analysis. PLoS Comput. Biol. (2008) 4:e1000043.[Medline]

    Barrett JC, et al. Genome-wide association defines more than 30 distinct susceptibility loci for Crohn's disease. Nat. Genet. (2008) 40:955–962.[CrossRef][Web of Science][Medline]

    Beissbarth T, Speed TP. GOstat: find statistically overrepresented Gene Ontologies within a group of genes. Bioinformatics (2004) 20:1464–1465.[Abstract/Free Full Text]

    Berg J, Lassig M. Cross-species analysis of biological networks by Bayesian alignment. Proc. Natl Acad. Sci. USA (2006) 103:10967–10972.[Abstract/Free Full Text]

    Biesecker LG. Mapping phenotypes to language: a proposal to organize and standardize the clinical descriptions of malformations. Clin. Genet. (2005) 68:320–326.[CrossRef][Medline]

    Bowden DW, et al. Linkage of genetic markers on human chromosomes 20 and 12 to NIDDM in Caucasian sib pairs with a history of diabetic nephropathy. Diabetes (1997) 46:882–886.[Abstract]

    Burton PR, et al. Genome-wide association study of 14 000 cases of seven common diseases and 3000 shared controls. Nature (2007) 447:661–678.[CrossRef][Web of Science][Medline]

    Chen Y, et al. Variations in DNA elucidate molecular networks that cause disease. Nature (2008) 452:429–435.[CrossRef][Web of Science][Medline]

    Combarros O, et al. Interaction between interleukin–6 and intercellular adhesion molecule–1 genes and Alzheimer's disease risk. J. Neurol. (2005) 252:485–487.[Medline]

    Dennis G, et al. DAVID: database for annotation, visualization, and integrated discovery. Genome Biol. (2003) 4:P3.[CrossRef][Medline]

    Flannick J, et al. Graemlin: general and robust alignment of multiple large interaction networks. Genome Res. (2006) 16:1169–1181.[Abstract/Free Full Text]

    Franke L, et al. Reconstruction of a functional human gene network, with an application for prioritizing positional candidate genes. Am. J. Hum. Genet. (2006) 78:1011–1025.[CrossRef][Web of Science][Medline]

    Fraser HB, Plotkin JB. Using protein complexes to predict phenotypic effects of gene mutation. Genome Biol. (2007) 8:R252.[Medline]

    Gahmberg CG, et al. ICAM-5—A novel two-facetted adhesion molecule in the mammalian brain. Immunol. Lett. (2008) 117:131–135.[Medline]

    Goh KI, et al. The human disease network. Proc. Natl Acad. Sci. USA (2007) 104:8685–8690.[Abstract/Free Full Text]

    Gopalraj RK, et al. Genetic association of low density lipoprotein receptor and Alzheimer's disease. Neurobiol. Aging (2005) 26:1–7.[Web of Science][Medline]

    Gunsalus KC, et al. Predictive models of molecular machines involved in Caenorhabditis elegans early embryogenesis. Nature (2005) 436:861–865.[CrossRef][Medline]

    Guo H, et al. Release of the neuronal glycoprotein ICAM-5 in serum after hypoxic-ischemic injury. Ann. Neurol. (2000) 48:590–602.[CrossRef][Web of Science][Medline]

    Hino H, et al. Reduction of telencephalin immunoreactivity in the brain of patients with Alzheimer's disease. Brain Res. (1997) 753:353–357.[CrossRef][Medline]

    Ji LN, et al. New susceptibility locus for NIDDM is localized to human chromosome 20q. Diabetes (1997) 46:876–881.[Abstract]

    Koller S, et al. Walking the interactome for prioritization of candidate disease genes. Am. J. Hum. Genet. (2008) 82:949–958.[CrossRef][Web of Science][Medline]

    Lage K, et al. A human phenome-interactome network of protein complexes implicated in genetic disorders. Nat. Biotechnol. (2007) 25:309–316.[CrossRef][Web of Science][Medline]

    Lee I, et al. A single gene network accurately predicts phenotypic effects of gene perturbation in Caenorhabditis elegans. Nat. Genet. (2008) 40:181–188.[CrossRef][Web of Science][Medline]

    Libioulle C, et al. Novel Crohn disease locus identified by genome-wide association maps to a gene desert on 5p13.1 and modulates expression of PTGER4. PLoS Genet. (2007) 3:6.

    Lim J, et al. A protein–protein interaction network for human inherited ataxias and disorders of Purkinje cell degeneration. Cell (2006) 125:801–814.[CrossRef][Web of Science][Medline]

    Lindsberg PJ, et al. Release of soluble ICAM-5, a neuronal adhesion molecule, in acute encephalitis. Neurology (2002) 58:446–451.[Abstract/Free Full Text]

    McGary KL, et al. Broad network-based predictability of S. cerevisiae gene loss-of-function phenotypes. Genome Biol. (2007) 8:R258.[CrossRef][Medline]

    McKusick VA. Mendelian inheritance in man and its online version, OMIM. Am. J. Hum. Genet. (2007) 80:588–604.[CrossRef][Web of Science][Medline]

    Mishra GR, et al. Human protein reference database – 2006 update. Nucleic Acids Res. (2006) 34:D411–D414.[Abstract/Free Full Text]

    Oti M, Brunner HG. The modular nature of genetic diseases. Clin. Genet. (2007) 71:1–11.[CrossRef][Web of Science][Medline]

    Oti M, et al. Predicting disease genes using protein–protein interactions. J. Med. Genet. (2006) 43:691–698.[Abstract/Free Full Text]

    Pujana MA, et al. Network modeling links breast cancer susceptibility and centrosome dysfunction. Nat. Genet. (2007) 39:1338–1349.[CrossRef][Web of Science][Medline]

    Rieckmann P, et al. Telencephalin as an indicator for temporal-lobe dysfunction. The Lancet (1998) 352:370–371.

    Rioux JD, et al. Genome-wide association study identifies new susceptibility loci for Crohn disease and implicates autophagy in disease pathogenesis. Nat. Genet. (2007) 39:596–604.[CrossRef][Web of Science][Medline]

    Rzhetsky A, et al. Probing genetic overlap among complex human phenotypes. Proc. Natl Acad. Sci. USA (2007) 104:11694–11699.[Abstract/Free Full Text]

    Sharan R, Ideker T. Modeling cellular machinery through biological network comparison. Nat. Biotechnol. (2006) 24:427–433.[CrossRef][Web of Science][Medline]

    Sharan R, et al. Conserved patterns of protein interaction in multiple species. Proc. Natl Acad. Sci. USA (2005) 102:1974–1979.[Abstract/Free Full Text]

    Sharan R, et al. Network-based prediction of protein function. Mol. Syst. Biol. (2007) 3:88.[Medline]

    Singh R, et al. Global alignment of multiple protein interaction networks with application to functional orthology detection. Proc. Natl Acad. Sci. USA (2008) 105:12763–12768.[Abstract/Free Full Text]

    Suthram S, et al. The Plasmodium protein network diverges from those of other eukaryotes. Nature (2005) 438:108–112.[CrossRef][Medline]

    Tian L, et al. Shedded neuronal ICAM-5 suppresses T-cell activation. Blood (2008) 111:3615–3625.[Abstract/Free Full Text]

    van Driel MA, et al. A text-mining analysis of the human phenome. Eur. J. Hum. Genet. (2006) 14:535–542.[CrossRef][Web of Science][Medline]

    Wijsman EM, et al. Evidence for a novel late-onset Alzheimer disease locus on chromosome 19p13.2. Am. J. Hum. Genet. (2004) 75:398–409.[CrossRef][Web of Science][Medline]

    Wood LD, et al. The genomic landscapes of human breast and colorectal cancers. Science (2007) 318:1108–1113.[Abstract/Free Full Text]

    Wu X, et al. Network-based global inference of human disease genes. Mol. Syst. Biol. (2008) 4:189.[Medline]

    Zouali H, et al. A susceptibility locus for early-onset non-insulin dependent (type 2) diabetes mellitus maps to chromosome 20q, proximal to the phosphoenolpyruvate carboxykinase gene. Hum. Mol. Genet. (1997) 6:1401–1408.[Abstract/Free Full Text]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
25/1/98    most recent
btn593v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Wu, X.
Right arrow Articles by Jiang, R.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Wu, X.
Right arrow Articles by Jiang, R.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?