Bioinformatics Advance Access originally published online on January 24, 2008
Bioinformatics 2008 24(5):724-726; doi:10.1093/bioinformatics/btm617
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Jenti: an efficient tool for mining complex inbred genealogies
1Twin Research & Genetic Epidemiology Unit, Kings College London, UK and 2Institute of Genetic Medicine, European Academy, Bolzano, Italy
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Summary: An efficient tool for mining complex inbred genealogies that identify clusters of individuals sharing the same expected amount of relatedness is described. Additionally it allows for the reconstruction of sub-pedigrees suitable for genetic mapping in a systematic way.
Availability: http://www.jenti.org
Contact: m.falchi{at}imperial.ac.uk
A promising approach to dissect the genetics of complex traits is to focus on isolated populations with small number of founders. In these isolates the expected number of phenotype-influencing variants is likely to be reduced and the shared environment among individuals is more uniform compared with outbred populations (Peltonen et al., 2000; Shifman and Darvasi, 2001; Wright et al., 1999). The value of genetic isolates is often enriched by the availability of extensive historical and archival records that allow tracking the inheritance pattern of extant individuals through generations. Despite the potential advantages of exploiting the knowledge of these relationships to increase the efficiency of the studies, the study design and the statistical analysis should be carefully planned, keeping in mind the peculiarity of a sample of individuals mostly related to each other through multiple lines of descent. The main issues to deal with are the non-independence of subjects genotype in population-based designs with resulting biased association results due to linkage, and the complexity of these large pedigrees that often prohibits using them entirely in family-based designs. Here, we describe Jenti, a user-friendly tool that assists the user in the selection of sub-samples suitable for genetic studies based on genealogical information.
Given the genealogical connections between two individuals a and b, their genetic relatedness might be described by the kinship coefficient (or coefficient of consanguinity)
(a,b) representing the chance that a randomly chosen pair of alleles, one from each individual, is the inherited copy of the same ancestral allele (Malécot, 1948). The kinship coefficient is related to the expected amount of alleles shared identical-by-descent among individuals in the genome (e.g. Glaubitz et al., 2003). All pairwise connections between the individuals of a genealogy, expressed by the kinship coefficient, can be exploited to cluster optimal sub-group(s) of individuals whose members share a given range of genetic relatedness.
The sub-sample of individuals sharing each with the other the lower degree of kinship is likely to be representative of the allele frequencies distribution in the population, since the bias due to the relationship among individuals is minimized. This sample would be therefore suitable to investigate the background linkage disequilibrium patterns, to select more appropriately SNP density for a genome-wide association panel, or for SNP discovery in candidate gene/region studies. Selecting a sub-sample of least related case-control subjects also avoids spurious associations due to linkage in an association-based study (e.g. Gianfrancesco et al., 2003).
On the other hand, sub-groups of individuals sharing the highest degree of relationship can be extracted from the population sample to partition a large genealogy into more manageable subunits. The identification of genetically homogeneous sub-groups maximizes the useful information for family-based genetic mapping studies while keeping the computations simple and efficient. Moreover, different underlying genetic model and variants frequencies can be assumed by including more or less related subjects in the same sub-pedigrees, thus increasing the power of mapping studies. This approach has been used to systematically partition large genealogies both for quantitative and qualitative traits linkage analyses (Ciullo et al., 2006; Falchi et al., 2004; Liu et al., 2007).
While straightforward in principle, systematic clustering of individuals in optimal sub-groups is unfeasible without an appropriate computational tool, because of the massive number of possible configurations. Indeed, if n individuals are related through a single genealogy, they provides up to n*(n–1)/2 pairwise kinship relationships. A pedigree can be represented by an undirected graph whose vertices (V) correspond to individuals and edges (E) connecting two vertices are weighted on the basis of the pairwise measure of relatedness between the two individuals, such as their kinship.
Using this representation, identifying the larger subgroup of equally distantly related individuals might be seen as finding the maximum-clique in a graph, which has been proven to be a NP-complete problem (Garey and Johnson, 1979). A maximum-clique is the complete sub-graph of the graph, such that each vertex pair is joined by an edge, having the maximum cardinality.
Jenti implements a version of the Bron and Kerbosch (1973) algorithm, apt to identify the maximum clique in an undirected graph using two backtracking algorithms based on a branch and bound technique, the basis for most recent clique algorithms (Babel, 1991; Balas and Niehaus, 1996). Pedigree partitioning could be viewed as a graph partitioning problem, which is also a well-known NP-complete problem, and can be solved by iteratively searching the maximum clique of the graph and deleting it from the graph until there are no vertices left. We improved this approach, first proposed in Falchi et al., 2004, and integrated a flexible user-friendly framework.
A graphical interface (Fig. 1) assists the user step-by-step in the selection process, from the exploration and cleaning of the whole genealogical data to the manual or semi-automatic clustering of individuals in homogeneous sub-groups.
|
The program takes as input the genealogical data in standard linkage format. Kinship and inbreeding matrices are computed on the fly using a recursive algorithm from Lange (1997). An automatic exploration feature helps determining the natural clustering of the data. By providing a kinship range and an optional constraint on the minimum–maximum size for each extracted sub-group, the clustering process can identify:
- the largest sub-sample of individuals sharing with each other a particular expected amount of relatedness;
- the optimal partition in non-overlapping sub-samples.
Since several equivalent maximum cliques might be elected to nucleate or to extend a partition, the space of possible solutions is iteratively explored through permutations to maximize the overall number of informative pairwise relationships in the extracted dataset. The user can actively customize the proposed solution. Subjects manually removed from the extracted sub-group(s), along with those individuals which didn't fit the user's selection, could be reprocessed using different clustering parameters and integrated within the main partition scheme.
Jenti can automatically generate the pedigree connecting all the individuals within each sub-sample, using a subset of the ancestry which preserves most of the relationship among individuals, as observed in the whole genealogy, while limiting the sub-pedigree size. Alternatively, a user-specified depth can be applied for the reconstruction of all the sub-pedigrees (or a subset of them) to increase the inbreeding information. The integration with PedVizApi allows the visualization and interactive exploration of the pedigree data using either 2D or 2.5D layouts—depending on the pedigree complexity—thus supporting the decision process. The Jenti package is entirely written in Java language and is therefore executable on every supported operating system.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
The authors would like to thank Paola Forabosco and Céline Bellenguez for their helpful comments and suggestions.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Keith Crandall
Received on September 11, 2007; revised on November 18, 2007; accepted on December 10, 2007
| REFERENCES |
|---|
|
|
|---|
Allendorf FW, Phelps SR. Use of allelic frequencies to describe population structure. Can. J. Fish. Aquat. Sci (1981) 38:1507–1514.
Babel L. Finding maximum cliques in arbitrary and in special graphs. Comput (1991) 15:321–341.
Balas E, Niehaus W. DIMACS sSeries. Discrete Math. Theor. Comput. Sci (1996) 26:29–52.
Bron C, Kerbosch J. Finding All Cliques of an Undirected Graph – Algorithm 457. In: Communication of the ACM. (1973) vol. 16. New York, USA: ACM. 575–577.[CrossRef]
Ciullo M, et al. New susceptibility locus for hypertension on chromosome 8q by efficient pedigree-breaking in an Italian isolate. Hum. Mol. Genet (2006) 15:1735–1743.
Falchi M, et al. A genomewide search using an original pairwise sampling approach for large genealogies identifies a new locus for total and low-density lipoprotein cholesterol in two genetically differentiated isolates of Sardinia. Am. J. Hum. Genet (2004) 75:1015–1031.[CrossRef][Medline]
Garey M, Johnson D. Computers and Intractability; a Guide to the Theory of NP-completeness. (1979) New York: W.H. Freeman.
Gianfrancesco F, et al. Identification of a novel gene and a common variant associated with uric acid nephrolithiasis in a sSardinian genetic isolate. Am. J. Hum. Genet (2003) 72:1479–1491.[CrossRef][Web of Science][Medline]
Glaubitz JC, et al. Prospects for inferring pairwise relationships with single nucleotide polymorphisms. Mol. Ecol (2003) 12:1039–1047.[CrossRef][Medline]
Kruglyak L. Prospects for whole genome linkage disequilibrium mapping of complex disease genes. Nat. Genet (1999) 22:139–144.[CrossRef][Web of Science][Medline]
Lander ES, Schork NJ. Genetic dissection of complex traits. Science (1994) 265:2037–2048.
Lange K. Mathematical and Statistical Methods for Genetic Analysis. (1997) New York: Springer-Verlag.
Liu F, et al. A genomewide screen for late-onset Alzheimer disease in a genetically isolated Dutch population. Am. J. Hum. Genet (2007) 81:17–31.[CrossRef][Medline]
Malécot G. Les Mathématiques de lHérédité. (1948) Paris: Masson et Cie.
Peltonen L, et al. Use of population isolates for mapping complex traits. Nat. Rev. Genet (2000) 1:182–190.[CrossRef][Web of Science][Medline]
Shifman S, Darvasi A. The value of isolated populations. Nat. Genet (2001) 28:309–310.[CrossRef][Web of Science][Medline]
Wright AF, et al. Population choice in mapping genes for complex diseases. Nat. Genet (1999) 23:397–404.[CrossRef][Web of Science][Medline]
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
