Skip Navigation


Bioinformatics Advance Access originally published online on February 29, 2008
Bioinformatics 2008 24(7):958-964; doi:10.1093/bioinformatics/btn064
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
24/7/958    most recent
btn064v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Sampson, J. N.
Right arrow Articles by Self, S. G.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Sampson, J. N.
Right arrow Articles by Self, S. G.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2008. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Identifying trait clusters by linkage profiles: application in genetical genomics

Joshua N. Sampson 1,* and Steven G. Self 1,2

1Department of Biostatistics, University of Washington and 2Statistical Center for HIV/AIDS Research and Prevention, Seattle, WA, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 APPENDIX A
 ACKNOWLEDGEMENTS
 REFERENCES
 

Motivation: Genes often regulate multiple traits. Identifying clusters of traits influenced by a common group of genes helps elucidate regulatory networks and can improve linkage mapping.

Methods: We show that the Pearson correlation coefficient, Formula, between two LOD score profiles can, with high specificity and sensitivity, identify pairs of genes that have their transcription regulated by shared quantitative trait loci (QTL). Furthermore, using theoretical and/or empirical methods, we can approximate the distribution of Formula under the null hypothesis of no common QTL. Therefore, it is possible to calculate P-values and false discovery rates for testing whether two traits share common QTL. We then examine the properties of Formula through simulation and use Formula to cluster genes in a genetical genomics experiment examining Saccharomyces cerevisiae.

Results: Simulations show that Formula can have more power than the clustering methods currently used in genetical genomics. Combining experimental results with Gene Ontology (GO) annotations show that genes within a purported cluster often share similar function.

Software: R-code included in online Supplementary Material.

Contact: joshua.sampson{at}yale.edu

Supplementary information: Supplementary materials are available at Bioinformatics online.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 APPENDIX A
 ACKNOWLEDGEMENTS
 REFERENCES
 
Genetic linkage analysis, or linkage mapping, is a powerful tool for locating genes influencing quantitative, or continuously varying, traits. For linkage mapping, the trait of interest is measured on a group of related individuals and then the genotypes at a set of genetic markers (i.e. single nucleotide polymorphisms) are recorded for that same group. Markers that are strongly correlated with the trait are reported as quantitative trait loci (QTL).

When a single gene regulates two or more traits, an occurrence known as pleiotropy, the power to detect that gene and the precision of its estimated location are often improved by mapping all regulated traits simultaneously (Jiang and Zeng, 1995). Given a set of genetically correlated traits, several methods are available for joint linkage analysis. Maximum likelihood approaches can be applied to multivariate distributions (Chen, 2005; Korol et al., 1996). Haley–Knott regression is easily adapted to multiple traits by using multivariate regression and ANOVA (Knott and Haley, 2000). Principal components analysis can transform the traits into a set of orthogonal, canonical variables and each component can then be analyzed by standard, single-trait methods (Mangin et al., 1998; Weller et al., 1996). These methods have been used extensively in recent years, uncovering genes influencing milk production in cows, grain yield in wheat, and multi symptom illnesses in a variety of organisms (Kraft et al. 2003; Mangin, et al. 1998; Martin et al. 2003).

Before benefiting from such methods, we must first identify a set of genetically coregulated traits. We quantify genetic coregulation by averaging the percentage of influential genes common to both traits, C(·, ·). Usually traits have been clustered because of biological relationships or prior experiments. However, our knowledge may be limited to the data collected for the linkage studies. Therefore, using only the recorded trait values and marker genotypes, we want a method to determine if all, or a subgroup, of those traits are genetically coregulated. If the traits could share only a single gene, such a method would analyse each marker separately and mimic the previously listed joint mapping methods. However, the heritable traits still being studied are influenced by multiple genes, and two traits sharing one gene would likely share a set of genes.

In this article, we formalize a novel method for identifying groups, or clusters, of traits likely to share common QTL (Schadt et al., 2005). The need for such a method has dramatically increased with the emergence of genetical genomics. Genetics and genomics can be combined by measuring the expression levels of thousands of genes simultaneously in a group of related individuals and then treating each expression level as a quantitative trait for linkage mapping (Brem et al., 2002; Li and Burmeister, 2005; Segal et al., 2003). QTL controlling expression levels, eQTL, have been successfully identified in mice, yeast, wheat, and humans (Li and Burmeister, 2005; MacLaren and Sikela, 2005; Yvert et al., 2003). Here, we will offer a way to cluster genes regulated by common eQTL, thereby improving our estimates of QTL locations and identifying collections of genes participating in the same biological pathway.

Our clustering method is based on the correlation, Formula , between the LOD score profiles of two traits. Given a large group of traits, we form clusters so the value of Formula between any two members exceeds some threshold. We will introduce this method in the context of a single genetical genomics experiment. Until now, these analyses have generally formed clusters so all pairs within a cluster have highly correlated expression levels. We can therefore compare our method to this established standard. In other cases, such as clustering genes where the expression levels were measured on different populations, there is no alternative to clustering by Formula .

The remainder of the article is divided as follows. First, we introduce notation and briefly review LOD Scores. Second, we introduce Formula as an estimate for C(j1, j2). Third, we discuss Formula as a similarity measure, and show how combining this measure and an appropriate algorithm can identify clusters of genetically coregulated traits (Hastie et al., 2001). Fourth, we apply our clustering method to simulations and the experimental results from Yvert et al.'s study of Saccharomyces cerevisiae (Yvert et al., 2003).


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 APPENDIX A
 ACKNOWLEDGEMENTS
 REFERENCES
 
2.1 Notation
Assume a cross between two strains, ST–1 and ST1, of yeast producing n individuals. For subject i, i isin {1, ..., n}, let Yji be the expression level for gene j and let Formula be called the expression, or phenotypic, profile of gene j. Let subject i be genotyped at N markers, and let Gti be the genotype at position t, where Gti = 1 (Gti = –1) if the allele is from ST1 (ST–1). {Gti1, Yji1} and {Gti2, Yji2} are independent if i1 != i2. Recall, yeast are haploid and have only two possible genotypes with Formula , i.

To simplify the discussion, assume that all genes are located at markers. We say that gene t, marker t, or QTL t influences the expression of gene j if f(Yji | Gti = –1) != f(Yji | Gti = 1), ignoring the possibility of epistasis. Furthermore, let Rj {equiv} {tj1, ..., tjNj} be the positions of the Nj QTL influencing the expression of gene j. The LOD score, Formula , is the log10 of the likelihood ratio statistic testing whether t isin Rj (see Supplementary Material for definition) and can be approximated by Haley–Knott regression: Formula . Here, SST and SSE are the total and residual sum of squares from solving equation 1 (Haley et al., 1994). The vector of LOD scores, Formula will be called the LOD score profile of gene j.


Formula 1

(1)


Formula

Here Formula and Formula is the least-squares estimate of Formula .

Let 1tisinRj = 1 if t isin Rj, 0 otherwise. For trait j, j isin {j1, j2}, Formula is the proportion of QTL that are common to both traits. We measure the genetic coregulation by the geometric mean, C(j1, j2), of the two proportions.


Formula 2

(2)
Our goal is to find clusters, T1, T2, ..., Tg, of traits such that Formula is large.

2.2 Estimating C(j1, j2)
Define the LOD score correlation coefficient, Formula , for traits j1 and j2 by


Formula 3

(3)
where Formula . Let trait j1 be influenced by N1 genes and trait j2 be influenced by N2 genes. Define k1 and k2 by N1 = k1N and N2 = k2N, where 0 < k1, k2 < 1. Furthermore, let the two traits share N'' = k'N genes where 0 < k' < k1, k2. For large n, we show (Appendix A) that under three mild assumptions,


Formula 4

(4)

A second approach for estimating C(j1, j2) would be to first estimate 1tisinRj1 and 1tisinRj2 for each potential QTL t. Let Formula if Xjt is a local maximum and exceeds some threshold >0.22, 0 otherwise. Then, calculating LOD scores by composite interval mapping (CIM) promises


Formula 5

(5)
where Formula for j isin {j1, j2}. We avoid this approximation because it fails horribly for real sample sizes. Let QTL t' affect traits j1 and j2. Even when n is large, the estimated locations for QTL t' will rarely coincide perfectly resulting in Formula for all t close to t'. This approximation does not account for the distance between the two estimated positions. The obvious correction would be using a continuous measure estimating P(t isin Rj) or evidence of linkage instead of Formula . As such evidence is often quantified by Xjt, we have returned to our original approach.

Returning to our original focus, we calculate Formula for all possible pairs of traits and input those values in a proximity matrix, D. Let Dj1, j2 = C(j1, j2), where Dj1, j2 is the entry in the j1th row and j2th column of D. Given a proximity matrix, there are numerous methods available for finding groups, T1, T2, ..., Tg of traits such that Formula is large. If our estimates of C(j1, j2) are accurate, then the identified clusters will result in a large value of Formula .

2.3 Similarity measures
Finding clusters, T1, T2, ..., Tg, of traits resulting in a large value of Formula does not require estimating C(j1, j2). We can circumvent this step by identifying a statistic, or measure, D, that is highly correlated with C. Given such a measure, we can construct a proximity matrix, D, where Dj1, j2 = D(j1, j2). Applying an appropriate clustering method to D would result in groups of traits with a large value of Formula . Three candidate measures are the correlation between expression levels, the correlation between vectors Formula and Formula , and the correlation between LOD score profiles. We must avoid methods based on variance components because the genetic component is not identifiable for Yvert et al.'s yeast experiment or any experiment where the population is the progeny of a single cross.

The most prevalent method for finding genes linked to common eQTL is clustering by expression profile (Brem and Kruglyak, 2005; Eisen et al., 1998; Yvert et al., 2003). This method is equivalent to defining Formula , where Formula estimates the Pearson correlation coefficient, Formula , between Formula and Formula .


Formula 6

(6)
In Yvert et al.'s experiment, genes were clustered into groups, T1,T2, ..., Tg such that if j1, j2 isin Tk, then Formula . Here, genes could be members of multiple groups. In many cases, {rho}P, and therefore Formula should be strongly correlated with C(·, ·). Consider a linear model describing the expression levels of two genes, each controlled by a single QTL.


Formula 7

(7)
where {varepsilon}j1i and {varepsilon}j2i are independent and normally distributed with mean = 0 and variances = Formula , Formula , respectively. When both genes are influenced by the same eQTL, t = t',


Formula 8

(8)

When there is only one QTL, the following, desirable, statement is true: |{rho}P| > 0 if and only if the two traits are genetically coregulated. A discrete measure, C(j1, j2) isin {0, 1} will never perfectly correlate with a continuous measure, D(j1, j2). However, if Formula is constant for all genes, {rho}P, is an increasing function of the genetic effect sizes, βj1 and βj2. Therefore, as the influence of the shared QTL increases, the Formula will also increase, another highly desirable characteristic. Unfortunately, problems can arise when multiple genes influence each trait. Then, even the simple statement from above fails, as {rho}P = 0 no longer implies the absence of genetic correlation. We give an example later where {rho}P(j1, j2) = 0 and C(j1, j2) = 1.

A second possible measure is Formula , the Pearson correlation coefficient between Formula and Formula , where Formula is the least squares estimate of βj· in Equation (10). If we believe that two genes share a common function only when they share the same QTL and when those QTL have identical influences, Formula would be the preferred statistic. However, this second condition is superfluous if our ultimate aim is to group genes for QTL mapping, so we still prefer a measure correlated with C(j1, j2). Therefore, Formula has two drawbacks. The signs of Formula and Formula affect Formula and the size of Formula , not the evidence for linkage, affect Formula . The logical replacement for Formula , which addresses those flaws, is the F-statistic Formula . However, F = (n 2)(10–2Xjt – 1) so we are lead back to a function based on the LOD scores for our correlate to C(j1, j2).

The third possible measure, and our focus for the remainder of the article is Formula . Unlike Formula , Formula incorporates both expression and genetic data. Furthermore, Formula can compare traits measured on distinct populations. Both traits could be expression levels or one trait could be a more general characteristic, such as size or life expectancy. Finally, there is a robust correlation between Formula and C even in small samples, which will be illustrated by later examples.

2.4 Asymptotic properties of Formula
Let traits j1 and j2 have the distribution described by equation 10. To simplify calculations, we make the following three assumptions: (1) The entire genome is on a single chromosome. (2) The markers are evenly spaced at intervals of length d cM. (3) Haldane's mapping rule describes the recombination rate between loci. Under the absolute null hypothesis, Formula , we proved that Formula converges to a normal distribution as N, n -> {infty} (Supplementary Material).


Formula 9

(9)
where Formula is the asymptotic variance of Formula and var(Xj1t) = var(Xj2t) = 2/4.612. In the Supplementary Material, we show that as long as there are at least 200 subjects, the null distribution of Formula should be nearly normal and can be well approximated by the asymptotic null for a chromosome of sufficient length. For some genomes or when the number of available markers is small, the null distribution of Formula can be heavily skewed and can be better approximated by modifying the asymptotic distribution (see Supplementary Material). The appropriately calculated null distribution can provide P-values, and other measures of statistical significance, for experimental results. Violating any of the three assumptions had little effect on the null distribution (see Supplementary Material).

2.5 Clustering method
We propose a three-step method for identifying clusters of traits that share common QTL. (1) Calculate the truncated LOD score profile, Formula for each of the n traits, where Xtruncjt = min(Xjt,6). Without truncation, traits with an extreme LOD score at position t will be correlated to any trait j where Xjt is even modestly larger than E[Xjt]. Truncation also ensures that Formula is only large when traits share multiple QTL. Simulations suggested the threshold of six greatly reduced the type I error rate without noticably lowering power. (2) Form an nxn similarity matrix where the j1, j2th entry is Formula , calculated using the truncated LOD scores at all markers. (3) Use a heierarchical clustering method, (such as hclust, method=‘complete’ in R), to order the traits. Then, select groups of traits where Formula for all included trait pairs, j1 and j2, where c is a predefined threshold. These groups can be subsequently ranked by their size and/or mean value of Formula .

2.6 Simulations
We could simulate small groups of coregulated traits, and then examine whether the above method can correctly cluster those subgroups when applied to the union of all traits. However, these simulations would introduce multiple variables simultanously. Instead, we focus on groups of two coregulated traits, j1 and j2, and calculate the probability of correctly clustering those two traits together, or equivalently, calculating the Formula . We refer to Formula as the ‘power’ of our clustering method, because in this two trait example, our clustering method is equivalent to a test that rejects the null hypothesis H0 : {R1} {cap} {R2} = Ø when Formula . We define c{alpha} so the probability of clustering two unrelated traits together, the ‘type I error rate’, is {alpha}. In each of the scenarios described below, we generate 10 000 values of Formula under the null set-up and define the 95th percentile as c0.05. We then generate values under the alternative, and define the proportion exceeding c0.05 as the power. Full, as opposed to truncated, LOD scores could be used to calculate Formula because we avoided genes with extreme effects. Using identical methodology, we also calculate the power for the test rejecting H0 when Formula is large.

For each individual in a group of offspring, we simulated two phenotypes and a set of SNPs, spaced every 10 cM, along a single chromosome. The first marker was randomly assigned to be –1 or 1, indicating parental origin, and the remaining markers were generated according to Haldane's mapping function. In the first set of simulations, the phenotypes, Yj1 and Yj2 were generated according to equation 10 with Nj1 = Nj2 = 1. Data was simulated for 100 subjects when the trait heritability, H, was 0.05, 0.10 and 0.20 and for 1000 subjects when H was 0.010, 0.015 and 0.020. In this simple model, Formula . Simulations were repeated for genome lengths of 1000, 5000, and 10 000 cM. Under the null hypothesis, the genes influencing traits j1 and j2 were located at the 0.3Nth and 0.7Nth marker, respectively, and under the alternative, both genes were located at the 0.5Nth marker.

For the second set of simulations, we fixed the number of subjects (100), heritability of each QTL (i.e Formula , and genome length (5000 cM). The two phenotypes Yj1 and Yj2 were generated according to the following equation.


Formula 10

(10)
where there exists a single β such that | βjt | = β when | βjt | > 0 for j isin {j1, j2} and t isin {1, ..., N}. Also, let var({varepsilon}j1i) = var({varepsilon}j2i) = {sigma}{varepsilon}. Each phenotype had two, four or six influential QTL Formula . The total heritability, HT, was defined by Formula . The set of Nj QTL regulating phenotype j can be abbreviated by Rj {1, ..., N}. The influential genes for trait j1 were evenly spaced over the first half of the chromosome, R1 = {N/(2N1), ..., N1N/(2N1)}. Under the null hypothesis, the influential genes for trait j2 were evenly spaced along the second half of the chromosome, R2 = N/2 + {N/(2N1), ..., N1N/(2N1)}, whereas under the alternative, they were shifted by a constant so that specified percentage of QTL overlapped, R2 = constant + {N/(2N1), ..., N1N/(2N1)}. With multiple QTL, the direction of effect can influence our results. In addition to letting all elements of Formula and Formula be positive, we also examine the scenario where the signs of Formula alternate between positive and negative (i.e. βj2t21 = –βj2t22 = βj2t23 = –βj2t24). To define the concept of alternating more generally, let sjt = sign(βjt) and Formula . Then the alternating case is defined by S = 0.

2.7 Experiment
Yvert and colleagues (2003) measured the expression of 6818 genes in laboratory (BY) and wild (RM) strains of S.cerevisiae and in 112 segregants from a cross between them. Including multiple replicates for each parent, expression was measured on 130 samples. In addition to this genomics data, their lab genotyped each member of the two generations at 3114 genetic markers. Because genotypes at adjacent markers were often nearly identical, only a subset of 1063 marker locations were chosen for calculating the LOD score correlation coefficient. A complete description of the experiment has been previously published (Yvert et al., 2003).

2.8 Composite interval mapping
In the analysis of both simulations and experimental results, we calculated the LOD score profile using CIM. In addition to the locus of interest, the nearest markers, on each side, at least 15 cM away were also included in the Haley–Knott regression. For the experimental results, the ‘at least 15 cM’ requirement was replaced by ‘differing genotypes for at least 15 subjects’. All significant loci outside the surrounding interval were also included in the regression. The group of significant loci, {Omega}, was initially defined as t', where Xjt' = maxt(Xjt). Given a set {Omega}, the LOD score, Formula , for each position was then recalculated with all significant loci included in the regression. Until Formula or {Omega} contained eight loci, t' was added to {Omega}, where Formula . If Formula , and Formula , t'' was dropped from the set {Omega}. This abbreviated version of CIM was chosen to accommodate the large number of traits.


    3 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 APPENDIX A
 ACKNOWLEDGEMENTS
 REFERENCES
 
3.1 Simulations
Fundamental characteristics of Formula are revealed by simulating two traits, j1 and j2, each controlled by a single, and possibly common, gene (Table 1). As previously discussed Formula usually exceeds 0 if and only if C(j1, j2) > 0. When each trait is controlled by a distinct gene, the maximum of E[Xj1t] (E[Xj2t]) occurs at a position t where E[Xj2t] (E[Xj1t]) is small, resulting in negative correlation. As predicted, Formula increases dramatically with heritability, for example, ranging from 0.16 (H = 0.05) to 0.73 (H = 0.20) when there are 100 subjects and a 1000 cM chromosome. In general, Formula will also increase with the number of subjects and decrease with genome size. The variance of Formula decreases with chromosome length, because for any given β, Formula . The variance shrinks as the sample grows because Formula .


View this table:
[in this window]
[in a new window]

 
Table 1. The Table 1 is calculated under the null (R0 {cap} R1 = Ø) and alternative (R0 {cap} R1 != Ø) hypotheses under multiple combinations of subject number, heritability, and chromosome length. The power of the test, reject H0: R0 {cap} R1 = Ø when Table 1, is derived through simulation. For comparison, Table 1 and the power from a Table 1-based test are also listed

 
In all scenarios, the power for rejecting H0 was far greater when using tests based on Formula , compared to tests based on Formula (Table 1). The relative advantage of the former appears to increase with sample size. With 100 subjects, the power is larger by about a factor of 2, whereas with 1000 subjects, the power is larger by about a factor of 10. The relative power is increasing, in part, because Formula quickly approaches 1 as n increases, whereas Formula remains constant.

Figure 1 shows that Formula increases with C(j1, j2) and HT. With only a 100 subjects, the population size of Yvert's; experiment, HT still affects Formula and Formula is a poor approximation of C(j1, j2). However, we see that even for sample sizes where Formula is not true, tests based on Formula still have power to identify correlated traits. Table 2 shows that these tests can be far more powerful than those based on Formula . Table 2 also shows the power tends to be slightly smaller when the signs of the elements of Formula alternate, even though the loci are, for practical purposes, independent of each other. In our simulations, whenever C(j1, j2) > 0, Formula Formula implying that contrary to the marginal distributions, the joint distribution of Formula and Formula depends on the signs of Formula and Formula . To understand this phenomena, we focus on the 2 QTL example. Let t1 and t2 be the locations of the two influential genes. Although the Formula , the Formula . When this event occurs and βj1t2 = βj2t2, both Xj1t1 and Xj2t1 tend to increase. When Formula , both Xj1t1 and Xj2t1 tend to decrease. The values of Xj1t1 and Xj2t1 change together, or in unison. The same is true for the LOD scores at t2. In contrast, when βj1t2 = – βj2t2, Xj1t1 tends to be higher than E[Xj1t1] when Xj2t1 is lower than E[Xj2t1]. Here, the extra error is negatively correlated. In the Supplementary Materials, we provide the mathematical details and show that Formula E[(Xj1t1 E[Xj1t1])(Formula ])] will be proportional to Formula . The superscript indicates whether S = 0 (–) or S = 1 (+) when comparing Formula and Formula . As with the single QTL examples, we again found that power for rejecting H0 was greater when using tests based on Formula , compared to tests based on Formula . Now, we see that when S = 0 or the signs of Formula alternate, the advantage of the former is even greater. As promised in section 2.3, in the examples with S = 0, the power to detect traits sharing common QTL using Formula is only equal to the alpha level, 0.05, even when C(j1, j2) = 1.


Figure 1
View larger version (9K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. The Figure 1 (y-axis) are compared for multiple values of C(j1, j2) (x-axis) and multiple number of QTL, N1 = N2 isin {2,4,6} (equivalently HT isin {0.05, 0.09, 0.13}). Figure 1 was simulated using single point mapping, a 5000 cM chromosome, 100 subjects, and S = 1.

 

View this table:
[in this window]
[in a new window]

 
Table 2. The power for each of the following three tests: Reject H0: R0 {cap} R1 = Ø when Table 2 (Xj · calculated by either single point, s.p., mapping or CIM), and reject when Table 2, are derived for multiple combinations of QTL number, % overlap, and S. Simulations assume a 5000 cM genome and 100 subjects

 
3.2 Experiment
We calculated the LOD score profiles for the 6818 genes measured by Yvert and colleagues (2003). Following, the steps outlined in section 2.5, we formed clusters of genes where the values for Formula exceeded 0.4 for all member pairs. The threshold was decided after approximating the absolute null distribution for Formula by permuting the subject labels for each trait, recalculating the LOD score profiles with CIM, and then using those new profiles to calculate values of Formula . Applying our clustering method to these permuted traits, we found 643, 236, 50 and 3 clusters of two, three, four and five genes, respectively, where all member pairs had Formula exceeding 0.4. No clusters contained more than five genes. Since we focused our interest on larger clusters with at least 10 genes, we did not repeat this computationally expensive permutation step to estimate P-values for the smaller clusters. This permutation method leads to a distribution of Formula under the absolute null, Formula : There are no QTL. In practice, we often found little difference between this distribution and one assuming H0 (data not shown).

From the actual, experimental values, over 34 854 pairs had a Formula . We then ranked all clusters by their average value of Formula and focused on the top 20 clusters with at least 10 genes (see Supplementary Material for the genes within each cluster). Genes within these clusters tended to have the same molecular functions and biological processes, as determined by Gene Ontology (GO) annotations (see Supplementary Material). The GO project creates a common vocabulary to describe genes and their products (www.geneontology.org). For example, the biological process of all annotated genes in the highest ranking cluster included ‘DNA metabolic process’ and ‘Organelle organization and biogenesis’. The molecular function of all of these genes included ‘Helicase activity.’ Therefore, we now have potential functions for the six genes in this group that previously had no annotation. Figure 2 illustrates that the 16 genes in that cluster share multiple QTL. At least 67% of all annotated genes in eight additional clusters shared a common molecular function or biological process. Additionally, 77% of the genes in cluster 16 participate in a metabolic process, but the annotations discrimate between RNA, DNA, and amino acid metabolic processes. The mean Formula was 0.6 for the four genes labelled as part of the amino acid metabolic process, suggesting that those share nearly identical QTL with each other, but only a portion of their QTL with genes involved in other metabolic processes.


Figure 2
View larger version (29K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. Each row represents the LOD score profile for a gene in cluster 1. LOD scores are color-coded: 0–1 (white), 1–2 (yellow), 2–3 (orange), 3–4 (red), 4–5 (purple) and 5–6 (black). Numbers/blue lines on the x-axis indicate chromosomes.

 
The overall correlation between Formula and Formula was 0.37. Among the 215 661 pairs (1%) which were ranked highest by Formula , 64 232 were also within the top 1% of all pairs as ranked by Formula . Moreover, among the 20 clusters (≥ 10 genes) which ranked highest by Formula , most had large mean values of Formula (see Supplementary Material). Therefore, in this case, where both clustering methods are applicable, the clusters with the largest values of Formula would also be identifiable when clustering by phenotype, suggesting that few coregulated pairs can be described by the alternating scenario (S = 0).


    4 DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 APPENDIX A
 ACKNOWLEDGEMENTS
 REFERENCES
 
The LOD score correlation coefficient, Formula , quantifies the evidence that two traits share common QTL. Ideally, Formula can be interpreted as C(·, ·), the average percentage of genes which are common to both traits. However, even in the absence of this ideal, we demonstrated that Formula will still be strongly correlated with C(·, ·) and that the statistic can be used to identify clusters of coregulated traits. In this article, both traits are expression profiles and are measured on a single population. Fortunately, Formula is equally valid when assessing the genetic coregulation for any two quantitative traits that are measured on any two populations. Therefore, Formula offers the most widely applicable and statistically rigorous means for identifying genetically coregulated traits.

As more laboratories are focusing on genetical genomics, we need new methods to synthesize results. Investigators often focus on a specific subset of genes. The subset may be determined by their expression platform or by their experimental goals. Each manufacturer includes a different set of probes (i.e. different genes) on their microarrays (Verdugo and Medrano, 2006) and labs often limit their measurements to a specific type of tissue or a specific set of genes thought to be associated with a disease (MacLaren and Sikela, 2005; Yamashita et al., 2005). As an example of how Formula will immediately impact the field of genetical genomics, we offer WebQTL (http://www.genenetwork.org), a database that includes LOD score profiles for an expansive list of traits in mouse, rat and barley. This website was designed to find coregulated genes. Currently it assesses coregulation by Formula and is limited to searching traits measured on the same cross. Our introduction of Formula will now allow for previously impossible comparisons.

Clearly, Formula has limitations because it is a function of LOD scores. The quality of Formula is limited by the quality of the LOD score profiles, which are often very noisy. False positives will occur when two traits are influenced by linked, but not identical, genes. Moreover, there are other statistics which may better compare two LOD score profiles. In the future, we should explore statistics that account for the number of overlapping QTL and preprocess the LOD scores by smoothing their profiles before comparison. With enough smoothing, correlation will be based only on the largest peaks. Also, we might search for a method to better estimate P-values and FDR. Currently, our null distributions, from theory or permutation, both assume the absence of any genetic influence, and therefore are only close approximations to the desired null distributions.

At this stage, we propose the obvious two step procedure for improved QTL mapping. First, search for genetically coregulated traits. Then, perform joint linkage mapping on those traits. The P-values from standard joint linkage methods are no longer valid, as we have purposely grouped traits that appear to have similar LOD score profiles. In future research, we hope to combine clustering and linkage so we can assign a single, meaningful, significance level for each purported gene-trait pair. As we improve our methods and genetical genomics continues to gain popularity, Formula , will become increasingly important in identifying genetic coregulation.


    APPENDIX A
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 APPENDIX A
 ACKNOWLEDGEMENTS
 REFERENCES
 
When the following three assumptions hold and the number of subjects and markers are large, Formula .

ASSUMPTION 1. k1, k2 < <1

ASSUMPTION 2. The genes are independent of each other. For any two positions, t1 and t2, P(Gt1 = 1 | Gt2) = 0.5.

ASSUMPTION 3. Genetic effects are described by the linear model in Equation (10). with its accompanying restrictions of Formula and {sigma}{varepsilon}.

We show that violating these assumptions will have minimal effect in the Supplementary Material. Without loss of generality, let genes 1, ..., N1 be the influential QTL for trait j1. As Lander and Botstein (1989) point out, the expected score per progeny, or ELOD, can be described by


Formula 11

(11)
when Formula . Moreover, we know that E[Xj1t'] {approx} 0.22 when Formula . By Assumption 3, ELOD(j1, t') is the same for all t' < N1. Let ELOD {equiv} ELOD(j1, t') when Formula . Therefore,


Formula 12

(12)
Furthermore, when Formula


Formula 13

(13)


Formula 14

(14)
When Formula ,


Formula 15

(15)


Formula 16

(16)
Here F {approx} G if F/G ->n 1. Using these results and their counterparts for j2,


Formula 17

(17)
Next, we approximate Formula using equations 13 and 15.


Formula 18

(18)
Approximations 17, its counterpart for j2, and 18 show that


Formula 19

(19)


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 APPENDIX A
 ACKNOWLEDGEMENTS
 REFERENCES
 
This work was funded by National Institute of Dental and Craniofacial Research (T32DE07132). We wish to thank John Storey and Elizabeth Thompson for their valuable comments, and also thank the anonymous reviewers for their extremely insightful suggestions.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: John Quackenbush

Received on November 3, 2007; revised on January 11, 2008; accepted on February 17, 2008

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 APPENDIX A
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Brem R, Kruglyak L. The landscape of genetic complexity across, 5700 gene expression traits in yeast. PNAS (2005) 102:1572–1577.[Abstract/Free Full Text]

    Brem RB, et al. Genetic dissection of transcriptional regulation in budding yeast. Science (2002) 296:752–755.[Abstract/Free Full Text]

    Chen Z. The full em algorithm for the mles of qtl effects and positions and their estimated variances in multiple-interval mapping. Biometrics (2005) 61:474–480.[CrossRef][Web of Science][Medline]

    Eisen MB, et al. Cluster analysis and display of genome-wide expression patterns. Proc. Natl Acad. Sci (1998) 95:14863–14868.[Abstract/Free Full Text]

    Haley CS, et al. Mapping quantitative trait loci in crosses between outbred lines using least squares. Genetics (1994) 136:1195–1207.[Abstract]

    Hastie T, et al. The Elements of Statistical Learning (2001) Canada: 1st edn. Springer.

    Jiang C, Zeng ZB. Multiple trait analysis of genetic mapping for quantitative trait loci. Genetics (1995) 140:1111–1127.[Abstract]

    Knott SA, Haley CS. Multitrait least squares for quantitative trait loci detection. Genetics (2000) 156:899–911.[Abstract/Free Full Text]

    Korol AB, et al. Linkage between quantitative trait loci and marker loci: resolution power of three statistical approaches in single marker analysis. Biometrics (1996) 52:426–441.[CrossRef][Web of Science][Medline]

    Kraft P, et al. Multivariate variance components analysis of longitudinal blood pressure measurements from the Framingham Heart Study. BMC Genetics (2003) 4(suppl. 1):S55–S61.[CrossRef]

    Lander ES, Botstein D. Mapping Mendelian factors underlying quantitative traits using RFLP linkage maps. Genetics (1989) 121:185–199.[Abstract/Free Full Text]

    Lehman EL, Casella G. Theory of Point Estimation (1998) 2nd. New York: Springer. Chapter 6. pp. 429–475.

    Li J, Burmeister M. Genetical genomics: combining genetics with gene expression analysis. Hum. Mol. Genet (2005) 14(suppl. 2):R163–169.[Abstract/Free Full Text]

    MacLaren EJ, Sikela JM. Cerebellar gene expression profiling and eQTL analysis in inbred mouse strains selected for ethanol sensitivity. Alcohol Clin. Exp. Res (2005) 29:1568–1579.[CrossRef][Medline]

    Mangin B, et al. Pleiotropic QTL analysis. Biometrics (1998) 54:88–99.[CrossRef][Web of Science]

    Martin LJ, et al. Phenotypic, genetic, and genome-wide structure in the metabolic syndrome. BMC Genetics (2003) 4(suppl. 1):S95–S99.[CrossRef]

    Schadt E, et al. An integrative genomics approach to infer causal associations between gene expression and disease. Nat. Genet (2005) 37:710–717.[CrossRef][Web of Science][Medline]

    Segal E, et al. Genome-wide discovery of transcriptional modules from DNA sequence and gene expression. Bioinformatics (2003) 19(suppl. 1):i273–282.[Abstract]

    Verdugo RA, Medrano JF. Comparison of gene coverage of mouse oligonucleotide microarray platforms. BMC Genomics (2006) 7:58.[CrossRef][Medline]

    Weller J, et al. Application of a canonical transformation to detection of quantitative trait loci with the aid of genetic markers in multitrait experiments. Theor. App. Genet (1996) 92:998–1002.[CrossRef]

    Yamashita S, et al. Expression quantitative trait loci analysis of 13 genes in the rat prostate. Genetics (2005) 171:1231–1238.[Abstract/Free Full Text]

    Yvert G, et al. Trans-acting regulatory variation in Saccharomyces cerevisiae and the role of transcription factors. Nat. Genet (2003) 35:57–63.[Web of Science][Medline]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Brief BioinformHome page
T. M. Przytycka, M. Singh, and D. K. Slonim
Toward the dynamic interactome: it's about time
Brief Bioinform, January 8, 2010; (2010) bbp057v1.
[Abstract] [Full Text] [PDF]


Home page
Plant Cell PhysiolHome page
Y. Al-Ghazi, S. Bourot, T. Arioli, E. S. Dennis, and D. J. Llewellyn
Transcript Profiling During Fiber Development Identifies Pathways in Secondary Metabolism and Cell Wall Structure That May Contribute to Cotton Fiber Quality
Plant Cell Physiol., July 1, 2009; 50(7): 1364 - 1381.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
24/7/958    most recent
btn064v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Sampson, J. N.
Right arrow Articles by Self, S. G.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Sampson, J. N.
Right arrow Articles by Self, S. G.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?