Skip Navigation


Bioinformatics Advance Access originally published online on October 10, 2006
Bioinformatics 2006 22(23):2905-2909; doi:10.1093/bioinformatics/btl501
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/23/2905    most recent
btl501v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (1)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Begun, A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Begun, A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Robust method for detecting differential gene expression in twin studies

Alexander Begun

Institute of Medical Informatics and Statistics, Kiel University Brunswiker Strasse 10, D-24105 Kiel, Germany


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 

Motivation: A steadily increasing number of experiments with microarrays stimulate the further development of the statistical methods of the analysis of gene expression data. One of the central problems in this area is detecting differential gene expression under two or more conditions. Unfortunately, up to now it has not been studied how the correlations between related individuals, such as twins influence the estimates of differential gene expression.

Results: In this paper, we discuss this problem and propose a new method that is robust with respect to correlations of gene expression data for twins.

Contact: a.begun{at}ikmb.uni-kiel.de


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 
The classic problem in microarray analysis is to detect differential gene expression between two conditions (e.g. affected and non-affected). This involves not only determining whether there are any differentially expressed genes but also finding those genes with differential expression. The problem requires testing a null hypothesis H0 that there is no differential expression and calculating a test statistic Z for each gene. The basic idea in all methods for estimating the null distribution of the statistic Z is to construct the so-called null statistic z and then use this statistic to determine cut-off points, claiming significance for genes, whose Z exceed these cut-off points. The most popular non-parametrical methods (methods independent on strong parametric assumptions) are the significant analysis of microarray (SAM) (Tusher et al., 2001), the empirical Bayes (EB) method (Efron et al., 2000, 2001; Smyth, 2004) and the mixture model method (MMM) (Pan et al., 2002; Pan, 2003). In all these methods, it is assumed that under H0 the Z statistics of all the genes have the same distributions. However, different procedures for constructing the null statistic z are used in these methods. In SAM this procedure is based on the permutations of arrays and computation order statistics. In EB is used a t-statistic with a Bayesian adjusted denominator. In MMM a specific permutation of arrays allows obtaining the sample of z-distributed random variables and then a finite Normal mixture model is fitted to this data. As for real data we usually have genes with differential expression, the null distribution z obtained using the permutation procedure is more dispersed and cannot estimate Z under H0 well. An additional problem rises if we use these methods for related individuals, such as twins. It can lead to wrong inferences and to falsely estimated number of differentially expressed genes. Unfortunately this problem has not been studied up to present. The twin data have advantages for genetic studies, unobtainable in regular family or case-control studies. First, environmental confounders are minimized because twin children are usually exposed to similar environments. Second, ascertainment-bias typically does not pose a problem for twins because sample selection has not occurred with reference to any specific phenotype. This can reduce the sample size and the high expenses of a microarray experiment. In this paper, we propose a method that allows avoiding the possible errors in finding differentially expressed genes for twin data. We study the properties of the new statistic using simulations. It was shown that the method is robust with respect to the correlation pattern of gene expression data for twins.


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 
Consider first, the model for unrelated individuals. Let Yjk be the expression level of gene j in array k, j = 1, ... , G, k = 1, ... , K1, K1 + 1, ... , K1 + K2. Here, the first K1 and the last K2 arrays are obtained under the two conditions, respectively. Let

Formula
where {varepsilon}jk are independent, identically and symmetrically distributed random errors with mean 0 and the variance 1, xk = 1 for 1 ≤ k ≤ K1 and xk = 0 for K1 + 1 ≤ k ≤ K1 + K2. Determining whether a gene has differential expression is equivalent to testing for H0:bj = 0 against H1:bj != 0.

A current method for detecting differential gene expression is the two-sample equal-variance t-statistic


Formula

We will use another form of t-statistic for independent individuals. This statistic Z1j is more appropriate for constructing null statistic z and differs slightly from the statistic in the new version of MMM (Pan, 2003). Define Z1j by


Formula

with


Formula

Here [w] is the integer part of w.

The null statistic is

Formula
As {varepsilon}jk are symmetrically distributed, z1j has the same distribution as Z1j under H0.

For twins we will use the following model

Formula

where {varepsilon}jk,0, {varepsilon}jk,1, {varepsilon}jk,2 are independent, identically and symmetrically distributed random variables with mean 0 and the variance 1. The correlation between twins for gene j is in this linear model defined by the formula Formula . We divide all the twin pairs into three groups according to their concordance-discordance and condition status. We will denote these groups by Formula and Formula with the number of pairs equal to l1, l2 and l12, respectively. The first group for concordant twins under the first condition, the second group for concordant twins under the second condition and the third group for discordant twins with the first twin under the first condition and the second twin under the second condition. The full number of twin pairs is equal to n = l1 + l2 + l12. Denote Formula and Formula . We divide arbitrary groups Formula into 4 groups—Formula with number of pairs equal to Formula and Formula , respectively, and the group Md into 2 groups Formula and Formula with the number of pairs equal to Formula and Formula , respectively. Then we form new groups M11, M12, M21, M22 as follows. The group M11 includes Formula and all first twins from the group Formula . The group M12 includes Formula and all first twins from the group Formula . The group M21 includes Formula and all second twins from the group Formula . The group M22 includes Formula and all second twins from the group Formula . The number of individuals in the groups M11, M12, M21, M22 is equal to Formula and Formula , respectively. It holds that n11 + n12 + n21 + n22 = 2n.

For detecting differential gene expression, we propose the following statistic

Formula
The value of s2j (an estimate of the standard deviation of the numerator) is given below.

The null statistic we define with the formula

Formula
Since {varepsilon}jk,0, {varepsilon}jk,1, {varepsilon}jk,2 are independent, identically and symmetrically distributed random variables, the null statistic z2j has the same distribution as Z2j under H0 for all j. Indeed, denote

Formula
From the definitions of s2j (see below), Z2j and z2j it follows that these random variables depend on the random vector Formula (or Formula), Formula and Formula under H0 is equal to Formula. Using Bayes' formula and the symmetry of {varepsilon}jk,0, {varepsilon}jk,1, {varepsilon}jk,2, we can write

Formula
Similarly it can be shown that the null statistic z1j has the same distribution as Z1j under H0 for all j if all individuals are independent.

In MMM the values of z1j and z2j can be used to fit the probability density of null distributions f1 and f2 by the normal mixture model with g0 components

Formula
where {pi}m,i are the mixing proportions and {varphi}(.;µm,i,Vm,i) denotes the normal density function with mean µm,i and variance Vm,i. These approximate functions f1 and f2 allow us to determine the cut-off points cm,down and cm,upper such that the Type 1 error rate {alpha} is

Formula
Here, p is the genome-wide significance level. We will denote an interval [cm,down,cm,upper] by O{alpha}m. The unknown parameters of the mixture model can be estimated using the EM algorithm (McLachlan and Basford, 1988). This algorithm is realized in the Fortran program EMMIX, which is freely available on the web (http://www.maths.uk.edu.au/~gjm/emmix//emmix.html). Some details on the practical selection of the number of components can be found in Pan et al. (2002).

Calculating of s2j. Consider a random variable

Formula
If the group of discordant twins is not empty, then Formula correlates with Formula and Formula correlates with Formula. Denote the expectation of X by EX, Dj = {sigma}j,02 + {sigma}j,12 and Cj = {sigma}j,02. From here and from the independence of all twin pairs it follows that

Formula
and

Formula
It is easy to show that

Formula
To estimate Dj and Cj we calculate

Formula
Two last equalities hold for any kpq1 isin Mpqc and kpq2isinMpq{cap}Mqd, p,q = 1,2.

After a number of transformations we get

Formula
Substituting these equalities in Spq,D(j) and Spq,C(j) we get

Formula

The estimates Formula and Formula of Dj and Cj, respectively, we can easily calculate from two last equations replacing Formula with Formula and Formula with Formula. The value of s2j2 we now get by substituting Formula and Formula in the formula

Formula


    3 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 
Using simulated data, we compare the method based on the test statistic Z1j and the null statistic z1j with the method based on the test statistic Z2j and the null statistic z2j.

Consider the two following examples. Let l1 = l2 = 4, l12 = 0 (all twin pairs are concordant) in the first example and l1 = l2 = 0, l12 = 8 (all twin pairs are discordant) in the second example. In both cases we assume that G = 1000, the first 500 genes with differential expression and the last 500 genes with no differential expression. Assume additionally that aj ~ N(0,5), aj + bj ~ N(0,5), ({sigma}i,02 + {sigma}i,12)1/2 = 5, {varepsilon}j,0 ~ N(0,1), {varepsilon}j,1 ~ N(0,1) and all the random variables are independent. We permute separately concordant twin pairs under the first condition, concordant twin pairs under the second condition, and discordant twin pairs, obtaining set {Pi} of sets Formula and Formula with cardinality |{Pi}| = 50. The following characteristics are calculated for a number of Type 1 error rates {alpha}.

Estimated number of positives (EP),

Formula
True number of positives (TP),

Formula

True number of false positives (TFP), Formula

Estimated number of false positives (EFP), Formula

False discovery rates (FDR), Formula

In Table 1 are given the results of calculations of EP and EFP for different combinations of {rho}e and {rho}n (coefficients of correlation for differentially and non-differentially expressed genes). The numbers without brackets stand for statistic Z1j and the numbers in brackets stand for statistic Z2j. It is easy to see that there is not a marked difference between the new and the old method for equal values of {rho}e and {rho}n. The new method finds more (less) false positives for concordant (discordant) twins if {rho}e > {rho}n. On the contrary, the new method finds less (more) false positives for concordant (discordant) twins if {rho}e < {rho}n. The greater the difference between {rho}e and {rho}n, the greater this effect. Figures 12 illustrate this finding. One can clearly see in these figures that the old method gives very dispersed TFP for different pairs of ({rho}e, {rho}n). At the same time all the values for the new method are concentrated in a compact area (Legend for statistic Z1: solid line—({rho}e, {rho}n) = (0, {rho}); dashed line—({rho}e, {rho}n) = ({rho},0); {circ}—({rho}e, {rho}n) = ({rho}, {rho}); *—({rho}e, {rho}n) = (0,0). Legend for statistic Z2: +—({rho}e, {rho}n) = (0, {rho}); cjs3656—({rho}e, {rho}n) = ({rho},0); {square}—({rho}e, {rho}n) = ({rho}, {rho}), {diamond}—({rho}e, {rho}n) = (0, 0), where {rho} = max({rho}e, {rho}n) > 0).


Figure 1
View larger version (28K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1 True number of false positives. Concordant pairs.

 


Figure 2
View larger version (29K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2 True number of false positives. Discordant pairs.

 


View this table:
[in this window]
[in a new window]

 
Table 1 Results of simulation study by using the old and the new (in brackets) statistics

 

    4 DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 
The value of the numerator in statistic Z1j depends on the correlation of gene expressions between twins. If we don't take into account this correlation we can get wrong results in detecting differential gene expression for twin data. The new statistic Z2j corrects fully this drawback in the case of discordant twins and partially in the case of concordant twins or in the mixed case. Our simulation studies confirm this statement. It doesn't mean that the new statistic is better than the old statistic. Sometimes the old statistic gives us a smaller but sometimes a larger number of false positives. It depends on the difference between {rho}e and {rho}n and the concordance-discordance status of twins. As we a priori do not know the values of {rho}e and {rho}n, the usage of the new statistic gives us more stable and robust results and the implementation of this statistic is, therefore, more preferable. In any case, the values of TFP for the new statistic do not deviate so much from EFP as for the old statistic. As monozygotic and heterozygotic twins have different intra-pair phenotype correlations, they should be treated separately.

The assumption that under H0, the test statistics Z2 of all the genes have the same distribution, leading to the use of the null statistics z2 of all the genes to estimate a reference distribution for the test statistics Z2 may or may not hold in practice. However, if the expression levels of different genes are from different distributions belonging to the same member of a location-scale family, then the assumption holds. Instead to assume this for all genes, one can partition genes into several groups based on their expression levels and then assume that the assumption holds within each group (Pan, 2003).

Conflict of Interest: none declared.

Received on June 26, 2006; revised on September 11, 2006; accepted on September 30, 2006

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 

    Efron, B., Tibshirani, R., Goss, V., Chu, G. (2000) Microarrays and their use in a comparative experiment. Technical Report 37B/213 Statistical Department, Stanford University.

    Efron, B., Tibshirani, R., Storey, J.D., Tusher, V. (2001) Empirical Bayes analysis of a microarray experiment. J. Am. Stat. Assoc, . 96, 1151–1160[CrossRef][ISI].

    McLachlan, G.L. and Basford, K.E. (1988) Mixture Models: Inference and Application to Clustering. , New York Marcel Dekker.

    Pan, W., Lin, J., Le, C. (2002) How many replicates of arrays are required to detect gene expression changes in microarray experiments? A mixture model approach. Genome Biol, . 3, 1–11[Medline].

    Pan, W. (2003) On the use of permutation in and the performance of a class of nonparametric methods to detect differential gene expression. Bioinformatics, 19, 1333–1339[Abstract/Free Full Text].

    Smyth, G.K. (2004) Linear models and empirical Bayes methods foe assessing differential expression in microarray experiments. Stat. Appl. Genet. Mol. Biol, . 3, 1–26.

    Tusher, V.G., Tibshirani, R., Chu, G. (2001) Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl Acad. Sci. USA, 98, 5116–5121[Abstract/Free Full Text].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/23/2905    most recent
btl501v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (1)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Begun, A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Begun, A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?