Bioinformatics Advance Access originally published online on October 10, 2006
Bioinformatics 2006 22(23):2905-2909; doi:10.1093/bioinformatics/btl501
© The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org
Robust method for detecting differential gene expression in twin studies
Alexander Begun
Institute of Medical Informatics and Statistics, Kiel University Brunswiker Strasse 10, D-24105 Kiel, Germany
 |
ABSTRACT
|
|---|
Motivation: A steadily increasing number of experiments with microarrays stimulate the further development of the statistical methods of the analysis of gene expression data. One of the central problems in this area is detecting differential gene expression under two or more conditions. Unfortunately, up to now it has not been studied how the correlations between related individuals, such as twins influence the estimates of differential gene expression.
Results: In this paper, we discuss this problem and propose a new method that is robust with respect to correlations of gene expression data for twins.
Contact: a.begun{at}ikmb.uni-kiel.de
 |
1 INTRODUCTION
|
|---|
The classic problem in microarray analysis is to detect differential
gene expression between two conditions (e.g. affected and non-affected).
This involves not only determining whether there are any differentially
expressed genes but also finding those genes with differential
expression. The problem requires testing a null hypothesis
H0 that there is no differential expression and calculating a test
statistic
Z for each gene. The basic idea in all methods for
estimating the null distribution of the statistic
Z is to construct
the so-called null statistic
z and then use this statistic to
determine cut-off points, claiming significance for genes, whose
Z exceed these cut-off points. The most popular non-parametrical
methods (methods independent on strong parametric assumptions)
are the significant analysis of microarray (SAM) (
Tusher et al., 2001),
the empirical Bayes (EB) method (
Efron et al., 2000,
2001;
Smyth, 2004)
and the mixture model method (MMM) (
Pan et al., 2002;
Pan, 2003).
In all these methods, it is assumed that under
H0 the
Z statistics
of all the genes have the same distributions. However, different
procedures for constructing the null statistic
z are used in
these methods. In SAM this procedure is based on the permutations
of arrays and computation order statistics. In EB is used a
t-statistic with a Bayesian adjusted denominator. In MMM a specific
permutation of arrays allows obtaining the sample of
z-distributed
random variables and then a finite Normal mixture model is fitted
to this data. As for real data we usually have genes with differential
expression, the null distribution
z obtained using the permutation
procedure is more dispersed and cannot estimate
Z under
H0 well.
An additional problem rises if we use these methods for related
individuals, such as twins. It can lead to wrong inferences
and to falsely estimated number of differentially expressed
genes. Unfortunately this problem has not been studied up to
present. The twin data have advantages for genetic studies,
unobtainable in regular family or case-control studies. First,
environmental confounders are minimized because twin children
are usually exposed to similar environments. Second, ascertainment-bias
typically does not pose a problem for twins because sample selection
has not occurred with reference to any specific phenotype. This
can reduce the sample size and the high expenses of a microarray
experiment. In this paper, we propose a method that allows avoiding
the possible errors in finding differentially expressed genes
for twin data. We study the properties of the new statistic
using simulations. It was shown that the method is robust with
respect to the correlation pattern of gene expression data for
twins.
 |
2 METHODS
|
|---|
Consider first, the model for unrelated individuals. Let
Yjk be the expression level of gene
j in array
k, j = 1, ... ,
G, k = 1, ... ,
K1,
K1 + 1, ... ,
K1 +
K2. Here, the first
K1 and
the last
K2 arrays are obtained under the two conditions, respectively.
Let
where
jk are independent,
identically and symmetrically distributed random errors with
mean 0 and the variance 1,
xk = 1 for 1
k
K1 and
xk = 0 for
K1 + 1
k
K1 +
K2. Determining whether a gene has differential
expression is equivalent to testing for
H0:
bj = 0 against
H1:
bj 
0.
A current method for detecting differential gene expression is the two-sample equal-variance t-statistic
We will use another form of
t-statistic for independent
individuals. This statistic
Z1j is more appropriate for constructing
null statistic
z and differs slightly from the statistic in
the new version of MMM (
Pan, 2003). Define
Z1j by
with
Here
[w] is the integer part of
w.
The null statistic is
As
jk are symmetrically distributed,
z1j has the same distribution
as
Z1j under
H0.
For twins we will use the following model
 |
where
jk,0,
jk,1,
jk,2 are independent, identically and symmetrically distributed random variables with mean 0 and the variance 1. The correlation between twins for gene j is in this linear model defined by the formula
. We divide all the twin pairs into three groups according to their concordance-discordance and condition status. We will denote these groups by
and
with the number of pairs equal to l1, l2 and l12, respectively. The first group for concordant twins under the first condition, the second group for concordant twins under the second condition and the third group for discordant twins with the first twin under the first condition and the second twin under the second condition. The full number of twin pairs is equal to n = l1 + l2 + l12. Denote
and
. We divide arbitrary groups
into 4 groups
with number of pairs equal to
and
, respectively, and the group Md into 2 groups
and
with the number of pairs equal to
and
, respectively. Then we form new groups M11, M12, M21, M22 as follows. The group M11 includes
and all first twins from the group
. The group M12 includes
and all first twins from the group
. The group M21 includes
and all second twins from the group
. The group M22 includes
and all second twins from the group
. The number of individuals in the groups M11, M12, M21, M22 is equal to
and
, respectively. It holds that n11 + n12 + n21 + n22 = 2n.
For detecting differential gene expression, we propose the following statistic
The value of
s2j (an estimate of the standard deviation of the numerator)
is given below.
The null statistic we define with the formula
Since
jk,0,
jk,1,
jk,2 are independent, identically
and symmetrically distributed random variables, the null statistic
z2j has the same distribution as
Z2j under
H0 for all
j. Indeed,
denote
From the definitions
of
s2j (see below),
Z2j and
z2j it follows that these random
variables depend on the random vector

(or

),

and

under
H0 is equal to

. Using Bayes' formula and the symmetry of
jk,0,
jk,1,
jk,2, we can
write
Similarly it can
be shown that the null statistic
z1j has the same distribution
as
Z1j under
H0 for all
j if all individuals are independent.
In MMM the values of z1j and z2j can be used to fit the probability density of null distributions f1 and f2 by the normal mixture model with g0 components
where
m,i are the mixing proportions and

(.;µ
m,i,
Vm,i) denotes
the normal density function with mean µ
m,i and variance
Vm,i. These approximate functions
f1 and
f2 allow us to determine
the cut-off points
cm,down and
cm,upper such that the Type 1
error rate

is
Here,
p is
the genome-wide significance level. We will denote an interval
[
cm,down,
cm,upper] by
O
m. The unknown parameters of the mixture
model can be estimated using the EM algorithm (
McLachlan and Basford, 1988).
This algorithm is realized in the Fortran program EMMIX, which
is freely available on the web (
http://www.maths.uk.edu.au/~gjm/emmix//emmix.html).
Some details on the practical selection of the number of components
can be found in
Pan et al. (2002).
Calculating of s2j. Consider a random variable
If the group of discordant twins is not empty, then

correlates with

and

correlates with

. Denote the expectation of
X by
EX, Dj =
j,02 +
j,12 and
Cj =
j,02. From
here and from the independence of all twin pairs it follows
that
and
It is easy to show that
To estimate
Dj and
Cj we calculate
Two last equalities hold for any
kpq1
Mpqc and
kpq2
Mpq
Mqd, p,
q = 1,2.
After a number of transformations we get
Substituting these equalities in
Spq,D(
j) and
Spq,C(
j) we get
The estimates
and
of Dj and Cj, respectively, we can easily calculate from two last equations replacing
with
and
with
. The value of s2j2 we now get by substituting
and
in the formula
 |
3 RESULTS
|
|---|
Using simulated data, we compare the method based on the test
statistic
Z1j and the null statistic
z1j with the method based
on the test statistic
Z2j and the null statistic
z2j.
Consider the two following examples. Let l1 = l2 = 4, l12 = 0 (all twin pairs are concordant) in the first example and l1 = l2 = 0, l12 = 8 (all twin pairs are discordant) in the second example. In both cases we assume that G = 1000, the first 500 genes with differential expression and the last 500 genes with no differential expression. Assume additionally that aj
N(0,5), aj + bj
N(0,5), (
i,02 +
i,12)1/2 = 5,
j,0
N(0,1),
j,1
N(0,1) and all the random variables are independent. We permute separately concordant twin pairs under the first condition, concordant twin pairs under the second condition, and discordant twin pairs, obtaining set
of sets
and
with cardinality |
| = 50. The following characteristics are calculated for a number of Type 1 error rates
.
Estimated number of positives (EP),
True number of positives (TP),
True number of false positives (TFP),
Estimated number of false positives (EFP),
False discovery rates (FDR),
In Table 1 are given the results of calculations of EP and EFP for different combinations of
e and
n (coefficients of correlation for differentially and non-differentially expressed genes). The numbers without brackets stand for statistic Z1j and the numbers in brackets stand for statistic Z2j. It is easy to see that there is not a marked difference between the new and the old method for equal values of
e and
n. The new method finds more (less) false positives for concordant (discordant) twins if
e >
n. On the contrary, the new method finds less (more) false positives for concordant (discordant) twins if
e <
n. The greater the difference between
e and
n, the greater this effect. Figures 12 illustrate this finding. One can clearly see in these figures that the old method gives very dispersed TFP for different pairs of (
e,
n). At the same time all the values for the new method are concentrated in a compact area (Legend for statistic Z1: solid line(
e,
n) = (0,
); dashed line(
e,
n) = (
,0);
(
e,
n) = (
,
);
(
e,
n) = (0,0). Legend for statistic Z2: +(
e,
n) = (0,
);
(
e,
n) = (
,0);
(
e,
n) = (
,
),
(
e,
n) = (0, 0), where
= max(
e,
n) > 0).
 |
4 DISCUSSION
|
|---|
The value of the numerator in statistic
Z1j depends on the correlation
of gene expressions between twins. If we don't take into account
this correlation we can get wrong results in detecting differential
gene expression for twin data. The new statistic
Z2j corrects
fully this drawback in the case of discordant twins and partially
in the case of concordant twins or in the mixed case. Our simulation
studies confirm this statement. It doesn't mean that the new
statistic is better than the old statistic. Sometimes the old
statistic gives us a smaller but sometimes a larger number of
false positives. It depends on the difference between
e and
n and the concordance-discordance status of twins. As we a priori
do not know the values of
e and
n, the usage of the new statistic
gives us more stable and robust results and the implementation
of this statistic is, therefore, more preferable. In any case,
the values of TFP for the new statistic do not deviate so much
from EFP as for the old statistic. As monozygotic and heterozygotic
twins have different intra-pair phenotype correlations, they
should be treated separately.
The assumption that under H0, the test statistics Z2 of all the genes have the same distribution, leading to the use of the null statistics z2 of all the genes to estimate a reference distribution for the test statistics Z2 may or may not hold in practice. However, if the expression levels of different genes are from different distributions belonging to the same member of a location-scale family, then the assumption holds. Instead to assume this for all genes, one can partition genes into several groups based on their expression levels and then assume that the assumption holds within each group (Pan, 2003).
Conflict of Interest: none declared.
Received on June 26, 2006; revised on September 11, 2006; accepted on September 30, 2006
 |
REFERENCES
|
|---|
Efron, B., Tibshirani, R., Goss, V., Chu, G. (2000) Microarrays and their use in a comparative experiment. Technical Report 37B/213 Statistical Department, Stanford University.
Efron, B., Tibshirani, R., Storey, J.D., Tusher, V. (2001) Empirical Bayes analysis of a microarray experiment. J. Am. Stat. Assoc, . 96, 11511160[CrossRef][Web of Science].
McLachlan, G.L. and Basford, K.E. (1988) Mixture Models: Inference and Application to Clustering. , New York Marcel Dekker.
Pan, W., Lin, J., Le, C. (2002) How many replicates of arrays are required to detect gene expression changes in microarray experiments? A mixture model approach. Genome Biol, . 3, 111[Medline].
Pan, W. (2003) On the use of permutation in and the performance of a class of nonparametric methods to detect differential gene expression. Bioinformatics, 19, 13331339[Abstract/Free Full Text].
Smyth, G.K. (2004) Linear models and empirical Bayes methods foe assessing differential expression in microarray experiments. Stat. Appl. Genet. Mol. Biol, . 3, 126.
Tusher, V.G., Tibshirani, R., Chu, G. (2001) Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl Acad. Sci. USA, 98, 51165121[Abstract/Free Full Text].

CiteULike
Connotea
Del.icio.us What's this?