Skip Navigation


Bioinformatics Advance Access originally published online on November 16, 2004
Bioinformatics 2005 21(7):1121-1128; doi:10.1093/bioinformatics/bti140
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/7/1121    most recent
bti140v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (4)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Bickel, D. R.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Bickel, D. R.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2004. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions{at}oupjournals.org

Probabilities of spurious connections in gene networks: application to expression time series

David R. Bickel *

Office of Biostatistics and Bioinformatics, Medical College of Georgia Augusta, GA 30912-4900, USA


    Abstract
 TOP
 Abstract
 1 INTRODUCTION
 2 NEW METHODS
 3 APPLICATION TO CELL...
 NOTE ADDED IN PROOF
 REFERENCES
 

Motivation: The reconstruction of gene networks from gene-expression microarrays is gaining popularity as methods improve and as more data become available. The reliability of such networks could be judged by the probability that a connection between genes is spurious, resulting from chance fluctuations rather than from a true biological relationship.

Results: Unlike the false discovery rate and positive false discovery rate, the decisive false discovery rate (dFDR) is exactly equal to a conditional probability without assuming independence or the randomness of hypothesis truth values. This property is useful not only in the common application to the detection of differential gene expression, but also in determining the probability of a spurious connection in a reconstructed gene network. Estimators of the dFDR can estimate each of three probabilities: (1) The probability that two genes that appear to be associated with each other lack such association. (2) The probability that a time ordering observed for two associated genes is misleading. (3) The probability that a time ordering observed for two genes is misleading, either because they are not associated or because they are associated without a lag in time. The first probability applies to both static and dynamic gene networks, and the other two only apply to dynamic gene networks.

Availability: Cross-platform software for network reconstruction, probability estimation, and plotting is free from http://www.davidbickel.com in Statomics, a suite of R functions with a Java application.

Contact: bickel{at}prueba.info

Supplementary information: Color figures are available from http://www.davidbickel.com


    1 INTRODUCTION
 TOP
 Abstract
 1 INTRODUCTION
 2 NEW METHODS
 3 APPLICATION TO CELL...
 NOTE ADDED IN PROOF
 REFERENCES
 
Variations of the false discovery rate of Benjamini and Hochberg (1995) have been successfully applied to the problem of detecting differential gene expression between two or more groups on the basis of microarray data (Efron et al., 2001; Efron and Tibshirani, 2002; Müller et al., 2004, http://www.ceremade.dauphine.fr/~xian/publications.html; Pepe et al., 2003; Storey, 2002, 2003; Bickel, 2004a,b,c, http://arxiv.org/q-bio.GN/0404032). In this context, a discovery of differential expression is the rejection of the null hypothesis that a gene is not differentially expressed, and such a discovery is false if there is really no differential expression, i.e. if the rejected null hypothesis is true. Conversely, a non-discovery of differential expression is the non-rejection of a null hypothesis, so a non-discovery is false if there is differential expression, i.e. if the non-rejected null hypothesis is false. False discovery rate methods not only tend to have much lower false non-discovery rates than methods of controlling familywise error rates, but they have a simpler interpretation: depending on which mathematical definition of the false discovery rate is used, it is either exactly equal to a probability under general conditions (Efron and Tibshirani, 2002; Fernando et al., 2004), or it is approximately equal to a probability under more restrictive conditions (Efron et al., 2001; Storey, 2002, 2003). The probability in question is the probability that a gene considered differentially expressed is not really differentially expressed. More generally, it is the probability that a discovery is false, which, by definition, is the probability that a rejected null hypothesis is true. This property of the false discovery rate will be used to answer the question, ‘What is the probability that a given relationship in a gene network reconstructed from gene expression data is spurious?’ Thus, it will be seen that the methodology of false discovery rates not only aids in the detection of differential expression, but also in another important use of microarray technology, that of reverse-engineering regulatory networks of genes.

Many models of gene networks have been applied to the analysis of microarray data (De Jong, 2002), and the following false discovery rate methods are suitable for all gene networks that specify definite relationships between genes, including any network that can be represented as an undirected or directed graph. These methods will be illustrated with a network that can be seen as a dynamic generalization of networks constructed by considering two genes to be connected if the absolute value of the correlation coefficient between their expression values is sufficiently high. There are several variations of such networks (Butte et al., 2000; Rho et al., 2003, http://www.bepress.com/sagmb/vol3/iss1/art8), all of which fall in the general class of spatial networks (Herrmann et al., 2003), an example of a type of network that could benefit from the proposed methods.

Let x1(t) and x2(t) represent the expression values of the first and second gene at time t, respectively. (There are many possible definitions of expression value; some possibilities are given below, but here it is just assumed that the expression value tends to increase with the amount of mRNA.) Without loss of generality, define the coexpression value to be the absolute value of the linear correlation coefficient. Then, if x1(t) and x2(t) are weakly stationary (t R), the coexpression value between the two genes at a lag time of {tau} is

(1)
If |r1,2({tau})| > 0, then the two genes are said to be coregulated. [Alternately, they are coregulated only if |r1,2({tau})| is greater than or equal to the minimum coregulation that is biologically meaningful in the sense of Bickel (2004a).] Let {tau}optimal be the time lag at which the coregulation is maximal, i.e. {tau}optimal satisfies |r1,2({tau}optimal)| = max{tau}|r1,2({tau})|. Then if the two genes are coregulated, gene 2 is said to lag behind gene 1 by {tau}optimal if {tau}optimal > 0, gene 1 is said to lag behind gene 2 by |{tau}optimal| if {tau}optimal < 0, and the genes are said to be contemporaneous in expression if {tau}optimal = 0. For a finite number of time points n, t {0, {Delta}t,2{Delta}t,...,(n – 1){Delta}t}, {Delta}t > 0, and the corresponding coexpression value is

(2)
with xl,j {equiv} xl((j – 1){Delta}t). Here, values are truncated from both vectors when {tau} != 0, but other boundary conditions are possible, and a boundary condition that only causes truncation from one vector is presented below. Exactly how boundary expression values are handled has little effect on the coexpression value when |{tau}| n {Delta}t/4.

The methods of the next section construct a genetic network, from which one can make inferences about gene coexpression (1) based on expression levels obtained for thousands of genes from microarrays at fixed time intervals. The first method determines which combinations of gene pairs and lag times have finite series coexpression values (2) that meet or exceed a given threshold. The next method estimates the probability that any such combination is spurious in that its coexpression value (2) was as high as observed due to chance, rather than due to a non-zero true coexpression value (1). Given the set of gene-pair/lag-time combinations, the third method determines which of the non-zero lag times differ significantly from zero, and then estimates the probability that a significant non-zero lag time is a finite-size effect, i.e. that max{tau}|r1,2({tau})| = |r1,2(0)| when genes 1 and 2 have a statistically significant non-zero lag time. Methods to describe the tails of the connectivity distributions follow. Joint inference based on the probabilities of the second and third methods is also described. Lastly, yeast cell cycle data are analyzed for the purpose of illustration.


    2 NEW METHODS
 TOP
 Abstract
 1 INTRODUCTION
 2 NEW METHODS
 3 APPLICATION TO CELL...
 NOTE ADDED IN PROOF
 REFERENCES
 
2.1 Finding gene networks
Given m genes, each with expression values measured at n equally-spaced time points, all of the expression values can be represented by an m x n matrix, X. (This representation and method also applies to the more general cases of n experiments, not necessarily with a time-series design, and of n laboratory animals, human patients or other individuals.) Denote by xi,j the entry in the i-th row and j-th column of X, i.e. xi,j is the expression value of the i-th gene and the j-th time point, individual, or experiment. The term ‘expression value’ applies to any measure of a relative amount of gene expression, including logarithms or ranks of estimated mRNA levels or ratios. The ranks of the estimated mRNA levels within each gene are particularly useful since they are insensitive to outliers and to distributional differences. For the i-th gene, the rank for the j-th time point, experiment or individual is the rank of the corresponding estimated mRNA level or ratio among all the measurements for that gene, so that the rank is equal to 1 for the smallest, 2 for the second smallest, k for the k-th smallest and n for the largest estimated mRNA level or ratio for the i-th gene. This is implemented in software by applying the rank transform independently to each row of an m x n matrix of estimated mRNA levels or ratios, resulting in X, an m x n matrix of expression ranks. Two genes are considered to be coregulated if their row vectors in X satisfy a coregulation criterion that is based on a measure of similarity, such as correlation, or a measure of dissimilarity, such as Euclidean distance. In this paper, the coregulation criterion C is satisfied for two rows if the absolute value of their correlation coefficient is at least r0, where r0 is some specified value between 0 and 1. Higher values of r0 correspond to less inclusive (more stringent) criteria. For the sake of conciseness, the term ‘coregulation value’ will include all measures of association, including the absolute value of the correlation and distance-based similarities.

The following algorithm generates a network of genes from X, given a coregulation criterion C and, for time-series data, a maximum lag time {tau}max, a non-negative integer multiple of {Delta}t, the time between two consecutive measurements. Since correlations become unreliable as their lag times approach the length of the time series, it is recommended that {tau}max be not much more than a fourth of the duration of the experiment. C may be chosen to be as inclusive as possible, subject to limitations of computer resources, since it can easily be made more stringent after initial analysis, as seen below.

Step 1. If the columns of X correspond to equally spaced points in time, proceed to Step 2. Otherwise, set Y equal to X (Y <- X, i.e. let Y = X), m' <- m, and proceed to Step 4.

Step 2. Create time-shifted matrices by translating the data matrix to the left and to the right, as follows:For

let X({tau}) be the m x n matrix created by shifting X by |{tau}/({Delta}t)| columns to the right if {tau} ≥ 0 or to the left if {tau} < 0, with the values of |{tau}/({Delta}t)| columns dropped from one side of the matrix and with padding that repeats the closest column with expression values on the other side of the matrix. Thus, X({tau}) is a matrix of m row vectors, the i-th of which is (xi,1({tau}),xi,2({tau}),...,xi,n({tau})), where xi,j({tau}) is the expression level of the i-th gene and the (j{tau}/({Delta}t))-th time point unless j{tau}/({Delta}t) < 1 or j {tau}/({Delta}t) > n. For example, if {tau} = 2{Delta}t, then {forall}i{1,2,...m}xi,1(2{Delta}t) = xi,2(2{Delta}t) = xi,1 and {forall}i{1,2,...m};j{3,4,...n}xi,j(2{Delta}t) = xi,j–2. (It follows that X(0) = X.)

Step 3. The matrix Y is created by combining the rows of all of the matrices of Step 2. m' <- (2|{tau}max/({Delta}t)| + 1)m, so Y is an m' x n matrix.

Step 4. Compute all possible row–row coregulation values for Y, recording which pairs of rows that have sufficient coregulation to satisfy C. [This computationally intensive step was implemented in an R operating system call to a Java application, whereas all of the other computations herein were performed solely by functions written in R (R Development Core Team, 2003, http://www.R-project.org).]

Step 5. Interpret the results of Step 4 in terms of which genes are coregulated, and at which lag times they are coregulated. This is accomplished by mapping the rows of Y back to the rows of X and values of {tau} that were used to construct Y, noting the relationships between rows of Y found in Step 4. Thus, each pair of rows of Y that has sufficient coregulation corresponds to a pair of genes and a pair of lag times. These pairs of rows of Y will be referred to as the coregulated row pairs.

Using the results of Step 5, for all coregulated row pairs that satisfy C for different values of {tau}, record which gene of each pair lags behind the other gene of its pair, the amount of lag time between the two, and the coregulation value. For example, if a coregulated row pair has two genes that satisfy C when the row of the first gene in Y was from X({tau}1) and the row of the second gene in Y was from X({tau}2), with {tau}1 < {tau}2, then the second gene lags behind the first gene by |{tau}1{tau}2|. When more than one coregulated row pair give the same gene and relative lag time, their highest coregulation value is taken as representative of the relationship between the two genes at that relative lag time. (The highest value instead of an average value avoids a bias that would be present in an average value since the coregulated row pairs in Step 4 were chosen for having high coregulation values, leaving out row pairs with the same gene and lag information that did not satisfy C.) (For the following application to yeast data, the maximum absolute value of the correlation coefficient is the highest coregulation value.) Then, each pair of genes and relative lag time with a sufficiently high similarity can be represented by a gene–gene relationship consisting of a leading gene, a lagging gene, a relative lag time, a coregulation value, and the direction of association (positive or negative). In the yeast example, the coregulation value and the direction of the association are the magnitude and sign of the correlation coefficient, respectively. By contrast, distance-based measures of coregulation typically only account for positive association, although there are exceptions, e.g. Bickel (2003). When the relative lag time is zero, the ‘leading gene’ and ‘lagging gene’ are interchangeable. For gene–gene relationships that have the same pair of genes, only retain the relationships with coregulation values that are local maxima with respect to lag time, in order to eliminate relationships that have high coregulations only because they are close in time to relationships with higher coregulations. All k of the remaining gene–gene relationships constitute , the network of genes.

The open source code exemplifies a detailed implementation of this algorithm. The network of genes thereby determined may be represented by a directed graph if there is time information, as shown below. Otherwise, it may be represented as an undirected graph. In either case, the edges of the graph connect nodes that represent coregulated genes, with each edge representing a gene–gene relationship.

Instead of this algorithm, Kellam et al. (2002) used an evolutionary program to find the gene–gene relationships that satisfied the criterion for the correlation coefficient without the rank transform. Thus, the same measure of similarity may be used with different transforms and algorithms. Non-rank transforms may improve statistical power if the data fit a known model well, in which case the permutation test could possibly be replaced by a parametric test. Which algorithm is best for network reconstruction also depends on the data, but the suggested algorithm is widely applicable and conservative, even when not ideal.

2.2 Probabilities of spurious connections
Some of the gene–gene relationships in the network are false discoveries in that they are finite-n effects that would not appear in a large number of repeated experiments. To get an idea of how many discoveries are false, there are many ways to quantify a ‘false discovery rate’; the simple decisive false discovery rate (dFDR) of Bickel (2004a,b) will be used here. The estimated dFDR is zero if there are no gene–gene relationships. [By contrast, the closely related estimated proportion of false positives (PFP) of Fernando et al. (2004) would be undefined.] When there are gene–gene relationships in the network, the estimated dFDR is equal to the estimated proportion of false positives, the ratio of R0, the number of gene–gene relationships that would occur under a null hypothesis—to R—the number of gene–gene relationships in the network obtained from the data. A meaningful null hypothesis is that the rows and columns of X are independent. In this case, R0 could be the number of gene–gene relationships found by applying Steps 1–6 to an m x n matrix formed by randomly permuting the values within each of the rows of X. The random effects of the permutations can be reduced by forming B' independent matrices by the same permutation method, in which case R0 would be the mean number of gene–gene relationships found, i.e. the total number of gene–gene relationships found, divided by B'. The dFDR is estimated by min(1, R0/R) if R > 0 or by 0 if R = 0, giving an idea of how reliable the gene–gene relationships are. Like the PFP, the dFDR can be viewed as the probability pspurious(C) that a given gene–gene relationship that satisfies C is spurious (Bickel, 2004a,b). [The dFDR equals a conditional probability in the sense of Breiman (1992), even when the PFP is undefined.] Thus, a dFDR estimate close to 1 indicates that there is evidence for few or no real gene–gene relationships. This estimate is very conservative, and multiplying it by the estimated portion of true null hypotheses can reduce the number of false non-discoveries (Storey, 2002, 2003). As noted in the introduction, this dFDR methodology is not limited to the above algorithm. For example, parametric or non-parametric tests that assign a P-value to each gene–gene relationship would use the product of the significance level and the number of tests as R0, using an established way to estimate a false discovery rate given a list of P-values (Genovese and Wasserman, 2002; Efron and Tibshirani, 2002; Storey, 2003; Fernando et al., 2004; Bickel, 2004a,b) as described in Section 2.3 in the context of discoveries of non-zero lag times.

Given a gene network found by the above algorithm, a gene network with a more stringent coregulation criterion C' can be quickly found without repeating Steps 1–6. (C' can be said to be more stringent than C only if every pair of vectors that satisfies C' also satisfies C. For example, if the absolute value of the correlation is the coregulation value, then C might require that the coregulation value is at least 0.9, whereas C' might require that it is at least 0.99.) is the set of gene–gene relationships of with coregulation values that satisfy C'. can then be computed in the same way as .

2.3 Determining statistical significance of estimated lag times
While permuting the data of X provides , a good estimate of the probability that a gene–gene relationship is spurious due to possible independence, it does not provide , an estimate of pcontemporaneous, the probability that a gene–gene relationship has a spuriously non-zero relative lag time, a positive relative lag time resulting from the finite size of the data set rather than from one of the genes really preceding the other in gene-expression patterns. This probability is a dFDR that can be estimated by the ratio of the mean number of positive relative lag times that would occur by chance to the number of positive relative lag times in the network generated from the data. It can be based on a P-value for each positive relative lag time.

The P-value of the k-th of K relative lag times is the estimated probability that it is as great or greater than the relative lag time observed would occur if the genes had contemporaneously fluctuating expression values:

(3)
where I is an indicator function, {tau}k is the relative lag time for the k-th gene–gene relationship from X, and {tau}k,b is the relative lag time of the b-th of B bootstrap samples of the k-th gene–gene relationship, so that I({tau}k,b ≥ {tau}k) = 1 if {tau}k,b ≥ {tau}k and I ({tau}k,b ≥ {tau}k) = 0 if {tau}k,b < {tau}k. [In the terminology of Efron and Tibshirani (1993), is the achieved significance level.] Each bootstrap sample for the k-th gene–gene relationship is generated as follows when the expression values are ranks. (Some modification is needed for most other measures of expression value.) Let i(lead, k) be the index of the row in X of the leading gene, and let i(lag, k) be that of the lagging gene. Then the base level of expression for the j-th time point of those genes under the null model ise

(4)
the k-th set of residuals is

(5)
and the level of expression at the j-th time point for the l-th ‘gene’ of the b-th bootstrap sample for the k-th gene–gene relationship is

(6)
where is randomly selected with replacement from and l {1,2}. {tau}k,b is the absolute value of the lag time that maximizes the coexpression value between the first and second bootstrap gene, i.e. between the vectors and . The maximum is taken over some set of lag times that includes 0 and {tau}k.

All relative lag times with P-values less than or equal to some significance level {alpha} are statistically significant. Because of the multiple comparisons issue, the statistical significance of a relative lag time does not provide evidence for a biological lag time, except for small pcontemporaneous(C; {alpha}), the probability that a relative lag time significant at level {alpha} corresponds to a true lag time of 0. That probability is estimated by

(7)
with the dependencies on C suppressed for notational simplicity.

The probability that either type of false discovery occurs for a significant relative lag time can be estimated by the sum of the two dFDR estimates: , where is computed only for non-zero relative lag times. This may be too conservative in some cases since the two false discovery events are not mutually exclusive. A more reasonable estimate is based on the independence assumption:

(8)
However, unless both dFDR estimates are very large, the two estimates differ only negligibly. For example, if , then and .

That these are estimates, not known probabilities, must be kept in mind, especially when there are only a few time points. A careful simulation study of the mean square error of the probability estimates as a function of the number of genes and number of time points would be valuable, but no such study is planned at this time. As in other cases of false discovery rate estimation, reporting confidence intervals of estimates can prevent undue reliance on them (Bickel, 2004c).

A directed graph of all gene–gene relationships with statistically significant relative lag times has genes as nodes and a directed edge for each gene–gene relationship, from the node of the leading gene to the node of the lagging gene, with the relative lag time and the direction of association (‘+’ or ‘–’) as the edge label. Many of the directed edges may result from causative relationships between the genes, but a directed edge is insufficient to prove causation.

2.4 Scale-free networks
The extent to which a network is scale-free can be quantified by the fit of the tail of f, the probability density of the connectivity, to a power-law. The connectivity of a gene, denoted by k, is its number of connections, the number of genes with which it shares one or more gene–gene relationships. Agrawal (2002) found that networks constructed by a mutual nearest-neighbor algorithm satisfy

(9)
for with ß = –1, where Kmax is the highest connectivity in the network. Networks that satisfy Equation (9) are considered scale-free or self-similar since in the tails. The lack of a characteristic scale implies that the mean connectivity does not provide a good description of the network since it has many nodes with low connectivity and a few nodes with connectivity of one or more orders of magnitude higher.

Integrating Equation (9) gives the cumulative probability function F, the probability that the number of connections is at least K:

(10)
for sufficiently large . Agrawal (2002) set ß = –1 and used Equation (10) to obtain least-squares estimates of the parameters a1 and a2 from empirical values of . Nonlinear least-squares regression can be used to estimate ß as well as a1 and a2. Alternately, ß can be estimated by linear regression using the two-parameter model

(11)
since

(12)
is the equation of a line.


    3 APPLICATION TO CELL CYCLE DATA
 TOP
 Abstract
 1 INTRODUCTION
 2 NEW METHODS
 3 APPLICATION TO CELL...
 NOTE ADDED IN PROOF
 REFERENCES
 
Spellman et al. (1998) carried out an experiment to measure the variation of expression levels of yeast genes with the cell cycle. They noted that cell cycle regulation could keep order in cell division or conserve resources. Many genes both control and are controlled by the cell cycle, and most of the control is transcriptional (Spellman et al., 1998). In the experiment, they applied a pheromone ({alpha}-factor) to a yeast culture in order to stop the cells in the same cell age (G1, the first phase). Upon removing the pheromone, they sampled the culture and took a repeated sample every 7 min for 119 min, yielding a total of n = 18 samples. For each sample, a cDNA microarray was used to measure the expression values of 6178 genes. The methods described above were applied to the within-gene ranks of expression ratios of the m = 4489 genes that have no expression values missing in any of the microarrays. Applying the rank transform before computing the correlation coefficient in effect uses Spearman's correlation coefficient, which was informative in another expression time series in yeast (Bickel, 2003). The use of ranks minimizes the effects of outliers and obviates logarithmic or other monotonic transformations.

The above algorithm of network construction was used with {tau}max = 21 min and coregulation thresholds 0.90, 0.91, 0.92, 0.93, 0.94 and 0.95. Figure 1 summarizes all of the resulting gene–gene relationships, whereas Figure 2 summarizes only those for which {tau} != 0. Each plotted estimated probability that a relationship is false is , the dFDR estimated from B' = 4 independent permutations of all rows of the matrix.



View larger version (9K):
[in this window]
[in a new window]
 
Fig. 1 (a) Number of relationships and (b) estimated probability that a relationship is false versus the coregulation threshold for all gene–gene relationships.

 


View larger version (10K):
[in this window]
[in a new window]
 
Fig. 2 (a) Number of relationships and (b) estimated probability that a relationship is false versus the coregulation threshold for all gene–gene relationships with non-zero time lags.

 
Figures 3 and 4 display the entire directed graph and partial directed graph of the network, respectively, for the coregulation threshold of 0.95, {tau} != 0, and significance level {alpha} = 0.10 for B = 500 iterations of the bootstrap procedure. Estimation of dFDRs yields and , so the overall probability for error is approximately . Of the 69 genes in gene–gene relationships that have a coregulation of at least 0.95, a non-zero lag time, and a P-value of 0.10 or lower, 20 genes were not included in the set of 800 genes that Spellman et al. (1998) reported to be cell-cycle regulated; their open reading frame labels are YBL032W, YDL165W, YIL010W, YJL097W, YMR190C, YNL118C, YNL130C, YNL280C, YPR028W, YPR066W, YOL129W, YEL071W, YOR045W, YKR098C, YCL064C, YOR393W, YMR226C, YCL021W, YGL139W, and YKL141W. YNL118C has two gene–gene relationships satisfying the criteria, as seen in Figrue 4, but each of the other three genes only has one gene–gene relationship. The expression ranks of YNL118C and of the two genes connected to it are plotted in Figure 5. As Figure 4 indicates, YNL118C is 3 time units (21 min) ahead of the two lagging genes, which is also clear from Figure 5b. YNL118C is known to prevent translation by allowing the production of active Dcp1p, the yeast mRNA decapping enzyme. While such posttranscriptional regulation would not in itself change the mRNA expression levels of the two lagging genes, YNL118C might reduce the abundance of transcription factors that regulate those genes. Some caution is needed here since the arrows in the directed graphs suggest causality without strictly implying it.



View larger version (25K):
[in this window]
[in a new window]
 
Fig. 3 The entire network for the coregulation threshold of 0.95 and a significance level (Type 1 error rate) of 10%.

 


View larger version (33K):
[in this window]
[in a new window]
 
Fig. 4 Selected parts of the network for the coregulation threshold of 0.95 and a significance level (Type 1 error rate) of 10%.

 


View larger version (23K):
[in this window]
[in a new window]
 
Fig. 5 Each unit of time on the horizontal axis corresponds to 7 min for (a) expression ranks of three genes and (b) expression ranks of the two lagging genes with the expression ranks of the leading gene reflected. The reflection is necessary to show the relationship between the genes at a lag time of 3 units (21 min) since they are negatively correlated. The online supplementary information has color figures (Color key: YNL118C is black, YJL201W is red, and YPR076W is green).

 
Using the distributions of connectivity at different coregulation thresholds, the overall structures of the networks reconstructed from the cell cycle data were also studied. Figure 6 gives the total number of connections and the estimate of ß determined from fitting the three-parameter model of Equation (10). The results for the coregulation threshold of 0.95 are not shown since its small number of relationships makes them unreliable for regression. Estimates are displayed for the three-parameter model since it fits the data much better than either of the two-parameter models, as exemplified by Figure 7. Estimates of the scaling exponent are between –3 and –2.



View larger version (10K):
[in this window]
[in a new window]
 
Fig. 6 (a) The number of connections and (b) the scaling exponent estimate, , versus the coregulation threshold for all gene–gene relationships.

 


View larger version (16K):
[in this window]
[in a new window]
 
Fig. 7 Cumulative probability distribution, the probability that the connectivity is ≥ K, versus K for coregulation thresholds of (a) 0.90 and (b) 0.93. The curves are least-squares fits, with Equation (10) and ß = –1 represented by black, Equation (10) and ß estimated by the regression represented by blue, and Equation (12) represented by red. The blue curve was fit using the nls function (R Development Core Team, 2003), with initial parameter values ß = –1.5, a1 = 0.055 and a2 = 0.6 since those values of a1 and a2 were typical in Agrawal (2002) for ß = –1. The online supplementary information has color figures.

 

    NOTE ADDED IN PROOF
 TOP
 Abstract
 1 INTRODUCTION
 2 NEW METHODS
 3 APPLICATION TO CELL...
 NOTE ADDED IN PROOF
 REFERENCES
 
After this paper was revised in March 2004, I became aware of a related approach to static networks: Schafer J and Strimmer K (2004). An empirical bayes approach to inferring large-scale gene association networks, ‘Bioinformatics’, Advance Access published on October 12, 2004; doi: 10.1093/bioinformatics/bti062.


    Acknowledgments
 
I would like to express sincere appreciation for the following contributions. Evelyn Bickel and Nan Eaton enthusiastically helped me prepare the manuscript for publication. The comments of the two anonymous reviewers led to improvements. Conversations with Eric Schadt stimulated my interest in gene networks.


    Footnotes
 
*Present address: Pioneer Hi-Bred International, 7250 NW 62nd Street, PO Box 552, Johnston, IA 50131-0552, USA. Back

Received on January 21, 2004; revised on March 27, 2004; accepted on April 23, 2004

    REFERENCES
 TOP
 Abstract
 1 INTRODUCTION
 2 NEW METHODS
 3 APPLICATION TO CELL...
 NOTE ADDED IN PROOF
 REFERENCES
 

    Agrawal, H. (2002) Extreme self-organization in networks constructed from gene expression data. Phys. Rev. Lett., 89, 268702-4.

    Benjamini, Y. and Hochberg, Y. (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B, 57, 289–300.

    Bickel, D.R. (2003) Robust cluster analysis of microarray gene expression data with the number of clusters determined biologically. Bioinformatics, 19, 818–824[Abstract/Free Full Text].

    Bickel, D.R. (2004a) Degrees of differential gene expression: detecting biologically significant expression differences and estimating their magnitudes. Bioinformatics, 20, 682–688[Abstract/Free Full Text].

    Bickel, D.R. (2004b) Error-rate and decision-theoretic methods of multiple testing: which genes have high objective probabilities of differential expression?. Stat. Appl. Genet. Mol. Biol., 3, 8.

    Bickel, D.R. (2004c) On ‘Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates’: does a large number of tests obviate confidence intervals of the FDR?'. arXiv.org e-print q-bio.GN/0404032.

    Breiman, L. Probability, (1992) , Philadelphia Society for Industrial and Applied Mathematics.

    Butte, A.J., Tamayo, P., Slonim, D., Golub, T.R., Kohane, I.S. (2000) Discovering functional relationships between RNA expression and chemotherapeutic susceptibility using relevance networks. Proc. Natl Acad. Sci. USA, 97, , pp. 12182–12186[Abstract/Free Full Text].

    De Jong, H. (2002) Modeling and simulation of genetic regulatory systems: a literature review. J. Comput. Biol., 9, 67–103[CrossRef][ISI][Medline].

    Efron, B. and Tibshirani, R. An Introduction to the Bootstrap, (1993) , New York Chapman & Hall/CRC Press.

    Efron, B. and Tibshirani, R. (2002) Empirical Bayes methods and false discovery rates for microarrays. Genet. Epidemiol., 23, , pp. 70–86[CrossRef][ISI][Medline].

    Efron, B., Tibshirani, R., Storey, J.D., Tusher, V. (2001) Empirical Bayes analysis of a microarray experiment. J. Am. Stat. Assoc., 96, 1151–1160[CrossRef][ISI].

    Fernando, R.L., Nettleton, D., Southey, B.R., Dekkers, J.C.M., Rothschild, M.F., Soller, M. (2004) Controlling the proportion of false positives (PFP) in multiple dependent tests. Genetics, 166, 611–619[Abstract/Free Full Text].

    Genovese, C. and Wasserman, L. (2002) Operating characteristics and extensions of the false discovery rate procedure. J. R. Stat. Soc. Ser. B, 64, 499–517[CrossRef].

    Herrmann, C., Berthélemy, M., Provero, P. (2003) Connectivity distribution of spatial networks. Phys. Rev. E, 68, 026128-1–026128-6.

    Kellam, P., Liu, X., Martin, N., Orengo, C., Swift, S., Tucker, A. (2002) A framework for modeling virus gene expression data. Intell. Data Anal., 6, 265–279.

    Müller, P., Parmigiani, G., Robert, C., Rousseau, J. (2004) Optimal sample size for multiple testing: the case of gene expression microarrays. J. Am. Stat. Assoc., 99, 990–1001[CrossRef].

    Pepe, M.S., Longton, G., Anderson, G.L., Schummer, M. (2003) Selecting differentially expressed genes from microarray experiments. Biometrics, 59, 133–142[CrossRef][ISI][Medline].

    R: A Language and Environment for Statistical Computing. R Development Core Team. (2003) , Vienna, Austria ISBN: 3-900051-00-3 R Foundation for Statistical Computing.

    Rho, K., Jeong, H., Kahng, B. (2003) Identification of essential and functionally modulated genes through the microarray assay. preprint, arXiV.org e-print.

    Storey, J.D. (2002) A direct approach to false discovery rates. J. R. Stat. Soc. Ser. B, 64, 479–498[CrossRef].

    Storey, J.D. (2003) The positive false discovery rate: a Bayesian interpretation and the Q-value. Ann. Stat., 31, 2013–2035[CrossRef].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
Y. Zhou, C. Cras-Meneur, M. Ohsugi, G. D. Stormo, and M. Alan. Permutt
A global approach to identify differentially expressed genes in cDNA (two-color) microarray experiments
Bioinformatics, August 15, 2007; 23(16): 2073 - 2079.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/7/1121    most recent
bti140v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (4)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Bickel, D. R.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Bickel, D. R.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?