Bioinformatics Advance Access originally published online on April 26, 2007
Bioinformatics 2007 23(12):1519-1526; doi:10.1093/bioinformatics/btm140
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Two-stage designs applying methods differing in costs
Section of Medical Statistics, Medical University of Vienna, Spitalgasse 23, A-1090 Vienna, Austria
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Two-stage pilot and integrated designs are powerful tools for investigating large numbers of hypotheses. Asymptotically, optimal two-stage designs controlling the familywise error or false discovery rate are considered when costs and effect sizes per measurement differ between stages and total costs are constrained.
Results: Depending on the cost and effect size ratios between the measurements, it is generally more powerful to apply two-stage procedures using one measurement method at both stages. For the practically relevant case that the same method is applied at both stages but designing the second-stage measurements raises extra costs, two-stage designs are more powerful than the single-stage design even for large costs ratios. The power of the optimal pilot and integrated two-stage designs generally are similar, however, the integrated approach is less sensitive even to severe design misspecifications in the planning phase.
Availability: R-programs (R, 2005) to calculate asymptotically optimal designs are available on: http://statistics.msi.meduniwien.ac.at/index.php?page=ao2stage
Contact: alexandra.goll{at}meduniwien.ac.at
| 1 INTRODUCTION |
|---|
|
|
|---|
In gene expression and proteomic studies, we generally deal with large numbers of hypotheses, where only for a small fraction of the hypotheses noticeable effects exist. Due to limited resources, the number of observations per hypotheses in a conventional single-stage design is low which limits the power. It has been shown that two-stage (or multi-stage) designs are a good option to improve the power. In these sequential designs, early stages are used to screen for the promising hypotheses, which are further investigated in later stages. For example, Zehetmayer et al. (2005) proposed (optimal) two-stage designs for experiments with a large number of hypotheses and constraints on the total sample size which control the false discovery rate (FDR, see Benjamini and Hochberg, 1995). All hypotheses whose conventional univariate first-stage P-values fall below a certain common threshold are selected for the second stage. The final test decision is based on the observations pooled over both stages (integrated design), see also Bukszar and Van den Oord (2006), Satagopan and Elston (2003), Satagopan et al. (2002), Satagopan et al. (2004), Van den Oord and Sullivan (2003), Zehetmayer et al. (2005) also investigated optimal pilot designs, where the final test is only based on the second-stage data. Further comparisons between the pilot and the integrated design can also be seen in Skol et al. 2006. In all these proposals, constant costs and effect sizes over stages have been assumed.
In the following, we investigate two-stage designs using a less accurate assay in early stages and more accurate ones in later stages for cost reasons (see also Wang et al., 2006). For example, a quasi-quantitative, global LC-MS profiling proteomics experiment may underestimate the true effect size due to saturation or sensitivity effects inherent in these multiplexed assays, whereas a targeted, calibrated assay (e.g. ELISA) can show an effect size generally larger than the profiling study. First, we consider such a scenario that the experimenter from the beginning may have the choice between two methods that differ in costs and effect sizes. In the second scenario, different costs per measurement may arise if the same method is applied at both stages but specific experimental devices have to be produced at higher costs per measurement for the selected markers at the second stage. In contrast to Wang et al. (2006) who constructed designs minimizing the overall costs for a given FWE rate and power, we assume that the total costs of the experiment are fixed, similar to Satagopan et al. (2002), Zehetmayer et al. (2005) or Ohashi and Clark, (2005). For limited total costs, we derive both integrated and pilot designs with an asymptotically optimal power (for an increasing number of null hypotheses), either controlling the FWE rate or the FDR. The test problem is defined in Section 2 and the corresponding single-stage procedures in Section 3. In Sections 4 and 5, we define the asymptotically optimal pilot and integrated design. In Section 6, we show for the first scenario that depending on the cost and the effect ratios between the methods it is preferable either to apply the low-cost or the high-cost method on both stages. The second scenario is investigated in Section 7 calculating cost ratios between stages for which it is worthwhile to use (optimal) two-stage designs. We further look how design misspecifications in the planning phase would change the power of two-stage designs as compared to the standard single-stage design. A short discussion including some results under less stringent distributional assumptions is given in Section 8.
| 2 TEST PROBLEM |
|---|
|
|
|---|
Consider m1 (null) hypotheses for the mean of independent normally distributed observations with known variance:
against
,
.
For deriving the test procedures, we assume independence of observations across hypotheses.
| 3 THE SINGLE-STAGE DESIGN |
|---|
|
|
|---|
We assume that there is a limit on the overall total costs C of the study. Without loss of generality, the costs per observation of the single-stage design are set to 1. In the standard single-stage design, we equally allocate
is the distribution function of the standard normal distribution. The P-values are compared to a common critical boundary
: If
0 of the m1 hypotheses considered the null hypothesis is true. To simplify later calculations, we also assume that the same mean
To control the FWE rate (the probability to reject at least one true null hypothesis irrespective of how many and which are in fact true), we apply the critical Bonferroni boundary
. The power of such a single-stage design is defined by
, where
denotes the type 2 error as a function of the rejection boundary
,
is the distribution function of the normal distribution with mean µ and variance
and
is the (1–
) -quantile of the standard normal distribution. Note that under the assumption of a common alternative, the power is the expected fraction of null hypotheses correctly rejected.
To control the FDR (the expected proportion of erroneous rejections among all rejections), we apply the method of Storey, (2002) estimating the FDR. The critical value
is determined as the maximal
such that
|
| (1) |
Here,
is the estimated proportion of true null hypotheses given by
|
| (2) |
. Hence, the critical boundary is determined from the sample such that the estimated FDR never exceeds the targeted value
. Using the method of Storey the critical boundary is a random variable. Asymptotically, for large m1,
can be determined from the equation |
|
| 4 THE PILOT DESIGN |
|---|
|
|
|---|
4.1 The test procedure
We consider the same test problem as described in Section 2. Again, we assume there is a limit of overall total costs C for the study. Now, a fraction r of the total costs C is used for the first stage for testing the m1 hypotheses. Thus, for balanced sample size allocation the sample size of the first stage per hypothesis is
1. All null hypotheses are selected, whose P-values fall below a threshold
1
2. Consequently,
2
4.2 Optimal designs controlling the FWE rate
To control the FWE rate, we simply apply the Bonferroni method to determine the rejection boundary for the second-stage P-value
, but in contrast to the single-stage design, the adjustment refers to the number of selected hypotheses m2:
Since m2 is independent of the second-stage data, this procedure clearly controls the FWE rate at the level
.
We now will try to determine a
1 and r which maximizes the power of the two-stage design controlling the FWE rate. We assume that at stage 1 for all alternative hypotheses the same mean
and at stage 2 the same mean
, holds true, respectively. Here, k is the ratio of the effect sizes between the two stages, and we assume that the high-cost method at the second stage never provides a smaller effect size than the low-cost method at stage one. The first-stage power (the probability of being selected) for a true alternative is given by
|
|
Note that under the assumption of a common alternative, this is the expected proportion of correctly selected null hypotheses among all null hypotheses for which the alternative holds.
For the second stage we select m2 hypotheses which, for large m1, is given by
|
|
Because of the independence between the two stages, the overall power of the pilot design, i.e. the expected fraction of null hypotheses correctly rejected after the second stage, is asymptotically given by
|
| (3) |
Given an FWE rate
, an initial number of hypotheses m1, overall costs C, the cost ratio c2 between stages, the proportion of true null hypotheses
0, the effect size
and the effect size ratio k between stages we can optimize
p in the two design parameters r and
1. Considering r as a continuous variable, the optimal sample sizes per stage (n1 and n2) in general will be non-integer. It is easy to see that the optimal
1 and r depend on C, m1,
and k via
and
.
4.3 Optimal designs controlling the FDR
To control the FDR, the second-stage critical boundary
2 is determined as in formulas 1 and 2 replacing m1 by m2. Asymptotically, for large m1, the first-stage selection boundary
1 and the second-stage rejection boundary
2 in the pilot design have to adhere to the equation
|
| (4) |
1))(1–ß2(
2)) is the power
p of the pilot design defined in (3) using
2 instead of
p can be optimized as function of r and
1, where
2 follows from condition (4). | 5 THE INTEGRATED DESIGN |
|---|
|
|
|---|
5.1 The test procedure
We address the same test problem as in Section 2. Also, the screening step of the test procedure at the first stage is identical to the pilot design in the previous section. The only difference to the pilot design is that the final test decisions based on the selected null hypotheses are derived from integrated P-values
|
| (5) |
Now the test decision is again very simple: a selected null hypothesis
is rejected in the final test if
. Otherwise it is accepted. Optimizing the non-centrality parameter
of the test statistics zi leads to the optimal weight
|
| (6) |
If the same method (with the same effect size, k = 1) is used at both stages, then the weight
corresponds to that used in a group sequential two-stage design. Note that using non-optimal weights may lead to a larger power of the pilot design as compared to the integrated design when the effect size in the second stage is much larger than in the first stage (as already pointed at by Skol et al., 2006).
5.2 Optimal designs controlling the FWE rate
For the control of the FWE rate, the corresponding
is the solution of:
|
| (7) |
s is set to
denotes the density function of the standard normal distribution. Note again that n2 is random because it depends on the number of selected hypotheses (which also is random). By re-formulating the test decisions in terms of a sequential P-value |
| (8) |
1. Note that the optimal
1 and r, as in the pilot design, depend on C, m1,
and k via
5.3 Optimal designs controlling the FDR
For the control of the FDR, asymptotically the rejection boundary for the P-values in the final test is given by the solution of
|
| (9) |
s is a function of
which is given by (7). Such a two-stage procedure with a predefined sample size allocation rule controls the FDR, since it can be shown that the resulting sequential P-values
1 can be determined by maximizing the power (8) under the constraint (9). The rejection boundary
for the P-values pi of the selected null hypotheses calculated from pooling stagewise z-scores (5) with optimal weights (6) can then be found numerically from solving Equation (7). | 6 COMPARISON OF TWO-STAGE PROCEDURES |
|---|
|
|
|---|
6.1 Pilot design
Assume first that the experimenter has two different candidate methods for the measurements from the very beginning, a low-cost standard method and a high-cost improved method. So he could apply the same method at both stages (low–low or high–high), or he may switch to the more expensive method at the second stage (low–high). In the following, we investigate which of these three procedures is more powerful when controlling the FWE rate. Using the same test statistics only with modified critical boundaries, we expect similar findings when controlling the FDR. The power of the pilot design controlling the FWE rate for the low–high procedure is given by (3). Clearly the power of a procedure using the low-cost method in both stages,
1 are identical. Since formula (3) is monotonic in c2, the two-stage procedure applying the low-cost measurement method at both stages dominates the other two procedures (low–high and high–high) if the high-cost method is not sufficiently efficient, i.e. when
= 0.05 (FWE), was used assuming an effect size for the low-cost measurement method of
= 0.5. The asymptotically optimal power is given for the three procedures. The solid lines mark the respective maximal power over the three procedures if at least one observation is left at the first stage for the optimal high–high procedure. Note that for the other two procedures, the asymptotically optimal n1 is always larger than one. Obviously, the high–high procedure has the maximum power for relatively low costs c2. For the cost ratio k = 4, the solid curve jumps when the costs of the high-cost method get too large resulting in an asymptotic optimal n1 < 1. Here, the region where the low–high procedure is preferable to both, the other is very small, for k = 3 no such region exists. If we apply the constraint
|
6.2 Integrated design
Comparing the three procedures for the integrated design, we have to modify the formula for the power
6.3 Examples: optimal designs for k = 1 and ![]()
The previous sections have shown that if two methods are available, differing in costs and effect sizes, using two-stage designs applying the same method at both stages may be preferable. Asymptotically, optimal two-stage designs applying the same method at both stages (k = 1) can be derived as in Zehetmayer et al. (2005) if the costs do not differ between stages (c2 = 1) using appropriately defined total costs C. In the following, we focus on designs using the same methods at both stages; the second-stage measurement, however, raising extra costs c2 > 1. When c2 > 1, we have to use the power formulas (3) and (8) with k = 1 to derive asymptotically optimal designs. Table 1 for k = 1 and some c2 gives the design parameters of optimal pilot and integrated designs and their power for controlling the FWE rate and the FDR. Note that the optimal power values given for the integrated designs are only slightly larger than those of the pilot designs. For comparison, the power of the (asymptotic) single-stage designs with equal total costs for the control of the FWE rate and FDR are also listed in Table 1. As one can see from the tables, the asymptotic optimal screening boundary
1 decreases with increasing costs c2. For the same costs, the screening boundary
1 slightly increases with increasing
. At the same time, the proportion of costs used for the first stage increases with
. Note that due to the complexity of the power function there is a different dependence on costs for low and large effect sizes, which is also depending on FDR or FWE control. At least in the asymptotically optimal number of selected hypotheses m2 increases with
and decreases with costs c2 throughout the whole designs considered. Note that using designs with stagewise integer sample size (first rounded downwards and randomly choosing hypotheses where the rounded sample size is increased by 1 in order to achieve constant total costs) does not noticeably decrease the power as compared to the optimal non-integer designs. Simulations (100 000 runs each) for the cases C = 20 000, m1 = 1000,
,
= 0.75, c2 = 5 and 15 from Table 1 show for the pilot design power values of
and 0.574, respectively for an FWE rate of
= 0.05 and
and 0.660 for FDR control at the same level. It has to be mentioned that for large costs the number m2 of selected hypotheses may become small, so that the finite sample size modification of formulas (1) and (2) proposed by Storey, et al. (2004) has to be used in order to guarantee control of the FDR. This leads to a slight decrease in power.
|
| 7 WHEN TO USE TWO-STAGE DESIGNS |
|---|
|
|
|---|
7.1 Break even point in the cost ratio
It has been shown that for large m1 and constraints on the total costs, the power of an asymptotic optimal two-stage design may be considerably larger than the power of the corresponding single-stage design (see Table 1). Again, the scenario is considered where the same method is applied at the two stages (k = 1) and the second stage measurement raises extra costs (c2 > 1). We investigate when it is more efficient in terms of asymptotic power to use a two-stage design as compared to the single-stage design. We tackle the problem by asking whether a cost ratio
, k and
and
0 and
for the case of controlling the FWE rate or the FDR at
= 0.05. Again, C was set to 20 000 and m1 was set to 1000. The curves are fairly similar for control of the FWE rate and the FDR, the break even point varying more when the FDR is controlled. For large effect sizes, the power of the single-stage design and the pilot design are close to 1, and consequently
0 increases)
, this advantage in power of the single-stage FDR design over the single-stage FWE design decreases, whereas the optimal two-stage design controlling the FDR still has favorable properties as compared to the two-stage FWE design. Hence, larger second-stage costs can be afforded to achieve the same power as the corresponding single-stage design. This may lead to a crossing of the two corresponding curves.
|
7.2 Impact of design misspecifications
Whereas costs are usually known a priori, the optimal designs depend on the unknown proportion
0 and effect size
. Hence, the impact of design misspecifications in the planning phase is an important issue. In the following, again we consider the scenario C = 20 000, m1 = 1000 and
= 0.05. It is assumed that the optimal r and
1 were planned for the situation where
= 0.75,
0 and
for controlling the FDR and FWE rate. Positive values indicate superiority of the two-stage design. The example with a cost ratio c2 = 15 (confer Wang et al., 2006) is plotted for the pilot (first row of the panels) and the integrated design (second row). Not surprising, the figures show that the integrated design is more robust against misspecifications of
0 and
than the pilot design: it uses the whole data set from both stages for test decisions. The most robust design is the integrated design controlling the FWE (Fig. 3C). Here, in the parameter subspace, the two-stage integrated design shown is always noticeably better than the single-stage design. Controlling the FDR, the advantage of the single-stage design to adapt for
0 results in smaller differences between the integrated two-stage design and the single-stage design (Fig. 3D): in the left upper corner, the single-stage design is outperforming the two-stage design. The pilot design controlling the FWE rate is more sensible with regard to the design misspecifications than the pilot design controlling the FDR. The design applies non-optimal selection criteria and controlling the FWE rate no adaption to the correct parameters is possible in the second-stage sample (Fig. 3A): in the left upper corner, the power of the single-stage design may become substantially larger than the two-stage pilot design. Controlling the FDR adapting to the true parameters in the second-stage sample helps a little (Fig. 3B): there is only a slightly larger power of the single-stage design as compared to the two-stage pilot design in the left upper corner. Generally, a design optimal for a fraction of true null hypotheses which is larger than the true
0 can lead to a considerable loss of power as compared to the corresponding single-stage design. However, if the true
0 gets larger than the proportion used for planning and the true effect size
is close to the one used for planning generally the difference between two-stage designs and the single-stage design increases. Optimism in the planning phase with regard to the number of true alternatives may help to avoid a loss of power due to design misspecification. If the true effect size
gets larger than the one from the planning phase for values of
0 close to the true one, the power of the two-stage and single-stage designs both approach 1 so that the differences in the contour plots decrease.
|
| 8 DISCUSSION |
|---|
|
|
|---|
We have investigated two-stage designs in the situation that large numbers of null hypotheses are tested and only a small proportion of them are expected to be wrong. Moreover, it was assumed that there are constraints on total costs of the experiment. The first stage is used for screening out promising hypotheses which are then investigated further at the second stage. We focused on an important scenario in practice assuming that costs per measurement differ between stages: on the one hand, extra costs may arise when the same measurements have to be designed for a subset of hypotheses selected in an interim analysis and investigated at the second stage. On the other hand, the investigator from the very beginning may have the choice between a low-cost method and a high-cost method (which hopefully is more efficient in terms of the effect size under the alternatives). Given a large number of candidate hypotheses, we derived asymptotically optimal designs in terms of power using the simplifying assumptions of common alternatives (either controlling the FWE rate or the FDR).
We would like to summarize the results in the following way: if two different methods are available, depending on the ratios between costs and effect sizes it is preferable to run two-stage designs which apply either the low-cost or the high-cost method at both stages. Designs starting with the low-cost method and switching to the more expensive method in the interim analysis may only be advisable if there is lack of resources, so that first-stage sample size for the high-cost method would be too small. However, it has to be kept in mind that the best design depends on the relationship of the effect size and the cost ratios. Hence, in case of effect size misspecifications in the planning phase, the low–high method may actually be more powerful than the low–low or the high–high strategy. However, it seems natural to apply a design which is preferable under the parametric constellation considered in the planning phase. In the integrated design, the optimal way of combining more data from both stages arising from different measurement methods depends on the effect size ratio between stages, which introduces a further complication for appropriately designing such experiments applying different methods.
Two-stage screening designs are a very powerful tool even if we deal with equal effect sizes at the second stage, but the costs for designing the measurements for the selected hypotheses at the second stage are fairly high. Only severe design misspecifications in the planning phase may lead to a noticeable loss of power such that the single-stage design may become superior in power. With regard to the impact of design misspecification in the proportion of true alternatives, it seems to be preferable not to assume too small proportions in the planning phase. Integrated designs which use data from both stages for the final test decisions are more robust against design misspecifications.
With respect to deviations from the underlying assumption, we calculated optimal designs for the unknown variance case using the central and non-central t-distributions instead of the corresponding normal distributions. Again, assuming
= 0.75,
and 15 from Table 1, the optimal parameters for the pilot design controlling the FWE rate are r = 0.722,
and r = 0.703,
, respectively, which are very close to those of the known variance case. The corresponding optimal power values for the unknown variance case drop to 0.681 and 0.473. For the control of the FDR, the corresponding optimal design parameters in the unknown variance case change to r = 0.748,
for c2 = 5 and to r = 0.757,
for c2 = 15. The optimal power decreases to 0.747 and 0.565, respectively. However, using the optimal parameters for the known variance case in the situation of unknown variances leads to virtually the same performance as using the optimal parameters from the unknown variance case. Note that in the unknown variance case, the decision which of the procedures (low–low, high–high or low–high) is preferable is more difficult because no common crossing point in costs as a function of c2 between the three procedures exists. However, the region where the low–high procedure is preferable still remains small.
To investigate the impact of correlation, we assumed an autoregressive correlation structure among the hypotheses. The correlation between hypotheses i and j is given by
for some
. The alternative hypotheses (
= 0.75) are randomly distributed among the hypotheses. For example the simulated power values (100 000 runs) for c2 = 5 assuming a correlation of
= 0.2, 0.6 and 0.9 are 0.753, 0.749 and 0.728, respectively when controlling the FWE rate, and 0.802, 0.798 and 0.777, respectively when controlling the FDR (compare Table 1). Hence, the impact of correlation is small like in the case of constant costs in Zehetmayer et al. (2005). For the two-sided situation, we refer to their proposal to test a set of 2 m1 one-sided hypotheses.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
We thank Sonja Zehetmayer, Martin Posch and the three anonymous referees for constructive comments. This work was supported by the Austrian FWF-Fund no. P18698-n15.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Martin Bishop
Received on February 9, 2007; revised on March 23, 2007; accepted on April 5, 2007
| REFERENCES |
|---|
|
|
|---|
Benjamini Y, Hochberg Y. Controlling the false discovery rate – a practical and powerful approach to multiple testing. J. R. Stat. Soc. B, ( (1995) ) 57, : 289–300..
Bukszár J, Van den Oord E. Optimization of two-stage genetic designs where data are combined using an accurate and efficient approximation for Pearson's statistic. Biometrics, ( (2006) ) 62, : 1132–1137.[CrossRef][ISI][Medline].
Lehmacher W, Wassmer G. Adaptive sample size calculations in group sequential trials. Biometrics, ( (1999) ) 55, : 1286–1290.[CrossRef][ISI][Medline].
Ohashi J, Clark AG. Application of the stepwise focusing method to optimize the cost-effectiveness of genome-wide association studies with limited research budgets for genotyping and phenotyping. Ann. Hum. Genet, ( (2005) ) 69, : 323–328.[ISI][Medline].
R Development Core Team. R: a language and environment for statistical computing. In: R Foundation for Statistical Computing., ( (2005) ) Vienna, Austria..
Satagopan JM, et al. Two-stage designs for gene-disease association studies. Biometrics, ( (2002) ) 58, : 163–170.[CrossRef][ISI][Medline].
Satagopan JM, Elston RC. Optimal two-stage genotyping in population-based association studies. Genet. Epidemiol, ( (2003) ) 25, : 149–157.[CrossRef][ISI][Medline].
Satagopan JM, et al. Two-stage designs for gene-disease association studies with sample size constraints. Biometrics, ( (2004) ) 60, : 589–597.[CrossRef][ISI][Medline].
Skol AD, et al. Joint analysis is more efficient than replication-based analysis for two-stage genome-wide association studies. Nat. Genet, ( (2006) ) 38, : 209–213.[CrossRef][ISI][Medline].
Storey JD. A direct approach to false discovery rate. J. R. Stat. Soc. B, ( (2002) ) 64, : 479–498.[CrossRef].
Storey JD, et al. Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach. J. R. Stat. Soc. B, ( (2004) ) 66, : 187–205.[CrossRef].
Van den Oord EJ, Sullivan PF. A framework for controlling false discovery rates and minimizing the amount of genotyping in the search for disease mutations. Hum. Hered, ( (2003) ) 56, : 188–199.[ISI][Medline].
Wang H, et al. Optimal two-stage genotyping designs for genome-wide association scans. Genet Epidemiol, ( (2006) ) 30, : 356–368.[CrossRef][ISI][Medline].
Zehetmayer S, et al. Two-stage designs for experiments with a large number of hypotheses. Bioinformatics, ( (2005) ) 21, : 3771–3777.
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||





