Bioinformatics Advance Access originally published online on August 7, 2006
Bioinformatics 2006 22(17):2158-2159; doi:10.1093/bioinformatics/btl357
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Branch and bound computation of exact p-values
Center for Biomolecular Science and Engineering, School of Engineering 1156 High Street, University of California, Santa Cruz, CA 95064, USA
| ABSTRACT |
|---|
|
|
|---|
Summary: P-value computation is often used in bioinformatics to quantify the surprise, or significance, associated with a given observation. An implementation is provided that computes the exact p-value associated with any observed sample, against a null multinomial distribution, using the likelihood-ratio statistic. The efficient branch and bound code, far exceeding the full enumeration implemented by commercial packages, is especially useful with small sample, sparse data and rare events, common scenarios in bioinformatics, where approximations are often inaccurate and inappropriate. This code base can also be adapted to compute exact p-values of other statistics in diverse sampling scenarios.
Availability: Freely available at http://www.soe.ucsc.edu/~jill/src/
Contact: jill{at}soe.ucsc.edujill
| INTRODUCTION |
|---|
|
|
|---|
Bioinformatics is largely an exploratory science. A routine question that comes up over and over again in the context of data exploration is how surprising, or significant, an observation we make about some given data really is. One way to formalize this question is to describe the data using a probabilistic model Q. For a given sample T we ask how unexpected T is, assuming it was drawn from the probabilistic model Q. To quantify the answer, we can define a test statistic D, which measures how dissimilar any given sample T is from our expectation under the model Q. Formally, D : T
R is a function from every possible sample outcome T to a non-negative real number. The larger D(T) the more different the observed sample T is from a sample we expect to obtain from Q. In this scenario, the p-value of a sample T is the chance probability of seeing samples as surprising as T under the null model Q. When the sample space is finite, this amounts to the following:
![]() | (1) |
With the advent of ever growing computational power, much algorithmic and numerical research evolved to allow an accurate estimation of exact p-values [surveyed in Agresti (2001)]. In Bejerano et al. (2004) we recently introduced a generic branch and bound approach to efficiently compute Equation (1). Here I provide an implementation that successfully applies this methodology to compute a specific p-value more rapidly, more accurately, and over a broader range of cases than is possible using the explicit summation of Equation (1), currently implemented by commercial packages.
| METHODS |
|---|
|
|
|---|
The provided implementation computes the exact p-value of the following scenario: Q = (q1, q2, ... , qnk) is a multinomial distribution over a finite set of k possible outcomes. T = (n1, n2,... , nk) is a sample of size
over the k categories. In this sample, the i-th category appears ni times. The statistic D used to measure how dissimilar sample T is from the expected sample governed by Q is the likelihood ratio statistic:
![]() |
i, ni = nqi, which is the most probable result (up to fractions) when drawing a sample according to distribution Q. In all other cases this sum is positive, growing larger the more unlikely T is according to Q.
In order to span the search space of all possible samples of size n, a standard recursive procedure is used to unfold the assignment tree of Figure 1a. The set of all
possible assignments makes up the leaves of this tree. Internal nodes of the tree hold partial assignments of the form
= (n1, ... , ni1,, ... , ), where n1, ... , ni1 have already been assigned, while categories i through k await assignment of the remainder
counts. Let [
] denote all the leaves in the subtree rooted in the node labeled
. Equivalently, this is the set of all valid assignments that can be completed from
. The key to performing branch and bound computation on this tree lies in defining two functions, Dmin, Dmax :
R such that
![]() |
, they obtain the shapes shown in Figure 1b. We can now take advantage of this knowledge by intersecting the two curves with the threshold D(T) from Equation (1). This divides the range
into up to five distinct regions. We find these regions efficiently using binary searches, and then proceed to treat each region appropriatelythe recursion descends only into nodes where D(T') may be either above or below D(T); we discard internal nodes where D(T') will invariably be smaller than D(T); and we sum up the probability of the appropriate subtrees where D(T') will invariably be at least as large as D(T). By pruning or summing internal nodes we drastically reduce the search space actually traversed. By computing the probability of an entire subtree PQ(
) directly, without enumerating all assignments T
[
] we also increase the accuracy of the computation by avoiding small quotients and imbalanced summations. In Bejerano et al. (2004) we demonstrate the speed-up and accuracy gain obtained by this approach, and show that it is particularly efficient in small sample, sparse null hypotheses and rare events, all common scenarios in bioinformatics.
|
| DESCRIPTION |
|---|
|
|
|---|
The tarball contains the C++ source code, and a readme file that describes how to configure, compile and run it. When invoked the user is asked to enter the number of categories k, the null multinomial distribution Q and an observed sample T from which n and D(T) are computed. Alternatively the user can input n, D(T) directly. The user can then choose to compute the p-value in one of five ways:
- Asymptotic
2 approximation.
- Full enumeration of Equation (1).
- Recursive branch and bound computation.
- Branch and bound coupled with binary search speed up.
- Monte Carlo simulation.
When performing an exact computation the user is asked to specify a permutation
over the k categories, such that when assigning them, instead of assigning n1 first, then n2, etc. as in Figure 1a, one first assigns n
(1), then n
(2), etc. Changing the assignment order changes how the sample space is unfolded, which for the branch and bound alternatives means that different subtrees are examined and resolved for different permutations. When coupled with the binary searches, it appears empirically that the best assignment order is the one that orders the qi's in ascending order. In this case, the implementation recommends this permutation but does not enforce it. While greatly reducing the actual search space, runtime complexity does remain exponential in k (Bejerano et al., 2004) Thus, exact computation is mostly adequate for small values of k and small to moderate values of n. Two additional speed-ups that may have an adverse affect on accuracy are offered as configuration options (see the readme file): replacing actual log/exp operations with a look-up table of selected values, and linear interpolation between them; and allowing, instead of adding subtrees in Figure 1b, to add the probability of the father node, and subtract from it the complementary subset of subtrees, when this results in performing less arithmetic operations overall. The resulting p-value, as well as additional parameters of the computation, are outputted both in the form of a table row, and as more detailed untabulated free text (sent to cout and cerr, respectively).
As discussed in (Bejerano et al., 2004), and illustrated in Figure 1c, the implementation is valuable in computing the exact likelihood-ratio p-value in different bioinformatic scenarios. The code base serves as an ideal starting point to expand into other test statistics, such as Pearson's X2, and all other statistics of the Cressie-Read family (Read and Cressie, 1988). It is also readily extensible to many other hypothesis tests where the sample space can be traversed in a similar fashion.
| Acknowledgments |
|---|
G.B. performed parts of this work at the Hebrew university, where he was supported by a grant from the ministry of science, Israel.
Conflict of Interest. none declared.
| FOOTNOTES |
|---|
Associate Editor: Alex Bateman
Received on December 27, 2006; revised on June 23, 2006; accepted on June 24, 2006
| REFERENCES |
|---|
|
|
|---|
Agresti, A. (2001) Exact inference for categorical data: recent advances and continuing controversies. Stat. Med, . 20, 27092722[CrossRef][ISI][Medline].
Bejerano, G., et al. (2004) Efficient exact p-value computation for small sample, sparse, and surprising categorical data. J. Comput. Biol, . 11, 867886[ISI][Medline].
Rahmann, S. (2003) Dynamic programming algorithms for two statistical problems in computational biology. Lecture Notes Comp. Sci, . 2812, 151164.
Read, T.R.C. and Cressie, N.A.C. Goodness-of-Fit Statistics for Discrete Multivariate Data, (1988) , New York Springer-Verlag.
This article has been cited by other articles:
![]() |
A. Tomovic and E. J. Oakeley Position dependencies in transcription factor binding sites Bioinformatics, April 15, 2007; 23(8): 933 - 941. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||




