Skip Navigation

Bioinformatics 2007 23(2):e36-e43; doi:10.1093/bioinformatics/btl323
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Sun, Y.
Right arrow Articles by Buhler, J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Sun, Y.
Right arrow Articles by Buhler, J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Biological Sequence Analysis

Designing patterns for profile HMM search

Yanni Sun * and Jeremy Buhler

Department of Computer Science and Engineering, Washington University St Louis, MO 63130, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 APPROACH
 3 RESULTS
 4 CONCLUSIONS AND FUTURE...
 REFERENCES
 

Motivation: Profile HMMs are a powerful tool for modeling conserved motifs in proteins. These models are widely used by search tools to classify new protein sequences into families based on domain architecture. However, the proliferation of known motifs and new proteomic sequence data poses a computational challenge for search, requiring days of CPU time to annotate an organism's proteome.

Results: We use PROSITE-like patterns as a filter to speed up the comparison between protein sequence and profile HMM. A set of patterns is designed starting from the HMM, and only sequences matching one of these patterns are compared to the HMM by full dynamic programming. We give an algorithm to design patterns with maximal sensitivity subject to a bound on the false positive rate. Experiments show that our patterns typically retain at least 90% of the sensitivity of the source HMM while accelerating search by an order of magnitude.

Availability: Contact the first author at the address below.

Contact: yanni{at}cse.wustl.edu


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 APPROACH
 3 RESULTS
 4 CONCLUSIONS AND FUTURE...
 REFERENCES
 
Detecting membership in protein families is a crucial part of annotating newly obtained proteomic sequence data. Because family membership is often associated with a known function, recognizing a new protein's family can provide evidence for its function, even when there is no close homolog known. Today, family annotation is supported by databases (Henikoff et al., 2000; Hulo et al., 2006; Bateman et al., 2004; Schultz et al., 1998; Attwood et al., 2003) containing thousands of sequence motifs, each describing one or more domains common to a protein family.

Protein motifs are subject to variation in each motif position, deletion of part of the motif, and insertion of non-motif sequence. This variability can be quantified by a profile hidden Markov model, or pHMM (Krogh et al., 1994), which assigns probabilities to the different types of variation at each position in a motif. Using well-known algorithms for comparing a sequence to an HMM (Rabiner, 1989), motifs represented as pHMMs can be recognized even in sequences distant from those used to train the models.

The growing number of known motifs, combined with the rapid appearance of new proteomes, poses a computational challenge for software that compares proteins with pHMMs. One such tool, HMMER (Eddy, 1998), requires 0.001–0.01 s on a modern CPU to compare a typical pHMM and protein using the Viterbi algorithm. A new genome yields 5000–20 000 hypothetical proteins, which must be compared with 10 000–200 000 pHMMs. Such comparisons require 1–500 CPU-days, motivating a search for ways to accelerate algorithms for protein-pHMM comparison.

A path to faster pHMM database search is to design a computationally efficient filter that rapidly eliminates protein–motif pairs for which a full pHMM comparison does not yield a significant match. In this work, we construct such a filter based on a simplified motif representation, PROSITE patterns (Hulo et al., 2006). These patterns describe two aspects of a motif: the most common residues in each of its conserved positions and the (perhaps variable) numbers of non-conserved residues between successive conserved positions.

Two difficulties arise in using PROSITE patterns as a filter for pHMM search. First, many more motifs are known as pHMMs than as patterns, so an efficient method is needed to derive patterns from pHMMs. Second, even if PROSITE itself contains a pattern for a motif, that pattern may not capture the wide range of variation allowed by the corresponding pHMM. Hence, a search filtered against these patterns may miss many motifs that would be found by an unfiltered search. While we cannot hope to capture every nuance of a pHMM with a pattern, we can design patterns to achieve an attractive trade-off between motifs missed and search efficiency.

This work describes an algorithm to convert a pHMM for a protein motif to one or more representative PROSITE patterns for use as a search filter. Our method combines three components: a knapsack-based approximation algorithm to choose conserved pHMM positions for the pattern; a forward-like algorithm to estimate gap lengths between conserved positions and a greedy covering approach to produce several patterns that jointly describe a pHMM. We validate our methods on motifs from the Pfam-A database (Bateman et al., 2004). Compared with unfiltered pHMM search, filtered search using our patterns exhibits 90% or better sensitivity for 72% of motifs tested and runs an order of magnitude faster.

1.1 Related work
PROSITE patterns have previously been used as a motif representation not only by the curators of PROSITE itself but also by software seeking to infer motifs from unaligned protein sequences. SPLASH (Califano, 2000; Hart et al., 2000) and Pratt (Jonassen, 1997) are two such tools. The authors of these tools compared their inferred patterns with hand-curated PROSITE patterns but not with the (usually more sensitive) pHMM representation. In particular, the patterns inferred in (Califano, 2000; Hart et al., 2000) were rigid—they did not permit variable-length gaps in the motif. This work bases pattern design on pHMMs, rather than the raw aligned sequences, and uses the information in them to address variable-length gaps more systematically. Rigid patterns were also used as a surrogate for more complex probabilistic models in the MITRA-PSSM motif finder (Eskin, 2004).

Other work to accelerate pHMM search has used quite different filtering approaches. In software, the HMMERhead algorithm (Portugaly et al., 2004) uses a multi-stage BLASTP-like approach. HMMERhead searches for short amino acid strings that appear in the motif with high probability, extends these matches by a partial Viterbi algorithm, and finally uses the full HMMER software on matches that pass the filter. In hardware, the TimeLogic DecypherHMM (Gordon and Sensen, 2004; Madera et al., 2004) and CompuGen Bioaccelerator (http://www.cgen.com/) engines implement a simplified pHMM Viterbi algorithm on an FPGA platform. In contrast to these approaches, our work seeks to capture as much model information as possible in the pattern matching heuristic, without resorting to more complex filters based on dynamic programming.

HMM logos were introduced in (Schuster-Boeckler et al., 2004) as a way to visualize the information contained in the parameters of a pHMM. Computing an HMM logo requires algorithms similar to those we use to compute the lengths of pattern gaps. However, PROSITE patterns are a more limited formalism than HMM logos (and hence faster for search), so our work includes additional combinatorial design steps.

Finally, the greedy covering approach we use for multiple pattern design exploits earlier work by our group (Sun and Buhler, 2004) and others (Li et al., 2004; Xu et al., 2004) on seed design for pairwise sequence comparison.


    2 APPROACH
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 APPROACH
 3 RESULTS
 4 CONCLUSIONS AND FUTURE...
 REFERENCES
 
This section presents our algorithm to design patterns from a pHMM. During search, these patterns are scanned against a protein database, and only matching sequences are passed to a more expensive, unfiltered algorithm. We use HMMER's hmmsearch (Eddy, 1998) for unfiltered search.

After first formulating the problem, we briefly describe the pHMM formalism, then present our algorithm to design a locally optimal pattern based on a pHMM. The algorithm is presented in two parts: a method for rigid weight matrix models without gaps, followed by extensions to accommodate gaps. Finally, we extend our algorithm to pattern set design.

2.1 Problem definition
The patterns we design follow the grammar of PROSITE (Hulo et al., 2006). As an illustrative example, we designed the following pattern for protein family RNA_pol_Rpb2_3:

Formula
A pattern p consists of tokens, each of which can be a character, such as R, from sequence alphabet {Sigma}; a set of characters, such as [FILMV]; or a gap symbol ‘x’ followed by its length, such as x(3). Gaps may have either fixed or variable length; the latter specifies the range of allowed lengths, as in x(7,8). Define p's length |p| to be the number of tokens in it.

A pattern p matches a sequence S isin {Sigma}* if p matches any substring of S. A set of patterns P = {p1, ... , pn} matches S if any of its component patterns pi matches S.

The Viterbi score of a sequence S against a pHMM Formula is the score reported by running hmmsearch on S and Formula. A sequence is positive w.r.t. Formula if it has a Viterbi score higher than a given threshold against Formula. As suggested in the HMMER documentation, we use the Pfam GA cutoff for each model as our score threshold. For a pattern set P derived from pHMM Formula, P's sensitivity is defined as the probability that P matches a positive sequence w.r.t. Formula. P's match probability to a random background sequence is its fp (false positive) rate.

The general pattern design problem from a pHMM is:

Given a pHMM Formula, design a pattern set P = {p1, ... , pn} so that P's sensitivity w.r.t. Formula is maximized while maintaining P's fp rate below a threshold {theta}.

In this work, we do not bound the number n of patterns or the total false positive rate a priori but let both vary dynamically as follows. In the Pfam database, each pHMM is trained based on a seed multiple sequence alignment. We use the sequences in this alignment to evaluate the coverage of our pattern set, to decide when to stop adding new patterns. Each pattern is subject to an individual fp rate threshold {tau}. Hence, we actually address the following problem:

Given a pHMM Formula, design the smallest pattern set P = {p1, ... , pn} so that P matches every sequence in Formula's seed alignment, and each pattern has fp rate at most {tau}.

We emphasize that our core pattern design algorithm uses the pHMM, not the seed alignment. The seed sequences are used only to determine when to stop adding patterns to the set.

2.2 The profile HMM formalism
We design patterns to approximate the core model of the ‘Plan 7’ pHMM schema used by HMMER, which is illustrated in Figure 1.


Figure 1
View larger version (11K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1 Core section of the Plan7 pHMM architecture.

 
A Plan 7 model consists of match states M, deletion states D, and insertion states I, flanked by starting state B and ending state E. A pHMM for a motif of length L contains L match states. Match state Mi emits the i-th motif residue, while insertion state Ii emits background residues after this i-th residue. Each state Mi defines an emission probability distribution over {Sigma}. Mi emits residue j with probability Mij. All states Ii emit residues according to a common background distribution {pi}. Ii emits residue j with probability Iij = {pi}j. Deletion state Di allows the i-th motif residue to be skipped; it is non-emitting.

At each position i of a motif, there are seven allowed transitions: Mi -> Mi+1, Mi -> Ii, Mi -> Di+1, Ii -> Ii, Ii -> Mi+1, Di -> Di+1, and Di -> Mi+1. A pHMM can generate a state path by first following a transition B -> Mi, then extending the path by transitions as described above until it reaches E following a final transition Formula. Match and insertion states s on the path emit residues according to their emission probabilities. We denote by A(s, e) the probability of state transition s -> e.

2.3 Designing a single pattern for a weight matrix
We begin by addressing the problem of pattern design for a rigid profile or weight matrix. We may transform Formula of length L into a weight matrix of size L x |{Sigma}| by ignoring insertion and deletion states.

Let Formula be the L x |{Sigma}| matrix of match-state emission probabilities derived from Formula, and let Formula be an L x |{Sigma}| matrix that is just the background residue model replicated L times. For each column 1 ≤ i ≤ L, Formula(i, j) = Mij, while Formula(i, j) = {pi}j.

Let xij be 1 if pattern p allows residue j in position i, or 0 otherwise. Consider the following problem:


Formula 1

(1)
subject to the constraint


Formula 2

(2)
This problem seeks a collection of residues for each profile position (determined by the xijs) so that the pattern formed from these collections has the highest possible sensitivity, i.e. probability of matching a sequence generated from the profile, subject to a fixed bound {tau} on its total false positive rate, i.e. its probability of matching a random sequence from the background model. A pattern gap ‘x’ at position i corresponds to setting all xij = 1 for that i and does not change either the objective or the constraint value.

We note that the above formulation considers the probability of a pattern match at only one offset, namely 0, into the motif. If the chosen pattern is shorter than the motif (i.e. has gaps at its ends), it could be applied at additional offsets into the motif, providing additional opportunities to match. The objective of Equation (1) therefore underestimates sensitivity. However, unless the motif contains repetitive structure, it is unlikely that other pattern offsets contribute significantly to overall sensitivity. We have confirmed this intuition empirically and verified that ranking candidate patterns by our objective yields results consistent with their ranking under more careful probability estimates. Similarly, the threshold {tau} is specified for only one offset in the background; the actual fp rate depends on the database size. However, one may vary {tau} to achieve a desired level of specificity.

If we limit L to 1 (i.e. a single-column pattern), our problem is equivalent to the 0–1 knapsack problem. In general, if we consider all possible subsets of |{Sigma}| different characters (including the empty subset) for each column 1 ≤ i ≤ L, the selection of columns and subsets for the pattern can be reduced to the maximum integer K-choice knapsack problem (Chandra et al., 1976) with L groups and 2|{Sigma}| objects per group. The k-th object in group i corresponds to the kth subset of characters for column i; it has value Formula and weight Formula , where Formula is 1 if character j is in the k-th subset for column i.

For constant |{Sigma}|, we obtain a polynomial-time approximation algorithm for our problem using a result of Chandra et al. (1976) for K-choice knapsack. However, this solution uses 2|{Sigma}| objects per column and so is impractical for protein. We therefore designed a new, two-step approximation algorithm based on knapsack. In the first step, we select d equally-spaced sensitivity thresholds {theta}1 ... {theta}d between 0 and 1, then use a known FPTAS for knapsack (Ibarra and Kim, 1975) to compute, for each column i, subsets Formula of characters with the smallest total weight for value at least {theta}z, 1 ≤ z ≤ d. The value d depends on both |{Sigma}| and the desired overall approximation ratio. In the second step, we use an approximation algorithm for K-choice knapsack, now with only d choices per column, to decide which columns and subsets to include in the pattern. It can be shown that our two-step algorithm yields an approximation bound SEN ≥ (1 – {varepsilon})mOPT(1+{varepsilon}) where SEN is the value of our solution, OPT is the value of an optimal solution, m is the number of columns in an optimal pattern, and {varepsilon} is the approximation factor for the knapsack problem on each column. We omit the proof due to space constraints.

In practice, we make two further heuristic improvements to our algorithm. First, note that if, for column i and two sets with indices k and k', Vik > Vik' and Wik < Wik', we can safely remove the latter set (k') from the K-choice knapsack problem, because choosing the former (k) instead can only improve the objective value. Eliminating all such ‘dominated’ sets from the problem significantly reduces its size without changing the optimum. Second, we preemptively eliminate from the problem columns i with sufficiently low information content. Uninformative columns are unlikely to contribute significantly to the final pattern. Removing such columns has the benefit of reducing the problem size, though it does break the formal approximation guarantee.

In practice, our algorithm efficiently produces patterns with high power to distinguish foreground from background sequences. We use an approximation factor {varepsilon} = 0.01 for the knapsack algorithm, resulting in d = 2000; other values may be chosen to trade-off between solution quality and computational cost.

2.4 Extension to full pHMMs
To extend our pattern design algorithm for weight matrices to full pHMMs, we recognize two facts: firstly, some motif positions can be skipped via deletion states; and secondly, background sequence may be inserted between positions via insertion states. We first deal with these facts directly, then describe an additional heuristic that focuses pattern design on regions without highly variable gaps.

2.4.1 Accommodating gaps
In the simplified model Formula, each column is equally likely to become part of a pattern, which ignores the possibility that some positions may be skipped. Let h(s) be a state s's hitting probability in Formula, which is the probability that s is contained in a random state path B -> ··· -> E. Some states may have much higher hitting probabilities than others, so their matrix columns are more likely to appear in the pattern.

We accommodate deletion in pattern design by multiplying Formula(i, j) in the knapsack algorithm above by h(Mi). Columns with higher hitting probabilities are therefore more likely to appear in the pattern. The probabilities h(Mi), h(Di) and h(Ii) are computed by a variant of the forward algorithm for HMMs; Schuster-Boeckler et al. (2004) describe a similar computation.

We now turn to the problem of insertion, which is the problem of choosing sizes for pattern gaps. Applying FP-KNAPSACK to a pHMM Formula generates a vector of positions c1, c2, ... , cm, where ci corresponds to match state Formula in Formula, along with sets of pattern characters for each ci. For example, our algorithm picked {R}, {FILMV}, {G} and {G} in positions 36, 46, 55 and 59 for the protein family RNA_pol_Rpb2_3. It remains to decide the gap length between each pair of adjacent positions ci, ci+1. This problem is trivial for a rigid motif: the gap length is ci+1ci – 1. For a full pHMM, however, insertion and deletion are possible, so we must determine the range of likely gap lengths between ci and ci+1.

We now describe the algorithm to compute the distribution of gap lengths between two states. Define f({ell}, Mi, s) to be the probability that a random sequence generated starting from match state Mi and ending with state s has length {ell}. f({ell}, Mi, s) is obtained by the following recurrence:

Formula
Initially, f(0, Mi, s) = 0 for all s, f(1, Mi, Mi) = 1, and f(1, Mi, Ii) = 0.

For match states Mi and Formula, i' > i, we compute Formula only for {ell} ≤ (i' – i + 8). Gaps with insertions longer than eight residues do not appear in our patterns because of the segmentation heuristic described below. Finally, we choose the contiguous range of all {ell} with probability bigger than a threshold (e.g. 0.05).

2.4.2 Segmenting pHMMs into conserved regions
For many protein families, biologically significant conservation is clustered within short regions of the sequence. For example, the HMM logo in Figure 2 contains two conserved regions separated by a long insertion after position 20. Note that, because the length distribution of insertions is exponential in a Plan 7 pHMM, gaps that can be long with significant probability are also highly variable. The corresponding insertion state I20 has a high hitting probability, so the two regions are effectively independent sub-motifs.


Figure 2
View larger version (10K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2 HMM logo for protein family WD40.

 
Our pattern design algorithm first segments the match states of a model into conserved intervals, then designs patterns for each region separately, and finally chooses the best pattern obtained from any single interval. Prior segmentation avoids generating patterns containing gaps of highly variable length. These gaps likely reflect spacers between domains in a model, whose large uncertainty in length can amplify a pattern's fp rate.

An interval is conserved if (1) all of its match states have high hitting probability (e.g. ≥0.95) and (2) it does not contain an insertion state that can generate long sequences with high probability. We evaluate condition 1 by computing the hitting probability for each match state as described in Section 2.4.1. To evaluate condition 2, we compute the gap length distribution between every pair of adjacent match states Mi, Mi+1 using the values f({ell}, Mi, Mi+1). We then calculate a range of gap lengths that covers 98% of the probability mass. If the gap length range between Mi and Mi+1 is wider than a threshold {psi}, we split Formula between Mi and Mi+1, design patterns for each interval separately, and keep the best pattern.

The splitting threshold {psi} is user-adjustable; by default, we set {psi} = 3. The smaller this threshold, the fewer variable length gaps the pattern contains. Allowing less variability in gaps lowers sensitivity but also lowers the fp rate, allowing a trade-off between these two properties of patterns.

2.5 Designing multiple patterns for a pHMM
We design a set of patterns for a pHMM using a greedy covering heuristic which has previously been applied successfully to design multiple seeds for pairwise comparison (Sun and Buhler, 2004; Li et al., 2004; Xu et al., 2004). The sequences in the seed alignment for each pHMM form a training set T. Initially, we design a single pattern p0 based on the pHMM Formula. A new pHMM Formula is then trained from the sequences in T that are not matched by p0, and a new pattern p1 is designed based on Formula. This process is repeated until all sequences in T are covered. Figure 3 illustrates the logic of the greedy covering procedure.


Figure 3
View larger version (16K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3 Construction of multiple patterns.

 
The greedy covering algorithm designs a lossless pattern set w.r.t. the sequences in the seed alignment. This alignment generally contains few sequences, so it cannot describe all the variability of a protein family. Therefore, this strategy does not guarantee lossless filtration for all the positive sequences in a large protein database. However, since we train a new pHMM from the unmatched sequences in each round, weak signals in previous rounds are amplified in later pHMMs and captured by later patterns. We investigate the ultimate utility of this and our other design choices for a real protein database in the next section.


    3 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 APPROACH
 3 RESULTS
 4 CONCLUSIONS AND FUTURE...
 REFERENCES
 
In this section, we first examine sensitivity of our designed patterns and fp rate relative to unfiltered HMMER on the Swiss-Prot database. We then compare our patterns with hand-curated PROSITE patterns. Finally, we evaluate our patterns' ability to accelerate HMMER. In what follows, we refer to our patterns as KP-patterns.

3.1 Experiments
We obtained Swiss-Prot Release 49.1, which contains 208005 protein sequences, from ftp://ftp.expasy.org. Release 19.0 of the Pfam-A database, which contains 8183 pHMMs, was downloaded from http://pfam.wustl.edu. We randomly sampled a large subset of pHMMs from Pfam, then retained 1825 sampled models that matched more than five sequences in Swiss-Prot. Using the approaches described in Section 2, we designed 1825 sets of KP-patterns based on these pHMMs, using threshold {tau} = 4 x 10–5. 80% of these KP-pattern sets contain only a single pattern. The average design time per pattern was 250 seconds on a 2.4 GHz AMD Opteron.

To evaluate our patterns, we first applied HMMER's hmmsearch program for each pHMM against Swiss-Prot, using the Pfam GA (gathering) threshold cutoffs. All the resulting positive sequences that are not included in the seed alignment constitute the test set. We computed a pattern's sensitivity based on this test set. To estimate fp rate, we ran the pattern against all of Swiss-Prot. We used an existing Perl program, ps_scan (Gattiker et al., 2002), to scan protein sequences for matches to a PROSITE pattern.

For a pHMM Formula, let Formula be the set of all positive sequences w.r.t. Formula which are not contained in Formula's seed alignment (and hence are not used for training). Suppose P is the KP-pattern set we designed for Formula. Let Formula be the set of matched sequences obtained by running ps_scan for pattern set P against Formula. P's sensitivity is defined as Formula. Similarly, let Formula be the set of all non-positive sequences for Formula in Swiss-Prot, and let Formula be the set of sequences matched by P in Formula. Then P's fp rate is defined as Formula.

3.2 KP-patterns' sensitivity and fp rate
Figure 4 shows a scatter plot of sensitivity and fp rate for the 1825 sets of KP-patterns. The majority of patterns exhibit high sensitivity (≥ 0.9) and low false positive rate (≤ 0.05). In more detail, Figure 5 shows the distribution of sensitivity across these 1825 pattern sets. Totally 72% of pattern sets have sensitivity at least 0.9. Because only patterns with high sensitivity are of practical interest, we plot an fp rate distribution only for KP-pattern sets with sensitivity at least 0.9 in Figure 6. Of these sets 84% have fp rate no higher than 0.05.


Figure 4
View larger version (15K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4 Sensitivity and fp rate for 1825 KP-pattern sets.

 


Figure 5
View larger version (9K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 5 Sensitivities for 1825 KP-pattern sets.

 


Figure 6
View larger version (10K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 6 fp rate for 1306 KP-pattern sets with sensitivity at least 0.9.

 
Our results indicate that a small number of KP-patterns have unusually high fp rates or low sensitivity, despite being predicted to be both sensitive and specific. These patterns reflect the limitations of our sequence models; e.g. an fp rate based on a common i.i.d. background model for all proteins consistently underestimates the actual fp rate in Swiss-Prot. More concretely, we observed 66 pattern sets with sensitivity no higher than 0.5 on Swiss-Prot, even though they exhibit perfect sensitivity on their seed alignments. In these cases, training a pHMM on the positive sequences from Swiss-Prot yielded a motif model with very different properties than the original pHMM from Pfam. Future work should avoid overtraining by using more accurate modeling of motifs and background sequence when they differ substantially from Pfam's assumptions.

3.3 Comparison of KP-patterns to PROSITE patterns
The PROSITE pattern database contains hand-designed patterns incorporating the knowledge of expert curators. These patterns have been used as a benchmark for comparison with other systematically designed patterns. For these reasons, we compare the sensitivity and fp rate of hand-curated patterns (called ‘PROSITE patterns’ in what follows) relative to KP-patterns when used as filters for HMMER. Not every protein family in Pfam has a corresponding PROSITE pattern; among the 1825 pHMMs we sampled from Pfam, only 422 have corresponding PROSITE patterns. We therefore analyzed only these 422 families.

Figure 7 compares the sensitivity of a PROSITE pattern, a single KP-pattern, and the full KP-pattern set for each pHMM tested. Multiple KP-patterns yielded the highest sensitivity, but even single KP-patterns were more sensitive overall than the original PROSITE patterns. In particular, 114 KP-pattern sets and 98 single KP-patterns had sensitivity 1, while only 79 PROSITE patterns reached sensitivity 1.


Figure 7
View larger version (18K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 7 Sensitivity of PROSITE and KP-patterns for 422 motifs.

 
Although KP-patterns tend to have higher sensitivity than PROSITE patterns, most PROSITE patterns have very low fp rates of 1% or less. The most desirable trade-off between these two quantities is application-dependent. Here, we compare the two types of pattern using two combined measures of accuracy. First, define f1(P) = (Pfp + Pfn)/2, with P being the pattern set, Pfp being P's fp rate, and Pfn being P's false negative rate (1 – sensitivity). This function arithmetically averages false positives and false negatives; lower values are better. Second, define Formula where Psen and Pspe are P's sensitivity and specificity (1–fp rate). This function geometrically averages the two properties as for area in a ROC plot; higher values are better.

Figure 8 illustrates the distribution of f1(KPi) – f1(PROSITEi), where KPi and PROSITEi are KP-pattern and PROSITE pattern for protein family i, respectively. Of the 422 samples, 211 KP-patterns report lesser values for function f1 than the corresponding PROSITE patterns. Among the remaining samples, 175 favor the PROSITE pattern only slightly (difference <0.1). Figure 9 shows the distribution of f2(KPi) – f2(PROSITEi). 213 KP-patterns have f2 values larger than the corresponding PROSITE patterns. Overall, the KP-patterns are slightly better than the PROSITE patterns by measure f2 and closely competitive by measure f1.


Figure 8
View larger version (13K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 8 f1(KP-pattern)-f1(PROSITE-pattern) for 422 motifs.

 


Figure 9
View larger version (14K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 9 f2(KP-pattern)-f2(PROSITE-pattern) for 422 motifs.

 
3.4 Impact on HMMER search time
In this section, we estimate the computational cost of running hmmsearch for pHMMs against the Swiss-Prot database, with and without pattern filters. The time complexity of matching a pHMM with M states against a sequence of length L is O(L·M). Assuming a pattern p is converted to a DFA of size D(p), the cost of scanning is O(D(p) + L). D(p) can be exponential in |p| when many gaps are allowed; however, regular expression search tools are highly optimized and may not even construct the full DFA representation, so linear search of L is expected to dominate the scanning time. We quantify the empirical advantage of filtering below.

Let T be the time to run hmmsearch against Swiss-Prot with no filter, and let Tf be the time with the KP-pattern filter. Let the time to scan a pattern set P against Swiss-Prot with the ps_scan program be Ts. Finally, let Formula be the set of matched sequences to pattern set P from Swiss-Prot. We estimate the time to search with filtering enabled as Formula. The speedup of filtered over unfiltered search is T/Tf. Times were measured on a 2.4 GHz AMD Opteron running Linux.

Because we are interested only in KP-patterns with high sensitivity, we evaluated the speedup only for patterns with sensitivity no lower than 0.9. Figure 10 shows a histogram of speedups observed for these patterns, in the series labeled ‘actual speedup’. We can see that ~83% of these high-sensitivity KP-patterns yield at least 4x speedup, and that speedups of 16–32x are typical.


Figure 10
View larger version (21K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 10 Distribution of T/Tf for 1306 KP-pattern sets with sensitivity no lower than 0.9. Series ‘hypothetical’ assumes that we can speedup up the pattern scan stage 10-fold, e.g. in FPGA firmware.

 
Empirically, we found that a significant amount of time in filtered search was spent searching with the pattern in ps_scan. We have not yet attempted to optimize this phase of search, so additional software speedup may be possible. Even greater acceleration may be possible using FPGA-based firmware designed to accelerate regular expression search, since PROSITE patterns are a subset of regular expressions. For example, Brodie et al. (2006) recently reported the ability to scan for 50 simultaneous regular expressions at 500 MB/s on a commodity FPGA.

To see the effect of accelerating pattern matching (but not the dynamic programming of HMMER itself), we recomputed speedups assuming that Ts was reduced by a factor of 10. These results are shown in Figure 10 in the series labeled ‘hypothetical’. With faster scanning, 79% of KP-patterns yield at least 16x speedup, and typical speedup is now 64–128x.


    4 CONCLUSIONS AND FUTURE WORK
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 APPROACH
 3 RESULTS
 4 CONCLUSIONS AND FUTURE...
 REFERENCES
 
We have described an efficient, automated method to design PROSITE-like patterns for a large database of protein-family profile HMMs. The design algorithm robustly handles motifs with variable-length gaps. Our patterns compete favorably in overall sensitivity with hand-curated patterns from PROSITE, and most are at least 90% as sensitive as the full pHMM under HMMER. Using these patterns as search filters achieves typical accelerations of an order of magnitude, with room to improve given faster pattern scanning.

An avenue for future improvement is to find ways in which our training procedure can better capture the properties of motif instances in real protein databases. On sequences generated from the pHMMs themselves, our KP-patterns exhibited 60% higher average sensitivity than the corresponding hand-curated PROSITE patterns. This gap is much larger than that observed in Figure 7 for real sequences, so our patterns capture aspects of the pHMM not observed in real proteins. To avoid these pitfalls, our algorithm should become more skeptical about parts of the pHMM, such as the exponential length distribution of inserted sequences.

Our approach includes several user-adjustable parameters. By adjusting these parameters, the user can obtain patterns with higher sensitivity or lower fp rates. For example, increasing the gap length threshold as described in Section 2.4.2 can increase sensitivity, while decreasing the fp rate threshold {tau} can increase specificity. Future work should control these parameters more precisely so as to ensure that nearly all designed patterns have acceptably high sensitivity and low fp rates.

Finally, our pattern design approach should extend to probabilistic models in other areas of genomics. Phylogenetic HMMs and stochastic context-free grammars are sophisticated probabilistic models for DNA and RNA multiple alignments, respectively; extensions of our methods should in principle be able to speed up homology searches using these more challenging types of model.


    Acknowledgments
 
This work was supported by NSF CAREER Grant DBI-0237903.

Conflict of Interest: none declared.


    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 APPROACH
 3 RESULTS
 4 CONCLUSIONS AND FUTURE...
 REFERENCES
 

    Attwood, T.K., et al. (2003) PRINTS and its automatic supplement, prePRINTS. Nucleic Acids Res, . 31, 400–402[Abstract/Free Full Text].

    Bateman, A., et al. (2004) The Pfam protein families database. Nucleic Acids Res, . 32, D138–D141[Abstract/Free Full Text].

    Boeckmann, B., et al. (2003) The Swiss-Prot protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res, . 31, 365–370[Abstract/Free Full Text].

    Brodie, B., et al. (2006) A scalable architecture for high-throughput regular-expression pattern matching. proceedings of the 33rd Annual International Symposium on Computer Architecture, Baston, MA (ISCA '06), IEEE Computer Society, pp. 191–202.

    Califano, A. (2000) SPLASH: structural pattern localization analysis by sequential histograms. Bioinformatics, 16, 341–357[Abstract/Free Full Text].

    Chandra, A.K., et al. (1976) Approximate algorithms for some generalized knapsack problems. Theor. Comput. Sci, . 3, 293–304[CrossRef].

    Eddy, S.R. (1998) Profile hidden Markov models. Bioinformatics, 14, 755–763[Abstract/Free Full Text].

    Eskin, E. (2004) From profiles to patterns and back again: a branch and bound algorithm for finding near optimal motif profiles. Proceedings of the Eighth Annual International Conference on Computational Molecular Biology (RECOMB '04), San Diego, CA ACM Press, pp. 115–124.

    Gattiker, A., et al. (2002) ScanProsite: a reference implementation of a PROSITE scanning tool. Appl. Bioinformatics, 1, 107–108[Medline].

    Gordon, P.M.K. and Sensen, C.W. (2004) Osprey: a comprehensive tool employing novel methods for the design of oligonucleotides for DNA sequencing and microarrays. Nucleic Acids Res, . 32, e133[Abstract/Free Full Text].

    Hart, R.K., et al. (2000) Systematic and fully automatic identification of protein sequence patterns. J. Comput. Biol, . 7, 585–600[CrossRef][Web of Science][Medline].

    Henikoff, J.G., et al. (2000) Increased coverage of protein families with the blocks database servers. Nucleic Acids Res, . 28, 228–230[Abstract/Free Full Text].

    Hulo, N., et al. (2006) The PROSITE database. Nucleic Acids Res, . 34, 227–230.

    Ibarra, O.H. and Kim, C.E. (1975) Fast approximation algorithms for the knapsack and sum of subset problems. J. ACM, 22, 463–468[CrossRef].

    Jonassen, I. (1997) Efficient discovery of conserved patterns using a pattern graph. Comput. Appl. Biosci, . 13, 509–522[Abstract/Free Full Text].

    Krogh, A., et al. (1994) Hidden Markov models in computational biology: applications to protein modeling. J. Mol. Biol, . 235, 1501–1531[CrossRef][Web of Science][Medline].

    Li, M., et al. (2004) PatternHunter II: highly sensitive and fast homology search. J. Bioinform. Comput. Biol, . 2, 417–439[CrossRef][Medline].

    Madera, M., et al. (2004) The SUPERFAMILY database in 2004: additions and improvements. Nucleic Acids Res, . 32, D235–D239[Abstract/Free Full Text].

    Portugaly, E. and Ninio, M. (2004) HMMERHEAD—accelerating HMM searches on large databases (Poster), RECOMB '04.

    Rabiner, L.R. (1989) A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE, 77, 257–286[CrossRef].

    Schultz, J., et al. (1998) SMART, a simple modular architecture research tool: identification of signaling domains. Proc. Natl Acad. sci. USA, 95, 5857–5864[Abstract/Free Full Text].

    Schuster-Boeckler, B., et al. (2004) HMM Logos for visualization of protein families. BMC Bioinformatics, 5, 7–14[CrossRef][Medline].

    Sun, Y. and Buhler, J. (2004) Designing multiple simultaneous seeds for DNA similarity search. Proceedings of the Eighth Annual International Conference on Computational Molecular Biology (RECOMB '04), San Diego, CA ACM Press, pp. 76–84.

    Xu, J. (2004) Optimizing multiple spaced seeds for homology search. Lecture Notes in Computer Science, Springer 3109, , pp. 47–58.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
I. A. Bogdan, J. Rivers, R. J. Beynon, and D. Coca
High-performance hardware implementation of a parallel database search engine for real-time peptide mass fingerprinting
Bioinformatics, July 1, 2008; 24(13): 1498 - 1502.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Sun, Y.
Right arrow Articles by Buhler, J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Sun, Y.
Right arrow Articles by Buhler, J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?