Skip Navigation


Bioinformatics Advance Access originally published online on April 28, 2005
Bioinformatics 2005 21(13):2943-2949; doi:10.1093/bioinformatics/bti468
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow supplementary data
Right arrow All Versions of this Article:
21/13/2943    most recent
bti468v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (2)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by D'Avenio, G.
Right arrow Articles by Creti, R.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by D'Avenio, G.
Right arrow Articles by Creti, R.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions{at}oupjournals.org

SWIFT (sequence-wide investigation with Fourier transform): a software tool for identifying proteins of a given class from the unannotated genome sequence

G. D'Avenio 1,*, M. Grigioni 1, G. Orefici 2 and R. Creti 2

1Department of Technologies and Health, Parasitic and Immune-Mediated Diseases Istituto Superiore di Sanità, Rome, Italy
2Department of Infectious, Parasitic and Immune-Mediated Diseases Istituto Superiore di Sanità, Rome, Italy

*To whom correspondence should be addressed.


    Abstract
 TOP
 Abstract
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION
 CONCLUSIONS
 SUPPLEMENTARY DATA
 REFERENCES
 

Background: The ever increasing number of sequenced genomes calls for new analysis techniques, which can benefit from the methodologies developed in the field of signal processing.

Methods: The present paper addresses the question of searching a pattern of amino acids (not necessarily completely specified) by means of the cross-correlation of complex sequences, obtained after suitable coding of the original amino acid sequence. Subsequently, the proposed algorithm provides a flexible strategy in setting the border between the accepted and rejected ORFs, by means of the k-means clustering of the candidate ORFs. The search for the class of proteins specified by the pattern is carried out from the most basic level, i.e. the DNA sequence, without sifting through an ensemble of previously determined ORFs. Thus, an exhaustive examination of all the occurrences of the pattern in the genome is performed.

Results: The application of the method to the search of surface proteins in Gram-positive bacteria witnesses its efficacy, in terms of both sensitivity and specificity. The comparison with the usual (and somewhat arbitrary) choice of setting a fixed value for the threshold length of the putative ORF confirms the validity of the proposed approach.

Availability: The software is available upon request to the corresponding author.

Contact: davenio{at}iss.it


    INTRODUCTION
 TOP
 Abstract
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION
 CONCLUSIONS
 SUPPLEMENTARY DATA
 REFERENCES
 
The ever increasing number of sequenced genomes calls for improved methods to analyze them rapidly and accurately. Since the process of annotating a newly sequenced genome is long and cumbersome, it is often needed to perform preliminary analyses on the raw sequence data without the assistance of annotations. It is, therefore, very important to have software tools that can extract useful information only from the base pair sequence in a reliable way.

In the present paper, a software is presented that scans the genome in search of a user-defined pattern, characterizing a particular class of ORFs. For instance, the motif LPXTG can be used to look for the class of surface proteins (SP) in Gram-positive bacteria, since in these species it is characteristically associated with the sortase enzyme, regulating the spatial disposition of surface proteins at the cellular membrane (Fischetti et al., 1990). More generally, some variations are allowed of the pattern to be found in the genome. Hence, one can search, at the same time, for different degenerate sortase targets having a slight variation in the amino acid composition with respect to LPXTG. This heterogeneity in the characteristic pattern for cell wall attachment of SP in Gram-positive bacteria is reported in Navarre and Schneewind (1999).

The outcome of this preliminary investigation is sifted, and each ORF candidate which does not satisfy the minimum requirements for being actually expressed, given the positions of the start and stop codons closest to the queried motif, is discarded. Finally, the remaining ORF candidates undergo a clustering procedure (k-means clustering), in order to rule out the nucleotide sequences which are too short, given the sequence length distribution provided to the procedure.

The application of the proposed algorithm to the analysis of specific bacterial genomes for which the annotations are already available was carried out to underline the sensitivity and specificity of the method.


    METHODS
 TOP
 Abstract
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION
 CONCLUSIONS
 SUPPLEMENTARY DATA
 REFERENCES
 
The proposed method for finding genes of a given class is structured in successive steps, starting from the most basic level, i.e. the raw DNA sequence; at the end of each step, the ORF candidates are input to the successive step and the search is refined, until a reliable statistical procedure automatically accepts or rejects each possible ORF.

Implementation
The program, written in the Matlab (The MathWorks Inc. USA) software environment, performs a 6-fold scan of the genomic sequence (three frames for each strand) in search of the characteristic pattern used in the query. The raw sequence data in *.fna format, previously downloaded from the Nucleotide database of the NCBI (ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/) and stored on the hard disk of local PC, are directly input to the program. When yet incompletely determined sequences are available, i.e. when a set of nucleotides is provided at a certain location (e.g. w for the pair A–T), the program translates each of these sets into a nucleotide, as an extracted random variable with the same probability for each member of that class. The substitutions and the indices of occurrence of the latter are stored in appropriate arrays for possible further reference. This procedure, though susceptible to provide less-than-accurate results, enables one to analyze a genome in its first available release, which is often affected by such uncertainties.

The search for genes of a given class is performed in three steps detailed in the following.

Pattern search
The nucleotide sequence is split into segments of fixed length (user-definable), translated into amino acid sequences and compared with the amino acid pattern. The possibility of changing the segment length has the consequence that the maximum computing speed is achieved, with the memory allocation fitting the particular hardware on which the program is run.

The amino acid sequence under investigation and the pattern, denoted as {xi} and {pi}, respectively, have to be suitably coded, in order to attain unequivocal results. In the present work, each amino acid is assigned a number in the complex plane (z = x + jy, where ), its absolute value being equal to one (Fig. 1).



View larger version (10K):
[in this window]
[in a new window]
 
Fig. 1 Example of the coding scheme of the different amino acids (‘*’ stands for the terminating codon), in the case of the simple motif LPXTG. The sequence and the motif are assigned a series of complex values, whence the cross-correlation function can be computed to yield the locations of occurrence of the motif in the sequence. Each amino acid in the motif is marked with a circle drawn over the cross-shaped marker.

 
The comparison is made by cross-correlating the complex arrays equivalent, according to the mentioned coding scheme, to the considered sequence and the pattern. In other words, the sequence and the pattern (padded with zeros to match the sequence length) are input to a function whose peaks reveal the exact locations of the pattern in the considered genome.

From the expression of the complex cross-correlation function,

(1)
it can be seen that, when for some value of i pi = xi+k, the term is always equal to |pi|2 = 1, independently from the choice of the point on the unitary circle corresponding to the amino acid pi. If pi != xi + k, then , where {alpha} represents the angle difference between the points corresponding to pi and xi + k in the complex plane. Then, rewriting the complex cross-correlation function as

and recalling that for a real number ß, exp(jß) = cos(ß) + jsin(ß), it is evident that m identical symbols in the sequence and the pattern will be found if and only if

(2)
where Re{ } denotes the operator which extracts the real part of its argument.

The complex cross-correlation is actually performed by multiplication of the fast Fourier transform (FFT) of the sequence with the FFT of the pattern, followed by the inverse FFT of the result. Sequence lengths equal to powers of 2 have been chosen for fast computation of the FFTs.

Owing to rounding errors, the equality Re{f(k)} = m cannot be tested directly; the program searches instead for the condition

(3)
where {varepsilon} is a constant slightly larger than the floating point accuracy.

A more general form of the pattern used in the query was allowed by a more flexible coding of the amino acid sequence; the details are illustrated in Supplementary data. At the end of step 1, all the occurrences of the queried pattern are saved in a list, containing the necessary information for identifying the ORFs carrying that pattern, in the subsequent steps of the procedure.

Rejection of invalid candidate ORFs
After this preliminary search for a given class of proteins, the result is subjected to further analysis, in order to discard necessarily invalid candidates:

  1. A minimum number of amino acids is imposed as the interval between the end of the pattern and the first stop codon. In the case of SPs, there are at least 18 amino acids after the end of the sortase target LPXTG, accounting for the transmembranal part (of 15 amino acids minimum length) and the charged tail of the SP, typically of 3 amino acids.
  2. Scanning the sequence backward from the identified motif, the first stop codon on the same frame and strand is found, and the considered ORF candidate is discarded if no start codon could be found in the subsequence between the mentioned stop codon and the searched motif (in fact, that putative ORF could not be actually expressed).
All possible start codons valid for a generic bacterial species have been considered. Thus, the selectivity of the process can be improved, if the set of starting signals for a particular species is known a priori and given as a parameter to the program.

Clustering of the putative proteins
The class of putative ORFs remaining after step 2 contains the actual class which corresponds to the queried pattern, since only the minimum requirements for gene expression have been met at this point, and the number of actually expressed ORFs will be smaller. In particular, roughly speaking, there is a minimum number of amino acids that renders an ORF a plausible candidate for actually being expressed. Instead of enforcing a fixed threshold for the length of the sequences (as in Paulsen et al., 2003; Janulczyk and Rasmussen, 2001), a more flexible approach was followed. After collecting the lengths of the putative ORFs in an array, they were subjected to k-means clustering, which is a dynamical procedure for obtaining a number N of clusters from a set of unknown characteristics. The k-means clustering is an iterative procedure, yielding at each step the centroid of each cluster, used in turn to recalculate the partition of the set in order to minimize a global measure of the distances of the points in the ensemble from the centroids themselves. In our case, N = 2, corresponding to the choice between discarded and accepted ORF candidates.

A flow chart of the proposed method is provided in Figure 2. Besides the main outcome of the program (given by the k-means decision step), subsequent comparisons with available annotations (provided by *.ptt files) are also shown in the chart, as specified in the following subsection. This last step is useful for assessing the sensitivity and the specificity of the approach.



View larger version (10K):
[in this window]
[in a new window]
 
Fig. 2 Flow chart of the proposed method. The outcome of the program is given by the k-means decision step. Subsequent comparisons (also shown in the chart) with available annotations (provided by *.ptt files) have been carried out in order to show the sensitivity and the specificity of the approach.

 
Evaluation of the procedure
The files of the annotations for each of the considered bacterial genomes have been downloaded from ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/, as *.ptt files. The procedure for finding the ensemble of ORFs of a given class from the genomic sequence has been tested by comparing its results with the information given by annotated bacterial genomes. Specifically, the clusters of accepted and discarded ORFs (P and N, respectively, to indicate positive or negative outcome of the procedure) were subdivided in four classes: TP, FP, TN, FN (T and F stand for true and false, respectively), after comparison with the annotations. Then, TP denotes the ORFs deemed as valid by SWIFT which are also listed in the *.ptt relative to that genome (independently from the biological function attributed to these ORFs in the associate comment); FP are the ORFs deemed as valid by SWIFT which are not listed in the *.ptt relative to that genome. TN and FN are similarly built from the program's output and the annotation file. With these positions, the sensitivity of the procedure can be calculated simply as the proportion of the true positives with respect to the total number of valid candidates, or

(4)
(Nx denotes the number of elements in the class x), whereas the specificity is the proportion of the true positives with respect to the total number of positive hits found by the program, or

(5)
If the annotations are available, the program provides, as an output in the form of a text file, the subset of the annotation file (*.ptt), corresponding to the predicted ORFs that are, at the same time, compatible with the annotated ORFs. Thus, the user can immediately verify what the positive hits correspond to, in terms of the already characterized genes. This list is formatted according to the *.ptt file, with two columns added. The first shows the particular pattern found for the considered ORF, and the second gives the pattern's location. The content of the text file can be copied and pasted in a spreadsheet, such as Microsoft Excel (see the example in Fig. 3).



View larger version (64K):
[in this window]
[in a new window]
 
Fig. 3 Example of one of the formatted output files, with an enhanced version (see the first two columns of the worksheet) of the annotation file's content which is of relevance for the search of ORFs with a given pattern.

 
A second text file reports the strings of the amino acids for each predicted ORF at the end of the procedure (regardless of whether it matches with an ORF present in the annotations or not). Also the pattern found and the location of the ORF are provided. This allows the user to input each of the positive hits in a search database, such as Blastp, to investigate on similar genes in other organisms. The same information is provided, in a separate text file, for each rejected ORF at the end of the procedure.

Finally, besides the amino acid sequences of the predicted ORFs, when the annotations are available the program also outputs, again as a triplet pattern/location/sequence, the ORFs of the single classes (TP, FP, TN, FN) to which the matched patterns can be assigned, after comparison with the annotations. A *.txt file is provided for each class. This could be useful for checking the quality of the annotations, since it is not infrequent that the latter undergo some modifications as more experimental evidence is accrued. An ORF that is deemed by the program to be FP, because it is not yet considered in the annotations, could prove itself to be actually expressed. A simple Blast search with the content of the FP.txt file could give useful hints, in this respect.


    RESULTS
 TOP
 Abstract
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION
 CONCLUSIONS
 SUPPLEMENTARY DATA
 REFERENCES
 
Table 1 reports the results of the algorithm, after a search for the sortase target LPXTG in a series of Gram-positive bacterial genomes.


View this table:
[in this window]
[in a new window]
 
Table 1 Results of the search of the pattern LPXTG in different bacterial genomes

 
Table 2 reports the results of the algorithm, after a search for the multiply defined sortase target LPX[T|A][G|N], to account for the presence of degenerate forms of this pattern characterizing surface proteins.


View this table:
[in this window]
[in a new window]
 
Table 2 Results of the search of the multiply defined pattern LPX[T|A][G|N] in different bacterial genomes

 
It is evident how the method yields very good results, for both sensitivity and specificity (see definitions in Methods section). This performance is mainly related to the clustering procedure: in Figure 4 a scatterplot is provided, where specificity is plotted against sensitivity for the method proposed and for the same method without k-means clustering, accepting all the candidates as actual ORFs. In the latter case (circles in Fig. 4), the specificity is much poorer, because a large number of false positives is given by the method (these FP are given by the short ORFs). Instead, the complete procedure yield a high specificity (asterisks in Fig. 4), at the price of a small reduction (or even the same value) of the sensitivity. This results in the clustering of the points corresponding to the different genomes close to the upper right corner of the plane.



View larger version (8K):
[in this window]
[in a new window]
 
Fig. 4 Scatterplot of the specificity versus sensitivity for the presented method, before (circles) and after (asterisks) k-means clustering. The motif used in the search was the sortase target LPX[T|A][G|N]. Each point corresponds to one of the genomes listed in Table 2.

 
In order to compare the different approaches with the identification of ORFs of a given class, the geometric mean of Sn and Sp (i.e. ) can serve as a figure of merit. In Figure 5 the comparison is reported between the results of the proposed procedure and those stemming from the enforcement of a fixed threshold for the validation of the predicted ORFs, after step 2 of the procedure (see Methods section). In this case, a threshold of 90 bp was selected; this value is chosen as the default in Glimmer 2.0 (Delcher et al., 1999).



View larger version (9K):
[in this window]
[in a new window]
 
Fig. 5 Scatterplot of the geometric mean of sensitivity and specificity for the presented method. The ordinates represent the results of the proposed method, using the genomes listed in Table 2, whereas the abscissae refer to the same genome, but using a fixed threshold (90 bp) after step 2, instead of k-means clustering. The motif used in the search was the sortase target LPX[T| A][G|N]. The dashed line is the identity line; each of the genomes considered provided a better result with the proposed flexible partitioning method.

 
The values of Q for the genomes investigated with the proposed method are always higher than those with the use of a fixed threshold. The selection operated by the latter method provides good sensitivity [the number of false negatives (FN) was found to be always zero, with the choice of 90 bp] but poor specificity, in general, owing to the rather large number of false positives (FP); instead, the automatic adjustment of the threshold provided by the proposed algorithm preserves both specificity and sensitivity. A similar difference between the two approaches was found after setting a threshold of 150 bp (data not shown).

In Table 3 the number of candidate ORFs is reported after each step of SWIFT (i.e. pattern search, rejection of invalid candidate ORFs, clustering of the putative proteins), in the case of the multiply defined pattern LPX[T|A][G|N], for the same bacterial genomes listed in Tables 1 and 2. It can be seen that a strong reduction of the candidate ORFs takes place after the completion of the procedure, with respect to the sheer number of occurrences of the pattern as given by step 1. At most, 32.95% of such occurrences, in Streptococcus pyogenes M1, were deemed to belong to an actually expressed ORF.


View this table:
[in this window]
[in a new window]
 
Table 3 Number of ORF candidates after each step of SWIFT

 

    DISCUSSION
 TOP
 Abstract
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION
 CONCLUSIONS
 SUPPLEMENTARY DATA
 REFERENCES
 
Several software programs exist that allow the user to search for the ORFs of an unannotated genome. In the case of SP, for instance, in Janulczyk and Rasmussen (2001) the search of the class of cell wall-attached proteins was carried out with the help of a commercial software, looking for the traditional motif LPXTG or for a specifically designed tripartite pattern. For S.pyogenes, the authors report a total of 17 occurrences of LPXTG in the entire genome, of which 12 were recognized as probable SPs. Instead, our program found no less than 45 occurrences of LPXTG (S.pyogenes M1), and 155 of LPX[T|A][G|N], in the 6-fold scan of the genome. After the clustering step, 18 and 41 occurrences were deemed to be actually expressed in the two classes. As shown in Tables 1 and 2, the clustering of the putative ORFs, carried out by SWIFT matches the annotations closely.

The comparison with the result of Janulczyk and Rasmussen (2001) is made difficult by the fact that incomplete genomes were mostly considered in that paper, with the exception of S.pyogenes, already discussed. For instance, Streptococcus mutans was also considered by Janulczyk and Rasmussen (2001) but the analysis was carried out on 647 kbp, whereas the entire genome (UA159) is ~2 Mb long. In the cited paper, two and three validated occurrences are reported of SPs corresponding to LPXTG and LPX[T|A][G|N], respectively. Instead, in our study 12 (0) are the true (false) positive hits for LPXTG, resulting in Sp = 100%. For the more general pattern search, we found 26 actual SPs and only 2 false positive hits, with Sp = 92.86%. Both types of search resulted in a 100% value for sensitivity (Tables 1 and 2).

It must be mentioned that, after discarding invalid candidates in step 2 of the procedure, the presented results have been obtained without imposing any restraint on the position of the pattern in the amino acid sequence of the candidate ORF. With the opposite choice, the results of the comparisons with the annotations would have been negatively biased, since a certain number of legitimate ORFs containing the searched pattern would have been rejected by SWIFT, had a positional bound been enforced on the pattern itself. The software, nevertheless, allows the user to specify (if desired) a maximum distance of the pattern from the stop codon, in order to provide an additional criterion for pinpointing the ORFs that not only show a particular pattern but are also more appropriately grouped together in terms of biological significance of that pattern (in the case of SP, e.g. the sortase target is to be found at the distal part of the amino acid sequence, immediately before the hydrophobic tail).

Recently (Navarre and Schneewind, 1999; Pallen et al., 2001), the evidence has been found of multiple substrates or sortase targets; generally, each of these corresponds to a particular sortase in the genetic code. These sortase-like proteins usually have a pattern very similar to LPXTG, with few substitutions, similar to the LPX[T|A][G|N] pattern considered by Janulczyk and Rasmussen (2001). This wealth of sortase targets can be easily studied with the program herein proposed, which allows the user to consider multiple variations of the classical LPXTG pattern. The search is very specific for a given, multiply defined pattern, hence this approach is the natural complement of investigations of broader range, using, e.g. Blast on a sortase target's encoding sequence.

The first step of the algorithm is an FFT-based search for the occurrences of a given pattern. The Fourier transform has been applied to different problems in bioinformatics. Tsonis et al. (1991) employed Fourier analysis of DNA coding and non-coding sequences in an attempt to identify possible patterns in gene sequences. They found that while intronic sequences show a rather random pattern, coding sequences show periodicities and in particular, a periodicity of 3. This application of Fourier analysis is geared toward underlining periodicities in the sequence itself. On the same standpoint, a recent application of FFT in bioinformatics was presented by Sharma et al. (2004). In this paper, a software is proposed which finds repeats (contiguous or not) through an analysis of the power spectrum of a given DNA sequence. The procedure highlights structural features of the sequence, exploiting the fact that repetitions in the latter give rise to peaks in the Fourier spectrum of the sequence itself.

The approach hereby followed, instead, does not consider any periodicity of the DNA sequence nor any spectral feature of the latter. The FFT is used essentially to compute rapidly the cross-correlation function between the sequence and the user-defined pattern. The coding scheme allows also for not completely specified patterns, by means of assigning a particular point in the complex plane to the set of amino acids pertaining to a given position in the pattern.

The question can be raised whether it is convenient to use the FFT operation [whose complexity is O(N log2 N)], since there are algorithms which are theoretically better in pattern matching, such as the KMP algorithm (Knuth et al., 1977), which was demonstrated to solve the problem in O(N + K) time (N and K denote the sequence and the pattern length, respectively). Actually, although in the particular application presented, the KMP algorithm does not improve on less sophisticated approaches (such as the Brute Force algorithm), because the pattern LPXTG, as well as LPX[T|A][G|N], do not present repetitions of the alphabet characters. The latter feature is exploited by the KMP procedure to shorten the time needed for the pattern matching, with respect to the pattern–sequence comparison with the pointer into the sequence advanced by one position at each step. Therefore, in the present case the KMP algorithm does not perform better than other, less refined, procedures. To verify the soundness of the choice of the proposed FFT-based procedure, we compared the performance of the latter with that of the KMP algorithm (directly adapted from the compact C code implementation available at http://www-igm.univ-mlv.fr/~lecroq/string/node8.html), on the set of genomes considered in Tables 1 and 2.

A mean value (±s.d.) of 0.6994 (±0.0127) s/kb was given by the tests with SWIFT, whereas 0.9457 (±0.0118) s/kb was needed by KMP to perform the same task. It is evident that the FFT procedure outperforms the KMP algorithm, in terms of seconds per kbp required to sift through the sequence. The mean value of the ratio between the computation times according to the two procedures was found to be 0.7396 (±0.0138). A t-test between the computation times of the two series of pattern search rules out the possibility of a result affected by chance (P = 3.16 x 10–23).

The key of the performance of SWIFT is that the genome is split by the software in segments of optimal length. Since the time for the search in a string of N characters is O(N log2 N), the time relative to the search of h segments is hO((N/h)log2(N/h)) {cong} O(Nlog2(N/h)), so that the search on smaller fragment (corresponding to increasing values of h) tends to cut the analysis time. The latter, however, cannot decrease continuously, since there is a computational overhead for analyzing each fragment, so that a global minimum is reached at a given segment length. In our implementation we set N/h = 2048, then the analysis time of the algorithm is O(11N){equiv} O(N). The linear dependence of the analysis time with N was verified using the data relative to the genomes listed in Tables 13. In the tests with KMP, the same partitioning of the genome was made, for the sake of comparison; in this case, though, the overall performance is hO((N/h)+ K) which, since the pattern length K can be neglected with respect to N/h, can be simplified as O(N), i.e. the performance is constant, regardless of the length of the analyzed segments. Then, the two approaches hereby considered have the same asymptotical dependence on the genome length N, so that the performance tests previously reported are required to evaluate which one is more convenient for the task. Of course, had the value of N/h been different, the proposed algorithm would have shown a worse performance, but it would have maintained the characteristics of being O(N). The linearity of the computation time with respect to N was confirmed for both KMP and SWIFT by trials on the ensemble of genomes considered in Tables 13.

An interesting feature of the software is that there is no minimum gene length in bp to be set, since the k-means clustering adjusts automatically the border between accepted and rejected ORFs, in a statistically sound fashion. This is different from other software tools, such as Glimmer 2.0 (Delcher et al., 1999), which has a default minimum gene length of 90 bp. Also in Paulsen et al. (2003) the putative ORFs were selected on the basis of a 90 bp minimum length requirement, using Glimmer 2.0. This parameter could be critical in producing a certain number of false positives/false negatives, given the size distribution of the considered ORF class, which may differ largely as different genomes are considered. As shown in Figure 5, a fixed threshold was found to yield globally worse results, in terms of both sensitivity and specificity, than the proposed method. Then, if the performance of a genome-wide search of a class of proteins is to be considered, without the application of post-processing steps, such as Blast investigations on the putative ORFs (Paulsen et al., 2003), the flexibility offered by the algorithm herein presented is certainly advantageous.

From the comparison shown in Figure 5, moreover, it can be seen that the k-means clustering of the candidate ORFs, hereby adopted, could prove useful also with other methods for gene finding, relying on the choice of a definite threshold for the length of the acceptable ORFs.

Another distinctive feature of SWIFT consists in the absence of a training phase, which is often found in softwares based on hidden Markov models (Delcher et al., 1999; Pedersen and Hein, 2003). Therefore, the proposed program may be useful for comparison of results obtained with other softwares, possibly biased by the choice of training data.

The program has been described by considering the case of SP, but of course is not limited to the search of such class of proteins. Any meaningful sequence of amino acids can be chosen as the pattern. For instance, the program can be used to search for the presence of ORFs coding for class IIa bacteriocins, which contain the YGNGV amino acid motif near the N-terminus (Bennik et al., 1998). Another possible application can be the search for the zinc-binding motif, HEXXH, which is typically found in metalloproteases and metallopeptidases (Yu and Kroos, 2000). Such compounds are often involved in the pathogenic activity of bacterial species (Tonello et al., 2004).


    CONCLUSIONS
 TOP
 Abstract
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION
 CONCLUSIONS
 SUPPLEMENTARY DATA
 REFERENCES
 
A program is presented for the search of the ORFs of a given class, from the raw sequence data. The possibility to set incompletely specified search patterns renders the program very useful for finding sequence data whose knowledge is not yet complete, as in the case of the sortase targets in bacterial species. The chosen validation criteria yield high values of both sensitivity and specificity, owing to the flexibility in setting the threshold for the minimum length of acceptable ORFs. This is confirmed by comparing the results of the algorithm with those obtained with a fixed threshold, instead of the k-means clustering of the putative ORFs of the considered class.

The use of the FFT for the pattern search can, in principle, leverage the massive increase in speed that can be gained with parallel computation, since it is easy to port FFT-based algorithms to computers working in parallel (e.g. GRID-based computation). It can be envisaged, then, that the proposed algorithm will be capable to fully exploit the processing power of the next computing architectures.


    SUPPLEMENTARY DATA
 TOP
 Abstract
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION
 CONCLUSIONS
 SUPPLEMENTARY DATA
 REFERENCES
 
Supplementary data for this paper are available on Bioinformatics online.

Received on August 6, 2004; revised on April 23, 2005; accepted on April 25, 2005

    REFERENCES
 TOP
 Abstract
 INTRODUCTION
 METHODS
 RESULTS
 DISCUSSION
 CONCLUSIONS
 SUPPLEMENTARY DATA
 REFERENCES
 

    Aho, A.V. and Corasick, M.J. (1975) Efficient string matching: an aid to bibliographic search. Commun. ACM, 18, 333–340[CrossRef].

    Bennik, M.H., et al. (1998) A novel bacteriocin with a YGNGV motif from vegetable-associated Enterococcus mundtii: full characterization and interaction with target organisms. Biochim. Biophys. Acta, 14, 47–58.

    Delcher, A.L., et al. (1999) Improved microbial gene identification with GLIMMER. Nucleic Acids Res., 27, 4636–4641[Abstract/Free Full Text].

    Fischetti, V.A., et al. (1990) Conservation of a hexapeptide sequence in the anchor region of surface proteins from gram-positive cocci. Mol. Microbiol., 4, 1603–1605[Web of Science][Medline].

    Janulczyk, R. and Rasmussen, M. (2001) Improved pattern for genome-based screening identifies novel cell wall-attached proteins in gram-positive bacteria. Infect. Immun., 69, 4019–4026[Abstract/Free Full Text].

    Knuth, D.E., et al. (1977) Fast pattern matching in strings. SIAM J. Comput., 6, 323–350[CrossRef].

    Navarre, W.W. and Schneewind, O. (1999) Surface proteins of gram-positive bacteria and mechanisms of their targeting to the cell wall envelope. Microbiol. Mol. Biol. Rev., 63, 174–229[Abstract/Free Full Text].

    Pallen, M.J., et al. (2001) An embarrassment of sortases—a richness of substrates? Trends Microbiol., 9, 97–102[CrossRef][Web of Science][Medline].

    Paulsen, I.T., et al. (2003) Role of mobile DNA in the evolution of vancomycin-resistant Enterococcus faecalis. Science, 299, 2071–2074[Abstract/Free Full Text].

    Pedersen, J.S. and Hein, J. (2003) Gene finding with a hidden Markov model of genome structure and evolution. Bioinformatics, 19, 219–227[Abstract/Free Full Text].

    Sharma, D., et al. (2004) Spectral Repeat Finder (SRF): identification of repetitive sequences using Fourier transformation. Bioinformatics, 20, 1405–1412[Abstract/Free Full Text].

    Tonello, F., et al. (2004) Tyrosine-728 and glutamic acid-735 are essential for the metalloproteolytic activity of the lethal factor of Bacillus anthracis. Biochem. Biophys. Res. Commun., 16, 496–502[CrossRef].

    Tsonis, A.A., et al. (1991) Periodicity in DNA coding sequences: implications in gene evolution. J. Theor. Biol., 7, 323–331.

    Yu, Y.T. and Kroos, L. (2000) Evidence that SpoIVFB is a novel type of membrane metalloprotease governing intercompartmental communication during Bacillus subtilis sporulation. J. Bacteriol., 182, 3305–3309[Abstract/Free Full Text].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow supplementary data
Right arrow All Versions of this Article:
21/13/2943    most recent
bti468v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (2)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by D'Avenio, G.
Right arrow Articles by Creti, R.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by D'Avenio, G.
Right arrow Articles by Creti, R.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?