Bioinformatics Advance Access originally published online on August 26, 2008
Bioinformatics 2008 24(21):2438-2444; doi:10.1093/bioinformatics/btn460
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Direct mapping and alignment of protein sequences onto genomic sequence
1Department of Intelligence Science and Technology, Graduate School of Informatics, Kyoto University, Yoshida Honmachi, Sakyo-ku, Kyoto 606-8501 and 2National Institute of Advanced Industrial Science and Technology, Computational Biology Research Center, 2-42 Aomi, Koto-ku, Tokyo 135-0064, Japan
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Finding protein-coding genes in a newly determined genomic sequence is the first step toward understanding the content written in the genome. Sequences of transcripts of homologous genes, if available, can considerably improve accuracy of prediction of genes and their structures, compared with that without such knowledge. As protein sequences are generally better conserved than nucleotide sequences, remote homologs can be used as templates, extending the applicability of evidence-based gene recognition methods. However, no tool seems to have been developed so far to simultaneously map and align a number of protein sequences on mammalian-sized genomic sequence.
Results: We have extended our computer program Spaln to accept protein sequences, as well as cDNA sequences, as queries. When the query and the target sequences are reasonably similar, e.g. between mammalian orthologs, Spaln runs one to two orders of magnitude faster than conventional approaches that rely on Blast search followed by dynamic-programming-based spliced alignment. Exon-level and gene-level accuracies of Spaln are significantly higher than those obtained by the best available methods of the same type, particularly when the query and the target are distantly related.
Availability: Spaln is accessible online for a few species at http://www.genome.ist.i.kyoto-u.ac.jp/~aln_user. The source code is available for free for academic users from the same site.
Contact: o.gotoh{at}i.kyoto-u.ac.jp
Supplementary information: Supplementary data are available at Bioinformatics online.
| 1 INTRODUCTION |
|---|
|
|
|---|
The rapidly expanding worlds of genomes, transcriptomes and proteomes require constantly improved tools to compare huge amounts of sequence data produced daily worldwide. One category of software handles the alignment of a DNA sequence and a protein sequence, where the DNA sequence is conceptually translated in three or six frames. One purpose of such software is to detect possible frame shift errors in the DNA sequence given an established homologous protein sequence (Hein, 1994; Peltola et al., 1986). The second, more imperative application is to find protein-coding exons in a genomic sequence. In this article, we are interested in only the coding parts of exons or genes. Hence, the term coding will be omitted hereafter for simplicity. As protein sequences are generally better conserved than nucleotide sequences, higher sensitivity is expected in detecting homologous exons by a search with protein queries rather than a corresponding search with cDNA queries. Tfasta (Pearson et al., 1997), Blat (Kent, 2002) and Tblastn (Altschul et al., 1997; Gertz et al., 2006) are widely used for the rapid search of genomic sequences with protein queries. Although, these programs provide useful information about the gene structure embedded in genomic sequence, they do not attempt to identify the exact exon boundaries, nor do they generate the complete gene organization. The third class of programs is designed to infer the complete exon–intron organization of the relevant gene in terms of the so-called spliced alignment algorithm (Gelfand et al., 1996). If a closely related amino acid sequence is available, these programs will offer highly reliable prediction of the gene structure in a newly sequenced genome, even if no support is obtained from the direct transcripts, i.e. full-length cDNAs or ESTs derived from the same species. Thus, GeneWise (Birney et al., 2004), Nap-AAT (Huang and Zhang, 1996; Huang et al., 1997), GeneSeqer (Usuka and Brendel, 2000) or Aln (Gotoh, 2000) has been used as a principal tool in the annotation of various genomes, such as human (International Human Genome Sequencing Consortium, 2004), Caenorhabiditis elegans (The C. elegans Sequencing Consortium, 1998), Arabidopsis thaliana (The Arabidopsis Genome Initiative, 2000) and Aspergillus oryzae (Machida et al., 2005). Moreover, the results of these programs constitute a major component in recently developed comprehensive annotation tools (Allen and Salzberg, 2005; Brejova et al., 2005; Haas et al., 2008; Stanke et al., 2004) that utilize various lines of information for automatic gene prediction.
Although all the spliced alignment programs mentioned above take frame shift errors into account in their dynamic programming (DP) framework, none of them has an internal procedure to rapidly locate the genomic region(s) corresponding to the template protein sequence. To our knowledge, only Exonerate (Slater and Birney, 2005) can perform both genome search and spliced alignment in a single job. However, Exonerate is still prohibitively slow when a mammalian-sized genome is targeted. This difficulty could be circumvented as a genome-wide search and alignment can now be done efficiently by such programs as Gmap (Wu and Watanabe, 2005) or Spaln (Gotoh, 2008), running on a nominal computer.
Here, we report an update of Spaln so that it accepts not only cDNA sequences but also protein sequences as queries. The new version of Spaln (Version 1.3) combines the efficient mapping strategy of the original Spaln with our DP-based spliced alignment algorithm implemented into Aln (Gotoh, 2000) enhanced for better performance. The proposed method makes it possible to accelerate computational speed by 10- to 100-fold compared with the conventional strategy that incorporates a Blast (Tblastn) search followed by a DP-based spliced alignment, provided that the template and the target sequences are reasonably similar to each other, e.g. between mammalian orthologs.
As already mentioned, protein versus genome alignment programs have been widely applied to genome annotation. However, only a few small-scale attempts have been made to assess their relative performances in terms of accuracy and computational efficiency (Ko et al., 2004; Usuka and Brendel, 2000), whereas large-scale assessments have been conducted for ab initio or comprehensive gene finding methods (Burset and Guigo, 1996; Guigo et al., 2000, 2006; Pavy et al., 1999; Rogic et al., 2001). Hence, we examined Spaln and three other leading programs of the same type on a total of more than 1000 human genes and their relatives in mouse and chicken. The results showed that Spaln significantly outperforms other programs in both speed and accuracy in predicting correct exon boundaries. Under our experimental conditions, knowledge of exon–intron organization of a reference genomic sequence is virtually dispensable for accurate prediction of exon boundaries in the target genome in contrast to some recent observations (Chatterji and Pachter, 2006; Hsieh et al., 2006; Meyer and Durbin, 2004).
| 2 METHODS |
|---|
|
|
|---|
Spaln adopts a multiphase heuristic strategy, the overall architecture of which is preserved from that of the original version (Gotoh, 2008). We describe here only the major points that differ when protein sequences rather than cDNA sequences are used as queries.
2.1 Words and blocks
A word, often called a k-mer, is a string of amino acids of fixed length k. Thus, there are 20k kinds of such words. We find M–k+1 words in an amino acid sequence a=a1, a2,...,aM of length M. Words may also be generated by conceptual translation of the genomic sequence g=g1, g2,...,gN (see below). Occasionally, we use a reduced set of amino acids with r<20 alphabets, and also use a discontinuous matching pattern. As these techniques have been well documented (Brown, 2005; Cannata et al., 2002; Edgar, 2004; Ma et al., 2002; Murphy et al., 2000), no further explanation is given here.
The most unique characteristics of Spaln are found in the first phase, in which rough location(s) of the genomic region corresponding to the query is rapidly identified. Unlike most contemporary homology search programs which rely on seed-extension strategies, the phase-1 search of Spaln is based on statistical consideration of k-mer hits within a block. More specifically, we look for a pair of blocks between which the gene corresponding to the query may reside, where a block is a piece of the genomic sequence that is divided into fixed length B in the preparative phase (phase 0). In phase 0, we also construct a block index table with rk entries. Each entry corresponds to a word and stores the list of blocks that contain that word once or more than once. The words are generated from the genomic sequence by conceptual translation in six frames (Fig. 1). We choose words that do not overlap with each other in the same reading frame, and discard any translation that meets one or more in-frame stop codon or ambiguous nucleotide. Words on the complementary strand are generated in the reverse order, i.e. in the C-terminus to N-terminus direction, and mixed with those on the direct strand. This procedure can save memory of O(rk) otherwise allocated to an entry to the block index table for the reverse strand.
|
Each word, wc, is associated with its own matching score, s(c). For DNA sequence, s(c) is simply the negative logarithm of the probability of observing wc in a block (Gotoh, 2008). For protein sequence, we modify the evolutionary model proposed by Dayhoff et al. (1978) to estimate s(c). Let Pk(wa | wb; t) be the probability of conversion of word wb into wa after an evolutionary process of distance t in the PAM (accepted point mutation) unit. Assuming independence among sites and also a stationary and reversible Markov process, we obtain the probability of observing a matching pair of words wb and wa in an alignment of homologous protein sequences,
|
| (1) |
The background probability of a match of words drawn from an arbitrary amino acid sequence and a translated genomic sequence can be expressed as pk(wa) · qk(wb), where qk(wb) is estimated from the real counts of k-mers in the translation of the whole genome. Then, we define the word score s(c) of wc as:
|
| (2) |
When a set of words w(c)={w(c1),w(c2),...,w(ch)} extracted from the query amino acid sequence are given, the block score, S(w(c), b), for a particular block b is defined by
|
| (3) |
(w(cj), b)=1 or 0 depending on whether w(cj) is present in b or not. A block is significant if S(w(c), b) exceeds a threshold value calculated as a function of the number of test words |w(c)| (Gotoh, 2008). If one or more significant blocks or a block pair is found, we proceed to the next phase. Otherwise, there are two choices: (i) give up further search or (ii) examine all series of contiguous blocks that have positive scores. The latter procedure will be called salvage. We adopt the second choice in all the tests described in this article on the assumption that the target genome should have at least one gene homologous to the given query.
2.2 Construction of chained high-scoring segment pair lists
The second phase is optional; if skipped, a full DP-based spliced alignment algorithm is applied as described in the next subsection. The procedure of the second phase is essentially the same as that with cDNA queries to find colinearly ordered high-scoring segment pairs (HSPs) within the genomic segment identified in the first phase. By default, we adopt three-level recursions with decreasing word sizes. At the first, second and third levels, we use reduced amino acid alphabets (Edgar, 2004) with 16, 14 and 12 elements, respectively. The discrete seed patterns used at the three levels of recursion are 1110101, 11101 and 1101, respectively. Overlapping or closely separated word hits on the same diagonal are combined to form an HSP. The matching score for each gapless HSP is then reevaluated with an amino acid substitution matrix, which may differ from that used in the first phase. Finally, we apply a sparse DP algorithm to obtain colinearly ordered HSP lists, whose cumulative scores are optimized with the HSP scores and gap penalties assigned to neighboring pairs of HSPs.
2.3 Spliced alignment algorithm
If the phase-2 procedure reports a set(s) of chained HSPs, the HSPs in the list serve as anchor points to generate the entire spliced alignment with the attack by both sides or Sandwich algorithm (Gotoh, 2008; Slater and Birney, 2005; Wu and Watanabe, 2005). This heuristic is applicable only when a single intron is expected between neighboring HSPs. Otherwise, the standard spliced alignment algorithm based on full DP is applied as described previously (Gotoh, 2000). Here, we briefly review our algorithm and a few extensions made after the original publication.
Although our algorithm is not straightforwardly based on a formal probabilistic model, such as the hidden Markov model (HMM) (Birney et al., 2004; van Nimwegen et al., 2006) or the conditional random field (CRF) (DeCaprio et al., 2007; Gross et al., 2007), the overall framework is essentially the same as Viterbi algorithm that maximizes the log odds score derived from several sources of information. For computational efficiency, the genomic sequence is first converted into 23-letter tron (translated codon) codes (Gotoh, 2000) that possess information on both nucleotide and translated amino acid sequences at a time. The objective function to be optimized, H(g, a), is written as:
|
| (4) |
The second term comes from the so-called coding potential. The potential is assigned to each codon c in exonic regions {C} within the alignment. We obtain
with the standard phase-sensitive fifth-order Markov model (Borodovsky et al., 1995) if sufficiently large amounts of CDS data are available. Otherwise, we estimate
based on di-tron frequencies as described previously (Gotoh, 2000).
The third term corresponds to the intron penalty. In earlier versions of Aln,
is a simple well-shaped function of intron length x:
|
| (5) |
|
| (6) |
|
| (7) |
(x–µi)/
i, ki, µi and
i are shape parameters, and Ai is the amplitude of the i-th component (A1 +A2=1). We fit D(x) to the observed distribution by a maximum likelihood method with optim function in the R package (http://cran.r-project.org/).
The fourth term consists of exon-boundary signals, of which the first and the last ones correspond to translational initiation (
ini) and termination (
term), respectively, and the others are donor (
don) and acceptor (
acc) signals of introns. We obtain these signal values by means of a first- or second-order Markov model (Salzberg, 1997; Zhang and Marr, 1993) depending on the size of the learning set. In the present implementation, we assume that wj are equal for all splicing donors and acceptors, whereas wj for translational initiation and termination signals may differ from that for splicing signals.
Optionally, H(g, a) may incorporate yet another term, wB
i
{I}
i, where
i=1 if the i-th intron-insertion site along the predicted amino acid sequence matches the corresponding site of the template with respect to both position and phase in the alignment; otherwise,
i=0. This term simulates information transfer methods proposed in the literature (Chatterji and Pachter, 2006; Hsieh et al., 2006; Meyer and Durbin, 2004).
The coding potential and the boundary signals are calculated and stored in memory before performing the major DP procedure. Calculation of intron penalties and reconstruction of split codons are performed within the DP procedure in which the information is temporarily stored and retrieved with a hash memory. As intron penalty
(x) is now nonlinear as a function of intron length, a locally suboptimal combination of acceptor and donor may contribute to the globally optimal solution. To cope with such situations, we adopt the candidate list paradigm (Miller and Myers, 1988) in our algorithm with plural (two by default) candidates for potential introns for each of the three phases of intron insertion site in a codon.
2.4 Parameter selection
Besides the coding potential and the boundary signals that are learned from known CDS and exon-boundary sequences, respectively, several parameters of Spaln need to be adjusted.
In phase 1, we try various combinations of block size B, word size k and reduced amino acid alphabets. After several tens of trials, we find that the most basic one is the best, i.e.
, and k=5 or 6 depending on genome size N so that B·q(wc)<<1 should be satisfied. In this study, B and k are fixed to 46 000 and 6, respectively. We use the same threshold function as that for DNA queries (Gotoh, 2008), and hence the only parameter to be adjusted is PAM level t1 of the MDMt (subsection 2.1). We examined a series of MDMs calculated from the data of Jones et al. (1992), and find that phase-1 search is relatively insensitive to t1 in the range of 10
t1
80 (data not shown). We choose t1=20 throughout the examinations described in this study.
The word sizes, bit patterns and reduced alphabets mentioned in subsection 2.2 are chosen rather intuitively, and no particular optimization is attempted. The only adjusted parameter is PAM level t2 for the second-phase amino acid substitution matrix. Our examination on G178 dataset (next subsection) with templates having various degrees of sequence divergence shows only weak dependence of alignment accuracy on t2 (Fig. S1 in Supplementary Material); thus, we set t2=50 as the default.
In a similar way, we also investigate phase-3 amino acid substitution matrix. The results of the examination (Fig. S1 in Supplementary Material) show a broad optimum around t3
150, and hence we use t3=150 in this study. The affine or double affine gap penalty functions are the same as those reported previously (Gotoh, 2000), except that the incline of the latter penalty function for long gaps has a small negative value (–0.6 by default) instead of zero used in the earlier version.
The last set of parameters to be adjusted is the weights, w's in Equation (4), assigned to different lines of evidence. As only the relative strengths of w's are meaningful, we fix wA=1. Because D(x) for mammalian genomes is broadly distributed as a function of intron length x, the alignment accuracy is not sensitive to the value of wI itself, and we somewhat arbitrarily choose wI=8. Meanwhile, we previously showed (Gotoh, 2000) that the terms of
(x) and
j are tightly coupled, and the accuracy of alignment most significantly depends on a linear combination of the two terms. In other words, IP can be derived from the linear relation wI·<
(x)>+wJ(<
don>+<
acc>)=g(3)=constant if wJ is given, where the angular brackets indicate the mean of the corresponding quantity in the learning set, and g(3) is the penalty for an indel of length 1 codon/amino acid. Then, two parameters, wC and wJ, remain independent. Our investigations indicate that the alignment accuracy does not vary much in the range of 1
wC
2.5, whereas wJ has an optimum around eight (Fig. S2 in Supplementary Material). We choose wC=2 and wJ=8 as the default values. The weight, wB, for the optional term is examined after all the other weights are fixed to the default values.
2.5 Preparation of test data
Human (Build #36.1), mouse(Build #37.1) and chicken (Build #2.1) genomic sequences are downloaded from NCBI (ftp://ftp.ncbi.nih.gov/), and several releases of SwissProt/Uniprot database sequences are downloaded from EBI (ftp://ftp.ebi.ac.uk/pub/databases/uniprot/). We use three sets of test sequences, tentatively named G178 (Guigo et al., 2000), P491 (Meyer and Durbin, 2004) and C756 (Cui et al., 2007), with the specified numbers of human genes. P491 also contains 478 mouse genes, whereas C756 includes mouse and chicken genes of the same number as that of human. G178 is used for adjustment of parameter values except wB, whereas P491 and C756 are used for assessment.
Except for C756, the gene structures given in the datasets are regarded as the correct answers. For C756, we obtain the gene structures by running Spaln with cognate cDNA sequences. The fact that several chicken genes in C756 contain duplicated exons within each genomic locus makes unambiguous assignment of exon boundaries difficult. However, we do not take special regard to internal duplications because the number of such genes is small (
10) and would not influence the general conclusions.
For each gene in G178 and P491, we search for a set of template amino acid sequences with varying degrees of sequence divergence from the target gene. For this purpose, we run Blastp (Altschul et al., 1997) against SwissProt/Uniprot database, retain only retrieved sequences with lengths within the range of 80–120% of the query, and then run Aln in the ordinary (unspliced) alignment mode to calculate the sequence divergence between the query and the retained amino acid sequences. We randomly select one template in a pool of retained sequences that have been sorted into bins with 10% interval according to the sequence divergence. This procedure produces 725 and 5009 pairs of template and target for G178 and P491, respectively. For C756, the homologs listed in the dataset, which are originally derived from NCBI HomoloGene (Wheeler et al., 2006), are directly used as templates.
2.6 Tested programs and evaluation
Options used for the tested programs are as follows. Blastall (Version 2.2.14): -p tblastn -e 1.0; Blat (Version 34): -t=dnax -q=prot –fine; Exonerate (Version 1.4.0): –showalignment n –showtargetgff y -S n –model protein2genome supplemented with or without –exhaustive; GeneWise (Version 2-2-0): -quiet -genes supplemented with or without -init global; and Spaln: -Qn [-M], where n
[0,7] specifies phase 1 and phase 2 settings and -M option indicates multiple outputs. All calculations are performed on the same computer with a 3.0 GHz Intel® Pentium® D CPU running under Linux.
For alignment quality evaluation, sensitivity at the gene level implies the percentage of perfectly identified genes of the test set. Sensitivity (specificity) at the exon level is defined as the percentage of correctly identified exons of the total exons in the target (predicted) genes. For genome search, locus level sensitivity is defined as the percentage of runs in which the search result overlaps with some exons of the target gene on the genome. Exon and nucleotide level sensitivities are analogously defined as the fraction of target exons or exonic nucleotides that overlap with search results. Exactly identified genes and exons are documented separately from these sensitivity measures.
| 3 RESULTS |
|---|
|
|
|---|
3.1 Accuracy of gene structure prediction
We first examined the basic performance of Spaln as an aligner in comparison with those of other widely used programs. Table 1 shows the results of four programs (Exonerate, GeneWise, Nap-AAT and Spaln) examined on 491 human genomic segment and mouse homologous protein pairs and the same number of reverse combinations in P491 (Projector) dataset. The results of GeneAlign and Projector (Meyer and Durbin, 2004) programs are derived from Hsieh et al. (2006). The results of Spaln with the corresponding CDS queries (Gotoh, 2008) are also presented for comparison. For all the measures except CPU time, the full-DP version of Spaln with the double affine gap penalty function produced the best scores (indicated by double underlines in Table 1). In our experiments, GeneWise with the global option performed equivalently well with GeneAlign and was the best among the tested programs other than Spaln. However, even the most economical settings of Spaln with HSP anchoring and an affine gap penalty function significantly outperformed GeneWise in spite of the <300-fold shorter CPU time. Exonerate runs as fast as Spaln in this mode, but its alignment accuracy is considerably inferior to that of Spaln. Nap-AAT shows intermediate properties between heuristic and full-DP versions of Exonerate in both speed and prediction accuracy. It is remarkable that the heuristics used in Spaln works very well; both gene-level and exon-level accuracies decrease by only 0.1–1.2% when the results of the heuristic versions (with suffix Q in Table 1) are compared with those of the corresponding full-DP versions (with suffix F). This is in clear contrast with the results of Exonerate, with which the decreases in accuracy amount to 6.4–9.6%.
|
One serious issue plaguing this type of examination is that the tested pairs of human and mouse CDSs may be derived from different splicing isoforms. We estimate that 20–25 (4–5%) out of 491 pairs in P491 are likely to be derived from splicing isoforms rather than truly orthologous mature forms. This fraction is much greater in another dataset, C756, amounting to 20% or even more of all the pairs. Thus, examinations on C756 resulted in greatly reduced apparent accuracy, particularly at the gene level (Table S1 in Supplementary Material), compared with that shown in Table 1. Nevertheless, the relative performances of the tested programs exhibited essentially the same trends as those observed on P491. As expected, the accuracies at any level are considerably lower for mammalian versus bird pairs than those between mammalian homologs. However, the large fraction of nonorthologous isoforms in C756 hampers quantitative evaluation of the effects of sequence divergence on gene structure prediction accuracy.
We examined the effects of sequence divergence on prediction accuracy by using established protein sequences having various degrees of divergence from P491 genes as templates. Because the problem due to isoforms cannot be totally avoided, the results thus obtained indicate the lower limit of estimation of the prediction accuracy. The results shown in Figure 2 clearly indicate that Spaln keeps a high level of accuracy for remotely related templates much better than other programs.
|
Hsieh et al. (2006) and Chatterji and Pachter (2006) argued that knowledge of the structure of a homologous gene helps improve prediction accuracy of the target gene structure. Hence, we examined the effects of such knowledge by incorporating an additional term into the objective function (subsection 2.3). Lacking theoretical background, we tried various degrees of contribution of this term to the total score. As shown in Figure S3 in Supplementary Material, the effects of knowledge of homologous gene structure are marginal, implying that other terms in our objective function possess enough information for accurate prediction of the target gene structure. Dispensability of information of homologous gene structure is desirable because such information is not always available, and because loss and gain of introns occur frequently in genes of nematodes (Coghlan and Wolfe, 2004; Gotoh, 1998), fungi (Nielsen et al., 2004) and possibly many organisms other than vertebrates (Carmel et al., 2007; Zhaxybayeva and Gogarten, 2003).
3.2 Genome mapping and alignment
Our original motivation for developing Spaln was to construct an economical tool that can map and align a number of cDNA/EST sequences onto the cognate or a closely related genomic sequence of mammalian size (Gotoh, 2008). As the same strategy is adopted when protein sequences are given as queries, it was suspicious whether Spaln would be sufficiently sensitive to detect remotely related homologs in the genome. To test the performance of Spaln as a mapping tool, we performed cross-species comparison between the entire human genomic sequence and 487 mouse protein sequences and also the entire mouse genomic sequence and 491 human protein sequences in P491. As no preceding program is available to seamlessly perform both mapping and alignment of protein sequences onto a mammalian genome, we can meaningfully compare the sensitivity of Spaln with those of Blast and Blat without regard to precise exon boundaries. As shown in Table 2, Spaln is in fact slightly less sensitive than Blast and Blat in locating homologous genes on the target genome. However, Spaln is more sensitive than the other two programs when measured at exon or nucleotide levels. Moreover, the computational speed and gene or exon level alignment accuracies of Spaln are one order of magnitude higher than those of the two programs. Of a total of 60 unmatched queries (27 human and 33 mouse protein sequences), 30 showed hits to loci other than the references, of which 11 are better (or equally good) candidates for orthologs of the queries than those listed in P491. If we count these hits as correct answers, the mapping sensitivity of Spaln for P491 exceeds 95%.
|
Similar examinations were performed on C756 dataset consisting of homologous genes and proteins of human, mouse and chicken. Numerical results are listed in Table S2 in the Supplementary Material. Figure 3 summarizes the results, in which mapping (locus-level) and nucleotide-level sensitivities observed in the six-way comparisons are combined and plotted as a function of the degree of divergence between the query and the target protein sequences. Up to
30% sequence divergence, Spaln shows reasonably high mapping sensitivities, consistent with the observations on P491. Beyond that range, however, the mapping sensitivity of Spaln gradually declines, suggesting the limitation of its practical applicability. The affirmative range largely overlaps with the range of sequence divergence among orthologous mammalian proteins (Ouyang et al., 2003). Considering the high computation speed together with the high recognition accuracy at the gene and exon levels, Spaln would be a sensible choice for intra-mammalian comparisons. It should also be noted that the mapping sensitivity of Spaln would be higher for genomes with smaller sizes than those of mammals.
|
| 4 DISCUSSION |
|---|
|
|
|---|
Although not widely recognized, our Aln program (Gotoh, 2000) is one of the first to identify the structure of eukaryotic genes based on combined information of amino acid sequence similarity and statistical properties of the genomic sequence. Spaln inherits these features of Aln but achieves the same task with drastically improved computational efficiency. In particular, Spaln has made it possible to search the genomic region for similarity to query protein sequences and construct an evidence-based gene model in a seamless manner.
As a basic alignment tool, Spaln significantly outperforms other programs of the same type, particularly for distantly related pairs of query and target (Fig. 2). Yeh et al. (2001) reported a similar advantage of their program GenomeScan over some spliced alignment programs, such as GeneWise (Birney and Durbin, 1997) and Procrustes (Gelfand et al., 1996). GenomeScan incorporates homology information derived from BlastX search like a kind of coding potential in the framework of HMM-based gene recognition algorithm. In contrast, Aln/Spaln adopts a standard spliced alignment algorithm supplemented with statistical information. Despite differences in stand points, the success of both approaches suggests the usefulness of blending similarity and statistical properties. However, as GenomeScan is accessible only through web service, we could not directly compare the performance of the two methods in the present study.
Given similar architectures, what makes Spaln more robust than other aligners? We consider several points that may be responsible for the difference. First, Spaln adopts well-established methods for the calculation of coding potentials and splicing signals, whereas other programs often use less informative means. Second, Spaln allows freedom for adjustable degrees of contribution of different lines of evidence, whereas an HMM-based approach assumes uniform contribution of various probability measures. This freedom is argued to be the major source of advantage of CRF over HMM (DeCaprio et al., 2007). Third, the high computational efficiency of Spaln facilitates tuning of parameters, such as individual PAM matrices used in different phases. In a similar way, it is also shown that knowledge of template gene structure is virtually dispensable for optimal achievements of our method. Fourth, it is confirmed that the double affine gap penalty is slightly but significantly better than affine gap penalty exclusively used in other programs. Finally, the three-level recursive HSP anchoring works unexpectedly well, comparable to costly full-DP procedures with respect to alignment accuracy.
Compared with the standard tools for homology search, such as Tblastn and Blat, Spaln is 15–40-fold faster with a roughly equivalent level of memory consumption (Table 2). This speed is attained at the expense of reduced sensitivity for remotely related query and target sequences; beyond
30% amino acid sequence divergence, a significant fraction of homologs tend to be missed by Spaln. However, the limitation of 30% sequence divergence is acceptable as most mammalian orthologs have diverged <30% (Ouyang et al., 2003). In this connection, the ENSEMBL pipeline has set a threshold of 70% sequence identities for accepting a similarity based gene prediction (Curwen et al., 2004). As the alignment phase of Spaln can be reliably applied to more remote homologs than the 30% divergence (Fig. 2), better sensitivity in phase 1 to balance the sensitivities in the later phases would be desirable. The improvement of phase 1 sensitivity is one of the major issues to be resolved in the future development of Spaln. Another direction to be considered is the addition of a mode of homology search against a protein sequence database with a query of a genomic segment, in a similar way to BlastX and Nap-AAT. Homology search against established protein sequences and seamless gene structure prediction will be useful in the annotation of newly determined genomic fragments before extensive assembly.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
I would like to thank Drs T. Yada and N. Ichinose for valuable discussions and Mr M. Morita for preparation of the web page.
Funding: Ministry of Education, Culture, Sports, Science and Technology of Japan (No.18017017, No. 20017018).
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Burkhard Rost
Received on June 8, 2008; revised on August 21, 2008; accepted on August 22, 2008
| REFERENCES |
|---|
|
|
|---|
Allen JE, Salzberg SL. JIGSAW: integration of multiple sources of evidence for gene prediction. Bioinformatics (2005) 21:3596–3603.
Altschul SF, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. (1997) 25:3389–3402.
Birney E, Durbin R. Dynamite: a flexible code generating language for dynamic programming methods used in sequence comparison. ISMB (1997) 5:56–64.[Medline]
Birney E, et al. GeneWise and genomewise. Genome Res. (2004) 14:988–995.
Borodovsky M, et al. Detection of new genes in a bacterial genome using Markov models for three gene classes. Nucleic Acids Res. (1995) 23:3554–3562.
Brejova B, et al. ExonHunter: a comprehensive approach to gene finding. Bioinformatics (2005) 21(Suppl. 1):i57–i65.[Abstract]
Brown DG. Optimizing multiple seeds for protein homology search. IEEE/ACM Trans. Comput. Biol. Bioinform. (2005) 2:29–38.[CrossRef]
Burset M, Guigo R. Evaluation of gene structure prediction programs. Genomics (1996) 34:353–367.[CrossRef][Web of Science][Medline]
Cannata N, et al. Simplifying amino acid alphabets by means of a branch and bound algorithm and substitution matrices. Bioinformatics (2002) 18:1102–1108.
Carmel L, et al. Three distinct modes of intron dynamics in the evolution of eukaryotes. Genome Res. (2007) 17:1034–1044.
Chatterji S, Pachter L. Reference based annotation with GeneMapper. Genome Biol (2006) 7:R29.[CrossRef][Medline]
Coghlan A, Wolfe KH. Origins of recently gained introns in Caenorhabditis. Proc. Natl Acad. Sci. USA (2004) 101:11362–11367.
Cui X, et al. Homology search for genes. Bioinformatics (2007) 23:i97–i103.
Curwen V, et al. The Ensembl automatic gene annotation system. Genome Res. (2004) 14:942–950.
Dayhoff MO, et al. A model of evolutionary change in proteins. In: Atlas of Protein Sequence and Structure.—Dayhoff MO, ed. (1978) ML: National Biomedical Research Foundation, Silver Spring. 345–352.
DeCaprio D, et al. Conrad: gene prediction using conditional random fields. Genome Res. (2007) 17:1389–1398.
Edgar RC. Local homology recognition and distance measures in linear time using compressed amino acid alphabets. Nucleic Acids Res. (2004) 32:380–385.
Gelfand MS, et al. Gene recognition via spliced sequence alignment. Proc. Natl Acad. Sci. USA (1996) 93:9061–9066.
Gertz EM, et al. Composition-based statistics and translated nucleotide searches: improving the TBLASTN module of BLAST. BMC Biol (2006) 4:41.[CrossRef][Medline]
Gotoh O. Divergent structures of Caenorhabditis elegans cytochrome P450 genes suggest the frequent loss and gain of introns during the evolution of nematodes. Mol. Biol. Evol. (1998) 15:1447–1459.
Gotoh O. Homology-based gene structure prediction: simplified matching algorithm using a translated codon (tron) and improved accuracy by allowing for long gaps. Bioinformatics (2000) 16:190–202.
Gotoh O. A space-efficient and accurate method for mapping and aligning cDNA sequences onto genomic sequence. Nucleic Acids Res (2008) 36:2630–2638.
Gross SS, et al. CONTRAST: a discriminative, phylogeny-free approach to multiple informant de novo gene prediction. Genome Biol (2007) 8:R269.[CrossRef][Medline]
Guigo R, et al. An assessment of gene prediction accuracy in large DNA sequences. Genome Res. (2000) 10:1631–1642.
Guigo R, et al. EGASP: the human ENCODE genome annotation assessment project. Genome Biol (2006) 7(Suppl. 1):1–31. S2.
Haas BJ, et al. Automated eukaryotic gene structure annotation using EVidenceModeler and the program to assemble spliced alignments. Genome Biol (2008) 9:R7.[CrossRef][Medline]
Hein J. An algorithm combining DNA and protein alignment. J. Theor. Biol. (1994) 167:169–174.[CrossRef][Web of Science][Medline]
Hsieh SJ, et al. GeneAlign: a coding exon prediction tool based on phylogenetical comparisons. Nucleic Acids Res (2006) 34:W280–W284.
Huang X, Zhang J. Methods for comparing a DNA sequence with a protein sequence. Comput. Appl. Biosci. (1996) 12:497–506.
Huang X, et al. A tool for analyzing and annotating genomic sequences. Genomics (1997) 46:37–45.[CrossRef][Web of Science][Medline]
International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature (2004) 431:931–945.[CrossRef][Web of Science][Medline]
Jones DT, et al. The rapid generation of mutation data matrices from protein sequences. Comput. Appl. Biosci. (1992) 8:275–282.
Kent WJ. BLAT–the BLAST-like alignment tool. Genome Res (2002) 12:656–664.
Ko P, et al. Space-conserving optimal DNA-protein alignment. Proc. IEEE Comput. Syst. Bioinform. Conf. 2004 (2004) 80–88.
Ma B, et al. PatternHunter: faster and more sensitive homology search. Bioinformatics (2002) 18:440–445.
Machida M, et al. Genome sequencing and analysis of Aspergillus oryzae. Nature (2005) 438:1157–1161.[CrossRef][Web of Science][Medline]
Meyer IM, Durbin R. Gene structure conservation aids similarity based gene prediction. Nucleic Acids Res. (2004) 32:776–783.
Miller W, Myers EW. Sequence comparison with concave weighting functions. Bull. Math. Biol. (1988) 50:97–120.[Web of Science][Medline]
Murphy LR, et al. Simplified amino acid alphabets for protein fold recognition and implications for folding. Protein Eng. (2000) 13:149–152.
Nielsen CB, et al. Patterns of intron gain and loss in fungi. PLoS Biol (2004) 2:e422.[CrossRef][Medline]
Ouyang M, et al. Five hundred sixty-five triples of chicken, human, and mouse candidate orthologs. J. Mol. Evol. (2003) 57:271–281.[CrossRef][Web of Science][Medline]
Pavy N, et al. Evaluation of gene prediction software using a genomic data set: application to Arabidopsis thaliana sequences. Bioinformatics (1999) 15:887–899.
Pearson WR, et al. Comparison of DNA sequences with protein sequences. Genomics (1997) 46:24–36.[CrossRef][Web of Science][Medline]
Peltola H, et al. Algorithms for the search of amino acid patterns in nucleic acid sequences. Nucleic Acids Res. (1986) 14:99–107.
Rogic S, et al. Evaluation of gene-finding programs on mammalian sequences. Genome Res. (2001) 11:817–832.
Salzberg SL. A method for identifying splice sites and translational start sites in eukaryotic mRNA. Comput. Appl. Biosci. (1997) 13:365–376.
Slater GS, Birney E. Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics (2005) 6:31.[CrossRef][Medline]
Stanke M, et al. AUGUSTUS: a web server for gene finding in eukaryotes. Nucleic Acids Res (2004) 32:W309–W312.
The Arabidopsis Genome Initiative. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature (2000) 408:796–815.[CrossRef][Web of Science][Medline]
The C. elegans Sequencing Consortium. Genome sequence of the nematode C. elegans: a platform for investigating biology. Science (1998) 282:2012–2018.
Usuka J, Brendel V. Gene structure prediction by spliced alignment of genomic DNA with protein sequences: increased accuracy by differential splice site scoring. J. Mol. Biol. (2000) 297:1075–1085.[CrossRef][Web of Science][Medline]
van Nimwegen E, et al. SPA: a probabilistic algorithm for spliced alignment. PLoS Genet (2006) 2:e24.[CrossRef][Medline]
Wheeler DL, et al. Database resources of the national center for biotechnology information. Nucleic Acids Res (2006) 34:D173–D180.
Wu TD, Watanabe CK. GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics (2005) 21:1859–1875.
Yeh RF, et al. Computational inference of homologous gene structures in the human genome. Genome Res. (2001) 11:803–816.
Zhang MQ, Marr TG. a weight array method for splicing signal analysis. Comput. Appl. Biosci. (1993) 9:499–509.
Zhaxybayeva O, Gogarten JP. Spliceosomal introns: new insights into their evolution. Curr. Biol (2003) 13:R764–R766.[CrossRef][Medline]
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||



