Bioinformatics Advance Access originally published online on September 16, 2004
Bioinformatics 2005 21(4):471-482; doi:10.1093/bioinformatics/bti025
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Bioinformatics vol. 21 issue 4 © Oxford University Press 2005; all rights reserved.
Prediction of splice sites with dependency graphs and their expanded bayesian networks
1 Department of Electrical Engineering, National Tsing Hua University Hsinchu 30013, Taiwan
2 Department of Ecology and Evolution, University of Chicago Chicago, IL 60637, USA
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Motivation: Owing to the complete sequencing of human and many other genomes, huge amounts of DNA sequence data have been accumulated. In bioinformatics, an important issue is how to predict the complete structure of genes from the genomic DNA sequence, especially the human genome. A crucial part in the gene structure prediction is to determine the precise exonintron boundaries, i.e. the splice sites, in the coding region.
Results: We have developed a dependency graph model to fully capture the intrinsic interdependency between base positions in a splice site. The establishment of dependency between two position is based on a
2-test from known sample data. To facilitate statistical inference, we have expanded the dependency graph (which is usually a graph with cycles that make probabilistic reasoning very difficult, if not impossible) into a Bayesian network (which is a directed acyclic graph that facilitates statistical reasoning).
When compared with the existing models such as weight matrix model, weight array model, maximal dependence decomposition, Cai et al.'s tree model as well as the less-studied second-order and third-order Markov chain models, the expanded Bayesian networks from our dependency graph models perform the best in nearly all the cases studied.
Availability: Software (a program called DGSplicer) and datasets used are available at http://csrl.ee.nthu.edu.tw/bioinf/
Contact: cclu{at}ee.nthu.edu.tw
| INTRODUCTION |
|---|
|
|
|---|
Automated DNA sequencing has led to the rapid accumulation of huge amounts of DNA sequence data. This demands mathematical modeling, statistical methods and information technology to analyze the data.
Gene identification refers to the prediction of the complete gene structure, especially the precise exonintron structure of a gene in an eukaryotic genomic DNA sequence. Genomic sequences with lengths in the order of many millions of base pairs are now being produced. Such a sequence consists of a collection of genes separated from each other by long stretches of intergenic regions. Currently,
30 000 genes have been estimated in the three billion base pairs of the human genome. That is, only 1.1% of the human genome seems to contain useful coding information (Lander et al., 2001). However, there might be a fairly large number of human genes that remains to be identified. In response to this challenge, computational gene-finding prediction approaches have proliferated in recent years. However, their performance is still far from satisfactory (Mathe et al., 2002; Zhang, 2002).
Gene identification can be regarded as an attempt to define precisely the sequential dependency on the basic biochemical processes of transcription, RNA processing and translation. The sequence properties of known genes may offer us clues about the intrinsic mechanisms of these processes (Burge and Karlin, 1997). How to model the biological signals, such as promoter elements, transcriptional and translational signals and splice sites is undoubtedly the key issue in the prediction of the complete gene structure. In this paper, we focus on the signals related to pre-mRNA splicing, i.e. the splice sites that include donor and acceptor sites are the most important elements for the prediction of precise exonintron boundaries.
Splice signal detection
The cell recognizes a gene by utilizing different proteins to bind to different signals. Typically, there are several DNA segments required for a particular signal. We call these segments the members of the signal. However, not every member of a signal has a consensus sequence. We may and will assume that the differences between sequences for the members of a signal arose from a common ancestor via a stochastic process (Ewens and Grant, 2001), which suggests that the construction of statistical models for signals and genes is reasonable.
Several statistical models of donor and acceptor splice sites have been constructed in the past 20 years (Staden, 1984; Zhang and Marr, 1993; Burge and Karlin, 1997; Cai et al., 2000; Arita et al., 2002; Yeo and Burge, 2004). One of the earliest and most influential models is the weight matrix model (WMM) (Staden, 1984) that uses the position-specific compositional biases in splice sites. The WMM weights can be optimized using a neural network method (Brunak et al., 1991) developed for NetPlantGene (Hebsgaard et al., 1996) and NetGene2 (Tolstrup et al., 1997) and also adopted in NNSplice (Reese et al., 1997). Another method, called the weight array model (WAM) (Zhang and Marr, 1993), was developed to describe the dependencies between adjacent base positions by the inhomogeneous first-order Markov chain (1MC) model, and was later applied using the VEIL (Henderson et al., 1997) and MORGAN (Salzberg et al., 1998) software program.
Statistically significant dependencies between base positions in the donor and acceptor splice sites have been studied more recently (Burge and Karlin, 1997; Cai et al., 2000). Certain observed dependencies between splice site positions can be interpreted in terms of the spliceosome cycle between the structure of small nuclear RNPs (snRNPs) and the splice site region of the pre-mRNA (Mathews et al., 2000). Thus more complex splice signal models that are capable of capturing such dependencies, for instance, the maximal dependence decomposition (MDD) model in Genscan (Burge and Karlin, 1997) and Bayesian networks (Cai et al., 2000), have been developed. However, these more complex models do not achieve significant improvement in splice site discrimination over simpler models that assume only dependencies between adjacent positions. A possible reason is that these models do not fully capture the intrinsic interdependency between base positions either in the donor site or in the acceptor site. Significant improvement can possibly be achieved by combining one of the basic statistical models, such as WMM, WAM and MDD, of splice sites with other signal/content sensors and/or with rule-based filtering such as in GeneSplicer (Pertea et al., 2001), where the MDD model is combined with two second-order Markov chain (2MC) models to characterize coding/non-coding regions around splice sites in addition to a local maximal score filter. In this paper, we will develop a dependency graph model and its derivatives, and make an attempt to fully capture the intrinsic interdependency between base positions in a splice site.
| METHODS |
|---|
|
|
|---|
Test of dependency and the contingency tables
We used the
2-test as employed in MDD (Burge and Karlin, 1997) to establish the interdependency between positions in a splice signal.
To perform the null hypothesis test of independence on a pair of nucleotides at the i-th and the j-th positions of a splice site, we formed a 4 x 4 contingency table (Ewens and Grant, 2001), as shown in Table 1, by counting the observed number Y mn of DNA sequences where the i-th nucleotide X i was m and the j-th nucleotide X j was n (for simplicity, we have encoded A, T, C, G as 1, 2, 3, 4, respectively) from a sample of Y DNA sequences. The numbers Y mc and Y rn in Table 1 are row sums and column sums, respectively. It is clear that
.
|
The test statistic used is as follows:
![]() |
where
![]() |
is the expected number of DNA sequences in which the i-th nucleotide X i is m and the j-th nucleotide X j is n from a sample of Y DNA sequences when the null hypothesis of independence was true. To determine the rejection region for the null hypothesis, we have specified a numerical value
for the Type I error of the test, according to a
2-distribution with degrees of freedom (4 1) · (4 1) = 9, and then the critical point, K, was computed as follows:
![]() |
Bayesian networks
Bayesian methods provide a formalism for reasoning about partial beliefs under conditions of uncertainty (Pearl, 1988). The basic expressions in Bayesian formalism are statements about conditional probabilities. We say two random variables X and Y are independent if P(x | y) = P(x). The variables X and Y are conditionally independent given the random variable Z if P(x | y,z) = P(x | z). From the Bayesian rule, the global joint distribution function P(x 1, x 2, ..., x n ) of variables X 1, X 2, ..., X n can be represented as a product of local conditional distribution functions. That is,
![]() |
A Bayesian network for a collection {X 1, X 2, ..., X n } of random variables represents the joint probability distribution of these variables. The joint probability distribution, which is associated with a set of assertions of conditional independence among the variables, can be written as:
![]() |
where E x i is a subset of variables x 1, ..., x i1 on which x i is dependent. Hence, a Bayesian network can be described as a directed acyclic graph consisting of a set of n nodes and a set of directed edges between nodes. Each node in the graph corresponds to a variable x i and each directed edge is constructed from a variable in E x i to the variable x i . If each variable has a finite set of values, to each variable x i with parents in E x i , there is an attached table of conditional probabilities P(x i | E x i ).
| MODEL ARCHITECTURE AND ALGORITHMS |
|---|
|
|
|---|
Dependency graphs
The most outstanding observation made by P. Chambon is that almost all introns in pre-mRNA begin and end in the same way: the first two bases in an intron are GU and the last two are AG (Weaver, 1999). Although GU and AG are conserved at the donor site (the region surrounding the exon/intron boundary) and at the acceptor site (the region surrounding the intron/exon boundary), respectively, the bases at other positions are uncertain and require a statistical model.
We have chose a window of 18 base positions for the donor site, where nine consecutive bases are upstream from the exon/intron boundary and nine consecutive bases are downstream to the exon/intron boundary (see Results section). Further, a window of 36 base positions was chosen for the acceptor site, where 27 consecutive bases are upstream from the intron/exon boundary and nine consecutive bases are downstream to the intron/exon boundary. To be more precise, we denoted the two conserved bases of the donor site as D +1 and D +2, i.e. D +1 = G and D +2 = U. The bases in the downstream of the donor site were named as D +3, D +4, ..., D +9 and to the upstream as D 1, D 2, ..., D 9. On the other hand, we denoted the two conserved bases of the acceptor site as A 2 and A 1 i.e. A 2 = A and A 1 = G. The bases in the downstream of the acceptor site were named as A +1, A +2, ..., A +9 and to the upstream as A 3, A 4, ..., A 27.
By constructing a contingency table from a sample of donor sites for each pair (D i , D j ) of bases at distinct positions of the donor site, we tested the null hypothesis of independence for D i and D j with the
2-statistic
2(D i , D j ) as in (1). We have set the Type I error of the test to
D = 108 for the donor site (see Results section). According to the
2-distribution with nine degrees of freedom, the critical point for the rejection region of the test is K D = 55.4491 for the donor site.
By rejecting the null hypothesis, we infer that the two base positions D i and D j are dependent if the
2-statistic
2(D i , D j )
K D = 55.4491. It is clear that each of the two conserved base positions D +1 and D +2 must be statistically independent of any other position.
The dependency graph of the donor site was constructed as follows. There were 18 nodes in the undirected graph, each corresponding to a base position of the donor site. An edge was established between two nodes in the graph if the two corresponding base positions of the donor sites were dependent, as inferred from the
2-test procedure. The dependency graph so obtained is shown in Figure 1, where D +1 and D +2 are isolated nodes. For future use, adjacent base positions D j to each base position D i in the dependency graph of the donor site were sorted from left to right according to the
2-values
2(D i , D j ) varying from high to low (Table 2).
|
|
Using a similar procedure, we constructed the dependency graph of the acceptor site, as can be inferred from Table 3, where adjacent base positions A j to each base position A i in the dependency graph of the acceptor site were sorted from left to right in accordance with the
2-values
2(A i , A j ) from high to low. The Type I error of the test for the acceptor site was chosen to be
A = 103 with the critical point K A = 27.8772 (see Results section).
|
Expanded Bayesian networks
Although the dependency graph can fully capture the intrinsic interdependency between base positions in the donor site or in the acceptor site, it is difficult, if not impossible, to perform statistical inference based on the dependency graph. This is because there are cycles in the dependency graph of the donor site as shown in Figure 1 and also in the dependency graph of the acceptor site, which can be inferred from Table 3.
In contrast, as a directed acyclic graph, a Bayesian network is suitable for statistical reasoning as performed in the Cai et al. (2000) tree model. Whereas the Bayesian networks constructed by Cai et al. (2000) cannot capture the cyclic dependency among base positions as described in the dependency graph.
To resolve the dilemma, we expanded the dependency graph to form a Bayesian network by allowing a base position in the dependency graph to appear more than once in the Bayesian network as nominally distinct nodes. The basic procedure to build such a Bayesian network for the donor site is as follows:
- Calculate the sum S i =
j
N(i)
2(D i , D j ) of
2-statistics for each base position D i in the donor site, where N(i) is the set of indices of all base positions adjacent to D i in the dependency graph of the donor site (here and after, two base positions of a splice site are called adjacent if there is an edge connecting them in the dependency graph of the splice site).
- Assign a base position D i with the largest sum S i = max j S j to be the root of the Bayesian network.
- Expand the dependency graph from the rooted base position to the adjacent base positions as the first layer of the Bayesian network. The root itself forms the zeroth layer of the Bayesian network.
- Further expand the dependency graph from each base position in the first layer of the Bayesian network to their adjacent base positions to form the second layer of the Bayesian network.
- Repeatedly expand the dependency graph as in Steps 3 and 4 until all base positions and the two directions in any edges in the dependency graph have been reached at least once.
As can be seen from Figure 1 (Table 3), a node in the expanded Bayesian network may have at most five (22) parent nodes. Since each node represents a variable of four possible bases, there will be up to 46 = 4096 (423
7.0 x 1013) parameters to be estimated in establishing the conditional probability table for such a node in learning the expanded Bayesian network donor site (acceptor site) model for inference.
Considering the size of the training datasets used and to prevent overfitting the parameters of the statistical inference models, we have modified Step 4 of the basic procedure to limit each node in the expanded Bayesian network to have a maximum number p of parent nodes (we will consider p = 1, 2, 3) so that there are at most 4 p+1 = 16, 64, 256 for p = 1, 2, 3, respectively, instead of 46 or 423, parameters needed to be estimated for a conditional probability table.
To introduce the modification, we tag an adjacent index j in the adjacent index set N(i) of each base position D i in the dependency graph with an ordered pair [n j ,
2(D i , D j )], where n j is the number of times dynamically recorded that D j has been used as a parent node of D i during the expansion process and
2(D i , D j ) is the
2-statistic between D i and D j . We gave a total order to the tags as follows: [n j ,
2(D i , D j )] > [n k ,
2(D i , D k )] if either n j < n k or n j = n k and
2(D i , D j ) >
2(D i , D k ). The n j s were set to zeroes as initial values. Now, we state the modification of Steps 3 and 4 given in the above method as follows:
- (3') As in Step 3 given above. For each node D i in the first layer, increase the first entry of the tag to the rooted base position [which must be in the list N(i)] by one since the rooted position has been used as a parent node of D i .
- (4') As in Step 4 given above. If there are more than p parent nodes for a node in the second layer, keep p of the links to the parent nodes with the largest p tags and delete the rest. For each node D i in the second layer, update the tag to each parent base position D j in N(i) by incrementing n j by one.
- (4') As in Step 4 given above. If there are more than p parent nodes for a node in the second layer, keep p of the links to the parent nodes with the largest p tags and delete the rest. For each node D i in the second layer, update the tag to each parent base position D j in N(i) by incrementing n j by one.
The construct of the ordered tags was to ensure that the potential parent base positions for a base position in the dependency graph would be utilized uniformly in the expansion process with a little emphasis on those with high interdependency. Table 2 has been used to facilitate the dynamic ordering of tags used in Steps (3') and (4'). The expanded Bayesian network for the donor site as a result of the modified procedure with p = 2 is shown in Figure 2. Let random variable
be associated with the base position D i in the l-th layer of the expanded Bayesian network of the donor site. Let
be the set of parent random variables of the variable
as shown in Figure 2. For a specific DNA sequence S = (d 9, ..., d 1, d 1, ..., d 9) of a tested potential donor site, we let
for all l and for all i. Then the probability P(S|M) of having S based on the expanded Bayesian network model for a donor site is defined as:
|
![]() |
where the denominator is the sum of all possible base configurations
of the random variables
induced from all possible donor site DNA sequences and used as a normalization factor.
A similar procedure can be used to build an expanded Bayesian network for the acceptor site from the dependency graph, but not shown here because of the high complexity. This procedure can be accomplished with the sorted lists of adjacent base positions to each base position in the dependency graph of the acceptor site as given in Table 3.
| RESULTS |
|---|
|
|
|---|
Splice site datasets
To build reliable expanded Bayesian networks for the detection of human splice sites, high-quality datasets must be used. We extracted a collection of 2381 real donor sites and 2381 real acceptor sites from a set of 462 annotated multiple-exon human genes at http://www.fruitfly.org/sequence/human-datasets.html. We excluded the splice sites that contained base positions not labeled with A, T, C, G but with other symbols. Finally, there were 2379 real donor sites and 2380 real acceptor sites which were used as the true dataset. We also extracted a large collection of 283 062 pseudo donor sites and 400 314 pseudo acceptor sites from the 462 annotated genes and used it as the false dataset. Each of these pseudo donor/acceptor sites has D +1 = G, D +2 = U/A 2 = A, A 1 = G but is not a real donor/acceptor site according to the annotation.
Model learning
To prepare a machinery to determine whether a tested splice site is real or pseudo, we used the true training data to train a true expanded Bayesian network splice site model M T and the pseudo training data to train a false expanded Bayesian network model M F. Each node in an expanded Bayesian network is associated with a conditional probability table of at most 4 p+1 parameter entries to be estimated. The maximum-likelihood estimation procedure is used, which amounts to calculate the relative frequency.
Test score
The score, Score M (S), of a tested potential splice site S under the two contrast models M T and M F is the log-odds ratio defined as follows:
![]() |
where P(S | M T) and P(S | M F) are the probability of having the tested potential splice site S based on the true splice site model M T and the probability based on the false splice site model M F, respectively. With an empirically determined threshold score T, the tested potential splice site S will be claimed real if the log-odds score is no less than T; otherwise, it will be claimed pseudo.
Measures of predictive accuracy
A tested potential splice site was called a true positive (TP) if it was predicted true and actually true; a false negative (FN) if predicted pseudo but actually true; a true negative (TN) if predicted pseudo and actually pseudo; and a false positive (FP) if predicted true but actually pseudo.
It is common to use the two measures of false negative (FN) rate and false positive (FP) rate defined as
![]() |
to report the predictive accuracy of a splice site inference model (Cai et al., 2000; Pertea et al., 2001). Note that the sensitivity and the specificity of the inference model are equal to one minus the FN rate and one minus the FP rate, respectively (Khodarev et al., 2003).
Cross-validation
We used a 5-fold cross-validation in our dataset to estimate the splice site detection accuracy of all the models studied (Pertea et al., 2001). Each model was cross-validated by randomly partitioning the data into five subsets. Then we tested each subset (called the testing data) with the parameters trained by the other four subsets (called the training data) under the splice site model, and took the average of the five predictive accuracy measures corresponding to the five testing/training data pair. We also verified the training data with the model trained by themselves in the same manner.
Type I error selection
We considered five different values of the Type I error
, 108, 106, 103, 102 and 101, in the
2-test for the construction of the dependency graphs for the donor site and the acceptor site. According to the
2-distribution with nine degrees of freedom, the critical point K for the rejection of the null hypothesis is 55.4491, 44.8109, 27.8772, 21.6660 and 14.6837, respectively.
We compared the predictive accuracy of the corresponding five expanded Bayesian network predictive models with at most p parents for each window as will be described in the next section and for each p = 1, 2, 3. Then, we chose a Type I error for each window and for each p with the best performance for the donor site and for the acceptor site, respectively.
Window selection
To determine a proper window for the donor site and a proper window for the acceptor site for the purpose of computational prediction, we gathered statistics of 50 bases upstream of the exon/intron boundary and 50 bases downstream of the intron/exon boundary, respectively. We found that there was a consensus region between 3 bases upstream and 7 bases downstream of the exon/intron boundary and another between 27 bases upstream and 1 base downstream of the intron/exon boundary, respectively, as shown in Tables 4 and 5 where each of the base positions had 1 or 2 nt with the total compositional percentage not <60%.
|
|
Then keeping the reasonable complexity in mind, we examined 10 extensions of the consensus region of the donor site to select a proper window for the donor site. We compared the predictive accuracy of the corresponding 10 expanded Bayesian network predictive models with p = 2 for the training data and the testing data of the donor site, as shown in Figures 3 and 4, respectively. In this comparison, we used the best choice of Type I error for each window in the construction of the dependency graph for the donor site. Although the window 9 bases upstream to 12 bases downstream of the exon/intron boundary had the best predictive performance for the training data of the donor site, the window 9 bases upstream to 9 bases downstream of the exon/intron boundary had the best predictive performance for the testing data of the donor site. Considering the computational complexity, we selected the window nine bases upstream to nine bases downstream of the exon/intron boundary as the right choice for the donor site (in this case, the best Type I error
is 108). We also examined the same 10 candidates of window under WMM, WAM, MDD, Cai et al.'s tree model, the 2MC and third-order Markov chain (3MC) models, and the expanded Bayesian network models with p = 1 and p = 3 and the best window for each model was determined.
|
|
We also examined eight extensions of the consensus region of the acceptor site and compared the predictive accuracy of the corresponding eight expanded Bayesian network predictive models with p = 2 for the training data and the testing data of the acceptor site, as shown in Figures 5 and 6, respectively. In this comparison, we also used the best choice of Type I error for each window in the construction of the dependency graph for the acceptor site. It is apparent that the window 27 bases upstream to 9 bases downstream of the intron/exon boundary is the most suitable one for the acceptor site (in this case, the best Type I error
is 103). Similarly, we examined the same eight candidates of window with WMM, WAM, MDD, Cai et al.'s tree model, the 2MC and 3MC models, and the expanded Bayesian network models with p = 1 and p = 3 and selected the best window for each model.
|
|
Laplace's rule
When the training dataset is not large enough, some probability parameters in a probabilistic predictive model will be estimated as zeroes due to the non-existence of the corresponding base configurations in the training dataset and the predictive accuracy of the probabilistic model may be diminished. This often occurs when a higher-order Markov chain model or an expanded Bayesian network with higher p is built. One well-known approach to cope with this problem is to derive the relative frequencies by adding some fake extra counts to the true counts observed for each base configuration (Durbin et al., 1998). An extra count for each base configuration is called a pseudocount. The simplest pseudocount method is Laplace's rule: to add one pseudocount for each base configuration. In Figures 7 and 8, it is observed that the predictive accuracy of the 3MC model for splice sites is not acceptable without using the Laplace's rule and improves markedly with the use of Laplace's rule. The predictive accuracy of the expanded Bayesian network model with p = 3 is much better with the Laplace's rule than without it for the donor site, but only slightly better for the acceptor site. The predictive accuracy of the 2MC model, and the expanded Bayesian network model with p = 2 remains almost the same with or without the Laplace's rule while both models performed slightly better with the Laplace's rule, except that the expanded Bayesian network model with p = 2 performs better without the Laplace's rule for the acceptor site. For determining the best Type I error and/or the best window for each model in the previous subsections, we have compared the predictive accuracy with or without the Laplace's rule.
|
|
Predictive accuracy comparison
The two predictive accuracy measures, FN rate and FP rate, are reported for WMM, WAM, MDD, Cai et al.'s tree model, the 2MC and 3MC models, and the expanded Bayesian network models with p = 1, 2, 3 for the donor site and the acceptor site are shown in Figures 9 and 10 and Figures 11 and 12, respectively.
|
|
|
|
For the testing splice site data shown in Figures 9 and 11, the predictive accuracy of the expanded Bayesian network model with p = 2 (EBN2) was superior to that of all the other predictive models in all the cases examined, except for false positive rates
12% for the donor site and
17% for the acceptor site, respectively, where EBN2 was just among the best ones. For the training splice site data shown in Figures 10 and 12, the predictive accuracy of the expanded Bayesian network model with p = 3 (EBN3) is superior to that of all the other predictive models in all the cases examined. Note that while the predictive accuracy of EBN3 was superior to that of EBN2 for the training data, it was inferior for the testing data, which shows that EBN3 may overfit the splice site datasets. Also note that EBN2 outperforms the expanded Bayesian network model with p = 1 (EBN1) for almost all the cases examined. In particular, the sensitivity/specificity of EBN2 can reach up to 94%/94% for the testing data and 96%/94% for the training data of the donor site, and 93%/92% for the testing data and 95%/95% for the training data of the acceptor site. We next compared the predictive accuracy of the Markov chain models considered. First, we observed their predictive accuracy for the donor site. For the testing data, the 2MC model has the best predictive accuracy among all the Markov chain models while WMM (which can be regarded as the 0MC model) has the worst. WAM (which is the 1MC model) and 3MC model are slightly inferior to 2MC with WAM approaching 2MC when specificity decreases and 3MC approaching 2MC when specificity increases. For the training data, 2MC and 3MC have almost the same predictive accuracy and are superior to WAM, which is in turn superior to WMM. Apparently, 3MC overfits the donor site dataset, and 2MC had the best predictive accuracy for the donor site among all the Markov chain models.
Second, we observed the predictive accuracy of Markov chain models for the acceptor site. For the testing data, WAM has the best predictive accuracy among all the Markov chain models. WMM and 3MC have comparable predictive accuracy and were much inferior to WAM, while 2MC was slightly inferior to WAM. For the training data, the order of predictive accuracy of the Markov chain models was apparent with 3MC being the best, followed by 2MC and WAM, and WMM being the worst. Apparently, both 2MC and 3MC overfit the acceptor site dataset and WAM had the best predictive accuracy for the acceptor site among all the Markov chain models.
We compared the predictive performance between the Markov chain models and the expanded Bayesian network models. For the testing data of the donor site, the predictive accuracy of all the Markov chain models was inferior to that of all the expanded Bayesian network models with p = 1, 2, 3. For the testing data of the acceptor site, the predictive accuracy of WAM was second only to that of EBN2 (and better than all the other models, including MDD and Cai et al.'s tree model). The EBN3 has slightly inferior predictive accuracy compared to WAM, only when the FN rate
7%. The 2MC was slightly inferior to EBN3 but slightly superior to EBN1. The WMM and 3MC have much inferior predictive accuracy. Now for the training data of the donor site, the order of predictive accuracy is EBN3 > EBN2 > 2MC
3MC
EBN1 > WAM > WMM from the best to the worst. And for the training data of the acceptor site, the order of predictive accuracy is EBN3 > 3MC > EBN2 > 2MC > WAM > EBN1 > WMM.
For the prediction of the testing donor site data, MDD and WMM are the worst two models, whereas MDD is better than WMM when specificity is high and worse when specificity is low. For the prediction of the testing acceptor site data, MDD is superior to the worst two models WMM and 3MC and inferior to all other models. For the prediction of the training donor site data, MDD approaches 2MC as specificity increases but becomes the worst as specificity decreases. For the prediction of the training acceptor site data, MDD is superior to WAM but inferior to 2MC.
For the testing data of the donor site, the predictive accuracy of Cai et al.'s tree model is slightly inferior to that of 2MC, almost the same as that of 3MC, and slightly better than that of WAM when the false positive rates are <9% and slightly worse when the false positive rates are >9%. For the testing data of the acceptor site, the predictive accuracy of Cai et al.'s tree model is slightly inferior to that of WAM but becomes almost the same when the false positive rates are <7%. For the training data of the donor site and the acceptor site, Cai et al.'s tree model has about the same predictive accuracy with WAM. These results match and validate the observations made by Cai et al. (2000) between WAM and Cai et al.'s tree model. However, we observed that the splice site predictive accuracy of Cai et al.'s tree model is inferior to EBN2.
| DISCUSSION |
|---|
|
|
|---|
In this study, we developed a dependency graph model to fully capture the intrinsic cyclic interdependency between base positions in a splice site. Each dependency graph is expanded into a Bayesian network to facilitate the learning of a machinery for determining whether a tested potential splice site is real or pseudo. Compared with the previously published splice site models, such as WMM, WAM, MDD, Cai et al.'s tree model and the less-studied 2MC and 3MC models, this approach for the modeling of splice sites achieves the best results for all interesting cases, under the two predictive accuracy measures of FN rate and FP rate as shown in Figures 9 12.
The representation of the donor (acceptor) site by a window around the exon/intron (intron/exon) boundary has been studied extensively. We found that the window from 9 bases upstream to 9 bases downstream of the exon/intron boundary best represents the donor site and the window from 27 bases upstream to 9 bases downstream of the intron/exon boundary best represents the acceptor site.
The interdependency between base positions in the representative window of a splice site can be seen from the dependency graphs of the donor and the acceptor splice sites. As shown in Figure 1, strong interdependency among bases D 3, D 2, D 1, D +3, D +4, D +5, D +6, D +7 of the donor site was observed and conforms to the consensus region of the donor site as indicated in Table 4. This implies that the spliceosome binds the donor site mainly on the bases downstream of the exon/intron boundary which conforms to our biological knowledge derived from experiments. Similarly, as inferred from Table 3, strong interdependency among bases from A 27 to A 3 is observed and conforms to the consensus region of the acceptor site as indicated in Table 5. This implies that the spliceosome binds the acceptor site mainly on the bases upstream of the intron/exon boundary, which too conforms to our biological knowledge derived from experiments.
| Acknowledgments |
|---|
We thank Chao-Chung Chang and Chen-Wei Hsu for their computer programming skills in the automatic construction of dependency graphs and their expanded Bayesian networks and in the automatic execution of the predictive models studied. We thank the editor and the reviewers for their valuable suggestions to improve the presentation of this paper. This work was supported by a grant (NSC91-3112-B-007-003) from the National Research Program of Genomic Medicine, National Science Council, Taiwan.
Received on October 27, 2003; revised on August 11, 2004; accepted on September 3, 2004
| REFERENCES |
|---|
|
|
|---|
Arita, M., Tsuda, K., Asai, K. (2002) Modeling splicing sites with pairwise correlations. Bioinformatics, 18, (Suppl. 2), S27S34[Abstract].
Brunak, S., Engelbrecht, J., Knudsen, S. (1991) Prediction of human mRNA donor and acceptor sites from the DNA sequence. J. Mol. Biol., 220, 4965[CrossRef][ISI][Medline].
Burge, C. and Karlin, S. (1997) Prediction of complete gene structures in human genomic DNA. J. Mol. Biol., 268, 7894[CrossRef][ISI][Medline].
Cai, D., Delcher, A., Kao, B., Kasif, S. (2000) Modeling splice sites with Bayes networks. Bioinformatics, 16, 152158
Durbin, R., Eddy, S.R., Krogh, A., Mitchison, G. Biological Sequence Analysis: Probabilistic Models of Protein and Nucleic Acids, (1998) , Cambridge, MA Cambridge University Press.
Ewens, W.J. and Grant, G.R. Statistical Methods in Bioinformatics: An Introduction, (2001) , NY Springer-Verlag.
Hebsgaard, S.M., Korning, P.G., Tolstrup, N., Engelbrecht, J., Rouzé, P., Brunak, S. (1996) Splice site prediction in Arabidopsis thaliana pre-mRNA by combining local and global sequence information. Nucleic Acids Res., 24, , pp. 34393452
Henderson, J., Salzberg, S., Fasman, K. (1997) Finding genes in human DNA with a hidden Markov model. J. Comput. Biol., 4, 127141[ISI][Medline].
Khodarev, N.N., Park, J., Kataoka, Y., Nodzenski, E., Khorasani, L., Hellman, S., Roizman, B., Weichselbaum, R.R., Pelizzari, C.A. (2003) Receiver operating characteristic analysis: a general tool for DNA array data filtration and performance estimation. Genomics, 81, 202209[CrossRef][ISI][Medline].
Lander, E.S., Linton, L.M., Birren, B., Nusbaum, C., Zody, M.C., Baldwin, J., Devon, K., Dewar, K., Doyle, M., FitzHugh, W., et al. (2001) Initial sequencing and analysis of the human genome. Nature, 409, 860921[CrossRef][Medline].
Mathe, C., Sagot, M., Schiex, T., Rouzé, P. (2002) Current methods of gene prediction, their strengths and weaknesses. Nucleic Acids Res., 30, 41034117
Mathews, C.K., van Holde, K.E., Ahern, K.G. Biochemistry, (2000) 3rd edn. , San Francisco, CA Addison Wesley Longman.
Pearl, J. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, (1988) , San Mateo, CA Morgan Kaufmann.
Pertea, M., Lin, X., Salzberg, S.L. (2001) GeneSplicer: a new computational method for splice site prediction. Nucleic Acids Res., 29, , pp. 11851190
Reese, M.G., Eeckman, F.H., Kulp, D., Haussler, D. (1997) Improved splice site recognition in Genie. J. Comput. Biol., 4, 311324[ISI][Medline].
Salzberg, S., Delcher, A., Fasman, K., Henderson, J. (1998) A decision tree system for finding genes in DNA. J. Comput. Biol., 5, 667680[ISI][Medline].
Staden, R. (1984) Computer methods to locate signals in nucleic acid sequences. Nucleic Acids Res., 12, 505519[ISI][Medline].
Tolstrup, N., Rouzé, P., Brunak, S. (1997) A branch point consensus from Arabidopsis found by non-circular analysis allows for better prediction of acceptor sites. Nucleic Acids Res., 25, 31593163
Weaver, R.F. Molecular Biology, (1999) , NY WCB McGraw-Hill.
Yeo, G. and Burge, C.B. (2004) Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals. J. Comput. Biol., 11, , pp. 377394[CrossRef][ISI][Medline].
Zhang, M.Q. (2002) Computational prediction of eukaryotic protein-coding genes. Nat. Rev. Genet., 3, 698709[CrossRef][ISI][Medline].
Zhang, M.Q. and Marr, T.G. (1993) A weight array method for splicing signal analysis. Comput. Appl. Biosci., 9, 499509
This article has been cited by other articles:
![]() |
S. Sonnenburg, A. Zien, P. Philips, and G. Ratsch POIMs: positional oligomer importance matrices--understanding support vector machine-based signal detectors Bioinformatics, July 1, 2008; 24(13): i6 - i14. [Abstract] [PDF] |
||||
![]() |
S. Nikolajewa, R. Pudimat, M. Hiller, M. Platzer, and R. Backofen BioBayesNet: a web server for feature extraction and Bayesian network modeling of biological sequence data Nucleic Acids Res., July 13, 2007; 35(suppl_2): W688 - W693. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||





















