Skip Navigation


Bioinformatics Advance Access originally published online on July 26, 2006
Bioinformatics 2006 22(19):2326-2332; doi:10.1093/bioinformatics/btl398
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/19/2326    most recent
btl398v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (1)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Song, B.
Right arrow Articles by Yang, J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Song, B.
Right arrow Articles by Yang, J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

ARCS: an aggregated related column scoring scheme for aligned sequences

Bin Song 1,{dagger}, Jeong-Hyeon Choi 2,{dagger}, Guangyu Chen 1, Jacek Szymanski 1, Guo-Qiang Zhang 1, Anthony K. H. Tung 3, Jaewoo Kang 4, Sun Kim 2,* and Jiong Yang 1,*

1 Electrical Engineering and Computer Science Department, Case Western Reserve University Cleveland, OH, USA
2 School of Informatics, Indiana University Bloomington, IN, USA
3 Department of Computer Science, National University of Singapore Singapore
4 Department of Computer Science and Engineering, Korea University Seoul, Korea

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 ARCS METHOD
 3 EXPERIMENTS
 4 CONCLUSION
 REFERENCES
 

Motivation: Biologists frequently align multiple biological sequences to determine consensus sequences and/or search for predominant residues and conserved regions. Particularly, determining conserved regions in an alignment is one of the most important activities. Since protein sequences are often several-hundred residues or longer, it is difficult to distinguish biologically important conserved regions (motifs or domains) from others. The widely used tools, Logos, Al2co, Confind, and the entropy-based method, often fail to highlight such regions. Thus a computational tool that can highlight biologically important regions accurately will be highly desired.

Results: This paper presents a new scoring scheme ARCS (Aggregated Related Column Score) for aligned biological sequences. ARCS method considers not only the traditional character similarity measure but also column correlation. In an extensive experimental evaluation using 533 PROSITE patterns, ARCS is able to highlight the motif regions with up to 77.7% accuracy corresponding to the top three peaks.

Availability: The source code is available on http://bio.informatics.indiana.edu/projects/arcs and http://goldengate.case.edu/projects/arcs

Contacts: jiong.yang{at}case.edu, sunkim2{at}indiana.edu

Supplementary Material: http://bio.informatics.indiana.edu/projects/arcs and http://goldengate.case.edu/projects/arcs


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 ARCS METHOD
 3 EXPERIMENTS
 4 CONCLUSION
 REFERENCES
 
One of the most important and challenging problems in biological sequence analysis is to find the predominant residues or conserved regions in a set of biological sequences. Analysis of positional conservation in an amino acid sequence alignment can aid in detection of motifs and functionally and/or structurally important residues, e.g. at the binding sites (Pei and Grishin, 2001; Villar and Kauvar, 1994; Ouzounis et al., 1998). Mapping the conservation information on to a protein 3D structure helps to visualize spatial conservation patterns and to deduce potential functional surfaces of a protein molecule (Sander and Schneider, 1991; Lichtarge et al., 1996; Landgraf et al., 1999; Makarova and Grishin, 1999; Zhang et al., 2000). Several methods of conservation analysis are used, such as the vectorial method (Casari et al., 1995), evolutionary tracing (Lichtarge et al., 1996) and Entropy-based conservation analysis (Sander and Schneider, 1991; Shenkin et al., 1991). A typical approach for conservation analysis is to align the sequences using a multiple sequence alignment tool and then determine conserved regions of these aligned sequences.

There is a significant body of literature on the multiple sequence alignment problem (MSA), e.g. MSA algorithms, such as, the dynamic programming method, central-star approach (Gusfield, 1993, 1997), l-star algorithm (Bafna et al., 1997) and Partial Order Alignment algorithms (POA) (Lee et al., 2002); existing multiple sequence alignment tools, such as Clustal W (Higgins et al., 1994), T-coffee (Notredame et al., 2000), MuSiC (Tsai et al., 2004), etc. However, determining conserved regions in the aligned sequences remains a challenging problem. Computational tools that highlight potential conserved regions effectively can help biologists to determine conserved regions fast and accurately. To the best of our knowledge, there only exist a few tools, e.g. Logos (Scheneider and Stephens, 1990), AL2CO (Pei and Grishin, 2001), COMPASS (Sadreyev and Grishin, 2003) and ConFind (Smagala et al., 2005). In this paper, we present a novel algorithm that can highlight potential conserved regions effectively.

1.1 Motivation
Several methods are known for discovery of conserved regions from aligned sequences. The main idea of Logos was to compute the frequency of each letter at the position in the aligned sequences. Logos could present the consensus sequences and display the patterns in the aligned sequences. (In Section 3.2 we will show the disadvantages of Logos empirically.) AL2CO calculated a conservation index at each position in a multiple sequence alignment using several methods. Amino acid frequencies at each position are estimated and the conservation index is calculated from these frequencies. Two different strategies (unweighted frequencies and weighted frequencies) and three conceptually different approaches (entropy-based, variance-based and matrix score-based) were utilized in the AL2CO algorithm. COMPASS was a method for the comparison of multiple protein alignments. The method derived numerical profiles from alignments, constructs optimal local profile–profile alignments and analytically estimates E-values for the detected similarities. The scoring system and E-value calculation were based on a generalization of the PSI-BLAST approach to profile-sequence comparison, which was adapted for the profile–profile case. However, COMPASS focused on the comparison of different alignments, instead of highlighting the conserved regions from aligned sequences. ConFind was designed to work with a large number of closely related, highly variable sequences. Conserved regions were defined in terms of minimum region length, maximum informational entropy (variability) per position, number of exceptions allowed to the maximum entropy criterion and the minimum number of sequences that must contain a non-ambiguous character at a position to be considered for inclusion in a conserved region. Though ConFind provided robust handling of alignments containing partial sequences and ambiguous characters, the method could not deal with general alignments well. Thus more effective methods for highlighting true conserved regions of the alignment are still needed. Moreover, the above methods did not consider the correlation information among columns in the aligned sequences although they took into account the similarity within each (aligned) column.

In biological sequences, the columns or positions in biologically important domains are usually highly correlated and the sequences in rows are similar (Cline et al., 2002; Martin et al., 2005). Until now, to the best of our knowledge, the approaches of conserved regions discovery on alignments consider the similarity within each aligned column only. No one takes into account the correlation among columns which is significant in biological domains. If the column correlation information is incorporated into the discovery function of conserved regions, the results could be improved greatly. Therefore, we introduce a new aggregated related column scoring (ARCS) scheme for aligned sequences. In detail, ARCS consists of two factors. The first factor is the similarity of residues in an aligned column, which the LOGOS value (Scheneider et al., 1990) can measure. If the alignments are of similar sequences, then the score of ARCS will be high. The second factor reflects the correlation among positions. If the domains are more correlated, then it will also receive a higher score. The functional dependency (Giannella et al., 2004) could be used for this purpose. We apply the ARCS scheme to highlight the conserve regions on alignments. PROSITE (Nicolas et al., 2004) is a database of motif signatures in proteins and it is compiled by human experts. In an extensive experiment with randomly chosen 533 PROSITE patterns in correctly aligned sequences, ARCS is able to successfully highlight true motif regions up to 77.7%, corresponding to the three highest peaks. Both Logos and AL2CO are not as effective as ARCS.

The multiple sequence alignment is a difficult problem and, in reality, alignment of sequences may be incorrect. Thus, we compute the ARCS score for 47 randomly chosen PROSITE families that can not be aligned correctly. ARCS can still detect part of conserve regions up to 40.4%.

The remainder of this paper is organized as follows. In the next section, we will introduce some formal definitions of ARCS and present a method to compute. An extensive empirical study is done in Section 3. The final conclusion is drawn in Section 4.


    2 ARCS METHOD
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 ARCS METHOD
 3 EXPERIMENTS
 4 CONCLUSION
 REFERENCES
 
In this section, we present the ARCS model in detail. The main idea is that we make use of the biological knowledge that the elements in different columns of a domain are usually highly correlated and rows have great similarity. As a result, the functional dependency is used to represent the correlation between columns, which is FDi->j in Definition 2. LOGOS reflects the similarity of residues within a column, which is function LOGOS() in Definition 1.

The notations in this section are similar to those in Giannella and Robertson (2004) and Schneider and Stephens (1990).

DEFINITION 1
Given a set of n, n ≥ 2, aligned sequences {S1,S2,...,Sn} with the same length m, the LOGOS score is defined as

Formula 1(1)
where LOGOS(i) denotes the i-th column's LOGOS value in the aligned sequences. It tries to quantify the useful, ordered information that is available in the i-th column. The i-th column's H value, H(i) represents the disorder degree of the i-th column. It is defined as

Formula 2(2)
where Fie is the frequency of letter e in column i, that is

Formula 3(3)
Moreover, cie is the observed count for letter e in column i; cie = {sum}j{delta}(Sj(i) = e), where {delta}(Sj(i) = e) is 1 if Sj(i) = e and 0 otherwise. Sj(i) denotes the i-th letter in the aligned sequence Sj. HMax is defined as

Formula 4(4)
NL denotes the number of letters appear in the aligned sequences set {S1,S2,...,Sn}.

EXAMPLE 1
Consider an aligned sequence set in Fig. 1. There are 4 sequences (i.e. n = 4) with 5 distinct letters (MLQW_) (i.e. NL = 5). HMax = log2(Min(NL, n)) = log2(Min(5, 4)) = 2; F1M = 2/4 = 1/2; F1W= 2/4 = 1/2; F2L = 2/4 = 1/2; F2Q = 1/4; F2– = 1/4; F3Q = 3/4; F3L = 1/4; F4L = 3/4; F4W = 1/4. H(1) = –(F1M log2(F1M) + F1W log2(F1W)) = 1; H(2) = –(F2L log2(F2L) + F2Q log2(F2Q) + F2_log2(F2_)) = 1.5; H(3) = –(F3Q log2(F3Q) + F3L log2(F3L)) = 0.8113; H(4) = –(F4W log2(F4W) + F4L log2(F4L)) = 0.8113. LOGOS(1) = HMaxH(1) = 1; LOGOS(2) = 0.5; LOGOS(3) = 1.1887; LOGOS(4) = 1.1887.


Figure 1
View larger version (27K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1 Aligned Sequences Set.

 
DEFINITION 2
A functional dependency from A to B is defined as the existence of a map from A to B. Giannella et al. presented an approximation measure for functional dependency, which will be applied in the method of ARCS.

Given a set of n, n ≥ 2, aligned sequences {;S1, S2,...,Sn} with the same length m, for the i-th column and the j-th column in the aligned sequences, cip,jq is the observed count for letter p in column i and letter q in column j, i.e. cip,jq= {sum}k{delta}(Sk(i) = p, Sk(j) = q). The functional dependency from column i to column j is defined as

Formula 5(5)
Hi->j is the information dependency measure of column j given column i,

Formula 6(6)

EXAMPLE 2
Consider the aligned sequence set in Fig.1, the functional dependency from column 4 to column 1 is,

Formula 6

The functional dependency from column 1 to column 4 is,

Formula 6

We can see that the definition of functional dependency is not symmetrical, i.e. FDi->j is not necessarily equal to FDj->i.

DEFINITION 3:
Given a set of n, n ≥ 2, aligned sequences {S1,S2,...,Sn} with the same length m, the Aggregated Related Column Score (ARCS) is defined as

Formula 7(7)
where N(i) is the set of neighboring columns of column i. In the paper, we define a neighborhood size N = |N(i)|. Column j belongs to set N(i) if |j–i| ≤(N–1)/2.

EXAMPLE 3.
Consider the aligned sequence set in Fig. 1. Let N=3.

Formula 7

ARCS can be used to obtain some information about reserved regions among aligned sequences. For example, we use the aligned protein sequence set PS00702. Figure 2 shows the ARCS score of each column with neighborhood size 9, that is N(i) = [i – 4, i + 4].


Figure 2
View larger version (22K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2 ARCS score of PS00702 with neighborhood size 9. The motif region is between two lines.

 
From Figure 2, we can see that ARCS shows the conserved information for each sequence position. In order to let the curve highlight the conserved regions clearly, we smooth the ARCS result. That is,

Formula 8(8)
We let w denote the smoothing window size. Figure 3 shows the ARCS score curve by the smoothing window size 3.


Figure 3
View larger version (18K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3 The ARCS score smoothed by window size 3 with neighborhood size 9.

 

    3 EXPERIMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 ARCS METHOD
 3 EXPERIMENTS
 4 CONCLUSION
 REFERENCES
 
The performance of ARCS was extensively evaluated using the PROSITE database (Release 17.01 of January 2002). For each PROSITE pattern, we extracted a set of sequences with the pattern and aligned the sequence set with ClustalW. Column scores were then calculated using the ARCS method, which was implemented with Matlab and Octave. Among 1320 patterns, we randomly chose 709 patterns where the number of sequences was not >50. Of 176 patterns whose corresponding multiple sequence alignment failed to align the motif regions correctly, 47 patterns were randomly chosen. Thus we used 533 multiple sequence alignments to evaluate our method for the case that the alignment is correct (details in Section 3.2). A total of 47 alignments were tested for the case that Clustal W aligned part or none of the motifs (details in Section 3.3).

ARCS method transforms the multiple sequence alignment to a series of real numbers, one for each column, and we can define peaks in the number series. For each correct alignment and the corresponding PROSITE pattern, performance was measured in terms of the rank of the peak that the true motif region corresponds to. The highest peak will be assigned rank 1 while the second highest peak will be assigned peak 2, and so on. Since there were 533 patterns randomly selected to test for correct alignments, we were not able to manually verify; manual verification would also be subjective. Thus we implemented a peak-finding program that is described below. For the 47 alignments that Clustal W aligns ‘incorrectly’, we manually find whether the highest peaks indicate part of motif. We also measured the performance of our method in terms of the complexity of the PROSITE patterns (Section 3.4).

Automatic peak detection method. It is not trivial to define peaks in a series of numbers. One challenge is to handle adjacent peaks. For example, if two high peaks are nearby, should we define two separate peaks or a single peak merging these two peaks? A widely used technique is to smooth the values within a window of a fixed size. As a result of smoothing, we have a new series of numbers where peaks will be defined. To define peaks we need to define local minima and local maxima. We define a peak as a data position of a local maximum where the difference between the local maximum and any of two nearby local minima is greater than a parameter. The parameters for the peak finding program are Tmh, the minimum height of the local maximum from the local minimum (Li and Fenimore, 1996), and Tew, the half window size for evaluation.

In the following figures from Fig. 4 to Fig. 8, the x-axes represent the column position of the aligned sequences and the y-axes indicate the score.


Figure 4
View larger version (15K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4 ARCS score of PS00568 with the same smoothing window size 3 and different neighborhood size. (a) For neighborhood size 3, (b) 5 and (c) 7.

 
3.1 Effects of the neighborhood size for ARCS
We explore the peaks of ARCS with various neighborhood sizes for PS00568, which are illustrated in Figure 4. In Figure 4a, the positions of the highest peaks are not the known domains. However, at window length of 5, 7 (Fig. 4b and c), the highest peak corresponds to the true motif region. Table 1 shows experiments with a varying window size from 3 to 11 with 533 different PROSITE patterns. From neighborhood size 7 to neighborhood size 11, ARCS performance does not change much in highlighting the conserved regions.


View this table:
[in this window]
[in a new window]

 
Table 1 Evaluation ARCS performance with the same smoothing window size 3 and varying neighborhood sizes of 3, 5, 7, 9 and 11 on 533 datasets.

 
We evaluate ARCS performance with different neighborhood sizes on 533 datasets. Table 1 shows the results. From Table 1, when the neighborhood size is 3, 40.2% of motifs corresponded to the first peak, 60.0% of motifs corresponded to the top two peaks and 71.3% of motifs corresponded to the top three peaks. When the neighborhood size is 5, 45.8% of motifs corresponded to the first peak, 65.3% to the top two peaks and 74.5% to the top three peaks. When the neighborhood size is 7, the results are that 46.7% corresponded to the first peaks, 67.0% to the corresponded top two peaks and 77.7% corresponded to the top three peaks. If the neighborhood size is 9, then 46.7% patterns correspond to the first peaks, 65.7% to the top 2 peaks, and 77.3% to the top 3 peaks. If the neighborhood size is 11, then 48.4% of motifs corresponded to the first peaks, 65.3% to the top 2 peaks and 76.0% to the top 3 peaks. Therefore, when the neighborhood size is 7, ARCS could highlight the motif regions with up to 77.7% accuracy corresponding to the top three peaks. In addition, for the three highest peaks of ARCS score for sequences in each family, the precision is ~35% which means that 35% the peaks (the top three peaks) correspond to a true motif or part of a true motif. In some cases, multiple peaks may correspond to different portions of the same motif.

3.2 Performance of LOGOS, AL2CO and ARCS
The aligned protein sequence set PS00702 is used for the comparison. The multiple sequence alignment algorithm correctly aligns the known motif region. Figure 5 gives the LOGOS score of each column of PS00702.


Figure 5
View larger version (29K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 5 LOGOS score of PS00702. The motif region is between two dotted lines.

 
Similar to Smoothed ARCS score, we smooth the Logos score to highlight the conserved regions. That is,

Formula 9(9)
In AL2CO paper, it is recommended to use a window size of 3 to smooth the score of AL2CO. To be consistent, we choose the smoothing window size to be 3 for both ARCS and LOGOS too. Figure 6 shows the smoothed LOGOS score of each column of PS00702. Figure 7 illustrates the AL2CO method with the window size 3.


Figure 6
View larger version (20K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 6 Smoothed LOGOS score of PS00702 by window size 3. The motif region is between two dotted lines.

 


Figure 7
View larger version (21K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 7 Smoothed AL2CO score of PS00702 by window size 3.

 
For PS00702, Logos and AL2CO were not able to highlight the motif region clearly; there are a few peaks whose heights are comparable or higher than that of the motif region. In Contrast, with the ARCS method, the highest peak (among a small number of distinct peaks) corresponds to the true motif region (Fig. 3).

Performance on a large number of datasets. Table 2 shows the performance of ARCS comparing to LOGOS and AL2CO in terms of the rank of the peaks that corresponds to the motifs on random 533 datasets. To be consistent, we choose the smoothing window size to be 3 for ARCS, LOGOS and AL2CO methods. When neighborhood size is 7, ARCS could highlight 46.7% of motifs corresponding to the first peaks, 67.0% to the top 2 peaks, and 77.7% to the top 3 peaks. In contrast, the LOGOS method was able to highlight 35.6% motifs corresponding to the first peak, 52.3% to the top 2 peaks, and 67.2% to the top 3 peaks. AL2CO is 40.7% to the first peak, 60.8% to the top 2 peaks, and 73.2% to the top 3 peaks.


View this table:
[in this window]
[in a new window]

 
Table 2 Evaluation of ARCS with LOGOS and AL2CO in terms of the peak rank

 
3.3 Performance of ARCS on incorrectly aligned sequences
In some datasets, existing multiple sequence alignment tools could not align them ‘correctly’. Only part or none of the motifs is aligned by these tools. For example, the pattern of dataset PS01220 is ‘[FYL]-x-[LVM]-[LIVF]-x-[TIV]-[DC]-P-D-x-P-[SNG]-x(10)-H’. By Clustal W, ‘[LVM]-[LIVF]-x-[TIV]-[DC]-P-D-x-P-[SNG]-x(10)-H’ is aligned. However, the first part motif ‘[FYL]’ is not aligned. In this case, ARCS can find these parts of motifs either. Figure 8 presents the curve of smoothed ARCS score.


Figure 8
View larger version (21K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 8 The ARCS score of PS01220 smoothed by window size 3. The part motif region is between two lines.

 
A total of 47 patterns are randomly chosen among 176 patterns whose multiple sequences alignments are aligned incorrectly. On these 47 protein families, the first peak of ARCS corresponds to part of motifs up to 40.4% test cases.

3.4 Performance in terms of pattern complexity
What we have shown in the previous section is the accuracy of ARCS. It is also important to investigate the sensitivity to certain characteristics of the motif. We measured the sensitivity of ARCS with respect to the motif complexity which is defined as 1 – the ratio of the number of exact characters in the pattern to the length of the pattern. Higher complexity means that there are more ambiguous characters in the pattern, thus highlighting true motif regions for the pattern is more difficult. As Figure 9 shows, our method is not sensitive to the motif complexity and it works equally well for very high-complexity cases.


Figure 9
View larger version (29K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 9 Motif complexity versus accuracy of our method for Tmh = 0.05, Tew = 10, smoothing window size 3 and neighborhood size 11.

 

    4 CONCLUSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 ARCS METHOD
 3 EXPERIMENTS
 4 CONCLUSION
 REFERENCES
 
In this paper, we defined a new score scheme, ARCS, that considered column correlation as well as the traditional character similarity measure. We measured the performance of the ARCS method using 533 PROSITE patterns whose sequences were aligned correctly and 47 PROSITE patterns which aligned sequences were incorrectly. In the correctly aligned sequences, ARCS is able to successfully highlight true motif regions up to 77.7%, corresponding to the three highest peaks. Both Logos and AL2CO are not as effective as ARCS. For those incorrectly aligned families, ARCS can still detect part of conserve regions up to 40.4% with the highest peak. We believe that ARCS can be used to help biologists utilize multiple sequence alignments more effectively, i.e. extracting conserved regions and modeling a set of proteins in terms of alignments.

Our work can be extended in many directions. The alignment scoring scheme can be further developed to a de novo motif discovery algorithm based on the alignment. It will be also interesting to develop an algorithm to find boundaries of conserved regions for given alignment scores.


    Acknowledgments
 
This is partially supported by National Science Foundation Career DBI-0237901 and INGEN (Indiana Genomics Initiatives) to S.K.

Conflict of Interest: none declared.


    FOOTNOTES
 
{dagger}The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors. Back

Associate Editor: Christos Ouzounis

Received on November 14, 2005; revised on June 6, 2006; accepted on July 18, 2006

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 ARCS METHOD
 3 EXPERIMENTS
 4 CONCLUSION
 REFERENCES
 

    Bafna, V., et al. (1997) Approximation algorithms for multiple sequence alignment. Theor. Comput. Sci, . 182, 233–244[CrossRef].

    Casari, G., et al. (1995) A method to predict functional residues in proteins. Nat. Struct. Biol, . 2, 171–178[CrossRef][ISI][Medline].

    Cline, M.S., et al. (2002) Information-theoretic dissection of pairwise contact potentials. Proteins, 49, 7–14[CrossRef][ISI][Medline].

    Giannella, C. and Robertson, E. (2004) On approximation measures for functional dependencies. Information Systems, 483–507.

    Gusfield, D. (1993) Efficient methods for multiple sequence alignment with guaranteed error bounds. Bull. Math. Biol, . 55, 141–154[ISI][Medline].

    Gusfield, D. Algorithms on Strings, trees, and Sequence: Computer Science and Computational Biology, (1997) , New York Cambridge University Press.

    Higgins, D., et al. (1994) CLUSTAL W: improving the sensitivity of progressivemultiple sequence alignment through sequence weighting,position-specific gap penalties and weight matrix choice. Nucleic Acids Res, . 22, 4673–4680[Abstract/Free Full Text].

    Landgraf, R., et al. (1999) Analysis of heregulin symmertry by weighted evolutionary tracing. Protein Eng, . 12, 943–951[Abstract/Free Full Text].

    Lee, C., et al. (2002) Multiple sequence alignment using partial order graphs. Bioinformatics, 18, 452–462[Abstract/Free Full Text].

    Li, H. and Fenimore, F.E. (1996) Log-normal distributions in gamma-ray burst time histories. Astrophys. J, . 469, 115–L118[CrossRef].

    Lichtarge, O., et al. (1996) An evolutionary trace method defines binding surfaces common to protein families. J. Mol. Boil, . 257, 342–358[CrossRef][ISI][Medline].

    Makarova, K.S. and Grishin, N.V. (1999) The Zn-peptidase super-family: functional convergence after evolutionary divergence. J. Mol. Biol, . 292, 11–17[CrossRef][ISI][Medline].

    Martin, L.C., et al. (2005) Using information theory to search for co-evolving residues in proteins. Bioinformatics, 21, 4116–4124[Abstract/Free Full Text].

    Nicolas, H., et al. (2004) Recent improvements to the PROSITE database. Necleic Acids Res, . 32, 134–137.

    Notredame, C., et al. (2000) T-Coffee: a novel method for multiple sequence alignments. J. Mol. Biol, . 302, 205–217[CrossRef][ISI][Medline].

    Ouzounis, C., et al. (1998) Are binding residues conserved? Pac. Symp. Biocomput, . 401–412.

    Pei, J. and Grishin, N.V. (2001) AL2CO: calculation of positional conservation in a protein sequence alignment. Bioinformatics, 17, 700–712[Abstract/Free Full Text].

    Sadreyev, R. and Grishin, N. (2003) COMPASS: a tool for comparison of multiple protein alignments with assessment of statistical significance. J. Mol. Biol, . 326, 317–336[CrossRef][ISI][Medline].

    Sander, C. and Schneider, R. (1991) Database of homology-derived protein structures and the structural meaning of sequence alignment. Proteins, 9, 56–68[CrossRef][ISI][Medline].

    Scheneider, T. and Stephens, R. (1990) Sequence logos: a new way to display consensus sequences. Nucleic Acids Res, . 6097–6100.

    Shenkin, P.S., et al. (1991) Information-theoretical entropy as a measure of sequence variability. Proteins, 11, 297–313[CrossRef][ISI][Medline].

    Smagala, J.A., et al. (2005) Confind: a robust tool for conserved sequence identification. Bioinformatics, 21, 4420–4422[Abstract/Free Full Text].

    Tsai, Y.T., et al. (2004) MuSiC: a tool for multiple sequence alignment with constrains. Bioinformatics, 20, 2309–2311[Abstract/Free Full Text].

    Villar, H.O. and Kauvar, L.M. (1994) Amino acid preferences at protein binding sites. FEBS Lett, . 349, 125–130[CrossRef][ISI][Medline].

    Zhang, H., et al. (2000) Crystal structure of YbaK protein from Haemophilus influenzae (HI1434) at 1.8 A resolution: functional implications. Proteins, 40, 86–97[CrossRef][ISI][Medline].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Nucleic Acids ResHome page
Y.-S. Chung, W.-H. Lee, C. Y. Tang, and C. L. Lu
RE-MuSiC: a tool for multiple sequence alignment with regular expression constraints
Nucleic Acids Res., July 13, 2007; 35(suppl_2): W639 - W644.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/19/2326    most recent
btl398v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (1)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Song, B.
Right arrow Articles by Yang, J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Song, B.
Right arrow Articles by Yang, J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?