Bioinformatics Advance Access originally published online on October 20, 2005
Bioinformatics 2005 21(24):4420-4422; doi:10.1093/bioinformatics/bti719
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ConFind: a robust tool for conserved sequence identification
Department of Chemistry and Biochemistry, The University of Colorado at Boulder UCB #215, Boulder, CO 80309, USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Summary: ConFind (conserved region finder) identifies regions of conservation in multiple sequence alignments that can serve as diagnostic targets. Designed to work with a large number of closely related, highly variable sequences, ConFind provides robust handling of alignments containing partial sequences and ambiguous characters. Conserved regions are defined in terms of minimum region length, maximum informational entropy (variability) per position, number of exceptions allowed to the maximum entropy criterion and the minimum number of sequences that must contain a non-ambiguous character at a position to be considered for inclusion in a conserved region. Comparison of the calculated entropy for an alignment of 95 influenza A hemagglutinin sequences with random deletions results in a 98% reduction in the average error in ConFind relative to the Find Conserved Regions option in BioEdit.
Requirements: ConFind requires Python 2.3, but Python 2.4 or an upgrade of the optparse module to Optik 1.5 is suggested. The program is known to run under Linux and DOS.
Availability: ConFind is licensed under the GNU General Public License (GPL). Source code, documentation, and a precompiled DOS executable are available for download at http://www.colorado.edu/chemistry/RGHP/software/
Contact: rowlen{at}colorado.edu
| INTRODUCTION |
|---|
|
|
|---|
With the recent and continued growth of publicly available sequence databases, diagnostic applications of sequence-based techniques such as PCR and DNA microarrays are currently of widespread interest. Applications include identification of risk factors for genetic diseases such as cancer, detection of drug resistance and identification and subtyping of viral pathogens (Clewley, 2004; Gibbs et al., 1998; Striebel et al., 2003; Zammatteo et al., 2002). In contrast to traditional differential gene expression microarrays that require a specific capture sequence for each gene, diagnostic arrays mandate capture sequences that can detect a range of related sequences.
Designing diagnostics for highly mutable target sequences such as viral genomes requires particular care in selecting regions of sufficient conservation to allow detection of many strains, while ensuring differentiation between subtypes of interest (Clewley, 2004; Gibbs et al., 1998; Ivshina et al., 2004; Rodriguez et al., 1992; Ruest et al., 2003). Current research in our laboratory is focused on development of an oligonucleotide microarray for the rapid identification and subtyping of influenza virus (M.Mehlmann, E.D.Dawson, M.B.Townsend, J.A.Smagala, C.L.Moore, C.B.Smith, N.J.Cox, R.D.Kuchta and K.L.Rowlen, manuscript in preparation; Hilleman, 2002). Initial probe design for the array was hampered by the lack of a robust, widely available tool for identifying sequence conservation. In particular, the Find Conserved Regions option in BioEdit provides the functionality required, but has a number of drawbacks that make it difficult to use in alignments containing missing or ambiguous sequence data. ConFind emulates the basic functionality of the Find Conserved Regions option in BioEdit, but is significantly more robust in its ability to handle alignments containing incomplete sequence data.
| PROGRAM FEATURE SUMMARY |
|---|
|
|
|---|
ConFind is modeled after the conserved region finder in the BioEdit package and includes a number of similar features and options.
- ConFind detects conserved regions in multiple sequence alignments, though neither ConFind nor BioEdit will detect conserved regions that are not aligned.
- User-definable parameters include the minimum acceptable conserved region length (Lmin), the maximum entropy allowed for all positions (Hmax) and the number of exceptions (Nex) allowed to the maximum entropy.
- The summary output file includes the input parameter values used to generate the result, the total number of conserved regions found and a position-by-position listing of the entropy values for each conserved region.
- Additional FASTA files may be generated for each region found. These files contain the sequence information over the conserved region for each sequence in the alignment.
- ConFind calculates sequence variability using the traditional Shannon informational entropy formula shown in Equation (1), where p represents the probability of a character in the alphabet i occurring in an aligned column. BioEdit uses a similar formula based on a natural logarithm. Scores found by BioEdit can be divided by ln(2) to allow direct comparison with scores found by ConFind.
![]() | (1) |
The BioEdit Find Conserved Regions option has several limitations that prevent it from detecting conservation in regions where not all records contain a full-length sequence. ConFind contains two features to avoid this issue:
- Gaps and ambiguous characters are ignored when calculating the entropy score for a column of an alignment.
- The minimum number of non-ambiguous characters required for a position to be considered can be specified.
| RESULTS AND DISCUSSION |
|---|
|
|
|---|
Figure 1 shows a plot of the error when calculating the entropy for an alignment with missing sequence information. First, the true entropy was calculated for a complete dataset. Then, sequence information was deleted, the entropy recalculated and the true score subtracted from the apparent score. All BioEdit scores have been divided by ln(2) to normalize the entropy scales. The range of entropies for a four-character alphabet is [0,2]. BioEdit may calculate entropies as high as 2.32, as the gap character is included in the alphabet. Figure 1A shows the error in the calculated entropy for 31 positions of 10 influenza A neuraminidase genes. The sequences are shown with deletions in gray. Note the large error at position 21. The deletion of the single occurrence of a character results in a significant error in the calculated entropy. Figure 1B shows the error in the calculated entropy for 101 positions of 95 influenza A hemagglutinin genes with random deletion of up to 25% of the characters in a column. For positions with deleted sequence data in Figure 1B, the average error using ConFind is 0.01 ± 0.03, while the average error using BioEdit is 0.4 ± 0.2. Because BioEdit is guaranteed to generate artificially high entropy scores when an alignment includes gaps or ambiguous characters, ConFind is better suited to finding regions of possible conservation in alignments with missing sequence information.
|
| ALGORITHM DETAILS |
|---|
|
|
|---|
In order to identify regions of conservation in a multiple sequence alignment, ConFind attempts to calculate the entropy score for each column, assigning either a bit score or None for columns with an insufficient number of non-ambiguous characters. A list of start sites is then generated. This list may include the first position, any position following a position with a score greater than the maximum allowed entropy (Hmax), and any position following a position with a value of None. Start sites must have an entropy score lower than Hmax. A greedy algorithm is applied to obtain the maximum number of possible positions following each start site, stopping when the next position is None or when the next position causes the exception count to exceed the number of allowed exceptions (Nex) to the maximum entropy. If the end of the alignment is reached, no further start positions are examined. Finally, the regions are examined to ensure they are longer than the minimum length (Lmin).
| Acknowledgments |
|---|
The authors thank Dr Catherine Smith and Dr Rebecca Garten of the Influenza Branch at the Centers for Disease Control and Prevention for their feedback concerning this software. The authors also acknowledge funding from the National Institute of Allergy and Infectious Disease (U01 AI056528 [GenBank] ).
Conflict of Interest: none declared.
Received on August 10, 2005; revised on September 28, 2005; accepted on October 14, 2005
| REFERENCES |
|---|
|
|
|---|
Clewley, J.P. (2004) A role for arrays in clinical virology: fact or fiction? J. Clin. Virol, . 29, 212[CrossRef][Web of Science][Medline].
Gibbs, A., et al. (1998) The GPRIME package: computer programs for identifying the best regions of aligned genes to target in nucleic acid hybridization-based diagnostic tests, and their use with plant viruses. J. Virol. Methods, 74, 6776[CrossRef][Web of Science][Medline].
Hilleman, M.R. (2002) Realities and enigmas of human viral influenza: pathogenesis, epidemiology and control. Vaccine, 20, 30683087[CrossRef][Web of Science][Medline].
Ivshina, A.V., et al. (2004) Mapping of genomic segments of influenza B virus strains by an oligonucleotide microarray method. J. Clin. Microbiol, . 42, 57935801
Mehlmann, M., Dawson, E.D., Townsend, M.B., Smagala, J.A., Moore, C.L., Smith, C.B., Cox, N.J., Kuchta, R.D., Rowlen, K.L. (2005) Manuscript in preparation.
Rodriguez, A., et al. (1992) Primer design for specific diagnosis by PCR of highly variable RNA viruses: typing of foot-and-mouth disease virus. Virology, 189, 363367[CrossRef][Web of Science][Medline].
Ruest, A., et al. (2003) Comparison of the directigen flu A+B test, the QuickVue influenza test, and clinical case definition to viral culture and reverse transcription-PCR for rapid diagnosis of influenza virus infection. J. Clin. Microbiol, . 41, 34873493
Striebel, H.-M., et al. (2003) Virus diagnostics on microarrays. Curr. Pharm. Biotech, . 4, 401415.
Zammatteo, N., et al. (2002) New chips for molecular biology and diagnostics. Biotech. Ann. Rev, . 8, 85101.
This article has been cited by other articles:
![]() |
B. Song, J.-H. Choi, G. Chen, J. Szymanski, G.-Q. Zhang, A. K. H. Tung, J. Kang, S. Kim, and J. Yang ARCS: an aggregated related column scoring scheme for aligned sequences Bioinformatics, October 1, 2006; 22(19): 2326 - 2332. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Mehlmann, E. D. Dawson, M. B. Townsend, J. A. Smagala, C. L. Moore, C. B. Smith, N. J. Cox, R. D. Kuchta, and K. L. Rowlen Robust Sequence Selection Method Used To Develop the FluChip Diagnostic Microarray for Influenza Virus J. Clin. Microbiol., August 1, 2006; 44(8): 2857 - 2862. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. B. Townsend, E. D. Dawson, M. Mehlmann, J. A. Smagala, D. M. Dankbar, C. L. Moore, C. B. Smith, N. J. Cox, R. D. Kuchta, and K. L. Rowlen Experimental Evaluation of the FluChip Diagnostic Microarray for Influenza Virus Surveillance J. Clin. Microbiol., August 1, 2006; 44(8): 2863 - 2871. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||



