Skip Navigation


Bioinformatics Advance Access originally published online on October 20, 2005
Bioinformatics 2005 21(24):4420-4422; doi:10.1093/bioinformatics/bti719
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/24/4420    most recent
bti719v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (4)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Smagala, J. A.
Right arrow Articles by Rowlen, K. L.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Smagala, J. A.
Right arrow Articles by Rowlen, K. L.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions{at}oxfordjournals.org

ConFind: a robust tool for conserved sequence identification

James A. Smagala , Erica D. Dawson , Martin Mehlmann , Michael B. Townsend , Robert D. Kuchta and Kathy L. Rowlen *

Department of Chemistry and Biochemistry, The University of Colorado at Boulder UCB #215, Boulder, CO 80309, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 PROGRAM FEATURE SUMMARY
 RESULTS AND DISCUSSION
 ALGORITHM DETAILS
 REFERENCES
 

Summary: ConFind (conserved region finder) identifies regions of conservation in multiple sequence alignments that can serve as diagnostic targets. Designed to work with a large number of closely related, highly variable sequences, ConFind provides robust handling of alignments containing partial sequences and ambiguous characters. Conserved regions are defined in terms of minimum region length, maximum informational entropy (variability) per position, number of exceptions allowed to the maximum entropy criterion and the minimum number of sequences that must contain a non-ambiguous character at a position to be considered for inclusion in a conserved region. Comparison of the calculated entropy for an alignment of 95 influenza A hemagglutinin sequences with random deletions results in a 98% reduction in the average error in ConFind relative to the ‘Find Conserved Regions’ option in BioEdit.

Requirements: ConFind requires Python 2.3, but Python 2.4 or an upgrade of the optparse module to Optik 1.5 is suggested. The program is known to run under Linux and DOS.

Availability: ConFind is licensed under the GNU General Public License (GPL). Source code, documentation, and a precompiled DOS executable are available for download at http://www.colorado.edu/chemistry/RGHP/software/

Contact: rowlen{at}colorado.edu


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 PROGRAM FEATURE SUMMARY
 RESULTS AND DISCUSSION
 ALGORITHM DETAILS
 REFERENCES
 
With the recent and continued growth of publicly available sequence databases, diagnostic applications of sequence-based techniques such as PCR and DNA microarrays are currently of widespread interest. Applications include identification of risk factors for genetic diseases such as cancer, detection of drug resistance and identification and subtyping of viral pathogens (Clewley, 2004; Gibbs et al., 1998; Striebel et al., 2003; Zammatteo et al., 2002). In contrast to traditional differential gene expression microarrays that require a specific capture sequence for each gene, diagnostic arrays mandate capture sequences that can detect a range of related sequences.

Designing diagnostics for highly mutable target sequences such as viral genomes requires particular care in selecting regions of sufficient conservation to allow detection of many strains, while ensuring differentiation between subtypes of interest (Clewley, 2004; Gibbs et al., 1998; Ivshina et al., 2004; Rodriguez et al., 1992; Ruest et al., 2003). Current research in our laboratory is focused on development of an oligonucleotide microarray for the rapid identification and subtyping of influenza virus (M.Mehlmann, E.D.Dawson, M.B.Townsend, J.A.Smagala, C.L.Moore, C.B.Smith, N.J.Cox, R.D.Kuchta and K.L.Rowlen, manuscript in preparation; Hilleman, 2002). Initial probe design for the array was hampered by the lack of a robust, widely available tool for identifying sequence conservation. In particular, the ‘Find Conserved Regions’ option in BioEdit provides the functionality required, but has a number of drawbacks that make it difficult to use in alignments containing missing or ambiguous sequence data. ConFind emulates the basic functionality of the ‘Find Conserved Regions’ option in BioEdit, but is significantly more robust in its ability to handle alignments containing incomplete sequence data.


    PROGRAM FEATURE SUMMARY
 TOP
 ABSTRACT
 INTRODUCTION
 PROGRAM FEATURE SUMMARY
 RESULTS AND DISCUSSION
 ALGORITHM DETAILS
 REFERENCES
 
ConFind is modeled after the conserved region finder in the BioEdit package and includes a number of similar features and options.

  • ConFind detects conserved regions in multiple sequence alignments, though neither ConFind nor BioEdit will detect conserved regions that are not aligned.
  • User-definable parameters include the minimum acceptable conserved region length (Lmin), the maximum entropy allowed for all positions (Hmax) and the number of exceptions (Nex) allowed to the maximum entropy.
  • The summary output file includes the input parameter values used to generate the result, the total number of conserved regions found and a position-by-position listing of the entropy values for each conserved region.
  • Additional FASTA files may be generated for each region found. These files contain the sequence information over the conserved region for each sequence in the alignment.
  • ConFind calculates sequence variability using the traditional Shannon informational entropy formula shown in Equation (1), where p represents the probability of a character in the alphabet i occurring in an aligned column. BioEdit uses a similar formula based on a natural logarithm. Scores found by BioEdit can be divided by ln(2) to allow direct comparison with scores found by ConFind.

(1)

The BioEdit ‘Find Conserved Regions’ option has several limitations that prevent it from detecting conservation in regions where not all records contain a full-length sequence. ConFind contains two features to avoid this issue:

  • Gaps and ambiguous characters are ignored when calculating the entropy score for a column of an alignment.
  • The minimum number of non-ambiguous characters required for a position to be considered can be specified.


    RESULTS AND DISCUSSION
 TOP
 ABSTRACT
 INTRODUCTION
 PROGRAM FEATURE SUMMARY
 RESULTS AND DISCUSSION
 ALGORITHM DETAILS
 REFERENCES
 
Figure 1 shows a plot of the error when calculating the entropy for an alignment with missing sequence information. First, the true entropy was calculated for a complete dataset. Then, sequence information was deleted, the entropy recalculated and the true score subtracted from the apparent score. All BioEdit scores have been divided by ln(2) to normalize the entropy scales. The range of entropies for a four-character alphabet is [0,2]. BioEdit may calculate entropies as high as 2.32, as the gap character is included in the alphabet. Figure 1A shows the error in the calculated entropy for 31 positions of 10 influenza A neuraminidase genes. The sequences are shown with deletions in gray. Note the large error at position 21. The deletion of the single occurrence of a character results in a significant error in the calculated entropy. Figure 1B shows the error in the calculated entropy for 101 positions of 95 influenza A hemagglutinin genes with random deletion of up to 25% of the characters in a column. For positions with deleted sequence data in Figure 1B, the average error using ConFind is –0.01 ± 0.03, while the average error using BioEdit is 0.4 ± 0.2. Because BioEdit is guaranteed to generate artificially high entropy scores when an alignment includes gaps or ambiguous characters, ConFind is better suited to finding regions of possible conservation in alignments with missing sequence information.



View larger version (29K):
[in this window]
[in a new window]
 
Fig. 1 Comparison of error in calculated Shannon entropy due to missing sequence data using ConFind and BioEdit. Data were randomly deleted from the beginning or end of several sequences. (A) Ten influenza A neurimindase sequences, 31 positions. Deleted characters are shown in gray. Note the single ‘G’ in the boxed column has been deleted. (B) Ninety-five influenza A hemaglutinin sequences, 101 positions.

 

    ALGORITHM DETAILS
 TOP
 ABSTRACT
 INTRODUCTION
 PROGRAM FEATURE SUMMARY
 RESULTS AND DISCUSSION
 ALGORITHM DETAILS
 REFERENCES
 
In order to identify regions of conservation in a multiple sequence alignment, ConFind attempts to calculate the entropy score for each column, assigning either a bit score or ‘None’ for columns with an insufficient number of non-ambiguous characters. A list of start sites is then generated. This list may include the first position, any position following a position with a score greater than the maximum allowed entropy (Hmax), and any position following a position with a value of ‘None’. Start sites must have an entropy score lower than Hmax. A greedy algorithm is applied to obtain the maximum number of possible positions following each start site, stopping when the next position is ‘None’ or when the next position causes the exception count to exceed the number of allowed exceptions (Nex) to the maximum entropy. If the end of the alignment is reached, no further start positions are examined. Finally, the regions are examined to ensure they are longer than the minimum length (Lmin).


    Acknowledgments
 
The authors thank Dr Catherine Smith and Dr Rebecca Garten of the Influenza Branch at the Centers for Disease Control and Prevention for their feedback concerning this software. The authors also acknowledge funding from the National Institute of Allergy and Infectious Disease (U01 AI056528 [GenBank] ).

Conflict of Interest: none declared.

Received on August 10, 2005; revised on September 28, 2005; accepted on October 14, 2005

    REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 PROGRAM FEATURE SUMMARY
 RESULTS AND DISCUSSION
 ALGORITHM DETAILS
 REFERENCES
 

    Clewley, J.P. (2004) A role for arrays in clinical virology: fact or fiction? J. Clin. Virol, . 29, 2–12[CrossRef][Web of Science][Medline].

    Gibbs, A., et al. (1998) The GPRIME package: computer programs for identifying the best regions of aligned genes to target in nucleic acid hybridization-based diagnostic tests, and their use with plant viruses. J. Virol. Methods, 74, 67–76[CrossRef][Web of Science][Medline].

    Hilleman, M.R. (2002) Realities and enigmas of human viral influenza: pathogenesis, epidemiology and control. Vaccine, 20, 3068–3087[CrossRef][Web of Science][Medline].

    Ivshina, A.V., et al. (2004) Mapping of genomic segments of influenza B virus strains by an oligonucleotide microarray method. J. Clin. Microbiol, . 42, 5793–5801[Abstract/Free Full Text].

    Mehlmann, M., Dawson, E.D., Townsend, M.B., Smagala, J.A., Moore, C.L., Smith, C.B., Cox, N.J., Kuchta, R.D., Rowlen, K.L. (2005) Manuscript in preparation.

    Rodriguez, A., et al. (1992) Primer design for specific diagnosis by PCR of highly variable RNA viruses: typing of foot-and-mouth disease virus. Virology, 189, 363–367[CrossRef][Web of Science][Medline].

    Ruest, A., et al. (2003) Comparison of the directigen flu A+B test, the QuickVue influenza test, and clinical case definition to viral culture and reverse transcription-PCR for rapid diagnosis of influenza virus infection. J. Clin. Microbiol, . 41, 3487–3493[Abstract/Free Full Text].

    Striebel, H.-M., et al. (2003) Virus diagnostics on microarrays. Curr. Pharm. Biotech, . 4, 401–415.

    Zammatteo, N., et al. (2002) New chips for molecular biology and diagnostics. Biotech. Ann. Rev, . 8, 85–101.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
B. Song, J.-H. Choi, G. Chen, J. Szymanski, G.-Q. Zhang, A. K. H. Tung, J. Kang, S. Kim, and J. Yang
ARCS: an aggregated related column scoring scheme for aligned sequences
Bioinformatics, October 1, 2006; 22(19): 2326 - 2332.
[Abstract] [Full Text] [PDF]


Home page
J. Clin. Microbiol.Home page
M. Mehlmann, E. D. Dawson, M. B. Townsend, J. A. Smagala, C. L. Moore, C. B. Smith, N. J. Cox, R. D. Kuchta, and K. L. Rowlen
Robust Sequence Selection Method Used To Develop the FluChip Diagnostic Microarray for Influenza Virus
J. Clin. Microbiol., August 1, 2006; 44(8): 2857 - 2862.
[Abstract] [Full Text] [PDF]


Home page
J. Clin. Microbiol.Home page
M. B. Townsend, E. D. Dawson, M. Mehlmann, J. A. Smagala, D. M. Dankbar, C. L. Moore, C. B. Smith, N. J. Cox, R. D. Kuchta, and K. L. Rowlen
Experimental Evaluation of the FluChip Diagnostic Microarray for Influenza Virus Surveillance
J. Clin. Microbiol., August 1, 2006; 44(8): 2863 - 2871.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/24/4420    most recent
bti719v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (4)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Smagala, J. A.
Right arrow Articles by Rowlen, K. L.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Smagala, J. A.
Right arrow Articles by Rowlen, K. L.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?