Skip Navigation


Bioinformatics Advance Access originally published online on February 2, 2005
Bioinformatics 2005 21(9):2133-2135; doi:10.1093/bioinformatics/bti298
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/9/2133    most recent
bti298v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (2)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Chang, C.-Y.
Right arrow Articles by LaBaer, J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Chang, C.-Y.
Right arrow Articles by LaBaer, J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions{at}oupjournals.org

DNA polymorphism detector: an automated tool that searches for allelic matches in public databases for discrepancies found in clone or cDNA sequences

Chih-Yu (Carol) Chang 1 and Joshua LaBaer 2,*

1Chemical Biology Platform, Broad Institute of Harvard and MIT Cambridge, MA 02141–2023, USA
2Harvard Institute of Proteomics, Harvard Medical School Boston, MA 02141, USA

*To whom correspondence should be addressed.


    Abstract
 TOP
 Abstract
 REFERENCES
 

Summary: DNA polymorphism detector (DPD) is a new web application developed to help automate the process of cDNA clone validation. DPD identifies and highlights discrepancies between any cDNA clone sequence and its expected reference sequence. To determine if these differences correspond to natural genetic polymorphisms (versus artifacts introduced during clone production or evaluation), DPD uses the discrepancies, along with flanking sequences, to search GenBank for identical matching strings. If matching DNA sequences are found, DPD verifies that they are from the same gene. The application then reports the discrepancy as a polymorphism along with the corresponding GenBank reference information.

Availability: DPD is currently hosted by the Harvard Institute of Proteomics at http://www.hip.harvard.edu

Contact: carol_chang{at}hms.harvard.edu

The current trend toward high-throughput protein expression applications has created a demand for large repositories of cloned cDNAs (Pearlberg and LaBaer, 2004). The DNA sequence validation of these clones has lagged behind their production, in part because of the lack of software available to automate the validation process. One challenge, particularly for organisms where the cloning source material has come from multiple genetically diverse individuals, is determining the source of any discrepancies that are found between the clone sequence and its expected reference sequence. Discrepancies between these two can arise because the template used for producing the clone contained a natural sequence variant, or polymorphism, or because of mutations incurred during the manufacture of the clone. Discrepancies arising from natural variation are generally considered acceptable for functional experiments, whereas those introduced by the manufacturing process render the clone invalid for use, making it important to distinguish the two.

The manual process of identifying the differences between the sequences and determining their origin is tedious and time consuming. The steps include: (1) aligning the clone and reference sequences (William, 1997, http://fasta.bioch.virginia.edu/fasta_www/align.htm), (2) identifying any discrepancies, (3) searching GenBank for sequences that match any discrepant sequences and (4) verifying that the matching sequences are from the same gene. Finding an independent example of the same gene that matches precisely the clone sequence provides compelling evidence that a polymorphism has been identified. The DNA polymorphism detector (DPD) was developed mainly using Java2 technology including Java Server Page and Servlet (Ivar, 1992) to automate this tedious and time-consuming manual process. DPD automatically aligns both sequences using the Needleman–Wunsch global alignment algorithm (Needleman and Wunsch, 1970) and presents the user with a check box list of all discrepancies. Once the user selects the discrepancies of interest, the software appends a user-defined number of flanking bases to both sides and then compares this subsequence to selected GenBank databases using BLAST (Altschul et al., 1990). Any matches found are then compared more thoroughly with the full-length starting sequence to confirm that the genes are the same using Pairwise BLAST. The final output is a list of all discrepancies including the search string, annotating each one with a list of matching GenBank ID numbers, if there are any. On the results page, users can select any match in order to view its sequence aligned with either the clone or the reference sequence.

The user interface for displaying the clone and reference sequence alignment results is illustrated in Figure 1A. To the right of the alignment is a list of all the discrepancies with check boxes allowing selection of which ones should be interrogated. Users can assign the number of flanking base pairs to be appended to each side of the discrepancy to create the search string for the GenBank search. We have found empirically that 20 bases (the default) usually provides enough specificity to avoid finding unrelated genes without being so long that the search strings frequently overlap with other nearby discrepancies. Users can also specify the criteria for confirming that matches correspond to the same gene when performing the Pairwise BLAST in order to filter out any matching sequences from other genes (such as paralogs or related gene families). The comparison requires a minimum sequence identity (default is 95%) of over a minimum alignment length (default is 100 bases). If these two criteria are met, the genes are considered the same. Users may select up to three GenBank databases for the search, making it also possible to use this tool to compare closely related sequences between species. If users only wish to confirm the existence of other database entries that match their clone's sequence, but do not need to determine how many or examine them all in detail, they can elect to limit the number of hits returned per discrepancy, thereby speeding up the application.



View larger version (61K):
[in this window]
[in a new window]
 
Fig. 1 (A) The results of aligning the clone and reference sequences. On the left the two sequences are aligned and each discrepancy is highlighted in red. On the right is a list of discrepancies which are hyperlinked to their location on the alignment. Checking the box will include that discrepancy in the search for polymorphisms. (B) The polymorphism report. Each searched discrepancy is listed along with the actual search string containing the discrepancy indicated in red. Below each discrepancy is a list of GI numbers corresponding to cognate genes that contain exact sequence matches to the search string.

 
The results of the search are illustrated in Figure 1B. General gene information is listed at the top. Each searched discrepancy follows, including the search string and a list of matching GI numbers. If the same GI number appears among all the discrepancies, this suggests that these discrepancies travel together as a single genetic allele.

In conclusion, to validate the sequences of cloned cDNAs, the clone sequences must be compared with reference sequences. Any discrepancies could arise from: (1) sequencing errors, (2) mutations introduced during the cloning process and (3) natural polymorphisms found in the population. Natural polymorphisms are important to identify because they represent real differences between the clone and the reference that are still considered valid for use in functional studies. To identify natural polymorphisms, DPD implements several analysis steps automatically that search for independent examples of the same gene in GenBank that match the clone sequence, and organizes the results for display in a user-friendly manner.


    Acknowledgments
 
We wish to thank the members of the Harvard Institute of Proteomics and Steve Wong for the ideas and the input. We also want to thank Stuart Schreiber and Scott Eliasof for reviewing the manuscript.

Received on June 7, 2004; revised on January 18, 2005; accepted on January 26, 2005

    REFERENCES
 TOP
 Abstract
 REFERENCES
 

    Pearlberg, J. and LaBaer, J. (2004) Protein expression clone repositories for functional proteomics. Curr. Opin. Chem. Biol., 8, 98–102[CrossRef][Web of Science][Medline].

    William, R.P. (1997) University of Virginia, Virginia, USA. http://fasta.bioch.virginia.edu/fasta_www/align.html.

    Ivar, J. Object-Oriented Software Engineering: A Use Case Driven Approach, (1992) Addison-Wesley Pub. Co.

    Needleman, S.B. and Wunsch, C.D. (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol., 48, 443–453[CrossRef][Web of Science][Medline].

    Altschul, S.F., et al. (1990) Basic local alignment search tool. J. Mol. Biol, 215, , pp. 403–410[CrossRef][Web of Science][Medline].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/9/2133    most recent
bti298v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (2)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Chang, C.-Y.
Right arrow Articles by LaBaer, J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Chang, C.-Y.
Right arrow Articles by LaBaer, J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?