Skip Navigation


Bioinformatics Advance Access originally published online on December 15, 2005
Bioinformatics 2006 22(4):495-496; doi:10.1093/bioinformatics/btk006
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/4/495    most recent
btk006v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (6)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Huntley, D.
Right arrow Articles by Sergot, M.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Huntley, D.
Right arrow Articles by Sergot, M.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

SEAN: SNP prediction and display program utilizing EST sequence clusters

Derek Huntley 1,*, Angela Baldo 4, Saurabh Johri 2 and Marek Sergot 3

1Centre for Bioinformatics, Division of Molecular Biosciences, Imperial College London SW7 2AZ, UK
2Centre for Molecular Microbiology and Infection, Division of Investigative Sciences, Imperial College London SW7 2AZ, UK
3Department of Computing, Imperial College London SW7 2AZ, UK
4USDA-ARS Plant Genetic Resources Unit, New York State Agricultural Experiment Station Geneva, NY 14456, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 IMPLEMENTATION
 VALIDATION
 REFERENCES
 

Summary: SEAN is an application that predicts single nucleotide polymorphisms (SNPs) using multiple sequence alignments produced from expressed sequence tag (EST) clusters. The algorithm uses rules of sequence identity and SNP abundance to determine the quality of the prediction. A Java viewer is provided to display the EST alignments and predicted SNPs.

Availability: SEAN is freely available from http//zebrafish.doc.ic.ac.uk/Sean

Contact: d.huntley{at}imperial.ac.uk


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 IMPLEMENTATION
 VALIDATION
 REFERENCES
 
Expressed sequence tags (ESTs) are an important resource for identifying polymorphisms in transcribed regions. In humans, for example, estimates of polymorphism are in the range of 1 every 1.3 kb (Sachidanandam et al., 2001) and in cultivated tomatoes 1 every 7 kb (Nesbitt and Tanksley, 2002). SEAN provides a method to predict and visualize the presence of single nucleotide polymorphisms (SNPs) using EST sequence clusters. EST data have previously been used for SNP prediction by programs such as AutoSNP (Barker et al., 2003), PolyPhred (Nickerson et al., 1997), PolyBayes (Marth et al., 1999), TRACE_DIFF (Bonfield et al., 1998) and HarvEST (HarvEST Home Page available at http://harvest.ucr.edu). Whereas HarvEST provides pre-built SNP prediction libraries, AutoSNP, PolyPhred and PolyBayes, like SEAN, enable the prediction of SNPs from a users own EST dataset. SEAN, as with AutoSNP, uses the redundancy of the SNP in an alignment as a measure of confidence but reinforces this with a measure of sequence identity in the surrounding aligned sequences. Unlike the other tools listed, SEAN also allows for the inclusion of library data to further support SNP predictions. A Java viewer is included that enables the visualization of the alignments and SNP predictions for user inspection.

The search strategy for SEAN is based on the work of Picoult-Newberg et al. (1999) The sequence assembly program Phrap (Phrap available at http://www.phrap.org) is used to build a consensus from the clustered sequences and using the output file produced by the Phrap ‘ace’ flag the sequence alignment, including consensus, is built and the alignment parsed to find potential SNPs.

Five output files are produced by SEAN: three reference files and two Java configuration files. The first two reference files contain the sequence alignments (only those regions that align with the consensus are in the first file, the full alignments are in the second) together with a list of the potential SNPs and their locations and the consensus sequence in FASTA format. The third reference file contains a listing of the contigs produced by Phrap and their details—sequences, average sequence length and number of predicted SNPs. There is an option to include cultivar and library information for an improved SNP prediction. If this is used an additional output file details the predicted SNP position within the consensus and the number of occurrences of each base within each library at that position. This is provided to give additional evidence of the quality of the predicted SNP.

There are also two Java configuration files produced, one for the alignments only and one for the complete sequences. These are for a Java viewer that has been developed to enable visual inspection of the alignments and predicted SNPs. The viewer has been developed using the Neomorphic Genomic Software Development Kit (NGSDK) (available at http://www.affymetrix.com). The viewer displays the sequences as solid bars with the position of any potential SNPs shown by red points at the top of the display and the positions in the relevant sequences highlighted in red. If the SNP predictions have been generated using the library and cultivar data then the SNPs predicted with the lower confidence are coloured green to distinguish them. The display has zooming functionality and fully horizontally zooming overlays the bars with their nucleotide sequence.

SEAN requires Perl and Phrap for the analysis component and Java (1.3+) for the viewer.


    IMPLEMENTATION
 TOP
 ABSTRACT
 INTRODUCTION
 IMPLEMENTATION
 VALIDATION
 REFERENCES
 
The SEAN generated sequence alignment is parsed a base position at a time to find potential SNPs by comparing the base at each position with the corresponding consensus base. To eliminate poor quality sequence when a base difference is found, the surrounding sequence is compared with the consensus over a defined window, by default 15 bp either side of the base but configurable when running SEAN. If the sequences in the windows are identical the base and its position are flagged, and stored as a predicted SNP only if another identical base change is found at the same position in at least one other sequence. A further check is also made that the consensus base is also present in at least two sequences, as the consensus produced by Phrap does not always contain the dominant base at a particular position.

The prediction requirements mean that for a potential SNP to be found at least four overlapping sequences are required. Clusters containing large numbers of sequences are also unusable owing to the memory requirements of Phrap. The resources of the host computer determine the limit; on a standard PC with 1 Gb RAM up to 500 sequences can be handled satisfactorily, depending upon their compositional similarity. Pre-clustering sequences using Cap3 (Huang and Madan, 1999) or Gap (Bonfield et al., 1995) can reduce the number of sequences handled simultaneously.

If the sequences within the window either side of the potential SNP contain gaps in order to facilitate alignment, the window size is increased accordingly so that the actual required number of nucleotides are checked. Gaps are also sometimes included in the consensus produced by Phrap when they are not in the majority of the aligned sequences, occasionally when present in only one aligned sequence. If such gaps are present within the window region it could prejudice against the selection of a potential SNP so to compensate for this the window sequences are screened to ensure they are identical at the associated position.

The quality of the SNP predictions can be strengthened by the inclusion of cultivar and library data. SEAN reads an optional file containing this data for each sequence and separately labels potential SNPs that are predicted where they are present in at least two libraries from the same cultivar. These SNPs are also coloured differently in the Java viewer so that they can be readily identified.


    VALIDATION
 TOP
 ABSTRACT
 INTRODUCTION
 IMPLEMENTATION
 VALIDATION
 REFERENCES
 
In silico validation of SEAN has been carried out by searching mouse and human UniGene (Boguski and Schuler, 1995) clusters and confirming the predicted SNPs using the relevant dbSNP databases (Sherry et al., 2001). UniGene clusters were selected with the minimum number of four sequences required for SNP prediction and a maximum number of 500. This provided 27 169 human and 29 360 mouse clusters from which 128 408 human and 328 714 mouse SNPs were predicted. dbSNP contained 9 123 517 human and 506 198 mouse SNPs and confirmed 32 150 human predicted SNPs (25%) and 8528 mouse (24%).

SEAN has been used to successfully identify SNPs among public ESTs from tomato cultivars. Among 53 re-sequenced contigs in two or three cultivars, 21 confirmed the SNPs predicted by SEAN (Labate and Baldo, 2005). Five additional SNPs were visible in the SEAN viewer but not predicted because they fell within 15 bp of each other. Overall efficiency of SNP discovery/confirmation was increased 10-fold using SEAN to target SNP-containing regions relative to sequencing arbitrary regions of the genome (Labate and Baldo, 2005). Further validation results are documented on the website (SEAN SNP prediction and display programs available at http://zebrafish.doc.ic.ac.uk/Sean/).


    Acknowledgments
 
The authors gratefully acknowledge Joanne Labate for the confirmation data of SNPs in cultivated tomato and constructive suggestions for improvements in the SEAN prediction package and viewer. The authors also thank Elizabeth Fisher for the original idea for the development of SEAN.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Chris Stoeckert

Received on June 29, 2005; revised on November 23, 2005; accepted on December 11, 2005

    REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 IMPLEMENTATION
 VALIDATION
 REFERENCES
 

    Barker, G., et al. (2003) Redundancy based detection of sequence polymorphisms in expressed sequence tag data using autoSNP. Bioinformatics, 19, 421–422[Abstract/Free Full Text].

    Boguski, M.S. and Schuler, G.D. (1995) ESTablishing a human transcript map. Nat. Genet, . 10, 369–371[CrossRef][Web of Science][Medline].

    Bonfield, J.K., et al. (1995) A new DNA sequence assembly program. Nucleic Acids Res, . 24, 4992–4999.

    Bonfield, J.K., et al. (1998) Automated detection of point mutations using fluorescent sequence trace subtraction. Nucleic Acids Res, . 26, 3404–3409[Abstract/Free Full Text].

    Huang, X. and Madan, A. (1999) CAP3: a DNA sequence assembly program. Genome Res, . 9, 868–877[Abstract/Free Full Text].

    Labate, J. and Baldo, A. (2005) Targeted discovery of highly polymorphic genes in tomato cultivars. Molecular Breeding, 16, 343–349[CrossRef].

    Marth, G.T., et al. (1999) A general approach to single-nucleotide polymorphism discovery. Nat. Genet, . 23, 452–456[CrossRef][Web of Science][Medline].

    Nesbitt, T.C. and Tanksley, S.D. (2002) Comparative sequencing in the genus Lycopersicon. Implications for the evolution of fruit size in the domestication of cultivated tomatoes. Genetics, 162, 365–379[Abstract/Free Full Text].

    Nickerson, D.A., et al. (1997) PolyPhred: automating the detection and genotyping of single nucleotide substitutions using fluorescence-based resequencing. Nucleic Acids Res, . 25, 2745–2751[Abstract/Free Full Text].

    Picoult-Newberg, L., et al. (1999) Mining SNPs from EST databases. Genome Res, . 9, 167–174[Abstract/Free Full Text].

    Sachidanandam, R., et al. (2001) A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature, 409, 928–933[CrossRef][Medline].

    Sherry, S.T., et al. (2001) dbSNP: the NCBI database of genetic variation. Nucleic Acids Res, . 29, 308–311[Abstract/Free Full Text].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
J HeredHome page
B. S. Coates, D. V. Sumerford, N. J. Miller, K. S. Kim, T. W. Sappington, B. D. Siegfried, and L. C. Lewis
Comparative Performance of Single Nucleotide Polymorphism and Microsatellite Markers for Population Genetic Analysis
J. Hered., June 12, 2009; (2009) esp028v1.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
G. Denisov, B. Walenz, A. L. Halpern, J. Miller, N. Axelrod, S. Levy, and G. Sutton
Consensus generation and variant detection by Celera Assembler
Bioinformatics, April 15, 2008; 24(8): 1035 - 1040.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
A. Papanicolaou, S. Gebauer-Jung, M. L. Blaxter, W. Owen McMillan, and C. D. Jiggins
ButterflyBase: a platform for lepidopteran genomics
Nucleic Acids Res., January 11, 2008; 36(suppl_1): D582 - D587.
[Abstract] [Full Text] [PDF]


Home page
GeneticsHome page
M. Gautier, T. Faraut, K. Moazami-Goudarzi, V. Navratil, M. Foglio, C. Grohs, A. Boland, J.-G. Garnier, D. Boichard, G. M. Lathrop, et al.
Genetic and Haplotypic Structure in 14 European and African Cattle Breeds
Genetics, October 1, 2007; 177(2): 1059 - 1070.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/4/495    most recent
btk006v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (6)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Huntley, D.
Right arrow Articles by Sergot, M.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Huntley, D.
Right arrow Articles by Sergot, M.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?