Skip Navigation


Bioinformatics Advance Access originally published online on March 3, 2005
Bioinformatics 2005 21(10):2528-2530; doi:10.1093/bioinformatics/bti354
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/10/2528    most recent
bti354v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (2)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Unneberg, P.
Right arrow Articles by Sterky, F.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Unneberg, P.
Right arrow Articles by Sterky, F.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions{at}oupjournals.org

SNP discovery using advanced algorithms and neural networks

Per Unneberg {dagger}, Michael Strömberg {dagger} and Fredrik Sterky *

Royal Institute of Technology, AlbaNova University Center, Department of Biotechnology S-106 91 Stockholm, Sweden

*To whom correspondence should be addressed.


    Abstract
 TOP
 Abstract
 INTRODUCTION
 DESIGN PRINCIPLES
 SNP CLASSIFICATION
 PERFORMANCE
 REFERENCES
 

Summary: Forage is an application which uses two neural networks for detecting single nucleotide polymorphisms (SNPs). Potential SNP candidates are identified in multiple alignments. Each candidate is then represented by a vector of features, which is classified as SNP or monomorphic by the networks. A validated dataset of SNPs was constructed from experimentally verified SNP data and used for network training and method evalutation.

Availability: The package is available at biobase.biotech.kth.se/forage/

Contact: fredrik{at}biotech.kth.se

Supplementary information: Additional results and method description can be found at biobase.biotech.kth.se/forage/


    INTRODUCTION
 TOP
 Abstract
 INTRODUCTION
 DESIGN PRINCIPLES
 SNP CLASSIFICATION
 PERFORMANCE
 REFERENCES
 
To benefit from studies of human variation (The International HapMap Consortium, 2003), the need for reliable methods for discovery of single nucleotide polymorphisms (SNPs) remains high. Public databases based on transcript sequences (Strausberg et al., 2000; Boguski et al., 1993) are still rich resources for detection of SNP candidates. Several solutions for SNP finding have been proposed (Barker et al., 2003; Buetow et al., 1999; Marth et al., 1999; Picoult-Newberg et al., 1999; Dantec et al., 2004), generally employing scoring schemes that identify potential SNPs once the score exceeds a threshold. To date, the best performing methods are based on Bayesian statistics (Marth et al., 1999; Dantec et al., 2004). Here, we introduce a new methodology for SNP candidate evaluation that exploits the ideas of the Bayesian approach, as well as the classification properties of neural networks.

In its most simple form, SNP discovery consists of scanning multiple alignments for columns (‘slices’) where base discrepancies occur. However, not all such slices correspond to true SNPs because of sequencing errors. Our rationale to SNP discovery has been to train neural networks with a large number of slices that represent true SNPs as well as slices with sequencing errors. Since the accuracy of such an approach relies strongly on the quality of the input data, we took great care in selecting a validated SNP set from the Cancer Genome Anatomy Project (gai.nci.nih.gov). In addition, a set of synthetic monomorphic slices was added, in which sequencing errors were introduced according to the sequence error distribution of the validated data. Consequently, this provided us with a set of slices where each slice is known to be monomorphic or a true SNP. The final verification dataset consisted of 22 222 alignment slices, with 7810 slices representing validated SNP sites. A total of 1154 monomorphic slices contained sequencing errors. The verification dataset was further partitioned into a training set (13 334 slices) for network training, a test set (6666 slices) for optimization of network performance and a validation set (2222 slices) for subsequent comparison with other methods. As the validated SNPs correspond to experimental data and the sample is large, the corresponding quality values and alignment depths provided a sound representation of typical datasets generated by EST sequencing projects.

In contrast to previous methods that use preset thresholds to separate SNP candidates from sequencing errors, we introduce a dynamic threshold for identifying candidate SNPs and we take advantage of the non-linear classification abilities of neural networks (Ripley, 1994). We have implemented these principles in our application Forage, which uses a dual network design. The risk of false classification is reduced since an SNP is scored only if both networks unanimously classify a slice as an SNP.


    DESIGN PRINCIPLES
 TOP
 Abstract
 INTRODUCTION
 DESIGN PRINCIPLES
 SNP CLASSIFICATION
 PERFORMANCE
 REFERENCES
 
The basic idea of Forage is to convert information in an alignment slice into a vector x, consisting of n features defined on some feature space F. In the current implementation, we have chosen the largest base quality value for the major allele and the two most frequent minor alleles, the major allele frequency and a Bayesian probability (Marth et al., 1999) that a slice is polymorphic, giving a vector with five features. However, other features may be defined. For example, by defining a feature set that excludes quality values, SNP finding could be performed on data that lack quality values, a common case for EST data (Jongeneel, 2000).

The networks are variants of self-organising maps that have been optimized for classification tasks. The first network is based on Learning Vector Quantization (LVQ) algorithms (Kangas et al., 1990) and the second uses an applied Optimal Brain Damage (OBD) algorithm (LeCun et al., 1990). Basically, each network consists of a set of k codebook vectors {mi; mi F, i = 1,2,...,k}.

For any classification task, it is essential to select an optimal number of codebook vectors for each class, and to condition the input data in order to minimize the risk of misclassification (Kangas et al., 1990). The vectors of LVQ neural networks are fine-tuned during training to specific properties of the input data (Kangas et al., 1990; Kohonen, 1995). By assigning class labels to the codebook vectors, the map can be used for classification. Also, OBD optimizes the codebook set, but by reducing redundancy. The initial codebook presented to the OBD algorithm is based on all elements in the training set, with training proceeding until the codebook contains only non-redundant vectors that still ensure maximum classification accuracy. Once the networks have been fine-tuned, classification of unknown slices can be performed.


    SNP CLASSIFICATION
 TOP
 Abstract
 INTRODUCTION
 DESIGN PRINCIPLES
 SNP CLASSIFICATION
 PERFORMANCE
 REFERENCES
 
Forage reads Paracel Transcript Assembler (PTA) (www.paracel.com) or ace assembly files. Bayesian paralog filtration (Marth et al., 1999) is performed before alignment positions with discrepancies are located. From the slice, a vector of SNP features is constructed which is subsequently used for classification. A number of filtering options can be set by the user, such as the requirement that the alignment depth be above a certain cutoff. After passing through the filtering module, the slice is classified by both networks. Only if both networks classify the slice as being an SNP, is the slice scored as an SNP candidate (Fig. 1).



View larger version (28K):
[in this window]
[in a new window]
 
Fig. 1 Overview of Forage system design, including a representation of the feature vector components used in this work. allele3 corresponds to the major allele, allele2 to the most frequent minor allele and allele1 to the second most frequent minor allele.

 
During the analysis phase, statistics relating to the SNP discovery phase are evaluated. Totals showing the number of sequences and contigs evaluated, paralogs filtered, slices rejected by the neural network and SNP candidates are presented. The observed rate of polymorphism, the genetic variation distribution, transition and transversion statistics are calculated. SNP candidates are further categorized as being rare (minor allele frequency ≤5%), common (minor allele frequency >5% and <20%) or frequent (minor allele frequency ≥20%). Although biased towards high-frequency alleles, these numbers can be used to estimate the true allele frequency distributions (Nielsen et al., 2004).


    PERFORMANCE
 TOP
 Abstract
 INTRODUCTION
 DESIGN PRINCIPLES
 SNP CLASSIFICATION
 PERFORMANCE
 REFERENCES
 
The performance of Forage was compared with that of two other SNP finding methods, PolyBayes (Marth et al., 1999) and a method based on the neighbourhood quality standard (NQS) (Altshuler et al., 2000). Evaluation was done on the validation set. Forage was performed with slightly fewer erroneous classifications than PolyBayes, whereas NQS had a significantly higher number of classification errors. Table 1 shows how the classification errors depend on read coverage. Additional results can be found at the Supplementary web page.


View this table:
[in this window]
[in a new window]
 
Table 1 The total number of classification errors (false positives and false negatives) for NQS, PolyBayes and Forage

 
Further benchmarks on larger datasets would improve performance. In particular, selection of SNP features could be optimized to compensate for cases where the Bayesian probability leads to erroneous conclusions. However, experimentally verified SNP datasets are difficult to acquire.

In this version, only sequencing errors have been included as sources of false SNP sites. Errors introduced by amplification of DNA or reverse transcription often result in base calls with high-quality values, and are more difficult to distinguish from true SNPs. By modelling these errors into the synthetic monomorphic dataset, Forage could be taught to take such errors into account.


    Acknowledgments
 
We thank Jacob Odeberg and Gabor Marth for the valuable input and the comments to the text. This work was conducted under grant from The Knut and Alice Wallenberg Foundation.


    Footnotes
 
{dagger}The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors. Back

Received on December 9, 2004; revised on January 13, 2005; accepted on February 23, 2005

    REFERENCES
 TOP
 Abstract
 INTRODUCTION
 DESIGN PRINCIPLES
 SNP CLASSIFICATION
 PERFORMANCE
 REFERENCES
 

    Altshuler, D., et al. (2000) An SNP map of the human genome generated by reduced representation shotgun sequencing. Nature, 407, 513–516[CrossRef][Medline].

    Barker, G., et al. (2003) Redundancy based detection of sequence polymorphisms in expressed sequence tag data using autoSNP. Bioinformatics, 19, 421–422[Abstract/Free Full Text].

    Boguski, M.S., et al. (1993) dbEST-database for ‘expressed sequence tags’. Nat. Genet., 4, 332–333[CrossRef][ISI][Medline].

    Buetow, K.H., et al. (1999) Reliable identification of large numbers of candidate SNPs from public EST data. Nat. Genet., 21, 323–325[CrossRef][ISI][Medline].

    Dantec, L., et al. (2004) Automated SNP detection in expressed sequence tags: statistical considerations and application to maritime pine sequences. Plant Mol. Biol., 54, 461–470[CrossRef][ISI][Medline].

    Jongeneel, C.V. (2000) Searching the expressed sequence tag (EST) databases: panning for genes. Brief. Bioinformatics, 1, 76–92[Abstract/Free Full Text].

    Kangas, J.A., et al. (1990) Variants of self-organizing maps. IEEE Transactions on Neural Networks, 1, 93–99.

    Kohonen, T. Self-Organizing Maps, (1995) 3rd edn , Heidelberg Springer-Verlag 30, .

    LeCun, Y., et al. (1990) Optimal brain damage. In Touretzky, D.S. (Ed.). Advances in Neural Information Processing Systems II, , San Mateo, CA Morgan Kauffman.

    Marth, G.T., et al. (1999) A general approach to single-nucleotide polymorphism discovery. Nat. Genet., 23, 452–456[CrossRef][ISI][Medline].

    Nielsen, R., et al. (2004) Reconstituting the frequency spectrum of ascertained SNP data. Genetics, 168, 2373–2382[Abstract/Free Full Text].

    Picoult-Newberg, L., et al. (1999) Mining SNPs from EST databases. Genome Res., 9, 167–174[Abstract/Free Full Text].

    Ripley, B.D. (1994) Neural networks and related methods for classification. J. R. Stat. Soc. B, 56, 409–456.

    Strausberg, R.L., et al. (2000) The cancer genome anatomy project: building an annotated gene index. Trends Genet., 16, 103–106[CrossRef][ISI][Medline].

    The International HapMap Consortium. (2003) The international HapMap project. Nature, 426, 789–796[CrossRef][Medline].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/10/2528    most recent
bti354v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (2)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Unneberg, P.
Right arrow Articles by Sterky, F.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Unneberg, P.
Right arrow Articles by Sterky, F.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?