Bioinformatics Advance Access originally published online on April 26, 2007
Bioinformatics 2007 23(13):1692-1693; doi:10.1093/bioinformatics/btm154
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
SNP detection exploiting multiple sources of redundancy in large EST collections improves validation rates
1Animal Genetics and Genomics, Primary Industries Research Victoria, 475 Mickleham Rd, Attwood, Victoria, Australia 3049, 2Department of Animal and Aquacultural Sciences, Norwegian University of Life Sciences, Box 5003, N-1432 Aas, Norway, 3Centre for Integrative Genetics, Norwegian University of Life Sciences, Box 5003, N-1432 Aas, Norway and 4The Norwegian Pig Breeders Association (NORSVIN), Hamar, Norway
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Single nucleotide polymorphism (SNP) detection exploiting redundancy in expressed sequence tag (EST) collections that arises from the presence of transcripts of the same gene from different individuals has been used to generate large collections of SNPs for many species. A second source of redundancy, namely that EST collections can contain multiple transcripts of the same gene from the same individual, can be exploited to distinguish true SNPs from sequencing error. In this article, we demonstrate with Atlantic salmon and pig EST collections that splitting the EST collection in two, detecting SNPs in both subsets, then accepting only cross-validated SNPs increases validation rates.
Results: In the pig data set, 676 cross-validated putative SNPs were detected in a collection of 160 689 ESTs. When validating a subset of these by genotyping on MassARRAY 85.1% of SNPs were polymorphic in successful assays. In the salmon data set, 856 cross-validated putative SNPs were detected in a collection of 243 674 ESTs. Validation by genotyping showed that 81.0% of the cross-validated putative SNPs were polymorphic in successful assays.
Availability: Cross-validated SNPs are available at dbSNP (http://www.ncbi.nlm.nih.gov/projects/SNP/), ss69371838-ss69372575 for the salmon SNPs and ss69372587-ss69373226 for the pig SNPs.
Contact: ben.hayes{at}dpi.vic.gov.au
Redundant expressed sequence tag (EST) libraries are a valuable source of single nucleotide polymorphisms (SNP) (Buetow et al., 1999; Tallion-Miller et al., 1998). Such libraries have been mined for SNPs across a wide range of species by exploiting the redundancy that arises from creating such libraries from multiple individuals, such that multiple transcripts of the same gene in different individuals can be compared for polymorphism (e.g. Guryev et al., 2004; Irizarry et al., 2000). One difficulty with detecting SNPs in this way is that failure to distinguish sequencing error and true polymorphism leads to low rates of putative SNP validation. A number of procedures have been proposed to increase probabilities of detecting only true SNPs, including use of sequence quality information (Marth et al., 1999). Here we describe a simple method to take advantage of a second source of redundancy in ESTs collections, namely the presence of multiple transcripts from the same gene from the same individual, to further increase probabilities of detecting only true SNPs from EST data. This second source of redundancy can be exploited by creating two independent data sets (splitting the EST collection in two), running the SNP detection pipeline, and retaining only cross-validated SNPs, Figure 1.
|
The procedure was tested using EST collections for two species, Atlantic salmon (Salmo salar) and Pigs (Sus scrofa). In pigs, all available EST chromatogram files were retrieved from the NCBI trace archive on 28 September 2006. ESTs derived from Meishan pigs were excluded from the analysis. There were 160 689 sequences following removal of poor quality sequences. For salmon, the creation and sequencing of the Canadian GRASP EST libraries have been described by Rise et al. (2004), and are available at http://web.uvic.ca/cbr/grasp. With the exception of one (RGB) library, the libraries were derived from several individuals of the McConnell strain of S.salar. The chromatogram files for the Norwegian EST sequences were extracted from the Salmon Genome project database (http://www.salmongenome.no). These libraries were derived from several individuals from the Aquagen breeding company. GenBank Accession numbers for ESTs are BU965588 [GenBank] –BU965906, CA036414 [GenBank] –CA039704, CA039711 [GenBank] –CA064598, CA767613 [GenBank] –CA770910, CB498694 [GenBank] –CB518126, DW543182–584513, DW531365–DW543181 and DY694261–DY741432 (Koop and Davidson, 2007). There were 243 674 sequences in total. For both salmon and pig data sets, the respective EST collections were split into two data sets, at random. For each data set, to derive the most likely base at each position within a sequence from each of the chromatogram files, basecalling was performed with the Phred program (Ewing et al., 1998). In order to remove vector sequence incorporated into the ESTs during EST library creation, the sequence pCMV-PCR vector, the vector used to create ESTs, was masked in each sequence with cross_match (P.Green, unpublished http://www.genome.washington.edu/UWGC). PolyA and polyT sequences were also masked using a custom made script to avoid false clustering on these motifs. Clustering and contig assembly was performed with the Phrap program, (Gordon et al., 1998). Phrap was run with the options -trim_start 50–minmatch 50. The average number of sequences per contig was 5.2 for salmon and 4.1 for pigs. There were 10 192 singletons for salmon and 20 542 for pigs, averaged across the two data sets. The PolyBayes program (Marth et al., 1999) was used to detect putative SNPs in the sequence alignments, and give a probability of being a true SNP to each base substitution. Only SNPs with >95% probability of being a true SNP were retained. The contig consensus sequences were used as the anchor sequences. When there were more than one SNP within a 50 bp window, these putative SNPs were removed from further consideration. The SNPs in each data set were then compared using both the flanking 50 bp of sequence and the SNP alleles, if these were a perfect match the putative SNPs were considered to be cross-validated. For pigs, 676 putative SNPs were cross-validated while for salmon, 856 putative SNPs were cross-validated, Table 1.
|
Eighty seven cross-validated SNPs were chosen at random from the set of 676 cross-validated pig SNPs for validation by genotyping in a panel of 47 individuals from two commercial pig populations (Duroc and Norwegian Landrace). The SNPs were validated using matrix-assisted laser desorption/ionization time-of-flight mass spectroscopy (MALDI-TOF MS) assays. Assays for PCR and extension reactions were designed with the MassARRAY Assay Design 2.0 software (Sequenom). SNPs were genotyped by the IPLEX protocol as described by manufacturer (Sequenom, San Diego, USA). Eighty five percent of putative, cross-validated SNPs were polymorphic in successful assays (Table 2). One hundred and fifty six putative SNPs were chosen at random from the set of 856 cross-validated salmon SNPs for validation by SNP genotyping in a panel of salmon from geographically diverse locations. DNA samples were collected and DNA extracted from 65 fish from 13 different locations, distributed across Canada, Iceland, Ireland and Russia. The putative SNPs were genotyped as described above. Seventy two percent of putative, cross-validated SNPs were classified as real in successful assays (Table 2). The absence of homozygotes for some of the SNPs (results not shown) indicates that some of the putative SNPs were actually paralogous sequence variants (PSVs) rather than true SNPs, likely a result of the extensive duplication of the salmonid genome.
|
The validation rates of the putative SNPs here were considerably higher than for non-cross-validated SNPs in other studies. For example, Hawken et al. (2004) reported a validation rate of 50% for putative SNPs detected in alignment of cattle ESTs, and Lee et al. (2006) reported a validation rate of 29% for putative non-synonymous SNPs in alignment of Bos taurus sequences. Our cross-validation strategy would be ideal for choosing 1000–2000 putative SNPs for genome wide SNP chips in species without whole genome sequence. The high rates of validation rate of putative SNPs selected in this way will considerably reduce costs of assembling these resources.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Burkhard Rost
Received on February 11, 2007; revised on April 10, 2007; accepted on April 16, 2007
| REFERENCES |
|---|
|
|
|---|
Buetow KH, et al. Reliable identification of large numbers of candidate SNPs from public EST data. Nat. Genet (1999) 21:323–325.[CrossRef][Web of Science][Medline]
Ewing B, et al. Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res (1998) 8:175–185.
Gordon D, et al. Consed: a graphical tool for sequence finishing. Genome Res (1998) 8:195–202.
Guryev V, et al. Single nucleotide polymorphisms associated with rat expressed sequences. Genome Res (2004) 14:1438–1443.
Hawken RJ, et al. An interactive bovine in silico SNP database (IBISS). Mamm. Genome (2004) 819–827.
Irizarry K, et al. Genome-wide analysis of single-nucleotide polymorphisms in human expressed sequences. Nat. Genet (2000) 26:233–236.[CrossRef][Web of Science][Medline]
Koop BF, Davidson WS. cGRASP. (2007) (http://web.uvic.ca/cbr/grasp/).
Lee MA, et al. Establishment of a pipeline to analyse non-synonymous SNPs in Bos Taurus. BMC Genomics (2006) 26:298.
Marth GT, et al. A general approach to single-nucleotide polymorphism discovery. Nat. Genet (1999) 23:452–456.[CrossRef][Web of Science][Medline]
Rise ML, et al. Development and application of a salmonoid EST database and cDNA microarray: data mining and interspecific hybridization characteristics. Genome Res (2004) 14:478–490.
Taillon-Miller P, et al. Overlapping genomic sequences: a treasure trove of single-nucleotide polymorphisms. Genome Res (1998) 8:748–754.
This article has been cited by other articles:
![]() |
B. S. Coates, D. V. Sumerford, N. J. Miller, K. S. Kim, T. W. Sappington, B. D. Siegfried, and L. C. Lewis Comparative Performance of Single Nucleotide Polymorphism and Microsatellite Markers for Population Genetic Analysis J. Hered., September 1, 2009; 100(5): 556 - 564. [Abstract] [Full Text] [PDF] |
||||
![]() |
V. Russo, L. Fontanesi, E. Scotti, F. Beretti, R. Davoli, L. Nanni Costa, R. Virgili, and L. Buttazzoni Single nucleotide polymorphisms in several porcine cathepsin genes are associated with growth, carcass, and production traits in Italian Large White pigs J Anim Sci, December 1, 2008; 86(12): 3300 - 3314. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. J. Ryynanen, A. Tonteri, A. Vasemagi, and C. R. Primmer A Comparison of Biallelic Markers and Microsatellites for the Estimation of Population and Conservation Genetic Parameters in Atlantic Salmon (Salmo salar) J. Hered., November 5, 2007; (2007) esm093v1. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||


