Bioinformatics Advance Access originally published online on June 1, 2006
Bioinformatics 2006 22(19):2437-2438; doi:10.1093/bioinformatics/btl273
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
© 2006 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
PROBER: oligonucleotide FISH probe design software
1 Watson School of Biological Sciences, Cold Spring Harbor NY 11724, USA
2 Cold Spring Harbor Laboratory, Cold Spring Harbor NY 11724, USA
3 Karolinska Institutet, Cancer Center Karolinska Stockholm, Sweden
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
PROBER is an oligonucleotide primer design software application that designs multiple primer pairs for generating PCR probes useful for fluorescence in situ hybridization (FISH). PROBER generates Tiling Oligonucleotide Probes (TOPs) by masking repetitive genomic sequences and delineating essentially unique regions that can be amplified to yield small (1002000 bp) DNA probes that in aggregate will generate a single, strong fluorescent signal for regions as small as a single gene. TOPs are an alternative to bacterial artificial chromosomes (BACs) that are commonly used for FISH but may be unstable, unavailable, chimeric, or non-specific to small (10100 kb) genomic regions. PROBER can be applied to any genomic locus, with the limitation that the locus must contain at least 10 kb of essentially unique blocks. To test the software, we designed a number of probes for genomic amplifications and hemizygous deletions that were initially detected by Representational Oligonucleotide Microarray Analysis of breast cancer tumors.
Availability: http://prober.cshl.edu
Contact: navin{at}cshl.edu
| 1 INTRODUCTION |
|---|
|
|
|---|
Identification of submicroscopic chromosome abnormalities is useful in the clinical diagnosis of diseases, including mental retardation, autism and cancer. The detection of heritable copy number polymorphisms (CNPs) in the normal population (Sebat et al., 2004) and in cancer amplifications and deletions (Lucito et al., 2003) may be important for studying human disease and genome evolution. Whole genome microarray analysis using Comparative Genomic Hybridization (CGH) or Representational Oligonucleotide Microarray Analysis (ROMA) provides a method for initial discovery of these variations, and create a corresponding need for validation and more accurate quantification by interphase or metaphase FISH. In order to target very specific locations of the genome that are separated by as little as 50 kb, we have developed a method for designing Tiling Oligonucleotide Probes for any specified genomic region. Coverage of as little as 20% of a 100 kb region with essentially unique short sequences provides hybridization probes sufficient for robust FISH analysis.
Design overview. Genomic DNA sequences are retrieved from a server, masked for repetitive exact string matches in the human genome, and analyzed for contiguously amplifiable, nearly repeat free regions of sufficient aggregate length. These regions are searched for optimized PCR forward and reverse primers, resulting in a collection of oligonucleotide probes. Individual tiling probes are then PCR amplified and combined into a cocktail for FISH analysis.
MerMatch. PROBER initiates probe designs by requesting a target genomic sequence 10100 kb in length from DAS.DNA, a Distributed Annotation Sever specific to a human genome freeze from UCSC (Dowell et al., 2001). Short sequence substrings of a specified (mer.match.length) length in the target DNA sequence having multiple exact matches elsewhere in the genome are masked using the MerMatch algorithm. This algorithm is based on the MerEngine (Healy et al. 2003). The MerEngine marks every substring of mer.match.length in the target sequence with the number of its exact matches in the human genome. To operate this algorithm, and other algorithms that we use routinely for probe design, a database of the human genome is compressed using a Wheeler-Burrows transformation into a suffix array that is stored in an external file. The database is loaded into 1 Gb of RAM minimizing execution time. MerMatch masks the frequent mers in the human genome, where frequent is defined as the number of exact matches greater than a user-specified parameter (mer.count.cutoff).
Tolerance. Tolerance is a program that finds regions suitable to be hybridization probes. We first convert the masked sequence output of MerMatch into a binary string, with 0s indicating the frequent mers. Positions within the string with consec.freq consecutive frequent mers are then marked as condemned zones by setting them to a large negative number, and no region overlapping a condemned zone is ever considered suitable to be a hybridization probe. Using successive cumulative sums, we mark a region suitable to be a hybridization probe if it has a specified (min.length) minimal length, but less than a specified (repeat.tolerance) proportion of frequent mers. Our default values are 0.8 for repeat.tolerance, 100 for min.length, 18 for mer.match.length, 1 for mer.count.cutoff, 10 for min.length. By setting repeat.tolerance lower, min.length higher, mer.match.length longer or mer.count.cutoff higher, we increase the tolerance for repeats in the regions considered suitable as a hybridization probe.
|
Probe design. The desired probe size range (1002000 bp) for Tier 1 and for Tier 2 probe selection are specified along with the primer Tm range (5580°C), mer.match.length (15,18, 21mer), maximum number of nucleotide repeats (n < 4) and base pair spacer (if a distance between probes is desired). Every possible primer sequence is extracted from the masked DNA sequence within a size range of 1530 bp and placed in a 3D matrix. Primer melting temperature (Tm) is calculated using the Rychlik method (Rychlik et al., 1990) which is based on the nearest neighbor Borer method (Borer et al.) (Tm = 81.5 + 16.6(log[Na+]) + 0.41(%GC) 675/probe length). Primer pairs are matched according to minimal Tm deviation and primers outside of a specified Tm range or GC percentage are eliminated. The remaining primers are subjected to the G/C clamp rule (must end in G/C at the 3' end to control mispriming) and must contain no polypyrimidines or polypurines that could promote non-specific annealing (maximum repeat nucleotides <4 by default) (Dieffenbach et al., 1995). In addition, the three nucleotides at the 3' end of each primer are scored according to the presence of a GC clamp, but absence of any GC dinucleotides that may facilitate primer dimerization.
Probe selection proceeds by selecting the forward set of primers for a single base pair position and then jumping ahead by the probe length distance (1002000 bp) in the matrix until the highest scoring set of the reverse primers are located. If the primers at either base pair position do not meet the primer rules, then the next forward or reverse primer set is considered (n + 1). The two columns of forward and reverse primers are compared and the primers with the closest Tm match are selected, resulting in a final probe sequence. Probe sequences that have been utilized are marked in the DNA sequence, so that they will not be reused during Tier 2 probe selection, where more relaxed parameters are used to identify additional probes.
Finally the Percent Genome Coverage (PCG) [(Bp Sequence covered with probes/Total Bp)* 100] is calculated and the probe distribution is visualized in a graphical plot. We have determined that a PGC > 20.00% of a 100 kb sequence will not compromise the fluorescent probe signal in FISH. The output can be saved as a full report or short report (forward/reverse primer sequences) formatted text file.
| 2 RESULTS |
|---|
|
|
|---|
Simulations. http://prober.cshl.edu/simulations.html
Application. http://prober.cshl.edu/applications.html
| 3 IMPLEMENTATION |
|---|
|
|
|---|
PROBER was written in C# 2.0 for Microsoft Windows and requires installation of the dot net framework 2.0 for runtime.
| Acknowledgments |
|---|
The authors thank John Healy, Lakshmi Muthuswamy, Joan Alexander, Eirny Tholsdorf and Kristin Anna for their help. This work was supported by a fellowship from the Lindsey-Goldberg Foundation to N.N. and grants to A.Z. from the Swedish Cancer Society (0046-B04-38XAC) and from the Stockholm Cancer Society (03:171 and 02:144). This work was also supported by grants to M.W. from the National Institutes of Health (5R01-CA078544-07), Dept. of the Army (W81XWH04-1-0477), (W81XWH-05-1-0068), (W81XWH-04-0905), The Simons Foundation, Miracle Foundation, Breast Cancer Research Foundation, Long Islanders Against Breast Cancer, West Islip Breast Cancer Foundation, Long Island Breast Cancer (1 in 9), Elizabeth McFarland Breast Cancer Research Grant, Breast Cancer Help Inc. M.W. is an American Cancer Society Research Professor. Funding to pay the Open Access publication charges was provided by The Simons Foundation.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Martin Bishop
Received on February 22, 2006; revised on May 6, 2006; accepted on May 24, 2006
| REFERENCES |
|---|
|
|
|---|
Borer, P.N., et al. (1974) Stability of ribonucleic acid double-stranded helices. J. Mol. Biol, . 86, 843853[CrossRef][Web of Science][Medline].
Dieffenbach, C.W., et al. General Concepts for PCR Primer Design, in PCR Primer, A Laboratory Manual, (1995) , New York Cold Spring Harbor Laboratory Press, pp. 133155.
Dowell, R.D., et al. (2001) The distributed annotation system. BMC Bioinformatics, 2, 7[CrossRef][Medline].
Lucito, R., et al. (2003) Representational oligonucleotide microarray analysis: a high-resolution method to detect genome copy number variation. Genome Res, . 10, 22912305.
Healy, J., et al. (2003) Annotating large genomes with exact word matches. Genome Res, . 10, 23062315.
Rychlik, W., et al. (1990) Optimization of the annealing temperature for DNA amplification in vitro. Nucleic Acids Res, . 18, 64096412
Sebat, J., et al. (2004) Large-scale copy number polymorphism in the human genome. Science, 305, 525528
This article has been cited by other articles:
![]() |
A. E. Pozhitkov, D. Tautz, and P. A. Noble Oligonucleotide microarrays: widely applied poorly understood Brief Funct Genomic Proteomic, July 20, 2007; (2007) elm014v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. C. Lamb, T. Danilova, M. J. Bauer, J. M. Meyer, J. J. Holland, M. D. Jensen, and J. A. Birchler Single-Gene Detection and Karyotyping Using Small-Target Fluorescence in Situ Hybridization on Maize Somatic Chromosomes Genetics, March 1, 2007; 175(3): 1047 - 1058. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Hicks, A. Krasnitz, B. Lakshmi, N. E. Navin, M. Riggs, E. Leibu, D. Esposito, J. Alexander, J. Troge, V. Grubor, et al. Novel patterns of genome rearrangement and their association with survival in breast cancer Genome Res., December 1, 2006; 16(12): 1465 - 1479. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||


