Bioinformatics Advance Access originally published online on January 12, 2005
Bioinformatics 2005 21(9):1917-1926; doi:10.1093/bioinformatics/bti248
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Design of a DNA chip for detection of unknown genetically modified organisms (GMOs)
1Department of Informatics, University of Oslo PO Box 1080 Blindern, 0316 Oslo, Norway
2Section of Food and Feed Microbiology, National Veterinary Institute PO Box 8156 Dep., 0033 Oslo, Norway
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Motivation: Unknown genetically modified organisms (GMOs) have not undergone a risk evaluation, and hence might pose a danger to health and environment. There are, today, no methods for detecting unknown GMOs. In this paper we propose a novel method intended as a first step in an approach for detecting unknown genetically modified (GM) material in a single plant.
Results: A model is designed where biological and combinatorial reduction rules are applied to a set of DNA chip probes containing all possible sequences of uniform length n, creating probes capable of detecting unknown GMOs. The model is theoretically tested for Arabidopsis thaliana Columbia, and the probabilities for detecting inserts and receiving false positives are assessed for various parameters for this organism. From a theoretical standpoint, the model looks very promising but should be tested further in the laboratory.
Availability: The model and algorithms will be available upon request to the corresponding author.
Contact: knut.berdal{at}vetinst.no
| INTRODUCTION |
|---|
|
|
|---|
It is commonly accepted throughout the world that prior to the release of a genetically modified organism (GMO) it is necessary to conduct a thorough risk evaluation of the GMO and its potential effects on health and environment. An unknown GMO, as per definition, has not undergone such evaluation, and hence poses a considerably higher risk than known GMOs. As the ability to create new GMOs is becoming more and more widespread, and the resources required are easily available, the likelihood of intentional or unintentional release of GMOs will increase. Unintentional release could typically be escapes from experiments without public notice, while intentional release could be the result of the fear that a successful high-yield crop may not be sold if labelled GMO, or of more hostile intentions. The release of GMOs in the environment and marketing of GMO-derived food products are strictly regulated in the European Union and other regions (e.g. European Commission, 1997, 2001, 2003; FMBJ, 2000; MAFK, 2000). Moreover, the Cartagena Biosafety Protocol agreement governs the trade and transfer of living GMOs across national borders, and allows governments to prohibit the import of genetically modified food when there is concern over its safety (Gupta, 2000). Pressure from green groups and consumer organizations has also raised the general public's awareness of the GMO issue. As a result, it is important for food exporters, importers and retailers, as well as for competent authorities responsible for food safety, to know the extent and the nature of GMO ingredients in the products they handle.
The alteration, which renders an organism a GMO, usually consists of the insertion of a recombinant piece of DNA into the genome of the organism. Here, the inserted DNA is called the insert. Detection of the insert can be done using methods that target modified genes, i.e. DNA, or gene products, i.e. RNA or protein. Presently, the majority of the methods applied in diagnostics target the modified DNA (Bonfini et al., 2002 http://biotech.jrc.it/doc/EUR20384Review.pdf; Holst-Jensen et al., 2003) as DNA is a rather stable molecule and the most common detection method, polymerase chain reaction (PCR), is very sensitive (Anklam et al., 2001; Holst-Jensen et al., 2003). Any PCR-based detection strategy depends on a detailed knowledge of the DNA sequence of the insert in order to select the appropriate oligonucleotide primers (for a comprehensive review of PCR-based detection methods, see Holst-Jensen et al., 2003). Hence, PCR is inappropriate for use in the direct detection of unknown GMOs.
A challenge in GMO screening in the near future is the rapid pace of development of GM plants that feature new or multiple genes and genetic control elements. New technologies and instruments will be needed that offer high throughput detection of multiple or unknown inserts (Holst-Jensen, 2003; Miraglia et al., 2004). To ensure that competent authorities and others responsible for food and environmental safety and compliance with legislations can do their job, it is urgent to provide suitable analytical tools for discovering unknown GMOs. The DNA chip design described herein is meant to create a theoretical basis for detection of unknown GMOs, i.e. GMOs that are not described in current literature. Given that the approach is deemed likely to work from a theoretical point of view, the experimental part of the design can start up. However, the extensive costs in terms of reagents and manpower may not justify the setting up of the experiments unless the theoretical basis is sound. This paper provides this basis, and limited experimental work has been initiated.
| SYSTEMS AND METHODS |
|---|
|
|
|---|
To detect unknown GMOs an oligonucleotide array (Lockhart et al., 1996)DNA chipwill be designed such that if a sample hybridizes to the chip, the sample is most likely a GMO. The sequences hybridizing to the chip have to be tested further. In the following section we will discuss how to design such a chip. An assumption in the model is that the probes on the chip are of a uniform length n, and a procedure is proposed to determine the optimal value of n.
Model
The strategy to design the probes on the chip is as follows: starting with a set of probes containing every possible sequence of length n, a set of biological and combinatorial reduction rules is applied, reducing the probes to a number which can be fitted onto a single DNA chip.
At the current state of the art, a DNA chip can contain 106 distinct probes (M. Lundberg, Affymetrix, personal communication). Although this number seems to be steadily increasing (Lander, 1999), 106 will be used as the number of probes on a chip in all calculations throughout this paper.
There are three major groups of reduction rules that will be examined:
- Subtraction of probes corresponding to both strands of the reference genome. The reference genome is defined as the genetic material that is expected to be present in the test sample. This reduction is performed to reduce false positive hybridization signals.
- Subtraction of sequences that are unlikely to be genetically functional (e.g. hypervariable microsatellite motifs, long stretches of single or short oligonucleotide repeats). These probes would by definition correspond to targets unlikely to represent intended genetic modifications.
- Reduction based on predicted hybridization behaviour. Probes assumed to hybridize strongly and specifically are preferred to poor or promiscuous hybridizers.
The probes selected for use on the DNA chip is then given by the set
, where the universe is the set of all possible sequences of length n, and A, B and C are the sets of sequences defined by reduction rules A, B and C, respectively. Notably, a considerable number of probes would be removed by more than one of the three rules.
Reduction A
Reduction A involves removing probes corresponding to subsequences of both strands of the reference genome. The reduction eliminates probes hybridizing to the wild type of the target species. This is essential in order to reduce false positive hybridization signals.
In addition to removing the probes for the perfect subsequences in the reference genome, additional elimination of probes with a certain number of mismatches relative to the perfect subsequences may prove useful for three reasons. First, because hybridization between two non-complementary strands can occur (particularly if the mismatch is at the end of the probe), removing probes with mismatches will further reduce false positive signals. Notably, 3' mismatches will result in failure to isolate and characterize sequences by application of PCR, if the probe motif is to be used as PCR primer. Second, all specimens belonging to the same species do not share the exact DNA sequence. Removing probes with mismatches will reduce the probability for receiving false positives due to natural variation among specimens. Last, extended elimination will compensate for errors made during sequencing of the reference genome.
A large number of sequences are excluded by reduction A. In fact, in some cases the number of probes eliminated may be very high, leaving few or no useful probes for synthesis on the DNA chip. To obtain an estimate over the number of distinct subsequences in a reference genome, random sequences of length n can be generated; an experiment generating overlapping and non-overlapping sequences showed that the number of distinct sequences statistically can be regarded as the same in both cases (Nesvold,H., unpublished data). A random sequence of length n consists of nAAs, nCCs, nGGs and nTTs, where 0
ni
n and
i ni = n for i = {A, C, G, T}. The number of distinct ways to select the amount of A, C, G and T nucleotides to arrange a n long sequence under the above restrictions is equivalent to the number of ways to arrange n symbols of one type (e.g. Xs) and three symbols of another type (e.g. vertical lines):
![]() |
![]() |
![]() |
Here the probability for a specific nucleotide to occur in any position of a sequence is assumed to be constant, i.e. the same for all positions. Call these probabilities p(A), p(C), p(G) and p(T). Consequently, all
arrangements of nucleotides of one of the
nucleotide distributions are equally likely. The expected number of any one of these arrangements is then given by the total number of subsequences of length n in a reference genome of length r, (r n + 1) x 2, multiplied by the probability of the specific arrangement:
![]() |
When estimating the expected number of distinct subsequences, the expected number of sequences is summed over all possible sequences of length n. However, if the expected number of a sequence is >1, this is still counted as 1. Furthermore, because the order of the nucleotides in a sequence does not affect the above formula, the expected number of a specific distribution of nucleotides min[(r n + 1) x 2x p(A)nA x p(C)nC x p(G)nG x p(T)nT, 1] can be multiplied by the number of sequences that have this particular distribution
. This is summed over all different distributions of nucleotides (n + 3)!/n!3!. The resulting formula for the expected number of distinct sequences of length n in a reference genome of length r is
![]() |
is the amount of j nucleotides in distribution i, for j = {A, C, G, T}.
Dividing the above expression by the number of possible sequences of length n, 4n, yields a formula for the fraction of probes removed from a full combinatorial set. This formula was applied with two different probabilities for the bases: equal probability for each base, i.e. p(A) = p(C) = p(G) = p(T) =
, and 75% GC bias; p(A) = p(T) =
, p(C) = p(G) =
. The latter scenario equals 75% AT bias (25% GC). This was computed for different values for n and r, and the results are presented in Figure 1. The 2575% range provides reasonable limits for the GC content of the majority of hitherto sequenced material: Campbell et al. (1999) reported that out of 34 complete genomes and large DNA sequence samples (>5 x 105 bp), 33 had a GC content within this interval, and all 54 complete genomes examined by Tekaia (2002, http://www-alt.pasteur.fr/~tekaia/ntfreq.html) had this property. As can be observed in Figure 1, 10 seems to be the minimum probe length that can be applied if any probes are to be retained after the reduction. Moreover, shorter probe lengths can be used for reference genomes with a strong GC bias.
|
Additional removal of sequences with a certain number of mismatches from the subsequences in the reference genome further reduces the number of false positive hybridizations as described initially. There are three possible substitutions for a nucleotide at a specific position which result in a sequence that differs by one from the original. Consequently, there exist 3n different sequences with one mismatch from a sequence of length n, and additionally removing the probes for these sequences results in a great increase in the number of removed probes.
Devising a formula for the number of sequences removed when in addition eliminating sequences with a number of mismatches in the reduction is not trivial and has not been attempted herein. However, an efficient algorithm for calculating the set of sequences with a certain maximum number of mismatches from the sequences in the reference genome has been devised, and is presented in the Algorithm section of this paper.
The reduction rule can be made more sophisticated by adding sequences of typical parasitic, endo- and epiphytic organisms such as viruses, fungi and bacteria as additional reference genomes. This would, however, require careful consideration on a case-by-case basis, to optimize the probability of a high ratio of true to false positive signals.
Reduction B
The nature of the insert is unknown. However, certain groups of sequences are less likely to have been used in the insert, simply on the basis of their sequence composition. Removal of such sequencesreferred to here as unlikely sequencesmay be useful for selection of probes for detection of unknown GMOs.
To remove probes for unlikely sequences, probes will be tested against the following rules:
- (B1) Total number of As, Cs, Gs or Ts less than
n/2
.
- (B2) No more than three identical dinucleotides in a row.
- (B3) No more than
n/3
identical dinucleotides in total.
- (B2) No more than three identical dinucleotides in a row.
The aim of these rules is to eliminate probes for sequences containing repeated mononucleotides and dinucleotides. Repetitive DNA occurs in large quantities in eukaryotic cells and in smaller amounts in prokaryotic cells (Van Belkum et al., 1998). Analysis of the genomic distribution of repetitive DNA has shown that it is virtually excluded from genetically functional sequences (Cox and Mirkin, 1997).
Reduction C
Based on hybridization experiments, Affymetrix has published a set of heuristic rules for selecting high quality probes (Lockhart et al., 1996). These heuristics were empirically derived and based on the selection of probes of length 20, and are as follows:
- Total number of As or Ts <10.
- Total number of Cs or Gs <9.
- Number of As or Ts in any window of 8 bases <7.
- Number of Cs or Gs in any window of 8 bases <6.
- No more than 5 Cs or Gs in a row.
- No more than 6 As or Ts in a row.
- A palindrome score <7.
The palindrome score of a DNA sequence is the maximum number of WatsonCrick base pairing matches that could occur between the 5'3' sequence and the 3'5' sequence, divided by two (M.S.Mittmann, Affymetrix, personal communication).
Affymetrix rules 37 will be used in reduction C to eliminate probes that for reasons concerning hybridization are unsuitable for use on DNA chips. Additionally, a reduction based on the melting temperature (TM) of the probes will be applied. A criterion for successful hybridization on a DNA chip is that all probes have melting temperatures which are approximately the same; the smaller the variation the better. The TM of a sequence can be determined exactly only by empirical means. However, theoretical methods can be applied to estimate the melting temperature. The Wallace Rule (Suggs et al., 1981), Marmur formula (Marmur and Doty, 1962) and the nearest-neighbour method (Breslauer et al., 1986) are three popular methodsof varying complexityfor estimating the TM of a sequence. In this study, the nearest-neighbour method will be used as this is currently the most accurate method (Le Novére, 2003 http://www.ebi.ac.uk/~lenov/SOFTWARES/melting/melting.html) for the sequence lengths studied herein (1030 bp).
The nearest-neighbour method assumes that the stability of a given base pair depends on the identity and orientation of neighbouring base pairs. A set of thermodynamic parameters must be defined in order to use the nearest-neighbour method, and several such sets have previously been developed (e.g. Breslauer et al., 1986; SantaLucia et al., 1996; Sugimoto et al., 1996; Allawi and SantaLucia, 1997). At the time of writing, the most commonly used and accurate (SantaLucia, 1998) set of parameters is that described by Allawi and SantaLucia (1997), and this is used herein. It must be noted that these parameters are based on measures obtained in solution and not on solid support where hybridization behaviour is different. Despite this fact, it is suspected that thermodynamic parameters measured in solution can be used to predict the TM of probes mounted on chips approximately (Li and Stormo, 2001).
It is suggested that the melting temperature reduction is applied last in order to bring the number of probes remaining from the other reductions down to a number which can be fitted on the DNA chip (currently 106). The reduction should calculate the TM of the probes using the nearest-neighbour method and select a subset with as equal TMs as possible.
| IMPLEMENTATION |
|---|
|
|
|---|
Data structure
To store the large amounts of data needed to create the probes, a lookup table was used. Lookup tables were used as these enable insertions and retrievals of time complexity O(1). In a lookup table (a realization of an unordered dictionary abstract data type, see e.g. Goodrich and Tamassia, 1998, p. 247) each object, here a sequence, can be retrieved quickly using keys. Because all sequences that are handled simultaneously are of uniform length, the following one-to-one function can be used to map a sequence S to a unique key I(S)
![]() |
Because the lookup table is an unordered dictionary and the key-object relationship is one-to-one, storing the object at the position given by its key is unnecessary. The object can be derived by the key itself, by reversing the one-to-one mapping function. This associative property enables the lookup table to be implemented by an array of bits with one bit for every possible key. A 1 in a specific position indicates that the sequence with the key corresponding to that position is present in the table, a 0 that it is not.
The only requirement for using the described lookup table is that n is the same for all sequences so that the mapping function remains one-to-one.
| ALGORITHM |
|---|
|
|
|---|
Sequences with mismatches
Removal of sequences with mismatches is performed by application of reduction rule A to reduce false positive hybridization signals. An algorithm (utilizing lookup tables) for calculating the set of sequences with a certain maximum number of mismatches from the sequences in the reference genome has been devised.
The algorithm proposed here works as follows. Call the table containing the perfect subsequences of the reference genome the source table. A position in the new table is set to 1 if one or both of the following criteria are satisfied: (1) The same position in the source table contains a 1. (2) One of the 3n sequences that has one mismatch compared to the sequence mapped to the position is contained in the source table.
The strength of this algorithm comes into play when the source table contains a large number of sequences, i.e. the reference genome contains many distinct subsequences. Then, criterion 2 of the above criteria is more likely to be satisfied at an early stage, enabling the algorithm to perform fewer lookups for each sequence and thus run faster. As an example, if exactly half of the positions in the source table were marked off, the theoretical average number of lookups that would have to be performed when creating a position in the table with mismatches would be 3n/2.
Based on the newly created table, efficient computation of sets with even more mismatches are possible. Using the new set as a source table, all sequences with a maximum of two mismatches are created by another run of the algorithm. This process can be repeated to create the set with the desired number of mismatches, denoted m, and due to the property discussed above runs faster for higher values of m.
The approach of generating the sequences for m in increments of 1 is useful in the overall probe design for the method proposed in this paper. Because it is impossible to exactly determine the number of sequences removed by the reduction beforehand, calculating this number for several values for m will help determine the best suitable parameter for a specific reference genome and probe length.
Optimizations of the mismatch algorithm
A factor that incurs a decrease in the performance of the presented algorithm is constantly accessing different sections of the source table. Numerous reading operations at distinct locations in a large externally stored lookup table results in poor utilization of the system cache. To prevent this, consecutive lookups should be performed at positions as spatially close to each other as possible. Additionally, if two or more lookups at positions located next to each other are required within a relatively small time frame, one longer block of data should be read (a block that would contain data for more than just the next lookup) instead of several shorter ones.
An observation was made concerning the integer mappings of sequences and those of their corresponding sequences with one mismatch. It was found that the integer mappings of the sequences with one mismatch from the sequence mapped by integer X are closely linked to the integer mappings of the sequences with one mismatch from the sequence mapped by integer Y, if X and Y are of relatively the same order. The smaller the difference between X and Y, the greater the relationship between the two sets of mapped integers seems to be. This property allows for an optimization that exploits the principle of spatial locality and larger data blocks: instead of only reading the exact positions in the source table required to determine position X, larger data blocks are read at each of these locations. Because of the observed clustering of the positions of the sequences with mismatches, the data required to fill in positions X + 1, X + 2, ..., X + l will then also be read. The value of l depends on the size of the read blocks, where larger blocks lead to a higher value for l, and vice versa (the maximum value for l is given by the amount of available RAM). Moreover, by reading a sorted set of clusters, sequences that are varying less in terms of location in the source table will be read within a shorter temporal interval, leading to better utilization of the cache.
The melting temperature reduction
As stated, the melting temperature reduction is suggested to be performed last. However, the number of probes remaining before application of this reduction may be very large, and selecting a subset of probes with similar TMs from a large set of sequences is not computationally trivial as this involves extensive calculations using the nearest-neighbour method, and requires a very efficient selection method. Here follows an algorithm for selecting appropriate probes based on melting temperature.
The proposed algorithm for selecting probes based on TM starts by removing probes based on GC content. The TM of a sequence is closely linked to its GC content as the increased number of hydrogen bonds in a GC-rich sequence makes for a more stable molecule. This property is reflected in the nearest-neighbour parameters. The TM of the remaining probes can then be accurately estimated using the nearest-neighbour method, and the best subset selected. For example, if 108 sequences are to be reduced to 106 based on TM, initial retention of only 2 x 106 sequences based on their GC content and removal of all other sequences leaves a more manageable number of sequences on which the nearest-neighbour method can be applied to select an appropriate subset. Leaving 2 x 106 probes from which 106 should be selected for use on the DNA chip seems sufficient for probe lengths up to at least 20a larger number of probes does not enable a significantly smaller melting temperature to be created (Nesvold,H., unpublished data).
The last selection is performed by sorting the remaining sequences based on TM using an efficient sorting algorithm such as quicksort (Hoare, 1962), and selecting an appropriate subset. It is suggested that the GC-based selection is performed as follows, where X denotes the probes on which the selection should be performed. First, the GC content of each probe in X is found and stored in an array of length n + 1 where position c in the array contains the number of probes with GC content c in X. From these data the number of probes of each GC content that should be removed to reduce the number of probes to the desired amount is found, either by an algorithm or by hand. The probes in X are finally iterated once more and the calculated number of probes for each GC content removed.
| RESULTS |
|---|
|
|
|---|
To verify the proposed method, probe sets were created for Arabidopsis thaliana using ecotype Columbia (Col-0) as the reference genome. Arabidopsis was selected because it is one of the few thoroughly studied and fully sequenced higher eukaryotes available today (The Arabidopsis Genome Initiative, 2000). All subsequences from both strands of the Col-0 ecotype were used in reduction A (NCBI GenBank, accession numbers NC003070.3, NC003071.2, NC003074.3, NC003075.1 and NC003076.1). Probe sets were created for different probe lengths n and number of mismatches m. The probe sets were created as previously described, with an average GC content as close to 50% as possible serving as the initial TM reduction. A 50% GC content was chosen as this seemed like a reasonable mean of what potential inserts may contain, thus minimizing the average error in GC contentand thus TMbetween probes and insert. The second stage of the TM reduction created probe sets with
1°C spread in TM according to the nearest-neighbour method. The percentages of the total amount of removed probes removed by the reduction rules are reported in Figure 2.
|
Artificial inserts of length 5000 bp were randomly generated and tested against the probe sets. If one or more of the subsequences of length n in the artificial insert matched a probe on the chip the insert was said to be detected. A total of 3000 inserts were created: 1000 with 33% GC bias, 1000 with 50% GC bias and 1000 with 67% GC bias. The results are presented in Table 1. The number of probes in the sets for lengths <12 was very low as most probes were removed in reduction A (e.g. 4750 probes were left for length 11). These probe sets are not studied further. Observe that the percentage of inserts found drastically drops when the length of the probes is >16. Note also that when the GC content of the probes and the insert is more similar the insert is more likely to be found.
|
In the two last columns of Table 1 results from tests utilizing real sequences are reported. In the first of these columns, the number of theoretical hybridizations found between each probe set and a known T-DNA insert in A.thaliana (The Salk Institute Genomic Analysis Laboratory, N560156 B5; http://signal.salk.edu/pBIN-pROK2.txt-new), of length 4480 bp is given (50.6% GC). In the very last column 85 transgenic plant sequences of lengths between 5000 and 10 000 bp, and not originating from A.thaliana, were used. These sequences were found by using the search transgenic AND plant AND NOT Arabidopsis thaliana on the EMBL SRS (http://srs.ebi.ac.uk) search page in May 2004. For each of the sequences a simulated insertion of the sequence into the A.thaliana genome was performed, followed by simulated hybridization to the previously described Col-0 ecotype probe sets. The percentage of the sequences detected by at least one probe is reported for each probe set. These results are in high consistence with the results from using the artificially generated inserts.
One of the 85 sequences obtained from the EMBL SRS was used in an additional feasibility study, as an example to further simulate whether an unknown insert could be detected and characterized with the described approach. A probe length of 15 yields a high probability of positive hybridization (Table 1), hence this length was chosen for the further simulation, with m = 1. The chosen example (accession number AY560325, Cloning vector pC1300intB-35SnosEX, complete sequence) yielded 18 positive hybridizations (presented in Table 2), of which two were identical (D and L, Table 2), i.e. altogether 17 oligonucleotides of length 15. To see if some of these oligonucleotides could be joined with perfect overlaps between the 3' end of one probe and the 5' end of another probe (hereafter referred to as terminal overlaps) and consequently be used to establish longer oligonucleotide motifs putatively present in the insert, all the 17 probes together with their reverse complements were attempted aligned (altogether 34 motifs of 15 bases). This resulted in the generation of one terminal overlap motif of 24 bases containing a 6 bases overlap between two of the probes (ctttataccGGCTGTccgtcattt, A and I, Table 2, overlap shown with capital letters), and one terminal overlap motif of 16 bases containing a 14 base overlap between two of the probes (aTACCTGTCCGCCTTt, G and H, Table 2), as well as several motifs with terminal overlaps of four or fewer bases (data not shown). The 24 (AI-F) and 16 (GH-F) base motifs were used as primers in a simulated PCR reaction to try to detect the unknown insert (Fig. 3). Because the orientation of the probe motifs in the unknown insert was not known, reverse compliment primers (AI-R and GH-R) had to be included in the simulated experiment, and AI-F had to be combined with GH-F, AI-F with GH-R, AI-R with GH-F and AI-R with GH-R in four separate reactions. The results when accession number AY560325 was used as template sequence was amplification of a 1190 bp long fragment with the combination of the primers GH-F and AI-R, and no amplification using the other three primer combinations. Sequencing of the 1190 bp long fragment revealed that an additional 3 of the 17 original probe motifs were present in the sequence (P revcom, Q revcom and R revcom, Table 2). Consequently, 7 of the 17 original probe motifs were confirmed present and a 1190 bp long sequence contig was identified and could be subject to further analysis, e.g. to look for open reading frames or other functional elements, to perform similarity searches and to design the setup for further characterization of the insert on the basis of sequences flanking the 1190 bp fragment on both sides. Undoubtedly, this demonstrates that the described approach is feasible.
|
|
The probability for receiving false positive hybridization signals was also examined for the probe sets created for Arabidopsis. However, as there currently exists only one large-scale sequencing of a Columbia accession (Col-0), examining the extent of false positives by theoretical application of another Columbia sequence was not possible. Instead, the number of false positives was estimated based on a previous analysis of genetic diversity in Arabidopsis. Bergelson et al. (1998) have estimated the nucleotide diversity within populations of A.thaliana. Based on a study of three populations containing a total of 18 ecotypes, the average intrapopulation nucleotide diversity (average pairwise number of differences/effective number of sites) was found to be 0.0004. The number of bases differing between two arbitrary A.thaliana Col-0 specimens is then found by multiplying the number of bases in both strands of A.thaliana Col-0 by this factor. Furthermore, based on the number of mismatched bases between two samples, the maximum number of subsequences in the sample that may yield false positive hybridization signals to a certain probe set can be determined by finding the maximum number of subsequences containing at least m + 1 mismatched bases, where m is the mismatch parameter used in the design of the specific probe set as before. This number is given by
![]() |
|
| DISCUSSION |
|---|
|
|
|---|
The proposed model is universal in that it can be applied to all organisms, but the length of the reference genome sets a theoretical limit to the minimum value for n (Fig. 1). Furthermore, the limited number of probes on the chip sets a maximum probe length (Table 1) for which detection of the insert is sufficiently probable, given its presence. However, this last limit will increase with the number of probes on a chip, or by increasing the number of chips. The method could easily be extended to utilize more chips. Only one chip has been chosen for two reasons. First because the cost of synthesizing the chips is very high, and second if only one chip detects an unknown GMO, there are no reasons for including more sequences and hence more false positive hybridization signals. When designing a probe set for a new reference genome, the optimal value for n, m and the required number of probes must be found by conducting similar studies as have been performed for Arabidopsis in this paper.
As shown, it is possible to create a chip for unknown GMO detection which theoretically has a high probability of detecting inserts inserted into an A.thaliana specimen (Table 1). The use of hybridization data directly to set up experiments to isolate and characterize putatively unknown sequences is demonstrated in simulated hybridization, PCR and sequencing experiments (Table 2, Fig. 3). Estimates for the number of false positives have also been computed (Table 3). The ability to detect and identify inserts and distinguish between true and false positive signals will depend on a combination of many factors. These include design of the optimal chip, the inherent genetic variation in the taxon examined, the physicochemical reaction conditions in hybridization experiments, data interpretation and acquired experience. From the theoretical calculations with artificial inserts and A.thaliana, a probe length of 15 or 16 seems preferable, with m = 1. The estimated number of false hybridization signals obtained for these sets ranged from 147 to 548 (Table 3), but this is most likely a large overestimate as discussed earlier.
The impact of false positive signals can be significantly reduced by alignment analysis. The genome size of A.thaliana is
108 bp. The size of an insert is normally <104 bp. In the simulated hybridization and PCR experiment starting with probe length n = 15 and using accession number AY560325 as the simulated insert, the number of true positives was 17 (Table 2, Fig. 3), and terminal overlaps were identified (Fig. 3). The corresponding number of estimated false positives with n = 15 was 1175 with m = 0 and 548 with m = 1, respectively (Table 3). Thus, the probability of finding terminal overlaps between false positives is practically ignorable in comparison, provided they are randomly distributed throughout the genome. Reducing n will increase the probability of finding true positive motifs with terminal overlaps and result in longer sequence motifs fit as PCR primers for isolation and characterization of the insert. Contrary to what would be expected, it may therefore be more desirable to use short probes to produce probe sets yielding a sufficiently high probability of finding terminal overlaps between probes to identify the true positive signals.
Verification of hybridization signals can be achieved by repeated BLAST searches using the sequences for the probes that hybridized. Preferably, these sequences should be aligned if possible to obtain longer search sequences. Results from such analyses may indicate that the biological sample should be studied further to verify the presence of a potential insert, or dismiss all signals as false positives (i.e. the result of the similarity search is either DNA from the plant itself or known DNA from organisms that could be naturally from the field). Moreover, it is possible to further reduce the number of false positives for a specific chip, either by reduction of additional sequenced specimens as these become available (e.g. different ecotypes of A.thaliana), or by removal of probes which have yielded false positives in analogous experiments in the past. In this way, the probes on a chip are improved with each use in that they most likely yield fewer false positive signals and hence are also more likely to correspond to interesting subsequences in the applied sample.
The chip is foreseen to be used for testing of single plants only, but normally any plant material from fields will include infecting, endophytic and epiphytic organisms. Infections or material from other organisms in the field might be another source for unwanted hybridizations. These unwanted hybridizations could be detected by using BLAST, but probably even beforehand. The relative quantities of the plant and other genetic material most likely varies significantly over a whole plant. While some subsamples of the plant such as single cells or leaves may contain many copies of the other genetic material, e.g. in case of viral infection, other subsamples may contain few or no copies of that genetic material. Thus, given that the material subject to testing will be derived from single plant specimens and should include several tissues possibly tested separately, it is probable that a combined low sensitivity (detecting mainly high copy number genetic material) and subtraction of signals that are not reproduced in all the tissue samples, would provide a good basis for identification of likely true positives.
The estimates of false positives in Table 3 are worst case and based on an estimate of the A.thaliana nucleotide substitution rate. Although testing of multiple tissue samples for each individual plant specimen and compilation of data on previous false positive signals may aid in the identification of probes unfit for detection of unknown GMOs, the end-user may have to further examine a fairly large number of signals. Sequence similarity searches, and comparison of the sequences of the probes yielding positive signals may be applied to obtain some knowledge about the possible nature of the sequences responsible for the positive signals. If the multiple positive signals are derived from a genetic modification, several of the probe motifs should be found in close proximity in a string of DNA that may be amplified from the test sample, e.g. by a PCR applying probe motifs as PCR primers (Table 2).
It should be noted that reduction B could most certainly be improved. Removing more probes corresponding to unlikely sequences means that the probability for including a useful probe increases and hence does the probability for detecting the insert.
So far, testing of the method has been performed only theoretically. Now a test in the laboratory seems to be valuable. Experimental hybridization conditions must be adapted to ensure that probes hybridize with the desired degree of specificity independent of the chosen probe length. These adjustments would primarily focus on the temperature and buffer conditions. Short probes generally yield lower specificity than long probes, but long probes may be applied under conditions that allow for more mismatches in which case the specificity is reduced correspondingly. Short probes applied under very stringent conditions may be very efficient, as shown in the present study. Chemical and structural modifications of the probes may also increase their specificity without necessarily increasing their size, e.g. PNA or MGB probes.
The power of the presented model obviously lies in the high number of probes. Consequently, as the number of probes that can be fitted onto a single DNA chip continues to increase, so will the potential for detecting unknown GMOs using the described GMO chip. This study presents a theoretical basis for detection of unknown GMOs and the described approach will be further explored in molecular experiments at the National Veterinary Institute, Norway.
| Acknowledgments |
|---|
The authors would like to thank two anonymous reviewers for valuable suggestions and comments.
Received on June 29, 2004; revised on November 15, 2004; accepted on December 20, 2004
| REFERENCES |
|---|
|
|
|---|
Allawi, H.T. and SantaLucia, J., Jr. (1997) Thermodynamics and NMR of internal GT Mismatches in DNA. Biochemistry, 36, 1058110594[CrossRef][Medline].
Anklam, E., Gadani, F., Heinze, P., Pijnenburg, H., Van den Eede, G. (2001) Analytical methods for detection and determination of genetically modified organisms in agricultural crops and plant-derived food products. Eur. Food Res. Technol., 214, 326[CrossRef].
Bergelson, J., Stahl, E., Dudek, S., Kreitman, M. (1998) Genetic variation within and among populations of Arabidopsis thaliana. Genetics, 148, 13111323
Bonfini, L., Heinze, P., Kay, S., Van den Eede, G. (2002) Review of GMO detection and quantification techniques. EUR 20384/EN.
Breslauer, K.J., Frank, R., Blöcker, H., Marky, L.A. (1986) Predicting DNA duplex stability fom the base sequence. Proc. Natl Acad. Sci. USA, 83, 37463750
Campbell, A., Mrázek, J., Karlin, S. (1999) Genome signature comparisons among prokaryote, plasmid, and mitochondrial DNA. Proc. Natl Acad. Sci. USA, 96, 91849189
Cox, R. and Mirkin, S.M. (1997) Characteristic enrichment of DNA repeats in different genomes. Proc. Natl Acad. Sci. USA, 94, 52375242
. European Commission. (1997) Council Regulation (EC) No 258/97 of 27 January 1997 concerning novel foods and novel food ingredients. Official J., L 043, 15.
. European Commission. (2001) Council Directive 2001/18/EEC of 12 March 2001 on the deliberate release into the environment of genetically modified organisms and repealing Council Directive 90/220/EECCommission Declaration. Official J., L 106, 139.
. European Commission. (2003) Commission Regulation (EC) No. 1829/2003 of 22 September 2003 on genetically modified food and feed. Official J., L 268, 123.
. FMBJ. (2000) Notification No. 1775 (June 10) Food and Marketing Bureau, Ministry of Agriculture, Forestry and Fisheries of Japan, Tokyo.
Goodrich, M.T. and Tamassia, R. Data Structures And Algorithms in Java, (1998) , NY John Wiley & Sons.
Grimaldi, R.P. Discrete and Combinatorial Mathematics: An Applied Introduction, (1999) 4th edn , Reading, MA Addison-Wesley.
Gupta, A. (2000) Governing trade in genetically modified organisms. The Cartagena Protocol on Biosafety. Environment, 42, 2223.
Hoare, C.A.R. (1962) Quicksort. Comput. J., 5, 1015[CrossRef].
Holst-Jensen, A. (2003) Advanced DNA-based detection techniques for genetically modified food. In Lees, M. (Ed.). Food Authenticity and Traceability, , Cambridge Woodhead Publishing, pp. 575594.
Holst-Jensen, A., Rønning, S.B., Løvseth, A., Berdal, K.G. (2003) PCR technology for screening and quantification of genetically modified organisms (GMOs). Anal. Bioanal. Chem., 375, 985993[ISI][Medline].
Lander, E.S. (1999) Array of hope. Nat. Genet., 21, 34[CrossRef][ISI][Medline].
Li, F. and Stormo, G.D. (2001) Selection of optimal DNA oligos for gene expression arrays. Bioinformatics, 17, 10671076
Lockhart, D.J., Dong, H., Byrne, M.C., Follettie, M.T., Gallo, M.V., Chee, M.S., Mittmann, M., Wang, C., Kobayashi, M., Horton, H., Brown, E.L. (1996) Expression monitoring by hybridization to high-density oligonucleotide arrays. Nat. Biotechnol., 14, 16751680[CrossRef][ISI][Medline].
. MAFK. (2000) Notification No. 2000-31 (April 22) Ministry of Agriculture and Forestry of Korea, Seoul.
Marmur, J. and Doty, P. (1962) Determination of the base composition and deoxyribonucleic acid from its terminal denaturation temperature. J. Mol. Biol., 5, 109118[ISI][Medline].
Miraglia, M., Berdal, K.G., Brera, C., Corbisier, P., Holst-Jensen, A., Kok, E., Marvin, H.J.P., Schimmel, H., Rentsch, J., van Rie, J.P.P.E., Zagon, J. (2004) Detection and traceability of genetically modified organisms in the food production chain. Food Chem. Toxicol., 42, 11571180[CrossRef][ISI][Medline].
SantaLucia, J., Jr. (1998) A unified view of polymer, dumbbell, and oligonucleotide DNA nearest-neighbour thermodynamics. Proc. Natl Acad. Sci. USA, 95, 14601465
SantaLucia, J., Jr, Allawi, H.T., Seneviratne, P.A. (1996) Improved nearest-neighbour parameters for predicting DNA duplex stability. Biochemistry, 35, 35553562[CrossRef][Medline].
Suggs, S.V., Hirose, T., Miyake, T., Kawashima, E.H., Johnson, M.J., Itakura, K., Wallace, R.B. (1981) Use of synthetic oligodeoxyribonuclueotides for the isolation of specific closed DNA sequences. In Brown, D.D. and Fox, D.F. (Eds.). Developmental Biology Using Purified Genes, , New York Academic Press, pp. 683693.
Sugimoto, N., Nakano, S., Yoneyama, M., Honda, K. (1996) Improved thermodynamic parameters and helix initiation factor to predict stability of DNA duplexes. Nucleic Acids Res., 24, 45014105
Tekaia, F. (2002) Base composition of species complete sequences.
. The Arabidopsis Genome Initiative. (2000) Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature, 408, 796815[CrossRef][Medline].
Van Belkum, A., Scherer, S., Van Alphen, L., Verbrugh, H. (1998) Short-sequence DNA repeats in prokaryotic genomes. Microbiol. Mol. Biol. Rev., 62, 275293
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||









