Skip Navigation


Bioinformatics Advance Access originally published online on January 19, 2007
Bioinformatics 2007 23(6):694-700; doi:10.1093/bioinformatics/btl674
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/6/694    most recent
btl674v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (2)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Brodzik, A. K.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Brodzik, A. K.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Quaternionic periodicity transform: an algebraic solution to the tandem repeat detection problem

Andrzej K. Brodzik

The MITRE Corporation, Bedford MA 01730


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 PERIODICITY TRANSFORM
 3 THE QUATERNIONIC APPROACH
 4 EXPERIMENTS
 5 SUMMARY
 REFERENCES
 

Motivation: One of the main tasks of DNA sequence analysis is identification of repetitive patterns. DNA symbol repetitions play a key role in a number of applications, including prediction of gene and exon locations, identification of diseases, reconstruction of human evolutionary history and DNA forensics.

Results: A new approach towards identification of tandem repeats in DNA sequences is proposed. The approach is a refinement of previously considered method, based on the complex periodicity transform. The refinement is obtained, among others, by mapping of DNA symbols to pure quaternions. This mapping results in an enhanced, symbol-balanced sensitivity of the transform to DNA patterns, and an unambiguous threshold selection criterion. Computational efficiency of the transform is further improved, and coupling of the computation with the period value is removed, thereby facilitating parallel implementation of the algorithm. Additionally, a post-processing stage is inserted into the algorithm, enabling unambiguous display of results in a convenient graphical format. Comparison of the quaternionic periodicity transform with two well-known pattern detection techniques shows that the new approach is competitive with these two techniques in detection of exact and approximate repeats.

Supplementary information: Supplementary data are available at Bioinformatics online.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 PERIODICITY TRANSFORM
 3 THE QUATERNIONIC APPROACH
 4 EXPERIMENTS
 5 SUMMARY
 REFERENCES
 
DNA data contains symbol sequences that do not exhibit an obvious order and sequences made up of symbol patterns that repeat periodically. The later ones arouse interest because they are unexpected and because they provide a convenient visual and numerical reference. DNA repeats can also, in general, be classified, studied and endowed with biological significance easier than random assemblies of symbols.

Many different types of repetitions occur in the DNA data. At the most general level, repetitive patterns can be divided into tandem repeats, dispersed repeats and structural repeats. A tandem repeat consists of two or more adjacent copies of an arbitrary sequence of DNA symbols. The length of the sequence typically varies from a few to a few hundred of bases. Dispersed repeats consist of two or more non-adjacent copies of an arbitrary sequence. Such sequences are often of a greater length than tandem repeat patterns. Both tandem and dispersed repeats occur mainly in non-conserved regions of genomes. Dispersed repeats alone are estimated to comprise about one-half of the human genome (Lander et al., 2001). Structural repeats are defined by an over-representation of subsets of DNA sequences of a certain length. They are indicative, for example, of encoding for amino acids in the transcribable sections of DNA and of the helical structure of the DNA double strand.

The focus of this article is on the best known of those repeats; the tandem repeats. Tandem repeats encode information about the structure and function of DNA and thereby play a key role in a number of applications. One of the most important of these applications is the diagnosis of genetic disorders. Occurrence of tandem repeats and tandem repeat changes in the human genome has been associated in the past with Huntington's disease (HDCRG, 1993), myotonic dystrophy (Fu et al., 1992), Friedreich's ataxia (Campuzano et al., 1996), multiple sclerosis (Guerini et al., 2003), Alzheimer's disease (Licastro et al., 2003), schizophrenia (Brzustowicz et al., 2000) and cancer (Sidransky, 1997). In those cases occurrence of repetitions in specific parts of the genome indicates pathology. In other key applications, such as DNA forensics (Butler, 2003) and reconstruction of human evolutionary history (Tishkoff et al., 2000), tandem repeats allows the differentiation between individuals, or between geographically and temporarily separated populations. Furthermore, as genetic markers can also be used in microbial forensics, tandem repeat analysis provides a powerful tool for investigation of infectious disease outbreaks (Cummings, 2002). Since the availability of the genomic data is relatively recent, and research that is able to fully take advantage of this data is just beginning, the number of tandem repeat applications is likely to grow even further. For these reasons, the design of ever more sensitive and efficient tools for DNA repeat analysis is a task of a considerable importance.

The methods used to detect DNA repeats can be classified as either stochastic or deterministic. A review of some of the more popular techniques was recently given in (Krishnan, 2004). In general, stochastic methods are preferred over deterministic ones, in part because they are thought to be better able to differentiate between significant and insignificant repeats. In practice, this advantage might not be fully realized, as the concepts of statistical and biological significance often diverge (Stolovitzky and Califano, 1998).

The focus of this article is on the deterministic approaches. In contrast to stochastic techniques, deterministic (and, in particular, algebraic) methods have the advantage of being able (at least in principle) to detect all repeats, and of having computational complexity that is independent of the data. Among the deterministic approaches, many rely on spectral analysis of the data. The analysis is typically based either on Fourier (Anastassiou, 2001), Walsh (Tavare and Giddings, 1989) or wavelet transform (Arneodo et al., 1998) space processing. These methods target mainly structural and dispersed repeats. More recently, a time-domain method, called the periodicity transform has been introduced (Buchner and Janjarasjitt, 2003). Unlike the spectral methods, periodicity transform is well suited for tandem repeat detection, and it is also more computationally efficient. Despite these advantages, a wider use of periodicity transform has been limited, however, by several deficiencies that were not resolved in prior formulations. These include: (i) symbol bias that is inherent in the mapping of DNA symbols to complex numbers and which results in missed detections of some repetitive structures; (ii) lack of an appropriate post-processing stage that would remove redundant and insignificant repeats and (iii) absence of a strategy for identification of indels.

All of these deficiencies are addressed in this article. First, replacement of the complex number set in the DNA symbol mapping with the set of quaternions, is proposed. This replacement results in a periodicity transform that detects all repetitive structures. We anticipate that the quaternionic approach can be utilized (via the use of the quaternionic Fourier transform) to improve spectral methods, and (via the use of higher dimensional hypercomplex number systems, such as octonions and sedenions) to facilitate symbolic signal processing in applications utilizing larger than the genomic alphabet sizes. Moreover, the computational complexity of the approach is further reduced by a factor equal to the pattern length, a post-processing stage that reduces repeat redundancy is inserted into the algorithm, and a strategy for identification of indels is implemented and discussed. Finally, it is shown that the approach identifies repeats that are missed by two well-known tandem repeat techniques, TRF (Benson, 1999) and SRF (Sharma et al., 2004).

The article is organized as follows: in Section 2 the periodicity transform is introduced, in Section 3 the quaternionic approach is described, in Section 4 a comparison of the quaternionic approach with representatives of statistical and Fourier space approaches is made, in Section 5 a summary of results is given and in Supplementary Material computational complexity of the QPT and detection of indels is discussed.


    2 PERIODICITY TRANSFORM
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 PERIODICITY TRANSFORM
 3 THE QUATERNIONIC APPROACH
 4 EXPERIMENTS
 5 SUMMARY
 REFERENCES
 
Periodicity transform was first introduced in the context of speech processing (Sethares and Staley, 1999). It has been subsequently adapted to DNA sequence analysis by Buchner and Janjarasjitt (2003). A brief mathematical description of the formalism is given below.

2.1 Detection of P-periodic repeats
Factor the data size1 N = PR, where Formula are the period and the repetition factor, respectively, and let x be an arbitrary N-point sequence of real numbers, x=x0,x1, ... ,x N-1 . Define the periodicity transform (PT) by


Formula 1

(1)
The periodicity transform identifies all P-periodic subsequences of x. When P is allowed to vary, the periodicity transform identifies all repeats within a given range of P; the positions of those repeats, however, remain unknown.2 In practice, one is often interested in both: detecting occurrence of repetitions and identifying their locations.

Suppose R is not a prime, and choose a factor of R, Formula . In analogy to the short-time Fourier transform, define the short-time periodicity transform (STPT) by


Formula 2

(2)
where Formula and 0≤ s < P . In contrast to PT, which provides a global characterization of repeats, periodic components identified by the STPT are localized within a segment of length Formula .

Set n=s+rP . A more robust repeat indicator is obtained, when the STPT is normalized by the Formula -point segment's ‘power’, and averaged over P shifts, i.e.


Formula 3

(3)
We call this indicator the periodicity detector (PD). Since the sequences in the numerator and the denominator of (3), Formula and x, are evaluated over P and Formula points, respectively, hence the normalization factor Formula .

In the next subsection we will investigate properties of PD of a symbolic sequence, evaluated at a single shift. In preparation, we define the single shift PD,


Formula 4

(4)
Note, that for convenience, contrary to (3), normalization in (4) is performed over a subset of values of x, so that 0≤<x>(n)≤ 1 for all n. Moreover, if |x(n|2=K for all n for some Formula , then


Formula 5

(5)

2.2 An abstract periodic symbol detector
Take an arbitrary Formula -point DNA sequence, x=x0,x1, ... , Formula , and choose any stride by P Formula -point decimation of x, e.g. x0=x0,xP, ... , Formula . Denote by integers Formula and Formula , the count of symbols 'A', 'C', 'G' and 'T' in x0. Consider an assignment of DNA symbols to arbitrary numbers, e.g.


Formula 6

(6)
such that o0!= o1!= o2!= o3 and Formula . It follows from (4), that the single shift periodic symbol detector (PSD) of x can then be expressed as


Formula 7

(7)
where Formula . We call the detector in (7) the abstract periodic symbol detector. The abstract periodic symbol detector needs to satisfy the following conditions:

  1. <x>(0) reaches a minimum when DNA symbols are equally represented in x0. In particular, when Formula is a multiple of four, then <x>(0) reaches a minimum for {alpha} ={gamma} ={delta} =1/4 .
  2. <x>(0) reaches a maximum when x0 is a string of identical symbols.
  3. If {alpha} ≥ β, {gamma}, and {delta} , then replacement of C, G or T with A in x0 increases the value of <x>(0) .
  4. <x>(0) is invariant under permutation of any two symbols.

2.3 Complex assignment of DNA symbols
Consider an assignment of DNA symbols to complex numbers, e.g.


Formula 8

(8)
The complex PSD can then be expressed by the following formula


Formula 9

(9)
Since for {alpha} =β ={gamma} ={delta} =1/4


Formula 10

(10)
and for {alpha} = 1


Formula 11

(11)
the complex PSD satisfies conditions 1 and 2. Condition 3, however, is not met since, for example, for {alpha} =β ={delta} =1/3 and {gamma} = 0


Formula 12

(12)
and for {alpha} =1/2, β =1/6, {delta} =1/3 and {gamma} = 0


Formula 13

(13)
Moreover, since, e.g. for {alpha} ={gamma} =1/2 and β ={delta} =0


Formula 14

(14)
and for {alpha} ={delta} =1/2 and β ={gamma} =0


Formula 15

(15)
the condition 4 is violated as well.

In general, the complex PSD given by equations (8–9Go) is variant under exchange of any two parameters, except for the pairs β and {gamma}, and {alpha} and {delta}. In fact, no assignment of type (8) leads to a detector meeting all four conditions of subsection 2.2. In the next section an alternative DNA symbol mapping satisfying all conditions of the abstract PSD will be given.


    3 THE QUATERNIONIC APPROACH
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 PERIODICITY TRANSFORM
 3 THE QUATERNIONIC APPROACH
 4 EXPERIMENTS
 5 SUMMARY
 REFERENCES
 
Real and complex numbers can be viewed as one- and two-dimensional instances of N-dimensional hypercomplex numbers of the form


Formula 16

(16)
where Formula , i0 = 1 and ij, 0<j≤ N , are symbols called imaginary units. If N = 0 then h is real and if N = 1 and Formula then h is complex. The best known hypercomplex numbers, apart from the real and the complex numbers, are the four-dimensional quaternions and the eight-dimensional octonions.

The concept of quaternions was introduced by William Hamilton in 1843 (Hamilton, 1866), who defined a quaternion as a number of the form


Formula 17

(17)
where Formula , and i, j, k are symbols defined by the following set of rules


Formula 18

(18)
a is called the real, or scalar, part of q and q-a is called the imaginary, or vector, part of q. q-a is also known as the pure quaternion.

Applications of quaternions in signal processing include computer vision, robotics (Sommer, 2001), and color and hyperspectral image processing (Sangwine, 1996). For an elementary treatment of quaternions, see (Kantor and Solodovnikov, 1989).

3.1 Quaternionic assignment of DNA symbols
Consider the assignment of DNA symbols to pure quaternions, e.g.


Formula 19

(19)
The quaternionic PSD can then be expressed by the following formula


Formula 20

(20)

It is straightforward to verify that the quaternionic PSD satisfies all four conditions of the abstract PSD.

One of reviewers raised the issue of unequal mutation probabilities of DNA symbols and suggested that it might be useful to appropriately weight different symbol repeats according to these probabilities. Our view is that an evaluation of such probabilities is not fixed but evolves over sequence length and type, and that an insertion of an appropriate bias accurately reflecting these probabilities is not possible within the context of a complex or quaternionic symbol detector without violating the measure property. While we agree that symbol bias is an important factor in evaluating repeats, we think an appropriate weighting or filtering of repeats, reflecting this bias, should be deferred to a later stage that is not directly coupled with the main algorithm.

3.2 Performance comparison of CPT and QPT
A detailed performance comparison of complex and quaternionic detectors was given in (Brodzik and Peters, 2005). It was shown that use of the CPT might result in missed detection of some nested tandem repeats, while the QPT guaranties detection of all patterns. In an example given in (Brodzik and Peters, 2005), the CPT detected the repeat (AGG)4(TCC)4 , but missed the repeat (AGG)4(CTT)4 . Here, we remark on the impact of numerical assignment of DNA symbols on threshold selection in the analysis of approximate repeats.

Consider the case of Formula , and denote by x{leftrightarrow} y a symbol error resulting from replacement of x with y, or y with x. It follows directly from formulas (5) and (9) that for {varepsilon} symbol errors of the type c{leftrightarrow} g or a{leftrightarrow} t , the CPT is


Formula 21

(21)
Conversely, for {varepsilon} symbol errors of the type c{leftrightarrow} t or a{leftrightarrow} g , the CPT is


Formula 22

(22)
For example, for P = 10, the CPT thresholds guaranteeing detection of repeats with at most one and two errors are


Formula

respectively. Since Tc2(10,1)=Tc1(10,2) , there is no threshold that unambiguously detects all 10mers with one error or less.

In contrast to the CPT threshold, it follows from formula (20) that the QPT threshold,


Formula 23

(23)
is identical for all symbol error types.

As an illustration consider two simple VNTR (variable number tandem repeats) of period P = 3, containing a single error in the last symbol: ATTATA... and ACCACA... . Use of (9) and (5) leads to Formula for the first repeat, and Formula for the second repeat. Use of (20) and (5) leads to Formula for either of the two repeats.


    4 EXPERIMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 PERIODICITY TRANSFORM
 3 THE QUATERNIONIC APPROACH
 4 EXPERIMENTS
 5 SUMMARY
 REFERENCES
 
Two repetition-rich well-known sequences, M65145 [GenBank] and U43748 [GenBank] , have been investigated. The performance of the QPT has been compared with the performance of the Tandem Repeat Finder (TRF, Benson, 1999) and the Spectral Repeat Finder (SRF, Sharma et al., 2004). The SRF is a recently proposed approach that is a representative of the Fourier space based methods. These methods search for the most frequent repetitions by identifying the dominant frequencies in the short-time Fourier transform spectrum of the sequence. The chief advantage of the SRF over other methods is its ability to detect dispersed repeats. The TRF is a pattern detection method that relies on statistically based recognition criteria. It is the most widely used method for identification of exact and approximate tandem repeats.

The parameters of the QPT are the period P, the repeat number Formula and the threshold T. The period was set to 1≤ P ≤ 12 in the first experiment and to 1≤ P ≤ 48 in the second experiment, to conform with previous analyses. The remaining two parameters, Formula and T, are co-dependent and determine sensitivity of the detector to approximate and non-adjacent repeats. In general, large values of Formula and small values of T increase detector sensitivity to moderately dispersed and approximate tandem repeats. Since the focus of the analysis was on exact and almost exact tandem repeats, Formula has been used in both experiments.

Figures 1 and 2 provide graphical display of results. The figures show the raw QPT [computed via equation (3)] and the post-processed QPT. The goal of post-processing is to remove ambiguous and insignificant repeats. These include: (i) repeats with symbol errors exceeding the threshold; (ii) short duration repeats of less than a pre-determined number of DNA symbols;3 (iii) alias repeats occurring at periods that are multiples of the fundamental period; (iv) repeats that are contained in larger repeats.


Figure 1
View larger version (69K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Analysis of approximate tandem repeats in human microsatellite sequence M65145: raw periodicity transform (top) and post-processed periodicity transform (bottom). QPT parameters: N=1085, 1≤ P≤ 12, T=0.85 , Figure 1, minimum repeat length = 12.

 

Figure 2
View larger version (63K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. Analysis of approximate tandem repeats in the human frataxin gene, sequence U43748. Plots: raw periodicity transform (top) and post-processed periodicity transform (bottom). QPT parameters: N=2520, 1≤ P≤ 48, T=0.9 , Figure 2, minimum repeat length = 12.

 
4.1 M65145
In the first experiment an analysis of repeats in the human microsatellite sequence M65145 [GenBank] , considered previously in Sharma et al. (2004), was performed. Figure 1 and Table 1 summarize results of the analysis.


View this table:
[in this window]
[in a new window]

 
Table 1. Exact and approximate tandem repeat patterns of sequence M65145 detected by QPT and/or TRF. QPT threshold T = 0.85. TRF version 4.0, parameters: (match, mismatch, indels) = (2,7,7), minimum alignment score = 20. Symbol substitutions are denoted by bold face letters. Exact repeats: 2–3, 13. Patterns undetected by TRF: 1, 4, 10–12, 14–15

 
The QPT threshold was set to T = 0.85, and the minimum repeat length (P x copy number) was set to 12. The value of the threshold permits no errors in patterns one to four symbols long, one error in patterns five to eight symbols long, and two errors in patterns nine to 12 symbols long (from (23), {varepsilon} = {lfloor} 0.225P{rfloor} for T = 0.85).

The QPT identified 15 tandem repeats, 3 exact and 12 approximate. The periods of the detected repetitions included all values in the range tested, except for P = 3. Among the approximate repeats, repeats #4, #5, #6 and #7 had double letter substitutions. The remaining approximate repeats had a single letter substitution. None of the detected repeats contained indels.

Two of the exact repeats, the singleton A at position 51 : 69, and the doublet GT at position 860 : 895, contained a large number of copies, and produced clearly visible in the raw transform plot aliases at multiples of the fundamental period (P ≥ 2 in case of the singleton, and P=2, 4, 6, 8, 10 and 12 in the case of the doublet; Fig. 2). These aliases have been removed in the post-processing stage. Moreover, shorter repeats contained within longer repeats have been removed. While the last step is justified by the need to remove ambiguous and/or insignificant repeats, it may also have the undesirable effect of removing exact repeats masked by approximate repeats. An analysis of the M65145 [GenBank] sequence at threshold T = 1.0 revealed presence of exact repeats TGGCCG at position 522 : 533 and GGGGTATCTG at position 433 : 452. These repeats have been removed in the post-processing stage of the approximate repeat analysis, as they were contained by repeats #9 and #7, respectively.

The TRF algorithm (Benson's on-line code; version 4.00) was tested with the following parameter settings: (match, mismatch, indels) = (2,7,7), minimum alignment score = 20. TRF identified nine repeats. Three of these repeats were identical to the QPT repeats #2, #3 and #13. Out of the remaining six repeats identified by TRF, three were nearly identical to the QPT repeats #6, #8 and #9 (they was shifted with respect to the QPT repeats by a single base (#6, #9) or by a double base (#8) and shorter than the QPT repeats by two bases), two were overlapping repeats of the pattern GGGGTATCTG (433 : 456 and 433 : 468), essentially a permutation of the QPT repeat #7, and one was a repeat of the pattern TGACCTTTGGGG, coinciding with and similar to the QPT repeat #5 (131 : 168). The difference in the pattern content and in the length of repetition in the last two cases resulted from lower threshold used by the Benson code, which allowed detection of patterns having a higher number of errors. The TRF missed the QPT repeats #1, #4, #10–12, and #14–15 (one of the reviewers was able to obtain one more QPT repeat, #11, running the TRF with a minimum alignment score = 15). The SRF of Sharma identified only 2 out of the 15 QPT tandem repeats, #5 and #15.

4.2 U43748
In the second experiment an analysis of exact and almost exact repeats in the human frataxin gene sequence U43748 [GenBank] , analyzed previously by Benson, was performed. Figure 2 and Table 2 summarize results of the analysis.


View this table:
[in this window]
[in a new window]

 
Table 2. Exact and approximate tandem repeat patterns of sequence U43748 detected by QPT and/or TRF. QPT threshold T = 0.9. TRF version 4.0, parameters: (match, mismatch, indels) = (2,7,7), minimum alignment score = 30. Symbol substitutions are denoted by bold face letters. Exact repeats: 4, 13–15. Overlapping pattern groups: (5,6), (8,9), (10,11), (12,13), (15,16,17,18). Patterns undetected by TRF: 1–4, 6, 8, 10, 12

 
The QPT threshold was set to T = 0.9, the period range was set to 1≤ P≤ 48 , and the minimum repeat length was set to 12. The QPT threshold allowed for an occurrence of one symbol error for 6≤ P≤ 13 , two symbol errors for 14≤ P≤ 19 , three symbol errors for 20≤ P≤ 26 , four symbol errors for 27≤ P≤ 33 , five symbol errors for 34≤ P≤ 39 and a six symbol errors for 40≤ P≤ 46 (from (23), {varepsilon} ={lfloor} 0.15 P{rfloor} for T = 0.9). The QPT found 18 patterns, 4 exact ones and 14 approximate. The pattern periods were: 1, 2, 3, 6, 7, 8, 9, 10, 13, 14 and 44. Twelve of the 18 patterns belonged to one of five groups of overlapping repeats (see caption of Table 2). All approximate repeats had a single letter substitution, except for pattern #9, which had a single deletion, and pattern #11 (a 44mer), which had six letter substitutions.

The TRF code (match, mismatch, indels) = (2,7,7) was tested in two settings of the TRF minimum alignment score: 50 and 30. The first setting produced four patterns: 3mer (#14), 13mer (#17), 14mer (#5) and 44mer (#11). Except for the 13mer, these patterns were described previously in (Benson, 1999). The second setting produced 12 patterns (which included the previous 4 patterns). Out of these 12 patterns, 6 met the error criterion of the QPT, and were identical to the patterns detected by the QPT. Out of the remaining six repeats, five repeats (#15-18) were similar to repeats detected by the QPT. Among those were the 10mers AAAAATAATA (TRF) and AATAATAATA (QPT), occurring at overlapping regions 2372:2399 and 2378:2399, respectively. Since the TRF allowed for more errors than the QPT, it started counting repeats six bases earlier than the QPT, which resulted in a different pattern. One pattern, the 12mer AAGGCACGGGCG, identified by TRF, was undetected by QPT, due to the presence of two deletions.

Out of the 18 patterns identified by the QPT, 8 were undetected by the TRF. These included the 6mer GGCCCA, the 7mers GCGGCCA, GGCCGCA and AGGAAGG, the 8mers CAACCAAT and GTTTAGAA, the 9mer CGTGTGTGT and the 10mer TACTAAAAAA. Four more repeats (two of them overlapping with the reported repeats) have been found testing the TRF code with the lowest minimum alignment score allowed, 20, all of them, however, had a higher percentage of errors than allowed in the experiment. No dispersed repeats were found in the sequence using either the SRF or the QPT.


    5 SUMMARY
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 PERIODICITY TRANSFORM
 3 THE QUATERNIONIC APPROACH
 4 EXPERIMENTS
 5 SUMMARY
 REFERENCES
 
A new approach to computing the periodicity transform has been proposed. The approach incorporates several features that significantly improve performance and extend utility of the transform. These improvements include: invariance to symbol permutation, a convenient for multiple queries two-stage data processing scheme, a factor of P computational efficiency improvement, removal of aliases and an efficient mechanism for dealing with isolated indels.

The standard mapping of DNA symbols to complex numbers has been replaced with a mapping of DNA symbols to quaternionic numbers. This mapping removes preferential treatment of DNA symbols occurring in the complex mapping, and guaranties detection of all patterns, including complex, multi-period patterns, described by Hauth and Joseph (2002). In approximate repeat analysis, the quaternionic mapping leads to a threshold that is unambiguously coupled with the pattern length and the number of errors, and that is independent of pattern and error type. Numerical experiments have shown that the QPT is competitive with the industry standard, TRF, in that it detects patterns that are missed by TRF.

The QPT approach consists of two stages of processing. In the first stage all repeats are detected. In the second stage repeats that have high number of errors, are short, or ambiguous, are removed. The choice of two stage processing was dictated by computational convenience, as the number of relatively expensive computations of the second stage can be reduced by operating on a much smaller subset; such approach, however, has a number of other advantages as well. First, graphical display of results of both stages of the computation is produced, allowing visual evaluation. Second, since results of the first stage of the computation are stored, the second stage can be conveniently repeated for a different choice of pattern selection parameters, and perhaps for a subset of data, at a later time, without repeating the entire computation. Apart from graphical display, the algorithm produces a numerical listing of results.

One of the chief inconveniences of the original approach was the appearance of aliases in the periodicity transform plot. Here, repeats that occur at identical locations for multiples of the fundamental period, and repeats that are entirely contained in larger repeats, are removed. Repeats that partly overlap are kept, as such repeats cannot be uniquely ordered in importance strictly based on mathematical criteria (particularly in the presence of symbol errors), and as decision about which of these repeats is biologically relevant is application dependent.

The main computational burden is carried by the first stage of the algorithm. The complexity of the computation is reduced by a factor of P in comparison with the direct approach. In contrast to most reports on DNA repeat detection methods, which either provide only rough estimates of complexity, or give timing results of selected experiments, complexity of the QPT can be evaluated exactly. It was shown that computational complexity of the QPT is strictly linear with the data size, for a fixed range of period sizes, and that it requires only a single evaluation of a symbol match at each point (which consists of five additions and four multiplications in the current implementation of the algorithm). Moreover, since the evaluation is performed independently and identically for all period values, the algorithm can be easily implemented in a multi-processor environment.

An important discovery of this work was the finding that the first stage of processing produces intermediate results that can be used to detect indels. A scheme for identification of single indels has been implemented. Further work in this area is needed to equip the QPT with an unrestricted indel sensitivity.


    FOOTNOTES
 
Associate Editor: Thomas Lengauer

1Existence of such factorization is assumed for notational convenience, but is not required otherwise. Back

2The periodicity transform, as stated here, is identical to the 0th frequency slice through the Zak transform. Zak transform is a time-frequency representation that is intimately related to the Fourier and the short-time Fourier transforms (Brodzik and Tolimieri, 2000). In effect, periodicity transform analysis can be viewed as a special case of time-frequency analysis. Back

3This step can be removed, if Formula is allowed to vary in the main stage of the algorithm, so that the product Formula is not less than the minimum repeat length allowed. Back

Received on August 25, 2006; revised on December 10, 2006; accepted on January 3, 2007

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 PERIODICITY TRANSFORM
 3 THE QUATERNIONIC APPROACH
 4 EXPERIMENTS
 5 SUMMARY
 REFERENCES
 

    Anastassiou D. Genomic signal processing, In: IEEE Trans. SP (2001) 18:8–20.

    Arneodo A, et al. What can we learn with wavelets about DNA sequences? In: Physica A (1998) 249:439–448.[CrossRef][Web of Science]

    Benson G. Tandem repeat finder: a program to analyze DNA sequences. In: Nucleic Acid Res. (1999) 27:573–580.[Abstract/Free Full Text]

    Brodzik AK, Peters O. Symbol-Balanced Quaternionic Periodicity Transform for Latent Pattern Detection in DNA Sequences. (2005) Philadelphia, PA: ICASSP.

    Brodzik AK, Tolimieri R. Extrapolation of band-limited signals and the finite Zak transform. In: Sgnal Processing (2000) 80:413–423.[CrossRef]

    Brzustowicz LM, et al. Location of a major susceptibility locus for familial schizophrenia on chromosome 1q21–q22, In: Science (2000) 288:678–682.[Abstract/Free Full Text]

    Buchner M, Janjarasjitt S. Detection and visualization of tandem repeats in DNA sequences. In: IEEE Trans. SP (2003) 51:2280–2287.[CrossRef]

    Butler J. Forensic DNA Typing: Biology and Technology Behind STR Markers (2003) London: Academic Press.

    Campuzano V, et al. Friedreich s Ataxia: autosomal recessive disease caused by an intronic GAA triplet repeat expansion. In: Science (1996) 271:1423–1427.[Abstract]

    Cummings CA, Relman DA. Microbial Forensics—cross-examining pathogens, In: Science (2002) 296:1976–1979.[Abstract/Free Full Text]

    Fu YH, et al. An unstable triplet repeat in a gene related to myotonic muscular dystrophy, In: Science (1992) 255:1256–1258.[Abstract/Free Full Text]

    Guerini FR, et al. Myelin basis protein gene is associated with ms in DR4- and DR5-positive Italians and Russians. In: Neurology (2003) 61:520–526.[Abstract/Free Full Text]

    Hamilton WR. Elements of Quaternions (1866) London: Longman.

    Huntington s Disease Collaborative Research Group. A novel gene containing a trinucleotide repeat that is expanded and unstable on Huntington s disease chromosomes. In: Cell (1993) 72:971–983.[CrossRef][Web of Science][Medline]

    Hauth A, Joseph DA. Beyond tandem repeats: complex structures and distant regions of similarity, Bioinformatics (2002) 18:S31–S37.[Abstract]

    Kantor IL, Solodovnikov AS. Hypercomplex Numbers: An Elementary Introduction to Algebras (1989) New York: Springer-Verlag.

    Krishnan A, Tang F. Exhaustive whole-genome tandem repeats search, Bioinformatics (2004) 20:2702–2710.[Abstract/Free Full Text]

    Lander ES, et al. Initial sequencing and analysis of the human genome. In: Nature (2001) 409:860–921.[CrossRef][Medline]

    Licastro F, et al. Interleukin-6 gene alleles affect the risk of Alzheimer's disease and levels of the cytokine in blood and brain. In: Neurobiol. Aging (2003) 24:921–926.[CrossRef][Web of Science][Medline]

    Sangwine SJ. Fourier transforms of color images using quaternion or hypercomplex numbers. In: Electron. Lett. (1996) 32:1979–1980.[CrossRef]

    Sethares WA, Staley TW. Periodicity transforms. In: IEEE Trans. SP (1999) 47:2953–2964.[CrossRef]

    Sharma D, et al. Spectral Repeat Finder (SRF): identification of repetitive sequences using Fourier transformation. Bioinformatics (2004) 20:1405–1412.[Abstract/Free Full Text]

    Sidransky D. Nucleic acid-based methods for the detection of cancer, In: Science (1997) 278:1054–1058.[Abstract/Free Full Text]

    Stolovitzky G, Califano A. Statistical significance of patterns in biosequences. available at http:/:www.research.ibm.com/splash/Papers/Pattern%20Statistics.pdf.

    Sommer G, ed. Geometric Computing with Clifford Algebras (2001) New York: Springer-Verlag.

    Tavare S, Giddings BW. Some statistical aspects of the primary structure of nucleotide sequences. In: Mathematical Methods for DNA Sequences—Waterman MS, ed. (1989) Boca Raton: CRC Press. 117–131.

    Tishkoff SA, et al. Short tandem-repeat polymorphism/alu haplotype variation at the PLAT locus: implications for modern human origins. Am. J. Hum. Genet. (2000) 67:901–925.[CrossRef][Web of Science][Medline]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/6/694    most recent
btl674v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (2)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Brodzik, A. K.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Brodzik, A. K.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?