Skip Navigation


Bioinformatics Advance Access originally published online on September 30, 2004
Bioinformatics 2005 21(5):608-616; doi:10.1093/bioinformatics/bti050
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/5/608    most recent
bti050v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by da Fontoura Costa, L.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by da Fontoura Costa, L.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions{at}oupjournals.org

Biological sequence analysis through the one-dimensional percolation transform and its enhanced version

Luciano da Fontoura Costa

Cybernetic Vision Research Group, GII-IFSC, Universidade de São Paulo São Carlos, SP, Caixa Postal 369, 13560-970, Brazil


    Abstract
 TOP
 Abstract
 1 MOTIVATION
 2 METHODOLOGY
 3 PROTEIN ANALYSIS RESULTS
 4 CONCLUSIONS
 REFERENCES
 

Motivation: The necessity to characterize the spatial uniformity (or lack of it) of symbols in biological sequences, given its implications for identification of the properties of the structures associated with the sequences.

Methods: A one-dimensional version of a recently introduced percolation-based approach is presented, which allows the accurate quantification of symbol distributions even in the presence of co-existing densities. An enhanced version of this methodology, which uses an agglomerative process to organize hierarchically the sequence into subsequences, is also proposed and illustrated.

3. Results: The potential of the proposed methodology is illustrated with respect to synthetic and real data (1881 zebrafish and 1200 Xenopus proteins) and compared to two alternative multiscale methodologies, with encouraging results including the possibility to identify particularly remarkable amino acid arrangements in proteins.

4. Contact: luciano{at}if.sc.usp.br


    1 MOTIVATION
 TOP
 Abstract
 1 MOTIVATION
 2 METHODOLOGY
 3 PROTEIN ANALYSIS RESULTS
 4 CONCLUSIONS
 REFERENCES
 
Although many approaches in mathematics and physics consider continuous spaces, which are required to represent natural entities such as time and distances, biology is to a great extent a discrete realm. Not only living beings are composed of a finite (although very large) number of cells, but a great part of the information controlling animal dynamics—be it for reproduction, metabolism or development—is underlain by discrete sequences of symbols, of which DNA/RNA and protein instances are of paramount importance. Therefore, it is hardly surprising that much effort has been focused on the acquisition, analysis and classification of such sequences, motivating big initiatives such as the genome project and the new field of bioinformatics. The current approaches to biological sequence analyses can be divided into two main groups: (1) comparative methodologies based on sequence and subsequence matches, which are aimed at identifying known patterns inside the sequences (Baldi and Brunak, 2001); and (2) intrinsic methodologies, which try to identify patterns characterized by some property (e.g. spatial uniformity or symmetry) inside the analyzed sequences. While in the former type of analysis one relies on previous knowledge—such as mapped proteins or motifs—the latter type of investigation aims at identifying and quantifying special patterns inside the sequences, such as symmetry or repetitions or regularly spaced occurrences of symbols, as well as sequences marked by the lack of such uniformities. Although the latter type of investigation has received considerably less attention than the former, it constitutes an important endeavor because of the several insights it can provide about the basic structural and functional properties of the respectively coded structures, especially proteins and RNAs. More precisely, the spatial distribution of specific amino acids is ultimately one of the key factors defining the higher order structural properties of the protein. Of special interest is the uniformity of the linear spacing between specific amino acids, as well as their periodicity and symmetry (Clote and Backofen, 2000 Irback et al., 1996; Borstnik and Pumpernik, 2002 Castelo et al., 2002). This uniformity is often associated with the presence of frequent motifs such as alpha helices and beta sheets, which are characterized by uniform spacing of specific amino acids implied by the periodicity of such structures.

The problem of intrinsic analysis of biological discrete sequences has been approached from several perspectives, including correlations (Weiss and Herzel, 1998) information theory (Gross et al., 2002) chaos (Almeida et al., 2001) linguistics (Troyanskaya et al., 2002) and spectral methods (Anastassiou, 2001 Nagai et al., 2001). One of the main problems is the non-stationarity of biological data (i.e. the fact that different statistics are obtained when starting at different reference positions), which undermines approaches such as the Fourier transform and even wavelets (for the larger spatial scales, i.e. longer windows). Moreover, spectral methods such as the Fourier and Gabor/wavelet transforms often suffer from accuracy problems in the sense that the frequencies imposed by the spectral functions often do not perfectly match those in the original signal. Indeed, periodicities in the signal that are not multiple integer values of those of the spectral kernel will produce dispersion of the transformed signal around the represented frequency (Brigham, 1988). On the other hand, techniques based on the nearest-neighbor distances provide an inherently local representation of the symbol’s density, but fail to take into account larger spatial scales corresponding to the distances between groups of symbols. For instance, the signal in Figure 1 will produce a single peak in the nearest-neighbor histogram, completely missing the distribution of the groups of symbols at larger spatial scales.



View larger version (9K):
[in this window]
[in a new window]
 
Fig. 1 A simple sequence containing groups of uniformly spaced symbols. The nearest-neighbor characterization of such a signal completely misses the fact that the regularly spaced sequences are distributed with different distances between themselves.

 
The alternative approach recently introduced by (Costa, 2004a) provides an interesting balance regarding the representation of both small and large spatial scales of the signal, paving the way for the analysis of different co-existing symbol densities or periodicities. This approach is based on the concept of phase transition through percolation (Binney et al., 1992 Stauffer and Aharony, 1994). Given a series of spatially distributed sites, the progressive incorporation of links between such points tend to reach a point where a giant cluster is obtained. The abrupt transformation of the connectivity properties of such a structure as more links are added provides an example of percolation. The first step in the percolation transform (PT) applied for the characterization of the spatial distribution of sets of points involves connecting pairs of points in ascending order of distance, so that the closest points are connected first. Each such connection can be understood as an abrupt phenomenon taking place at a very small spatial scale. Connected sets are treated as a single entity (namely a cluster), which are further connected to other clusters at larger distances. The second step in the PT involves obtaining normalized histograms of the distances at which the clusters merged during the enforced percolation procedure. By considering a logarithm coordinate change as well as a normalizing product, which is essential to compensate for the reduced number of intervals implied by sparser symbol densities, the PT of the original sequence into the normalized distance histogram allows the observation of co-existing symbol distributions with different densities in the original sequence. An enhanced version of the density transform, which uses an agglomerative clustering algorithm [namely a modification of the single-linkage hierarchical clustering (Costa and Cesar, 2001)] in order to obtain multiscale analysis of the sequence, proposed here allows the identification of hidden regularities in the original sequence. The potential of these two methods, the simple and enhanced percolation transform (EPT) for the characterization of spatial uniformity is corroborated with respect to the analysis of amino acid distribution in 1881 zebrafish (Brachydanio rerio) and 1200 Xenopus proteins. The proposed methodology is also compared with two alternative multiscale approaches, namely continuous wavelets and detrend analysis.


    2 METHODOLOGY
 TOP
 Abstract
 1 MOTIVATION
 2 METHODOLOGY
 3 PROTEIN ANALYSIS RESULTS
 4 CONCLUSIONS
 REFERENCES
 
This section describes the local density analysis of discrete sequences as well as its enhanced version, which are both illustrated with respect to real amino acid sequence data.

2.1 Percolation transform analysis
Let the sequence of interest be represented as the list x = (x 1, x 2, ..., x i , ..., x N ), with x i S, where S = s 1, s 2, ..., s j , ..., s M is the set of the possible symbols. For instance, in the henceforth assumed case of proteins, we have M = 20 symbols and S = {A, R, D, N, C, E, Q, G, H, I, L, K, M, F, P, S, T, W, Y, V}. The total number of elements given as s i , i = 1, 2, ..., M in sequence x is henceforth represented as N(s i ). A new sequence x(s j ) = [x 1(s j ), x 2(s j ), ..., x i (s j ), ..., x N (s j )] can be obtained from the original sequence x for each possible symbol s j by making x i (s j ) = 1 whenever x i = s j and assigning x i (s j ) = 0 otherwise. For example, if the original sequence is:


we have




One immediate information about the distribution of the symbols along the sequence is provided by the relative frequency f(s j ) = N(s j )/N. While a regularly spaced sequence would be completely represented by any of these parameters, typical biological sequences are characterized by rich and variable distribution of specific symbols. A prototypical statistical symbol distribution (often considered as the ‘null-hypothesis’) is that described by the Poisson density with parameter {gamma}, which expresses the mean number of symbols per unit sequence length. A uniformly spaced sequence will have similar relative frequency, i.e. f(s i ) {approx} N(s j )/N {approx} {gamma}. Figure 2 shows two Poisson sequences with different Poisson rates. Observe that the Poisson sequences can, loosely speaking, be thought of as ‘noisy’ versions of the uniformly spaced sequences with the same density.



View larger version (28K):
[in this window]
[in a new window]
 
Fig. 2 Two Poisson sequences with different symbol densities (a and b) and its respective distance histograms (c and d), logarithmic version of the distance histograms (e and f), height-normalized logarithm of the distance histograms (g and h) and Parzen-interpolated and smoothed (for {sigma} = 0.5) respective versions (i and j) of the latter.

 
The recently introduced (Costa, 2004a) PT allows the quantification of the densities of symbols present in the original sequence in terms of a simple density function even in the presence of co-existing different densities. The PT involves measuring the distances d i (s j ) between pairs of successive symbols x i (s j ) and x i+1(s j ) along the sequence x(s j ), a parameter also called inter-pulse interval (IPI) in areas such as animal sound analysis (Surlykke and Moss, 2000 Popov et al., 2003). The list d(s j ) = [d 1(s j ), d 2(s j ), ..., d i (s j ), ..., d D (s j )] of successive distances d i (s j ) {1, 2, ..., D} along x(s j ) provides a comprehensive characterization of that sequence. Observe that the possible distance values are the set of non-negative integer numbers, i.e. i {1, 2, ..., N–1}. The histogram h(s j ) = [h 1(s j ), h 2(s j ), ..., h i (s j ), ..., h D (s j )] of the successive distances of x(s j ) can be immediately obtained from d(s j ). The list of distances of the sequence in the previous example [Equations (1) and (2)] and the respective distance histograms are given below:



Figure 2 shows two Poisson sequences (a and b) and the respective distance histograms (c and d). To enhance the spatial uniformity of the histogram for different densities (Costa, 2004a) it is interesting to map this function onto a logarithmic space by introducing the new indexing variable w = log(i) so that h i (s j ) becomes h w (s j ), which is shown in Figure 2 (e, f). The effect of such a transformation in producing more uniform dispersions along the x-axis can be readily appreciated by comparing the histograms in (c and d) with those in Figure 2 (e, f), which now have similar dispersion. In addition, in order to equalize the histogram heights for distinct densities, the logarithm histogram is multiplied by exp(w), yielding H w (s j ) = h w (s j ) exp(w), as shown in Figure 2 (g, h). Such an operation is defined by imposing that any Poisson symbol distribution observed through a fixed-size window will produce the same intensity of height variation in the respective histograms (Costa, 2004a). Therefore, the histogram height variation produced by a sparse density of points—characterized by a larger average distance between points (and hence larger w)—will be emphasized by the product of the histogram at that spatial scale with the value \exp(w). It is such a normalizing scheme that allows the identification of sparser densities in symbol distributions involving multiple spatial scales, as will be further illustrated below. Observe that alternative equalization schemes, such as multiplying the sequence by exp(2s) can also be considered to further enhance the method sensitivity to low-density symbol distributions. Because such histograms are often sparse, it is interesting to apply some interpolation scheme. To interpolate and smooth the obtained histogram, the Parzen windows approach (Costa and Cesar, 2001) which consists of convolving the previous function with the Gaussian SD {sigma}, is applied over the histogram h(s j ), resulting in the smoothed density function g(s j ). The Parzen interpolations of the histograms g(s j ) in Figure 2 (e, f) are given in Figure 2 (g, h), respectively.

The two sequences in Figure 2 (a, b) were synthetically obtained by uniform random choice of the position of the sequence symbols, i.e. the random signals obeyed the Poisson distribution with rates {gamma}1 = 0.2 and {gamma}2 = 0.07. It has been verified that the distance indicated by the peak value of the density transform for a Poisson sequence with parameter {gamma} is given by Equation (8). Therefore, D 1 = 2/{gamma}1 = 10 and D 2 = 50. The densities in (g and h), Parzen-interpolated with the Gaussian SD {sigma} = 0.5, provide a clear and normalized characterization of the original signals. The respective peaks w1(peak) = 2.09 and w2(peak) = 3.23 imply, through Equation (7), D 1(peak) = exp(2.09) = 9.08 and D 2(peak) = exp(3.23) = 25.28, yielding, by applying Equation (8)), estimated respective rates {gamma}1(peak) = 2/9.08 = 0.22 and {gamma}2(peak)=2/25.28=0.079, which are good approximations of the expected values.



The simple example above illustrates the three important points contributing to the effectiveness of the density transform, namely: (1) the logarithmic scale accounting for spatial uniformity of Poisson sequences with varying parameters; (2) the height normalization allowing peaks corresponding to Poisson sequences with small values of {gamma} to become visible; and (3) the Parzen smoothing/interpolation allowing the accurate characterization of the distribution of symbols.

The procedure described above is further illustrated in the following with respect to the sequence in Figure 3a, which was obtained by distributing, according to the Poisson model with parameter {gamma}2 = 0.033, denser Poisson sequences with parameter {gamma}1=0.33. Figure 3b shows the distance histogram, whose logarithmic version is given in (c), being further height-normalized into (d). The Parzen-interpolated version of the latter is shown in (e). Interestingly, the two peaks in this figure identify the two co-existing densities in Figure 3a. Indeed, the values of s at each of these two peaks are w 1 = 1.52 and w 2 = 4.47, which imply D 1 = exp(w 1) = 4.55 and D 2 = exp(w 2) = 87.36, which correspond to the two typical intersymbol distances in the original sequence. Actually, considering that the denser sequence obeys the Poisson distribution (by construction), we have that the estimated {gamma} parameters are {gamma}1(peak) = 2/D 1(peak) = 0.43 and {gamma}2(peak) = 2/D 2(peak) = 0.023, which are good estimations of the original values. Observe that the second peak would go completely unnoticed were it not by the adopted exp(w) product used to transform the histogram in (c) into that in (d). Figure 4a shows the alternative characterization of the original sequence in terms of the nearest-neighbor histogram. By considering only the nearest distance for each sequence symbol, this approach completely misses the secondary density, implied by larger gaps between sequence symbols. Figure 4b illustrates the power spectra, namely the square magnitude of the Fourier transform, of the original sequence, which also provides little insight about the co-existing densities in the original signal. The analysis of the original sequence by using the continuous wavelet transform (Starck et al., 1998 Costa and Cesar, 2001) represented here using the wavelet approach considering first derivative of the Gaussian, is shown in Figure 4c and 4d. The obtained w-representation—where the signal is represented at increasing spatial scales defined by the vertical axis—is shown in (c), with the respective crests, corresponding to the extremes of the surface in (c), is presented in (d). Although it would be, in principle, possible to identify the presence of symbol spatial regularities by detecting the extinction of specific crests along the vertical axis (spatial scales), such a scheme is undermined by the irregular shape of the crest lines and the fact that crests corresponding to short spacing between symbols tend to be rather noisy.



View larger version (19K):
[in this window]
[in a new window]
 
Fig. 3 A sequence characterized by two co-existing symbol densities (a) and its respective distance histogram (b), logarithm version of the distance histogram (c), height-normalized logarithm of the distance histogram (d) and Parzen-interpolated and smoothed version (e) of the latter.

 


View larger version (38K):
[in this window]
[in a new window]
 
Fig. 4 The nearest-neighbor histogram (a), power spectra (b), continuous wavelet scalegram (c) and its respective maxima crests obtained for the original sequence in Figure 3a.

 
We have so far considered sequences obeying the Poisson model, which are characterized by little uniformity of intersymbolic distances. The examples in Figure 5 show how the PT can also be used to characterize the uniformity of sequences. Figure 5a and 5b show two regularly spaced sequences (constant distance between symbols), whose intersymbol distances are 3 and 6, respectively. The obtained histograms, shown in Figure 5c and 5d, are characterized by peaks at distances D 1 = exp(1.15) = 3.18 and D 2 = exp(1.85) = 6.36, which are good estimations of the original values. The relatively high dispersions for such regularly spaced sequences are an immediate consequence of the Parzen interpolation required for proper treatment of less regular sequences. Still, the dispersion for regular sequences is substantially smaller than those for more random ones. The choice of the smoothing parameter, namely the SD {sigma} of the Gaussian involved in the Parzen interpolation, should consider the degree in detail expected from the analysis. Small values of {sigma} will tend to produce more informative results, but at the expense of noise implied by the discrete nature of the distances. On the other hand, large values of {sigma} lead to smoother functions, but at the expense of joining peaks that are too close to one another. We have verified experimentally that, for amino acid analysis in proteins, good results are normally obtained for 0.1 ≤ {sigma} ≤ 0.5.



View larger version (12K):
[in this window]
[in a new window]
 
Fig. 5 Two regularly spaced sequences (a and b) and respective normalized histograms h w (c and d).

 
Whereas the percolation histogram h w (s j ) provides a rich information about the spatial distribution of a type of symbol in the analyzed sequence, providing effective visual information about such distribution, the many points involved by this histogram constrain its application in pattern recognition and classification problems. It is therefore interesting to obtain global measurements which, although reduced to a single value, retain valuable information about specific properties of interest for the analysis. Here, we propose three parameters that can be used to quantify typical distance between symbols, the overall uniformity of the symbol distribution and how much the histogram resembles a two-modal distribution (i.e. the presence of two clear peaks).

2.1.1 Typical distance between symbols
This measurement is defined as the distance value B corresponding to the highest peak in the normalized histogram.

2.1.2 Symbol spatial dispersion
This parameter, henceforth abbreviated as M(s j ), is obtained by normalizing the histogram h w (s j ), i.e. making h w = h w (s j )/N(s j ), and taking its SD.

2.1.3 Multimodality
Given the potential of PT for the identification of co-existing symbol densities, it becomes particularly interesting to define an objective measurement of the multimodality of such histograms. This task is substantially easier because the Parzen interpolation guarantees clearly defined smooth peaks, without spurious oscillations. Therefore, each peak can be identified by looking for changes of sign of the differences h w h w+{Delta} w , implying zero derivative, combined with positive difference at the left-hand side of the difference (i.e. a maximum instead of a minimum peak). The distance W between the leftmost and rightmost peaks of the normalized histogram, taken along the horizontal axis, has been verified to be an appropriate and stable measurement of the multimodality of the analyzed histograms. Table 1 gives the typical distance between symbols, the symbol spatial dispersion and the multimodality for each of the sequence in this section. Further examples of applications of these three measurements are supplied and discussed in the following section.


View this table:
[in this window]
[in a new window]
 
Table 1 Typical distances, symbol dispersion, and multimodality for the sequences in Figures 2, 3 and 5

 
Despite the fact that the above measurements are just three, while the normalized histogram involves a substantially higher number of points, they have been found to provide a clear and reasonably complete characterization of the uniformity (or lack of it) of the symbol distribution in sequences, as illustrated in the following section.

Enhanced percolation transform analysis
Whereas the one-dimensional PT described in the previous section provides a robust means for characterizing symbol distribution in sequences even in the presence of co-existing densities, it is possible to extend the potential of that transform even further in order to cope with hidden regularities. Consider, for instance, the situation illustrated in Figure 6a, which includes a series of regularly spaced bursts of different characteristic distances and lengths. In addition to the spacing regularity of the sequence clusters at small spatial scale, the sequence also exhibits regularity at higher spatial scales, which can be easily identified by substituting each of the bursts by a single symbol at the respective burst center, yielding the subsumed sequence in Figure 6b.



View larger version (14K):
[in this window]
[in a new window]
 
Fig. 6 A sequence (a) exhibiting regularity at high spatial scales (b).

 
The EPT is presented for the first time in this section. Its basic principle is to have the sequence to undergo some pre-processing in order to subsume it along hierarchical levels of detail, allowing eventual hidden regularities to manifest themselves. This has been accomplished through a procedure similar to the single-linkage agglomerative clustering (Costa and Cesar, 2001) with the difference that the obtained clusters are represented in terms of their center of mass along the sequence axis. More specifically, the original sequence is progressively subsumed by replacing groups of regularly spaced symbols with a single symbol at the respective center. Such a merging is performed by considering ascending values of intersymbol distances. Figure 7 illustrates the application of such a pre-processing to the sequence in (a), leading to a hierarchy of subsumed versions (b) of the original sequence. The ‘V’-like regions, in dark-gray, correspond to the intersymbol spaces, which are progressively reduced as the subsuming level d increases. Figures 8a, 8c, 8e and 8g show some of the subsumed sequences from Figure 6b, with respective EPTs shown in (b, d, f and h). The effect of the progressive merging of symbols in enhancing the higher spatial scale structure of the sequence is clearly perceived not only in the subsumed sequences, but also in the respective PTs. The leftmost peak present in (b) was duly suppressed by the subsumed sequence in (c), whose high regularity is characterized by the low-dispersion peak in (d). Further subsumption of the sequence (c) led to the more irregular sequence in (e), whose respective PT almost produced two peaks. The sequence in (g), characterized by just three symbols, also led to a low-dispersion PT in (h). Observe that the standard PT is naturally incorporated into its enhanced version (i.e. as its first transform). The basic routines for PT and hierarchical subsumption can be found (Costa, 2004b).



View larger version (25K):
[in this window]
[in a new window]
 
Fig. 7 A sequence (a) and its subsumed version (b).

 


View larger version (24K):
[in this window]
[in a new window]
 
Fig. 8 The hierarchical subsumption of the sequence in Figure 7a.

 

    3 PROTEIN ANALYSIS RESULTS
 TOP
 Abstract
 1 MOTIVATION
 2 METHODOLOGY
 3 PROTEIN ANALYSIS RESULTS
 4 CONCLUSIONS
 REFERENCES
 
The potential of PT and its enhanced version is explored in this section in order to obtain a systematic characterization of the uniformity of spatial distribution of amino acids in zebrafish. The original data was obtained from the NIH Zebrafish Gene Collection repository (http://zgc.nci.nih.gov/) (file \verb+drmgccdsaa.fasta+). For each of a total of 1880 proteins, 20 sequences corresponding to the basic amino acids were obtained and processed through the EPT. The typical distance, symbol spatial dispersion and multimodality measurements were estimated for each of the subsumed sequences in the EPTs. Figure 9 presents the scatterplots obtained for the typical distances (B) against the multimodality (M) of the original sequences for each amino acid. It is clear from these graphs that the Cys, His, Met, Phe, Trp and Tyr amino acids are characterized by higher intersymbol distances (y-axes) than the other amino acids, while Leu, Ser and Val present the smallest typical distance between symbols. Strikingly similar symbol spatial dispersions (x-axes) were obtained for all amino acids. Figure 10 shows the scatterplots of the multimodalities W1 and W2 calculated for the original and amino acid sequences subsumed at distance d = 7, respectively. Several interesting features can be identified in these results. First, let us identify the following four special regions in the graphs: (1) the cases with null W2, which formed the straight clusters at the lower left corner of the x-axis in each plot; (2) the situations characterized by the same multimodality values for both original and subsumed sequence, corresponding to the cases distributed along the main diagonal; (3) the sequences with null W1, which produced the straight clusters along the y-axis of the graphs; and (4) the remaining cases. Most sequences for the amino acids Cys, His, Met, Trp and Tyr were characterized in case (2) above, which is a direct consequence of their respective larger typical distances already identified in Figure 9. The original sequences for the amino acids Leu, Ser and Val were characterized by particularly low multimodality, suggesting that they tend to present higher spatial uniformity. On the other hand, the amino acids Arg, Cys, Glu, Lys, Pro and Tyr were characterized by several high multimodality cases. As far as the multimodality of the subsumed sequences is concerned, the amino acid Leu was characterized by a particular low value of such a measurement, with most sequences belonging to the above category (1). Figure 11 shows some particularly interesting sequences identified from Figure 10e. The following three extreme points were identified in Figure 10 (marked with diamonds): (1) upper right side; (2) lower left side; and (3) low intermediate region. The original sequence and its subsumption for d = 7 are shown in Figure 11a and 11c, with the respective PTs shown in Figure 11(b, d). This sequence is characterized by high multimodality for both spatial scales (i.e. original sequence and subsumed version at d=7), even though the smallest peak in (b) was removed in (d). The original and subsumed (d=7) versions of sequence (2) are shown in Figure 11e and 11g, while the respective PTs are shown in (f and h). This case illustrates the situation where a new peak—leftmost peak in (h)—was created as a consequence of the elimination of smaller scale detail from the sequence. The case (3), whose original and subsumed (d = 7) sequences are shown in Figure 11i and 11k and respective PTs are depicted in (j and l), illustrates the situation where a small peak—the leftmost peak in (j)—was removed as a consequence of the elimination of the pair of nearly spaced symbols in case (1). Several other interesting sequences can be identified by inspecting the scatterplots in Figure 11 therefore paving the way for data mining possibilities.



View larger version (52K):
[in this window]
[in a new window]
 
Fig. 9 Scatterplots of the typical distance against symbol dispersion of the 20 basic amino acids in zebrafish proteins.

 


View larger version (58K):
[in this window]
[in a new window]
 
Fig. 10 Scatterplots of the multimodality for the original (W1) and subsumed (W2), for distance equal to 7, sequences for the 20 basic amino acids in zebrafish proteins.

 


View larger version (28K):
[in this window]
[in a new window]
 
Fig. 11 The three extreme cases identified from Figure 10e. See text for explanation.

 
The characterization of the zebrafish proteins by using the wavelets approach discussed in Section 2.1 and detrend analysis (Peng et al., 1994) was also considered. As is clear from the results available (Costa, 2004b) the wavelet approach led to poor characterization because the noisy crests (corresponding to short symbol separations) tended to collapse near the graph origin. The detrend analysis, performed for window sizes 20 and 40 implied cluttered distributions. The PT, the wavelet and detrend analysis methodologies were also applied to Xenopus data from the NIH database (1200 proteins). The achieved results, available from (Costa, 2004b) clearly indicate that the distributions obtained for the PT (and also the other two methods) are almost indistinguishable from those of zebrafish, suggesting that the obtained patterns of amino acid distribution have a more general nature.

To investigate the potential of the proposed measurements for the characterization of protein families, we applied the reported methodology to several families obtained from the Pfam database (Sanger Institute, 2004 http://www.sanger.ac.uk/Software/Pfam/), with encouraging results. Figure 12 illustrates the distributions obtained by using the PT for each amino acid considering three protein families, namely Interferon alpha/beta domain (17 proteins), Envelope glycoprotein GP120 (9 proteins) and AIR carboxylase (12 proteins). Varying degrees of discrimination between the families were obtained with respect to each amino acid. Protein distributions obtained by using wavelets and detrend analysis can be found from (Costa, 2004b). Although the results obtained for the latter technique were characterized by the same problem as discussed in Section 2.1 the protein scattering resulting from detrend analysis with windows of 20 and 40 positions allowed separation of the families to a degree similar to that obtained with the PT. At the same time, the results obtained by the latter method and detrend analysis should not be understood as being superior to one another, but as providing complementary information about the similarities and differences between the analyzed protein families as far as the regularity and long range correlations are concerned.



View larger version (43K):
[in this window]
[in a new window]
 
Fig. 12 The distribution of proteins belonging to three families—namely Interferon alpha/beta domain (+), Envelope glycoprotein GP120 (x) and AIR carboxylase (closed diamonds)—obtained by the PT applied to each amino acid.

 

    4 CONCLUSIONS
 TOP
 Abstract
 1 MOTIVATION
 2 METHODOLOGY
 3 PROTEIN ANALYSIS RESULTS
 4 CONCLUSIONS
 REFERENCES
 
This paper presents how the spatial regularity (or lack of it) of symbol distribution in biological sequences can be effective and comprehensively characterized in terms of the one-dimensional version of PT, a recently introduced physics-based approach. By incorporating two normalizations, namely the logarithm of the distance axis (required for scale uniformity) and equalizing product (which compensates for the fixed-window size), such an approach allows the characterization of symbol uniformity even in the presence of co-existing different symbol densities. The application and interpretation of the one-dimensional PT was presented and discussed in didactical fashion, including several illustrative examples and comparison with two alternative multiscale methodologies.

To endow the PT with the ability to detect hidden symbol regularities, a subsumption processing based on the single-linkage agglomerative clustering was incorporated, leading to EPT reported for the first time in this paper. The potential of this methodology was illustrated with respect to amino acid distribution in zebrafish proteins, with interesting results, including different patterns of spatial uniformities identified for different amino acids, as well as the possibility to datamine the obtained results in order to identify especially interesting sequences. Similar results were obtained for Xenopus, suggesting that the observed amino acid patterns have a more general nature.

Although we restricted our attention to amino acid sequences, the proposed concepts and methods can be immediately applied to other important biological signals, such as DNA sequences and neuronal spikes. Two- and three-dimensional versions of the PT are also being applied by the author to space-time gene expression characterization. Further developments includes the estimation of PT inside each of the clusters defined for each distance value along the agglomerative merging stage, in order to allow improved spatial localization.


    Acknowledgments
 
The author is grateful to Francisco A. Rodrigues for help with obtaining the protein families, and to FAPESP (proc. 99/12765–2), CNPq (proc. 308231/03–1) and the Human Frontier Science Program for the financial support. This text benefitted from comments from Professor Dr. Gonzalo Travieso.

Received on May 16, 2004; revised on September 11, 2004; accepted on September 12, 2004

    REFERENCES
 TOP
 Abstract
 1 MOTIVATION
 2 METHODOLOGY
 3 PROTEIN ANALYSIS RESULTS
 4 CONCLUSIONS
 REFERENCES
 

    Almeida, J.S., Carrico, J.A., Maretzek, A., Noble, P.A., Fletcher, M. (2001) Analysis of genomic sequences by Chaos Game Representation. Bioinformatics, 17, 429–437[Abstract/Free Full Text].

    Anastassiou, D. (2001) Genomic signal processing. IEEE Signal Process. Mag., 18, 8–20[CrossRef].

    Baldi, P. and Brunak, S. Bioinformatics, (2001) , Cambridge, MA The MIT Press.

    Binney, J.J., Fisher, A.J., Dowrick, N.J., Newman, M.E.J. The Theory of Critical Phenomena, (1992) , London Clarendon Press.

    Borstnik, B. and Pumpernik, D. (2002) Tandem repeats in protein coding regions of primate genes. Genome Res., 12, , pp. 909–915[Abstract/Free Full Text].

    Brigham, E.O. The Fast Fourier Transform and Its Applications, (1988) , NJ Prentice-Hall.

    Castelo, A.T., Martins, W., Gao, G.R. (2002) TROLL—tandem repeat occurrence locator. Bioinformatics, 18, , pp. 634–636[Abstract/Free Full Text].

    Clote, P. and Backofen, R. Computational Molecular Biology: An Introduction, (2000) , Sons John Wiley and Sons.

    Costa, L.da F. (2004a) Actively-induced percolation: an effective approach to multiple-object systems characterization. eprint arXiv, cond-mat/0404310 .

    Costa, L.da F. (2004b) Complementary material for the present article.

    Costa, L.da F. and Cesar, R.M., Jr. Shape Analysis and Classification: Theory and Practice, (2001) , Boca Raton, FL CRC Press.

    Gross, I., Bernaola-Galvan, P., Carpena, P., Roman-Roldan, R., Oliver, J., Stanley, H.E. (2002) Analysis of symbolic sequences using the Jense–Shannon divergence. Phys. Rev. E, 65, , pp. 041905.

    Irback, A., Peterson, C., Potthast, F. (1996) Evidence for nonrandom hydrophobicity structures in protein chains. Proc. Natl Acad. Sci. USA, 93, 9533–9538[Abstract/Free Full Text].

    Nagai, N., Kuwata, K., Hayashi, T., Kuwata, H., Era, S. (2001) Evolution of the periodicity and the self-similarity in DNA sequence: a fourier transform analysis. Jpn. J. Physiol., 51, 159–168[CrossRef][ISI][Medline].

    Peng, C.K., Buldyrev, S.V., Havlin, S., Simmons, M., Stanley, H.E., Goldeberger, A.L. (1994) Mosaic organization of DNA nucleotides. Phys. Rev. E, 49, 1685[CrossRef].

    Popov, A.V., Sitnik, N.A., Savvateeva-Popova, E.V., Wolf, R., Heisenberg, M. (2003) The role of central parts of the brain in the control of sound production during courtship in Drosophila melanogaster . Neurosci. Behav. Physiol., 33, 53–65[CrossRef][Medline].

    Sanger Institute. (2004) Protein families database of alignment and HMM.

    Starck, J.L., Murtagh, F., Bijaoui, A. Image Processing and Data Analysis, (1998) , Cambridge, MA Cambridge University Press.

    Stauffer, D. and Aharony, A. Introduction to Percolation Theory, (1994) Taylor and Francis.

    Surlykke, A. and Moss, C.F. (2000) Echolocation behavior of big brown bats, Eptesius fuscus, in the field and the laboratory. J. Acoust. Soc. Am., 108, , pp. 2419–2429.

    Troyanskaya, O.G., Arbell, O., Koren, Y., Landau, G.M., Bolshoy, A. (2002) Sequence complexity profiles of prokaryotic genomic sequences: a fast algorithm for calculating linguistic complexity. Bioinformatics, 18, 679–688[Abstract/Free Full Text].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/5/608    most recent
bti050v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by da Fontoura Costa, L.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by da Fontoura Costa, L.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?