Skip Navigation


Bioinformatics Advance Access originally published online on February 26, 2008
Bioinformatics 2008 24(8):1041-1048; doi:10.1093/bioinformatics/btn077
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
24/8/1041    most recent
btn077v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Mrázek, J.
Right arrow Articles by Srivastava, A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Mrázek, J.
Right arrow Articles by Srivastava, A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2008. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

AIMIE: a web-based environment for detection and interpretation of significant sequence motifs in prokaryotic genomes

Jan Mrázek 1,2,*, Shaohua Xie 3, Xiangxue Guo 1 and Anuj Srivastava 2

1Department of Microbiology, University of Georgia, Athens, GA 30602-2605, 2Institute of Bioinformatics, University of Georgia, Athens, GA 30602-7229 and 3Department of Computer Science, University of Georgia, Athens, GA 30602-7404, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 IMPLEMENTATION AND...
 4 RESULTS
 5 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

Motivation: Genomes contain biologically significant information that extends beyond that encoded in genes. Some of this information relates to various short dispersed repeats distributed throughout the genome. The goal of this work was to combine tools for detection of statistically significant dispersed repeats in DNA sequences with tools to aid development of hypotheses regarding their possible physiological functions in an easy-to-use web-based environment.

Results: Ab Initio Motif Identification Environment (AIMIE) was designed to facilitate investigations of dispersed sequence motifs in prokaryotic genomes. We used AIMIE to analyze the Escherichia coli and Haemophilus influenzae genomes in order to demonstrate the utility of the new environment. AIMIE detected repeated extragenic palindrome (REP) elements, CRISPR repeats, uptake signal sequences, intergenic dyad sequences and several other over-represented sequence motifs. Distributional patterns of these motifs were analyzed using the tools included in AIMIE.

Availability: AIMIE and the related software can be accessed at our web site http://www.cmbl.uga.edu/software.html.

Contact: mrazek{at}uga.edu


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 IMPLEMENTATION AND...
 4 RESULTS
 5 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Hundreds of complete microbial genomes have been sequenced and many more are forthcoming. Utilizing the full potential of this accumulation of sequence data requires development of new computational techniques. The initial and most pressing need consists of an accurate identification of genes and prediction of their function. A number of methods have been developed for this task over the last decade [reviewed in (Overbeek et al., 2007)]. Less attention is devoted to identification and functional characterization of other functionally significant sequence features present in the genome.

The search for new sequence motifs can generally be divided into two types of tasks which require different computational approaches. The first type of motif search relates to situations where approximate locations of the motifs are known. A typical example is a search for cis-regulatory elements associated with a set of co-regulated genes. In this case, one may expect DNA sequences upstream of these genes to contain one or more shared sequence motifs that coincide with the cis-regulatory elements. Several programs are available for this task, generally based on implementations of Monte Carlo Markov chain class of algorithms such as Gibbs sampler or expectation maximization (Bailey and Elkan, 1996; Hughes et al., 2000; Lawrence et al., 1993; Thompson et al., 2007). The interpretation of the results is usually straightforward because typical applications involve assigning sequence motifs to known biological functions.

The second type of motif search consists of ab initio discovery of novel sequence motifs and does not require any prior knowledge about the motifs or their location. Many functionally significant sequence motifs occur in the genome at unusual frequencies, either significantly more or significantly less often than expected. Methods for ab initio motif discovery typically aim to find significantly over- or underrepresented oligonucleotides (words). These methods differ in stochastic models used to assess the expected word counts and in techniques and approximations used to estimate the statistical significance of the differences between the observed and expected counts (Karlin and Cardon, 1994; Karlin et al., 1996; Kirzhner et al., 2003; Leung et al., 1996; Pesole et al., 1992; Reinert et al., 2000; Schbath, 1997; Trifonov and Brendel, 1986). All these methods are subjected to two significant caveats. First, statistical significance does not necessarily imply biological significance and vice versa, in part due to a high level of uncertainty in developing an appropriate stochastic model to assess the expected distribution of word counts. Second, these methods provide a list of significantly over- or under-represented motifs but additional information is required to predict the biological function of these motifs. Consequently, these methods have not been routinely used in complete genome analysis. Our goal in this work is to design a user-friendly environment that integrates tools for ab initio discovery of novel sequence motifs in complete prokaryotic genomes with tools for building hypotheses about their possible biological roles.


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 IMPLEMENTATION AND...
 4 RESULTS
 5 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Ab Initio Motif Identification Environment (AIMIE) consists of several user interfaces accessible via a web browser, each designed for a different task in the discovery and interpretation of significant sequence motifs. The first phase involves discovery of over-represented sequence motifs. After the motifs are identified, a suite of tools is provided for their further analysis and interpretation, including r-scan statistics (Dembo and Karlin, 1988; Karlin and Brendel, 1992) to detect anomalous distribution of the motifs, and pattern vicinity analysis to investigate potential relationships between the sequence motifs and annotated genes. The software also provides an option to mask out the dominant sequence motifs and reanalyze the sequence in order to increase sensitivity to weaker motifs.

2.1 AIMIE phase I: motif discovery
The algorithm implemented in AIMIE is based on the frequent word technique designed by Karlin and co-workers (Karlin and Cardon, 1994; Karlin and Leung, 1991; Karlin et al., 1996). This method detects significantly frequent words (oligonucleotides) of a fixed length s. The word length is set such that 4s–1 ≤ L<4s, where L is the length of the analyzed DNA sequence. That is, the words of length s are on average expected to occur less than once. Words w that satisfy the inequality


Formula 1

(1)
are considered frequent words. rw denotes the observed number of occurrences of the word w in the sequence and pw is the expected word frequency assessed from the first order Markov model (pwL is the expected count). The left side of the inequality (1) represents an approximate probability that the word w occurs rw times in the sequence and by setting this probability ≤1/L one expects to find not more than one frequent word if the sequence was random. However, the fixed word length represents a significant drawback as many frequent words are either parts of longer sequence motifs or extensions of shorter sequence motifs. The following algorithm is designed to overcome this difficulty.

Motif discovery algorithm:

  1. The frequent word technique provides a list of significantly over-represented oligonucleotides of a fixed length s (between 10 and 12 bp for prokaryotic chromosomes, depending on the sequence length). A given number of most significant frequent words (user-defined parameter, default 100) are selected for further analysis.
  2. All copies of the top frequent words in the analyzed sequence are found and combined into Segments Consisting of Overlapping Frequent words (SCOFs), which are extracted from the analyzed sequence using the Pattern Locator program (Mrázek and Xie, 2006). SCOFs are of variable length ≥s but many represent different copies of the same sequence motif.
  3. SCOFs are clustered into groups corresponding to the same (or similar) sequence motifs. A distance between two SCOFs is defined as a minimum number of mismatched nucleotides between the optimally aligned SCOFs (without inserting gaps) divided by the length of the shorter SCOF. The standard Unweighted Pair Group Method with Arithmetic mean (UPGMA) hierarchical clustering method is used to cluster the SCOFs and all SCOFs joined into a single node below a given distance threshold (user-defined parameter, default 0.3) are considered the same sequence motif. John Brzustowski's qclust program (http://www2.biology.ualberta.ca/jbrzusto/dosclust.html) is used for the clustering. The default clustering cutoff was determined after extensive experimentation as a value that yields most interpretable results in most cases. The users can adjust the clustering cutoff for each analyzed sequence. For example, the cutoff value can be increased if too many similar motifs appear in the output that might be combined into a single conserved motif.
  4. Each sequence motif is represented by an alignment of SCOFs which belong to that motif. The alignment is performed by ClustalW (Thompson et al., 1994). A consensus sequence is generated from the alignment using degenerate nucleotide alphabet (standard NCIUB, formerly IUPAC code) (NCIUB, 1986). The consensus-generating algorithm allows ignoring nucleotides that occur in less than a given fraction of SCOFs in the alignment (user-defined parameter, default 10%). For example, if the frequencies of A, C, G and T at a given position in the alignment are, say, 70%, 20%, 5% and 5%, respectively, the consensus will have the letter M (A or C) at that position. Ambiguous codes corresponding to three or four different nucleotides (N, B, D, H and V) at both termini are removed. Consequently, the consensus sequence can be either longer or shorter than the initial word length s.
  5. At this point the user is presented with a list of identified sequence motifs and enters AIMIE phase II. The user can also select motifs to be masked out in the analyzed sequence and repeat AIMIE phase I to find additional sequence motifs.

2.2 AIMIE phase II: motif interpretation
After completion of AIMIE phase I, the consensus sequences for all identified motifs are displayed as a starting point for AIMIE phase II, which aims to provide additional information about the motifs to facilitate predictions of their possible biological functions. Three actions are available for each sequence motif: (1) display aligned SCOFs that belong to this motif, (2) investigate the distribution of the motif in the analyzed sequence using the motif consensus sequence and (3) investigate the distribution of the motif using a position-specific score matrix (PSSM) derived from the alignment. We describe the available tools below.

2.2.1 Analysis of motif distribution using consensus sequence
The consensus sequence is transferred to a modified Pattern Locator interface (Mrázek and Xie, 2006), which opens in a new window so that the user can return to the list of sequence motifs and choose other motifs for analysis. The interface allows users to manually modify the consensus sequence using any allowable Pattern Locator syntax before continuing [see reference Mrázek and Xie (2006) or http://www.cmbl.uga.edu/software/patloc-user-guide.htm for details]. Typical modifications would include specifying which DNA strands are searched, or allowing a user-defined number of mismatches. Pattern Locator first finds in the analyzed sequence all copies of the motif that match the consensus sequence and passes the coordinates to additional programs including r-scan statistics and pattern vicinity analysis.

r-scan statistics is designed to detect statistically significant anomalies in a distribution of a set of markers in a DNA sequence, such as excessive clumping (clusters), overdispersion (gaps) or unexpectedly regular distribution. The mathematical background for r-scan statistics and formulas to assess statistical significance were developed by Dembo and Karlin (1988). A brief description of the method is available on our web site http://www.cmbl.uga.edu/software/r-scans.htm and examples of practical applications can be found in the literature [e.g. (Karlin and Brendel, 1992; Karlin et al., 1996; Mrázek et al., 2002)]. The user can choose to receive the text output and graphical representation of significant clusters and gaps.

Pattern vicinity analysis provides a set of tools to analyze the relationship of sequence motifs to adjacent genes. The primary output includes a list of all instances of the motif in the analyzed DNA sequence together with information about the overlapping or adjacent genes obtained from the annotation. A brief summary provides counts of motifs found in genes, in intergenic regions or overlapping with gene starts or ends, and for intergenic motifs counts of those located between convergently transcribed genes, divergently transcribed genes and co-oriented genes. Users can also choose to receive histograms of motif counts found at specific positions with respect to 3' and 5' ends of genes. These data can be used to identify motifs found frequently near the end of a gene (e.g. those involved in transcription termination) or the start of a gene (possibly involved in transcription/translation initiation), as well as motifs associated with a specific functional category of genes.

2.2.2 Analysis of motif distribution using the PSSM representation
The PSSM is derived from the aligned SCOFs as an n x 4 matrix consisting of log-odds scores assigned to each nucleotide at every position in the alignment. n is the width of the alignment. Let si,j = log (pi,j/qi) be the score for the nucleotide i (i = A, C, G or T) at the motif position j. pi,j is the probability of finding a nucleotide i at position j of the motif (the target probability), estimated from the alignment as the number of times the nucleotide i occurs at position j divided by the number of sequences in the alignment. Pseudocounts are used in order to avoid the estimated probabilities being equal to zero. The pseudocounts are set to be equivalent to adding sequences consisting of poly-A, poly-C, poly-G and poly-T to the alignment. That is, the effect of pseudocounts becomes less significant when the number of SCOFs in the alignment is high. qi is a probability of finding the nucleotide i at any given position in the analyzed sequence (background probability). Any nucleotide sequence of length n can now be assigned a score Formula where ij is the nucleotide at the position j in the sequence at hand. The probabilistic rationale for the PSSM representation can be found in most bioinformatics textbooks, e.g. (Deonier et al., 2005).

After the aligned SCOFs are converted into a PSSM, the analyzed DNA sequence is scanned for all words of length n with a score S higher than a given cutoff S0. The cutoff can be specified in two ways: the user can provide the actual cutoff value or a percentile referring to the distribution of scores among the SCOFs in the alignment. For example, specifying 10% when the alignment contains 50 sequences will set the score cutoff equal to the score of the 6th lowest scoring SCOF in the alignment. By default, the score cutoff is set to 10% but not less than zero. r-Scan statistics and analysis of distribution is applied to all copies of the motif with scores ≥S0 as described above for the consensus sequence representation.


    3 IMPLEMENTATION AND AVAILABILITY
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 IMPLEMENTATION AND...
 4 RESULTS
 5 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
AIMIE is currently running on a Dell Precision 690n workstation with two quad-core Intel Xeon processors under the Redhat Enterprise Linux. It is accessible via a standard Web browser at http://www.cmbl.uga.edu/software/aimie.html. The CGI interface and the software environment consist of a collection of programs and scripts written in Python, C, Java and Perl. The environment is linked to a locally stored database of annotated prokaryotic genomes downloaded from the NCBI ftp server at ftp://ftp.ncbi.nih.gov/genomes/Bacteria/. The local database is periodically synchronized with the NCBI ftp server. AIMIE also allows users to upload their own DNA sequences. The sequences must be in GenBank format and should include the basic annotation (‘gene’ and ‘CDS’ features) in order to keep all AIMIE features functional. All files related to a particular AIMIE session are stored in a temporary directory, which is deleted after a period of inactivity exceeding 6 h. That is, a user may encounter a ‘No such file or directory’ exception if attempting to continue an AIMIE session after more than 6 h of inactivity. Programs that comprise AIMIE phase II (r-scan statistics, pattern vicinity analysis) are also accessible independently as extensions to Pattern Locator (Mrázek and Xie, 2006; http://www.cmbl.uga.edu/software/patloc.html) and Motif Locator (http://www.cmbl.uga.edu/software/motloc.html). The scripts and programs that comprise AIMIE are available upon request to users who wish to set up their own AIMIE server. However, an implementation on a different platform or under a different operating system may require appropriate modifications.


    4 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 IMPLEMENTATION AND...
 4 RESULTS
 5 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
We describe below two examples of AIMIE application. These serve to test the capabilities and limitations of the environment and demonstrate its intended use on specific examples. We chose the Escherichia coli K12 and Haemophilus influenzae Rd genomes for analysis by AIMIE. Both genomes were previously manually analyzed by similar methods. We demonstrate that the automated processing of the data in AIMIE detects the known sequence motifs and provides additional information.

4.1 DNA sequence motifs of the E.coliK12 chromosome
The initial analysis by AIMIE phase I with the default parameters (using top 100 frequent words, clustering cutoff 0.30, and consensus cutoff 0.10) detected seven dominant sequence motifs (Table 1) plus seven additional motifs consisting of not more than seven SCOFs each (Data not shown). The latter mostly include fragments of the main motifs that were not assigned into one of the main clusters by the clustering algorithm. The sequence motifs in the output are sorted by the number of SCOFs in the alignment multiplied by the length of the motif. All seven motifs in Table 1 correspond to repeated extragenic palindrome (REP) elements of the consensus sequence GCCKGATGGCGRCGY ... RCGYCTTATCMGGCCTAC and its inverted complement, or their fragments (Higgins et al., 1988). We compared numbers of matches to each motif in the E.coli chromosome with random sequences and most of these motifs did not have a match in random sequences (Table 1). These comparisons confirm a high-statistical significance of the detected motifs. Note that the number of matches to the consensus sequence depends on the stringency of the consensus and is not necessarily same as the number of SCOFs in the alignment. The ‘consensus cutoff’ parameter (default 0.10) can be used to regulate the stringency of the consensus. A higher consensus cutoff will increase the specificity of the search. When analyzing the motif distribution with the PSSM representation the sensitivity is determined by the score cutoff parameter.


View this table:
[in this window]
[in a new window]

 
Table 1. Top frequent motifs found in the E.coli K12 chromosome

 
In order to reduce the number of SCOFs entering the clustering algorithm, by default only 100 top frequent words are used. That leaves other potentially significant motifs undetected. In order to detect additional motifs, AIMIE phase I was repeated with the seven dominant motifs masked out. The second run yielded five additional motifs comprising ≥50 SCOFs each and a subsequent masking of these motifs followed by a third run of the program returned six more motifs comprising ≥20 SCOFs each (Table 2). Some of these additional motifs are also commonly present in random sequences but all have significantly more copies in the analyzed sequence.


View this table:
[in this window]
[in a new window]

 
Table 2. Additional frequent motifs found in the E.coli K12 chromosome

 
We use the first motif in Table 1, which represents a subset of REP elements (Higgins et al., 1988) for further analysis in order to demonstrate the tools available in AIMIE phase II. Five significant clusters detected by r-scans are located at positions 338980-339294, 374153-376746, 507805-508042, 4315988-4324308 and 4612279-4612516, whereas the region 898953-1952452 is significantly devoid of this motif. AIMIE also provides graphical (PDF or PostScript) display of significant clusters and areas of over-dispersion.

Pattern vicinity analysis gives information about the genes overlapping or adjacent to all copies of the motif in the analyzed sequence. The gene information is extracted from the GenBank input file. A summary at the end of the output displays the basic statistics on motif occurrences in genes and in intergenic regions (Table 3). AIMIE also generates histograms of counts of matching motifs found at specific distances from annotated starts and ends of genes (Fig. 1). Table 3 shows that virtually all copies of this motif are intergenic, which is expected for REP elements. Interestingly, only one copy is found between divergently transcribed genes, whereas 51 are located between convergently transcribed genes. This is consistent with data in Figure 1, which shows that this motif is more common in the 3' flanking regions than 5' flanking regions. Moreover, this sequence motif often features multiple regularly spaced copies in the same intergenic region, specifically between genes b0321 (yahG, hypothetical) and b0323 (yahI, hypothetical) (3 copies), b0337 (codA, cytosine deaminase) and b0338 (cynR, DNA-binding transcriptional dual regulator) (2 copies), b0352 (mhpE, 4-hyroxy-2-oxovalerate/4-hydroxy-2-oxopentanoic acid aldolase class I) and b0353 (mhpT, putative 3-hydroxyphenylpropionic transporter) (4 copies), b0354 (yaiL, nucleoprotein/polynucleotide-associated enzyme) and b0355 (frmB, putative esterase) (2 copies), b0483 (ybaQ, putative DNA-binding transcriptional regulator) and b0484 (copA, copper transporter) (3 copies), b2531 (iscR, DNA-binding transcriptional repressor) and b2532 (trmJ, putative methyltransferase) (2 copies), b3244 (tldD, putative peptidase) and b4472 (yhdP, putative transporter) (2 copies), b3844 (fre, flavin reductase) and b3845 (fadA, thiolase I) (2 copies), b4015 (aceA, isocitrate lyase) and b4016 (aceK, isocitrate dehydrogenase) (2 copies), b4107 (yjdN, hypothetical) and b4108 (yjdM, hypothetical) (5 copies) and b4378 (yjjV, putative DNase) and b4379 (yjjW, putative pyruvate formate lyase activating enzyme) (3 copies). The spacing between adjacent copies is 91–101 bp, and always identical within the same cluster. Other motifs in Table 1 also frequently feature multiple, regularly spaced copies (Data not shown).


Figure 1
View larger version (23K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Histograms show number of times the first motif in Table 1 is found at particular relative positions with respect to starts (top) and ends (bottom) of genes. In the top panel, position zero corresponds to the first nucleotide of an annotated gene, positive coordinates refer to positions downstream of a gene start (i.e. inside the protein-coding region), and negative coordinates refer to positions upstream of a gene start. In the bottom panel, position zero signifies the last nucleotide of an annotated gene, positive coordinates refer to positions downstream of a gene end (i.e. outside of the gene) and negative coordinates refer to positions upstream of a gene end (i.e. inside the gene). The gene starts and ends are determined from the ‘gene’ features in GenBank files.

 

View this table:
[in this window]
[in a new window]

 
Table 3. Pattern vicinity analysis of the first motif in Table 1

 
Sequence motifs in Table 2 were detected after the dominant REP-related motifs were masked out. We further analyzed motifs number 1, 2, 5 and 6, which were absent from the random sequences. The first three of these motifs, RNVNRSBDBDBWYRMSGCRTYCGVSM, TGSCGGATGCGGYKTRRVYRYCHBRYCC and YGYAGGYCKGATAAGVYRY are similar to REP elements although not a perfect match to the standard REP consensus. They also follow the same distributional pattern as REP elements: almost exclusively intergenic, frequently between convergent genes and often occurring in multiple periodically spaced copies. The fourth motif, MGGTTTATCCCCGCTGRYGMRGGGAACWY, occurs exclusively in two clusters of 11 and 7 regularly spaced copies, respectively, and corresponds to previously identified CRISPR sequences (Jansen et al., 2002).

4.2 DNA sequence motifs in the H. influenzae Rd chromosome
The analysis of the H.influenzae chromosome aims to test whether the automated procedure implemented in AIMIE finds all sequence motifs that we previously found by manual analysis of frequent words (Karlin et al., 1996; Mrázek and Karlin, 1996). The first run of AIMIE was dominated by core uptake signal sequence (USS) motif AAGTGCGGT and the inverted complement ACCGCACTT (Smith et al., 1995). However, AIMIE detected a more extended conserved pattern in the form AAAGTGCGGTNRDWW and WWWHYNACCGCACTTT (Table 4). Motifs number 3 and 5 in Table 4 represent a dyad pairing of USSs and a partial USS, respectively. It was proposed that dyad pairs of USS could act as Rho-independent transcription terminators (Karlin et al., 1996; Kingsford et al., 2007; Kroll, et al., 1992). The H.influenzae genome contains multiple extensive tandem tetranucleotide repeats (Karlin et al., 1996) and although AIMIE is primarily designed to detect short dispersed sequence repeats it can detect tandem repeats if they are sufficiently long (Table 4). The second AIMIE run was dominated by additional USS-related motifs whereas the third round yielded new motifs unrelated to USS (Table 4). In particular, the motif number 18 is a part of the previously described intergenic dyad sequence (IDS) (Mrázek and Karlin, 1996).


View this table:
[in this window]
[in a new window]

 
Table 4. Frequent motifs found in the H.influenzae Rd chromosome

 
We used the tools included in AIMIE to obtain additional information about the motifs in Table 4 not related to USS or tetranucleotide-tandem repeats. Listed below are some results of potential interest:
  • The sequence motif RVHGAAGAAAW features 299 copies in the chromosome, of which 267 (89%) are in genes. The fraction in genes is similar to protein-coding content of the genome (87%), indicating a lack of preference for genes or intergenic regions. Interestingly, 12 copies of this motif are in potential pseudogenes containing frameshifts. The H.influenzae annotation includes 1789 ‘gene’ features and 37 include a note’contains frameshifts’. If the 267 in-gene copies of the motif were distributed randomly among different genes, one would expect to find on average about 5.52 copies of the motif in genes with the ‘contains frameshifts’ note. Assuming that the number of motifs found in such genes is Poisson distributed, finding 12 copies is statistically significant with P-value 0.011. One might speculate that this motif could be related to the frameshifts.
  • The sequence motif TCGCCTTKTT occurs in 99 copies, 89 are located in genes.
  • The sequence motif CYAWTTCTTC is similar to an inverted complement of the RVHGAAGAAAW motif (see above). It has 109 copies of which 102 are in genes and 5 are in frameshift containing genes.
  • The sequence motif YATTGATGAA occurs in 98 copies of which 93 are in genes or pseudogenes.
  • The IDS related motif GRCTDWAGCCCACCCTAC is the only motif in this list that does not occur mostly in genes. All 24 copies are intergenic, with 17 between co-oriented genes, 6 between divergent genes and 1 between convergent genes. The motifs occur in several clusters detected by r-scan statistics.
  • The sequence motif CAATGSCATYA involves 50 copies of which 45 are in genes.
  • The sequence motif TTCAATATC has 121 copies in the chromosome and 113 are in genes.

The results listed above are restricted to information that is directly available from AIMIE. Additional potentially useful data for predicting the biological function of the sequence motifs can sometimes be obtained from further analysis of functional descriptions of the overlapping or adjacent genes.


    5 DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 IMPLEMENTATION AND...
 4 RESULTS
 5 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Our goal was to design a user-friendly environment combining tools for ab initio discovery of significant short dispersed motifs in genomic DNA sequences with tools for building hypotheses about possible biological roles of the motifs. AIMIE automates several key tasks in motif discovery and analysis and combines the appropriate tools in a single environment for routine analysis of prokaryotic genomes. We present two model applications involving two of the most thoroughly analyzed genomes, E.coli and H.influenzae. Over-represented sequence motifs in the E.coli K12 chromosome are dominated by REP elements (Higgins et al., 1988). These motifs occur mostly downstream of protein-coding regions. Although REP elements are not involved in transcription termination, they were shown to protect the mRNA from degradation by 3'->5' exonucleases, and the location near gene 3' termini is consistent with this function (Higgins et al., 1988). We found other motifs similar to REP elements, which share the same distributional patterns and probably have the same function. Interestingly, REP-related motifs often occur in clusters comprising several regularly spaced copies. The multiple copies of a REP element could serve to increase the stabilizing effect on the mRNA. Note that the motif discovery algorithm employed in AIMIE phase I is designed to find sequence motifs that are statistically significant in the context of complete genomes, whereas other methods are more suitable for finding motifs associated with a specific set of genes, such as cis-acting regulatory elements (Bailey and Elkan, 1996; Thompson et al., 2007).

Dispersed sequence motifs in the H.influenzae genome are dominated by the uptake signal sequences (Smith et al., 1995). Two additional AIMIE iterations preceded by masking of the previously detected motifs were required to detect the weakly conserved IDS (Mrázek and Karlin, 1996). Additional statistically significant short dispersed repeats are located mostly in genes and most of them could be related to repeated amino acid motifs in proteins.

In the two case studies, we were able to detect the previously characterized sequence motifs plus some additional sequence motifs in H.influenzae. Our tests confirmed that the algorithm employed by AIMIE detected all sequence motifs previously found by time-consuming manual analysis of the sequence data, and comparisons with random sequences verified that the motifs detected by AIMIE are statistically significant. We present AIMIE as an environment suitable for routine analysis of prokaryotic genomes and for building hypotheses about short dispersed repeats they contain.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 IMPLEMENTATION AND...
 4 RESULTS
 5 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
This work was supported by the startup funds provided by the University of Georgia and the R. E. Powe Award from Oak Ridge Associated Universities to J.M. The authors wish to thank Dr Tim Hoover and Ms. Lyla Lipscomb for their suggestions regarding the software and the interface, and for their critical reading of the manuscript.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Alex Bateman

Received on January 14, 2008; revised on February 21, 2008; accepted on February 22, 2008

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 IMPLEMENTATION AND...
 4 RESULTS
 5 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Bailey TL, Elkan C. Unsupervised learning of multiple motifs in biopolymers using expectation maximization. Machine Learning (1996) 21:51–80.[Web of Science]

    Dembo A, Karlin S. Poisson approximations for r-scan processes. Ann. Appl. Prob. (1988) 2:329–357.[CrossRef]

    Deonier RC, et al. Computational Genome Analysis: An Introduction (2005) New York: Springer.

    Higgins CF, et al. Repetitive extragenic palindromic sequences, mRNA stability and gene expression: evolution by gene conversion? A review. Gene (1988) 72:3–14.[CrossRef][Web of Science][Medline]

    Hughes JD, et al. Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. J. Mol. Biol. (2000) 296:1205–1214.[CrossRef][Web of Science][Medline]

    Jansen R, et al. Identification of genes that are associated with DNA repeats in prokaryotes. Mol. Microbiol. (2002) 43:1565–1575.[CrossRef][Web of Science][Medline]

    Karlin S, Brendel V. Chance and statistical significance in protein and DNA sequence analysis. Science (1992) 257:39–49.[Abstract/Free Full Text]

    Karlin S, Cardon LR. Computational DNA sequence analysis. Annu. Rev. Microbiol. (1994) 48:619–654.[Web of Science][Medline]

    Karlin S, Leung MY. Some limit theorems on distributional patterns of balls in urns. Ann. Appl. Prob. (1991) 1:513–538.[CrossRef]

    Karlin S, et al. Frequent oligonucleotides and peptides of the Haemophilus influenzae genome. Nucleic Acids Res. (1996) 24:4263–4272.[Abstract/Free Full Text]

    Kingsford CL, et al. Rapid, accurate, computational discovery of Rho-independent transcription terminators illuminates their relationship to DNA uptake. Genome Biol. (2007) 8:R22.[CrossRef][Medline]

    Kirzhner V, et al. A large-scale comparison of genomic sequences: one promising approach. Acta Biotheor. (2003) 51:73–89.[CrossRef][Web of Science][Medline]

    Kroll JS, et al. Palindromic Haemophilus DNA uptake sequences in presumed transcriptional terminators from H. influenzae and H. parainfluenzae. Gene (1992) 114:151–152.[CrossRef][Web of Science][Medline]

    Lawrence CE, et al. Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science (1993) 262:208–214.[Abstract/Free Full Text]

    Leung MY, et al. Over- and under-representation of short DNA words in herpesvirus genomes. J. Comput. Biol. (1996) 3:345–360.[Web of Science][Medline]

    Mrázek J. Analysis of distribution indicates diverse functions of simple sequence repeats in Mycoplasma genomes. Mol. Biol. Evol. (2006) 23:1370–1385.[Abstract/Free Full Text]

    Mrázek J, Karlin S. A new significant recurrent dyad pairing in Haemophilus influenzae. Trends Biochem. Sci. (1996) 21:201–202.[CrossRef][Web of Science][Medline]

    Mrázek J, Xie S. Pattern locator: a new tool for finding local sequence patterns in genomic DNA sequences. Bioinformatics (2006) 22:3099–3100.[Abstract/Free Full Text]

    Mrázek J, et al. Frequent oligonucleotide motifs in genomes of three streptococci. Nucleic Acids Res. (2002) 30:4216–4221.[Abstract/Free Full Text]

    NCIUB. Nomenclature for incompletely specified bases in nucleic acid sequences. Recommendations 1984. Proc. Natl Acad. Sci. USA (1986) 83:4–8.[Free Full Text]

    Overbeek R, et al. Annotation of bacterial and archaeal genomes: improvi8ng accuracy and consistency. Chem. Rev. (2007) 107:3431–3447.[CrossRef][Web of Science][Medline]

    Pesole G, et al. WORDUP: an efficient algorithm for discovering statistically significant patterns in DNA sequences. Nucleic Acids Res. (1992) 20:2871–2875.[Abstract/Free Full Text]

    Reinert G, et al. Probabilistic and statistical properties of words: an overview. J. Comput. Biol. (2000) 7:1–46.[CrossRef][Web of Science][Medline]

    Schbath S. An efficient statistic to detect over- and under-represented words in DNA sequences. J. Comput. Biol. (1997) 4:189–192.[Web of Science][Medline]

    Smith HO, et al. Frequency and distribution of DNA uptake signal sequences in the Haemophilus influenzae Rd genome. Science (1995) 269:538–540.[Abstract/Free Full Text]

    Thompson JD, et al. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. (1994) 22:4673–4680.[Abstract/Free Full Text]

    Thompson WA, et al. The Gibbs centroid sampler. Nucleic Acids Res. (2007) 35:W232–W237.[Abstract/Free Full Text]

    Trifonov EN, Brendel V. Gnomic: A Dictionary of Denetic Codes (1986) Philadelphia: Balaban Publishers, Rehovot.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Brief BioinformHome page
J. Mrazek
Finding sequence motifs in prokaryotic genomes--a brief practical guide for a microbiologist
Brief Bioinform, June 24, 2009; (2009) bbp032v1.
[Abstract] [Full Text] [PDF]


Home page
J. Bacteriol.Home page
Y. Liu, L. Guo, R. Guo, R. L. Wong, H. Hernandez, J. Hu, Y. Chu, I. J. Amster, W. B. Whitman, and L. Huang
The Sac10b Homolog in Methanococcus maripaludis Binds DNA at Specific Sites
J. Bacteriol., April 1, 2009; 191(7): 2315 - 2329.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
24/8/1041    most recent
btn077v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Mrázek, J.
Right arrow Articles by Srivastava, A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Mrázek, J.
Right arrow Articles by Srivastava, A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?