Bioinformatics Advance Access originally published online on November 24, 2006
Bioinformatics 2007 23(4):502-503; doi:10.1093/bioinformatics/btl601
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Mclip: motif detection based on cliques of gapped local profile-to-profile alignments
ARC Centre of Excellence for Integrative Legume Research and Bioinformatics Laboratory, Genomic Interactions Group, Research School of Biological Sciences, Australian National University GPO Box 475, Canberra, ACT 2601, Australia
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Summary: A multitude of motif-finding tools have been published, which can generally be assigned to one of three classes: expectation-maximization, Gibbs-sampling or enumeration. Irrespective of this grouping, most motif detection tools only take into account similarities across ungapped sequence regions, possibly causing short motifs located peripherally and in varying distance to a core motif to be missed. We present a new method, adding to the set of expectation-maximization approaches, that permits the use of gapped alignments for motif elucidation.
Availability: The program is available for download from: http://bioinfoserver.rsbs.anu.edu.au/downloads/mclip.jar
Contact: Georp.Weiller{at}anu.edu.au
Supplementary information: http://bioinfoserver.rsbs.anu.edu.au/utils/mclip/info.php
| 1 INTRODUCTION |
|---|
|
|
|---|
Motif detection methods can generally be classified into one of three groups: enumeration methods [Weeder (Pavesi et al., 2001)] Gibbs-sampling [AGLAM (Tharakaraman et al., 2005)] and expectation-maximization [MEME (Bailey and Elkan, 1994)]. The basic premise for all of the methods is that, for a given set of co-expressed sequences, the motifs responsible for this co-expression will be more conserved and present more frequently in those sequences than in other sets of sequences. Finding all possible motifs of any length in a highly variable number of sequences, some of which may contain the motif and some not, is a daunting task and, most likely, the reason why many motif detection tools require the user to specify bounding parameters, such as specific motif lengths, or the number of times a motif is to be found in a sequence. Unfortunately, regulatory motifs vary in size [TRANSFAC (Matys et al., 2003)] and frequently it is uncertain whether the observed co-expression of a set of sequences is due to common regulation or chance. Both are factors that may cause relevant motifs to be missed if inappropriate bounding parameters are used. In addition, many programs will simply output one or multiple sequence regions in which a motif was detected without providing an estimate of how likely this motif is to have occurred there by chance. A further disadvantage is that many tools are available only via a web-interface, making large-scale analyses tedious, or require extensive dependencies, making installation of the programs a major stumbling block to their everyday use. Fortunately there are also many easy to use, readily installable programs that provide adequate significance measures for their results; such as A1ignACE (Hughes et al., 2000), AGLAM and MEME.
We wish to extend that list by presenting a tool basing its motif detection on cliques of gapped local profileprofile alignments, in this case, cliques refer to sets of alignment traces for which all profiles share co-aligned residues. The use of gapped alignments comes at the cost of increased complexity and longer running time, but may increase sensitivity by enabling the detection of gapped or additional motifs located peripherally and in variable distance from a core motif. An example application of Mclip to sets of coexpresses sequences and a more detailed explanation of the alignment and motif-finding procedure can be found as part of the supplementary information.
| 2 IMPLEMENTATION |
|---|
|
|
|---|
The program uses a multi-step approach to finding motifs (Fig. 1). Local alignments are generated for all sequence pairs. Based on these alignments, 5-state profiles [A,C,G,T,gap] are derived for each sequence and provide a numerical representation of the residues contained in the alignments covering that sequence. Local alignments are then generated for all pairs of profiles by maximizing the log-odds ratio of one profile region emitting the residue counts present in a region of the other profile and vice versa (similar to COMPASS Sadreyev and Grishin, 2003). Motifs can then be inferred from cliques of local profileprofile alignments sharing co-aligned residues. A motif is derived by combining the position specific residue frequencies of the profile regions covered by the clique and adding gaps as specified by the local alignments.
|
This produces a set of possible motifs. Which of these are present in which sequences is determined by aligning the input sequences back to the motifs. The program returns the motifs and sequence regions with high-scoring alignments to the motifs.
| 3 APPLICATION |
|---|
|
|
|---|
Default input is a set of unaligned FASTA format sequences. Command line parameters as well as a web-interface allow the user to modify the parameters. Mclip automatically determines the size of motifs and aligns them to the input sequences, providing a statistical estimate for the motif-sequence alignment. The output is similar to MEME and consists of a list of detected motifs, the sequences with significant similarities to the motifs, their start, motif-match, end, alignment score, Z-score and E-value. The motif-sequence alignment routine is available separately (Mmatch) and can be used to search for motifs found by Mclip in a different set of sequences. Both programs are written in Java and run under MacOS, Windows and Unix/Linux; a Java 1.5 or better runtime environment is required. The programs are available under the GNU-General Public License; all source code is included in the jar archives.
Mclip is available for download from http://bioinfoserver.rsbs.anu.edu.au/downloads/mclip.jar. Mmatch is available for download from http://bioinfoserver.rsbs.anu.edu.au/downloads/mmatch.jar.
In addition, Mclip can be run via the web-interface at http://bioinfoserver.rsbs.anu.edu.au/utils/mclip/.
| Acknowledgments |
|---|
This research was funded by an Australian Research Council Centre of Excellence grant. Funding to pay the Open Access publication charges for this article was provided by the same grant.
| FOOTNOTES |
|---|
Associate Editor: John Quackenbush
Received on September 15, 2006; revised on November 5, 2006; accepted on November 20, 2006
| REFERENCES |
|---|
|
|
|---|
Bailey, LL. and Elkan, C. (1994) Fitting a mixture model by expectation maximization to discover rnotifs in biopolymers. In Altman, R.B., Brutlag, D.L., Karp, P.D., Lathrop, R.H., Searls, D.B. (Eds.). Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology, , Menlo Park, CA AAAI Press, pp. 2836.
Hughes, J.D., et al. (2000) Computational identification of cis-regulatory elements associated with functionally coherent groups of genes in Saccharornyccs cerevisiae. J. Mol. Biol, . 296, 12051214[CrossRef][Web of Science][Medline].
Matys, V., et al. (2003) Tranfac: transcriptional regulation, from patterns to profiles. Nucleic Acids Res, . 31, 374378
Pavesi, C.L., et al. (2001) An algorithm for finding signals of unknown length in DNA sequences. Bioinformatics, 17, S207S214[Abstract].
Sadreyev, R. and Grishin, N. (2003) COMPASS: a tool for comparison of multiple protein alignments with assessment of statistical significance. J. Mol. Biol, . 326, 317336[CrossRef][Web of Science][Medline].
Tharakaraman, K., et al. (2005) Alignments anchored on genomic landmarks can aid the identification of regulatory elements. Bioinformatics, 21, i440i448[Abstract].
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
