A new method for finding long consensus patterns in nucleic acid sequences
MRC Virology Unit, Institute of Virology, Glasgow GI I 5JR; Computing Service, Glasgow University Glasgow G12 8QQ, UKDepartment of Genetics, Leningrad State University Leningrad 199034, USSR
We describe a fast computer algorithm for identifying consensus patterns in DNA sequences. The method requires no prior assumptions about the consensus pattern other than its length. In particular no previous knowledge of the frequency or spacing of consensus patterns is required. However, a priori information about the shape of the consensus pattern, or invariability of individual positions, or the overall conservation level, can be utilized to enhance the selectivity and sensitivity of search. As the number of all possible consensus words increases very rapidly with length, comprehensive searches have usually been restricted to a maximum of 1012 nucleotides, even when large mainframes are used. Our algorithm enables searching for consensus patterns of this order on current mid-range and powerful microcomputers. Searches may be conducted on single, long sequences or a set of possibly aligned shorter sequences. We give examples of identified consensus patterns in both prokaryotic and eukaryotic DNA sequences, along with some typical program timings.
Received on January 14, 1991; accepted on March 5, 1991