Statistical distance between texts and filtration methods in sequence comparison
Department of Mathematics, University of Southern California Los Angeles, CA 90089-1113, USA and Laboratory of Mathematical Methods, Institute of Genetics of Microorganisms Moscow 113545, USSR
Upon searching local similarities in long sequences, the necessity of a rapid similarity search becomes acute. Quadratic complexity of dynamic programming algorithms forces the employment of filtration methods that allow elimination of the sequences with a low similarity level. The paper is devoted to the theoretical substantiations of the filtration method based on the statistical distance between texts. The notion of the filtration efficiency is introduced and the efficiency of several filters is estimated. It is shown that the efficiency of the statistical l-tuple filtration upon DNA database search is associated with a potential extension of the original fourletter alphabet and grows exponentially with increasing l. The formula that allows one to estimate the filtration parameters is presented.
Received on March 15, 1991; accepted on September 13, 1991