A test for the statistical significance of DNA sequence similarities for application in databank searches
Laboratory of Mathematical Biology, National Institute for Medical Research Mill Hill, London NW7 IAA
1Department of Applied Statistics, University of Reading Reading RG6 2AN, UK
*To whom reprint requests should be sent
A method is developed, based on word-searching, which provides a rapid test for the statistical significance of DNA sequence similarities for use in databank searching. The method makes allowance for the lengths and dinucleotide compositions of the sequences being compared. A way is also described to calculate the power of the test, i.e. the probability of detecting a given similarity as being statistically significant. The effects on the power of the test of the scoring method, word length, sequence length, and sequence composition are examined. A novel scoring method is shown to be superior to the method currently used in most word-searching algorithms.
Received on August 3, 1988; accepted on December 12, 1988