Skip Navigation


Bioinformatics Advance Access originally published online on April 13, 2006
Bioinformatics 2006 22(14):1690-1701; doi:10.1093/bioinformatics/btl146
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/14/1690    most recent
btl146v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (1)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Chu, K. H.
Right arrow Articles by Qi, J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Chu, K. H.
Right arrow Articles by Qi, J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Ribosomal RNA as molecular barcodes: a simple correlation analysis without sequence alignment

K. H. Chu 1,*, C. P. Li 1 and J. Qi 2

1 Department of Biology, The Chinese University of Hong Kong Hong Kong, China
2 Center for Comparative Genomics and Bioinformatics, Pennsylvania State University University Park, PA 16802, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALs AND METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 

Motivation: We explored the feasibility of using unaligned rRNA gene sequences as DNA barcodes, based on correlation analysis of composition vectors (CVs) derived from nucleotide strings. We tested this method with seven rRNA (including 12, 16, 18, 26 and 28S) datasets from a wide variety of organisms (from archaea to tetrapods) at taxonomic levels ranging from class to species.

Result: Our results indicate that grouping of taxa based on CV analysis is always in good agreement with the phylogenetic trees generated by traditional approaches, although in some cases the relationships among the higher systemic groups may differ. The effectiveness of our analysis might be related to the length and divergence among sequences in a dataset. Nevertheless, the correct grouping of sequences and accurate assignment of unknown taxa make our analysis a reliable and convenient approach in analyzing unaligned sequence datasets of various rRNAs for barcoding purposes.

Availability: The newly designed software (CVTree 1.0) is publicly available at the Composition Vector Tree (CVTree) web server http://cvtree.cbi.pku.edu.cn

Contact: kahouchu{at}cuhk.edu.hk


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALs AND METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
It is estimated that there are between 3.6 million and 100 million species on earth (Heywood and Watson, 1995), which are valuable biological resource for the human civilization. In the recent decade, the loss of biodiversity has been recognized as a major global environmental problem and much effort has been targeted on biodiversity conservation. Yet a major obstacle in accessing human impact on the biosphere is what has often been referred to as ‘taxonomic impediment’, which reflects the lack of taxonomic expertise in many groups of living organisms, and information in taxonomy is not always accessible and intelligible to biologists who are not taxonomists (Minelli, 2003). To overcome this problem, genetic information, specifically DNA sequences, has been suggested to serve as a criterion, or at least a complement, in taxonomic identification (e.g. Blaxter, 2003; Mallet and Willmort, 2003; Tautz et al., 2003; Savolainen et al., 2005).

Hebert et al. (2003a, b) proposed the use of mitochondrial cytochrome c oxidase subunit I (COI) as ‘DNA barcodes’ for species identification in the animal kingdom and its applicability has been demonstrated in a wide variety of organisms (e.g. Hebert et al., 2004a,b; Saunders, 2005; Ward et al., 2005). Yet closely related species may have identical or nearly identical COI sequences (Harrison, 2004; Lorenz et al., 2005; Zhang et al., 2005). It has also been suggested that it is undesirable to rely on a single sequence for taxonomic identification (Sites and Crandall, 1997; Mallet and Willmortt, 2003; Matz and Nielsen, 2005). Thus the feasibility of using additional genes, particularly ribosomal RNA (rRNA) genes, as DNA barcodes has also been explored. For instance, different rRNA genes have been proposed to be good DNA barcodes in nemotades (Flody et al., 2002; Blaxter et al., 2004; Power, 2004), tardigrades (Blaxter et al., 2003) and amphibians (Vences et al., 2005).

An intrinsic problem of using rRNA as barcodes resides in sequence alignment (Lutzoni et al., 2000; Noé and Kucherov, 2004). Since base insertions and deletions (indels) are common in rRNA sequences, every sequence with indels has to be assigned gaps for alignment with the others. Since no universal alignment parameters are defined, assigning gaps into DNA sequence is subjective (Geiger, 2002), and there is no consensus on what defines a ‘good’ or a ‘best’ multiple alignment (Wheeler, 1996). As a result, even when the alignment process is performed carefully by experienced researchers, human errors can be introduced, particularly in some rRNA sequences for which no closely related sequences are available for use as reference. Besides ambiguity in multiple sequence alignment, this process often has to be repeated whenever a new sequence (taxon) is added to a dataset before analysis. It is estimated that 200 000 barcode records will be added into the database each year (Hajibabaei et al., 2005). With such a large dataset, sequence alignment in the barcode project would become tedious and time consuming. While it may be argued that sequences in a DNA barcode database can be divided to subsets (each representing an appropriate taxonomic level such as class, order or family) which can be aligned independently for analysis, each of these datasets would still make up of hundreds to thousands of sequences which are massive for the alignment procedure. For instance, the fish family Cichlidae of about 1300 described species (Kullander, 1998) and the insect family Tipulidae (craneflies) of up to 15 000 species (Alexander, 1920) are examples of taxa that are preferably to be analyzed as a group. Similarly, there are about 22 000 described species of nematodes and new DNA sequences of this group (for which the corresponding descriptive taxonomy may not be available) possibly have to be analyzed with those from all nematodes as a whole. Another drawback of the alignment procedure is that ideally, the DNA sequences for barcoding purpose should be intact, i.e. not incorporated with any artifacts, including gaps. Otherwise, the same sequence may be referred to as different barcodes by different laboratories because of differences in alignment. Thus, sequence alignment is a major obstacle that limits the effective use of rRNAs for barcode purposes.

In this present study, we attempted to analyze rRNA sequences without alignment, using a simple correlation analysis based on composition vectors (CVs) derived from sequence data (Yu and Jiang, 2001; Chu et al., 2004; Qi et al., 2004a,b), with a view to test the feasibility of using rRNAs as molecular barcodes. In line with the studies which demonstrate the use of COI as DNA barcodes (Hebert et al., 2003a,b), we have considered to assemble large datasets of rDNA sequences from GenBank database for our feasibility study. Yet this approach needs construction of trees based on alignment for comparative purpose, which would involve ambiguity of alignment as well as subjectivity in the choice of tree construction methods. Thus, we have taken an alternative strategy by comparing our approach with published rRNA trees in the literature, plus an unpublished tree from our own research. We analyzed a total of seven rRNA datasets from a wide variety of organisms and taxonomic levels, from archaea to tetrapods, from class to species. The results demonstrated that unaligned rRNA gene sequences could be used as convenient and reliable DNA barcodes.


    MATERIALs AND METHODS
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALs AND METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
Sequences from six published rRNA datasets (Arahal et al., 1996; Ro et al., 1997; de Bellocq et al., 2001; Shull et al., 2001; Rickard et al., 2002; Xia et al., 2003) were downloaded from GenBank for analysis (Table 1). An unpublished dataset of partial 12S rRNA sequences (~430 bp) from 19 Nephropidae (clawed lobsters) species (Tshudy et al., 2005) was also included in the analysis. These datasets were chosen because they represented different rRNA genes (12, 16, 18, 26 and 28S) from different groups of living organisms, including archaea (Arahal et al., 1996), bacteria (Rickard et al., 2002), plants (Ro et al., 1997) and animals (de Bellocq et al., 2001; Shull et al., 2001; Xia et al., 2003; Tshudy et al., 2005), with taxonomic levels ranging from class to species. The methods used in analyzing the datasets in the original published papers incorporated the common approaches of phylogenetic reconstruction, including neighbor joining (NJ), maximum parsimony (MJ) and maximum likelihood (ML), among others. The number of taxa in each dataset ranged from 19 to 49 taxa.


View this table:
[in this window]
[in a new window]
 
Table 1 List of organisms (with the number of taxa in parentheses), source references, sequence information, K values used in analysis and the tree topology test results of the seven datasets studied

 
As a first step of our analysis, the length of each sequence was checked against the others in the same dataset and any excessive sequences from individual taxa were excluded from analysis. Our approach based on CVs was originally applied to analyze all protein sequences from complete genomes (Chu et al., 2004; Qi et al., 2004a, b) and the vectors were analogous to the peptide frequency vectors used by Stuart et al. (2002a,b). In the present study we adopted the approach in analyzing nucleotide sequences of rRNA genes. Briefly, for a sequence of rRNA gene of length L, the frequency of the appearance of oligonucleotide strings of a fixed length K was calculated. The total number of N possible types of such strings was 4K and the total number of K-strings was (LK + 1). The frequency of each of the N kinds in a given DNA sequence was determined by sliding through the sequence, shifting one nucleotide position at a time. The observed frequency p({alpha}1{alpha}2...{alpha}K) of a K-string {alpha}1{alpha}2...{alpha}K was n({alpha}1{alpha}2...{alpha}K)/(L K + 1), where n({alpha}1{alpha}2...{alpha}K) was the number of times that {alpha}1{alpha}2...{alpha}K appeared in this sequence. For instance, in the DNA sequence ‘CGCAGTTTGTATACCGTCAT’ p(A) = 4/20, p(TT) = 2/(20 – 2 + 1) and p(TTT) = 1/(20 – 3 + 1). For a certain K, we put the frequencies of all possible K-strings in a fixed order to obtain a CV of dimension 4K for each sequence. The correlation C(A,B) between two sequences A and B was determined by taking the projection of one vector on another, and the distance between the two was defined as D = (1 – C)/2. After constructing a distance matrix for all sequences in a dataset, the NJ (Saitou and Nei, 1987) analysis implemented in Phylip 3.63 (Felsenstein, 1989) was used to construct the phylogenetic tree for the dataset. The details of this method were described in Qi et al. (2004a) and Chu et al. (2004) in analyzing the amino acid sequences from complete genomes of prokaryotes and chloroplast genomes, respectively. In the previous studies, in order to diminish the influence of random neutral mutations at the molecular level and to highlight the shaping role of selective evolution, such random background was subtracted from the frequencies of oligopeptide strings using a Markov model of order (K – 2) before computation of the CVs. This procedure of subtracting random background was omitted in the present study because of the limited length of the rRNA genes. Preliminary analysis on the rRNA datasets used also showed that the procedure did not further enhance the reliability of the method. The analysis was implemented using CVTree alpha 1.0 which can be downloaded from http://cvtree.cbi.pku.edu.cn (Qi et al., 2004b).

To determine the length of string (K) used in the CV analysis, we followed Pevzner's (2000) result that the best K value for a sequence of length L is Formula. The K values used for each of the rRNA datasets to generate the distance matrices ranged from 8 to 11 (Table 1). Our preliminary analyses showed that for most datasets, the number of correctly grouped taxa reached a peak when K was above 7–8. The CV trees generated from the distance matrices were then compared with the corresponding trees constructed based on traditional methodologies with sequence alignment, using the Kishino–Hasegawa (KH) test (Kishino and Hasegawa, 1989) and Shimodaira–Hasegawa (SH) test (Shimodaira and Hasegawa, 1999) in PAUP 4.0 (Swofford, 2000). If the topologies of the two trees from the same dataset were significantly different, we varied the K value to search for a K value that could generate a CV tree that matched better with the tree constructed based on sequence alignment. For the published datasets, the trees in the corresponding papers were used for comparison. For comparison with the CV tree based on the unpublished Nephropidae dataset, a NJ tree was constructed. First, the Nephropidae sequences were aligned using the multiple-alignment program Clustal W 1.5c (Thompson et al., 1994) with adjustments made by eye. The NJ analysis with bootstrap value of 1000 was then implemented using Mega 3 (Kumar et al., 2004) based on Kimura 2-parameter distance model (Kimura, 1980).


    RESULTS
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALs AND METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
The 12S rRNA (~430 bp) dataset including 18 Nephropidae (Arthropoda, Crustacea, Decapoda) species and Neoglyphea inopinata as outgroup (GenBank accession nos DQ298420DQ298438) was generated by the first author of this paper and his collaborators in phylogenetic studies of this family (Tshudy et al., 2005). In both the NJ and CV trees (Fig. 1), species in the same genus always grouped together. The topologies and the relationships between taxa were also highly similar in the two trees. The only difference was in the position of Thymopides grobovi and Nephropides caribaeus. In the CV tree, they clustered as a group most closely related to the two Eunephrops species. In the NJ tree, the former two species did not form a clade; instead there was a weak (BP < 60) bootstrap support for the grouping of N.caribaeus with Eunephrops spp., with T.grobovi as the most distant taxon among the four species. Nevertheless, there was no significant difference in tree topology between the two trees based on both the KH and the SH tests (P < 0.05).


Figure 1
View larger version (29K):
[in this window]
[in a new window]
 
Fig. 1 Distance trees of Nephropidae based on 12S rRNA constructed with (a) the NJ method and (b) CV analysis (K = 8). Numbers on branches in (a) indicate bootstrap values (1000 replicates) from NJ analysis.

 
Arahal et al. (1996) attempted to identify 22 halophilic Archaea strains in family Halobacteriaceae collected from Dead Sea. The strains were divided into five groups based on their phenotypic features. A 16S rRNA (~1400 bp) NJ tree was then constructed on one representative strain (E1, E2, E8, E11 and E12) from each group and 27 well-known Halobacteriaceae strains, along with two outgroup taxa. The five unknown strains were assigned into three genera, Haloferax, Haloarula and Halobacterium in the NJ tree. The assignment of the five strains was identical in the tree generated from our CV analysis on the same dataset (Fig. 2). The grouping of the 27 well-known strains at the genus level was also the same in our tree and the published tree. The relationships between different genera were not well resolved in both trees, and no significant difference between the two trees was found in the KH and the SH tests.


Figure 2
View larger version (25K):
[in this window]
[in a new window]
 
Fig. 2 CV tree (K = 10) based on the 16S rRNA dataset of archaea analyzed in Arahal et al. (1996). Unknown strains indicated by arrows.

 
In a study on freshwater biofilm bacteria by Rickard et al. (2002), 15 gram-negative strains were identified to the genus level based on ML analysis of partial 16S rRNA sequences (~650 bp) using data from 31 known bacterial strains from five taxonomic groups, Zymomonas, Methylobacterium, Bradyrhizobium, Rhodobacter and Pseudomonas. In our CV tree (Fig. 3), all bacterial strains, including the 15 unknowns, were assigned correctly into their corresponding taxonomic groups as in the ML tree. In both trees, Pseudomonas group was the most distant taxon and the relationships between the other four groups were not well resolved, although the topologies were different between the two trees. While Rhodobacter was the sister group of a cluster consisting of Bradyrhizobium, Melthylobacterium and Zymomonas in the ML tree, it clustered with the Zymomonas group only in the CV tree, with Bradyrhizobium and Melthylobacterium as their sister groups. Unlike the previous two comparisons, the KH and the SH tests (P < 0.05) showed that the topologies of the two trees were significantly different. By varying K values from 4 to 20 in our analysis, we could not generate a topology that gave better match with the ML tree.


Figure 3
View larger version (32K):
[in this window]
[in a new window]
 
Fig. 3 CV tree (K = 9) based on the 16S rRNA dataset of bacteria analyzed by Rickard et al. (2002). Unknown strains indicated by arrows.

 
Shull et al. (2001) explored the phylogenetic relationships of 36 adephagan beetles (Arthopoda: Insecta: Coleoptera) and 13 outgroup species based on full-length 18S rRNA (~2400 bp) sequences using two tree reconstruction approaches, POY (Gladstein and Wheeler, 1996, ftp.amnh.org/people/wheeler/poy) + parsimony searches and ML with 5:1 alignment weight. Broadly similar tree topologies from two approaches suggested that suborder Adephaga was a well-supported group, and monophyly of each of the two groups, Geadephaga (in terrestrial habitat) and Hydradephaga (in aquatic habitat), within Adephaga was also supported. However, monophyly of the families within the two groups was weakly supported. Moreover, family Trachypachidae which is terrestrial but possesses some features that characterize Hydradephaga, was grouped with Geadephaga. In the CV tree (Fig. 4), the clustering of monophyletic groups, including Adephaga, Geadephaga and Hydradephaga was identical to the trees of Shull et al. (2001). Similarly, the monophyly of the families was not supported in our CV analysis, although the relationships between specific taxa might be different. Similar to the ML tree, Trachypachidae was placed within Geadephaga. There was no significant difference in tree topologies between our CV tree and the ML tree according to KH and SH tests.


Figure 4
View larger version (20K):
[in this window]
[in a new window]
 
Fig. 4 CV tree (K = 11) based on the 18S rRNA dataset of adephagan beetles analyzed by Shull et al. (2001).

 
Studies based on 18S rRNA sequences of tetrapods always supported the grouping of birds and mammals (Hedges et al., 1990; Rzhetsky and Nei, 1992; Huelsenbeck et al., 1996) in contrast to the grouping of birds and reptiles based on morphological, paleontological and other molecular data (Carroll, 1988; Eernisse and Kluge, 1993; Hedges, 1994). Xia et al. (2003) attempted to resolve this issue by analyzing 47 tetrapod 18S rRNA sequences (~2100 bp), with Latimeria as outgroup. By presenting a FastME (Desper and Gascuel, 2002) tree constructed using structurally aligned 18S rRNA sequences, the authors argued that the bird–mammal grouping was because of sequencing errors and misalignment of the sequences in previous analyses. Xia et al.'s (2003) tree showed that, other than those sequences of Hedges et al. (1990), which included three reptiles, three amphibians and one bird (Turdus), birds and reptiles did group together. Sequences from Hedges et al. (1990), however, clustered together as a sister group of the bird–reptile clade in the FastME tree. According to Xia et al. (2003), the sequences from Hedges et al. (1990) were poor in quality for alignment, so that those sequences failed to be assigned to the respective amphibian, reptile or bird clades as a result of analytical errors. In our CV tree (Fig. 5), most of the taxa could be grouped to their corresponding amphibian, reptile, bird or mammal clades, including Turdus that was correctly grouped to the bird clade as the most distant taxon, rather than to the other sequences of Hedges et al. (1990) as in the FastME tree of Xia et al. (2003). However, the sequences of reptiles and amphibians from Hedges et al. (1990) were grouped into a single clade, distinct from the rest. In contrast to the FastME tree, the CV tree supported the affinity between birds and mammals as in many previous analyses based on 18S rRNA (Hedges et al., 1990; Rzhetsky and Nei, 1992; Huelsenbeck et al., 1996) but not the bird–reptile relationship. When the sequences from Hedges et al. (1990) were excluded from our analysis, the topology of the CV tree remained the same (tree not shown). Significant difference between the topologies of the FastME tree and the CV tree was found based on KH and SH tests. Varying K values from 4 to 20 in our analysis did not yield a tree topology that matched better with the FastME tree. Birds and mammals always grouped together in our CV trees.


Figure 5
View larger version (17K):
[in this window]
[in a new window]
 
Fig. 5 CV tree (K = 10) based on the 18S rRNA dataset of tetrapods analyzed by Xia et al. (2003).

 
Ro et al. (1997) used partial 26S rRNA sequences (~1100 bp) of 31 Ranunculaceae (Anthophyta: Angiospermae: Ranunculales) taxa and four Berberidaceae outgroup taxa to resolve the phylogenetic relationships in Ranunculaceae at subfamily level. The traditional classification system in Ranunculaceae is based on fruit types, flower parts, the number and shape of chromosomes. The Ranunculus group (R-chromosome group) has large and long chromosomes with a base number of 8, and the Thalictrum group (T-chromosome group) has short and small chromosomes with a base number of 7 or 9. Based on NJ analysis of 26S rRNA, Ro et al. (1997) re-examined the traditional classification system and proposed four subfamilies of Ranunculaceae: (1) Hydrastidoideae, with genus Hydrastis; (2) Coptidoideae, with Coptis and Xanthorhiza; (3) Thalictroideae consisting of all T-chromosome taxa, except Hydrastis, Coptis and Xanthorhiza and (4) Ranunculoideae including all R-chromosome taxa. Hydrastis, which was placed in family Hydrastidaceae by Hoot (1991, 1995), was treated as a highly autopomorphic lineage and included within family Ranunculaceae by Ro et al. (1997). Similarly, our CV analysis also separated Hydrastis as the basal branch of this family (Fig. 6). Monophyly of Xanthorhiza-Coptis, Trollius-Adonis, Consolida-Delphinium and Actaea-Cimicifuga-Eranthis as supported by NJ analysis was also evident in the CV tree. Yet monophyly of Ranunculus-Trautvetteria that was strongly supported by the NJ analysis was not supported by the CV tree. Moreover, the position of the Consolida-Delphinium group which belongs to R-chromosome group (equivalent to Ranunculoideae) was different between the two trees. In the NJ tree, it was the sister group of a clade consisting of the other Ranunculoideae taxa and Thalictroideae. However, in our CV tree, the Consolida-Delphinium clustered with Thalictroideae instead of the other Ranunculoideae taxa. In any case, the phylogenetic position of this group in Ranunculaceae has always been controversial (Tamura, 1993; Jensen, 1995). Other than these discrepancies, the grouping and topology of the NJ tree and CV tree were identical. And no significant difference was found in the KH and SH tests between the NJ and CV trees.


Figure 6
View larger version (32K):
[in this window]
[in a new window]
 
Fig. 6 CV tree (K = 10) based on the 26S rRNA dataset of Ranunculaceae analyzed by Ro et al. (1997).

 
Traditional classifications of nematodes are always problematic because reliable morphological characters are difficult to be examined. de Bellocq et al. (2001) attempted to use partial 28S rRNA sequences (~600 bp) to resolve phylogenetic relationships of 19 nematode species from Trichostrongylina and Strongylina groups, in the order Strongylida. In their MP tree, Trichostrongylina constituted a monophyletic group, and the three-superfamily classification of Trichostrongyloidea, Heligmosomoidea and Molineoidea within this group was well-supported, with Heligmosomoidea and Molineoidea as sister taxa. The phylogenetic relationships within Trichostrongyloidea could not be well resolved, with bootstrap support of only ~60. In contrast to Trichostrongylina, Strongylina was found to be paraphyletic in MP analysis as Triodontophorus serratus (family Strongylidae) clustered with Trichostrongylina (~60 bootstrap support) instead of to the other Strongylina. In the CV tree (Fig. 7a), the monophyly of Trichostrongylina and paraphyly of Strongylina were also evident. However, in two of the three superfamilies (Trichostrongyloidea and Molineoidea) in Trichostrongylina, members of the same superfamily did not cluster together, although members of Heligmosomoidea and Molineoidea were found to be closely related as in MP tree. Significantly difference in topology in the KH and the SH tests was found between the CV and MP trees. Interestingly, when the string length K was lowered to four (Fig. 7b), Trichostrongyloidea became a monophyletic group and the monophyly of the three families (Haemonchidae, Cooperiidae and Trichostrongylidae) was supported as in the MP tree. Yet the relationship among members of the other two superfamilies were identical between the CV trees of K = 10 and K = 4, but different from the MP tree. KH test showed that the CV tree of K = 4 was not significantly different from the MP tree, but significant difference between the two trees was found based on SH test (P < 0.05).


Figure 7
View larger version (33K):
[in this window]
[in a new window]
 
Fig. 7 CV trees based on 28S rRNA dataset of nematodes analyzed by de Bellocq et al. (2001) with (a) K = 9 and (b) K = 4.

 

    DISCUSSION
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALs AND METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 
Qi et al. (2004a) demonstrated the applicability of CV analysis without sequence alignment in phylogenetic reconstruction of complete protein sequences from prokaryote genomes. This method has subsequently been applied, in some cases with modifications, in analyzing the chloroplast (Chu et al., 2004; Yu et al., 2005) and mitochondrial genomes (Z. G. Yu et al., unpublished data). A similar approach has previously been applied in analyzing mitochondrial genomes of vertebrates (Stuart et al., 2002a,b). While all previous analyses have been based on protein sequence analysis, using different procedures for subtraction of random background, the present study is a first attempt to apply this approach in analyzing short DNA sequences from single genes. We have used the simplest version of CVs (without subtraction of random background) in analyzing a range of rRNA datasets from different taxonomic groups of organisms available in the literature. The purpose is to test the feasibility of this approach in clustering short DNA sequences. By circumventing the sequence alignment procedure, we hope that this approach would facilitate the use of various rRNAs as molecular barcodes in species identification.

In the datasets analyzed, the groupings of rRNA sequence to taxa as revealed by our analysis, such as grouping of spiny lobsters (Nephropidae) to genera, of buttercups (Ranunculaceae) species to subfamilies, and of tetrapods to classes, are often very similar to those using traditional approaches based on sequence alignment. And in the Archaea and bacteria datasets, the assignment of unknown taxa to their respective taxonomic groups is identical among the two kinds of analyses. Among the seven datasets examined, the Nephropidae (12S rRNA), archaea (16S rRNA), adephagan beetle (18S rRNA) and Ranunculaceae (26S rRNA) datasets gave very similar results in terms of topology between the published trees and the trees based on analysis of CVs. This indicates that our analysis could also elucidate the phylogenetic relationship among the higher taxa, other than grouping of sequences to lower taxa. In the other three datasets, i.e. bacteria (16S rRNA), tetrapod (18S rRNA) and nematode (28S rRNA), there are significant differences in topology between the published trees and the CV trees, suggesting that the relationship among the higher taxa revealed by the two kinds of analysis are different in some cases.

In the case of tetrapods, although our CV tree is different from the Xia et al.'s (2003) FastME tree, it is very similar to those in previous studies based on 18S rRNA (Hedges et al., 1990; Rzhetsky and Nei, 1992; Huelsenbeck et al., 1996) as well as based on other algorithms used by Xia et al. (2003), in which birds and mammals are grouped together. Thus, the reason why 18S rRNA always gives a topology distinct from those revealed by other datasets in tetrapods remains an issue to be explored. Yet it is interesting to note that in our tree one of the ‘poor’ sequences of Hedge et al. (1990), from the bird from Turdus, cluster with those of other birds, suggesting that our approach may be useful in analyzing ‘poor’ DNA sequences.

For the tetrapod and bacteria datasets, despite the differences in tree topology, our analysis could accurately cluster the sequences to taxonomic groups (genera and classes, respectively), suggesting that the unaligned rRNA genes could serve as DNA barcodes in grouping of sequences to taxa and assigning unknown sequences to the taxa. However, the CV analysis based on the nematode dataset appears to be problematic in term of the capability of grouping sequences together to the right taxa. Interestingly, the CV tree based on a DNA string length (K) of 4 yields a tree topology that matches better with the published tree than the CV tree of K = 9. We note that the nematode dataset is among those with a shorter sequence length. Among the datasets studied, three have sequences under 600 bp. The Nephropidae dataset has the shortest sequence length (~430 bp), and both the bacteria and nematode datasets are between 550–600 bp in length. The other four datasets had sequence length >1000 bp. Following Pevzner's (2000) results, we have used small K values (8 or 9) in analyzing the three datasets with shorter sequences, as frequency data of long K-strings (10 or 11) may not provide enough information to reveal the relationships between the sequences. Yet only for the nematode dataset among the three datasets does our analysis fail to group the sequences to taxa. One parameter worth noting is the mean sequence divergence of the datasets, which is 14.5% for the bacteria dataset, 13.9% for the Nephropidae dataset and only 7.3% for the nematode dataset. The higher level of divergence in the former two datasets would provide more information for analysis and thus may explain why the topology of the corresponding CV trees is comparable with that in the published trees. And in the nematode dataset with the lowest divergence, a smaller DNA string length (K = 4) would enhance the resolving power of CV analysis. To elucidate this issue, the relationships between the length of DNA string used in the analysis and the sequence length and divergence in a dataset have to be explored. Further, the applicability of different algorithms for subtracting the random background in sequences, including Markov model (Chu et al., 2004; Qi et al., 2004a,b) dynamic language model (Yu et al., 2005) and discrete Fourier transform (Z. G. Yu, personal communication) in enhancing the reliability of analyzing short sequences of rRNA genes should also be investigated.

To sum up, we have demonstrated that the analysis of CVs based on unaligned rRNA sequences is a reliable clustering strategy for DNA barcoding purposes in a variety of taxonomic groups and systemic levels. While this approach was previously applied in analyzing complete genome data, the present study shows that it is also applicable in analyzing much shorter DNA sequences from a single gene, which is going to be the fundamental block in the massive barcode database. Ultimately the database is estimated to include as many as 65 billion bp (Hajibabaei et al., 2005). It has been estimated that it takes 1 h to align a dataset containing 100 sequences of 1500 bp each on a P4 1.8 GHz computer using ClustalW (Ebedes and Datta 2004). Yet in a simulation study of CV analysis, we could analyze 10 000 sequences of the same length in the same duration. Therefore, the approach would much expedite the barcoding analysis of large datasets. The approach may also be applied as a rapid method for cluster analysis of a massive dataset (>10 000 sequences) so that the subsets can be analyzed separately by alternative strategies.

It is worthy to note that the CV analysis could have other applications in DNA barcoding besides in cluster analysis. The determination of frequencies of DNA strings would enable easy identification of taxon-specific strings that can be used as taxon-specific probes in DNA chip for species identification (Summerbell et al., 2005). Moreover, the vector based on each sequence is unique and thus could serve as a taxon-specific signature, e.g. in the proposed Barcode of Life Data Systems Identification engine (Hajibabaei et al., 2005). The use of such vector signatures would reduce the size of the entire database from several hundred base pairs per taxon to ~10 digits per taxon. This taxon-specific code would be analogous to the ‘Code 39’ standard widely used in many industry and government barcode specifications. To conclude, we believe our approach without sequence alignment would much facilitate the development of various rRNA genes in barcoding of life.


    Acknowledgments
 
The authors would like to thank two anonymous reviewers for their constructive comments which significantly improved the manuscript. The work described in this article was fully supported by a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China (Project No. CUHK4419/04M).

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Chris Stoeckert

Received on January 6, 2006; revised on April 11, 2006; accepted on April 11, 2006

    REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALs AND METHODS
 RESULTS
 DISCUSSION
 REFERENCES
 

    Alexander, C.P. (1920) The crane-files (Tipulidae, Diptera). Ohio J. Sci, . 20, 193–203.

    Arahal, D.R., et al. (1996) Phylogenetic analyses of some extremely halophilic archaea isolated from dead sea water, determined on the basis of their 16S rRNA sequences. Appl. Environ. Microbiol, . 62, 3779–3786[Abstract/Free Full Text].

    Blaxter, M. (2003) Counting angels with DNA. Nature, 421, 122–123[CrossRef][Medline].

    Blaxter, M., et al. (2003) DNA taxonomy of a neglected animal phylum: an unexpected diversity of tardigrades. Proc. Biol. Soc, . 271, Suppl. 4, S189–S192.

    Blaxter, M., Floyd, R., Dorris, M., Eyualem, A., De Ley, P. (2004) Utilising the new nematode phylogeny for studies of parasitism and diversity. In Cook, R. and Hunt, D.J. (Eds.). Nematology Monographs and Perspectives, , Leiden E. J. Brill, pp. 615–632.

    Camin, J.H. and Sokal, R.R. (1965) A method for deducing branching sequences in phylogeny. Evolution, 19, 311–326[CrossRef][Web of Science].

    Carroll, R.L. Vertebrate Paleontology and Evolution, (1988) , New York W.H. Freeman.

    Chu, K.H., et al. (2004) Origin and phylogeny of chloroplasts: a simple correlation analysis of complete genomes. Mol. Biol. Evol, . 21, 200–206[Abstract/Free Full Text].

    de Bellocq, J.G., et al. (2001) Phylogeny of the Trichostrogylina (Nematoda) inferred from 28S rDNA sequences. Mol. Phylogenet. Evol, . 19, 430–442[CrossRef][Web of Science][Medline].

    Desper, R. and Gascuel, O. (2002) Fast and accurate phylogeny reconstruction algorithms based on the minimum-evolution principle. J. Comput. Biol, . 9, 687–705[CrossRef][Web of Science][Medline].

    Ebedes, J and Datta, A. (2004) Multiple sequence alignment in parallel on a workstation cluster. Bioinformatics, 20, 1193–1195[Abstract/Free Full Text].

    Eernisse, D.J. and Kluge, A.G. (1993) Taxonomic congruence versus total evidence, and amniote phylogeny inferred from fossils, molecules, and morphology. Mol. Biol. Evol, . 10, 1170–1195[Abstract].

    Felsenstein, J. (1981) Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol, . 17, 368–376[CrossRef][Web of Science][Medline].

    Felsenstein, J. (1989) PHYLIP—phylogeny inference package (Version 3.2). Cladistics, 5, 164–166.

    Floyd, R., et al. (2002) Molecular barcodes for soil nematode identification. Mol. Ecol, . 11, 839–850[CrossRef][Medline].

    Geiger, D.L. (2002) Stretch coding and block coding: two new strategies to represent questionably aligned DNA sequences. J. Mol. Evol, . 54, 191–199[CrossRef][Web of Science][Medline].

    Gladstein, D. and Wheeler, W.C. POY. Program and documentation, (1996) American Museum of Natural History.

    Hajibabaei, M., et al. (2005) Critical factors for assembling a high volume of DNA barcodes. Philos. Trans. R. Soc. Lond. B Biol. Sci, . 360, 1959–1967[Abstract/Free Full Text].

    Harrison, J.S. (2004) Evolution, biogeography, and the utility of mitochondrial 16S and COI genes in phylogenetic analysis of the crab genus Austinixa (Decapoda: Pinnotheridae). Mol. Phylogenet. Evol, . 30, 743–754[CrossRef][Web of Science][Medline].

    Hebert, P.D., et al. (2003a) Biological identifications through DNA barcodes. Proc. Biol. Sci, . 270, 313–321[Abstract/Free Full Text].

    Hebert, P.D., et al. (2003b) Barcoding animal life: cytochrome c oxidase subunit 1 divergences among closely related species. Proc. Biol. Sci, . 270, Suppl. 1, S96–S99[CrossRef].

    Hebert, P.D., et al. (2004a) Ten species in one: DNA barcoding reveals cryptic species in the neotropical skipper butterfly Astraptes fulgerator. Proc. Natl. Acad. Sci. USA, 101, 14812–14817[Abstract/Free Full Text].

    Hebert, P.D., et al. (2004b) Identification of birds through DNA barcodes. PLoS Biol, . 2, e312[CrossRef][Medline].

    Hedges, S.B. (1994) Molecular evidence for the origin of birds. Proc. Natl. Acad. Sci. USA, 91, 2621–2624[Abstract/Free Full Text].

    Hedge, S.B., et al. (1990) Tetrapod phylogeny inferred from 18S and 28S ribosomal RNA sequences and a review of the evidence for amniote relationships. Mol. Biol. Evol, . 7, 607–633[Abstract].

    Heywood, V.H. and Watson, R.T. Global Biodiversity Assessment, (1995) , Cambridge Cambridge University Press.

    Hoot, S.B. (1991) The phylogeny of the Ranunculaceae based on epidermal microcharacters and macromorphology. Syst. Bot, . 16, 741–755[CrossRef].

    Hoot, S.B. (1995) Phylogeny of the Ranunculaceae based on preliminary atpB, rbcL and 18S nuclear ribosomal DNA sequence data. Plant Syst. Evol, . 9, Suppl, 241–251[CrossRef].

    Huelsenbeck, J.P., et al. (1996) Combining data in phylogenetic analysis. Trends Ecol. Evol, . 11, 152–158[CrossRef].

    Jensen, U. (1995) Secondary compounds of the Ranunculiflorae. Plant. Syst. Evol, . 9, Suppl, 85–97.

    Kimura, M. (1980) A simple method for estimating evolutionary rate of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol, . 16, 111–120[CrossRef][Web of Science][Medline].

    Kishino, H. and Hasegawa, M. (1989) Evaluation of the maximum likelihood estimate of the evolutionary tree topologies from DNA sequence data, and the branching order in Hominoidea. J. Mol. Evol, . 29, 170–179[CrossRef][Web of Science][Medline].

    Kullander, S.O. (1998) A phylogeny and classification of the South American Cichlidae (Teleostei: Perciformes). In Malabarba, L.R., Reis, R.E., Vari, R.P., Lucena, Z.M., Lucena, C.A.S. (Eds.). Phylogeny and classification of neotropical fishes, , Edipucrs Porto Alegre, pp. 461–498.

    Kumar, S., et al. (2004) MEGA 3: integrated software for molecular evolutionary genetics analysis and sequence alignment. Brief. Bioinform, . 5, 150–163[Abstract/Free Full Text].

    Lorenz, J.G., et al. (2005) The problems and promise of DNA barcodes for species diagnosis of primate biomaterials. Philos. Trans. R. Soc. Lond. B Biol. Sci, . 360, 1869–1877[Abstract/Free Full Text].

    Lutzoni, F., et al. (2000) Integrating ambiguously aligned regions of DNA sequences in phylogenetic analyses without violating positional homology. Syst. Biol, . 49, 628–651[Abstract/Free Full Text].

    Mallet, J. and Willmort, K. (2003) Taxonomy: renaissance or Tower of Babel? Trends Ecol. Evol, . 18, 57–59[CrossRef].

    Matz, M.V. and Nielsen, R. (2005) A likelihood ratio test for species membership based on DNA sequence data. Philos. Trans. R. Soc. Lond. B Biol. Sci, . 360, 1969–1974[Abstract/Free Full Text].

    Minelli, A. (2003) The status of taxonomic literature. Trends Ecol. Evol, . 18, 75–76[CrossRef].

    Noé, L. and Kucherov, G. (2004) Improved hit criteria for DNA local alignment. BMC Bioinformatics, 5, 149–157[CrossRef][Medline].

    Qi, J., et al. (2004a) Whole proteome prokaryote phylogeny without sequence alignment: a K-string composition approach. J. Mol. Evol, . 58, 1–11[CrossRef][Web of Science][Medline].

    Qi, J., et al. (2004b) CVTree: a phylogenetic tree reconstruction tool based on whole genomes. Nucleic Acids Res, . 32, W1–W3[Free Full Text].

    Power, T. (2004) Nematode molecular diagnostics: from bands to barcodes. Annu. Rev. Phytopathol, . 42, 367–383[CrossRef][Web of Science][Medline].

    Pevzner, P.A. Computational Molecular Biology: An Algorithmic Approach, (2000) , Cambridge, MA MIT Press, pp. 75.

    Rickard, A.H., et al. (2002) Phylogenetic relationships and coaggregation ability of freshwater biofilm bacteria. Appl. Environ. Microbiol, . 68, 3644–3650[Abstract/Free Full Text].

    Ro, K.E., et al. (1997) Molecular phylogenetic study of the Ranunculaceae: utility of the nuclear 26S ribosomal DNA in inferring intrafamilial relationships. Mol. Phylogenet. Evol, . 8, 117–127[CrossRef][Web of Science][Medline].

    Rzhetsky, A. and Nei, M. (1992) A simple method for estimating and testing minimum-evolution trees. Mol. Biol. Evol, . 9, 945–967[Web of Science].

    Saunders, G.W. (2005) Applying DNA barcoding to red macroalgae: a preliminary appraisal holds promise for future applications. Philos. Trans. R. Soc. Lond. B Biol. Sci, . 360, 1879–1888[Abstract/Free Full Text].

    Saitou, N. and Nei, M. (1987) The neighbour-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol, . 10, 471–483.

    Savolainen, V., et al. (2005) Towards writing the encyclopaedia of life: an introduction to DNA barcoding. Philos. Trans. R. Soc. Lond. B Biol. Sci, . 360, 1805–1811[Abstract/Free Full Text].

    Shimodaira, H. and Hasegawa, M. (1999) Multiple comparisons of log-likelihoods with applications to phylogenetic inference. Mol. Biol. Evol, . 16, 1114–1116[Web of Science].

    Shull, V.L., et al. (2001) Sequence alignment of 18S ribosomal RNA and the basal relationships of Adephagan beetles: evidence for monophyly of aquatic families and the placement of Trachypachidae. Syst. Biol, . 50, 945–969[Abstract/Free Full Text].

    Sites, J.W. and Crandall, K.A. (1997) Testing species boundaries in biodiversity studies. Conserv. Biol, . 11, 1289–1297[CrossRef].

    Stuart, G.W., et al. (2002a) Integrated gene species phylogenies from unaligned whole genome protein sequences. Bioinformatics, 18, 100–108[Abstract/Free Full Text].

    Stuart, G.W., et al. (2002b) A comprehensive vertebrate phylogeny using vector representations of protein sequences from whole genomes. Mol. Biol. Evol, . 19, 554–562[Abstract/Free Full Text].

    Summerbell, R.C., et al. (2005) Microcoding: the second step in DNA barcoding. Philos. Trans. R. Soc. Lond. B Biol. Sci, . 360, 1897–1903[Abstract/Free Full Text].

    Swofford, D.L. PAUP*: Phylogenetic Analysis Using Parsimony (* and other methods). Version 4, (2000) , Sunderland, MA Sinauer Associates.

    Tamura, M. (1993) Ranunculaceae. In Kubitski, K., Rohwer, J.G., Bittrich, V. (Eds.). The Families and Genera of Vascular Plants: Flowering Plants-Dicotyledons, , Berlin Springer-Verlag vol. 2, , pp. 563–583.

    Tautz, D., et al. (2003) A plea for DNA taxonomy. Trends Ecol. Evol, . 18, 70–74[CrossRef][Web of Science].

    Thompson, J.D., Higgins, D.G., Gibson, T.J. (1994) Clustal W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Proc. Natl. Acad. Sci. USA, 22, 4673–4680.

    Tshudy, D., Chu, K.H., Robles, R., Ho, K.C., Chan, T.Y., Felder, D., Ahyong, S. (2005) Phylogeny of the marine clawed lobsters based on mitochondrial rDNA. Paper presented in the Sixth International Crustacean Congress17th–22nd July 2005University of Glasgow, Scotland, UK.

    Vences, M., et al. (2005) Comparative performance of the 16S rRNA gene in DNA barcoding of amphibians. Front. Zool, . 2, 5[CrossRef][Medline].

    Ward, R.D., et al. (2005) DNA barcoding Australia's fish species. Philos. Trans. R. Soc. Lond. B Biol. Sci, . 360, 1847–1857[Abstract/Free Full Text].

    Wheeler, W. (1996) Optimization alignment: the end of multiple sequence alignment in phylogenetics. Cladistics, 12, 1–9[CrossRef][Web of Science].

    Xia, X., et al. (2003) 18S ribosomal RNA and tetrapod phylogeny. Syst. Biol, . 52, 283–295[Abstract/Free Full Text].

    Yu, Z.G. and Jiang, P. (2001) Distance, correlation and mutual information among portraits of organisms based on complete genomes. Phys. Lett. A, 286, 34–46[CrossRef].

    Yu, Z.G., et al. (2005) Phylogeny of prokaryotes and chloroplasts revealed by a simple composition approach on all protein sequences from whole genome without sequence alignment. J. Mol. Evol, . 60, 538–545[CrossRef][Web of Science][Medline].

    Zhang, A.B., et al. (2005) Species status and phylogeography of two closely related Coptolabrus species (Coleoptera: Carabidae) in South Korea inferred from mitochondrial and nuclear gene sequences. Mol. Ecol, . 14, 3823–3841[CrossRef][Medline].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
SIMHome page
D.M. Geiser, M.A. Klich, J.C. Frisvad, S.W. Peterson, J. Varga, and R.A. Samson
The current status of species recognition and identification in Aspergillus
Stud Mycol, January 1, 2007; 59(1): 1 - 10.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/14/1690    most recent
btl146v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (1)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Chu, K. H.
Right arrow Articles by Qi, J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Chu, K. H.
Right arrow Articles by Qi, J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?