Skip Navigation

Bioinformatics 2008 24(16):i7-i13; doi:10.1093/bioinformatics/btn276
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Dalevi, D.
Right arrow Articles by Markowitz, V. M.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Dalevi, D.
Right arrow Articles by Markowitz, V. M.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2008. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Annotation of metagenome short reads using proxygenes

Daniel Dalevi 1, Natalia N. Ivanova 2, Konstantinos Mavromatis 2, Sean D. Hooper 2, Ernest Szeto 1, Philip Hugenholtz 3, Nikos C. Kyrpides 2 and Victor M. Markowitz 1,*

1Biological Data Management and Technology Center, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, 2Genome Biology Program, DOE Joint Genome Institute, 2800 Mitchell Dr, Walnut Creek and 3Microbial Ecology Program, DOE Joint Genome Institute, 2800 Mitchell Dr, Walnut Creek, CA 94598, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

Motivation: A typical metagenome dataset generated using a 454 pyrosequencing platform consists of short reads sampled from the collective genome of a microbial community. The amount of sequence in such datasets is usually insufficient for assembly, and traditional gene prediction cannot be applied to unassembled short reads. As a result, analysis of such datasets usually involves comparisons in terms of relative abundances of various protein families. The latter requires assignment of individual reads to protein families, which is hindered by the fact that short reads contain only a fragment, usually small, of a protein.

Results: We have considered the assignment of pyrosequencing reads to protein families directly using RPS-BLAST against COG and Pfam databases and indirectly via proxygenes that are identified using BLASTx searches against protein sequence databases. Using simulated metagenome datasets as benchmarks, we show that the proxygene method is more accurate than the direct assignment. We introduce a clustering method which significantly reduces the size of a metagenome dataset while maintaining a faithful representation of its functional and taxonomic content.

Contact: vmmarkowitz{at}lbl.gov


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
The ultimate goal of metagenomic studies of a microbial community (microbiome) is to determine the systemic properties including genetics, metabolism, physiology and behavioral aspects of all community members, their interactions with various biotic and abiotic factors, transfer of energy and nutrients, and ecosystem dynamics. In practice, such comprehensive studies are seldom feasible and the scope of metagenomic analysis of most microbial communities is limited to genomic and metabolic reconstruction of the dominant population(s), including identification of key metabolic pathways likely to be present or absent in these populations. For most metagenome projects the amount of sequence data is insufficient for assembly and classification of sequences into different populations, thus preventing even limited population-specific genomic and metabolic reconstruction. In these cases a gene-centric analysis using environmental gene tags (EGTs) is employed (Tringe et al., 2005). In this approach, protein coding sequences (CDSs) are identified in unassembled or partially assembled metagenomic sequences using an ab initio or evidence-based gene finder. These CDSs are further assigned to protein families, such as COGs (Tatusov et al., 1997), Pfams (Bateman et al., 2004) and TIGRfams (Selengut et al., 2007) and comparison of the relative abundance of protein families is performed. Proteins are assigned to families using reverse position-specific BLAST (RPS-BLAST) against position-specific scoring matrices (PSSMs) of COGs in the CDD database (Marchler-Bauer et al., 2002) and enzyme-specific PSSMs in the PRIAM database (Claudel-Renard et al., 2003) and using hmmsearch against hidden Markov models (HMMs) in Pfam and TIGRfam databases. Alternatively, associations of proteins with functional subsystems can be achieved via BLAST searches against databases of annotated proteomes such as SEED (Overbeek et al., 2005).

The quality of annotations for metagenomic sequence data is lower than that of isolate microbial genomes due to higher rate of sequencing errors and data fragmentation. However, identification of CDSs, their assignment to protein families and enumeration of representatives in metagenomes do not pose a problem even for completely unassembled reads generated by the Sanger sequencing platform, nor do they distort the functional or taxonomic profiles of the datasets. Such profiles may be distorted by the biases inherent to Sanger sequencing which involves cloning of metagenomic DNA into vectors, propagation of the vector within host bacteria and DNA amplification. The extent and the impact of such biases are largely unknown and therefore are difficult to account for in the downstream analysis. These problems and the relatively high cost of Sanger sequencing led to the increasing popularity of another variant of shotgun metagenome sequencing which does not require cloning of environmental DNA and employs the 454 Life Sciences pyrosequencing platform (Edwards et al., 2006). This type of sequencing raises another challenge to the downstream analysis: the depth of sequence generated by the pyrosequencing platform is usually insufficient for assembly, so the resulting metagenomes consist of individual unassembled reads. Furthermore, unlike Sanger sequencing which generates individual reads of 600–800 bp, each encoding a full-length protein or a significant portion thereof, pyrosequencing reads are 100–200 bp long and contain only a (usually small) fragment of a protein. As a result, traditional procedures for finding CDSs and assigning them to protein families cannot be applied to such sequences.

For protein family assignment of unassembled and/or short sequences, such as those generated by 454 platforms, two strategies can be envisioned: (1) direct assignment to protein families using translated read sequences for searches against family-specific PSSMs or HMMs or (2) assignment via proxygene which we define as a full-length protein identified by a BLASTx search of read sequences against a protein sequence database and then used as a representative of a read or group of reads.

The perceived disadvantage of direct assignment of 454 reads to protein families is the low sensitivity of assignment in the case of RPS-BLAST, high computational demands in the case of hmmsearch and possible biases introduced by different degrees of sequence conservation within different protein families, which may explain why published metagenome studies followed a proxygene approach (Angly et al., 2006; Edwards et al., 2006; Turnbaugh et al., 2006). These studies provided insufficient details about the methods employed for the selection of proxygenes (e.g. using best BLAST hit or multiple BLAST hits, resolution of functional annotation conflicts if more than one BLAST hit was used, etc.) or the reliability of the protein family assignment based on proxygenes.

In this article we examine the reliability of direct and indirect assignment strategies using simulated metagenomic datasets created from pyrosequencing reads generated for isolate microbial genomes. We show that indirect assignment using proxygenes is more accurate than the direct method using RPS-BLAST. We also introduce a clustering method that reduces significantly the size of the derived datasets while maintaining the accuracy of functional and taxonomic assignments based on proxygenes. The reduction in size allows maintaining a compact yet comprehensive overview of the functional and taxonomic content of a metagenome.


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
2.1 Simulated datasets
Reads from 22 genome projects, sequenced at the Joint Genome Institute (JGI) using the 454 GS20 pyrosequencing platform that produces ~100 bp reads, were selected and the genomes were split into three groups based on their phylogeny and the number of reads to ensure similar sizes for the simulated datasets. From each genome project, reads were sampled randomly at four different levels of coverage (0.1X, 1X, 2X and 4X per genome), resulting in a total of 12 simulated datasets (Table 1). The coverage is defined as the average number of times a nucleotide is sampled.


View this table:
[in this window]
[in a new window]

 
Table 1. Genomes sampled for the simulated metagenome datasets

 
The position of each read on the assembled contigs was identified by BLASTn. Only the best hit of each read, with identity >95%, was kept and used to identify a position of the read with respect to the CDSs predicted on the assembled contigs using the JGI annotation and analysis pipeline. The nucleotide sequences of the genomes, the coordinates of the reference genes and their functional annotation were extracted from version 2.2 of the IMG database (http://img.jgi.doe.gov). At each level of coverage the CDSs overlapping the reads by more than 50 nt comprised the reference gene set; the assignment of a read to a protein family was considered correct if it coincided with the family assignment of the gene from which the read has originated.

2.2 Assignment of 454 reads to protein families
We considered two ways of assigning reads to protein families: (1) direct assignment of the reads using RPS-BLAST against profiles of COGs and Pfams and (2) assignment via a proxygene. For direct assignment, translated RPS-BLAST search of reads against PSSMs in the CDD database was performed with an e-value cutoff in the range of 10–1 to 10–8 retaining the best hit only.

Proxygenes for 454 reads were found by BLASTx of the reads against the protein sequences in the IMG 2.2 database using e-value cutoffs in the range of 10–1–10–8. Proxygenes were either assigned as the best BLASTx hit of a read (BH) or using a simple clustering method (Fig. 1). For the latter, the set of all reads {x1,...,xN} that have at least one hit below the cutoff have been clustered using the following algorithm:

  1. Let x=x1 and i=1.
  2. Add x to group number Gi.
  3. Extract the set of all proteins (A) that x has hits to, and add them to Gi.
  4. For each protein p in A, extract all other reads xj,...,xM that have a best hit to p, and add them to Gi.
  5. For each x in {x1,...,xM} repeat Step 2 until no more reads or proteins can be added to Gi.
  6. Let x be the next unassigned read and let i=i+1, and repeat Step 2.


Figure 1
View larger version (17K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Assigning reads to full-length (proxy) genes in a database: (a) each read is assigned to a separate proxygene by best BLASTx hit: a read may be assigned to several identical proxygenes; (b) grouping identical proxygenes; (c) proxygenes-clustering: each read is assigned to single proxygene.

 
This algorithm results in disjoint clusters (proxy clusters) in which no reads and no genes are members of more than one group. For each protein within the proxy cluster, the cumulative bit-score of its alignment to the reads within the same cluster is calculated. The protein with the highest cumulative bit-score is selected as a representative proxygene of this proxy cluster and is used for all further analyses, such as functional and taxonomic accuracy or determination of the overall functional profiles.

Most protein databases seem to be contaminated to some extent with rRNA sequences on which protein-coding genes have been predicted in different frames. Due to the high sequence conservation of rRNA genes, some of these ‘ghost’ proteins are also conserved and even form ‘ghost’ clusters which may contain proteins with no sequence similarity whatsoever and represent the same parts of rRNA sequences translated in different frames. Therefore, before any protein family assignments of 454 reads were carried out, a filtering step has been introduced which involved BLASTn of the reads against an RNA database consisting of all rRNAs in IMG 2.2 in order to remove these reads.


    3 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
454 reads can be associated with protein families by direct assignment or via proxygenes. Direct assignment compares the sequence of the read translated in six frames directly to the sequence profiles of protein families. Assignment via proxygenes is an indirect approach whereby BLASTx against a protein sequence database is used to identify a full-length protein (‘proxygene’) with high sequence similarity to the translated sequence of the 454 read. High sequence similarity between the read and the proxygene is considered as an indication that the read originated from a protein-coding gene which has high overall sequence similarity to the proxygene. Consequently, protein family membership and functional annotation of the read is considered to be the same as that of a full-length protein, which is used as a ‘proxy’ of the 454 read in subsequent metagenome data analysis. Similarly, high sequence similarity between the read and the proxygene implies phylogenetic proximity of the organisms from which the read and the proxygene have originated, so that the full-length protein can be also used as a ‘proxy’ of the read in assessing the taxonomic composition of the metagenome. However, the indirect approach may produce spurious hits to proteins that have little overall sequence similarity to the gene from which the read has originated. Accordingly, we have evaluated the accuracy of protein family assignments of the simulated datasets using direct and proxygene-based methods with several e-value cutoffs and assessed the accuracy of taxonomic assignments using both the BH and proxygene cluster approach.

It should be pointed out that although the simulated datasets used in this study faithfully reproduce some of the features of 454-sequenced metagenomes, such as the frequency and type of sequencing errors or variation (if any) of sequencing coverage, certain problems associated with processing of real metagenomes are hard to reproduce in a simulated environment. The main problem is the absence of a comprehensive collection of reference genomes; as a result only a small fraction of the genes in most metagenomic datasets generated to date are from organisms that have sequenced close relatives, thus limiting the detection of similarities between the short reads and reference genes. However, many of the genomes from which the 454 reads for the simulated datasets were selected belong to such over-sampled taxonomic groups as gamma- and betaproteobacteria (Table 1). In order to account for the potential errors resulting from a biased composition of reference databases and simulate the absence of close relatives of the sampled organisms, we followed the approach of (Mavromatis et al., 2007) and excluded all closely related genomes (either the same species or genus as the sampled genomes) from the reference database before carrying out BLASTx searches. The estimated sequence coverage is another unknown variable which may affect the results in the case of real metagenomes sequenced with the 454 platform. For instance, the effect of resampling of a complex microbial community at very low sequence coverage is hard to estimate and it is possible that protein family composition and abundance will vary greatly from sample to sample. Similarly, comparison of completely unrelated microbial communities sampled at different coverage may result in virtually identical protein family abundance profiles. We attempted to address this problem by sampling the genomes included into each dataset at four different levels of coverage (0.1X, 1X, 2X, 4X) as described in Section 2.

3.1 Evaluating accuracy of protein family assignment
In the first step we optimized the settings of RPS-BLASTx and BLASTx searches by using e-value cutoffs in the range of 10–1–10–8 and then estimated the accuracy of read assignments to COGs. The latter was calculated as the ratio of correct COG assignments (i.e. same COG assignment of the proxygene as that of the gene from which the read originated) over the total number of COG assignments. The results of this analysis (Fig. 2) show that the direct assignment has invariably lower accuracy than the proxygene approach, with the exception of very low cutoffs for metagenome dataset M1 where the direct approach performs as well as the assignment via proxygenes (e.g. M1 at 4X in Fig. 2). Most notably, the accuracy of COG assignment via proxygenes varied very little at different e-value cutoffs, with the percent of false assignments never exceeding 10% even at e-value of 10–1 (Fig. 2). However, the percentage of reads assigned to COGs depends strongly on the cutoff and increases substantially at higher e-values (Table 2). For example, about 39% of all reads in dataset M3 were assigned to COGs at cutoff 10–1, while at a more stringent cutoff of 10–5 used in previous studies (Angly et al., 2006; Edwards et al., 2006; Turnbaugh et al., 2006) only 20% of the reads were assigned to COGs. This result is independent of the coverage, which is expected in the case of random sampling of reads. Since decreasing the e-value cutoff provides little reduction of the rate of false positive assignments while strongly affecting the overall number of reads assigned to COGs, the e-value cutoff 10–1 has been used in further analysis.


Figure 2
View larger version (15K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. Indirect annotation of COGs is compared to annotation using BH-proxygenes for three simulated datasets at 4X coverage (Table 1): (a) M1, (b) M2 and (c) M3. We removed all reference genomes that belong to the same species and/or genera as genomes used to create the simulated metagenomes before the BLASTx step. P(A) is the probability that a read is assigned to a COG and P(C|A) is the probability that an assigned COG is correct conditioned on the event that a COG has been assigned.

 

View this table:
[in this window]
[in a new window]

 
Table 2. Percentage of assigned reads together with the degree of reduction obtained using proxygene clustering as opposed to BH-proxygene

 
In addition to evaluating the e-value cutoffs, the effect of reference database composition was assessed by performing BLASTx searches against the reference database from which either the genomes of the same species or genomes of the same genus as the organisms used in the simulated datasets were removed. The effect of the reference database composition was most pronounced in the case of metagenome dataset M1 where removing all reference genomes belonging to the same genus as the sampled organisms resulted in an error rate twice as high as that observed for the reference database with only same-species genomes removed. Conversely, removing all reference genomes of the same genus from the database had little, if any, effect on the accuracy of assignments for datasets M2 and M3. These results can be explained by the different taxonomic composition of metagenome datasets M2 and M3 as compared to dataset M1 (Table 1): M1 has been sampled mostly from the representatives of Firmicutes, while M2 and M3 are composed almost exclusively of Proteobacteria, a Phylum with more sequenced representatives than all other bacterial phyla combined. Even in the absence of the closest relatives, these genomes provide a comparative context rich enough for highly accurate assignment of reads.

The taxonomic composition of the simulated metagenome datasets also affected the percentage of reads assigned to COGs: at low e-values metagenome datasets M2 and M3 had twice as many reads assigned to COGs as dataset M1, although these differences were less prominent at higher e-values (Table 2). Furthermore, while the accuracy of direct assignments was essentially the same for all datasets at a given e-value, the accuracy of assignment via proxygenes was much higher for metagenome dataset M3 as compared to M1 at the same e-value, sequence coverage, and reference database composition. Note that although reference databases contain significantly fewer genomes of Firmicutes than Proteobacteria, there are many phyla with even less sequenced representatives. It is expected that for the metagenomes composed of the members of such poorly sampled phyla the accuracy of proxygene-based assignment will be even lower. Thus proxygene-based comparisons of the metagenomes with vastly different taxonomic composition (e.g. those dominated by Proteobacteria against those composed mostly of planctomycetes or chloroflexi) should be treated with caution, since several-fold differences in the accuracy of protein family assignments may result in gross errors in data interpretation. Our results emphasize the importance of a good reference database for the analysis of 454 data and indicate that although the availability of same-species reference sequences is highly desirable, it may be unnecessary as long as sequences of multiple and diverse representatives of the same Phylum are present in the database.

3.2 Proxygene clustering
While the best BLASTx hit (BH) is the simplest and most direct method of selecting a proxygene, this approach may result in a high level of redundancy: several reads may be associated with the same proxygene (Fig. 1a), therefore there is no need to consider them as separate entities. Moreover, due to the presence of many closely related genomes in the reference databases, the read may have hits of nearly the same strength to several highly similar genes of which only one is chosen as a proxygene. Alternatively, the reads originating from the same gene may become associated with different, but closely related proxygenes (Fig. 1b). In terms of their protein family membership and functional annotation, all such proxygenes are equivalent and should be handled as one entity. Finally, treating each read–proxygene pair separately results in very large datasets, hardly amenable to any manual analysis by biologists and posing serious data management scalability problems.

In order to address these problems, we have developed a simple clustering algorithm for grouping the reads and proxygenes, as illustrated in Figure 1c and described in Section 2. This proxygene clustering provides a significant reduction in the size of the resulting datasets. Figure 3 shows a comparison between the number of proxygenes with and without proxygene clustering: the reduction is about 1.2–1.5 times at 0.1X coverage (BLAST e-value cutoff=10–5; removing same-genus genomes), whereas at 4X it is about 7 times for dataset M1, 10 times for M2 and 10 times for M3 (Table 2). This reduction is significant in light of the rapidly increasing number and size of metagenome datasets.


Figure 3
View larger version (25K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. The number of proxygenes is significantly reduced for all levels of coverage using the clustering approach. The number of proxygenes for the BH approach is shown in black for the three simulated datasets (M1, M2 and M3) at coverage 0.1X, 1X, 2X and 4X. The red lines show the number of proxygenes after clustering.

 
3.3 Taxonomic assignment of reads via proxygenes
In addition to assessing the functional content of various microbial communities, most metagenomic studies attempt to determine and compare the taxonomic composition of the samples. For metagenomic datasets generated with the pyrosequencing platform this question can be addressed by a proxygene-based approach, using the phylogenetic distribution of proxygenes as an estimate of the phylogenetic composition of a sample. Similarly, a proxygene cluster-based approach can be used to estimate the taxonomic composition of a sample, whereby the taxonomic identity of all reads assigned to a proxygene cluster is considered either the same as the representative proxygene (an approach used in this study) or as that of the lowest taxonomic group to which all proxygenes in the proxygene cluster belong.

Using the simulated datasets, we have examined the accuracy of the taxonomic assignment of reads using the proxygene and proxygene cluster approaches. The accuracy of the assignment was measured as the fraction of true positives at different taxonomic levels (Domain, Phylum, Class, Order and Family). As expected, the accuracy of assignment at the Domain and Phylum level is much higher as compared to the level of Order and Family with Domain-level assignments being 100% accurate and the fraction of accurate Family-level assignments varying from 20% to 60% for different metagenomes and different reference databases. The accuracy of taxonomic assignments at the Phylum level reaches more than 90% for datasets M2 and M3, while the accuracy of assignments for M1 is only 60% at the same level. Similar to the accuracy of protein family assignments, this disparity appears to reflect the difference in taxonomic composition of the three simulated datasets, with M1 composed of representatives of less well-sampled phyla than M2 and M3.

At low sequence coverage (0.1X) the proxygenes and proxygene clusters are almost identical since most proxygene clusters contain only one or two reads. However, at higher (4X) coverage the clustering of the reads into proxygene clusters does not decrease the accuracy and in some cases it even improves the assignment, especially at higher taxonomic levels (Phylum, Class), which are most frequently used in the estimation of the taxonomic composition of metagenomic samples. This result indicates that the reads are grouped into essentially consistent taxonomic clusters and the selection of one proxygene as a representative of multiple reads effectively screens out some of the spurious hits that would adversely affect the accuracy of taxonomic assignments.

3.4 Hierarchical clustering against a reference
The relative frequencies of COGs and Pfams are often used to compare metagenomic datasets obtained from different environments and to detect functions that are over- or under-represented (Tringe et al., 2005). Such analyses depend on an unbiased identification of COGs and Pfams and comparable accuracy of their detection in different datasets irrespective of the taxonomic composition, population structure and variation in the sequence coverage of microbial communities. Our results indicate that although the accuracy of protein family assignment is fairly high, it may vary greatly between metagenome datasets depending on their taxonomic composition and sequence coverage. Such variations may influence the results of gene-centric analysis performed on the individual protein families and even on their groupings, such as COG Pathways and Functional Categories. However, it is not clear whether these variations in accuracy could change the overall functional profiles of the metagenomes so severely as to affect the results of their profile-based clustering.

In order to address these questions, we performed hierarchical clustering of the datasets based on the relative frequencies of COGs produced by direct assignment, proxygenes and proxygene clusters (Fig. 4). The placement of absolute references (i.e. COG frequencies of all genes from sampled isolate genomes) and sampled references (i.e. COG frequencies of the sampled genes) for all metagenome datasets show that 454 sequencing indeed has little bias in terms of under- or over-sampling certain genomic regions. Furthermore, there is little difference between the profiles obtained via the proxygene and proxygene cluster approaches indicating that there is little loss or gain of functional information with proxygene clustering. The error introduced by annotation in every case is nevertheless high since none of the metagenome datasets end up in a cluster with the sample and absolute reference. As expected, 4X and 2X sampling references are closer to the absolute references than 1X and especially 0.1X. In addition, there is significant difference between the direct and proxygene-based profiles, since the profiles obtained by direct assignment mostly clustered together.


Figure 4
View larger version (47K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4. Hierarchical clustering of relative COG frequencies of the three simulated metagenome datasets (M1, M2 and M3) at different levels of coverage. The absolute reference consists of the relative frequencies as they occur in the isolate genomes while the sample reference is the relative frequency of the reads that were sampled. The best BLASTx hit proxygene (BH) and the proxygenes defined by the clustering approach (Cluster) are shown together with the direct annotation using RPS-BLAST (rpsblast).

 
These results indicate that, although metagenome datasets generated by pyrosequencing platforms may indeed represent an unbiased sample of community DNA, significant biases can be introduced by subsequent processing of the data. These biases are mostly due to the skewed composition of reference databases and, depending on the taxonomic composition and population structure of the sample, they may be as difficult to account for as cloning biases of Sanger technology. While the accuracy of protein family assignments was sufficient to separate the three simulated metagenome datasets discussed in this article, a similar separation may not be possible for real environmental samples that may be characterized by large disparity in sequence coverage due to different evenness and abundance of species distribution and considerable variation of the taxonomic composition.


    4 DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
We have compared methods for the annotation of short reads in metagenome datasets using benchmark datasets that model faithfully the main features of real datasets. While the proxygene-based method is generally more accurate, its efficiency depends on the composition of the reference database. Thus, the metagenome datasets containing representatives of over-sampled phyla were annotated more efficiently. The accuracy of assignments did not increase significantly at lower e-value cutoffs, while selection of less stringent cutoffs (10–1) allowed assignment of twice as many reads without increasing the rate of false positive assignments. We have also shown that the proxygene clustering has the important advantage of reducing substantially the size of metagenome datasets, while preserving faithfully their functional and taxonomic content.

Despite the increase of the average read length produced by newer generation of 454 sequencing platforms such as GS FLX (~200 bp reads), it is expected that many metagenome datasets will remain unassembled due to prohibitively high amount of sequence data necessary to ensure even modest degree of assembly for all but the simplest microbial communities. Consequently, it is likely that gene-centric analysis will remain the method of choice for the analysis of many metagenomes and therefore the proxygene cluster-based annotation presented in this article has standing practical significance. We have applied this method to several metagenome datasets from ongoing metagenome studies (such as the PT3 and PT6 datasets listed in Table 2) that have been included into the IMG/M system (Markowitz et al., 2008). As soon as their analysis is completed and published, these datasets will be released as part of IMG/M's public version (http://img.jgi.doe.gov/m).


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Funding: The work presented in this article was supported by the Director, Office of Science, Office of Biological and Environmental Research, Life Sciences Division, US Department of Energy under Contract No. DE-AC02-05CH11231.

Conflict of Interest: none declared.


    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Angly F, et al. The marine viromes of four oceanic regions. PLoS Biol (2006) 4:11.[CrossRef]

    Bateman A, et al. The Pfam protein families database. Nucleic Acids Res (2004) 32:D138–D141.[Abstract/Free Full Text]

    Claudel-Renard C, et al. Enzyme-specific profiles for genome annotation: PRIAM. Nucleic Acids Res (2003) 31:6633–6639.[Abstract/Free Full Text]

    Edwards RA, et al. Using pyrosequencing to shed light on deep mine microbial ecology. BMC Genomics (2006) 7:57.[CrossRef][Medline]

    Markowitz VM, et al. IMG/M: a data management and analysis system for metagenomes. Nucleic Acids Res (2008) 36:D534–D538.[Abstract/Free Full Text]

    Marchler-Bauer A, et al. CDD: a database of conserved domain alignments with links to domain three-dimensional structure. Nucleic Acids Res (2002) 30:281–283.[Abstract/Free Full Text]

    Mavromatis K, et al. Use of simulated data sets to evaluate the fidelity of metagenomic processing methods. Nat. Methods (2007) 4:495–500.[CrossRef][Web of Science][Medline]

    Overbeek R, et al. The subsystems approach to genome annotation and its use in the project to annotate 1000 genomes. Nucleic Acids Res (2005) 33:5691–5702.[Abstract/Free Full Text]

    Selengut JD, et al. TIGRFAMs and genome properties: tools for the assignment of molecular function and biological process in prokaryotic genomes. Nucleic Acids Res (2007) 35:D260–D264.[Abstract/Free Full Text]

    Tatusov RL, et al. A genomic perspective on protein families. Science (1997) 278:631–637.[Abstract/Free Full Text]

    Tringe SG, et al. Comparative metagenomics of microbial communities. Science (2005) 308:554–557.[Abstract/Free Full Text]

    Turnbaugh PJ, et al. An obesity-associated gut microbiome with increased capacity for energy harvest. Nature (2006) 444:1027–1031.[CrossRef][Web of Science][Medline]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Genome ResHome page
M. Hamady and R. Knight
Microbial community profiling for human microbiome projects: Tools, techniques, and challenges
Genome Res., July 1, 2009; 19(7): 1141 - 1152.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Dalevi, D.
Right arrow Articles by Markowitz, V. M.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Dalevi, D.
Right arrow Articles by Markowitz, V. M.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?