Skip Navigation


Bioinformatics Advance Access originally published online on September 30, 2004
Bioinformatics 2005 21(6):703-710; doi:10.1093/bioinformatics/bti045
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
21/6/703    most recent
bti045v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (12)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Zheng, X. H.
Right arrow Articles by Mural, R.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Zheng, X. H.
Right arrow Articles by Mural, R.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2004. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions{at}oupjournals.org

Using shared genomic synteny and shared protein functions to enhance the identification of orthologous gene pairs

Xiangqun H. Zheng {dagger}, Fu Lu {dagger},{ddagger}, Zhen-Yuan Wang {ddagger}, Fei Zhong §, Jeffrey Hoover and Richard Mural *

Assays and Bioinformatics, Celera Genomics Corporation 45 West Gude Drive, Rockville, MD 20850, USA

*To whom correspondence should be addressed.


    Abstract
 TOP
 Abstract
 INTRODUCTION
 SYSTEM AND METHODS
 RESULTS
 DISCUSSION
 SUPPLEMENTARY DATA
 REFERENCES
 

Motivation: The identification of orthologous gene pairs is generally based on sequence similarity. Gene pairs that are mutually ‘best hits’ between the genomes being compared are asserted to be orthologs. Although this method identifies most orthologous gene pairs with high confidence, it will miss a fraction of them, especially genes in duplicated gene families. In addition, the approach depends heavily on the completeness and quality of gene annotation. When the gene sequences are not correctly represented the approach is unlikely to find the correct ortholog. To overcome these limitations, we have developed an approach to identify orthologous gene pairs using shared chromosomal synteny and the annotation of protein function.

Results: Assembled mouse and human genomes were used to identify the regions of conserved synteny between these genomes. ‘Syntenic anchors’ are conserved non-repetitive locations between mouse and human genomes. Using these anchors, we identified blocks of sequences that contain consistently ordered anchors between the two genomes (syntenic blocks). The synteny information has been used to help us identify orthologous gene pairs between mouse and human genomes. The approach combines the mutual selection of the best tBlastX hits between human and mouse transcripts, and inferring gene orthologous relationships based on sharing syntenic anchors, collocating in the same syntenic blocks and sharing the same annotated protein function. Using this approach, we were able to find 19 357 orthologous gene pairs between human and mouse genomes, a 20% increase in the number of orthologs identified by conventional approaches.

Contact: richard.mural{at}celera.com


    INTRODUCTION
 TOP
 Abstract
 INTRODUCTION
 SYSTEM AND METHODS
 RESULTS
 DISCUSSION
 SUPPLEMENTARY DATA
 REFERENCES
 
Comparative genomics is the study of evolutionary relationships between the genes and genomes of different organisms. It plays a key role in understanding the functions of genomic elements, including transcribed protein coding and non-protein coding sequences, cis-acting elements that regulate gene expression at either the transcriptional or post-transcriptional levels, and sequence features that affect or control various aspects of chromosome biology, including chromatin structure. Over the past several years, scientists around the world have published sequence maps of a number of important model organisms, including Caenorhabditis elegans (The C. elegans Sequencing Consortium, 1998) Drosophila melanogaster (Adams et al., 2000) Homo sapiens (Lander et al., 2001; Venter et al., 2001) Fugu rubripes (Aparicio et al., 2002) Anopheles gambiae (Holt et al., 2002) Mus musculus (Mural et al., 2002; Waterston et al., 2002) Ciona intestinalis (Dehal et al., 2002) and Rattus norvegicus (Gibbs et al., 2004). These completed genomes have allowed comprehensive genome-wide comparative studies between species (Gibbs et al., 2004 Mural et al., 2002 Zdobnov et al., 2002 Rubin et al., 2000 Wheelan et al., 1999).

We have developed a method to detect evolutionary conservation at the whole genome scale. The genomic level conservation can be viewed at two levels: one level is bi-directionarily unique homologous sequences between the two genomes (syntenic anchors), the other is conserved synteny as inferred by large chromosomal segments on which anchors are in a consistently incremental/decremental order (syntenic blocks) (Mural et al., 2002). Syntenic blocks can be considered as a large chromosomal segments inherited from a common ancestral chromosome without major chromosomal rearrangements. Syntenic blocks provide a context for identifying appropriate orthologous relationships.

The functions of human disease-related genes are often elucidated by studying their orthologous counterparts in various model organisms (for a review see O'Brien et al., 1999). Orthologs are defined as genes in different species that evolved from a common ancestor (Fitch, 1970). Evolutionary distances are usually inferred by sequence similarity between two genes, given the assumption that the longer the divergence time, the two genes would have less sequence similarity. Current methods used to identify orthologs between species are mostly based on sequence similarity (Lee et al., 2002; Remm et al., 2001; Tatusov et al., 1997). While the sequence similarity approach allows us to accurately identify a high confidence set of orthologs, the approach will inevitably miss orthologs within repetitive gene families where gene family members are so similar to each other, that none of them satisfies the stringent mutual best hit criteria. The approach will also miss orthologous genes because the mutually best selection imposes a strict one-to-one relationship between orthologous counterparts. More over, the sequence similarity-based approach will only work after the entire gene repertoires are available for both species, otherwise it will infer the wrong relationships when incomplete transcriptomes or gene sequences of low quality are used as input data.

To overcome these shortcomings, we have developed a pipeline that uses conserved synteny and protein functions, as well as sequence similarity to identify orthologous pairs of genes. Although chromosomal segments are rearranged by translocations, inversions and other events during evolution, genes within a conserved syntenic segment in two closely related species are generally in the same order and this information can be used to infer orthologous relationships. To be more specific, the pipeline first identifies a set of orthologs by selecting mutually best tBlastX hits between the mouse and human annotated transcriptomes; additional orthologs are then identified by finding genes which share syntenic anchors in their exons or are at syntenically conserved positions. Since orthologs are derived from a common ancestral gene, orthologous proteins are likely to have similar functions in different organisms. This observation was used as the rationale for assigning and confirming orthologs that might be missed by the previous two approaches, by further checking the assigned function of genes that have conserved positions in syntenic blocks. Recently, gene order conservation around ‘seed’ orthologs has successfully been used to identify additional orthologs in the comparative analysis of C.briggsae and C.elegans (Stein et al., 2003) and is being used as part of an Ensembl pipeline to generate ortholog predictions (Clamp et al., 2003) demonstrating the feasibility of the approach. Our method extends these examples by including syntenic anchors and protein functions to increase the number of orthologs.

Our method has allowed us to identify a more comprehensive set of orthologous gene pairs between mouse and human genome than previously published results (Lander et al., 2001). We identified 20% more orthologous gene pairs than by using the mutually best hit approach. Unlike the one-to-one relationship characterized by the mutually best hit selection approach, our method produced gene pairs of a one-to-many or many-to-many relationship. That is to say, one gene in one species may have multiple counterparts in the other species. By definition, gene duplications after speciation would make it possible for one gene in one species to have multiple orthologous genes in another species (Jensen, 2001; Koonin, 2001; Fitch, 2000).


    SYSTEM AND METHODS
 TOP
 Abstract
 INTRODUCTION
 SYSTEM AND METHODS
 RESULTS
 DISCUSSION
 SUPPLEMENTARY DATA
 REFERENCES
 
Data
We used Celera's assembled human (Celera Genomics, 2002b) and mouse (Celera Genomics, 2002a) genomes, as well as their associated annotation to demonstrate our method of ortholog detection. Each genome was annotated by an automatic gene annotation pipeline, and the resulting putative genes and transcripts were then individually reviewed and modified by expert curators.

Identification of syntenic anchors and syntenic blocks
Syntenic anchors (Mural et al., 2002) are sequences in two genomes that show significant sequence similarity and are bi-directionally unique matches (i.e. they are found in only one location in each genome). They were computed in two steps. The first was an ‘all-against-all’ search of mouse assembly scaffolds against the human assembly scaffolds using BlastN (after repeat masking ubiquitous repeats and low-complexity regions in human sequences). The E-value threshold for this BlastN run was 10–4. This initial set of matches was filtered to eliminate repetitive matches that were missed by the initial repeat masking. A match between two segments, of at least 50 bp, sharing at least 80% identity was retained as a syntenic anchor.

Syntenic blocks refer to chromosomal segments in two species where syntenic anchors are in consecutive order (Mural et al., 2002). The concept of syntenic block here is similar to the concept of syntenic segment described in the mouse genome sequencing paper by Waterston et al. (2002). Genes within a syntenic block are likely to be orthologous with a preserved gene order (Lane et al., 2001). Syntenic blocks were generated by a two-step process (Fig. 1). In the first step, the syntenic anchors were sorted by their chromosomal position on the reference genome (mouse); the anchors were then grouped by their chromosome assignments on the comparative genome (human). In the second step, individual anchor groups were further divided into more refined groups named syntenic blocks as follows: (1) adjacent anchors were grouped together if the anchors on the human genome were in a consecutive ascending or descending order; a group was discontinued when the order of anchors on the human genome jumped two anchor positions or more; (2) small groups composed of two or less anchors or shorter than 100 000 bp on either genome were deleted to remove noise; (3) the remaining anchors were regrouped as in the first step and (4) block size and orientation were determined for both genomes based the span of the anchors and whether the anchor order was incremental or decremental.



View larger version (46K):
[in this window]
[in a new window]
 
Fig. 1 Construction of syntenic blocks between the human and mouse genomes. The left panel illustrates the grouping of syntenic anchors at the chromosomal level; the right panel illustrates the further grouping of syntenic anchors into syntenic blocks by order within a chromosome group. Groups that have two or less anchors or span 100 000 bp or less are excluded.

 
Identification of orthologs
Three complementary methods were used to identify orthologs. The first method used two cycles of tBlastX between the Celera human and mouse transcripts by using an E-value cut-off of 10–4, with the subject and query databases swapped between runs. By comparing E-values, mutually best transcript pairs were selected as orthologs. When E-values were equal, bits score and sequence coverage were used as tie-breakers to select the top hit.

The second method used conserved synteny to identify additional orthologs missed by the first method. Human and mouse genes were identified as orthologs if they satisfied the following three criteria: (1) they share common syntenic anchor(s) in their exons; (2) they are in the same syntenic block and (3) they share significant sequence similarity. Two sequences were considered to have significant sequence similarity if they were among the mutual top five hits in the tBlastX runs (this criteria applied to methods described below as well). Since syntenic anchors are mutually unique sequences in the two genomes, inevitably, there are regions of a genome where syntenic anchors are under-represented. Genes in such anchor-poor regions cannot be paired up effectively using the shared anchor approach. To overcome this drawback, we took advantage of the conservation of gene order inside syntenic blocks to identify more orthologs. Within a syntenic block, a pair of human and mouse genes flanked by previously identified ortholog pairs are likely to be orthologs. To implement this approach, we first identified a set of most confident or ‘seed’ orthologs. The syntenic blocks were divided into mini-blocks by the ‘seeds’. Within each mini-block, usually there are 4–5 genes on each genome. Each human gene was then paired up with all the mouse genes in the same mini-block individually. Any pair with significant sequence similarity was considered as an ortholog. The ‘seed’ orthologs were selected from the orthologs identified by the mutually best tBlastX selection using the following criteria: (1) the orthologous counterparts share syntenic anchors in their exons; (2) they are within the same syntenic block and (3) they have consistent order in the two genomes.

Finally, protein functional classifications were used to identify orthologs. Our data suggest that when protein functions are known, over 90% of orthologous genes identified by selecting mutually top hits perform similar biological functions in two organisms. We used the Panther protein classification system (Thomas et al., 2003b, c) to help identify orthologous proteins. If the proteins translated from a human transcript and a mouse transcript belonged to the same Panther subfamily, collocated in the same syntenic block, and shared significant sequence similarity, they were identified as an orthologous pair.

The results generated by the above three methods were consolidated to produce a non-redundant list of transcript pairs and subsequently, transcript pairs were converted into gene pairs. Each pair was annotated with a score that is a measure of the confidence of the assignment. A score is the number of unique evidence types supporting an orthologous pair with all evidence weighted equally.

Pseudogenes were included in the analysis because pseudogenes might have been active in ancestors. For each human–mouse gene pair, there may be multiple transcript pairs associated due to alternative splicing. For the same gene pair, different alternative transcript pairs may have different ortholog scores. We used the best ortholog score of the transcript pairs to represent the gene pair.

Evaluate sequence similarity using a global alignment tool NAP
NAP is a program that constructs optimal global and local alignment between a DNA sequence and a protein sequence (Huang and Zhang, 1996). Human proteins and mouse transcripts were used as an input to the NAP program. Default parameters were used when running the NAP program. Percentage identity and percentage similarity are the number of identical and similar residues between the human protein and one of the three open reading frames (ORFs) of mouse transcript, respectively, divided by the length of human peptide length. They reflect the global identity and similarity of two sequences. Percentage coverage is the length of total non-gapped aligned segments divided by human peptide length.

Mapping of Celera genes to LocusIDs
LocusIDs were used to pull sequences from the ref_seq, GenBank, Swiss-Prot, Unigene and db_EST datasets. These sequences were aligned to Celera's genome assemblies with cut-offs of 97% identity and 20% coverage using either SIM4 or Genewise. The top-scoring alignment for each sequence was collocated with the exon alignments of transcripts from the Celera Annotation Set. Only the best collocation for each sequence was kept. The collocations were then grouped by locusID and Celera gene (hCG or mCG) and outliers were discarded. The remaining collocations were then used to establish the association between locusIDs and Celera genes.


    RESULTS
 TOP
 Abstract
 INTRODUCTION
 SYSTEM AND METHODS
 RESULTS
 DISCUSSION
 SUPPLEMENTARY DATA
 REFERENCES
 
Conserved synteny at the genomic level
Using the method described above, 444 410 bi-directionally unique, homologous sequences were identified between the human and mouse genome assemblies. These syntenic anchor sequences cover 2.7 and 2.9% of the human and mouse genomes, respectively. The anchor size ranges from 50 to 6778 bp (Table 1), with an average length of 169 bp.


View this table:
[in this window]
[in a new window]
 
Table 1 A summary of syntenic anchors (SA) and syntenic blocks (SB) between Celera human and mouse assemblies

 
It is anticipated that protein coding sequences, regulatory elements and other functionally important sequences are conserved during evolution. Our observations agree well with this expectation (Fig. 2a and b). The density of syntenic anchors in exons is 2249 anchors per mega base of sequence, which is ~12- to 20-fold higher than that in intron and intergenic regions (182 and 111 anchors per megabase, respectively). Despite the fact that the anchor density is highest in exons, 72% of all anchors are found in regions outside exons. Such conserved elements may represent sequences with biological functions that have not yet been identified, or they may be conserved during evolution merely by chance. Many groups are mining these sequences to identify unknown functional elements in mammalian genomes (Bejerano et al., 2004; Margulies et al., 2003; Thomas et al., 2003a).



View larger version (20K):
[in this window]
[in a new window]
 
Fig. 2 (a) Distribution of syntenic anchors by genes, exons, introns and intergenic regions. We count syntenic anchors contained by or overlapping by at least 1 bp with a given region (genes, exons, introns and intergenic regions) towards the anchor density in that region. When a syntenic anchor is contained by or overlaps with multiple regions of the same category, we count the anchor only once. For example, we count an anchor only once even if it is contained by overlapping exons of alternatively spliced transcripts. For a given gene, of all overlapping alternative transcripts, only the longest one was used to compute exon length. An anchor was considered as a span rather than a point in our analysis; the same anchor would be counted as anchor-in-intron and anchor-in-exon if the anchor happened to be at an intron/exon boundary. This explains why the anchor density in intron plus exon is greater than that in genes. In this analysis, a gene is from the start of the first exon to the end of the last exon; 5'- and 3'-UTRs are included as well. (b) Syntenic anchor density by genes, exons, introns and intergenic regions.

 
Syntenic anchors can be used to infer chromosomal syntenic relationships between species. A syntenic block is defined as a maximal chromosomal region where anchors are conserved in order and orientation. A heuristic algorithm was implemented as described above to identify syntenic blocks between mouse and human genomes. A total of 638 syntenic blocks were identified with a N50 size of ~10 Mb ( Table 1). Over 90% of the human and mouse genomes were included in the syntenic blocks. A majority of syntenic anchors (98%) are in syntenic blocks.

Ortholog results
Conserved synteny and protein functions allow us to identify additional orthologs
Currently available techniques for the most part use sequence similarity to identify orthologs. This method generates high-confidence ortholog pairs, but almost certainly underestimates the number of orthologous pairs. By leveraging syntenic information and protein function information, we were able to identify more orthologous gene pairs. Compared to the approach that mutually selects best tBlastX hits as orthologs, our method identified 20% more orthologous gene pairs, 19 357 pairs compared to 16 140 pairs ( Table 2). Unlike similarity-based methods, we generated orthologs with a one-to-many or many-to-many relationship, meaning that one mouse gene may have more than one human counterpart and vice versa. Of the 19 357 gene pairs, 14 004 pairs have a one-to-one relationship, the remaining 5353 pairs have a many-to-many or one-to-many relationship. If only unique genes are considered, our method identified 8% more mouse genes and 6% more human genes as having orthologous counterparts than did the mutually best hit selection approach.


View this table:
[in this window]
[in a new window]
 
Table 2 Statistics of human mouse orthologs generated by a mutually best tBlastX selection approach (MB) and our new approach (NEW)

 
Of the 3217 gene pairs that are missed by the similarity-based approach, 419 gene pairs are supported by shared syntenic anchors, shared syntenic blocks and conserved protein functions. This is a subset of gene pairs that we have high confidence in their orthologous relationships but are missed by similarity-based methods (Fig. 3a).



View larger version (22K):
[in this window]
[in a new window]
 
Fig. 3 (a) This view shows that by leveraging conserved synteny and protein function information, an ortholog that was missed by mutually best selection approach is identified by our new approach. Mouse and human genes are displayed in two adjacent tiers. Genes are labeled with their Celera accessions when space allows (mCG for mouse gene accessions and hCG for human accessions). Orthologs identified by mutually best selection approach are linked by vertical solid lines; orthologs identified by conserved synteny and/or protein functions are aligned to each other but without a linking line (mCG19309and hCG29392. Mouse gene mCG19309and human gene hCG29392are both thymidine kinase 1, the mouse gene is located on chromosome 11 and the human gene is on chromosome 17. The two genes are located in the same syntenic block, share syntenic anchors in their exons, and the flanking genes are nicely aligned orthologs. However, the mutually best selection did not pair the two genes as an orthologous pair, rather, it paired the 14 067 bp hCG29392with mCG10091 a much shorter gene (1189 bp, not included in this snapshot) located on mouse chromosome 8. Although mCG10091is also a thymidine kinase gene, since the human and mouse genes do not belong to the same syntenic block, they are unlikely to be orthologs. The tBlastX E-value, hCG29392and mCG19309is 10–132, slightly worse than the E-value between hCG29392and mCG10091(10–137), which explains why the similarity-based method did not pick hCG29392and mCG19309as an ortholog pair. (b) This view shows that some one-to-many ortholog relationship identified by our new approach resulted from imperfect gene annotation (likely a gene merging). In addition to hCG2039732, we identified a smaller human gene hCG2039731 as ortholog to mouse gene mCG140082. The smaller human gene is missed by the mutually best selection. Looking at the mouse and human transcript tiers, it seems that the first exon of the longest transcript of mouse mCG140082 probably should have been split from the gene to form an independent gene. We are more confident of human genes since each one of them has been manually curated.

 
In addition, 445 gene pairs that are missed by the mutually top selection are supported by shared anchors in exons and shared syntenic blocks, while the protein functions are either unknown or do not agree with each other. After studying individual cases in-depth, we observed that some cases were similar to the case discussed in Figure 3a legend, i.e. the mouse and human genes are truly orthologous to each other, mutually best selection missed them because they were not the top hits in both tBlastX runs; or that they were one-to-many orthologs due to imperfect gene annotation. Figure 3b shows an example. Two mouse genes were merged into one gene by the computational annotation pipeline (mCG140082), while the orthologous genes exist in human as discreet genes (hCG2039731 and hCG2039732). Similarity-only based methods picked the longer human gene as the ortholog, while our method picked the missing pieces and mapped one gene to multiple orthologs in the other species. This shows how gene annotation can be improved using cross-species evidence.

Annotate the gene pairs with evidence
For each potential orthologous gene pair, we provide a confidence score that is composed of four types of evidence: first is whether the two human and mouse counterparts are mutually best hit to each other; second is whether they share syntenic anchors in their exons; third is whether they are in the same syntenic block; and finally whether they perform the same protein function as inferred by whether or not they belong to the same Panther sub-family. The evidence reflects the features and characteristics of the pairs. They may not be direct indications of what proportion of the orthologous pairs were identified by various methods. For instance, when identifying orthologs, the concept of mini-blocks was used to pair up genes by shared synteny; however, when it comes to evidence, syntenic block was used as one of the attributes for each ortholog. The overall scores are summarized in Table 3. Of all gene pairs, 47.4% are supported by all four evidence types, these can be considered as pairs most likely to be true orthologs. The 30.1% of gene pairs are supported by three, 14.4% are supported by two lines of evidence and 8.2% gene pairs have only one type of associated evidence. Half of the pairs supported by only one type of evidence are the ones that are linked up by shared syntenic chromosomal locations (mini-block). While these gene pairs can be real orthologs, there is a chance that they could be false positives. Manual curation of individual cases will help to resolve these issues.


View this table:
[in this window]
[in a new window]
 
Table 3 A summary of evidence associated with the orthologous gene pairs

 
For 95.3% of the orthologous gene pairs, the human and mouse counterparts are in the same syntenic block. Of the potential ortholog gene pairs, 77.8% shared syntenic anchors in their exons. The remaining 22.2% may not have common anchors in the exons because of the uniqueness criteria: the anchors are bi-directionally unique in both genomes, thus orthologs that are in expanded gene families or duplicated regions would not have anchors associated with them. For 60.6% of orthologs, mouse and human counterpart genes have been assigned the same protein function. The percentage of mouse–human orthologous genes that perform the same function is over 95% when we consider only pairs for which both human and mouse protein functions are known. This observation agrees well with the speculation that during evolution, most orthologous genes perform similar functions in different organisms.

For the 3217 orthologous pairs that have been missed by the similarity-based approach, for 56.3% of them the mouse and human counterparts have the same protein functions. The percentage is similar to that of the ortholog set identified purely by similarity-based approach, indicating that the additional orthologs identified by our approach are of similar quality.

A total of 3999 mouse pseudogenes and 1570 human pseudogenes were included in the analysis. Of the 19 357 orthologs, 321 gene pairs involved either a human pseudogene or a mouse pseudogene; 13 gene pairs involved pseudogenes in both mouse and human. The pseudogenes might have been functional in ancestors and lost function and became pseudogenes in the course of evolution in one species, but their ortholog in another species may very well still be transcribed and functional. Of all the pseudogenes, 5% mouse pseudogenes and 6% human pseudogenes have orthologs.

Although rare, mutually best selection may pair up the wrong genes as orthologs, especially when the quality of gene annotation is not guaranteed. We found that of the 16 140 gene pairs which are paired up by mutually best selection, 577 gene pairs are supported by neither conserved synteny nor protein functions and thus may or may not be true orthologs. A good fraction of the 577 gene pairs had poor sequence similarity. While for the entire ortholog set, over 58% of the pairs had E-values of zero, only a quarter of the 577 pair set had E-values of zero.

Alignment quality of the ortholog pairs
The sequence similarity of all orthologous gene pairs was evaluated using NAP. While tBlastX identifies the best local alignment between two sequences for all six reading frames, NAP was used to obtain an independent evaluation of the global alignments between the orthologous genes (Fig. 4). On an average, the orthologous human and mouse genes share 82% of similar residues and 76% of identical residues. These numbers are close to previously reported observations (Makalowski and Boguski, 1998; Lander et al., 2001). The average match length is 94% of the human peptide length and 95% of the gene pairs have percentage coverage >50%. For a small fraction of gene pairs, NAP reported poor sequence matches while tBlastX detected reasonable similarity. The discrepancies may be explained as follows: while tBlastX has the flexibility of choosing one ORF out of any of the six, NAP used mouse peptides as input. If a mouse peptide is translated using a different ORF than the one picked by tBlastX, NAP will generate a very different and most probably a poor sequence alignment.



View larger version (19K):
[in this window]
[in a new window]
 
Fig. 4 Alignment quality of the orthologs as measured by NAP.

 
Human mouse orthologs on mouse chromosome 16
To allow readers to assess the validity of our method, a complete list of human mouse orthologs on mouse chromosome 16 is provided in a Supplementary table with annotated evidence. On mouse chromosome 16, our method identified 606 orthologous gene pairs that involve 565 mouse genes and 540 human genes. Of the 606 pairs, 528 were pairs identified by the mutually best selection approach. Our method added 13% more ortholog pairs than the conventional mutually best approach. In the supplementary table, Celera gene identifiers were associated with LocusIDs using the method described above. A total of 501 mouse genes and 517 human genes were mapped to LocusIDs. Celera transcript sequences were also included in the table.


    DISCUSSION
 TOP
 Abstract
 INTRODUCTION
 SYSTEM AND METHODS
 RESULTS
 DISCUSSION
 SUPPLEMENTARY DATA
 REFERENCES
 
Comparative genomics helps to determine the function of biologically important genes through the use of cross-species sequence conservation and the identification of orthologous genes. With the availability of whole assembled genomes, we were able to identify cross-species conserved synteny at the whole genome level. Conserved synteny can be calculated using two features: syntenic anchors and syntenic blocks. Practically, syntenic anchors can be used as landmarks for one to navigate between different genomes; they can serve as seed sequences to identify longer range conserved segments between species which may collocate with functional elements (Levy et al., 2001); or, as described in this paper, they help us to infer syntenic relationships (syntenic blocks) between species. Using our method, we identified over 444 000 sytenic anchors that composed of close to 3% sequences of each genome. These numbers are less than what described in the mouse genome sequencing paper (558 000 anchors and 7.5% genome covered) (Waterston et al., 2002). The differences attribute to differences in the softwares we used and different stringencies we applied. We used Blastn in our process to identify syntenic anchors; a newer algorithm that is much faster and requires much less computer resources has been developed based on MUMmer (Delcher et al., 1999) in Celera Genomics (C. Mobarry and G. Sutton, personal communication). We intend to use the new algorithm for future whole genome alignments. Other faster and more sensitive algorithms such as Blastz (Schwartz et al., 2003) and BLAT (Kent, 2002) are also suitable for this purpose. Using anchors, we inferred over 90% of both human and mouse genomes were included in syntenic blocks. This agrees well with the published data (Waterston et al., 2002). The 10% of genome sequences that did not belong to syntenic blocks might be regions with intensive local chromosome rearrangements or regions of duplication that will be under-represented with respect to syntenic anchors because of the uniqueness criteria.

Conserved synteny at the genomic level can also be used to help identify orthologous gene pairs. We have developed an approach to identify orthologous gene pairs that extends the sequence similarity search-based approach. This method leverages syntenic conservation and functional conservation between genomes. It identifies a more comprehensive set of orthologs between two species. Compared to the sequence similarity-based approach, it adds 20% more gene pairs. Overall, the gene pairs identified are strongly supported by evidence: for 95.3% of gene pairs, the orthologous counterparts belong to the same syntenic block; on an average, the human and mouse genes share 76% identical residues, and, in cases where both the mouse and human proteins have an annotated function, the protein functions agree between the human and mouse counterparts in 95% of the cases.

While standard sequence similarity-based methods use mutually best selection criteria that confines the pairings to a one-to-one relationship, our approach allows one-to-many or many-to-many relationships between human and mouse genes. Consider, however, the following: if an ancestral gene A evolved into gene A1 and A2 in two species (by a speciation event), gene A1 duplicated after the speciation event to become A1a and A1b, then by definition, both genes A1a and A1b are orthologous to A2 (Jensen, 2001). Therefore, a one-to-many (or a many-to-many) relationship does not violate the definition of orthologous genes. Practically, it is difficult to distinguish duplications before speciation and duplications after speciation (with loss of one paralog in one species); while most gene pairs output from our method are true orthologous counterparts, a small fraction could be false positives.

Using our method, we found that over 60% of human and mouse genes have recognizable orthologs. It has been speculated that most mouse genes should have a human ortholog as species-specific genes are rare (Mural et al., 2002; Waterston et al., 2002). We manually checked some human and mouse genes that do not have obvious orthologs by our method. When viewed in a syntenic map viewer, none of the genes we checked had obvious counterparts in the syntenic region, indicating that our approach is robust. For genes for which we did not find an ortholog, 22% of the mouse genes were outside the syntenic blocks. We also noticed that 13% of the mouse genes without orthologs were pseudogenes. Of all pseudogenes, we found a small percentage with orthologs (5% mouse and 6% human). We speculate that the pseudogenes for which we could not assign an ortholog may not have good ORFs and hence tBlastX would not find any matches. Because some degree of sequence similarity is the minimal requirement for identifying a pair, such pseudogenes would not be included in our list. Incomplete annotation may be another reason why a fraction of human and mouse genes do not have obvious orthologs.

The paper described our methods and the results of comparative analysis of two species. Comparative genomics will be an even more powerful analytical approach when more species are included in the wide spectrum of evolutionary relationships.


    SUPPLEMENTARY DATA
 TOP
 Abstract
 INTRODUCTION
 SYSTEM AND METHODS
 RESULTS
 DISCUSSION
 SUPPLEMENTARY DATA
 REFERENCES
 
Supplementary data for this paper are available on Bioinformatics online.


    Acknowledgments
 
We would like to thank Dr Graziella Piras for providing a graphic representation of the process to generate syntenic blocks (Fig. 1); Dr Peter Roberts, Dr Peter Li and Dr Xiaoying Lin for critical readings of the manuscript.


    Footnotes
 
{dagger}The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors. Back

{ddagger}Present address: National Center for Biotechnology Information, National Institute of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA. Back

§Present address: Qiagen, 19300 Germantown Road, Germantown, MD 20874, USA. Back

Received on May 27, 2004; revised on September 20, 2004; accepted on September 21, 2004

    REFERENCES
 TOP
 Abstract
 INTRODUCTION
 SYSTEM AND METHODS
 RESULTS
 DISCUSSION
 SUPPLEMENTARY DATA
 REFERENCES
 

    Adams, M.D., Celniker, S.E., Holt, R.A., Evans, C.A., Gocayne, J.D., Amanatides, P.G., Scherer, S.E., Li, P.W., Hoskins, R.A., Galle, R.F., et al. (2000) The genome sequence of Drosophila melanogaster. Science, 287, 2185–2195[Abstract/Free Full Text].

    Aparicio, S., Chapman, J., Stupka, E., Putnam, N., Chia, J.M., Dehal, P., Christoffels, A., Rash, S., Hoon, S., Smit, A., et al. (2002) Whole-genome shotgun assembly and analysis of the genome of Fugu rubripes. Science, 297, 1301–1310[Abstract/Free Full Text].

    Bejerano, G., Pheasant, M., Makunin, I., Stephen, S., Kent, W.J., Mattick, J.S., Haussler, D. (2004) Ultraconserved elements in the human genome. Science, 304, 1321–1325[Abstract/Free Full Text].

    Celera Genomics. (2002a) Celera Mouse Genome Database flat files release 13, Release Notes.

    Celera Genomics. (2002b) Celera Human Genome Database flat files release 27, Release Notes.

    Clamp, M., Andrews, D., Barker, D., Bevan, P., Cameron, G., Chen, Y., Clark, L., Cox, T., Cuff, J., Curwen, V., et al. (2003) Ensembl 2002: accommodating comparative genomics. Nucleic Acids Res., 31, 38–42[Abstract/Free Full Text].

    Dehal, P., Satou, Y., Campbell, R.K., Chapman, J., Degnan, B., De Tomaso, A., Davidson, B., DiGregorio, A., Gelpke, M., Goodstein, D.M., et al. (2002) The draft genome of Ciona intestinalis: insights into chordate and vertebrate origins. Science, 298, 2157–2167[Abstract/Free Full Text].

    Delcher, A.L., Kasif, S., Fleischmann, R.D., Peterson, J., White, O., Salzberg, S.L. (1999) Alignment of whole genomes. Nucleic Acids Res., 27, 2369–2376[Abstract/Free Full Text].

    Fitch, W.M. (1970) Distinguishing homologous from analogous proteins. Syst. Zool., 19, 99–113[Medline].

    Fitch, W.M. (2000) Homology a personal view on some of the problems. Trends Genet., 16, 227–231[CrossRef][Web of Science][Medline].

    Gibbs, R.A., Weinstock, G.M., Metzker, M.L., Muzny, D.M., Sodergren, E.J., Scherer, S., Scott, G., Steffen, D., Worley, K.C., Burch, P.E., et al. (2004) Genome sequence of the Brown Norway rat yields insights into mammalian evolution. Nature, 428, 493–521[CrossRef][Medline].

    Holt, R.A., Subramanian, G.M., Halpern, A., Sutton, G.G., Charlab, R., Nusskern, D.R., Wincker, P., Clark, A.G., Ribeiro, J.M., Wides, R., et al. (2002) The genome sequence of the malaria mosquito Anopheles gambiae. Science, 298, 129–149[Abstract/Free Full Text].

    Huang, X. and Zhang, J. (1996) Methods for comparing a DNA sequence with a protein sequence. Comput. Appl. Biosci., 12, 497–506[Abstract/Free Full Text].

    Jensen, R.A. (2001) Orthologs and paralogs—we need to get it right. Genome Biol., 2, INTERACTIONS1002[Medline].

    Kent, W.J. (2002) BLAT—the BLAST-like alignment tool. Genome Res., 12, 656–664[Abstract/Free Full Text].

    Koonin, E.V. (2001) An apology for orthologs—or brave new memes. Genome Biol., 2, COMMENT1005[Medline].

    Lander, E.S., Linton, L.M., Birren, B., Nusbaum, C., Zody, M.C., Baldwin, J., Devon, K., Dewar, K., Doyle, M., Fitz Hugh, W., et al. (2001) Initial sequencing and analysis of the human genome. Nature, 409, 860–921[CrossRef][Medline].

    Lane, R.P., Cutforth, T., Young, J., Athanasiou, M., Friedman, C., Rowen, L., Evans, G., Axel, R., Hood, L., Trask, B.J., et al. (2001) Genomic analysis of orthologous mouse and human olfactory receptor loci. Proc. Natl Acad. Sci. USA, 98, 7390–7395[Abstract/Free Full Text].

    Lee, Y., Sultana, R., Pertea, G., Cho, J., Karamycheva, S., Tsai, J., Parvizi, B., Cheung, F., Antonescu, V., White, J., et al. (2002) Cross-referencing eukaryotic genomes: TIGR Orthologous Gene Alignments (TOGA). Genome Res., 12, 493–502[Abstract/Free Full Text].

    Levy, S., Hannenhalli, S., Workman, C. (2001) Enrichment of regulatory signals in conserved non-coding genomic sequence. Bioinformatics, 17, 871–877[Abstract/Free Full Text].

    Makalowski, W. and Boguski, M.S. (1998) Evolutionary parameters of the transcribed mammalian genome: an analysis of 2,820 orthologous rodent and human sequences. Proc. Natl Acad. Sci. USA, 95, 9407–9412[Abstract/Free Full Text].

    Margulies, E.H., Blanchette, M., Haussler, D., Green, E.D. (2003) Identification and characterization of multi-species conserved sequences. Genome Res., 13, 2507–2518[Abstract/Free Full Text].

    Mural, R.J., Adams, M.D., Myers, E.W., Smith, H.O., Miklos, G.L., Wides, R., Halpem, A., Li, P.W., Sutton, G.G., Nadeau, J., et al. (2002) A comparison of whole-genome shotgun-derived mouse chromosome 16 and the human genome. Science, 296, 1661–1671[Abstract/Free Full Text].

    O'Brien, S.J., Menotti-Raymond, M., Murphy, W.J., Nash, W.G., Wienberg, J., Stanyon, R., Copeland, N.G., Jenkins, N.A., Womack, J.E., Marshall Graves, J.A. (1999) The promise of comparative genomics in mammals. Science, 286, 458–462 479–481[Abstract/Free Full Text].

    Remm, M., Storm, C.E., Sonnhammer, E.L. (2001) Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. J. Mol. Biol., 314, 1041–1052[CrossRef][Web of Science][Medline].

    Rubin, G.M., Yandell, M.D., Wortman, J.R., Gabor Miklos, G.L., Nelson, C.R., Hariharan, I.K., Fortini, M.E., Li, P.W., Apweiler, R., Fleischmann, W., et al. (2000) Comparative genomics of the eukaryotes. Science, 287, 2204–2215[Abstract/Free Full Text].

    Schwartz, S., Kent, W.J., Smit, A., Zhang, Z., Baertsch, R., Hardison, R.C., Haussler, D., Miller, W. (2003) Human–mouse alignments with BLASTZ. Genome Res., 13, 103–107.

    Stein, L.D., Bao, Z., Blasiar, D., Blumenthal, T., Brent, M.R., Chen, N., Chinwalla, A., Clarke, L., Clee, C., Coghlan, A., et al. (2003) The Genome Sequence of Caenorhabditis briggsae: a platform for comparative genomics. PLoS Biol., 1, E45[Medline].

    Tatusov, R.L., Koonin, E.V., Lipman, D.J. (1997) A genomic perspective on protein families. Science, 278, 631–637[Abstract/Free Full Text].

    The C. elegans Sequencing Consortium. (1998) Genome sequence of the nematode C. elegans: a platform for investigating biology. Science, 282, 2012–2018[Abstract/Free Full Text].

    Thomas, J.W., Touchman, J.W., Blakesley, R.W., Bouffard, G.G., Beckstrom-Sternberg, S.M., Margulies, E.H., Blanchette, M., Siepel, A.C., Thomas, P.J., McDowell, J.C., et al. (2003a) Comparative analyses of multi-species sequences from targeted genomic regions. Nature, 424, 788–793[CrossRef][Medline].

    Thomas, P.D., Campbell, M.J., Kejariwal, A., Mi, H., Karlak, B., Daverman, R., Diemer, K., Muruganujan, A., Narechania, A. (2003b) PANTHER: a library of protein families and subfamilies indexed by function. Genome Res., 13, 2129–2141[Abstract/Free Full Text].

    Thomas, P.D., Kejariwal, A., Campbell, M.J., Mi, H., Diemer, K., Guo, N., Ladunga, I., Ulitsky-Lazareva, B., Muruganujan, A., Rabkin, S., Vandergriff, J.A., Doremieux, O. (2003c) PANTHER: a browsable database of gene products organized by biological function, using curated protein family and subfamily classification. Nucleic Acids Res., 31, 334–341[Abstract/Free Full Text].

    Venter, J.C., Adams, M.D., Myers, E.W., Li, P.W., Mural, R.J., Sutton, G.G., Smith, H.O., Yandell, M., Evans, C.A., Holt, R.A., et al. (2001) The sequence of the human genome. Science, 291, 1304–1351[Abstract/Free Full Text].

    Waterston, R.H., Lindblad-Toh, K., Birney, E., Rogers, J., Abril, J.F., Agarwal, P., Agarwal, P., Agarwala, R., Ainscough, R., Alexandersson, M., et al. (2002) Initial sequencing and comparative analysis of the mouse genome. Nature, 420, 520–562[CrossRef][Medline].

    Wheelan, S.J., Boguski, M.S., Duret, L., Makalowski, W. (1999) Human and nematode orthologs—lessons from the analysis of 1800 human genes and the proteome of Caenorhabditis elegans. Gene, 238, 163–170[CrossRef][Web of Science][Medline].

    Zdobnov, E.M., von Mering, C., Letunic, I., Torrents, D., Suyama, M., Copley, R.R., Christophides, G.K., Thomasova, D., Holt, R.A., Subramanian, G.M., et al. (2002) Comparative genome and proteome analysis of Anopheles gambiae and Drosophila melanogaster. Science, 298, 149–159[Abstract/Free Full Text].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
T. Hachiya, Y. Osana, K. Popendorf, and Y. Sakakibara
Accurate identification of orthologous segments among multiple genomes
Bioinformatics, April 1, 2009; 25(7): 853 - 860.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
L. Eckhart, L. D. Valle, K. Jaeger, C. Ballaun, S. Szabo, A. Nardi, M. Buchberger, M. Hermann, L. Alibardi, and E. Tschachler
From the Cover: Identification of reptilian genes encoding hair keratin-like proteins suggests a new scenario for the evolutionary origin of hair
PNAS, November 25, 2008; 105(47): 18419 - 18423.
[Abstract] [Full Text] [PDF]


Home page
Mol Biol EvolHome page
L. Eckhart, C. Ballaun, M. Hermann, J. L. VandeBerg, W. Sipos, A. Uthman, H. Fischer, and E. Tschachler
Identification of Novel Mammalian Caspases Reveals an Important Role of Gene Loss in Shaping the Human Caspase Repertoire
Mol. Biol. Evol., May 1, 2008; 25(5): 831 - 841.
[Abstract] [Full Text] [PDF]


Home page
Biol. Reprod.Home page
M. A. Nolan, L. Wu, H. J. Bang, S. A. Jelinsky, K. P. Roberts, T. T. Turner, G. S. Kopf, and D. S. Johnston
Identification of Rat Cysteine-Rich Secretory Protein 4 (Crisp4) as the Ortholog to Human CRISP1 and Mouse Crisp4
Biol Reprod, May 1, 2006; 74(5): 984 - 991.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
21/6/703    most recent
bti045v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (12)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Zheng, X. H.
Right arrow Articles by Mural, R.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Zheng, X. H.
Right arrow Articles by Mural, R.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?