Bioinformatics Advance Access originally published online on September 30, 2004
Bioinformatics 2005 21(6):703-710; doi:10.1093/bioinformatics/bti045
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Using shared genomic synteny and shared protein functions to enhance the identification of orthologous gene pairs

,


Assays and Bioinformatics, Celera Genomics Corporation 45 West Gude Drive, Rockville, MD 20850, USA
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Motivation: The identification of orthologous gene pairs is generally based on sequence similarity. Gene pairs that are mutually best hits between the genomes being compared are asserted to be orthologs. Although this method identifies most orthologous gene pairs with high confidence, it will miss a fraction of them, especially genes in duplicated gene families. In addition, the approach depends heavily on the completeness and quality of gene annotation. When the gene sequences are not correctly represented the approach is unlikely to find the correct ortholog. To overcome these limitations, we have developed an approach to identify orthologous gene pairs using shared chromosomal synteny and the annotation of protein function.
Results: Assembled mouse and human genomes were used to identify the regions of conserved synteny between these genomes. Syntenic anchors are conserved non-repetitive locations between mouse and human genomes. Using these anchors, we identified blocks of sequences that contain consistently ordered anchors between the two genomes (syntenic blocks). The synteny information has been used to help us identify orthologous gene pairs between mouse and human genomes. The approach combines the mutual selection of the best tBlastX hits between human and mouse transcripts, and inferring gene orthologous relationships based on sharing syntenic anchors, collocating in the same syntenic blocks and sharing the same annotated protein function. Using this approach, we were able to find 19 357 orthologous gene pairs between human and mouse genomes, a 20% increase in the number of orthologs identified by conventional approaches.
Contact: richard.mural{at}celera.com
| INTRODUCTION |
|---|
|
|
|---|
Comparative genomics is the study of evolutionary relationships between the genes and genomes of different organisms. It plays a key role in understanding the functions of genomic elements, including transcribed protein coding and non-protein coding sequences, cis-acting elements that regulate gene expression at either the transcriptional or post-transcriptional levels, and sequence features that affect or control various aspects of chromosome biology, including chromatin structure. Over the past several years, scientists around the world have published sequence maps of a number of important model organisms, including Caenorhabditis elegans (The C. elegans Sequencing Consortium, 1998) Drosophila melanogaster (Adams et al., 2000) Homo sapiens (Lander et al., 2001; Venter et al., 2001) Fugu rubripes (Aparicio et al., 2002) Anopheles gambiae (Holt et al., 2002) Mus musculus (Mural et al., 2002; Waterston et al., 2002) Ciona intestinalis (Dehal et al., 2002) and Rattus norvegicus (Gibbs et al., 2004). These completed genomes have allowed comprehensive genome-wide comparative studies between species (Gibbs et al., 2004 Mural et al., 2002 Zdobnov et al., 2002 Rubin et al., 2000 Wheelan et al., 1999).
We have developed a method to detect evolutionary conservation at the whole genome scale. The genomic level conservation can be viewed at two levels: one level is bi-directionarily unique homologous sequences between the two genomes (syntenic anchors), the other is conserved synteny as inferred by large chromosomal segments on which anchors are in a consistently incremental/decremental order (syntenic blocks) (Mural et al., 2002). Syntenic blocks can be considered as a large chromosomal segments inherited from a common ancestral chromosome without major chromosomal rearrangements. Syntenic blocks provide a context for identifying appropriate orthologous relationships.
The functions of human disease-related genes are often elucidated by studying their orthologous counterparts in various model organisms (for a review see O'Brien et al., 1999). Orthologs are defined as genes in different species that evolved from a common ancestor (Fitch, 1970). Evolutionary distances are usually inferred by sequence similarity between two genes, given the assumption that the longer the divergence time, the two genes would have less sequence similarity. Current methods used to identify orthologs between species are mostly based on sequence similarity (Lee et al., 2002; Remm et al., 2001; Tatusov et al., 1997). While the sequence similarity approach allows us to accurately identify a high confidence set of orthologs, the approach will inevitably miss orthologs within repetitive gene families where gene family members are so similar to each other, that none of them satisfies the stringent mutual best hit criteria. The approach will also miss orthologous genes because the mutually best selection imposes a strict one-to-one relationship between orthologous counterparts. More over, the sequence similarity-based approach will only work after the entire gene repertoires are available for both species, otherwise it will infer the wrong relationships when incomplete transcriptomes or gene sequences of low quality are used as input data.
To overcome these shortcomings, we have developed a pipeline that uses conserved synteny and protein functions, as well as sequence similarity to identify orthologous pairs of genes. Although chromosomal segments are rearranged by translocations, inversions and other events during evolution, genes within a conserved syntenic segment in two closely related species are generally in the same order and this information can be used to infer orthologous relationships. To be more specific, the pipeline first identifies a set of orthologs by selecting mutually best tBlastX hits between the mouse and human annotated transcriptomes; additional orthologs are then identified by finding genes which share syntenic anchors in their exons or are at syntenically conserved positions. Since orthologs are derived from a common ancestral gene, orthologous proteins are likely to have similar functions in different organisms. This observation was used as the rationale for assigning and confirming orthologs that might be missed by the previous two approaches, by further checking the assigned function of genes that have conserved positions in syntenic blocks. Recently, gene order conservation around seed orthologs has successfully been used to identify additional orthologs in the comparative analysis of C.briggsae and C.elegans (Stein et al., 2003) and is being used as part of an Ensembl pipeline to generate ortholog predictions (Clamp et al., 2003) demonstrating the feasibility of the approach. Our method extends these examples by including syntenic anchors and protein functions to increase the number of orthologs.
Our method has allowed us to identify a more comprehensive set of orthologous gene pairs between mouse and human genome than previously published results (Lander et al., 2001). We identified 20% more orthologous gene pairs than by using the mutually best hit approach. Unlike the one-to-one relationship characterized by the mutually best hit selection approach, our method produced gene pairs of a one-to-many or many-to-many relationship. That is to say, one gene in one species may have multiple counterparts in the other species. By definition, gene duplications after speciation would make it possible for one gene in one species to have multiple orthologous genes in another species (Jensen, 2001; Koonin, 2001; Fitch, 2000).
| SYSTEM AND METHODS |
|---|
|
|
|---|
Data
We used Celera's assembled human (Celera Genomics, 2002b) and mouse (Celera Genomics, 2002a) genomes, as well as their associated annotation to demonstrate our method of ortholog detection. Each genome was annotated by an automatic gene annotation pipeline, and the resulting putative genes and transcripts were then individually reviewed and modified by expert curators.
Identification of syntenic anchors and syntenic blocks
Syntenic anchors (Mural et al., 2002) are sequences in two genomes that show significant sequence similarity and are bi-directionally unique matches (i.e. they are found in only one location in each genome). They were computed in two steps. The first was an all-against-all search of mouse assembly scaffolds against the human assembly scaffolds using BlastN (after repeat masking ubiquitous repeats and low-complexity regions in human sequences). The E-value threshold for this BlastN run was 104. This initial set of matches was filtered to eliminate repetitive matches that were missed by the initial repeat masking. A match between two segments, of at least 50 bp, sharing at least 80% identity was retained as a syntenic anchor.
Syntenic blocks refer to chromosomal segments in two species where syntenic anchors are in consecutive order (Mural et al., 2002). The concept of syntenic block here is similar to the concept of syntenic segment described in the mouse genome sequencing paper by Waterston et al. (2002). Genes within a syntenic block are likely to be orthologous with a preserved gene order (Lane et al., 2001). Syntenic blocks were generated by a two-step process (Fig. 1). In the first step, the syntenic anchors were sorted by their chromosomal position on the reference genome (mouse); the anchors were then grouped by their chromosome assignments on the comparative genome (human). In the second step, individual anchor groups were further divided into more refined groups named syntenic blocks as follows: (1) adjacent anchors were grouped together if the anchors on the human genome were in a consecutive ascending or descending order; a group was discontinued when the order of anchors on the human genome jumped two anchor positions or more; (2) small groups composed of two or less anchors or shorter than 100 000 bp on either genome were deleted to remove noise; (3) the remaining anchors were regrouped as in the first step and (4) block size and orientation were determined for both genomes based the span of the anchors and whether the anchor order was incremental or decremental.
|
Identification of orthologs
Three complementary methods were used to identify orthologs. The first method used two cycles of tBlastX between the Celera human and mouse transcripts by using an E-value cut-off of 104, with the subject and query databases swapped between runs. By comparing E-values, mutually best transcript pairs were selected as orthologs. When E-values were equal, bits score and sequence coverage were used as tie-breakers to select the top hit.
The second method used conserved synteny to identify additional orthologs missed by the first method. Human and mouse genes were identified as orthologs if they satisfied the following three criteria: (1) they share common syntenic anchor(s) in their exons; (2) they are in the same syntenic block and (3) they share significant sequence similarity. Two sequences were considered to have significant sequence similarity if they were among the mutual top five hits in the tBlastX runs (this criteria applied to methods described below as well). Since syntenic anchors are mutually unique sequences in the two genomes, inevitably, there are regions of a genome where syntenic anchors are under-represented. Genes in such anchor-poor regions cannot be paired up effectively using the shared anchor approach. To overcome this drawback, we took advantage of the conservation of gene order inside syntenic blocks to identify more orthologs. Within a syntenic block, a pair of human and mouse genes flanked by previously identified ortholog pairs are likely to be orthologs. To implement this approach, we first identified a set of most confident or seed orthologs. The syntenic blocks were divided into mini-blocks by the seeds. Within each mini-block, usually there are 45 genes on each genome. Each human gene was then paired up with all the mouse genes in the same mini-block individually. Any pair with significant sequence similarity was considered as an ortholog. The seed orthologs were selected from the orthologs identified by the mutually best tBlastX selection using the following criteria: (1) the orthologous counterparts share syntenic anchors in their exons; (2) they are within the same syntenic block and (3) they have consistent order in the two genomes.
Finally, protein functional classifications were used to identify orthologs. Our data suggest that when protein functions are known, over 90% of orthologous genes identified by selecting mutually top hits perform similar biological functions in two organisms. We used the Panther protein classification system (Thomas et al., 2003b, c) to help identify orthologous proteins. If the proteins translated from a human transcript and a mouse transcript belonged to the same Panther subfamily, collocated in the same syntenic block, and shared significant sequence similarity, they were identified as an orthologous pair.
The results generated by the above three methods were consolidated to produce a non-redundant list of transcript pairs and subsequently, transcript pairs were converted into gene pairs. Each pair was annotated with a score that is a measure of the confidence of the assignment. A score is the number of unique evidence types supporting an orthologous pair with all evidence weighted equally.
Pseudogenes were included in the analysis because pseudogenes might have been active in ancestors. For each humanmouse gene pair, there may be multiple transcript pairs associated due to alternative splicing. For the same gene pair, different alternative transcript pairs may have different ortholog scores. We used the best ortholog score of the transcript pairs to represent the gene pair.
Evaluate sequence similarity using a global alignment tool NAP
NAP is a program that constructs optimal global and local alignment between a DNA sequence and a protein sequence (Huang and Zhang, 1996). Human proteins and mouse transcripts were used as an input to the NAP program. Default parameters were used when running the NAP program. Percentage identity and percentage similarity are the number of identical and similar residues between the human protein and one of the three open reading frames (ORFs) of mouse transcript, respectively, divided by the length of human peptide length. They reflect the global identity and similarity of two sequences. Percentage coverage is the length of total non-gapped aligned segments divided by human peptide length.
Mapping of Celera genes to LocusIDs
LocusIDs were used to pull sequences from the ref_seq, GenBank, Swiss-Prot, Unigene and db_EST datasets. These sequences were aligned to Celera's genome assemblies with cut-offs of 97% identity and 20% coverage using either SIM4 or Genewise. The top-scoring alignment for each sequence was collocated with the exon alignments of transcripts from the Celera Annotation Set. Only the best collocation for each sequence was kept. The collocations were then grouped by locusID and Celera gene (hCG or mCG) and outliers were discarded. The remaining collocations were then used to establish the association between locusIDs and Celera genes.
| RESULTS |
|---|
|
|
|---|
Conserved synteny at the genomic level
Using the method described above, 444 410 bi-directionally unique, homologous sequences were identified between the human and mouse genome assemblies. These syntenic anchor sequences cover 2.7 and 2.9% of the human and mouse genomes, respectively. The anchor size ranges from 50 to 6778 bp (Table 1), with an average length of 169 bp.
|
It is anticipated that protein coding sequences, regulatory elements and other functionally important sequences are conserved during evolution. Our observations agree well with this expectation (Fig. 2a and b). The density of syntenic anchors in exons is 2249 anchors per mega base of sequence, which is
12- to 20-fold higher than that in intron and intergenic regions (182 and 111 anchors per megabase, respectively). Despite the fact that the anchor density is highest in exons, 72% of all anchors are found in regions outside exons. Such conserved elements may represent sequences with biological functions that have not yet been identified, or they may be conserved during evolution merely by chance. Many groups are mining these sequences to identify unknown functional elements in mammalian genomes (Bejerano et al., 2004; Margulies et al., 2003; Thomas et al., 2003a).
|
Syntenic anchors can be used to infer chromosomal syntenic relationships between species. A syntenic block is defined as a maximal chromosomal region where anchors are conserved in order and orientation. A heuristic algorithm was implemented as described above to identify syntenic blocks between mouse and human genomes. A total of 638 syntenic blocks were identified with a N50 size of
10 Mb ( Table 1). Over 90% of the human and mouse genomes were included in the syntenic blocks. A majority of syntenic anchors (98%) are in syntenic blocks.
Ortholog results
Conserved synteny and protein functions allow us to identify additional orthologs
Currently available techniques for the most part use sequence similarity to identify orthologs. This method generates high-confidence ortholog pairs, but almost certainly underestimates the number of orthologous pairs. By leveraging syntenic information and protein function information, we were able to identify more orthologous gene pairs. Compared to the approach that mutually selects best tBlastX hits as orthologs, our method identified 20% more orthologous gene pairs, 19 357 pairs compared to 16 140 pairs ( Table 2). Unlike similarity-based methods, we generated orthologs with a one-to-many or many-to-many relationship, meaning that one mouse gene may have more than one human counterpart and vice versa. Of the 19 357 gene pairs, 14 004 pairs have a one-to-one relationship, the remaining 5353 pairs have a many-to-many or one-to-many relationship. If only unique genes are considered, our method identified 8% more mouse genes and 6% more human genes as having orthologous counterparts than did the mutually best hit selection approach.
|
Of the 3217 gene pairs that are missed by the similarity-based approach, 419 gene pairs are supported by shared syntenic anchors, shared syntenic blocks and conserved protein functions. This is a subset of gene pairs that we have high confidence in their orthologous relationships but are missed by similarity-based methods (Fig. 3a).
|
In addition, 445 gene pairs that are missed by the mutually top selection are supported by shared anchors in exons and shared syntenic blocks, while the protein functions are either unknown or do not agree with each other. After studying individual cases in-depth, we observed that some cases were similar to the case discussed in Figure 3a legend, i.e. the mouse and human genes are truly orthologous to each other, mutually best selection missed them because they were not the top hits in both tBlastX runs; or that they were one-to-many orthologs due to imperfect gene annotation. Figure 3b shows an example. Two mouse genes were merged into one gene by the computational annotation pipeline (mCG140082), while the orthologous genes exist in human as discreet genes (hCG2039731 and hCG2039732). Similarity-only based methods picked the longer human gene as the ortholog, while our method picked the missing pieces and mapped one gene to multiple orthologs in the other species. This shows how gene annotation can be improved using cross-species evidence.
Annotate the gene pairs with evidence
For each potential orthologous gene pair, we provide a confidence score that is composed of four types of evidence: first is whether the two human and mouse counterparts are mutually best hit to each other; second is whether they share syntenic anchors in their exons; third is whether they are in the same syntenic block; and finally whether they perform the same protein function as inferred by whether or not they belong to the same Panther sub-family. The evidence reflects the features and characteristics of the pairs. They may not be direct indications of what proportion of the orthologous pairs were identified by various methods. For instance, when identifying orthologs, the concept of mini-blocks was used to pair up genes by shared synteny; however, when it comes to evidence, syntenic block was used as one of the attributes for each ortholog. The overall scores are summarized in Table 3. Of all gene pairs, 47.4% are supported by all four evidence types, these can be considered as pairs most likely to be true orthologs. The 30.1% of gene pairs are supported by three, 14.4% are supported by two lines of evidence and 8.2% gene pairs have only one type of associated evidence. Half of the pairs supported by only one type of evidence are the ones that are linked up by shared syntenic chromosomal locations (mini-block). While these gene pairs can be real orthologs, there is a chance that they could be false positives. Manual curation of individual cases will help to resolve these issues.
|
For 95.3% of the orthologous gene pairs, the human and mouse counterparts are in the same syntenic block. Of the potential ortholog gene pairs, 77.8% shared syntenic anchors in their exons. The remaining 22.2% may not have common anchors in the exons because of the uniqueness criteria: the anchors are bi-directionally unique in both genomes, thus orthologs that are in expanded gene families or duplicated regions would not have anchors associated with them. For 60.6% of orthologs, mouse and human counterpart genes have been assigned the same protein function. The percentage of mousehuman orthologous genes that perform the same function is over 95% when we consider only pairs for which both human and mouse protein functions are known. This observation agrees well with the speculation that during evolution, most orthologous genes perform similar functions in different organisms.
For the 3217 orthologous pairs that have been missed by the similarity-based approach, for 56.3% of them the mouse and human counterparts have the same protein functions. The percentage is similar to that of the ortholog set identified purely by similarity-based approach, indicating that the additional orthologs identified by our approach are of similar quality.
A total of 3999 mouse pseudogenes and 1570 human pseudogenes were included in the analysis. Of the 19 357 orthologs, 321 gene pairs involved either a human pseudogene or a mouse pseudogene; 13 gene pairs involved pseudogenes in both mouse and human. The pseudogenes might have been functional in ancestors and lost function and became pseudogenes in the course of evolution in one species, but their ortholog in another species may very well still be transcribed and functional. Of all the pseudogenes, 5% mouse pseudogenes and 6% human pseudogenes have orthologs.
Although rare, mutually best selection may pair up the wrong genes as orthologs, especially when the quality of gene annotation is not guaranteed. We found that of the 16 140 gene pairs which are paired up by mutually best selection, 577 gene pairs are supported by neither conserved synteny nor protein functions and thus may or may not be true orthologs. A good fraction of the 577 gene pairs had poor sequence similarity. While for the entire ortholog set, over 58% of the pairs had E-values of zero, only a quarter of the 577 pair set had E-values of zero.
Alignment quality of the ortholog pairs
The sequence similarity of all orthologous gene pairs was evaluated using NAP. While tBlastX identifies the best local alignment between two sequences for all six reading frames, NAP was used to obtain an independent evaluation of the global alignments between the orthologous genes (Fig. 4). On an average, the orthologous human and mouse genes share 82% of similar residues and 76% of identical residues. These numbers are close to previously reported observations (Makalowski and Boguski, 1998; Lander et al., 2001). The average match length is 94% of the human peptide length and 95% of the gene pairs have percentage coverage >50%. For a small fraction of gene pairs, NAP reported poor sequence matches while tBlastX detected reasonable similarity. The discrepancies may be explained as follows: while tBlastX has the flexibility of choosing one ORF out of any of the six, NAP used mouse peptides as input. If a mouse peptide is translated using a different ORF than the one picked by tBlastX, NAP will generate a very different and most probably a poor sequence alignment.
|
Human mouse orthologs on mouse chromosome 16
To allow readers to assess the validity of our method, a complete list of human mouse orthologs on mouse chromosome 16 is provided in a Supplementary table with annotated evidence. On mouse chromosome 16, our method identified 606 orthologous gene pairs that involve 565 mouse genes and 540 human genes. Of the 606 pairs, 528 were pairs identified by the mutually best selection approach. Our method added 13% more ortholog pairs than the conventional mutually best approach. In the supplementary table, Celera gene identifiers were associated with LocusIDs using the method described above. A total of 501 mouse genes and 517 human genes were mapped to LocusIDs. Celera transcript sequences were also included in the table.
| DISCUSSION |
|---|
|
|
|---|
Comparative genomics helps to determine the function of biologically important genes through the use of cross-species sequence conservation and the identification of orthologous genes. With the availability of whole assembled genomes, we were able to identify cross-species conserved synteny at the whole genome level. Conserved synteny can be calculated using two features: syntenic anchors and syntenic blocks. Practically, syntenic anchors can be used as landmarks for one to navigate between different genomes; they can serve as seed sequences to identify longer range conserved segments between species which may collocate with functional elements (Levy et al., 2001); or, as described in this paper, they help us to infer syntenic relationships (syntenic blocks) between species. Using our method, we identified over 444 000 sytenic anchors that composed of close to 3% sequences of each genome. These numbers are less than what described in the mouse genome sequencing paper (558 000 anchors and 7.5% genome covered) (Waterston et al., 2002). The differences attribute to differences in the softwares we used and different stringencies we applied. We used Blastn in our process to identify syntenic anchors; a newer algorithm that is much faster and requires much less computer resources has been developed based on MUMmer (Delcher et al., 1999) in Celera Genomics (C. Mobarry and G. Sutton, personal communication). We intend to use the new algorithm for future whole genome alignments. Other faster and more sensitive algorithms such as Blastz (Schwartz et al., 2003) and BLAT (Kent, 2002) are also suitable for this purpose. Using anchors, we inferred over 90% of both human and mouse genomes were included in syntenic blocks. This agrees well with the published data (Waterston et al., 2002). The 10% of genome sequences that did not belong to syntenic blocks might be regions with intensive local chromosome rearrangements or regions of duplication that will be under-represented with respect to syntenic anchors because of the uniqueness criteria.
Conserved synteny at the genomic level can also be used to help identify orthologous gene pairs. We have developed an approach to identify orthologous gene pairs that extends the sequence similarity search-based approach. This method leverages syntenic conservation and functional conservation between genomes. It identifies a more comprehensive set of orthologs between two species. Compared to the sequence similarity-based approach, it adds 20% more gene pairs. Overall, the gene pairs identified are strongly supported by evidence: for 95.3% of gene pairs, the orthologous counterparts belong to the same syntenic block; on an average, the human and mouse genes share 76% identical residues, and, in cases where both the mouse and human proteins have an annotated function, the protein functions agree between the human and mouse counterparts in 95% of the cases.
While standard sequence similarity-based methods use mutually best selection criteria that confines the pairings to a one-to-one relationship, our approach allows one-to-many or many-to-many relationships between human and mouse genes. Consider, however, the following: if an ancestral gene A evolved into gene A1 and A2 in two species (by a speciation event), gene A1 duplicated after the speciation event to become A1a and A1b, then by definition, both genes A1a and A1b are orthologous to A2 (Jensen, 2001). Therefore, a one-to-many (or a many-to-many) relationship does not violate the definition of orthologous genes. Practically, it is difficult to distinguish duplications before speciation and duplications after speciation (with loss of one paralog in one species); while most gene pairs output from our method are true orthologous counterparts, a small fraction could be false positives.
Using our method, we found that over 60% of human and mouse genes have recognizable orthologs. It has been speculated that most mouse genes should have a human ortholog as species-specific genes are rare (Mural et al., 2002; Waterston et al., 2002). We manually checked some human and mouse genes that do not have obvious orthologs by our method. When viewed in a syntenic map viewer, none of the genes we checked had obvious counterparts in the syntenic region, indicating that our approach is robust. For genes for which we did not find an ortholog, 22% of the mouse genes were outside the syntenic blocks. We also noticed that 13% of the mouse genes without orthologs were pseudogenes. Of all pseudogenes, we found a small percentage with orthologs (5% mouse and 6% human). We speculate that the pseudogenes for which we could not assign an ortholog may not have good ORFs and hence tBlastX would not find any matches. Because some degree of sequence similarity is the minimal requirement for identifying a pair, such pseudogenes would not be included in our list. Incomplete annotation may be another reason why a fraction of human and mouse genes do not have obvious orthologs.
The paper described our methods and the results of comparative analysis of two species. Comparative genomics will be an even more powerful analytical approach when more species are included in the wide spectrum of evolutionary relationships.
| SUPPLEMENTARY DATA |
|---|
|
|
|---|
Supplementary data for this paper are available on Bioinformatics online.
| Acknowledgments |
|---|
We would like to thank Dr Graziella Piras for providing a graphic representation of the process to generate syntenic blocks (Fig. 1); Dr Peter Roberts, Dr Peter Li and Dr Xiaoying Lin for critical readings of the manuscript.
| Footnotes |
|---|
The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors.
Present address: National Center for Biotechnology Information, National Institute of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA. ![]()
Present address: Qiagen, 19300 Germantown Road, Germantown, MD 20874, USA. ![]()
Received on May 27, 2004; revised on September 20, 2004; accepted on September 21, 2004
| REFERENCES |
|---|
|
|
|---|
Adams, M.D., Celniker, S.E., Holt, R.A., Evans, C.A., Gocayne, J.D., Amanatides, P.G., Scherer, S.E., Li, P.W., Hoskins, R.A., Galle, R.F., et al. (2000) The genome sequence of Drosophila melanogaster. Science, 287, 21852195
Aparicio, S., Chapman, J., Stupka, E., Putnam, N., Chia, J.M., Dehal, P., Christoffels, A., Rash, S., Hoon, S., Smit, A., et al. (2002) Whole-genome shotgun assembly and analysis of the genome of Fugu rubripes. Science, 297, 13011310
Bejerano, G., Pheasant, M., Makunin, I., Stephen, S., Kent, W.J., Mattick, J.S., Haussler, D. (2004) Ultraconserved elements in the human genome. Science, 304, 13211325
Celera Genomics. (2002a) Celera Mouse Genome Database flat files release 13, Release Notes.
Celera Genomics. (2002b) Celera Human Genome Database flat files release 27, Release Notes.
Clamp, M., Andrews, D., Barker, D., Bevan, P., Cameron, G., Chen, Y., Clark, L., Cox, T., Cuff, J., Curwen, V., et al. (2003) Ensembl 2002: accommodating comparative genomics. Nucleic Acids Res., 31, 3842
Dehal, P., Satou, Y., Campbell, R.K., Chapman, J., Degnan, B., De Tomaso, A., Davidson, B., DiGregorio, A., Gelpke, M., Goodstein, D.M., et al. (2002) The draft genome of Ciona intestinalis: insights into chordate and vertebrate origins. Science, 298, 21572167
Delcher, A.L., Kasif, S., Fleischmann, R.D., Peterson, J., White, O., Salzberg, S.L. (1999) Alignment of whole genomes. Nucleic Acids Res., 27, 23692376
Fitch, W.M. (1970) Distinguishing homologous from analogous proteins. Syst. Zool., 19, 99113
Fitch, W.M. (2000) Homology a personal view on some of the problems. Trends Genet., 16, 227231[CrossRef][Web of Science][Medline].
Gibbs, R.A., Weinstock, G.M., Metzker, M.L., Muzny, D.M., Sodergren, E.J., Scherer, S., Scott, G., Steffen, D., Worley, K.C., Burch, P.E., et al. (2004) Genome sequence of the Brown Norway rat yields insights into mammalian evolution. Nature, 428, 493521[CrossRef][Medline].
Holt, R.A., Subramanian, G.M., Halpern, A., Sutton, G.G., Charlab, R., Nusskern, D.R., Wincker, P., Clark, A.G., Ribeiro, J.M., Wides, R., et al. (2002) The genome sequence of the malaria mosquito Anopheles gambiae. Science, 298, 129149
Huang, X. and Zhang, J. (1996) Methods for comparing a DNA sequence with a protein sequence. Comput. Appl. Biosci., 12, 497506
Jensen, R.A. (2001) Orthologs and paralogswe need to get it right. Genome Biol., 2, INTERACTIONS1002[Medline].
Kent, W.J. (2002) BLATthe BLAST-like alignment tool. Genome Res., 12, 656664
Koonin, E.V. (2001) An apology for orthologsor brave new memes. Genome Biol., 2, COMMENT1005[Medline].
Lander, E.S., Linton, L.M., Birren, B., Nusbaum, C., Zody, M.C., Baldwin, J., Devon, K., Dewar, K., Doyle, M., Fitz Hugh, W., et al. (2001) Initial sequencing and analysis of the human genome. Nature, 409, 860921[CrossRef][Medline].
Lane, R.P., Cutforth, T., Young, J., Athanasiou, M., Friedman, C., Rowen, L., Evans, G., Axel, R., Hood, L., Trask, B.J., et al. (2001) Genomic analysis of orthologous mouse and human olfactory receptor loci. Proc. Natl Acad. Sci. USA, 98, 73907395
Lee, Y., Sultana, R., Pertea, G., Cho, J., Karamycheva, S., Tsai, J., Parvizi, B., Cheung, F., Antonescu, V., White, J., et al. (2002) Cross-referencing eukaryotic genomes: TIGR Orthologous Gene Alignments (TOGA). Genome Res., 12, 493502
Levy, S., Hannenhalli, S., Workman, C. (2001) Enrichment of regulatory signals in conserved non-coding genomic sequence. Bioinformatics, 17, 871877
Makalowski, W. and Boguski, M.S. (1998) Evolutionary parameters of the transcribed mammalian genome: an analysis of 2,820 orthologous rodent and human sequences. Proc. Natl Acad. Sci. USA, 95, 94079412
Margulies, E.H., Blanchette, M., Haussler, D., Green, E.D. (2003) Identification and characterization of multi-species conserved sequences. Genome Res., 13, 25072518
Mural, R.J., Adams, M.D., Myers, E.W., Smith, H.O., Miklos, G.L., Wides, R., Halpem, A., Li, P.W., Sutton, G.G., Nadeau, J., et al. (2002) A comparison of whole-genome shotgun-derived mouse chromosome 16 and the human genome. Science, 296, 16611671
O'Brien, S.J., Menotti-Raymond, M., Murphy, W.J., Nash, W.G., Wienberg, J., Stanyon, R., Copeland, N.G., Jenkins, N.A., Womack, J.E., Marshall Graves, J.A. (1999) The promise of comparative genomics in mammals. Science, 286, 458462 479481
Remm, M., Storm, C.E., Sonnhammer, E.L. (2001) Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. J. Mol. Biol., 314, 10411052[CrossRef][Web of Science][Medline].
Rubin, G.M., Yandell, M.D., Wortman, J.R., Gabor Miklos, G.L., Nelson, C.R., Hariharan, I.K., Fortini, M.E., Li, P.W., Apweiler, R., Fleischmann, W., et al. (2000) Comparative genomics of the eukaryotes. Science, 287, 22042215
Schwartz, S., Kent, W.J., Smit, A., Zhang, Z., Baertsch, R., Hardison, R.C., Haussler, D., Miller, W. (2003) Humanmouse alignments with BLASTZ. Genome Res., 13, 103107
Stein, L.D., Bao, Z., Blasiar, D., Blumenthal, T., Brent, M.R., Chen, N., Chinwalla, A., Clarke, L., Clee, C., Coghlan, A., et al. (2003) The Genome Sequence of Caenorhabditis briggsae: a platform for comparative genomics. PLoS Biol., 1, E45[Medline].
Tatusov, R.L., Koonin, E.V., Lipman, D.J. (1997) A genomic perspective on protein families. Science, 278, 631637
The C. elegans Sequencing Consortium. (1998) Genome sequence of the nematode C. elegans: a platform for investigating biology. Science, 282, 20122018
Thomas, J.W., Touchman, J.W., Blakesley, R.W., Bouffard, G.G., Beckstrom-Sternberg, S.M., Margulies, E.H., Blanchette, M., Siepel, A.C., Thomas, P.J., McDowell, J.C., et al. (2003a) Comparative analyses of multi-species sequences from targeted genomic regions. Nature, 424, 788793[CrossRef][Medline].
Thomas, P.D., Campbell, M.J., Kejariwal, A., Mi, H., Karlak, B., Daverman, R., Diemer, K., Muruganujan, A., Narechania, A. (2003b) PANTHER: a library of protein families and subfamilies indexed by function. Genome Res., 13, 21292141
Thomas, P.D., Kejariwal, A., Campbell, M.J., Mi, H., Diemer, K., Guo, N., Ladunga, I., Ulitsky-Lazareva, B., Muruganujan, A., Rabkin, S., Vandergriff, J.A., Doremieux, O. (2003c) PANTHER: a browsable database of gene products organized by biological function, using curated protein family and subfamily classification. Nucleic Acids Res., 31, 334341
Venter, J.C., Adams, M.D., Myers, E.W., Li, P.W., Mural, R.J., Sutton, G.G., Smith, H.O., Yandell, M., Evans, C.A., Holt, R.A., et al. (2001) The sequence of the human genome. Science, 291, 13041351
Waterston, R.H., Lindblad-Toh, K., Birney, E., Rogers, J., Abril, J.F., Agarwal, P., Agarwal, P., Agarwala, R., Ainscough, R., Alexandersson, M., et al. (2002) Initial sequencing and comparative analysis of the mouse genome. Nature, 420, 520562[CrossRef][Medline].
Wheelan, S.J., Boguski, M.S., Duret, L., Makalowski, W. (1999) Human and nematode orthologslessons from the analysis of 1800 human genes and the proteome of Caenorhabditis elegans. Gene, 238, 163170[CrossRef][Web of Science][Medline].
Zdobnov, E.M., von Mering, C., Letunic, I., Torrents, D., Suyama, M., Copley, R.R., Christophides, G.K., Thomasova, D., Holt, R.A., Subramanian, G.M., et al. (2002) Comparative genome and proteome analysis of Anopheles gambiae and Drosophila melanogaster. Science, 298, 149159
This article has been cited by other articles:
![]() |
T. Hachiya, Y. Osana, K. Popendorf, and Y. Sakakibara Accurate identification of orthologous segments among multiple genomes Bioinformatics, April 1, 2009; 25(7): 853 - 860. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Eckhart, L. D. Valle, K. Jaeger, C. Ballaun, S. Szabo, A. Nardi, M. Buchberger, M. Hermann, L. Alibardi, and E. Tschachler From the Cover: Identification of reptilian genes encoding hair keratin-like proteins suggests a new scenario for the evolutionary origin of hair PNAS, November 25, 2008; 105(47): 18419 - 18423. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Eckhart, C. Ballaun, M. Hermann, J. L. VandeBerg, W. Sipos, A. Uthman, H. Fischer, and E. Tschachler Identification of Novel Mammalian Caspases Reveals an Important Role of Gene Loss in Shaping the Human Caspase Repertoire Mol. Biol. Evol., May 1, 2008; 25(5): 831 - 841. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. A. Nolan, L. Wu, H. J. Bang, S. A. Jelinsky, K. P. Roberts, T. T. Turner, G. S. Kopf, and D. S. Johnston Identification of Rat Cysteine-Rich Secretory Protein 4 (Crisp4) as the Ortholog to Human CRISP1 and Mouse Crisp4 Biol Reprod, May 1, 2006; 74(5): 984 - 991. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||







