Skip to Main Content

Article Navigation

Journal Article

GOSemSim: an R package for measuring semantic similarity among GO terms and gene products

Author Notes

Abstract

Summary: The semantic comparisons of Gene Ontology (GO) annotations provide quantitative ways to compute similarities between genes and gene groups, and have became important basis for many bioinformatics analysis approaches. GOSemSim is an R package for semantic similarity computation among GO terms, sets of GO terms, gene products and gene clusters. Four information content (IC)- and a graph-based methods are implemented in the GOSemSim package, multiple species including human, rat, mouse, fly and yeast are also supported. The functions provided by the GOSemSim offer flexibility for applications, and can be easily integrated into high-throughput analysis pipelines.

Availability: GOSemSim is released under the GNU General Public License within Bioconductor project, and freely available at http://bioconductor.org/packages/2.6/bioc/html/GOSemSim.html

Contact: boxc@bmi.ac.cn; sqwang@bmi.ac.cn

Supplementary information: Supplementary information is available at Bioinformatics online.

1 INTRODUCTION

The Gene Ontology (GO) is becoming the de facto standard for the annotation of gene products. The GO consortium annotates gene products with terms from three orthogonal ontologies organized as directed acyclic graphs, laying the foundation for quantitative semantic comparisons. Several metrics have been proposed to measure the semantic similarity between GO annotations, and have been verified in terms of the correlations with sequence similarity (Lord et al., 2003) protein–protein interactions (Xu et al., 2008), and gene expression profiles (Sevilla et al., 2005). The GO semantic similarity provides the basis for functional comparison of gene products, and therefore has been widely used in bioinformatics applications, such as protein sub-nuclear localization prediction (Lei and Dai, 2006), gene function prediction (Tao et al., 2007) and cluster analysis of genes (Bolshakova et al., 2005; Wolting et al., 2006).

Two typical approaches to measure semantic similarity of GO terms are information content (IC)- and graph-based measures. The IC-based measures depend on the frequencies of two GO terms involved and that of their closest common ancestor term in a specific corpus of GO annotations, such as the UniProt Knowledgebase. Three IC-based measures, Resnik's (Philip, 1999), Lin's (Lin, 1998) and Jiang and Conrath's (Jiang and Conrath, 1997) have been introduced from natural language taxonomies by Lord et al. (2003) to compare gene products in early time. On the basis of Resnik's and Lin's definition, an IC-based measure has also been presented by Schlicker et al. (2006). Considering that the specificity of a GO term is usually determined by its location in the GO graph, Wang et al. (2007) proposed a graph-based strategy to compute semantic similarity using the topology of the GO graph structure. In the Wang's method, the semantics of GO terms are encoded into a numeric format and the different semantic contributions of the distinct relations are considered.

Several online tools for semantic similarity measurement of gene products are available at present, such as G-SESAME (Wang et al., 2007) and FuSSiMeG (http://xldb.fc.ul.pt/rebil/ssm/). To facilitate large-scale analysis, two freely available software packages, GOGraph (Lord et al., 2003) and GOSim (Frohlich et al., 2007) implementing classic IC-based methods for semantic comparison of GO terms have also been developed. Here, we present an R package named GOSemSim to compute semantic similarity among GO terms, sets of GO terms, gene products and gene clusters, providing both IC- and graph-based methods.

2 IMPLEMENTATION

The GOSemSim is developed as a package for the statistical computing environment R and is released under the GNU General Public License within Bioconductor (Gentleman et al., 2004) project. GOSemSim depends on the annotation data GO.db provided by Bioconductor to obtain the ancestors of GO terms and their relations. The information content is species specific and calculated from Bioconductor annotation packages org.Hs.eg.db, org.Rn.eg.db, org.Mm.eg.db, org.Dm.eg.db and org.Sc.sgd.db for human, rat, mouse, fly and yeast, respectively.

Considering that existing approaches performs differently under different circumstances (Pesquita et al., 2009), four IC-based (Resnik's, Lin's, Jiang and Conrath's and Schlicker's) and one graph-based (Wang's) semantic similarity measure algorithms mentioned before are selected to be integrated in GOSemSim, and can be selected by setting the ‘method’ parameter of the package functions to ‘Resnik’, ‘Lin’, ‘Jiang’, ‘Rel’ and ‘Wang’, respectively. The Resnik's, Lin's and Jiang and Conrath's algorithms are the most common semantic similarity measures used with GO (Pesquita et al., 2009). Several assessment results had shown that the Resnik's measure had consistently high correlation with sequence similarity (Lord et al., 2003; Mistry and Pavlidis, 2008; Pesquita et al., 2008) and gene co-expression (Sevilla et al., 2005). By using a best-match average combination strategy, Pesquita et al. found that Jiang and Conrath's measure had the highest correlation with sequence similarity (Pesquita et al., 2009). The Schlicker's measure had been found to perform better than Resnik's measure in distinguishing orthologous gene products from gene products with other levels of sequence similarity (Schlicker et al., 2006). The Wang's measure had also been shown to produce more accurate results than Resnik's measure in clustering gene pairs according to their semantic similarity (Wang et al., 2007). The details about the semantic similarity measure algorithms used in GOSemSim can be found in the user's manual (Supplementary Material 1). The GO used in measurement can be restricted by assigning the corresponding parameter to ‘BP’ (biological process), ‘MF’ (molecular function) and ‘CC’ (cellular component).

3 FUNCTIONS AND EXAMPLES

Six functions are provided by GOSemSim package. The function goSim, mgoSim, geneSim and clusterSim can compute the semantic similarity among GO terms, sets of GO terms, GO descriptions of gene products and GO descriptions of gene clusters, respectively. The functions mgeneSim and mclusterSim are designed to calculate the similarity scores matrix of a set of genes and gene clusters.

The output value of the basic function goSim is between 0 and 1. The higher the value obtained more the similarity between them. For example:

The function mgoSim is designed to compute the similarity of two GO terms lists, such as

By mapping gene products to GO annotations, functions geneSim, mgeneSim, clusterSim and mclusterSim can be used to measure the semantic similarity among gene products. Gene IDs and species are needed for the measurements. For human, rat, mouse and fly, the Gene IDs refer to Entrez Gene IDs, while for yeast, the Gene IDs refer to ORF identifiers from Saccharomyces Genome Database (SGD), for example:

The functions mgeneSim, clusterSim and mclusterSim are especially designed for large-scale analysis. Suppose we have a group of genes (here, we use a random sample set of Affymetrix IDs as an example) and want to cluster the genes based on their functions. First, we call the function mgeneSim to compute the pairwise GO semantic similarities of these genes:

Then, we can use hierarchical cluster function hclust of the stats package to cluster these gene products based on semantic similarities of their GO annotations. After cutting the cluster tree into discrete cluster groups, we can use clusterSim and mclusterSim function to measure the similarities among gene clusters, for instance:

4 CONCLUSIONS

The measurements of the semantic similarities for GO annotations facilitate users to infer relationships among genes, and therefore is becoming one of the important bases in many bioinformatics analysis methods. The GOSemSim package implement five classic approaches for GO annotations-based semantic similarity measurements, and provide useful functions to offer flexibility for typical applications. The package can be easily integrated into pipelines for high-throughput analysis, such as gene expression data mining, protein interactions validation and miRNA-regulated network interpretation.

Funding: National Key Technologies R&D Program for New Drugs (2009ZX09301-002); National Nature Science Foundation of China (30530650); National Science Fund for Distinguished Young Scholars (30625041).

Conflict of Interest: none declared.

REFERENCES

Bolshakova

N

, et al.

A knowledge-driven approach to cluster validity assessment

,

Bioinformatics

,

2005

, vol.

21

(pg.

2546

-

2547

)

Frohlich

H

, et al.

GOSim—an R-package for computation of information theoretic GO similarities between terms and gene products

,

BMC Bioinform.

,

2007

, vol.

8

pg.

166

Gentleman

RC

, et al.

Bioconductor: open software development for computational biology and bioinformatics

,

Genome Biol.

,

2004

, vol.

5

pg.

R80

Jiang

JJ

,

Conrath

DW

.

Semantic similarity based on corpus statistics and lexical taxonomy

,

Tenth International Conference on Research on Computational Linguistics (ROCLING X)

,

1997

Taiwan

OpenURL Placeholder Text

Lei

Z

,

Dai

Y

.

Assessing protein similarity with Gene Ontology and its use in subnuclear localization prediction

,

BMC Bioinform.

,

2006

, vol.

7

pg.

491

Lin

D

.

An information-theoretic definition of similarity

,

Proceedings of the Fifteenth International Conference on Machine Learning

,

1998

San Francisco, CA, USA

Morgan Kaufmann Publishers Inc.

(pg.

296

-

304

)

OpenURL Placeholder Text

Lord

PW

, et al.

Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation

,

Bioinformatics

,

2003

, vol.

19

(pg.

1275

-

1283

)

Mistry

M

,

Pavlidis

P

.

Gene Ontology term overlap as a measure of gene functional similarity

,

BMC Bioinform.

,

2008

, vol.

9

pg.

327

Pesquita

C

, et al.

Metrics for GO based protein semantic similarity: a systematic evaluation

,

BMC Bioinform.

,

2008

, vol.

9

Suppl. 5

pg.

S4

Pesquita

C

, et al.

Semantic similarity in biomedical ontologies

,

PLoS Comput. Biol.

,

2009

, vol.

5

pg.

e1000443

Philip

R

.

Semantic similarity in a taxonomy: an information-based measure and its application to problems of ambiguity in natural language

,

J. Artif. Intell. Res.

,

1999

, vol.

11

(pg.

95

-

130

)

OpenURL Placeholder Text

Schlicker

A

, et al.

A new measure for functional similarity of gene products based on Gene Ontology

,

BMC Bioinform.

,

2006

, vol.

7

pg.

302

Sevilla

JL

, et al.

Correlation between gene expression and GO semantic similarity

,

IEEE/ACM Trans. Comput. Biol. Bioinform.

,

2005

, vol.

2

(pg.

330

-

338

)

Tao

Y

, et al.

Information theory applied to the sparse gene ontology annotation network to predict novel gene function

,

Bioinformatics

,

2007

, vol.

23

(pg.

i529

-

i538

)

Wang

JZ

, et al.

A new method to measure the semantic similarity of GO terms

,

Bioinformatics

,

2007

, vol.

23

(pg.

1274

-

1281

)

Wolting

C

, et al.

Cluster analysis of protein array results via similarity of Gene Ontology annotation

,

BMC Bioinform.

,

2006

, vol.

7

pg.

338

Xu

T

, et al.

Evaluation of GO-based functional similarity measures using S.cerevisiae protein interaction and expression profile data

,

BMC Bioinform.

,

2008

, vol.

9

pg.

472

Author notes

^†The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors.

Associate Editor: Olga Troyanskaya

© The Author 2010. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Download all slides

Views

23,583

Altmetric

Total Views 23,583

18,994 Pageviews

4,589 PDF Downloads

Since 11/1/2016

Month:	Total Views:
November 2016	25
December 2016	31
January 2017	69
February 2017	156
March 2017	147
April 2017	66
May 2017	90
June 2017	95
July 2017	120
August 2017	121
September 2017	117
October 2017	107
November 2017	150
December 2017	193
January 2018	250
February 2018	263
March 2018	293
April 2018	298
May 2018	295
June 2018	225
July 2018	226
August 2018	246
September 2018	241
October 2018	417
November 2018	388
December 2018	404
January 2019	350
February 2019	385
March 2019	429
April 2019	528
May 2019	424
June 2019	321
July 2019	443
August 2019	389
September 2019	349
October 2019	319
November 2019	298
December 2019	278
January 2020	173
February 2020	183
March 2020	206
April 2020	242
May 2020	170
June 2020	215
July 2020	239
August 2020	197
September 2020	191
October 2020	230
November 2020	257
December 2020	201
January 2021	237
February 2021	247
March 2021	280
April 2021	309
May 2021	292
June 2021	341
July 2021	375
August 2021	235
September 2021	281
October 2021	310
November 2021	410
December 2021	362
January 2022	321
February 2022	244
March 2022	342
April 2022	272
May 2022	305
June 2022	289
July 2022	294
August 2022	228
September 2022	265
October 2022	369
November 2022	318
December 2022	200
January 2023	237
February 2023	275
March 2023	280
April 2023	293
May 2023	254
June 2023	270
July 2023	297
August 2023	307
September 2023	272
October 2023	318
November 2023	276
December 2023	314
January 2024	345
February 2024	240
March 2024	286
April 2024	143