Bioinformatics Advance Access originally published online on September 6, 2007
Bioinformatics 2007 23(18):2477-2484; doi:10.1093/bioinformatics/btm375
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Medline search engine for finding genetic markers with biological significance
Molecular and Behavioral Neuroscience Institute and Department of Psychiatry, University of Michigan, Ann Arbor, Michigan 48109, USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Genome-wide high density SNP association studies are expected to identify various SNP alleles associated with different complex disorders. Understanding the biological significance of these SNP alleles in the context of existing literature is a major challenge since existing search engines are not designed to search literature for SNPs or other genetic markers. The literature mining of gene and protein functions has received significant attention and effort while similar work on genetic markers and their related diseases is still in its infancy. Our goal is to develop a web-based tool that facilitates the mining of Medline literature related to genetic studies and gene/protein function studies. Our solution consists of four main function modules for (1) identification of different types of genetic markers or genetic variations in Medline records (2) distinguishing positive versus negative linkage or association between genetic markers and diseases (3) integrating marker genomic location data from different databases to enable the retrieval of Medline records related to markers in the same linkage disequilibrium region (4) and a web interface called MarkerInfoFinder to search, display, sort and download Medline citation results. Tests using published data suggest MarkerInfoFinder can significantly increase the efficiency of finding genetic disorders and their underlying molecular mechanisms. The functions we developed will also be used to build a knowledge base for genetic markers and diseases.
Availability: The MarkerInfoFinder is publicly available at: http://brainarray.mbni.med.umich.edu/brainarray/datamining/MarkerInfoFinder
Contact: mengf{at}umich.edu
| 1 INTRODUCTION |
|---|
|
|
|---|
The development of efficient SNP genotyping methods enabled the genome-wide association (GWA) studies using high density SNP scanning for various complex disorders. Many large-scale GWA studies are underway and it is expected that a significant number of candidate disease predisposing SNP alleles will be identified in the next couple of years. However, the potential biological impact of these SNP alleles will be unclear under most situations due to the scarcity of biological function annotation. On the other hand, free text databases such as Medline and OMIM contain extensive biological function information from published genetic and molecular biology studies. Solutions that can enable effective utilization of these free text databases for the understanding of the biological role of SNP alleles identified in GWA studies are highly desirable.
Typically, researchers favor using SNP IDs to search Medline directly. Currently, this is not an effective search strategy since only 550 of about 16 million Medline records are annotated with SNP IDs. Even if a Medline record can be retrieved by direct SNP ID query, existing search engines such as PubMed and Google Scholar fail to retrieve relevant genetically related records like abstracts describing other SNPs, genes, STS/microsatellite markers and cytobands in the same Linkage Disequilibrium (LD) region of a query SNP. Such genetically related abstracts are likely to be the main source of information for understanding the biological role of a SNP. Existing search engines cannot retrieve such records because the genomic location and the LD relationship of different genetic markers are not incorporated in their search algorithms. As a result, in order to perform a thorough investigation of existing literature for a given SNP, a researcher must (1) find the genomic location of the SNP; (2) define the LD region for the SNP; (3) obtain the name of all STS/microsatellite markers, SNPs and genes in this region from the appropriate databases; (4) identify all possible nomenclature of cytobands that overlap with this region and (5) finally perform a Medline search using all the genetic marker names together with the corresponding disease names as a filter. Undoubtedly, this is a very time-consuming, error-prone and repetitive process that few researchers would like to perform for every SNP when related databases are updated.
The ideal solution is an automated system where a researcher can use the name/ID of genetic markers to obtain all relevant Medline records in one easy step. In order to achieve this goal, we need to (1) automatically recognize disease names and the name of various genetic markers, including SNPs, STS/microsatellite markers, cytobands and the polymorphisms of genes; (2) identify the negative reports between genetic markers and diseases. This is important since there is a significant proportion of negative linkage or association reports for complex disorders in the literature and (3) obtain the genomic location of various genetic markers from their corresponding databases and build a function for retrieving all genetically related markers based on the HapMap data. This function should also allow a researcher to define his/her own criteria for genetic relevance, such as the size of the neighboring genomic region to be considered and the threshold for a particular LD score (e.g. r2). In addition, since such queries are likely to return a large number of records, functions for the efficient exploration of hundreds or even thousands of Medline records are also needed.
Automatic entity recognition is a hot topic in the mining of biomedical literature and extensive efforts have already been devoted to the identification of gene/protein names in free text literature. Previous work in this area can be roughly divided into two categories: those based on statistical or machine learning techniques, and those based on manually developed rules (Collier et al., 2000; Eriksson et al., 2002; Fukuda et al., 1998; Narayanaswamy et al., 2003; Tanabe and Wilbur, 2002; Xuan et al., 2003). Significant attention has also been devoted to mining protein–protein interaction (Blaschke et al., 1999; Koike et al., 2003; Rzhetsky et al., 2004; Toshihide et al., 2001), protein function-related information (Chiang and Yu, 2003; Daraselia et al., 2004; Raychaudhuri et al., 2002) and gene-drug association (Rindfleisch et al., 2000; Srinivasan and Sehgal, 2003).
In contrast, automated information extraction related to genetic markers and polymorphisms is still at an early stage. The MEMA method was developed to extract protein sequence mutations from Medline abstracts and allows users to access the mutation-gene pairs identified in literature (Rebholz-Schuhmann et al., 2004). There is a similar work focused on the extraction of single point mutations for the GPCR and NR superfamilies (Horn et al., 2004). VTag was developed for identifying acquired genomic aberrations (McDonald et al., 2004). Most recently, the HCAD database was derived from the literature mining of chromosome breakpoints (Hoffmann et al., 2005). These pioneering works demonstrate the promise of applying natural language processing technologies to genetic markers but the above solutions are targeted at specific areas of research. There is no comprehensive genetic marker entity recognition solution for every type of genetic markers such as microsatellite markers. None of the existing methods allow researchers to readily identify all genetically related markers, such as SNPs in the same LD region as microsatellite markers, during literature exploration. Consequently, researchers have to rely on time-consuming manual database searches in order to find published literature related to the SNPs or other genetic markers from association or linkage studies.
To bridge the gap between genotyping data from genome-wide scanning and the large volume of literature related to various forms of genetic markers and their functions, we decided to develop a new Medline search engine named MarkerInfoFinder. It integrates information from relevant databases with our genetic marker entity recognition results for efficient retrieval and exploration of Medline records.
The System Design section provides an overview of MarkerInfoFinder and a description of our data integration efforts. We then present algorithms for entity recognition and the identification of negative marker-disease association in the Entity Recognition section. The Web Functions section describes the web interface and functions for the search and exploration of Medline records. The Use Case Example section includes some search examples to better illustrate the performance and usage of MarkerInfoFinder.
| 2 SYSTEM DESIGN |
|---|
|
|
|---|
The MarkerInfoFinder consists of four main components: an integrated database, a web interface, an entity recognition module and a negative marker-disease relationship recognition module. In this article, we use the phrase genetic marker in its broadest sense to include SNPs, STS/microsatellite markers, cytobands and genes/proteins found to be related to complex genetic disorders.
As shown at the bottom of Figure 1, we integrate genetic marker IDs, names and their genomic location information from dbSNP, UniSTS, NCBI Ideogram and Entrez Gene. Our database also stores linkage disequilibrium scores (R2, D, LOD) between every pair of SNPs included in the HapMap II data release. If a HapMap data set is based on an old genome assembly, we recalculate the LD scores using the HaploView program (Barrett et al., 2005) after mapping HapMap SNPs to the latest version of the human genome assembly. The integration of genetic marker location and SNP LD scores based on the same version of genome assembly permits the mapping of different types of genetic markers to each other, which is critical for identifying all Medline records genetically related to user queries.
|
Appropriate search mechanisms allow our system to accept a wide variety of query inputs related to the exploration of potential biological implications of genetic markers in the Medline database (left side of Fig. 1). For various types of genetic marker names and genomic/chromosome location queries, we use a powerful genetic marker search function developed in our group for the SNP Function Portal, which is designed to provide extensive molecular level function annotation for SNPs (Wang et al., 2006). To support gene/protein accession numbers and GeneChip IDs, we include the ProbeMatchDB functionality in the MarkerInfoFinder (Wang et al., 2002). In addition, Section 3.3 describes the ability to use genetic disease names directly in the search. Furthermore, Section 5 describes functions for sorting and filtering search results (summarized in the right part of Fig. 1) as elements of our web interface.
The core of the MarkerInfoFinder system is the ability to identify the relationship between genetic markers and diseases in Medline records (top of Fig. 1). We first preprocess Medline records for genetic markers and marker-disease relationships using the modules described in Section 3. We then index every Medline record based on each of the four types of markers (cytoband, STS/microsatellite, SNP and gene/protein) as well as the presence of negative marker-disease relationship in the titles of Medline records. To achieve better specificity for searches using gene/protein/sequence ID or disease name as input, we add a filter function to restrict the results to genetic markers and genetic diseases.
The comprehensive integration of data and functions described above in the MarkerInfoFinder allows the Medline database to be utilized for the efficient exploration of the biological implication of genetic markers identified in association and linkage studies.
| 3 ENTITY RECOGNITION |
|---|
|
|
|---|
In this section, we describe our approaches for identifying various types of genetic markers and disease names in free text literature.
3.1 Cytoband extraction
Cytobands describe the positions of cytogenetic bands within chromosomes. There is a large body of literature associating cytobands to genetic diseases. It is critical to extract cytoband information from literature and associate it with specific species.
We utilize MeSH descriptors (e.g. Human, Homo sapiens) to determine the organism. Currently, we focus on human cytobands. If there is no MeSH descriptor indicating the organism, we search for a list of terms that indicate human subjects, e.g. child, women, etc.
Human cytobands are usually mentioned by patterns of (p|q)\d+(\.\d+)?. We designed a series of regular expressions to match commonly seen cytoband patterns. Table 1 shows some examples of the regular expressions we compiled for capturing various forms of cytobands.
|
Some cytobands are ambiguous because strings of pattern (p|q)\d+(\.\d+)? may denote other meanings. For example, induce P56 in the mutant cell line P2.1, and linked in transductions by phage P11. Meanwhile, some proteins names match the pattern of p\d+, e.g. p53.
We pre-compiled a list of gene and protein names that matches the pattern p\d+?, such as p27, p53, etc. Using the ideogram table from NCBI we obtained the list of maximum/minimum p/q band numbers for each chromosome. Thus, strings such as P53, P450 will simply be ignored because the maximum p/q bands are 36/44, respectively for humans.
To further disambiguate entities, we developed an algorithm to automatically obtain feature terms associated with cytobands based on two manually curated Medline corpora. The first corpus has 400 Medline abstracts that contain unambiguous cytoband patterns. The second corpus consists of 200 abstracts containing strings that match patterns (p|q)\d+(\.\d+)? but not cytobands as well as 200 abstracts without cytoband patterns.
We use a frequency profiling method to extract feature terms that differentiate these two corpora (Rayson and Garside, 2000). The log-likelihood values are calculated by the equation:
|
| (1) |
|
| (2) |
Our algorithms can also identify, with high accuracy, chromosome names that are associated with complex cytogenetic regions. We incorporated rules from human cytogenetic nomenclature (Shaffer and Tommerup, 2005). For example, our program extracts 12q14–12q15 and 14q23–14q24 from t(12;14)(q14–15;q23–24), 16p13.1 and 16q22 from inv(16)(p13.1;q22), and 10q22.2–10q22.3 from del(10)(q22.2q22.3). In addition, our method applied heuristics to identify chromosomes that are mentioned in the same sentence where cytoband regions appear.
3.2 Sequence tagged site extraction
Sequence Tagged Site (STS) is another type of frequently used genetic marker in genetics studies. STSs are short sequences that are operationally unique in the genome and used to generate mapping reagents. STSs can be defined by PCR primer pairs and are associated with additional information, such as genomic position, genes and sequences. Since each STS is unique, STSs are helpful for chromosome placement of mapping and sequencing data from different laboratories.
We extracted and merged all STS names and IDs from the NCBI UniSTS database, which is a comprehensive database integrating marker and mapping data from a variety of public resources. We also combined STS aliases from mapview (build 35.1) since it contains additional STS names not present in UniSTS. In total, we compiled a STS name dictionary that consists of 924 302 names and aliases of 454 439 unique STSs.
Common English words (e.g. muscle) and general biological terms (e.g. transcription) were removed from our STS name dictionary to reduce false positives. These terms were identified using word frequency statistics computed from a word dictionary that consists of 556 974 filtered single words compiled from all Medline abstracts. When compared to gene and protein names, the STS markers have fewer variants in free text.
However, many STS patterns that frequently occur in Medline are identical to gene names. For example, CD34 and STAT3 are listed on UniSTS as STS names, even though they usually represent gene names in most situations. Therefore, if a STS name also appears in our gene/protein pattern list, our disambiguation module requires STS-related MeSH terms to be present in the Medline citation. Similar to the process of cytoband disambiguation, we compiled 16 MeSH feature terms, e.g. Microsatellites and Microsatellite Repeats, etc. based on non-ambiguously DNA segment names used to represent STS entities (e.g. D1S80, DYS132, DXS52, SY161, etc.). The weighted sum method described in Section 3.1 is used to identify real STS names. We also apply a filter to eliminate entities that are followed by a negative feature term (e.g. cell).
3.3 Genetic disease extraction
MarkerInfoFinder supports literature search focusing on genetic disease. Our initial genetic disease names came from the OMIM database. As of August 2005, the OMIM database contained 2552 genetic-related disorders.
We transformed these names into matching patterns. For example, crohn disease, susceptibility to is transformed to its canonical form crohn disease to increase sensitivity. Generic terms were eliminated using corpus-wide frequency statistics and manual curation. For example, gene expression will not be generated as a canonical form of gene expression, variation in, qtl. Some common English words and terms belong to other semantic types, e.g. body mass index, were removed to reduce false positives.
The remaining names were sent to our morphology module to generate morphological variations. Some names require special handling. For example, aids is ambiguous and appears frequently in text. Therefore, we only match its uppercase form. In total, we obtained 4457 forms of disease names, including some plural inflections, such as carcinoid tumors, ACTH deficiencies, etc.
In scanning each Medline abstract, we first normalize the text by removing punctuation marks, etc. We then search disease names using a fast string searching algorithm (Bentley and Sedgewick, 1997). This algorithm blends tries and binary search trees and exhibits very good performance for searching multikey data.
3.4 Association extraction of marker and disease
We mainly use the co-occurrence of genetic markers and diseases for retrieving Medline records. Since many papers report negative association between genetic markers and diseases, only considering the co-occurrence of genetic markers and disease in individual Medline records can sometimes be erroneous. Ideally, researchers should have the option of performing explorations based on co-occurrence (i.e. all positive and negative associations), positive association or negative association for better understanding of the relationship between genetic markers and diseases.
Based on our experience, published negative results usually present evidences against important studies or hypotheses in existing literature. Authors reporting a negative association between a genetic marker and a disease usually highlight the negative result in the title of their papers. Therefore, we currently focus on detecting negations in citation titles.
Chapman (Chapman et al., 2001) developed NegEx which uses regular expressions to identify negated findings and diseases in clinical reports. Certain patterns occur frequently in clinical reports. For example, denies accounted for 15 % of negation phrase in their corpus, e.g. patient denied experiencing chest pain. But in our corpus consisting of 250 521 Medline citation titles containing genetic-related disease names, the phrase denies occurs only once. Consequently, we have to identify negated statements between genetic markers and diseases using additional methods.
Since titles are relatively short, if a negation occurs in a title, both genetic marker and disease names should be in the phrase or sentence. We built a list of 130 regular expressions-based negation phrases typically describe negations, e.g. irrelevant, does not, etc. based on an iterative process combining expert extraction and semi-automatic filtering. We further enriched our list with 162 negation phrases from NegEx. Some negation phrases almost certainly indicate negative relations. For example, absence of correlation, lack of association, no evidence, failure to replicate, does not influence, etc. Based on the compiled negation phrases, we developed templates that capture negations such as <genetic_marker> <negation phrase> <genetic_disease>.
We scanned all Medline abstracts for marker-disease associations. The identified negative associations between markers and diseases are used as one of the filters for literature search.
| 4 RESULTS |
|---|
|
|
|---|
In this section, we first describe results for the text mining component and then explain the mechanism behind our web search engine and data integration components. We then show how to retrieve genetic marker related Medline citations.
4.1 Evaluation of extraction methods
To evaluate the performance of our entity extraction algorithms, we compiled testing corpora for each type of entity and manually annotated the corresponding corpora. Our gene and protein entity identification method was evaluated earlier and it is on par with the best performance reported in literature (Xuan et al., 2003). Hereafter, we focus on cytoband, STS, genetic disease and negation recognition.
We measured the performance of the algorithms using recall, precision and F-score where F-score is defined as follows:
|
| (3) |
The STS extraction algorithm evaluation corpus containing 200 abstracts was generated in a similar way.
For generating the genetic disease name corpus, we randomly sampled 500 diseases from OMIM and arbitrarily picked one Medline abstract associated with each disease in OMIM database. We then asked a biologist to manually confirm if these abstracts contained disease names in the corresponding OMIM entries.
For benchmarking our negated associations identification algorithm, we randomly sampled 1700 titles from Medline citations that contain both diseases and genetic markers in their titles. A human expert annotated these titles for positive and negative associations as well as irrelevant records.
Both testing and annotation sets can be downloaded from MarkerInfoFinder website: http://brainarray.mbni.med.umich.edu/brainarray/datamining/MarkerInfoFinder/Help.
Table 2 shows the evaluation results for our approaches. Our methods achieve satisfactory precision and recall. The F-scores for cytoband, STS and negation are significantly better than the state-of-art gene/protein name recognition algorithms due to the relatively simple nature of these entities. The performance for genetic disease name recognition is lower because there is extensive variation in disease names in free text literature.
|
4.2 Entity extraction results
As of 24 July 2006, Medline has 15 995 358 citations, with 8 325 901 of them containing abstracts. We stored all Medline citations in our local Oracle 10 g database. Table 3 shows our entity extraction results on abstracts and titles of Medline citations.
|
In addition to good overall entity recognition performance, our text mining components are computationally efficient. For example, our entity identification engine can extract gene and protein names for the whole Medline database in 28 h on a desktop PC (P4 2.8 GHz) running Windows 2000. Other genetic marker extraction processes can be finished in 15 h.
4.3 Filtering matched literature
When a user searches Medline by gene/protein/disease names for related genetic studies, Medline often returns many unrelated records due to the large body of non-genetic studies related to these entities. To restrict the returned citations to those only related to genetic studies, we use MeSH terms for additional filtering. The MeSH term 2005 (Coletti and Bleich, 2001) contains 22 491 descriptors. Over 98 % Medline citations include an average of 11 MeSH terms per citation, which were judged by human annotators to be most relevant to the main topics of each citation.
We constructed 29 queries to search for the MeSH database for genetic or sequence variation-related descriptors. Each query retrieves descriptors related to certain keywords that are frequently used to describe sequence variations. For example, the query SELECT * FROM mesh_descriptor WHERE lower (MeSH) LIKE " %polymorphism %" return four descriptors such as, Polymorphism, Genetic, and Polymorphism, Single Nucleotide. In total, we obtained 470 unique MeSH descriptors. We then manually examined this list to remove terms frequently not associated with genetic studies. The result list contains 113 MeSH descriptors. For example, DNA Mutational Analysis, Microsatellite Repeats, etc. This list can also be downloaded from our website.
We scanned the whole Medline dataset for such genetic study related MeSH terms. A total of 1 230 577 MeSH annotations in 816 121 citations met our criteria, i.e. these citations were annotated with one or more such MeSH terms. When a user searches citations by gene/protein names, probe ID lists or disease names, matched citations will be returned if they pass this genetic study MeSH filter. Table 4 shows the genetic study related disease entity extraction results, before and after applying the above filter. The filtering step should significantly increase the relevance of the search results pertaining to genetic studies.
|
| 5 WEB FUNCTIONS |
|---|
|
|
|---|
5.1 Searching by genetic markers
When a user tries to retrieve Medline records using a SNP list, MarkerInfoFinder will utilize the SNP Function Portal web service developed in our group to search for genetically related SNPs (Wang et al., 2006). For example, users can select criteria such as genomic neighbors and LD scores to find SNPs, which greatly increases the efficiency of finding all relevant SNPs.
The resulting SNP list will be used to find related Medline citations using pre-indexed tables (which map cytoband, STS, SNP_ID or gene/protein variants to PMID). The literature-matching results will be filtered and presented to the user in genomic location view (Fig. 2a), where citations are grouped by genetic markers (e.g. SNP IDs) and they are also sorted by genomic locations.
|
MarkerInfoFinder also allows users to find literature by directly searching STS names/IDs, gene/protein name/ID and human cytoband names that appear in the abstract, either through mapping these entities to a functionally related SNP list first or by finding the corresponding entities and overlapping entities in the same category directly in Medline. For example, if a user chooses to perform a search based only on cytoband and not mapping to SNPs, the citations returned by MarkerInfoFinder will include cytoband region(s) in the Medline title/abstract overlapping with the user input. We provide an option of mapping non-SNP entities directly to Medline records based on user feedbacks. Although mapping through SNP or genes is correct in principle, it may lead to a large number of results not intended by the user.
Naturally, MarkerInfoFinder supports genomic location-based searches since we have the genomic location of every type of genetic markers mentioned here. Users can also search by gene names or keywords. When users query by keywords for gene names, MarkerInfoFinder will call NCBI's web service to search genes of user-specified organisms in the Entrez Gene database and the UniGene database. We merge these results and map them to UniGene IDs. Users can then select genes of their interest in order to query related publications.
5.2 Searching by genetic disease names
With the explosion of genetic marker information, researchers are facing the challenge of associating makers, e.g. SNP markers, with various diseases. Therefore, MarkerInfoFinder uses another module that provides the capability to search marker-related citations by disease names.
We built an inverted index of keywords from normalized OMIM genetic-related disease names thus enabling users to search disease by either keywords or MeSH terms, e.g. bipolar disorder. Users can select one or more related diseases and retrieve Medline citations, which are filtered by a genetic MeSH filter. MarkerInfoFinder also provides users with the option to filter out citations reporting negative relations between markers and genetic diseases.
Our search results will also provide genetic markers related to the disease. This is a useful feature since many researchers are interested in determining which genetic factors are associated with a disease, in addition to what are the potential biological implications related to a list of genetic markers.
5.3 Functions facilitating medline search
MarkerInfoFinder provides a series of scaffolding tools to facilitate Medline search. For example, when there are no hits from MarkerInfoFinder, we will call spell checking web services (ESpell from NCBI and spelling check from Google) to notify the user of potentially miss spelled terms. Also, for each citation users can review related journal information to decide which publications deserve further investigation.
Similar to our GeneInfoMiner (Xuan et al., 2005), we use MeSH descriptors to organize search results for more efficient literature exploration (Fig. 2b). We believe this function is critical for exploring a large number of returned Medline records. The MarkerInfoFinder counts the number of Medline records associated with each MeSH term and sorts results in descending order by default according to the product of three values: number of hits, inverse document frequency (IDF) and term frequency (TF) for each MeSH term. A user can also sort the number of hits and TF*IDF independently. These sorting methods can help user to quickly identify specific biological processes related to the genetic markers of their interests.
Medline records associated with each SNP or MeSH topic can be displayed in a browser window with multiple sorting options such as gene symbols, the number of submitted sequence IDs related to each abstract, journal impact factor and publication date, etc. MarkerInfoFinder allows users to examine identified gene information with multiple linkouts. In the MeSH group view page, users can review MeSH terms of interest in a simplified hierarchical MeSH tree. Once a user select papers of interest, citations of the selected papers can be downloaded directly to citation managers, such as Endnote and ProCite.
| 6 USE CASE EXAMPLE |
|---|
|
|
|---|
In order to further compare our overall solution with the PubMed and Google Scholar search engines, we tested some typical genetic marker-related search situations.
Table 5 shows comparative search results using a set of cytobands, STSs and SNPs that are related to the Parkinson's disease and mentioned in Medline abstracts. These results were generated on 28 February 2007 by searching PubMed, Google Scholar and MarkerInfoFinder (MIF). We used LD score r2
0.8 for genetic marker mapping for the MarkerInfoFinder search. The results in Table 5 demonstrate that MarkerInfoFinder outperforms searching PubMed directly by a significant margin. Follow-up examinations revealed that all citations returned by the PubMed search engine are covered in the MarkerInfoFinder result sets. Table 5 also shows results from the Google Scholar search engine. Although in some cases the number of citations returned by Google Scholar is comparable to that of MarkerInfoFinder, Google Scholar does not restrict results to those related to genetic studies and they use full-text indexing rather than solely relying Medline records. Since the appearance of genetic markers in titles and abstracts usually suggest higher importance than those that only occur in full-length text, MarkerInfoFinder should provide more relevant records based on authors summaries. Furthermore, Google results are unlikely to contain all genetically related markers since LD criteria are not incorporated in Google Scholar.
|
| 7 DISCUSSION |
|---|
|
|
|---|
While there are a number of biological databases containing genetic marker information, such as UniSTS and dbSNP, they are mainly designed for querying individual markers rather than providing a comprehensive understanding of their potential biological functions. Our MarkerInfoFinder is the first application that allows researchers to find genetic marker related Medline records using flexible criteria. It greatly increases the efficiency of finding all Medline records related to a given genetic marker or a list of genetic markers. In our web search engine, we provide users with an intuitive search interface for effectively searching, displaying, sorting and exploring the Medline database.
We will continue to improve entity recognition for various types of genetic markers, as some of them have never been dealt with in the literature before. We plan to incorporate Conditional Random Fields algorithms to help further disambiguate genetic markers in Medline abstracts. We will also try to develop the ability to recognize the negative association between genetic markers and diseases in abstracts rather than just in titles. Keyword-based filtering and the ability to explore clustered subsets of Medline records retrieved by genetic marker-based search will be added in the next phase. We expect this effort will lead to the building of a knowledge base for genetic markers and various pathophysiological processes, which can be used to develop knowledge-based analysis of genome-wide scanning results.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
The authors are members of the Pritzker Neuropsychiatric Disorders Research Consortium, which is supported by the Pritzker Neuropsychiatric Disorders Research Fund L.L.C. This work is also partly supported by the National Center for Integrated Biomedical Informatics through NIH grant 1U54DA021519-01A1. Funding to pay the Open Access publication charges was provided by the Pritzker Neuropsychiatric Disorders Research Consortium.
Conflict of Interest: none decelared.
| FOOTNOTES |
|---|
Associate Editor: Alfonso Valencia
Received on August 25, 2006; revised on July 10, 2007; accepted on July 15, 2007
| REFERENCES |
|---|
|
|
|---|
Barrett JC, et al. Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics, ( (2005) ) 21, : 263–265.
Bentley JL, Sedgewick R. Fast algorithms for sorting and searching strings. ( (1997) ) Proceedings of the 8th ACM-SIAM. Louisiana: New Orleans. 360–369..
Blaschke C, et al. Automatic extraction of biological information from scientific text: protein-protein interactions. ( (1999) ) Proceedings of the AAAI Conference on ISMB. 60–67..
Chapman WW, et al. A simple algorithm for identifying negated findings and diseases in discharge summaries. J. Biomed. Inform, ( (2001) ) 34, : 301–310.[CrossRef][ISI][Medline].
Chiang J-H, Yu H-C. MeKE: discovering the functions of gene products from biomedical literature via sentence alignment. Bioinformatics, ( (2003) ) 19, : 1417–1422.
Coletti MH, Bleich HL. Medical subject headings used to search the biomedical literature. J. Am. Med. Inform. Assoc, ( (2001) ) 8, : 317–323.
Collier N, et al. Extracting the names of genes and gene products with a hidden markov model. ( (2000) ) Proceedings of the 18th International Conference on Computational Linguistics. 201–207..
Daraselia N, et al. Extracting protein function information from medline using a full-sentence parser. ( (2004) ) Proceeding of the Second European Workshop on Data Mining and Text Mining for Bioinformatics. 11–18..
Eriksson G, et al. Exploiting syntax when detecting protein names in text. ( (2002) ) Proceedings of Workshop on NLP in Biomedical Applications..
Fukuda K, et al. Toward information extraction: identifying protein names from biological papers. ( (1998) ) Proceedings of the Pacific Symposium on Biocomputing. Hawaii: Maui. 707–718..
Hoffmann R, et al. HCAD, closing the gap between breakpoints and genes. Nucleic Acids Res, ( (2005) ) 33, : D511–D513.
Horn F, et al. Automated extraction of mutation data from the literature: application of MuteXt to G protein-coupled receptors and nuclear hormone receptors. Bioinformatics, ( (2004) ) 20, : 557–568.
Koike A, et al. Kinase pathway database: an integrated protein-kinase and NLP-based protein-interaction resource. Genome Res, ( (2003) ) 13, : 1231–1243.
McDonald RT, et al. An entity tagger for recognizing acquired genomic variations in cancer literature. Bioinformatics, ( (2004) ) 20, : 3249–3251.
Narayanaswamy M, et al. A biological named entity recognizer. ( (2003) ) Proceedings of the Pacific Symposium on Biocomputing. Hawaii. 427–438..
Raychaudhuri S, et al. Using text analysis to identify functionally coherent gene groups. Genome Res, ( (2002) ) 1582–1590..
Rayson P, Garside R. Comparing corpora using frequency profiling. ( (2000) ) Proceedings of the Workshop on Comparing Corpora, 38th Annual Meeting of the Association for Computational Linguistics. Hong Kong. 1–6..
Rebholz-Schuhmann D, et al. Automatic extraction of mutations from Medline and cross-validation with OMIM. Nucleic Acids Res, ( (2004) ) 32, : 135–142.
Rindfleisch TC, et al. EDGAR: extraction of drugs, genes and relations from the biomedical literature. Pac. Symp. Biocomput, ( (2000) ) 5, : 514–525..
Rzhetsky A, et al. GeneWays: a system for extracting, analyzing, visualizing, and integrating molecular pathway data. J. Biomed. Inform, ( (2004) ) 37, : 43–53.[CrossRef][ISI][Medline].
Shaffer LG, Tommerup N, eds. ISCN 2005: an International System for Human Cytogenetic Nomenclature (2005): Recommendations of the International Standing Committee on Human Cytogenetic Nomenclature., ( (2005) ) New York: Karger..
Srinivasan P, Sehgal AK. Mining MEDLINE for similar genes and similar drugs. Technical report, ( (2003) ) Department of computer science, University of Iowa..
Tanabe L, Wilbur WJ. Tagging gene and protein names in biomedical text. Bioinformatics, ( (2002) ) 18, : 1124–1132.
Toshihide O, et al. Automated extraction of information on protein–protein interactions from the biological literature. Bioinformatics, ( (2001) ) 17, : 155–161.
Wang P, et al. SNP Function Portal: a web database for exploring the function implication of SNP alleles. Bioinformatics, ( (2006) ) 22, : e523–e529.
Wang P, et al. ProbeMatchDB – a web database for finding equivalent probes across microarray platforms and species. Bioinformatics, ( (2002) ) 18, : 488–489.
Xuan W, et al. Identifying gene and protein names from biological texts. Computer Society Bioinformatics, ( (2003) ) CA: Stanford. 639–643..
Xuan W, et al. GeneInfoMiner – a web server for exploring biomedical literature using batch sequence ID. Bioinformatics, ( (2005) ) 21, : 3452–3453.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

