Skip Navigation


Bioinformatics Advance Access originally published online on October 25, 2005
Bioinformatics 2006 22(6):651-657; doi:10.1093/bioinformatics/bti733
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/6/651    most recent
bti733v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (1)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Sandler, T.
Right arrow Articles by Ungar, L. H.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Sandler, T.
Right arrow Articles by Ungar, L. H.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Automatic term list generation for entity tagging

Ted Sandler *, Andrew I. Schein and Lyle H. Ungar

Department of Computer and Information Science, University of Pennsylvania 3330 Walnut Street, Philadelphia, PA 19104, USA

*To whom correspondence should be addressed.

ABSTRACT

Motivation: Many entity taggers and information extraction systems make use of lists of terms of entities such as people, places, genes or chemicals. These lists have traditionally been constructed manually. We show that distributional clustering methods which group words based on the contexts that they appear in, including neighboring words and syntactic relations extracted using a shallow parser, can be used to aid in the construction of term lists.

Results: Experiments on learning lists of terms and using them as part of a gene tagger on a corpus of abstracts from the scientific literature show that our automatically generated term lists significantly boost the precision of a state-of-the-art CRF-based gene tagger to a degree that is competitive with using hand curated lists and boosts recall to a degree that surpasses that of the hand-curated lists. Our results also show that these distributional clustering methods do not generate lists as helpful as those generated by supervised techniques, but that they can be used to complement supervised techniques so as to obtain better performance.

Availability: The code used in this paper is available from http://www.cis.upenn.edu/datamining/software_dist/autoterm/

Contact: tsandler{at}seas.upenn.edu

1 INTRODUCTION

Entity tagging is one of the most basic tasks in information extraction. Consequently, if any sophisticated information extraction is to be done, this task must be done well. Currently, state-of-the-art entity taggers are trained with hand-labeled training data that resemble the text to which they will be applied. However, such training data are rarely abundant enough to contain more than a modest subset of all the members of the target entity class. To remedy this, designers of information extraction systems frequently augment their systems with large lists of known members of the entity class so as to cover most of the prototypical, and perhaps not so prototypical, members of the class that the training data do not contain. Such entity lists, term lists as we refer to them, are useful in ensuring the baseline level of coverage and thus the baseline level of functionality.

Unfortunately, in some domains, term lists may not exist, making the start-up costs of building information extraction systems higher. And even when they do exist, they will most certainly be incomplete. Therefore, finding cheap and efficient means of generating term lists that boost the performance of information extraction systems is of importance to the information extraction community.

Large-corpus methods for bootstrapping domain-specific lexicons have existed for quite some time, as have methods for partitioning words into pseudo-semantic classes based on their distributional properties (Church and Hanks, 1990; Hearst, 1992; Hindle, 1990; Pereira et al., 1993; Riloff and Shepherd, 1997; Roark and Charniak, 1998). While not perfect, these methods do work surprisingly well and are straightforward to implement. Therefore, it is natural to ask whether such methods can be used to create term lists that boost the performance of information extraction systems.

In this paper we investigate this question by showing how unsupervised word clustering techniques can automatically generate lists of biomedical entities and other terms, and that such lists can significantly improve the performance of an entity tagging system that automatically identifies mentions of genes in biomedical text. We compare the performance gains obtained from using the automatically generated term lists against those obtained from manually curated lists, and a list generated through supervised learning. Finally we review previous work.

2 APPROACH

We describe four related methods for generating term lists based on a 2 x 2 layout. The methods vary in their ability to create lists of high quality, lists of large size and lists that capture domain-specific rather than general knowledge. In all methods, traditional distributional clustering techniques are used to partition the vocabulary into clusters, which serve as the desired term lists. The methods differ in how they represent words, the clustering algorithms they use to partition the words and their choice of feature weighting schemes. Table 1 provides a description of the 2 x 2 layout. Along the horizontal dimension lies the representation and clustering algorithm used. And along the vertical dimension lies the feature weighting scheme used.


View this table:
[in this window]
[in a new window]
 
Table 1 The 2 x 2 Layout

 
In the first representation, an affinity matrix is used to represent all pairwise similarities between terms, while in the second, terms are represented as feature vectors in a vector space. All methods rely on the availability of a large body of unlabeled domain-relevant text and a parser capable of extracting certain shallow syntactic relations.

We first decide on the set of vocabulary items that we wish to partition into separate entity classes. We call this set the ‘base vocabulary.’ Typically the base vocabulary will consist of nouns, but it could just as easily consist of other parts of speech. Next we decide on a specific set of syntactic relations, R, that are particularly useful in distinguishing different entity types. These relations should be frequent enough to co-occur with a large portion of the base vocabulary, should be informative about what entity classes the vocabulary items belong to and should be relatively noise-free. We parse the corpus to extract these relations and collect statistics on which words engage in which relations. Finally, we use a clustering algorithm to partition the base vocabulary based on the collected statistics. The resulting partitions form the automatically generated term lists.

2.1 Extracting relations from the corpus
Our corpus consisted of 15 000 sentences from the Biocreative 2004 gene tagging competition and an additional 1 800 547 abstracts from the MEDLINE database (BioCreAtIvE, 2004, http://www.pdg.cnb.uam.es/BioLink/workshop_BioCreative_04). The abstracts ranged mainly from years 1995 to 2000 and contained 341 000 315 words in total. We parsed this corpus and selected as our base vocabulary the set of 7782 single-token nouns that occurred in the Biocreative corpus and contained at least one alphabetical character.

To extract relations from the corpus, we used the minipar parser of (Lin, 1998b). Minipar can be configured to output a sequence of ‘dependency triples’ that represent shallow syntactic configurations between words. A dependency triple has the form (w1, relr, w2) where w1 and w2 are words in a sentence that engage in some syntactic relation ‘relr’. We ignored all triples t = (wt1, relt, wt2) for which relt {notin} R.

Table 2 shows the relations that we chose to include in the set R. The ‘Coverage’ column lists the number of words in the base vocabulary that occurred in each relation. ‘Kinds’ refers to the number of distinct words that occurred opposite a base-vocabulary word in a particular relation. For example the triples ‘(organism, noun:mod-by:adj, cellular)’ and ‘(process, noun:mod-by:adj, cellular)’ are both of the same kind since they agree in both relation ‘noun:mod-by:adj,’ and modifying word, ‘cellular.’ In the last column, ‘Extracted’ refers to the total number of relations extracted for each relation type.


View this table:
[in this window]
[in a new window]
 
Table 2 Extracted syntactic relations

 
We restricted the number of ‘Kinds’ in the ‘noun:mod-by:noun’ and ‘noun:modifies:noun’ relations to the top 1000 most frequent nouns because the total number of kinds for these two relations was greater than the size of base vocabulary itself. None of the other relations was restricted in this way. Following are some example sentences from the Biocreative 2004 corpus and some of the relations we extracted for them.
  • Protein kinases play pivotal roles in the control of many cellular processes.
    (kinases, noun:subj-of:verb, play)
    (roles, noun:obj-of:verb, play)
    (kinases, noun:mod-by:noun, protein)
    (protein, noun:modifies:noun, kinases)
  • Selenium is an essential element for humans, animals and some species of microorganisms.
    (selenium, noun:subj:noun, element)
    (element, noun:mod-by:adj, essential)
    (human, noun:conj:noun, animal)
    (animal, noun:conj:noun, species)
  • We report here on the molecular nature of an EMS-induced mutant, mn1-89, a leaky semidominant allele of the Miniature1 (Mn1) seed locus that encodes a seed-specific cell wall invertase, INCW2.
    (mutant, noun:appo:noun, mn1-89)
    (mn1-89, noun:appo:noun, allele)
    (invertase, noun:appo:noun, INCW2)

2.2 Representations
As stated above, we use two term list generation methods. Each method requires a different representation for the base vocabulary. The first uses a vector space representation in which each vocabulary item is represented by the set of syntactic configurations that it occurs in. These configurations serve as features or attributes of the vocabulary item. The second is an affinity matrix representation wherein each vocabulary item is represented by its similarities to the other items in the base vocabulary.

The first five syntactic relations in Table 2 provide the dimensions of the vector space, and the last three relations are used to determine the affinities between nouns in the affinity matrix. Using examples listed above, ‘(noun:subj-of:verb play)’ would be one of the axes in the vector space representation, whereas items ‘mutant’ and ‘mn1-89’ would have a higher similarity in the affinity matrix representation.

2.3 Weighting schemes
{chi}2-tests of independence between nouns and features are used to weight the feature values in the vector space model, and tests of independence between nouns and other nouns are used to weight the similarity values in the affinity matrix. The {chi}2-statistic is chosen because it is non-negative, symmetric and its values lie on an interpretable scale.

Two estimation techniques are used to estimate the {chi}2-statistic: the generalized likelihood ratio (GLR) and Pearson's {chi}2-test of independence. We use the GLR to test the hypothesis that a single binomial random variable W generates occurrences of a word w against the hypothesis that w is generated by a mixture of two binomial random variables, W|A and WA, with parameters PW|A and PWA, respectively. The null hypothesis tested is

Formula
where A is a particular feature in the vector space representation or another noun in the affinity matrix representation. In Pearson's {chi}2-test, a normal approximation to the binomial is used.

As described in Dunning (1993), the GLR is robust in the case of small counts whereas Pearson's {chi}2-tends to overstate their importance, especially when ratios are skewed toward zero or one. On the other hand, this overemphasis can actually provide some benefit when trying to detect associations between infrequently occurring entities such as genes. These observations are apparent in Table 3, where the top ten word associations for each weighting scheme are listed. Clearly, the GLR is placing common, everyday associations at the top of its rankings, whereas seven of the top ten Pearson {chi}2-associations appear to be gene related as judging from their contexts in the Biocreative corpus. Consequently, we use both weighting schemes in order to capture both ‘common sense’ and domain-specific information.


View this table:
[in this window]
[in a new window]
 
Table 3 Top 10 word associations for GLR and Pearson {chi}2

 
2.4 Clustering
To cluster the words of the base vocabulary, we use two separate clustering algorithms, one for each of the data representations described above. We cluster the words in the vector space representation using kmeans clustering since it is well suited to data that live in a vector space. Furthermore, its global stability allows us to partition a large portion of the base vocabulary without the ‘collapsing’ effects that plague other greedy clustering methods (see below). However, the clusters generated by kmeans tend to be rather noisy. Therefore a complementary approach is used.

We cluster the affinity matrix data using average-link, agglomerative clustering. This allows us to generate clusters of high purity by tuning a threshold parameter T. T determines how similar two clusters must be in order for a merger between them to be allowed. The higher the value of T, the higher the required similarity, and thus the higher the purity of the resulting clusters. Unfortunately, many words are not clusterable at high values of T because their similarity to other words is only average in comparison with that of the most similar words. Therefore lower values of T are desirable in order to partition a greater portion of the base vocabulary. Yet these lower values result in a collapsing effect wherein all words end up in a small handful of clusters. Consequently, the use of kmeans and agglomerative clustering is a necessary and effective way of generating clusters of both high purity and high coverage.

In our experiments, we chose clustering parameters by testing different values of K and T on a small development test-set. No particular values definitively outperformed the others so we simply chose values that appeared ‘reasonable’ with respect to the clusters they generated and the performance observed on the development test-set.1 We set K = 500 and applied kmeans to the GLR and Pearson {chi}2-weighted data separately, generating 1000 clusters in all. The median cluster size was 8 words for the GLR-weighted data and 11 words for the Pearson {chi}2-weighted data. For agglomerative clustering, we set the threshold parameter T = 20, and as with the vector space data, we clustered the GLR and Pearson {chi}2-weighting schemes separately. This generated 820 clusters for the GLR-weighted data and 323 clusters for the Pearson {chi}2-weighted data. For the GLR-weighted data 3857 words were clustered and 5487 words were clustered for the Pearson {chi}2-weighted data. Median cluster sizes were 1 words and 2 words for the GLR- and Pearson {chi}2-weighted data, respectively.

Some of the resulting clusters are displayed in Table 4. The table shows the 2 x 2 layout with weighting schemes along one axis and clustering algorithms along the other.


View this table:
[in this window]
[in a new window]
 
Table 4 2 x 2 clustering layout

 
2.5 Cluster analysis
To evaluate the quality of the resulting clusters, we sampled five clusters at random from each of the four quadrants in the 2 x 2 layout and gave each cluster a purity score equal to the number of words related to the cluster's largest topic divided by the total number of words in the cluster. The average purity for each of the quadrants was:


Formula

All five clusters sampled for the Agglom/GLR scheme contained only two words. However, each of the word pairs was ‘well-typed’ in the sense that both words in the pair were of the same semantic category. For example, Agglom/GLR found pairs {‘AG’, NNA’} and {"spermicide", "nonoxynol-9"}. As it turns out, ‘nonoxynol-9’ is a spermicide and ‘AG’ and ‘NNA’ are nitric oxide synthase inhibitors. However, only 3 of the 15 clusters sampled from the other quadrants had purity scores of at least 75%. One of these was a cluster of 10 words, 8 of which were gene or protein names—e.g. ‘BZP’ and ‘CBF-A.’ Another was a cluster of five words, all of which were names or abbreviations for certain enzymes in a class called the ‘metalloproteinases.’ The last was a cluster of 12 words, 9 of which were proteins or protein-related terms.

2.6 Gene tagging
Having automatically generated 2143 overlapping term lists using the procedure described in Section 2, we set out to determine whether the lists could improve the performance of McDonald and Pereira's conditional random fields gene tagger (McDonald and Pereira, 2004; McCallum, 2002, http://mallet.cs.umass.edu). The tagger treats text as a sequence of tokens possessing hidden labels B-GENE, I-GENE and O, where B-GENE denotes that a token is the first token in a gene name, I-GENE denotes the token is a subsequent token in a gene name and O denotes that a token is not part of a gene name. By successfully guessing the hidden labels for the token sequence, the tagger is able to bracket gene mentions in naturally occurring text. To do this accurately, the tagger requires an informative set of features to help it choose the correct labels for the token sequence. We therefore test whether our automatically generated lists can serve as useful features in the tagging task by incorporating them as features of the model.

For each term list l we create a binary feature fl such that for a given word w,2

Formula
Thus a feature is activated whenever the token under consideration is a member of the corresponding term list. Consequently, training the tagger on a fully labeled training sequence allows the model to learn which term lists are correlated with which hidden labels.

2.7 Tagger evaluation
To determine the degree of benefit provided by our term lists, we trained a baseline tagger containing no term lists, a second tagger augmented with hand-compiled lists of genes and a third tagger augmented with a large list of genes compiled through supervised learning. In a second set of experiments, we trained taggers with combinations of these lists to see if combining term lists could produce even better performance.

The hand-compiled lists were extracted from the Human Genome Organization's (HUGO) ‘search-data.txt’ file (Wain et al., 2004). The ‘searchdata.txt’ contains nearly 60 000 gene names, symbols and aliases. From this file we extracted the fields Approved Gene Name, Approved Gene Symbol, Previous Symbol, Aliases and Previous Gene Name, and used the contents of the fields to form separate term lists. Since the gene-name and gene-aliases fields contained multi-token genes, and since the tagger processes one token at a time, we tokenized the entries of the fields so that single tokens from the input sequence could be matched against multi-token gene names. Tokenization was performed by splitting on whitespace and stripping leading and trailing punctuation. Duplicate tokens were removed. This resulted in 57 563 tokens for the hand-curated term lists.

We used the ‘Gene.Lexicon’ of (Tanabe and Wilbur, 2002, 2004) as our term list compiled through supervised learning methods. Tanabe and Wilbur created the ‘Gene.Lexicon’ by harvesting over two million putative genes from MEDLINE with their ABGene tagger and removing false positives with a system trained to distinguish gene names from non-gene names using morphological cues. The system was learned by running the inductive logic programming algorithm on a training set of 42 166 positive examples and 43 943 negative examples. The positive examples were names taken from LocusLink and the negative examples were mostly non-genes collected by applying the ABGene tagger to a set of documents deemed unlikely to contain genes3. In the end, their system filtered the two million putative genes down to a list of 1 145 913 that became the ‘Gene.Lexicon.’

All taggers were trained and tested on the 394 661 words of the Biocreative 2004 corpus using 5-fold cross validation; one-fifth of the data was used in training with the other four-fifths held out for testing. This differs from the Biocreative competition in which two-thirds of the data are used in training with one-third held out for testing. We chose a smaller-sized training set because larger sets contained too many of term lists' words, and consequently, the benefits of using term lists were obscured.

Overfitting was a concern because our cluster analysis showed that many of the cluster features were of questionable quality. To prevent overfitting, the taggers were trained using feature induction and a Gaussian prior was used to regularize the feature weights. Training consisted of 30 rounds of feature induction wherein 700 features were added to the models at the start of each round. Feature weights were set using 10 training iterations per round. A variance of 1.0 was used for the Gaussian prior.

3 RESULTS

Shown in Table 5, the tagger augmented with term lists generated by unsupervised methods demonstrates nearly a 1% improvement in precision and a slightly better than 1% improvement in recall. Based on a two-sided, paired t-test, the increase in precision is significant at the 5% level while the increase in recall is significant at the 1% level. The tagger augmented with manually compiled term lists also outperforms the baseline tagger with an increase in precision that is significant at the 5% level. The difference in recall is not significant at 5%. Finally, all improvements between the supervised term list tagger and the baseline tagger are significant at the 1% level.


View this table:
[in this window]
[in a new window]
 
Table 5 Gene tagger performance

 
In comparisons among the augmented taggers, we find that the –0.3% difference in precision between the unsupervised and manually curated term list taggers is not significant at the 5% level while the +0.3% difference in recall is. The difference in precision between the supervised and the unsupervised term list tagger is significant at the 5% level and the difference in recall is significant at the 1% level. Between the supervised and manually curated term list tagger, the difference in precision is not significant at the 5% level but the difference in recall is.

In the combined experiments, the increases in precision and recall from adding the unsupervised term lists to the manually curated term list tagger are significant at levels 1 and 5%, respectively. The increases in precision and recall from adding the unsupervised term lists to the supervised term list tagger are significant at levels 5 and 1%, respectively. Adding the manually-curated term lists to the unsupervised term list tagger produces increases in precision and recall that are significant at 1%. Adding the manually curated lists to the supervised term list tagger produces an increase in precision that is significant at 5% (the increase in recall is not significant at the 5% level). Finally, adding the supervised term list to either the unsupervised or manually curated term list tagger produces improvements in precision and recall that are significant at 1%.

4 DISCUSSION

The results of the previous section demonstrate that the unsupervised term lists are able to improve precision to a degree that closely matches that of the manually curated term lists and are able to improve recall to an extent surpassing that of manual curation. Furthermore, the improvement in F-measure is half that of the supervised term lists but is obtained without any custom tailoring to the domain. Additionally, the results in Table 5 show that the benefits provided by term lists are cumulative. This fact justifies the diversity approach we employed in using different data representations, feature weighting schemes and clustering algorithms to generate term lists that vary in coverage, purity and domain specificity. But it also demonstrates that unsupervised learning of term lists is useful even in situations where supervised learning is possible. In such situations, unsupervised term lists can be used to complement the supervised or manually curated term lists and to produce even greater performance. Additionally, unsupervised learning has the benefit that it is available in situations where supervised learning is not. In constructing the ‘Gene Lexicon’, Tanabe and Wilbur were able to take advantage of their already existing ABGene tagger and the LocusLink database. But pre-existing entity taggers and high quality databases may not exist when porting systems across languages, when working in proprietary domains where access to proprietary databases is limited or when working in new domains for which resources are still being developed.

While statistically significant, the gains in Table 5 are relatively modest and it is therefore natural to ask whether unsupervised (or even supervised) learning techniques can create term lists that produce more substantial improvements. Recent work on entity tagging in the newswire domain suggests that they can. Both (Miller et al., 2004) and (Freitag, 2004) have demonstrated that word clustering can improve entity tagging by as much as 17% (F-score), where the amount of benefit depends on the type of entity being extracted and the amount of training data used. Both authors evaluate on the MUC-6 corpus, a newswire corpus where the task is to extract persons, locations, organizations, dates, times, money and percents. Miller et al. use the class-based n-gram model described in Brown et al. (1992) to partition words into clusters of varying granularity. As in our work, they incorporate cluster information as features in a discriminative tagger. At 50 000 words of training data, their tagger's F-score increases from 83 to 90%. Freitag uses the information theoretic co-clustering algorithm of Dhillon et al. (2003) to cluster words based on their left and right contexts. His boosted wrapper induction tagger receives a 17% increase on organization entities from the cluster features.

The distributional clustering techniques we use are closely related to those of Lin and Pantel (2001), especially since we use Lin's minipar parser to extract dependency triples. However, Lin and Pantel employ a different measure of word similarity and extract other syntactic relations besides the eight we mention in Table 2 (Lin, 1998a). The authors do not demonstrate whether their induced semantic classes provide benefit in a larger information extraction system, though in Lin and Pantel (2001) they do use human judges to evaluate the cohesiveness of their clusters. On a five-point scale with five being the highest, the newswire clusters receive an average rating of 4.26, while the MEDLINE clusters receive an average rating of 3.37.4 No attempt was made to uncover the cause of this discrepancy, but we can speculate that it is caused by poorer parser performance on MEDLINE, intrinsic differences between the text of the two domains, or both. If the parse performance is to blame, then this would certainly impact the quality of our generated term lists since minipar is an integral part of our pipeline.

Regardless of the discrepancy's cause, it is consistent with the general lag in performance of biomedical information extraction systems behind those of newswire. In Hirschman et al. (2002), the authors note that the scores of biomedical entity taggers lag ~10% behind those of newswire. They cite less experience, fewer annotated training resources and lower interannotator agreement as potential contributing factors. On one hand, the results of Miller and Freitag suggest that further work on generating term lists for biomedical taggers should yield better results, especially since both authors have previously built taggers for the MUC-6 dataset earlier (Bikel et al., 1997; Freitag and Kushmerick, 2000). On the other hand, the lower cohesiveness ratings reported by Lin and Pantel, as well as the lower interannotator agreement discussed by Hirschman, suggest that perhaps the biomedical tasks are intrinsically more difficult. However, the across-the-board improvements demonstrated by multiple authors in multiple domains confirm that unsupervised term list generation is a useful tool for enhancing the performance of information extraction systems.

5 CONCLUSION

In this paper we have proposed new unsupervised methods for generating lists of terms and entities. The methods are neither entity nor corpus specific and the lists they generate significantly improve the precision and recall of a named entity tagger. Comparisons against taggers augmented with manually compiled term lists show that the unsupervised term lists produce gains in precision and recall comparable with and surpassing those of hand-compiled lists. Comparisons against taggers augmented with term lists induced by supervised learning show that the unsupervised term lists capture roughly half the gains produced by their supervised counterparts. Experiments in combining term lists show that gains produced by term lists are cumulative, and consequently, that unsupervised lists can be used to complement lists generated through supervised learning or manual curation. While the gains on our biomedical tagging task were more modest than the ones reported for newswire, they are consistent with the current lag of biomedical information extraction behind newswire. Gains reported in multiple domains confirm that unsupervised term list generation is a useful tool to include among the set available to builders of information extraction systems. Directions for future work include extending our methods to clustering other parts of speech so that more words in the corpus can be leveraged in information extraction tasks. With further refinements, the proposed unsupervised techniques should be able to yield even greater gains.

Acknowledgments

The authors wish to thank Ryan T. McDonald for his helpful advice and assistance. This work was supported in part by NIH Training grant T32HG 0000461.

Conflict of Interest: none declared.

FOOTNOTES

Associate Editor: Alfonso Valencia

1In a later set of experiments, we evaluated a wider range of parameter values on the actual test-set and found that none of these values outperformed our initial parameter choices at a P-value < 0.153. However, these experiments did demonstrate that we overpartitioned the vocabulary and that better parameter choices would have been 50 ≤ k ≤ 150 and 1 ≤ T ≤ 10. Back

2Technically, the term list features are of the form Formula, where fl ignores all its arguments except for yi and the i-th component of Formula. Back

3A naive Bayes classifier was used to classify documents as containing or not containing genes. The strings that ABGene tagged as genes in the documents classified as least likely to contain genes were used as the negative examples. Back

4Differences in familiarity with the two domains were taken into account by having medical doctors judge the MEDLINE clusters. Back

Received on April 29, 2005; revised on October 20, 2005; accepted on October 20, 2005
REFERENCES

    Bikel, D.M., Miller, S., Schwartz, R.L., Weischedel, R.M. (1997) Nymble: a high-performance learning name-finder. Proceedings of the 5th Conference on Applied Natural Language Processing 1997 , Washington, DC , pp. 194–201.

    Brown, P.F., et al. (1992) Class-based n-gram models of natural language. Comput. Linguist, . 18, 467–479.

    Church, K.W. and Hanks, P. (1990) Word association norms, mutual information, and lexicography. Comput. Linguist, . 16, 22–29.

    Dhillon, I.S., Mallela, S., Modha, D.S. (2003) Information-theoretic co-clustering. Proceedings of the KDD 2003Washington, DC , pp. 89–98.

    Dunning, T. (1993) Accurate methods for the statistics of surprise and coincidence. Comput. Linguist, . 19, 61–74.

    Freitag, D. (2004) Trained named entity recognition using distributional clusters. Proceedings of the EMNLP 2004Barcelona, Spain , pp. 262–269.

    Freitag, D. and Kushmerick, N. (2000) Boosted wrapper induction. Proceedings of the 17th National Conference on Artificial Intelligence and 12th Conference on Innovative Applications of AIAustin, TX , pp. 577–583.

    Hearst, M.A. (1992) Automatic acquisition of hyponyms from large text corpora. Proceedings of the 14th International Conference on COLINGNantes, France , pp. 539–545.

    Hindle, D. (1990) Noun classification from predicate-argument structures. Proceedings of the 28th Conference of ACLPittsburgh, PA , pp. 268–275.

    Hirschman, L., et al. (2002) Rutabaga by any other name: extracting biological names. J. Biomed. Inform, . 35, 247–259[Medline].

    Hirschman, L., Yeh, A., Blaschke, C., Valencia, A. (2005) Overview of BioCreAtIvE: critical assessment of information extraction for biology. BMC Bioinformatics 6, Suppl 1: S1,.

    Lin, D. (1998a) Automatic retrieval and clustering of similar words. Proceedings of the COLING-ACLMontreal, Canada , pp. 768–774.

    Lin, D. (1998b) Dependency-based evaluation of MINIPAR. Proceedings of the Workshop on the Evaluation of Parsing Systems, at LREC-98Granada, Spain.

    Lin, D. and Pantel, P. (2001) Induction of semantic classes from natural language text. Proceedings of the KDD 2001San Francisco, CA , pp. 317–322.

    McCallum, A.K. (2002) Mallet: A machine learning for language toolkit.

    McDonald, R. and Pereira, F. (2004) Identifying gene and protein mentions in text using conditional random fields. A critical assessment of text mining methods in molecular biologyMarch 28–31, 2004Granada, SpainBMC Bioinformatics 6, Suppl 1: S6,.

    Miller, S., Guinness, J., Zamanian, A. (2004) Name tagging with word clusters and discriminative training. Proceedings of the HLT-NAACL 2004Boston, MA , pp. 337–342.

    Pereira, F.C.N., Tishby, N., Lee, L. (1993) Distributional clustering of english words. Proceedings of the Meeting of the ACLColumbus, OH , pp. 183–190.

    Riloff, E. and Shepherd, J. (1997) A corpus-based approach for building semantic lexicons. Proceedings of the EMNLP 1997Providence, RI , pp. 117–124.

    Roark, B. and Charniak, E. (1998) Noun-phrase co-occurrence statistics for semiautomatic semantic lexicon construction. Proceedings of the 17th International Conference on Computational LinguisticsMontreal, Canada , pp. 1110–1116.

    Tanabe, L. and Wilbur, W.J. (2002) Tagging gene and protein names in full text articles. Proceedings of the Workshop on Natural Language Processing in the Biomedical DomainPhiladelphia, PA , pp. 9–13.

    Tanabe, L. and Wilbur, W.J. (2004) Generation of a large gene/protein lexicon by morphological pattern analysis. J. Bioinformatics Comput. Biol, . 1, 611–626[CrossRef].

    Wain, H.M., et al. (2004) Genew: the human gene nomenclature database, 2004 updates. Nucleic Acids Res, . 32, 255–257[Abstract/Free Full Text].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Brief BioinformHome page
P. Agarwal and D. B. Searls
Literature mining in support of drug discovery
Brief Bioinform, November 1, 2008; 9(6): 479 - 492.
[Abstract] [Full Text] [PDF]


Home page
Brief BioinformHome page
P. Zweigenbaum, D. Demner-Fushman, H. Yu, and K. B. Cohen
Frontiers of biomedical text mining: current progress
Brief Bioinform, October 30, 2007; (2007) bbm045v1.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/6/651    most recent
bti733v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (1)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Sandler, T.
Right arrow Articles by Ungar, L. H.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Sandler, T.
Right arrow Articles by Ungar, L. H.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?