| Comma, OneCap T (2008) | |||||||||||||||
Abstract | |||||||||||||||
| In this paper, we propose an ensemble of classifiers for biomedical named entity recognition in which three classifiers (one SVM and two HMMs) are combined effectively using a simple majority voting strategy. In addition, we incorporate an abbreviation resolution module, a protein/gene name refinement module and a simple dictionary matching module into the system to further improve the performance. Evaluation shows that our system achieves best performance (F-measure 82.58) on the closed test of the BioCreative protein/gene name recognition task (Task 1A). 1 Feature Representation In the competition, the following five features are applied to capture the special characteristics of protein/gene names: • Surface Word: For example, if a word occurs in a vocabulary, one dimension in the feature vector of the SVM (corresponding to the position of the word in the vocabulary) is set to 1. The vocabulary is constructed by taking all the words in the training data (filtered with threshold 3). • Orthographic Feature: This feature concerns about capitalization, digitalization and word formation information. Table 1 shows a complete list in the descending order of priority. | |||||||||||||||
Details der Publikation | |||||||||||||||
| |||||||||||||||