Options
2005
Poster
Titel
Biological and chemical entity recognition from text using conditional random fields (CRF)
Titel Supplements
Poster at the German Conference on Bioinformatics (GBC '05), Hamburg, October 5th-7th, 2005
Abstract
Most of the information in the biomedical domain is present as unstructured text. Amongst other uses, this wealth of information can be used to interpret the results of expression experiments or to derive pathways of biological or chemical interactions. Text mining is a possible solution to obtain this information. The first step to efficiently extract information from text is to accurately assign meaningful tags from a well defined ontology to certain entities. For biological entity recognition (BER) tasks, problems arise from the fact that there is no unified nomenclature for protein and gene names that is used by all scientists. Further problems lie in the ambiguity and in the occurrence of multiword terms. Here, we present our work on applying machine learning (ML) techniques for biological and chemical entity recognition (CER) from scientific text with a rich set of features. Our process follows the conventions and data sets provided for the shared task of the 'International Joint Workshop on Natural Language Processing in Biomedicine and its Application 2004' (JNLPBA) [Kim et al. 2004]. The presented work uses the GENIA corpus 3.02 [Kim et al. 2003] containing 2000 MEDLINE abstracts with 400000 words and nearly 100000 hand-coded annotations for biological terms.