Biological and chemical entity recognition from text using conditional random fields (CRF)

Revillion, T.; Friedrich, C.M.; Fluck, J.; Hofmann, M.

2005

Poster

Abstract

Most of the information in the biomedical domain is present as unstructured text. Amongst other uses, this wealth of information can be used to interpret the results of expression experiments or to derive pathways of biological or chemical interactions. Text mining is a possible solution to obtain this information. The first step to efficiently extract information from text is to accurately assign meaningful tags from a well defined ontology to certain entities. For biological entity recognition (BER) tasks, problems arise from the fact that there is no unified nomenclature for protein and gene names that is used by all scientists. Further problems lie in the ambiguity and in the occurrence of multiword terms. Here, we present our work on applying machine learning (ML) techniques for biological and chemical entity recognition (CER) from scientific text with a rich set of features. Our process follows the conventions and data sets provided for the shared task of the 'International Joint Workshop on Natural Language Processing in Biomedicine and its Application 2004' (JNLPBA) [Kim et al. 2004]. The presented work uses the GENIA corpus 3.02 [Kim et al. 2003] containing 2000 MEDLINE abstracts with 400000 words and nearly 100000 hand-coded annotations for biological terms.

Author(s)

Revillion, T.

Friedrich, C.M.

Fluck, J.

Hofmann, M.

Konferenz

German Conference on Bioinformatics (GCB) 2005

Options

Biological and chemical entity recognition from text using conditional random fields (CRF)