Publications Search Results

Now showing 1 - 10 of 29
  • Publication
    Detecting miRNA mentions and relations in biomedical literature
    ( 2015)
    Bagewadi, S.
    ;
    Bobic, T.
    ;
    Hofmann-Apitius, M.
    ;
    Fluck, J.
    ;
    Klinger, R.
    Introduction: MicroRNAs (miRNAs) have demonstrated their potential as post-transcriptional gene expression regulators, participating in a wide spectrum of regulatory events such as apoptosis, differentiation, and stress response. Apart from the role of miRNAs in normal physiology, their dysregulation is implicated in a vast array of diseases. Dissection of miRNA-related associations are valuable for contemplating their mechanism in diseases, leading to the discovery of novel miRNAs for disease prognosis, diagnosis, and therapy. Motivation: Apart from databases and prediction tools, miRNA-related information is largely available as unstructured text. Manual retrieval of these associations can be labor-intensive due to steadily growing number of publications. Additionally, most of the published miRNA entity recognition methods are keyword based, further subjected to manual inspection for retrieval of relations. Despite the fact that several databases host miRNA-associations derived from text, lower sensitivity and lack of published details for miRNA entity recognition and associated relations identification has motivated the need for developing comprehensive methods that are freely available for the scientific community. Additionally, the lack of a standard corpus for miRNA-relations has caused difficulty in evaluating the available systems. We propose methods to automatically extract mentions of miRNAs, species, genes/proteins, disease, and relations from scientific literature. Our generated corpora, along with dictionaries, and miRNA regular expression are freely available for academic purposes. To our knowledge, these resources are the most comprehensive developed so far. Results: The identification of specific miRNA mentions reaches a recall of 0.94 and precision of 0.93. Extraction of miRNA-disease and miRNA-gene relations lead to an F1 score of up to 0.76. A comparison of the information extracted by our approach to the databases miR2Disease and miRSel for the extraction of Alzheimer's disease related relations shows the capability of our proposed methods in identifying correct relations with improved sensitivity. The published resources and described methods can help the researchers for maximal retrieval of miRNA-relations and generation of miRNA-regulatory networks.
  • Publication
    Weakly Labeled Corpora as Silver Standard for Drug-Drug and Protein-Protein Interaction
    ( 2012)
    Thomas, P.
    ;
    Bobic, T.
    ;
    Hofmann-Apitius, M.
    ;
    Leser, U.
    ;
    Klinger, R.
  • Publication
    Improving Distantly Supervised Extraction of Drug-Drug and Protein-Protein Interactions
    ( 2012)
    Bobic, T.
    ;
    Klinger, R.
    ;
    Thomas, P.
    ;
    Hofmann-Apitius, M.
    Relation extraction is frequently and successfully addressed by machine learning methods. The downside of this approach is the need for annotated training data, typically generated in tedious manual, cost intensive work. Distantly supervised approaches make use of weakly annotated data, like automatically annotated corpora. Recent work in the biomedical domain has applied distant supervision for protein-protein interaction (PPI) with reasonable results making use of the IntAct database. Such data is typically noisy and heuristics to filter the data are commonly applied. We propose a constraint to increase the quality of data used for training based on the assumption that no self-interaction of real-world objects are described in sentences. In addition, we make use of the University of Kansas Proteomics Service (KUPS) database. These two steps show an increase of 7 percentage points (pp) for the PPI corpus AIMed. We demonstrate the broad applicability of our approach by using the same workflow for the analysis of drug-drug interactions, utilizing relationships available from the drug database DrugBank. We achieve 37.31%in F1 measure without manually annotated training data on an independent test set.
  • Publication
    e-Government and Policy Simulation in Intelligent Virtual Environments
    ( 2012)
    Aisopos, F.
    ;
    Kardara, M.
    ;
    Senger, P.
    ;
    Klinger, R.
    ;
    Papaoikonomou, A.
    ;
    Tserpes, K.
    ;
    Gardner, M.
    ;
    Varvarigou, T.
  • Publication
    Online communities support policy-making: The need for data analysis
    ( 2012)
    Klinger, R.
    ;
    Senger, P.
    ;
    Madan, S.
    ;
    Jacovi, M.
    Policy decisions in governmental models are often based on their perception and acceptance in the general public. Traditional methods for harvesting opinions like telephone or street surveys are time intensive and costly and direct interaction between a governmental member and the population is limited. Social media harbor the chance to easily get a high number of opinions and proposals in form of poll participation or interactive debate contributions. Especially debates about political topics can generate data which are hard to interpret because of its length and complexity. We propose a collection of methods to support a decision maker in gaining an overview over textual debates coming from several social media to save time and effort in manual analysis. Our approach enables an efficient decision making process by a combination of automatic topic clustering, sentiment analysis, filtering, and search functionalities aggregated in a graphical user interface. We present an implementation and a use case proving the usefulness of the proposed methodologies.
  • Publication
    Challenges in the association of human single nucleotide polymorphism mentions with unique database identifiers
    ( 2011)
    Thomas, P.E.
    ;
    Klinger, R.
    ;
    Furlong, L.I.
    ;
    Hofmann-Apitius, M.
    ;
    Friedrich, C.M.
    Background: Most information on genomic variations and their associations with phenotypes are covered exclusively in scientific publications rather than in structured databases. These texts commonly describe variations using natural language; database identifiers are seldom mentioned. This complicates the retrieval of variations, associated articles, as well as information extraction, e. g. the search for biological implications. To overcome these challenges, procedures to map textual mentions of variations to database identifiers need to be developed.Results: This article describes a workflow for normalization of variation mentions, i.e. the association of them to unique database identifiers. Common pitfalls in the interpretation of single nucleotide polymorphism (SNP) mentions are highlighted and discussed. The developed normalization procedure achieves a precision of 98.1 % and a recall of 67.5% for unambiguous association of variation mentions with dbSNP identifiers on a text corpus based on 296 MEDLINE abstracts containing 527 mentions of SNPs.The annotated corpus is freely available at http://www.scai.fraunhofer.de/snp-normalization-corpus.html.Conclusions: Comparable approaches usually focus on variations mentioned on the protein sequence and neglect problems for other SNP mentions. The results presented here indicate that normalizing SNPs described on DNA level is more difficult than the normalization of SNPs described on protein level. The challenges associated with normalization are exemplified with ambiguities and errors, which occur in this corpus.
  • Publication
    Conditional random fields for named entity recognition. Feature selection and optimization in biology and chemistry
    (Shaker, 2011)
    Klinger, R.
    Most knowledge is stored and communicated in the form of natural language text. Databases including abstracts of journal articles or proceeding contributions are freely available. To make this knowledge available in a structured form, allowing for deeper analysis and combination with existing databases, technologies from the field of information extraction are necessary. A fundament for most methods like relation extraction or semantic search is named entity recognition. Conditional random fields are an established probabilistic method for labeling sequences. Nevertheless, the adaption to novel domains or entity classes of interest requires manual effort. This dissertation presents such adaptions for entity classes from the biological and chemical domain. Workflows for the detection of gene and protein names, mentions of mutations of genes, and chemical names following the nomenclature of the International Union of Pure and Applied Chemistry. For these classes, training corpora are discussed and built. Questions addressed include how to use knowledge from multiple annotators, how stable a model is on data from different time ranges, or how to normalize found entities. The presented use cases exemplify the need for feature design and selection. Different methods for choosing a meaningful feature subset decreasing the run time and number of features clearly are developed and evaluated. To extend the applicability of conditional random fields, a training method based on multicriterial optimization is introduced allowing the user to choose between different precision-recall weightings without increase of runtime. Additionally, it is analysed if automatically selected structures going beyond the common linear structure of conditional random fields can be beneficial for named entity recognition. These methods and analyses support the generation of workflows to build novel named entity recognition tools with less user intervention.
  • Publication
  • Publication
    Learning Protein Protein Interaction Extraction using Distant Supervision
    ( 2011)
    Thomas, P.
    ;
    Solt, I.
    ;
    Klinger, R.
    ;
    Leser, U.
    Most relation extraction methods, especially in the domain of biology, rely on machine learning methods to classify a co-occurring pair of entities in a sentence to be related or not. Such an approach requires a training corpus, which involves expert annotation and is tedious, time-consuming, and expensive. We overcome this problem by the use of existing knowledge in structured databases to automatically generate a training corpus for protein-protein interactions. An extensive evaluation of different instance selection strategies is performed to maximize robustness on this presumably noisy resource. Successful strategies to consistently improve performance include a majority voting ensemble of classifiers trained on subsets of the training corpus and the use of knowledge bases consisting of proven non-interactions. Our best configured model built without manually annotated data shows very competitive results on several publicly available benchmark corpora.
  • Publication
    Automatically Selected Skip Edges in Conditional Random Fields for Named Entity Recognition
    ( 2011)
    Klinger, R.
    Incorporating distant information via manually selected skip chain templates has been shown to be beneficial for the performance of conditional random field models in contrast to a simple linear chain based structure (Sutton and McCallum, 2007; Galley, 2006; Liu et al., 2010). The set of properties to be captured by a template is typically manually chosen with respect to the application domain. In this paper, a search strategy to find meaningful skip chains independent from the application domain is proposed. From a huge set of potentially beneficial templates, some can be shown to have a positive impact on the performance. The search for a meaningful graphical structure demonstrates the usefulness of the approach with an increase of nearly 2% F1 measure on a publicly available data set (Klinger et al., 2008).