Crowdsourced semantic annotation of scientific publications and tabular data in PDF

Takis, Jaana; Saiful Islam, A.Q.M.; Lange, Christoph; Auer, Sören

doi:10.1145/2814864.2814887

2015

Conference Paper

Abstract

Significant amounts of knowledge in science and technology have so far not been published as Linked Open Data but are contained in the text and tables of legacy PDF publications. Making such information available as RDF would, for example, provide direct access to claims and facilitate surveys of related work. A lot of valuable tabular information that till now only existed in PDF documents would also finally become machine understandable. Instead of studying scientific literature or engineering patents for months, it would be possible to collect such input by simple SPARQL queries. The SemAnn approach enables collaborative annotation of text and tables in PDF documents, a format that is still the common denominator of publishing, thus maximising the potential user base. The resulting annotations in RDF format are available for querying through a SPARQL endpoint. To incentivise users with an immediate benefit for making the effort of annotation, SemAnn recommends related papers, taking into account the hierarchical context of annotations in a novel way. We evaluated the usability of SemAnn and the usefulness of its recommendations by analysing annotations resulting from tasks assigned to test users and by interviewing them. While the evaluation shows that even few annotations lead to a good recall, we also observed unexpected, serendipitous recommendations, which confirms the merit of our low-threshold annotation support for the crowd.