Enhancing Digital Libraries with Automated Definition Generation

Zielinski, Andrea; Hirzel, Simon; Arnold-Keifer, Sonja

doi:10.1145/3677389.3702536

2024

Conference Paper

Abstract

Scientific domains encompass many concepts that require a concise term definition to enable a common understanding among researchers, in particular for interdisciplinary fields. In digital libraries, information access and sharing is often facilitated by terminology databases. However, building up such resources is expensive to produce manually and requires expert knowledge. Automatically generating definitions for scientific terms has become a hot research topic recently that can reduce the manual burden. However, current methods heavily rely on large language models (LLMs) that store factual knowledge in their parameters, so that knowledge cannot be easily updated for emerging scientific terms. Furthermore, a major shortcoming of these models is that they are prone to hallucination and their output is difficult to control. To bridge these gaps, we propose to address the task of definition generation through guided abstractive summarization, incorporating key information from external resources. At test time, we augment the model with retrieved abstracts from Scopus and use automatically extracted topics and keywords as guidance, both essential for definition generation. To this aim, our approach takes into account two relevant sub-tasks in the process, a) predicting the topic class and b) generating hypernym candidates for the term. Our proposed pipelined approach for automatic guided definition generation achieves significant performance improvement over the standard baselines as well as relevant prior works on this problem. We use BLEU, ROUGE and BERTScore to automatically evaluate the quality of the systems on our benchmark and carry out a human evaluation to assess fluency, relevancy, coherence and factuality of the output. Our experiments show that LLMs can provide fluent and coherent definitions, and are often on par with human created definitions. Yet, there is still room for improvement on identifying relevant content and improving factual correctness.