Options
2025
Conference Paper
Title
Historic to FAIR: Leveraging LLMs for Historic Term Identification and Standardization
Abstract
As the availability of historical biodiversity data continues to grow, ensuring its usability through adherence to FAIR principles (Findable, Accessible, Interoperable, and Reusable) has become increasingly essential. This study addresses a key challenge in biodiversity data drawn from historical texts: identifying and interpreting common species names and scientific names. We highlight five main issues associated with historical common names: variations in spelling, the creation of new terms, shifts from broad historical names to more specific modern ones (and vice versa), and the renaming of historical terms. To tackle these challenges, we explore the application of a large language model (GPT-4) for entity detection and terminology alignment. Our findings demonstrate that GPT-4, when provided with a small context, can effectively identify both historical common species names and modern scientific names. On a test dataset, the model achieved a 92% success rate in detecting historical common names and correctly identified 98% of scientific terms. Additionally, for four of the five identified challenges, the LLM provided meaningful insights, including successfully matching historical common names to their modern counterparts. We demonstrate an embedded understanding of the evolution of biodiversity terminology within the model which underscores its potential to mobilize historical biodiversity data according to FAIR
Author(s)
Open Access
File(s)
Rights
CC BY-SA 4.0: Creative Commons Attribution-ShareAlike
Language
English