MESD: Metadata Extraction from Scholarly Documents

Boukhers, ZeydZeydBoukhersYang, CongCongYang2025-08-042025-08-042025https://publica.fraunhofer.de/handle/publica/4901662-s2.0-105008497172This paper presents an overview of the Metadata Extraction from Scholarly Documents (MESD) shared task, which was designed to address the challenge of extracting structured metadata (e.g. Title, Author, Abstract, etc.) from scientific publications. The task aimed to promote the development of techniques for making scholarly data more Findable, Accessible, Interoperable, Reusable (FAIR) by improving metadata extraction from PDF documents. We describe the task design and the creation of two complementary datasets: (1) the S2ORC_Exp500v1 dataset consisting of 500 training samples, 100 validation samples, and 100 test samples with text-based annotations, and (2) the SSOAR Multidisciplinary Vision Dataset (SSOARGMVD) containing more than 8000 documents with bounding box annotations suitable for computer vision approaches. We discuss potential directions for future research in metadata extraction from scholarly documents, highlighting the opportunities presented by these new resources.enfalsedocument processingmetadata extractionnatural language processingscholarly documentsMESD: Metadata Extraction from Scholarly Documents - A Shared Task Overviewconference paper