Options
2025
Journal Article
Title
Fine-tuning large language models with contrastive margin ranking loss for selective entity matching in product data integration
Abstract
Entity Matching (EM) concerns identifying entities from different data sources that correspond to the same real-world object. It is widely used for product data integration in e-commerce, product classification, and inventory management, enabling the matching of duplicate product records with heterogeneous descriptions across various platforms and software systems. The standard EM solution consists of two steps: a blocking step to retrieve a subset of candidates and a pairwise matching step to classify whether the query entity matches each candidate. However, a significant challenge arises when pairwise matching fails to account for similar distractors within the candidate subset, often leading to false positive matches. This issue has been largely overlooked in prior work and existing benchmark datasets. In this study, we address this gap through three key aspects. First, we revisit the standard pairwise EM setting by recompiling existing benchmark datasets to include more hard negative (HN) candidates, which are semantically similar to corresponding query entities. We then evaluate state-of-the-art (SOTA) pairwise matchers on these recompiled datasets, revealing the limitations of the conventional pairwise EM approach under more challenging and realistic conditions. Second, we propose a selective EM approach that formulates EM as a listwise selection task, where the query entity is compared directly with the entire candidate set rather than evaluated through independent pairwise classifications. Accordingly, a new evaluation framework is introduced, including recompiled benchmark datasets and a new evaluation metric. Third, we propose a selective EM method Mistral4SelectEM, which fine-tunes a large language model for selective EM by structuring it into a Siamese network and fine-tuning it with a novel contrastive margin ranking loss (CMRL). It aims to enhance the model's ability to distinguish true positives from semantically similar HNs. Extensive experiments demonstrate that our method outperforms SOTA pairwise EM approaches in both efficiency and performance across multiple benchmark datasets. The code and the recompiled entity matching benchmark datasets are publicly available at: https://github.com/quickhdsdc/LLM4EntityMatching.
Author(s)