Options
2022
Conference Paper
Title
Scoring-based DOM Content Selection with Discrete Periodicity Analysis
Abstract
The comprehensive analysis of large data volumes forms the shape of the future. It enables decision-making based on empiric evidence instead of expert experience and its utilization for the training of machine learning models enables new use cases in image recognition, speech analysis or regression and classification. One problem with data is, that it is often not readily available in aggregated form. Instead, it is necessary to search the web for information and elaborately mine websites for specific data. This is known as web scraping. In this paper we present an interactive, scoring based approach for the scraping of specific information from websites. We propose a scoring function, that enables the adaption of threshold values to select specific sets of data. We combine the scoring of paths in a web pages DOM with periodicity analysis to enable the selection of complex patterns in structured data. This allows non-expert users to train content selection models and to label classif ication data for supervised learning.