Statisfy me: What are my stats?

Sejdiu, Gezim; Ermilov, Ivan; Mami, Mohamed Nadjib; Lehmann, Jens

2018

Conference Paper

Abstract

The increasing adoption of the Linked Data format, RDF, over the last two decades has brought new opportunities. It has also raised new challenges though, especially when it comes to managing and processing large amounts of RDF data. In particular, assessing the internal structure of a data set is important, since it enables users to understand the data better. One prominent way of assessment is computing statistics about the instances and schema of a data set. However, computing statistics of large RDF data is computationally expensive. To overcome this challenging situation, we previously built DistLODStats, a framework for parallel calculation of 32 statistical criteria over large RDF datasets, based on Apache Spark. Running DistLODStats is, thus, done via submitting jobs to a Spark cluster. Often times, this process is done manually, either by connecting to the cluster machine or via a dedicated resource manager. This approach is inconvenient as it requires acquiring new software skills as well as the direct interaction of users with the cluster. In order to make the use of DistLODStats easier, we propose in this paper an approach for triggering RDF statistics remotely simply using HTTP requests. DistLODStats is built as a plugin into the larger SANSA Framework and makes use of Apache Livy, a novel lightweight solution for interacting with Spark cluster via a REST Interface.

Author(s)

Sejdiu, Gezim

Ermilov, Ivan

Mami, Mohamed Nadjib

Lehmann, Jens

Mainwork

ISWC-P&D-Industry-BlueSky 2018. ISWC 2018 Posters & Demonstrations, Industry and Blue Sky Ideas Tracks. Online resource

Conference

International Semantic Web Conference (ISWC) 2018

Options

Statisfy me: What are my stats?