Efficient computation of comprehensive statistical information of large OWL datasets: A scalable approach

Mohamed, H.H.MohamedFathalla, S.S.FathallaLehmann, JensJensLehmannJabeen, H.H.Jabeen2022-11-072022-11-072023https://publica.fraunhofer.de/handle/publica/42833310.1080/17517575.2022.20626832-s2.0-85130021274Computing dataset statistics is crucial for exploring their structure, however, it becomes challenging for large-scale datasets. This has several key benefits, such as link target identification, vocabulary reuse, quality analysis, big data analytics, and coverage analysis. In this paper, we present the first attempt of developing a distributed approach (OWLStats) for collecting comprehensive statistics over large-scale OWL datasets. OWLStats is a distributed in-memory approach for computing 50 statistical criteria for OWL datasets utilizing Apache Spark. We have successfully integrated OWLStats into the SANSA framework. Experiments results prove that OWLStats is linearly scalable in terms of both node and data scalability.enDistributed processingin-memory approachSANSA frameworkscalable architectureSemantic Webstatistics computationsEfficient computation of comprehensive statistical information of large OWL datasets: A scalable approachjournal article