Options
2018
Conference Paper
Title
Query Rewriting for Heterogeneous Data Lakes
Abstract
The increasing popularity of NoSQL systems has lead to the model of polyglot persistence, in which several data management systems with different data models are used. Data lakes realize the polyglot persistence model by collecting data from various sources, by storing the data in its original structure, and by providing the datasets for querying and analysis. Thus, one of the key tasks of data lakes is to provide a unified querying interface, which is able to rewrite queries expressed in a general data model into a union of queries for data sources spanning heterogeneous data stores. To address this challenge, we propose a novel framework for query rewriting that combines logical methods for data integration based on declarative mappings with a scalable big data query processing system (i.e., Apache Spark) to efficiently execute the rewritten queries and to reconcile the query results into an integrated dataset. Because of the diversity of NoSQL systems, our approach i s based on a flexible and extensible architecture that currently supports the major data structures such as relational data, semi-structured data (e.g., JSON, XML), and graphs. We show the applicability of our query rewriting engine with six real world datasets and demonstrate its scalability using an artificial data integration scenario with multiple storage systems.