Options
2024
Conference Paper
Title
Enhancing Machine Learning Capabilities in Data Lakes with AutoML and LLMs
Abstract
The exponential growth of data from digitization requires efficient utilization and storage of large amounts of data. Data lakes can store heterogeneous datasets and prepare them for machine learning (ML). However, current data lakes lack mature capabilities to support ML requirements. AutoML is the process of automating the end-to-end application of ML to real-world problems. Large Language Models (LLMs) can potentially increase ML pipeline automation by assisting at various stages of the process and democratizing access to advanced analytics. This paper explores the integration of AutoML tools and LLMs and their application in the data lake SEDAR. We present an extended data lake metadata model for capturing data analytics, a Python package for wrapping AutoML libraries, and a module that leverages LLMs for AutoML. Finally, we undertake a comparative analysis between the performance of AutoML and LLMs in four challenging real-world use cases from the domain of chemistry, each presenting a distinct type of ML problem.
Author(s)
Keyword(s)