Options
2025
Master Thesis
Title
Convergence of HPC with Kubernetes to deploy a Scalable On-premise MLOps Platform
Abstract
Managing High-Performance Computing (HPC) and non-HPC tasks within diverse machine learning workflows poses significant challenges, particularly in building scalable and unified MLOps platforms. Kubernetes is widely used for long-running non-HPC tasks such as inference and model serving, but it lacks native support for HPC tasks such as large-scale distributed training or fine-tuning, as its default scheduler does not handle batch jobs natively. Conversely, HPC clusters like Slurm provide native support for batch jobs but only offer limited flexibility and automation. This thesis explores two convergence methods to bridge this gap, namely the Slurm connector strategy, which facilitates the submission and management of Slurm jobs from within a Kubernetes cluster to an external Slurm cluster and the custom scheduling strategy, which integrates custom batch job schedulers within Kubernetes to handle HPC batch workloads.
An on-premise MLOps platform was deployed on a Kubernetes cluster in Fraunhofer Edge Cloud (FEC). A distributed GPU enabled Whisper ASR fine-tuning pipeline was implemented on the platform to evaluate these convergence methods. A centralized observability stack consisting of monitoring tools such as Prometheus and Grafana and logging tools such as Elasticsearch, Fluent Bit and Kibana was integrated into the MLOps platform to collect metrics from the experiments. These metrics included log data collected from various sources, CPU and GPU utilization and more. The convergence methods were analyzed and evaluated using non-parametric statistical tests such as Kruskal-Wallis H-test for statistical significance and Tukey Honestly Significant Difference (HSD) test for pairwise comparison to compare the performance of Kubernetes-based custom schedulers with the bare-metal Slurm scheduler for the distributed Whisper fine-tuning task. This thesis demonstrates the potential of Kubernetes and its custom scheduling capabilities to support scalable and diverse machine learning workflows in the MLOps platform.
An on-premise MLOps platform was deployed on a Kubernetes cluster in Fraunhofer Edge Cloud (FEC). A distributed GPU enabled Whisper ASR fine-tuning pipeline was implemented on the platform to evaluate these convergence methods. A centralized observability stack consisting of monitoring tools such as Prometheus and Grafana and logging tools such as Elasticsearch, Fluent Bit and Kibana was integrated into the MLOps platform to collect metrics from the experiments. These metrics included log data collected from various sources, CPU and GPU utilization and more. The convergence methods were analyzed and evaluated using non-parametric statistical tests such as Kruskal-Wallis H-test for statistical significance and Tukey Honestly Significant Difference (HSD) test for pairwise comparison to compare the performance of Kubernetes-based custom schedulers with the bare-metal Slurm scheduler for the distributed Whisper fine-tuning task. This thesis demonstrates the potential of Kubernetes and its custom scheduling capabilities to support scalable and diverse machine learning workflows in the MLOps platform.
Thesis Note
Sankt Augustin, Hochschule, Master Thesis, 2025
Author(s)
File(s)
Rights
Use according to copyright law
Language
English