Distributed online service coordination using deep reinforcement learning

Schneider, S.; Qarawlus, H.; Karl, H.

doi:10.1109/ICDCS51616.2021.00058

2021

Conference Paper

Abstract

Services often consist of multiple chained components such as microservices in a service mesh, or machine learning functions in a pipeline. Providing these services requires online coordination including scaling the service, placing instance of all components in the network, scheduling traffic to these instances, and routing traffic through the network. Optimized service coordination is still a hard problem due to many influencing factors such as rapidly arriving user demands and limited node and link capacity. Existing approaches to solve the problem are often built on rigid models and assumptions, tailored to specific scenarios. If the scenario changes and the assumptions no longer hold, they easily break and require manual adjustments by experts. Novel self-learning approaches using deep reinforcement learning (DRL) are promising but still have limitations as they only address simplified versions of the problem and are typically centralized and thus do not scale to p ractical large-scale networks. To address these issues, we propose a distributed self-learning service coordination approach using DRL. After centralized training, we deploy a distributed DRL agent at each node in the network, making fast coordination decisions locally in parallel with the other nodes. Each agent only observes its direct neighbors and does not need global knowledge. Hence, our approach scales independently from the size of the network. In our extensive evaluation using real-world network topologies and traffic traces, we show that our proposed approach outperforms a state-of-the-art conventional heuristic as well as a centralized DRL approach (60 % higher throughput on average) while requiring less time per online decision (1 ms).

Author(s)

Schneider, S.

Qarawlus, H.

Karl, H.

Mainwork

IEEE 41st International Conference on Distributed Computing Systems, ICDCS 2021

Conference

International Conference on Distributed Computing Systems (ICDCS) 2021

Options

Distributed online service coordination using deep reinforcement learning