Hierarchical Clock Synchronization in MPI

Hunold, S.; Carpen-Amarie, A.

2018

Conference Paper

Abstract

MPI benchmarks are used for analyzing or tuning the performance of MPI libraries. Generally, every MPI library should be adjusted to the given parallel machine, especially on supercomputers. System operators can define which algorithm should be selected for a specific MPI operation, and this decision which algorithm to select is usually made after analyzing bench-mark results. The problem is that the latency of communication operations in MPI is very sensitive to the chosen data acquisition and data processing method. For that reason, depending on how the performance is measured, system operators may end up with a completely different MPI library setup. In the present work, we focus on the problem of precisely measuring the latency of collective operations, in particular, for small payloads, where external experimental factors play a significant role. We present a novel clock synchronization algorithm, which exploits the hierarchical architecture of compute clusters, and we show that it outperforms previous approaches, both in run-time and in precision. We also propose a different scheme to obtain precise MPI run-time measurements (called Round-Time), which is based on given, fixed time slices, as opposed to the traditional way of measuring for a predefined number of repetitions. We also highlight that the use of MPI_Barrier has a significant effect on experimentally determined latency values of MPI collectives. We argue that MPI_Barrier should be avoided if the average run-time of the barrier function is in the same order of magnitude as the run-time of the MPI function to be measured.

Author(s)

Hunold, S.

Carpen-Amarie, A.

Hauptwerk

IEEE International Conference on Cluster Computing, CLUSTER 2018. Proceedings

Konferenz

International Conference on Cluster Computing (CLUSTER) 2018

Options

Hierarchical Clock Synchronization in MPI