Using GPI-2 for distributed memory paralleliziation of the caffe toolbox to speed up deep neural network training

Kühn, Martin; Keuper, Janis; Pfreundt, Franz-Josef

2017

Conference Paper

Abstract

Deep Neural Network (DNN) are currently of great interest in research and application. The training of these networks is a compute intensive and time consuming task. To reduce training times to a bearable amount at reasonable cost we extend the popular Caffe toolbox for DNN with an efficient distributed memory communication pattern. To achieve good scalability we emphasize the overlap of computation and communication and prefer fine granular synchronization patterns over global barriers. To implement these communication patterns we rely on the the ""Global address space Programming Interface"" version 2 (GPI-2) communication library. This interface provides a lightweight set of asynchronous one-sided communication primitives supplemented by non-blocking fine granular data synchronization mechanisms. Therefore, CaffeGPI is the name of our parallel version of Caffe. First benchmarks demonstrate better scaling behavior compared with other extensions, e.g., the IntelTMCaffe. Even within a single symmetric multiprocessing machine with four graphics processing units, the CaffeGPI scales better than the standard Caffe toolbox. These first results demonstrate that the use of standard High Performance Computing (HPC) hardware is a valid cost saving approach to train large DDNs. I/O is an other bottleneck to work with DDNs in a standard parallel HPC setting, which we will consider in more detail in a forthcoming paper.

Author(s)

Kühn, Martin

Fraunhofer-Institut für Techno- und Wirtschaftsmathematik ITWM

Keuper, Janis

Fraunhofer-Institut für Techno- und Wirtschaftsmathematik ITWM

Pfreundt, Franz-Josef

Fraunhofer-Institut für Techno- und Wirtschaftsmathematik ITWM

Mainwork

Seventh International Conference on Advanced Communications and Computation, INFOCOMP 2017

Conference

International Conference on Advanced Communications and Computation (INFOCOMP) 2017

Options

Using GPI-2 for distributed memory paralleliziation of the caffe toolbox to speed up deep neural network training