Hier finden Sie wissenschaftliche Publikationen aus den Fraunhofer-Instituten.

GPU GEMM-Kernel Autotuning for scalable machine learners

: Sailer, Johannes; Frey, Christian; Kühnert, Christian

Volltext urn:nbn:de:0011-n-5320100 (324 KByte PDF)
MD5 Fingerprint: 8b78dc230fda6ff364b072ec12f3c32c
(CC) by
Erstellt am: 8.2.2019

Beyerer, Jürgen (Ed.); Kühnert, Christian (Ed.); Niggemann, Oliver (Ed.):
Machine Learning for Cyber Physical Systems. Selected papers from the International Conference ML4CPS 2018 : Selected papers from the International Conference ML4CPS 2018, Karlsruhe, October 23rd and 24th, 2018
Berlin: Springer Vieweg, 2019 (Technologies for Intelligent Automation 9)
ISBN: 978-3-662-58484-2 (Print)
ISBN: 978-3-662-58485-9 (Online)
Conference on Machine Learning for Cyber-Physical-Systems and Industry 4.0 (ML4CPS) <4, 2018, Karlsruhe>
Konferenzbeitrag, Elektronische Publikation
Fraunhofer IOSB ()
GPU; Matrix Multiplication; Autotuning; automatic gerneration; acceleration; CUDA; BLAS

Deep learning (DL) is one of the key technologies in the artificial intelligence (AI) domain Deep learning neural networks (DLNN) profit a lot from the overall exponential data growth while on the other hand the computational effort for training and inference strongly increase. Most of the computational time in DLNN is consumed by the convolution step, which is based on a general matrix multiplication (GEMM). In order to accelerate the computational time for DLNN different highly optimized GEMM implementations for Graphic Processing Units (GPUs) have been presented in the last years [1] most of these approaches are GPU hardware specific implementations of the GEMM software kernel and do not incorporate the performance dependency of the training data layout. In order to achieve a maximum performance the parameters of the GEMM algorithm have to be tuned for the different GPU hardware and specific data layout of the training task. In this paper we present a two step autotuning approach for GPU based GEMM algorithms. In the first step the kernel parameter search space is pruned by several performance criteria and afterwards further processed by a modified Simulated Annealing in order to find the best kernel parameter combinations with respect to the GPU hardware and the task specific data layout. Our results were carried out on 160 different input problems with the proposed approach an average speedup against the state of the art implementation from NVIDIA (cuBLAS) from around 12 on a NVIDIA GTX 1080 Ti accelerator card can be achieved.