CC BY 4.0Sailer, JohannesJohannesSailerFrey, ChristianChristianFreyKühnert, ChristianChristianKühnert2022-03-148.2.20192019https://publica.fraunhofer.de/handle/publica/40367410.1007/978-3-662-58485-9_8Deep learning (DL) is one of the key technologies in the artificial intelligence (AI) domain Deep learning neural networks (DLNN) profit a lot from the overall exponential data growth while on the other hand the computational effort for training and inference strongly increase. Most of the computational time in DLNN is consumed by the convolution step, which is based on a general matrix multiplication (GEMM). In order to accelerate the computational time for DLNN different highly optimized GEMM implementations for Graphic Processing Units (GPUs) have been presented in the last years [1] most of these approaches are GPU hardware specific implementations of the GEMM software kernel and do not incorporate the performance dependency of the training data layout. In order to achieve a maximum performance the parameters of the GEMM algorithm have to be tuned for the different GPU hardware and specific data layout of the training task. In this paper we present a two step autotuning approach for GPU based GEMM algorithms. In the first step the kernel parameter search space is pruned by several performance criteria and afterwards further processed by a modified Simulated Annealing in order to find the best kernel parameter combinations with respect to the GPU hardware and the task specific data layout. Our results were carried out on 160 different input problems with the proposed approach an average speedup against the state of the art implementation from NVIDIA (cuBLAS) from around 12 on a NVIDIA GTX 1080 Ti accelerator card can be achieved.enGPUMatrix MultiplicationAutotuningautomatic gernerationaccelerationCUDABLAS004670GPU GEMM-Kernel Autotuning for scalable machine learnersconference paper