Acceleration of an Autoencoder using a FPGA-SoC in a High-Performance Node of a Distributed Onboard Computer

Kuijper, ArjanWesterdorff, KarstenFreitag, TimoTimoFreitag2022-12-162022-12-162022https://publica.fraunhofer.de/handle/publica/430107Image compression is essential for minimizing used downlink capacity of satellites. Using neural networks as autoencoders for image compression can improve the compression rate compared to ordinary compression algorithms like JPEG. The calculation of a pass of a neural network can become very computational expensive. So to improve runtime hardware accelerators are used. These usually incorporate both the linear matrix vector multiplication and the non-linear activation function calculation which need to be quantized in order to be implemented in hardware. This thesis proposes the approach for calculating the activation functions in the CPU while the linear matrix vector multiplication are still calculated in the accelerator. This prevents the quantization of activation functions and allows to port the implementation to other platforms which do not have enough resources to include a complex hardware implementation of activation functions. The implementation consists of a neural network compiler for preparing data and generating instructions, an application for executing these instruction, a kernel space driver for providing buffers and hardware access to the application and a hardware accelerator for calculating sparse matrix vector multiplications. This approach is implemented on a FGPA-SoC of the Zynq-7000 family which provides a tight coupling between the CPU and FPGA on which the accelerator is implemented. The system is running an embedded Linux operating system and the implementation is integrated into the ScOSA framework which is a designed for a distributed onboard computer of a satellite. Using this a approach a throughput of 235 million multiply accumulate operations per seconds has been achieved. It is shown that time utilization of the accelerator is over 85% for matrices bigger than 480 000 elements, but also that it decreases severely for matrices smaller than 80 000 elements. Therefore, the approach of calculating the activation functions separately in the CPU is more costly for smaller matrices relative to the size of the matrices. Nevertheless, improvements of the approach for minimizing the costs even for smaller matrices are proposed.enLead Topic: Digitized WorkResearch Line: Machine Learning (ML)Field-programmable gate array (FPGA)Deep learningHigh performance computingDistributed processingAcceleration of an Autoencoder using a FPGA-SoC in a High-Performance Node of a Distributed Onboard Computermaster thesis