Now showing 1 - 10 of 25
  • Publication
    A comparison of deep saliency map generators on multispectral data in object detection
    Deep neural networks, especially convolutional deep neural networks, are state-of-the-art methods to classify, segment or even generate images, movies, or sounds. However, these methods lack of a good semantic understanding of what happens internally. The question, why a COVID-19 detector has classified a stack of lung-ct images as positive, is sometimes more interesting than the overall specificity and sensitivity. Especially when human domain expert knowledge disagrees with the given output. This way, human domain experts could also be advised to reconsider their choice, regarding the information pointed out by the system. In addition, the deep learning model can be controlled, and a present dataset bias can be found. Currently, most explainable AI methods in the computer vision domain are purely used on image classification, where the images are ordinary images in the visible spectrum. As a result, there is no comparison on how the methods behave with multimodal image data, as well as most methods have not been investigated on how they behave when used for object detection. This work tries to close the gaps by investigating three saliency map generator methods on how their maps differ in the different spectra. This is achieved via an accurate and systematic training. Additionally, we examine how they perform when used for object detection. As a practical problem, we chose object detection in the infrared and visual spectrum for autonomous driving. The dataset used in this work, is the Multispectral Object Detection Dataset,1 where each scene is available in the long-wave (FIR), mid-wave (MIR) and short-wave (NIR) infrared as well as the visual (RGB) spectrum. The results show, that there are differences between the infrared and visual activation maps. Further, an advanced training with both, the infrared and visual data not only improves the network's output, it also leads to more focused spots in the saliency maps.
  • Publication
    Image-Based Out-of-Distribution-Detector Principles on Graph-Based Input Data in Human Action Recognition
    Living in a complex world like ours makes it unacceptable that a practical implementation of a machine learning system assumes a closed world. Therefore, it is necessary for such a learning-based system in a real world environment, to be aware of its own capabilities and limits and to be able to distinguish between confident and unconfident results of the inference, especially if the sample cannot be explained by the underlying distribution. This knowledge is particularly essential in safety-critical environments and tasks e.g. self-driving cars or medical applications. Towards this end, we transfer image-based Out-of-Distribution (OoD)-methods to graph-based data and show the applicability in action recognition. The contribution of this work is (i) the examination of the portability of recent image-based OoD-detectors for graph-based input data, (ii) a Metric Learning-based approach to detect OoD-samples, and (iii) the introduction of a novel semi-synthetic action recognition dataset. The evaluation shows that image-based OoD-methods can be applied to graph-based data. Additionally, there is a gap between the performance on intraclass and intradataset results. First methods as the examined baseline or ODIN provide reasonable results. More sophisticated network architectures - in contrast to their image-based application - were surpassed in the intradataset comparison and even lead to less classification accuracy.
  • Publication
    Analysis of Explainers of Black Box Deep Neural Networks for Computer Vision: A Survey
    Deep Learning is a state-of-the-art technique to make inference on extensive or complex data. As a black box model due to their multilayer nonlinear structure, Deep Neural Networks are often criticized as being non-transparent and their predictions not traceable by humans. Furthermore, the models learn from artificially generated datasets, which often do not reflect reality. By basing decision-making algorithms on Deep Neural Networks, prejudice and unfairness may be promoted unknowingly due to a lack of transparency. Hence, several so-called explanators, or explainers, have been developed. Explainers try to give insight into the inner structure of machine learning black boxes by analyzing the connection between the input and output. In this survey, we present the mechanisms and properties of explaining systems for Deep Neural Networks for Computer Vision tasks. We give a comprehensive overview about the taxonomy of related studies and compare several survey papers that deal with explainability in general. We work out the drawbacks and gaps and summarize further research ideas.
  • Publication
    Context Sensitivity of Spatio-Temporal Activity Detection using Hierarchical Deep Neural Networks in Extended Videos
    ( 2020)
    Hertlein, Felix
    ;
    ;
    The amount of available surveillance video data is increasing rapidly and therefore makes manual inspection impractical. The goal of activity detection is to automatically localize activities spatially and temporally in a large collection of video data. In this work we will answer the question to what extent context plays a role in spatio-temporal activity detection in extended videos. Towards this end we propose a hierarchical pipeline for activity detection which spatially localizes objects first and subsequently generates spatial-temporal action tubes. Additionally, a suitable metric for performance evaluation is enhanced. Thus, we evaluate our system using the TRECVID 2019 ActEV challenge dataset. We investigated the sensitivity by detecting activities multiple times with various spatial margins around the performing actor. The results showed that our pipeline and metric is suited for detecting activities in extended videos.
  • Publication
    Investigation on Combining 3D Convolution of Image Data and Optical Flow to Generate Temporal Action Proposals
    ( 2019)
    Schlosser, Patrick
    ;
    ;
    In this paper, several variants of two-stream architectures for temporal action proposal generation in long, untrimmed videos are presented. Inspired by the recent advances in the field of human action recognition utilizing 3D convolutions in combination with two-stream networks and based on the Single-Stream Temporal Action Proposals (SST) architecture [3], four different two-stream architectures utilizing sequences of images on one stream and sequences of images of optical flow on the other stream are subsequently investigated. The four architectures fuse the two separate streams at different depths in the model; for each of them, a broad range of parameters is investigated systematically as well as an optimal parametrization is empirically determined. The experiments on the THUMOS' 14 [11] dataset - containing untrimmed videos of 20 different sporting activities for temporal action proposals - show that all four two-stream architectures are able to outperform the original single-stream SST and achieve state of the art results. Additional experiments revealed that the improvements are not restricted to one method of calculating optical flow by exchanging the method of Brox [1] with FlowNet2 [10] and still achieving improvements.
  • Publication
    Augmentation techniques for video surveillance in the visible and thermal spectral range
    In intelligent video surveillance, cameras record image sequences during day and night. Commonly, this demands different sensors. To achieve a better performance it is not unusual to combine them. We focus on the case that a long-wave infrared camera records continuously and in addition to this, another camera records in the visible spectral range during daytime and an intelligent algorithm supervises the picked up imagery. More accurate, our task is multispectral CNN-based object detection. At first glance, images originating from the visible spectral range differ between thermal infrared ones in the presence of color and distinct texture information on the one hand and in not containing information about thermal radiation that emits from objects on the other hand. Although color can provide valuable information for classification tasks, effects such as varying illumination and specialties of different sensors still represent significant problems. Anyway, obtaining sufficient and practical thermal infrared datasets for training a deep neural network poses still a challenge. That is the reason why training with the help of data from the visible spectral range could be advantageous, particularly if the data, which has to be evaluated contains both visible and infrared data. However, there is no clear evidence of how strongly variations in thermal radiation, shape, or color information influence classification accuracy. To gain deeper insight into how Convolutional Neural Networks make decisions and what they learn from different sensor input data, we investigate the suitability and robustness of different augmentation techniques. We use the publicly available large-scale multispectral ThermalWorld dataset consisting of images in the long-wave infrared and visible spectral range showing persons, vehicles, buildings, and pets and train for image classification a Convolutional Neural Network. The training data will be augmented with several modifications based on their different properties to find out which ones cause which impact and lead to the best classification performance.
  • Publication
    Evaluating the Impact of Color Information in Deep Neural Networks
    Color images are omnipresent in everyday life. In particular, they provide the only necessary input for deep neural network pipelines, which are continuously being employed for image classification and object recognition tasks. Although color can provide valuable information, effects like varying illumination and specialties of different sensors still pose significant problems. However, there is no clear evidence how strongly variations in color information influence classification performance throughout rearward layers. To gain a deeper insight about how Convolutional Neural Networks make decisions and what they learn from input images, we investigate in this work suitability and robustness of different color augmentation techniques. We considered several established benchmark sets and custom-made pedestrian and background datasets. While decreasing color or saturation information we explore the activation differences in the rear layers and the stability of confidence values. We show that Luminance is most robust against changing color system in test images irrespective of degraded texture or not. Finally, we present the coherence between color dependence and properties of the regarded datasets and classes.
  • Publication
    An architecture for automatic multimodal video data anonymization to ensure data protection
    To perform a data protection concept for our mobile sensor platform (MODISSA), we designed and implemented an anonymization pipeline. This pipeline contains plugins for reading, modifying, and writing different image formats, as well as methods to detect the regions that should be anonymized. This includes a method to determine head positions and an object detector for the license plates, both based on state of the art deep learning methods. These methods are applied for all image sensors on the platform, no matter if they are panoramic RGB, thermal IR, or grayscale cameras. In this paper we focus on the whole face anonymization process. We determine the face region to anonymize on the basis of body pose estimates from OpenPose what proved to lead to robust results. Our anonymization pipeline achieves nearly human performance, with almost no human resources spent. However, to gain perfect anonymization a quick additional human interactive postprocessing step can be performed. We evaluated our pipeline quantitatively and qualitatively on urban example data recorded with MODISSA.
  • Publication
    Automated license plate detection for image anonymization
    Images or videos recorded in public areas may contain personal data such as license plates. According to German law, one is not allowed to save the data without permission of the affected people or an immediate anonymization of personal information in the recordings. As asking for and obtaining permission is practically impossible for one thing and then again, manual anonymization time consuming, an automated license plate detection and localization system is developed. For the implementation, a two-stage neural net approach is chosen that hierarchically combines a YOLOv3 model for vehicle detection and another YOLOv3 model for license plate detection. The model is trained using a specifically composed dataset that includes synthesized images, the usage of low-quality or non-annotated datasets as well as data augmentation methods. The license plate detection system is quantitatively and qualitatively evaluated, yielding an average precision (AP) of 98.73% for an intersection over union threshold of 0.3 on the openALPR dataset and showing an outstanding robustness even for rotated, small scaled or partly covered license plates.
  • Publication
    Searching remotely sensed images for meaningful nested gestalten
    Even non-expert human observers sometimes still outperform automatic extraction of man-made objects from remotely sensed data. We conjecture that some of this remarkable capability can be explained by Gestalt mechanisms. Gestalt algebra gives a mathematical structure capturing such part-aggregate relations and the laws to form an aggregate called Gestalt. Primitive Gestalten are obtained from an input image and the space of all possible Gestalt algebra terms is searched for well-assessed instances. This can be a very challenging combinatorial effort. The contribution at hand gives some tools and structures unfolding a finite and comparably small subset of the possible combinations. Yet, the intended Gestalten still are contained and found with high probability and moderate efforts. Experiments are made with images obtained from a virtual globe system, and use the SIFT method for extraction of the primitive Gestalten. Comparison is made with manually extracted ground-truth Gestalten salient to human observers.