Human Body Part Segmentation using Deep Learning with Plausible Estimation of Occluded Body Parts

Hartmann, Alexander

2024

Master Thesis

Abstract

In the current research, human visual centric tasks have received high-level attention due to their widespread application in various domains. In the realm of computer vision, human body part segmentation, also known as human parsing, plays a crucial role in the human centric analysis that involves the fine-grained image segmentation of individual human body parts in images or videos. Perceiving the success of deep learning in diverse domains, this thesis explores the innovative application of human body part segmentation in the domain of Cultural Heritage. Given multi-viewpoint 2D renderings of 3D human representatives generated from artifacts, preliminary sketches or underdrawings (socalled sinopias), it is possible to generate large amounts of images of photo-realistic humans dressed in authentic historical garments. This can be done by leveraging the capabilities of generative image-to-image or text-to-image diffusion models, which have gained significant attention recently. Despite significant advancements, deep learning models struggle with the accurate segmentation of human body parts in the presence of the complexities inherent to Cultural Heritage use case. Even for the human eye, accurately inferring the shape and position of potentially hidden or highly deformable body parts within a single 2D image remains a demanding task. The lack of relevant training data specifically tailored to the domain of the Cultural Heritage creates a significant challenge for deep learning models. Across various, often domain-specific, human centric datasets, the label space varies significantly between different domains. This inconsistency can introduce noise into models trained on data from multiple sources (cross-dataset training) due to ambiguities or inaccurate labeling in the original datasets. To address the scarcity
of ground truth training data and to overcome the domain gap, a common approach is the utilization of synthesized data generated from 3D human models. Body models, like SMPL, offer a high degree of parameterization, allowing for a wide variety of poses and body shapes. Additionally, these models can be clothed with various virtual garments and diverse textures in order to simulate the Cultural Heritage use case to a certain extent. Through supervised domain adaptation learning, semantic image segmentation models can be trained to generalize and adapt to other domains, such as the Cultural Heritage. Other approaches, that rely on underlying body meshes, such as dense pose estimation and regression models, provide rich spatial information about the body pose and shape of humans depicted in images. Therefore, by segmenting the corresponding surface landmarks and their body parts, it is possible to infer and annotate body parts hidden under wide clothing. Despite the large variety of possible historic garments, the introduced deep learning models abstracts clothing from body parts to certain extends through the learning of features from these synthetic human images.

Thesis Note

Darmstadt, TU, Master Thesis, 2024

Author(s)

Hartmann, Alexander

Fraunhofer-Institut für Graphische Datenverarbeitung IGD

Advisor(s)

Kuijper, Arjan