JEMA: Joint Embedding of Multimodal and multi-view Alignment in human-centric embedding space for manufacturing

Sousa, João; Darabi, Roya; Sousa, Armando; Brückner, Frank; Reis, Luís Paulo; Reis, Ana

doi:10.1016/j.cviu.2026.104771

May 2026

Journal Article

Abstract

This work introduces JEMA (Joint Embedding with Multimodal and multi-view Alignment), a novel co-learning framework and loss function to combine multiple sensors and process parameters in Directed Energy Deposition (DED), a critical process in metal additive manufacturing. As Industry 5.0 advances in industrial applications, effective process monitoring becomes increasingly essential. However, the limited availability of data and the black-box nature of AI solutions present significant implementation challenges in industrial settings. JEMA addresses these limitations by leveraging multimodal data, including multi-view images and process parameters, to learn transferable semantic representations. By implementing a supervised regression contrastive loss function, JEMA shapes the embedding space to enable interpretable inference. Furthermore, the framework allows for simplified hardware requirements and reduced computational overhead during deployment by utilizing only the primary on-axis sensor. We evaluate the effectiveness of JEMA loss in DED process monitoring, with particular focus on its generalization capabilities for downstream tasks such as melt pool geometry prediction without extensive fine-tuning. Our empirical results demonstrate the effectiveness of JEMA, showing improvements of 29% and 20% in multimodal and unimodal settings, respectively, compared to models without any regularization loss. Additionally, JEMA outperforms supervised contrastive learning methods by 8% and 2% in the same settings. These improvements are also accompanied by a more structured and meaningful representation in the embedding space. Importantly, the learned embedding representation provides direct interpretability of the feature space, which can be utilized by both human operators and automated systems for process optimization, control, and anomaly detection based on defined thresholds. This human-centered approach ensures that operators can actively engage with the system, making informed decisions and enhancing their trust in the process. Our framework establishes a foundation for integrating multisensor data with metadata, enabling diverse downstream applications both within manufacturing processes and beyond, while keeping human expertise central to the loop.