Options
2022
Conference Paper
Title
A Digitization Pipeline for Mixed-Typed Documents Using Machine Learning and Optical Character Recognition
Abstract
Although digitization is advancing rapidly, a large amount of data processed by companies is in printed format. Technologies such as Optical Character Recognition (OCR) support the transformation of printed text into machine-readable content. However, OCR struggles when data on documents is highly unstructured and includes non-text objects. This, e.g., applies to documents such as medical prescriptions. Leveraging Design Science Research (DSR), we propose a flexible processing pipeline that can deal with character recognition on the one hand and object detection on the other hand. To do so, we derive Design Requirements (DR) in cooperation with a practitioner doing prescription billing in the healthcare domain. We then developed a prototype blueprint that is applicable to similar problem formulations. Overall, we contribute to research and practice in multiple ways. First, we provide evidence for selected OCR methods provided by previous research. Second, we design a machine-learning-based digitization pipeline for printed documents containing both text and non-text objects in the context of medical prescriptions. Third, we derive a nascent design pattern for this type of document digitization. These patterns are the foundation for further research and can support the development of innovative information systems leading to more efficient decision making and thus to economic resource usage.
Author(s)