User-centered interactive similarity definition for complex data objects

Sessler, David

2014

Bachelor Thesis

Abstract

The definition of similarity between data objects plays a key role for the applicability of many analytical systems. Similarity measures are used for prominent data analysis tasks like nearest neighbor search, clustering, or pattern recognition. These tasks are applied in many scientific domains like Information Retrieval, Data Mining, Machine Learning, Information Visualization and Visual Analytics. The data used for the calculation of similarity can either be of uniform attribute type (like numerical, ordinal, categorical or binary) or consist of combinations thereof (mixed data). The process of similarity definition comprises several challenges which I aim to tackle in this work. To start with, in many applications the developers (data experts) of the analytical system are not necessarily the users (domain experts) of the system. A problem arises, because data experts implement the functional similarity specification for domain experts. The functional similarity specification, however, should reflect the similarity notion in the minds of domain experts. Therefore the domain experts should be involved in the similarity generation process. The second challenge refers to the similarity definition for mixed data. A variety of similarity definitions for numerical, categorical or binary data exist. However, the similarity definition based on mixed data is cumbersome because of the complexity of the data. Finally, there are two possibilities when the similarity can be defined, namely at compile time or at run time. Today, many analytical systems define the similarity at compile time. However, the similarity notion of domain experts or the data set may vary over time. This would require a new specification of the functional similarity and a new compilation of the system. The definition of similarity at run time would solve this problem. I present a visual-interactive system that enables domain experts to define a similarity measure that reflects their similarity notion. The system is applicable for mixed data sets. Domain experts can align objects in a visual interface to generate feedback. Dynamic recalculation of the functional similarity specification allows to match the similarity notion of domain expert at run time. This way the functional similarity specification can be adjusted at any time. Further, I provide a visual-interactive mode which enables the data expert to explore the similarity definition process of the domain expert. In addition, I evaluate the system to assess the quality of the similarity concept as well as the feedback generation process. The results of the evaluation illustrate both: the validity of my solution as well as extension possibilities depending on the complexity of the given user feedback. In two case studies I show the applicability of the system. Both use cases show that the 'mental' similarity notion of users can be captured by the similarity concept. The results of the evaluation and the observations made in the case studies can be applied to improve the system or be used as a baseline for future approaches for user-centered interactive similarity definition for complex data objects.

Thesis Note

Darmstadt, TU, Bachelor Thesis, 2014

Author(s)

Sessler, David