KPCA embeddings: An unsupervised approach to learn vector representations of finite domain sequences
A use case for words and DNA sequences
Most of the well-known word embeddings from the last few years rely on a predefined vocabulary so that out-of-vocabulary words are usually skipped when they need to be processed. This may cause a significant quality drop in document representations that are built upon them. Additionally, most of these models do not incorporate information about the morphology of the words within the word vectors or if they do, they require labeled data. We propose an unsupervised method to generate continuous vector representations that can be applied to any sequence of finite domain (such as text or DNA sequences) by means of kernel principal component analysis (KPCA). We also show that, apart from their potential value as a preprocessing step within a more complex natural language processing system, our KPCA embeddings also can capture valuable linguistic information without any supervision, in particular word morphology of German verbs. When they are applied to DNA sequences, they also encode enough information to detect splice junctions.