• English
  • Deutsch
  • Log In
    Password Login
    Research Outputs
    Fundings & Projects
    Researchers
    Institutes
    Statistics
Repository logo
Fraunhofer-Gesellschaft
  1. Home
  2. Fraunhofer-Gesellschaft
  3. Konferenzschrift
  4. KPCA embeddings: An unsupervised approach to learn vector representations of finite domain sequences
 
  • Details
  • Full
Options
2017
Conference Paper
Title

KPCA embeddings: An unsupervised approach to learn vector representations of finite domain sequences

Title Supplement
A use case for words and DNA sequences
Abstract
Most of the well-known word embeddings from the last few years rely on a predefined vocabulary so that out-of-vocabulary words are usually skipped when they need to be processed. This may cause a significant quality drop in document representations that are built upon them. Additionally, most of these models do not incorporate information about the morphology of the words within the word vectors or if they do, they require labeled data. We propose an unsupervised method to generate continuous vector representations that can be applied to any sequence of finite domain (such as text or DNA sequences) by means of kernel principal component analysis (KPCA). We also show that, apart from their potential value as a preprocessing step within a more complex natural language processing system, our KPCA embeddings also can capture valuable linguistic information without any supervision, in particular word morphology of German verbs. When they are applied to DNA sequences, they also encode enough information to detect splice junctions.
Author(s)
Brito, Eduardo
Sifa, Rafet  
Bauckhage, Christian  
Mainwork
Lernen, Wissen, Daten, Analysen, LWDA 2017. Conference Proceedings. Online resource  
Conference
Conference "Lernen, Wissen, Daten, Analysen" (LWDA) 2017  
Link
Link
Language
English
Fraunhofer-Institut für Intelligente Analyse- und Informationssysteme IAIS  
  • Cookie settings
  • Imprint
  • Privacy policy
  • Api
  • Contact
© 2024