HASKER: An efficient algorithm for string kernels. Application to polarity classification in various languages

Popescu, Marius; Grozea, Cristian; Ionescu, Radu Tudor

doi:10.1016/j.procs.2017.08.207

2017

Journal Article

Abstract

String kernels have successfully been used for various NLP tasks, ranging from text categorization by topic to native language identification. In this paper, we present a simple and efficient algorithm for computing various spectrum string kernels. When comparing two strings, we store the p-grams in the first string into a hash table, and then we apply a hash table lookup for the p-grams that occur in the second string. In terms of time, we show that our algorithm can outperform a state-of-the-art tool for computing string similarity. In terms of accuracy, we show that our approach can reach state-of-the-art performance for polarity classification in various languages. Our efficient implementation is provided online for free at http://string-kernels.herokuapp.com.