Incorporating code-switching and borrowing in Dutch-English automatic language detection on Twitter

Kent, S.; Claeser, D.

doi:10.1007/978-3-030-02686-8_32

2019

Conference Paper

Abstract

This paper presents a classification system to automatically identify the language of individual tokens in Dutch-English bilingual Tweets. A dictionary-based approach is used as the basis of the system, and additional features are introduced to address the challenges associated with identifying closely related languages. Crucially, a separate system aimed specifically at differentiating between code-switching and borrowing is designed and then implemented as a classification step within the language identification (LID) system. The separate classification step is based on a linguistic framework for distinguishing between borrowing and CS. To test the effectiveness of the rules in the LID system, they are used to create feature vectors for training and testing machine learning systems. The discussion centres are based on a Decision Tree Classifier (DTC) and Support Vector Machines (SVM). The results show that there is only a small difference between the rule-based LID system (micro F1 = .95) and the DTC (micro F1 = .96).

Author(s)

Kent, S.

Claeser, D.

Mainwork

Future Technologies Conference, FTC 2018. Proceedings. Vol.1

Conference

Future Technologies Conference (FTC) 2018

Options

Incorporating code-switching and borrowing in Dutch-English automatic language detection on Twitter