Bot and Gender Identification in Twitter using Word and Character N-Grams

Vogel, Inna; Jiang, Peter

2019

Conference Paper

Abstract

Automated social media accounts, called bots, gained worldwide considerable importance over the course of the last years. Social bots can have serious implications on our society by swaying political elections or spreading disinformation - giving rationale to social bot detection as an emerging research area. Hence, tools and techniques to automatically detect and classify manipulative bots are needed. In this notebook, we describe our system for the author profiling task at PAN 2019 on bot and gender identification on Twitter. The submitted system uses word unigrams and bigrams as well as character n-grams as features. Tweet preprocessing and feature construction were conducted to train a linear Support Vector Machine (SVM) classifier. Our model shows that it is possible to differentiate bots from humans with a (fairly) high accuracy. Additionally, the accuracy shows that our SVM architecture can solidly determine the gender of the author (male or female). Our submitted model achieved an overall accuracy of 0.92 for bot detection on the English dataset and an accuracy of 0.91 for Spanish tweets. Gender can be determined by the accuracy of 0.82 and 0.78 on the English and Spanish corpus, respectively. Our simple model ranked 8th out of 55 competitors.

Author(s)

Vogel, Inna

Jiang, Peter

Mainwork

CLEF 2019, Conference and Labs of the Evaluation Forum. Working Notes. Online resource

Funder

Bundesministerium für Bildung und Forschung BMBF (Deutschland)

Conference

Conference and Labs of the Evaluation Forum (CLEF) 2019

Options

Bot and Gender Identification in Twitter using Word and Character N-Grams