Bot and Gender Identification in Twitter using Word and Character N-Grams
Notebook for PAN at CLEF 2019
Automated social media accounts, called bots, gained worldwide considerable importance over the course of the last years. Social bots can have serious implications on our society by swaying political elections or spreading disinformation - giving rationale to social bot detection as an emerging research area. Hence, tools and techniques to automatically detect and classify manipulative bots are needed. In this notebook, we describe our system for the author profiling task at PAN 2019 on bot and gender identification on Twitter. The submitted system uses word unigrams and bigrams as well as character n-grams as features. Tweet preprocessing and feature construction were conducted to train a linear Support Vector Machine (SVM) classifier. Our model shows that it is possible to differentiate bots from humans with a (fairly) high accuracy. Additionally, the accuracy shows that our SVM architecture can solidly determine the gender of the author (male or female). Our submitted model achieved an overall accuracy of 0.92 for bot detection on the English dataset and an accuracy of 0.91 for Spanish tweets. Gender can be determined by the accuracy of 0.82 and 0.78 on the English and Spanish corpus, respectively. Our simple model ranked 8th out of 55 competitors.