Twitter corpus creation: The case of a Malay Chat-style-text Corpus (MCC)
In recent years, social networks, microblogs, and short message service have deeply penetrated peoples lives, and thus, chat-style text is a common phenomenon. This chat-style text has many unknown features for linguists, which can be discovered by analyzing a chat-style corpus. The process of constructing a corpus conforms to specific corpus criteria, such as representativeness, sampling, variety, and chronology. Up to now, literature does not provide specific corpus criteria for creating a chat-style-text corpus. In contrast to related work, corpus criteria for creating a chat-style corpus are provided. An exhaustive and reliable Malay chat-style text corpus is still lacking. Thus, the provided criteria are used to demonstrate the process of constructing a Twitter corpus known as the Malay Chat-style Corpus (MCC). The MCC, which has 1 million twitter messages, consists of 14,484,384 word instances, 646,807 terms and metadata, such as posting time, used twitter client application, and type of Twitter message (simple Tweet, Retweet, Reply). Furthermore, the results of the analysis of the MCC reveal characteristics of the corpus including the most frequent terms and collocations, Zipf law diagram, Twitter peak hours, and percentages of message types. Finally, representativeness of the corpus is evaluated by employing cartography and automatic language identification methods. This corpus and the process of corpus creating are valuable for researchers working in linguistics, natural language processing, and data mining.
Saloot, Mohammad Arshi
University of Malaya, Malaysia and Institute for Infocomm Research (I2R), A*STAR, Singapore