Text mining in livestock animal science: Introducing the potential of text mining to animal sciences
In biological research, establishing the prior art by searching and collecting information already present in the domain has equal importance as the experiments done. To obtain a complete overview about the relevant knowledge, researchers mainly rely on 2 major information sources: i) various biological databases and ii) scientific publications in the field. The major difference between the 2 information sources is that information from databases is available, typically well structured and condensed. The information content in scientific literature is vastly unstructured; that is, dispersed among the many different sections of scientific text. The traditional method of information extraction from scientific literature occurs by generating a list of relevant publications in the field of interest and manually scanning these texts for relevant information, which is very time consuming. It is more than likely that in using this "classical" approach the researcher misses some relevant information mentioned in the literature or has to go through biological databases to extract further information. Text mining and named entity recognition methods have already been used in human genomics and related fields as a solution to this problem. These methods can process and extract information from large volumes of scientific text. Text mining is defined as the automatic extraction of previously unknown and potentially useful information from text. Named entity recognition (NER) is defined as the method of identifying named entities (names of real world objects; for example, gene/protein names, drugs, enzymes) in text. In animal sciences, text mining and related methods have been briefly used in murine genomics and associated fields, leaving behind other fields of animal sciences, such as livestock genomics. The aim of this work was to develop an information retrieval platform in the livestock domain focusing on livestock publications and the recognition of relevant data from cattle and pigs. For this purpose, the rather noncomprehensive resources of pig and cattle gene and protein terminologies were enriched with orthologue synonyms, integrated in the NER platform, ProMiner, which is successfully used in human genomics domain. Based on the performance tests done, the present system achieved a fair performance with precision 0.64, recall 0.74, and F-1 measure of 0.69 in a test scenario based on cattle literature.