Predicting Player Churn with LLMs: A Comprehensive Evaluation of World Knowledge and Reasoning

Schneider, Tobias; Sparrenberg, Lorenz; Sifa, Rafet

doi:10.1109/DSAA65442.2025.11248010

2025

Conference Paper

Abstract

While large language models (LLMs) have demonstrated impressive results on public benchmarks, their effectiveness in structured, real-world problems like behavioral analytics remains underexplored. This work assesses the out-of-the-box performance of LLMs for industry-specific downstream tasks, with player churn prediction as a representative task. Evaluating LLMs on public benchmarks risks data leakage and task-specific overfitting, so instead we perform experiments on a novel selfcompiled dataset for churn prediction, a task not part of any standard benchmark. We compare the performance of OpenAI's GPT-4.1 with traditional machine learning models, such as XGBoost and MLPs, and analyze the impact of the LLM's extensive internal world knowledge and reasoning capabilities. With few-shot prompting, GPT-4.1 achieves a weighted F1 score of 0.787, matching the performance of XGBoost on the same set of samples. We show that the LLM can compensate for missing information with its internal world knowledge and reasoning capabilities, performing best if it can leverage both. Our results highlight the potential of LLMs for cross-game churn prediction and other structured, industry-specific tasks.