Understanding Open Source Large Language Models: An Exploratory Study

Sowe, Sulayman K.; Mou, Yongli; Cheng, Du; Kong, Lingxiao; Neumann, Alexander Tobias; Decker, Stefan

doi:10.1109/FLLM63129.2024.10852438

2024

Conference Paper

Abstract

Prompted by the increasing dominance of proprietary Large Language Models (LLMs), such as OpenAI’s GPT-4 and Google’s Gemini, concerns about data privacy, accessibility and bias have led to a growing advocacy for OSLLMs. This study investigates Open Source Large Language Models (OSLLMs), exploring their characteristics, openness, and community interactions. Our research aims to define OSLLMs (license, openness, community engagement). Utilizing data from the Hugging Face platform, we examine the popularity metrics, license distribution, artefact accessibilities and community engagement of LLM projects. Findings reveal a skewed distribution of model usage, with a few models dominating downloads and likes. Apache 2.0 and MIT are the most common licenses among top models, highlighting a preference for flexible usage terms. However, a significant portion of models lack specified licenses, posing potential legal challenges. Openness analysis shows that nearly half of the examined models share their training code and datasets, with standardized evaluation metrics common across repositories. Community engagement analysis indicates that engineer users are more active than general users, contributing significantly to technical discussions. Sentiment analysis of forum interactions reveals varying user attitudes, with licensed models generally receiving more positive feedback. This study underscores the potential of OSLLMs to democratize AI access and foster innovation, while also highlighting areas for improvement in community engagement and model openness.