The Perils of Generative Model Inbreeding: Evaluating the Consequences of Cross-Model Training in Large Language Models
Loading...
Date
Authors
Stein, Gabrielle
Journal Title
Journal ISSN
Volume Title
Publisher
East Carolina University
Abstract
What happens when the output of generative AI models is included in the training data of new models? With the rise of generative AI content online, and considering that most training data for AI models is sourced from the Internet, concerns have arisen about how this generated content might taint future training datasets. Existing research has evaluated the effect of models consuming their own output, and has shown that the output of self-consuming models degrades with each successive generation of re-training, a phenomenon coined as "model collapse.'' This degradation takes the form of a loss of diversity in the output of the model. Currently there is limited research on the impact of models consuming other models' output, specifically large language models. In this study we aimed to determine the effect of training a model on a different model's output. Additionally, we developed a potential solution to prevent "model collapse.'' Guaranteeing the majority of training data is guaranteed to be human-generated (non-synthetic) data has been shown to mitigate the loss of diversity caused by "model collapse.'' Given that AI models are here to stay, the methods for developing new models will need to evolve to address this issue, ensuring that AI development can continue to progress and improve.
