Part of speech distributions for Grimm versus artificially generated fairy tales

ChatGPT is a chatbot tool that relies on GPT3 and later OpenAI transformer language models to generate responses to user prompts. In this study, we sought to investigate the statistical differences between naturally generated and artificially generated text due to the dramatic increase in quality of natural language generation from large language models, popularized by ChatGPT. To constrain our problem, we considered fairy tales as these texts have existed for centuries. To explore statistical differences, we focused on the distribution of words according to their parts of speech (POS), elements that characterize words based on their grammatical function. We generated a novel corpus of 101 fairy tales “authored” by ChatGPT. We compared this against 209 fairy tales written by the Grimm Brothers and made available freely online. Our hypothesis was that the distributions of POS for Grimm fairy tales and ChatGPT fairy tales would be different and that the POS distributions will vary among Grimm fairy tales more than among ChatGPT fairy tales. We performed appropriate preprocessing and computed total variation distances for individual fairy tales within and between authorship conditions. We found out that in fact, the distribution of POS in ChatGPT fairy tales is significantly different from the distribution of POS in Grimm fairy tales.