The End of Language Data

2 min readJun 30, 2023

“As of September 2021” is the message that comes from ChatGPT when you ask it for a fact. That’s the latest training data to populate the huge language model. Adding more parameters and updating the model is a larger and larger haul and requires more resources to run. At some point, it requires all the energy in the universe.

But there’s a bigger problem. Let’s say doomsayers are right and generative AI becomes the de facto language tool for communicating between people and services. There will be very little “genuine” human communication that is untainted by AI edits or rewrites. The next training set will be corrupted.

So this is it. As of September 2021, we have one of our last English language sets of “pure” human English writing. From now on, the next trillion parameter model will be based on generative AI output. Will training with AI output effect the gene pool of language? Will the result be an eventual convergence of a single style of communication? Or, will AI be able of creating enough variances to maintain a stable language gene pool?

These aren’t far out questions. By the end of this decade, these language tools will be used so widely and embedded in so many different tools that they will change how we write and communicate with each other. We may want to think about what we can do to preserve the human language.

The End of Language Data

Written by Leor Grebler

No responses yet