Death by LLM: Stack Overflow's decline, and its plan to survive, shows the future of free online data in an AI world

Alistair Barr,Adam RogersAug 3, 2023, 20:55 IST

Business Insider

Stack Overflow CEO Prashanth Chandrasekar at the WeAreDevelopers World Congress in Berlin in July 2023.Stack Overflow

Stack Overflow, an online community for software coders, has seen traffic fall since GPT-4 came out.
Some AI models that compete against Stack Overflow were partly trained on the company's data.

OpenAI released the world's most-powerful AI model in March. A few weeks later, Stack Overflow CEO Prashanth Chandrasekar spotted a worrying trend.

Online traffic to the Q&A website for software coders had begun to slip. In April, traffic was down about 13% from 2022, the company's data showed.

For 15 years, Stack Overflow has been the online community where software engineers go to ask questions and get tips from fellow coders. Now, though, they can just ask OpenAI's GPT-4, ChatGPT, Codex, or GitHub Copilot for help. So there's less need to visit Stack Overflow.

What's even more galling is that many of these new models were partly trained on Stack Overflow's information, which is freely available online and has been packaged up into a handy AI training dataset.

"Some of them are very explicit about calling out Stack Overflow as a primary source," Chandrasekar told Insider in a recent interview.

Welcome to the future of the internet in an AI world. Online communities, like Stack Overflow and Wikipedia, thrived as hubs for experts and curious browsers to come together and share information freely. Now, these digital meeting places are being pillaged by big tech companies prowling for human data to train their large language models.

The new products emerging from this generative AI boom are putting the future of these online forums in doubt. The chatbots answer questions clearly, automatically, and often pleasantly — so humans don't need to deal with other humans to get information.

"Death by LLM," Elon Musk called it recently.

How Stack Overflow responds, and tries to survive, has broad ramifications for thousands of other businesses that make money by posting and hosting information online for free. It also reveals a looming problem at the heart of the AI revolution: With less incentive for people to go online and answer questions, the rich, human data that AI needs for training will whither and the quality of these models could degrade.

'This phenomenal community resource'

Stack Overflow CEO Prashanth Chandrasekar on stageStack Overflow

Years ago, when Chandrasekar was a young developer, he would write and debug code late at night. Early on, there was no one to lean on when he hit a roadblock, an experience that left "a lot scars," he recalled.

"This is the reason why Stack Overflow in many ways became so popular," Chandrasekar explained. "You don't have to spend an inordinate amount of hours into the night trying to figure things out. You have this phenomenal community resource, and that can only be appreciated by people who have experienced that."

With more than 200 million monthly visits from desktop computers alone, the service thrived. The startup raised over $130 million from marquee Silicon Valley investors like Andreessen Horowitz and Union Square Ventures. In 2021, Prosus, a major backer of Chinese tech giant Tencent, bought Stack Overflow for $1.8 billion.

The rise of GPT models

Sam Altman, the CEO of OpenAI, and an illustration of GPT-4.JASON REDMOND/AFP via Getty Images; Jaap Arriens/NurPhoto via Getty Images

Right around that time, a new type of AI model was gaining traction. It combining two existing ideas in the AI field: transformers and unsupervised pre-training. The result was a generative pre-trained transformer, or GPT. OpenAI released GPT-3 in 2020 and made it available to everyone in November 2021.

A year later, it also launched ChatGPT, a stunningly popular chatbot built on the underlying GPT-3 model. GPT-4, considered the most capable AI model to date, came out in March. These models are surprisingly helpful when answering software coding questions.

There are also specific AI-powered coding services now, including OpenAI's Codex and GitHub Copilot. The latest version of GitHub Copilot, powered by GPT-4, was released in March. This update added powerful chat features, so coders can ask the the model to explain software code, and show it errors and get suggested fixes.

This new technology is already helping engineers churn out more software faster. So it's not surprising that coders are relying less on Stack Overflow.

"I used to go to stackoverflow everyday before chatGPT was a thing," developer Nasim Uddin wrote on Twitter recently. "But nowadays I never have to go to stackoverflow."

Stack Overflow wants to be paid for its training data

Stack Overflow is responding in two main ways. One is to start charging the tech companies that have been using its data for free to train competing AI services.

"We are entering this new era," Chandrasekar said. "People who are leveraging our data for LLM purposes, we took a position several months ago that they should engage with us."

"We should be able to be paid for that data," he added. "The large companies have proactively reached out to us, and we're effectively engaged in those conversations at the moment."

Chandrasekar declined to name any of these companies. However, Nat Friedman, the CEO of Github through 2021, expects tech companies to pay for training data in the future.

"When StackOverflow is fully dead (due to long congenital illness, self-inflicted wounds, and the finishing blow from AI), where will AI labs get their training data?" he wrote on Twitter recently. "They can just buy it!"

Billions, not millions

Nat FriedmanGitHub

Friedman even did some back-of-the-envelope math on potential deals like this: Assuming 10,000 "quality" answers per week, and $250 per answer, that works out to $130 million a year. "Even at multiples of this estimate, quite affordable for large AI labs and big tech companies who are already spending much more than this on data," Friedman added.

This has huge implications for a host of other online businesses that have seen their information scraped and used for AI model training. Publishers, for instance, want to be paid billions of dollars for their online content, according to Semafor. (Axel Springer, the owner of Insider, is part of a coalition that's forming to push for payments and legislative action, Semafor also reported).

If Github, owned by Microsoft, is prepared to pay for training data, then this could become standard industry practice, meaning Google, OpenAI, Meta, Amazon and other industry giants hand over large payments to providers of human content. Indeed, OpenAI has already signed a content licensing agreement with The Associated Press.

If you can't beat them, join them

Stack Overflow CEO Prashanth Chandrasekar introduces Overflow AIStack Overflow

Stack Overflow's second strategy is to develop its own AI models, trained not only on its public data but masses of proprietary information, too.

The company's data is mostly arranged in a Q&A format, from coders asking questions online and getting different answers that are then voted on by other members of the community. It claims to have 58 million questions and answers. This is well-suited to training AI models and chatbots, according to Chandrasekar.

The questions are like prompts, which the models need to make associative connections. The answers provide statistical next-word connections and vector-based ones that let them figure out synonyms and other more magical-seeming capabilities. Upvotes and downvotes tell the models to give certain sets of words higher statistical priority, or to de-prioritize other text.

The first fruits of this labor appeared on July 27, when Stack Overflow announced OverflowAI, which uses generative AI to automatically answer people's coding questions. The new system, which is in an early test version, taps into the company's existing 58 million Q&A data corpus to create an instant summarized answer.

Chandrasekar says the new technology will likely be nicer when answering questions. In the past, when less-experienced users asked basic questions that had already been solved by the Stack Overflow community, some experts could be rude when answering and that made people nervous about engaging on the platform. Now, Chandrasekar hopes more people will come to Stack Overflow to ask questions without fear.

Avoiding Model Collapse

A bigger challenge is where Stack Overflow will get answers to future coding questions. It's new AI service is designed to answer questions automatically, so why would human experts keep coming back to this online community to provide their own input and ideas?

This issue goes beyond the survival of Stack Overflow. All AI models need a steady flow of quality human data to train on. Without that, they will be left to rely on machine-generated content, and researchers have already found that this leads to worse performance. There's an ominous name for this now: Model Collapse.

"That's a very real thing, which is why you absolutely need really solid high quality sources of truth like Stack Overflow forever," Chandrasekar said. "If you don't have that, then you're gonna run into that situation."

Incentive mechanics

So how will Stack Overflow keep human software experts coming back to its online community?

Chandrasekar said Stack Overflow is not going to pay experts for their contributions. He argues this misunderstands why developers and other tech experts share their knowledge online for free. It's about showcasing their expertise, getting validation from peers, and improving software for everyone, the CEO explained.

So, Stack Overflow is planning to take some of the money it gets from tech companies paying for its training data, and reinvest that in new "mechanisms" to incentivize human coding experts to continue answering questions.

Those are a work in progress, but the CEO said there are many levers to pull. Firstly, automated answers from the new OverflowAI system include citations and references back to the most accurate and useful human answers.

Stack Overflow is also working on new ways to measure the impact of a human's answer on the platform. As more AI-generated answers show up, experts will get new forms credit for their contributions, and that should keep them coming back to the site.

"We're working on the incentive mechanics of how to make sure those folks get credit, even when a generative AI answer is delivered to, let's say, a novice programmer who was able to solve their immediate problem," the CEO explained. "We are thinking through the details of that incentive system at the moment as we test this with our users over the next couple of months."

Time is of the essence. As you read this, GPT-4 and its powerful AI brethren are busying answering thousands of coding questions -- far away from Stack Overflow.

Cookies on the Business Insider India website