Stack Overflow Will Charge AI Giants for Training Data

Stack Overflow Will Charge AI Giants for Training Data

Developing the AI systems behind tools such as ChatGPT and the image generator Dall-E costs hundreds of millions of dollars—and it’s about to get more expensive.

OpenAI, Google, and other companies building large-scale AI projects have traditionally paid nothing for much of their training data, scraping it from the web. But Stack Overflow, a popular internet forum for computer programming help, plans to begin charging large AI developers as soon as the middle of this year for access to the 50 million questions and answers on its service, CEO Prashanth Chandrasekar says. The site has more than 20 million registered users.

Stack Overflow’s decision to seek compensation from companies tapping its data, part of a broader generative AI strategy, has not been previously reported. It follows an announcement by Reddit this week that it will begin charging some AI developers to access its own content starting in June.

The two community sites are not alone in wanting a share. The News/Media Alliance, a US trade group of publishers, including Condé Nast, which owns WIRED, today unveiled principles calling on generative AI developers to negotiate any use of their data for training and other purposes and respect their right to fair compensation.

Meta, Google, and OpenAI—maker of ChatGPT—all have developed AI systems using data sets that culled content from thousands of online sources, including Stack Overflow and Reddit, according to outside analyses and their own disclosures. Feeding text from online banter or expert discussions about programming into machine learning algorithms known as large language models, or LLMs, can help AI text generators or chatbots be more fluent and knowledgeable. Using LLMs to generate programming code is viewed as one of the technology's biggest opportunities, with Microsoft charging as much as $19 a month per person for its code generator GitHub Copilot.

“Community platforms that fuel LLMs absolutely should be compensated for their contributions so that companies like us can reinvest back into our communities to continue to make them thrive,” Stack Overflow’s Chandrasekar says. “We're very supportive of Reddit’s approach.”

Chandrasekar described the potential additional revenue as vital to ensuring Stack Overflow can keep attracting users and maintaining high-quality information. He argues that will also help future chatbots, which need “to be trained on something that's progressing knowledge forward. They need new knowledge to be created.” But fencing off valuable data also could deter some AI training and slow improvement of LLMs, which are a threat to any service that people turn to for information and conversation. Chandrasekar says proper licensing will only help accelerate development of high-quality LLMs.

Every AI developer is seeking to bring down the huge costs of developing large scale AI systems, which take enormous amounts of expensive computers to power. Having to pay for data they once grabbed for free could extend the already unclear timelines to turning a profit on their emerging technologies. OpenAI did not respond to a request for comment, and Meta and Google did not have immediate comment.

Add a Comment