The raw materials for creating AI

Alistair Barr,Kali HaysSep 16, 2023, 06:48 IST

Business Insider

ChatGPT developer Sam Altman has sparked a wave of AI chatter in Corporate America.picture alliance

The generative AI boom is fueling a 'shadow war for data.'
AI companies have used information scraped from the internet for model training.

The generative AI boom started with the stunning success of ChatGPT in late 2022. Now, seemingly every company is trying to use the technology.

The AI models behind this technology are built using high-quality datasets from millions of different sources. These are the raw materials for model "training," in industry parlance.

"This is the secret story just beneath the surface of what's happening," former Github CEO Nat Friedman said in a recent interview with tech analyst Ben Thompson.

Nvidia GPUs are the main hardware required for AI model training.

"But the other key input is data," Friedman said. "So there is currently happening beneath the surface, a shadow war for data where the largest AI labs are spending huge amounts of money, like huge amounts of money, to acquire more valuable tokens, either paying experts to generate it, working through labeling companies."

Scraped from the internet

A lot of this training data has been scraped from the internet and used without permission.

Tech companies, hungry for even more training data, are also granting themselves new permissions to use a lot more of your information.

The use of information scraped from the internet has sparked a debate about the future of copyright and licensing in this new AI world.

Online communities based on the sharing of free information are also being upended. Why continue to share online when that data will likely be sucked into an AI model that ends up competing with you later?

Data from Stack Overflow, a popular coding Q&A website, has been used for AI model training. In recently months, it has seen traffic fall as AI models offer coding answers directly now, negating the need to visiting the site and ask questions.

There's a backlash brewing

Companies, content creators and other web businesses are waking up to the realization that their work is being secretly used against them.

This is undermining the grand bargain of the web, and sparking a backlash.

"Media companies are starting to wake up and realize a lot of their information has been stolen — probably some of yours, too," said Marc Benioff, CEO of Salesforce and the owner of Time magazine. "As a media owner, it's a major issue, because I do go to the models, and I'll find material from Time magazine in there and go, 'Wait a minute, that's my content,'" he added.

More websites are blocking web crawlers, which are the technical tools used to prowl the web scooping up data for AI model training. GPTbot, from ChatGPT creator OpenAI, was blocked by more than 15% of the 100 most-popular websites in just two weeks, including Amazon and Quora, Insider reported in August.

Reddit is demanding to be paid for its data, which is a common source of AI model training.

LexisNexis, a leading provider of legal information, has had to warn customers not to upload or share its data with AI models and related bots.

Sarah Silverman sued OpenAI and Meta, alleging they used her book without compensation or permission to train their AI models.

Over 8,000 authors, including Margaret Atwood and James Patterson, signed an open letter demanding compensation from AI companies for using their works to train AI without permission.

Efforts to avoid legal risk

AI companies are responding, mostly by trying to reduce legal risks.

Meta and other tech companies have stopped disclosing the training data they use to train AI models. This is partly for competitive reasons, but observers say this is also to avoid legal exposure.

OpenAI's ChatGPT is trying to hide that it was trained on copyrighted material such as JK Rowling's Harry Potter book series, according to research published in August.

Other researchers have developed an AI model that can remove data to reduce legal risks. In the process, they also created a way to measure how specific data contributes to an AI model's output.

Got a tip or insights about the leading AI companies OpenAI, Google, Microsoft and Meta? Contact Alistair Barr at abarr@insider.com, or through Twitter DM @alistairmbarr.

Reach out to Kali Hays at khays@insider.com, on secure messaging app Signal at 949-280-0267, or through Twitter DM at @hayskali. Reach out using a non-work device.

Cookies on the Business Insider India website

The raw materials for creating AI

Scraped from the internet

There's a backlash brewing

Efforts to avoid legal risk