The New York Times got its content removed from one of the biggest AI training datasets. Here's how it did it.

Alistair Barr,Kali HaysNov 8, 2023, 17:14 IST

Business Insider

The New York Times' office and Sam Altman, OpenAI CEO.Lindsey Nicholson/UCG/Universal Images Group via Getty Images; Win McNamee/Getty Images

The New York Times discovered a big AI training dataset contained links to its copyrighted content.
The media company also found its content in other AI training datasets, such as WebText.

By now, most major online content creators realize tech companies have been using their copyrighted work for years to train AI models without permission or payment.

Some of these content owners are taking action, and even beginning to have success in stopping this activity.

The New York Times discovered that Common Crawl, one of the largest AI training datasets, contained millions of URLs linking to its paywalled articles and other copyrighted content.

Common Crawl was built by scraping most of the web using crawling software called CCBot. The foundation that runs this operation says it has amassed more than 250 billion pages since 2007, with up to 5 billion new pages added each month.

This provides the training data backbone for many large language models, including OpenAI's GPT-3. Google's Infiniset gets 12.5% of its data from C4, a cleaned up version of Common Crawl.

AI models really need this quality training data to perform well. However, The New York Times doesn't want to be part of this new process because these models deliver answers directly instead of sending users to the original source of the information.

In essence, this new technology uses NYT's own copyrighted content to siphon away NYT readers and paying subscribers.

Common Crawl request

So, earlier this year, The New York Times reached out to the Common Crawl Foundation to get its content pulled from the dataset.

"We simply asked that our content be removed, and were pleased that Common Crawl complied with our request and recognized The Times's ownership of our quality journalistic content," Charlie Stadtlander, a spokesman at The New York Times, told Insider.

Common Crawl also agreed not scrape anymore NYT content in the future, according to a recent letter the media company wrote to the US Copyright Office.

CCBot crackdown

Other content creators have tried to stop Common Crawl, too. As of late September, almost 14% of the 1,000 most popular websites are blocking CCBot, according to data from Originality.ai. Those blocking CCBot include Amazon, Vimeo, Masterclass, Kelly Blue Book, The New Yorker, and The Atlantic. Common Crawl did not respond to a request for comment this week.

The New York Times has found its paywalled articles and other copyrighted content in other popular AI training datasets. A recreated version of WebText, which was used to train OpenAI's ChatGPT-2, had NYT content that accounts for 1.2% of the entire dataset, the media company noted in its letter to the US Copyright Office.

"Once powered with our content, GAI tools can do a number of things with it, including reciting it verbatim, summarizing it, drafting new content with a similar style of expression, and using it to generate misinformation attributed to The Times that appears to be fact," the NYT added in the letter.

It's unclear if The New York Times has managed to get its content removed from WebText and other AI training datasets.

Cookies on the Business Insider India website

The New York Times got its content removed from one of the biggest AI training datasets. Here's how it did it.

Common Crawl request

CCBot crackdown