Major websites like Amazon and the New York Times are increasingly blocking OpenAI's web crawler GPTBot

Kali HaysAug 25, 2023, 03:04 IST

Business Insider

The New York Times' office and Sam Altman, OpenAI CEO.Lindsey Nicholson/UCG/Universal Images Group via Getty Images; Win McNamee/Getty Images

OpenAI said this month it was using its own web crawler to collect training data for ChatGPT.
It promised not to crawl websites deploy a decades-old web tool, robot.txt.

Dozens of large companies including Amazon and The New York Times have rushed to block GPTBot, a tool that OpenAI recently announced it was using to crawl the web for data that would be fed to its popular chatbot, ChatGPT.

As of this week, 70 of the world's top 1,000 websites have moved to block GPTBot, the web crawler OpenAI revealed two weeks ago was being used to collect massive amounts of information from the internet to train ChatGPT. Originality.ai, a company that checks content to see if it's AI-generated or plagiarized, conducted an analysis that found more than 15% of the 100-most-popular websites have decided to block GPTBot in the past two weeks.

The six largest websites now blocking the bot are amazon.com (along with several of its international counterparts), nytimes.com, cnn.com, wikihow.com, shutterstock.com, and quora.com.

The top 100 sites blocking GPTBot include bloomberg.com, scribd.com, and reuters.com, as well as insider.com and businessinsider.com. Among the top 1,000 sites blocking the bot are ikea.com, airbnb.com, nextdoor.com, nymag.com, theatlantic.com, axios.com, usmagazine.com, lonelyplanet.com, and coursera.org.

"GPTBot launched 14 days ago and the percentage of Top 1,000 sites blocking it has been steadily increasing," the analysis said.

Graph from Originality AI showing increase in blocking of GPTBotOriginality AI

How these websites block GPTBot is relatively simple, even crude, depending on your perspective. The sites include a file called robots.txt, and GPTBot has been added to its "disallow" list.

Robots.txt is a tool created in the 1990s meant to stop web crawlers, such as Google or Bing's search crawlers, from extracting data and information from a website. When revealing the crawler, OpenAI said it would abide by robots.txt and GPTBot would not crawl websites that deploy it.

Much of what is available on the internet, particularly text and images, is technically under copyright. Crawlers like GPTBot do not ask for permission, license, or pay to use any data or information they extract. The only way to avoid them at this point is through robots.txt, although companies that deploy crawlers are not legally bound to recognize robots.txt restrictions.

There's been an increasing awareness about copyright rules and the ownership of data these crawlers take to train AI projects based on large language models, or LLMs, as tools like ChatGPT have exploded onto the tech scene. Several lawsuits are already in the works. The author Stephen King, after learning his books have been used in AI training sets, said he's looking to the future with a "certain dreadful fascination."

For its part, OpenAI has taken to trying to hide that ChatGPT was trained on any copyrighted material.

A representative of OpenAI could not be immediately reached for comment.

See below for a full list of those among the biggest websites to have blocked GPTBot between August 8 and August 22:

amazon.com

quora.com

nytimes.com

shutterstock.com

wikihow.com

cnn.com

foursquare.com

healthline.com

scribd.com

businessinsider.com

reuters.com

medicalnewstoday.com

amazon.co.uk

insider.com

yourdictionary.com

slideshare.net

amazon.de

bloomberg.com

amazon.in

studocu.com

ikea.com

uol.com.br

amazon.fr

geeksforgeeks.org

pcmag.com

theverge.com

nextdoor.com

amazon.ca

amazon.co.jp

airbnb.com

vulture.com

polygon.com

prnewswire.com

mashable.com

nymag.com

detik.com

theatlantic.com

trulia.com

amazon.es

eater.com

picclick.com

bustle.com

etymonline.com

teacherspayteachers.com

archiveofourown.org

vox.com

kumparan.com

theathletic.com

amazon.it

alltrails.com

thrillist.com

amazon.com.br

usmagazine.com

pikiran-rakyat.com

city-data.com

hellomagazine.com

stern.de

chicagotribune.com

spanishdict.com

lonelyplanet.com

inverse.com

actu.fr

fool.com

coursera.org

france24.com

myfitnesspal.com

dotesports.com

theglobeandmail.com

axios.com

Cookies on the Business Insider India website

Major websites like Amazon and the New York Times are increasingly blocking OpenAI's web crawler GPTBot