+

Cookies on the Business Insider India website

Business Insider India has updated its Privacy and Cookie policy. We use cookies to ensure that we give you the better experience on our website. If you continue without changing your settings, we\'ll assume that you are happy to receive all cookies on the Business Insider India website. However, you can change your cookie setting at any time by clicking on our Cookie Policy at any time. You can also see our Privacy Policy.

Close
HomeQuizzoneWhatsappShare Flash Reads
 

Major websites like Amazon and the New York Times are increasingly blocking OpenAI's web crawler GPTBot

Aug 25, 2023, 03:04 IST
Business Insider
The New York Times' office and Sam Altman, OpenAI CEO.Lindsey Nicholson/UCG/Universal Images Group via Getty Images; Win McNamee/Getty Images
  • OpenAI said this month it was using its own web crawler to collect training data for ChatGPT.
  • It promised not to crawl websites deploy a decades-old web tool, robot.txt.
Advertisement

Dozens of large companies including Amazon and The New York Times have rushed to block GPTBot, a tool that OpenAI recently announced it was using to crawl the web for data that would be fed to its popular chatbot, ChatGPT.

As of this week, 70 of the world's top 1,000 websites have moved to block GPTBot, the web crawler OpenAI revealed two weeks ago was being used to collect massive amounts of information from the internet to train ChatGPT. Originality.ai, a company that checks content to see if it's AI-generated or plagiarized, conducted an analysis that found more than 15% of the 100-most-popular websites have decided to block GPTBot in the past two weeks.

The six largest websites now blocking the bot are amazon.com (along with several of its international counterparts), nytimes.com, cnn.com, wikihow.com, shutterstock.com, and quora.com.

The top 100 sites blocking GPTBot include bloomberg.com, scribd.com, and reuters.com, as well as insider.com and businessinsider.com. Among the top 1,000 sites blocking the bot are ikea.com, airbnb.com, nextdoor.com, nymag.com, theatlantic.com, axios.com, usmagazine.com, lonelyplanet.com, and coursera.org.

"GPTBot launched 14 days ago and the percentage of Top 1,000 sites blocking it has been steadily increasing," the analysis said.

Advertisement

Graph from Originality AI showing increase in blocking of GPTBotOriginality AI

How these websites block GPTBot is relatively simple, even crude, depending on your perspective. The sites include a file called robots.txt, and GPTBot has been added to its "disallow" list.

Robots.txt is a tool created in the 1990s meant to stop web crawlers, such as Google or Bing's search crawlers, from extracting data and information from a website. When revealing the crawler, OpenAI said it would abide by robots.txt and GPTBot would not crawl websites that deploy it.

Much of what is available on the internet, particularly text and images, is technically under copyright. Crawlers like GPTBot do not ask for permission, license, or pay to use any data or information they extract. The only way to avoid them at this point is through robots.txt, although companies that deploy crawlers are not legally bound to recognize robots.txt restrictions.

There's been an increasing awareness about copyright rules and the ownership of data these crawlers take to train AI projects based on large language models, or LLMs, as tools like ChatGPT have exploded onto the tech scene. Several lawsuits are already in the works. The author Stephen King, after learning his books have been used in AI training sets, said he's looking to the future with a "certain dreadful fascination."

For its part, OpenAI has taken to trying to hide that ChatGPT was trained on any copyrighted material.

Advertisement

A representative of OpenAI could not be immediately reached for comment.

See below for a full list of those among the biggest websites to have blocked GPTBot between August 8 and August 22:

amazon.com

quora.com

nytimes.com

Advertisement

shutterstock.com

wikihow.com

cnn.com

foursquare.com

healthline.com

Advertisement

scribd.com

businessinsider.com

reuters.com

medicalnewstoday.com

amazon.co.uk

Advertisement

insider.com

yourdictionary.com

slideshare.net

amazon.de

bloomberg.com

Advertisement

amazon.in

studocu.com

ikea.com

uol.com.br

amazon.fr

Advertisement

geeksforgeeks.org

pcmag.com

theverge.com

nextdoor.com

amazon.ca

Advertisement

amazon.co.jp

airbnb.com

vulture.com

polygon.com

prnewswire.com

Advertisement

mashable.com

nymag.com

detik.com

theatlantic.com

trulia.com

Advertisement

amazon.es

eater.com

picclick.com

bustle.com

etymonline.com

Advertisement

teacherspayteachers.com

archiveofourown.org

vox.com

kumparan.com

theathletic.com

Advertisement

amazon.it

alltrails.com

thrillist.com

amazon.com.br

usmagazine.com

Advertisement

pikiran-rakyat.com

city-data.com

hellomagazine.com

stern.de

chicagotribune.com

Advertisement

spanishdict.com

lonelyplanet.com

inverse.com

actu.fr

fool.com

Advertisement

coursera.org

france24.com

myfitnesspal.com

dotesports.com

theglobeandmail.com

Advertisement

axios.com

You are subscribed to notifications!
Looks like you've blocked notifications!
Next Article