+

Cookies on the Business Insider India website

Business Insider India has updated its Privacy and Cookie policy. We use cookies to ensure that we give you the better experience on our website. If you continue without changing your settings, we\'ll assume that you are happy to receive all cookies on the Business Insider India website. However, you can change your cookie setting at any time by clicking on our Cookie Policy at any time. You can also see our Privacy Policy.

Close
HomeQuizzoneWhatsappShare Flash Reads
 

OpenAI's GPTBot and other AI web crawlers are being blocked by even more companies now

Sep 28, 2023, 16:20 IST
Business Insider
Sam Altman, the OpenAI CEO, and an illustration of GPT-4.JASON REDMOND/AFP via Getty Images; Jaap Arriens/NurPhoto via Getty Images
  • Hundreds of major companies and websites are now blocking ChatGPT's web crawler.
  • Dozens more are also now blocking the crawler of Common Crawl, a major source of AI training data.
Advertisement

More and more companies are trying to avoid having their data freely scraped and saved by web crawlers working for the benefit of AI models.

Last month, OpenAI last revealed its own crawler, GPTBot, saying it would respect robots.txt, a decades-old method through which a website can tell a web crawler to ignore it. About 70 of the 1,000 most popular sites blocked it, including Amazon and Tumblr.

This week, Insider got new data on this from Originality.ai. It shows that, over the course of about three weeks, the number of top sites blocking GPTbot has jumped to more than 250.

The list of new GPTbot blockers includes Pinterest, Vimeo, GrubHub, Indeed, Apartments.com, The Guardian, Live Science, USA Today, NPR, CBS News and CBS Sports, NBC News and CNBC, The New Yorker, People, and what appears to be all titles published by Hearst and those by Conde Nast. Even weather.com is blocking the bot.

Unique and accurate information is vital to the performance of generative AI models like OpenAI's GPT-4, which has effectively memorized huge amounts of text to respond cleverly to user questions. Most of the information these models are trained on is pulled from the internet, despite most of it being owned or under copyright. A growing awareness of the practice has led to several lawsuits, and new government rules and regulations could be on the way.

Advertisement

Many more companies are now also blocking CCBot, a web crawler used by Common Crawl. Based in Europe, Common Crawl has spent years collecting massive amounts data from the web, including stuff under copyright, and organizing the datasets for use as free training data for large language models such as Meta's Llama. As of late September, almost 14% of the 1,000 most popular websites are blocking CCBot, according to data from Originality.ai.

Those blocking CCBot include Amazon, Vimeo, Masterclass, Kelly Blue Book, The New York Times, The New Yorker, and The Atlantic. Many of those blocking CCBot also block GPTBot. Although it seems ChatGPT's notoriety has caused more companies to block its crawler, despite CCBot likely being active over a longer period of time.

While online businesses have been deploying robots.txt to try and stop their data being taken to train AI models, many tech companies have updated their terms of service and user policies to give them free and full access to user content and activity for use in AI projects and training.

See below for a full list of the biggest websites now blocking GPTBot and CCBot as of Sept. 22:

Blocking GPTBot

  • amazon.com

  • quora.com

  • nytimes.com

  • theguardian.com

  • shutterstock.com

  • wikihow.com

  • cnn.com

  • sciencedirect.com

  • usatoday.com

  • healthline.com

  • stackexchange.com

  • alamy.com

  • scribd.com

  • webmd.com

  • businessinsider.com

  • dictionary.com

  • reuters.com

  • washingtonpost.com

  • medicalnewstoday.com

  • npr.org

  • cbsnews.com

  • goodhousekeeping.com

  • amazon.co.uk

  • tumblr.com

  • latimes.com

  • insider.com

  • glassdoor.com

  • vocabulary.com

  • investopedia.com

  • slideshare.net

  • amazon.de

  • cosmopolitan.com

  • nbcnews.com

  • indiamart.com

  • stackoverflow.com

  • hindustantimes.com

  • bloomberg.com

  • cnbc.com

  • people.com

  • tvtropes.org

  • amazon.in

  • vimeo.com

  • verywellhealth.com

  • ikea.com

  • espn.com

  • indianexpress.com

  • thesaurus.com

  • pbs.org

  • 123rf.com

  • wattpad.com

  • variety.com

  • today.com

  • popsugar.com

  • thespruce.com

  • uol.com.br

  • amazon.fr

  • geeksforgeeks.org

  • elle.com

  • economictimes.com

  • pcmag.com

  • theverge.com

  • allrecipes.com

  • thoughtco.com

  • rollingstone.com

  • wired.com

  • nextdoor.com

  • hollywoodreporter.com

  • abc.net.au

  • ew.com

  • amazon.ca

  • news18.com

  • womenshealthmag.com

  • rateyourmusic.com

  • amazon.co.jp

  • techradar.com

  • airbnb.com

  • ndtv.com

  • lifewire.com

  • tomsguide.com

  • vulture.com

  • everydayhealth.com

  • polygon.com

  • theconversation.com

  • esquire.com

  • prnewswire.com

  • billboard.com

  • menshealth.com

  • metro.co.uk

  • countryliving.com

  • mashable.com

  • gamesradar.com

  • thehindu.com

  • timesofindia.com

  • deadline.com

  • harpersbazaar.com

  • medscape.com

  • nymag.com

  • refinery29.com

  • radiotimes.com

  • cbssports.com

  • tandfonline.com

  • theatlantic.com

  • trulia.com

  • amazon.es

  • pinterest.es

  • nationalgeographic.com

  • bhg.com

  • eater.com

  • southernliving.com

  • healthgrades.com

  • vice.com

  • picclick.com

  • bustle.com

  • newyorker.com

  • eonline.com

  • digitalspy.com

  • opentable.com

  • pinterest.de

  • thepioneerwoman.com

  • caranddriver.com

  • byrdie.com

  • livemint.com

  • medicinenet.com

  • teacherspayteachers.com

  • cookpad.com

  • thespruceeats.com

  • bizjournals.com

  • pagesjaunes.fr

  • liputan6.com

  • delish.com

  • masterclass.com

  • archiveofourown.org

  • vox.com

  • realsimple.com

  • aarp.org

  • francetvinfo.fr

  • pinterest.fr

  • kumparan.com

  • theathletic.com

  • travelandleisure.com

  • vogue.com

  • livescience.com

  • apartments.com

  • marketwatch.com

  • glamour.com

  • amazon.it

  • cinemablend.com

  • thrillist.com

  • amazon.com.br

  • pinterest.co.uk

  • angi.com

  • alamy.es

  • usmagazine.com

  • distractify.com

  • bbcgoodfood.com

  • jagran.com

  • mercadolibre.com.mx

  • androidauthority.com

  • city-data.com

  • foodandwine.com

  • hellomagazine.com

  • amazon.com.au

  • gq.com

  • ingles.com

  • amarujala.com

  • ieee.org

  • prevention.com

  • stern.de

  • kbb.com

  • edmunds.com

  • marthastewart.com

  • pcgamer.com

  • justanswer.com

  • health.com

  • 20minutes.fr

  • fortune.com

  • homes.com

  • scientificamerican.com

  • popularmechanics.com

  • verywellfit.com

  • vanityfair.com

  • chicagotribune.com

  • verywellmind.com

  • housebeautiful.com

  • cntraveler.com

  • allure.com

  • spanishdict.com

  • neverbounce.com

  • answers.com

  • moneycontrol.com

  • architecturaldigest.com

  • slate.com

  • lonelyplanet.com

  • inverse.com

  • corriere.it

  • actu.fr

  • self.com

  • tripsavvy.com

  • instyle.com

  • eatingwell.com

  • superuser.com

  • welt.de

  • spiegel.de

  • womansday.com

  • seventeen.com

  • hbr.org

  • oprahdaily.com

  • autotrader.com

  • bonappetit.com

  • sueddeutsche.de

  • seriouseats.com

  • liveabout.com

  • seattletimes.com

  • coursera.org

  • livehindustan.com

  • france24.com

  • townandcountrymag.com

  • dotesports.com

  • worldplaces.me

  • faz.net

  • teenvogue.com

  • motor1.com

  • nj.com

  • glamourmagazine.co.uk

  • okdiario.com

  • brides.com

  • stylecaster.com

  • alamyimages.fr

  • jagranjosh.com

  • theglobeandmail.com

  • axios.com

  • francebleu.fr

  • tabelog.com

  • thebalancemoney.com

  • nydailynews.com

  • sheknows.com

  • naomedical.com

  • verywellfamily.com

Blocking CCBot

  • nytimes.com

  • shutterstock.com

  • reuters.com

  • goodhousekeeping.com

  • tumblr.com

  • cosmopolitan.com

  • pixabay.com

  • depositphotos.com

  • pbs.org

  • elle.com

  • glosbe.com

  • patch.com

  • wired.com

  • womenshealthmag.com

  • esquire.com

  • indiatoday.in

  • menshealth.com

  • countryliving.com

  • zippia.com

  • chron.com

  • harpersbazaar.com

  • tr-ex.me

  • detik.com

  • theatlantic.com

  • newyorker.com

  • digitalspy.com

  • etymonline.com

  • thepioneerwoman.com

  • caranddriver.com

  • hinative.com

  • teacherspayteachers.com

  • delish.com

  • masterclass.com

  • archiveofourown.org

  • theathletic.com

  • vogue.com

  • glamour.com

  • alltrails.com

  • gq.com

  • ingles.com

  • prevention.com

  • kbb.com

  • popularmechanics.com

  • vanityfair.com

  • housebeautiful.com

  • cntraveler.com

  • allure.com

  • spanishdict.com

  • architecturaldigest.com

  • self.com

  • sfgate.com

  • womansday.com

  • songkick.com

  • seventeen.com

  • oprahdaily.com

  • autotrader.com

  • bonappetit.com

  • aajtak.in

  • coursera.org

  • townandcountrymag.com

  • faz.net

  • teenvogue.com

  • glamourmagazine.co.uk

You are subscribed to notifications!
Looks like you've blocked notifications!
Next Article