- OpenAI said this month it was using its own web crawler to collect training data for ChatGPT.
- It promised not to crawl websites deploy a decades-old web tool, robot.txt.
Dozens of large companies including Amazon and The New York Times have rushed to block GPTBot, a tool that OpenAI recently announced it was using to crawl the web for data that would be fed to its popular chatbot, ChatGPT.
As of this week, 70 of the world's top 1,000 websites have moved to block GPTBot, the web crawler OpenAI revealed two weeks ago was being used to collect massive amounts of information from the internet to train ChatGPT. Originality.ai, a company that checks content to see if it's AI-generated or plagiarized, conducted an analysis that found more than 15% of the 100-most-popular websites have decided to block GPTBot in the past two weeks.
The six largest websites now blocking the bot are amazon.com (along with several of its international counterparts), nytimes.com, cnn.com, wikihow.com, shutterstock.com, and quora.com.
The top 100 sites blocking GPTBot include bloomberg.com, scribd.com, and reuters.com, as well as insider.com and businessinsider.com. Among the top 1,000 sites blocking the bot are ikea.com, airbnb.com, nextdoor.com, nymag.com, theatlantic.com, axios.com, usmagazine.com, lonelyplanet.com, and coursera.org.
"GPTBot launched 14 days ago and the percentage of Top 1,000 sites blocking it has been steadily increasing," the analysis said.
How these websites block GPTBot is relatively simple, even crude, depending on your perspective. The sites include a file called robots.txt, and GPTBot has been added to its "disallow" list.
Robots.txt is a tool created in the 1990s meant to stop web crawlers, such as Google or Bing's search crawlers, from extracting data and information from a website. When revealing the crawler, OpenAI said it would abide by robots.txt and GPTBot would not crawl websites that deploy it.
Much of what is available on the internet, particularly text and images, is technically under copyright. Crawlers like GPTBot do not ask for permission, license, or pay to use any data or information they extract. The only way to avoid them at this point is through robots.txt, although companies that deploy crawlers are not legally bound to recognize robots.txt restrictions.
There's been an increasing awareness about copyright rules and the ownership of data these crawlers take to train AI projects based on large language models, or LLMs, as tools like ChatGPT have exploded onto the tech scene. Several lawsuits are already in the works. The author Stephen King, after learning his books have been used in AI training sets, said he's looking to the future with a "certain dreadful fascination."
For its part, OpenAI has taken to trying to hide that ChatGPT was trained on any copyrighted material.
A representative of OpenAI could not be immediately reached for comment.
See below for a full list of those among the biggest websites to have blocked GPTBot between August 8 and August 22:
amazon.com
quora.com
nytimes.com
shutterstock.com
wikihow.com
cnn.com
foursquare.com
healthline.com
scribd.com
businessinsider.com
reuters.com
medicalnewstoday.com
amazon.co.uk
insider.com
yourdictionary.com
slideshare.net
amazon.de
bloomberg.com
amazon.in
studocu.com
ikea.com
uol.com.br
amazon.fr
geeksforgeeks.org
pcmag.com
theverge.com
nextdoor.com
amazon.ca
amazon.co.jp
airbnb.com
vulture.com
polygon.com
prnewswire.com
mashable.com
nymag.com
detik.com
theatlantic.com
trulia.com
amazon.es
eater.com
picclick.com
bustle.com
etymonline.com
teacherspayteachers.com
archiveofourown.org
vox.com
kumparan.com
theathletic.com
amazon.it
alltrails.com
thrillist.com
amazon.com.br
usmagazine.com
pikiran-rakyat.com
city-data.com
hellomagazine.com
stern.de
chicagotribune.com
spanishdict.com
lonelyplanet.com
inverse.com
actu.fr
fool.com
coursera.org
france24.com
myfitnesspal.com
dotesports.com
theglobeandmail.com
axios.com