‘Too smart to steal’: AI, made to help, faces backlash for data scraping

Divyanshi SharmaAug 6, 2024, 14:21 IST

AI companies are being accused of data scrapingUnsplash

Various AI companies have been accused of data scraping in the past
Reddit CEO recently asked Microsoft to pay for using its website content
Midjourney announced a ban on Stability AI employees in March this year

OpenAI launched ChatGPT in November 2022 and the world hasn’t been the same since then. From writing content and composing poetry to helping with research and writing code, the AI chatbot can do things that were earlier considered to be exclusive to human beings. After ChatGPT’s launch, several other companies including Microsoft and Google also launched their own AI chatbots.

Today, chatbots from various companies are being widely used by people for various purposes. But do you know where the chatbots get their information from? One source is the data you enter while talking to the chatbot (same is mentioned in the form of disclaimers and you can even opt out of them). The other is data from online sources which is mostly the content created by other individuals and companies available publicly. And this has been a source of debate for a long time.

Microsoft AI division CEO Mustafa Suleyman was in the news last month for saying that anyone can use the data available on the internet. However, Reddit CEO opposes this viewpoint and was recently making headlines for asking Microsoft to pay for using the data on Reddit.

In the last few months, there have been various reports of content creators, actors, and companies taking action against AI companies for scraping their data to train their own AI models. Let us take a look at some of them.

1. Microsoft

Last week, Reddit CEO Steve Huffman made headlines for demanding that companies like Microsoft should pay for using the site’s data to train their AI models. A report in The Verge quoted Huffman saying that without such agreements, Reddit has no control over how its data is used or displayed, leading them to block companies unwilling to negotiate. He specifically called out Microsoft, Anthropic, and Perplexity for refusing to come to terms.

The Verge report added that Huffman also accused Microsoft of using Reddit’s data to train its AI and summarise content in Bing results without informing the platform. He also mentioned that Reddit’s data has been sold through the Bing API to other search engines. Referencing Microsoft AI CEO Mustafa Suleyman’s comment, Huffman said that Microsoft, Anthropic, and Perplexity seem to believe all internet content is free for their use.

2. Stability AI

In March this year, Midjourney announced a ban on Stability AI employees from using its service, accusing them of causing a system outage by scraping Midjourney's data. On March 2nd, Midjourney reported on its Discord server that a prolonged server outage prevented generated images from displaying in user galleries, attributing the issue to "botnet-like activity from paid accounts" linked to Stability AI employees.

According to a report in The Verge, during a business update call on March 6th, Midjourney claimed the outage occurred because "someone at Stability AI was trying to grab all the prompt and image pairs in the middle of the night on Saturday." The company traced multiple paid accounts back to a member of Stability AI's data team.

Consequently, Midjourney indefinitely banned all Stability AI employees from using its service and introduced a new policy to ban employees from any company involved in "aggressive automation" or causing service outages.

3. Runway

Google-backed AI startup Runway was in the limelight last month for accusations of scraping thousands of YouTube videos without authorisation to train its AI video creation model. The allegations came to light through a leaked internal spreadsheet obtained by 404 Media.

The spreadsheet, allegedly shared by a former Runway employee, outlined plans to categorise and tag content from over 3,900 YouTube channels, including big names like Disney, Netflix, and well-known YouTubers such as Casey Neistat and Marques Brownlee (MKBHD). This data was purportedly used to develop Runway's Gen-3 AI video creation model, previously known as "Jupiter."

Runway did not verify the authenticity of the spreadsheet. The company had earlier claimed it used "curated, internal datasets" for training but did not disclose further details.

Prominent YouTubers like MKBHD and MrWhoseTheBoss voiced their concerns on social media about the whole incident. MKBHD disclosed that over 1,600 of his videos had been scraped, while MrWhoseTheBoss described the practice as "scary," revealing that 1,600 of his videos were also used.

4. OpenAI

In May this year, Hollywood actress Scarlett Johansson sued OpenAI for using her voice likeness without asking for permission. OpenAI, on the other hand, had claimed that the voice in question belonged to other actors and not Johansson. Later that month, it was revealed that ChatGPT’s Sky voice (the voice in question) was being taken down.

Johansson had also said that OpenAI CEO Sam Altman had approached her twice for allowing them to clone her voice for the AI chatbot. However, she had declined the offer.

Meanwhile Altman, in an interview with NBC News, said that Sky was not based on Johansson's voice. He explained that a professional actor recorded the voice, but the actor's identity cannot be revealed for privacy reasons.

"We cast the voice actor behind Sky’s voice before any outreach to Ms. Johansson. Out of respect for Ms. Johansson, we have paused using Sky’s voice in our products. We are sorry to Ms. Johansson that we didn’t communicate better," he had said.

5. Anthropic

American AI startup Anthropic also found itself in the middle of a controversy last month. The startup faced accusations of aggressively scraping data from various websites to train its systems.

A report in Financial Times said that Matt Barrie, CEO of Freelancer.com, claimed that Anthropic was "the most aggressive scraper by far" of his freelancing portal, which sees millions of daily visits.

Other web publishers echoed Barrie’s concerns, stating that Anthropic's bots were inundating their sites and ignoring requests to stop collecting content for training purposes. Freelancer.com recorded 3.5 million visits from an Anthropic-linked web crawler within just four hours, according to data shared with the Financial Times. Barrie noted that this volume was "probably about five times the number two" AI crawler.

Despite attempts to block access using standard web protocols, visits from Anthropic's bots continued to rise, prompting Barrie to block traffic from Anthropic’s internet addresses entirely.

Anthropic responded by stating that it was investigating the issue and aimed to respect publishers' requests, striving not to be "intrusive or disruptive."

The Financial Times also quoted Kyle Wiens, CEO of iFixit.com, who reported receiving 1 million hits from Anthropic bots in 24 hours. "We have a load of alarms [for high traffic], people get woken up at 3am. This set off every alarm we have," Wiens said.

SEE ALSO: Sam Altman and generative AI can't be trusted, says leading expert
Elon Musk is having another go at suing OpenAI and Sam Altman — here's why

Cookies on the Business Insider India website