scorecardBig Tech needs to get creative as it runs out of data to train its AI models. Here are some of its wildest solutions.
  1. Home
  2. tech
  3. news
  4. Big Tech needs to get creative as it runs out of data to train its AI models. Here are some of its wildest solutions.

Big Tech needs to get creative as it runs out of data to train its AI models. Here are some of its wildest solutions.

Lakshmi Varanasi   

Big Tech needs to get creative as it runs out of data to train its AI models. Here are some of its wildest solutions.
Big Tech is scouring the internet for new data sources to train its AI models.Gilnature/Getty Images
  • OpenAI, Meta, Google, and other Big Tech firms train their AI models using online data.
  • But AI models learn so fast that all that data could run out by 2026.

More is more when it comes to AI. The more data AI systems are trained on, the more powerful they will be.

But as the AI arms race heats up, tech giants like Meta, Google, and OpenAI face a problem: They're running out of data to train their models.

Many leading AI systems have been trained on the vast supply of online data. But by 2026, all the high-quality data could be exhausted, according to Epoch, an AI research institute.

So major tech companies are searching for new data sources to keep their systems learning. Here's a look at some of the most creative options that tech companies are considering.

Google considered tapping consumer data available in Google Docs, Sheets, and Slides.

Google considered tapping consumer data available in Google Docs, Sheets, and Slides.
Google considered using data from Google Docs, Sheets, and Slides for training its AI systems.      Shutterstock

Last summer, the legal department at Google began asking employees to broaden the language around using consumer data, the Times reported. Some employees were informed that the company wanted to use data from the free consumer versions of Google Docs, Google Sheets, Google Slides, and even the restaurant reviews on Google Maps.

While Google updated its privacy policy in July 2023, the company says it didn't expand the types of data it uses to train AI models.

Splurging on the publishing house, Simon & Schuster.

Splurging on the publishing house, Simon & Schuster.
Simon & Schuster's New York City headquarters in 2016.      Robert Alexander/Getty Images

At Meta, the waning supply of usable data concerned executives so much they met almost daily in March and April last year to brainstorm alternatives, the Times reported.

One idea floated at these meetings was to buy Simon & Schuster. The famed publishing house has worked with authors like Stephen King and Jennifer Weiner and was purchased by private equity firm KKR for $1.62 billion last year.

Other attendees suggested a more budget-friendly option of paying $10 a book to obtain the full licensing rights to new titles.

Generating synthetic data

Generating synthetic data
OpenAI is exploring synthetic data to train its systems.      RICHARD JONES/SCIENCE PHOTO LIBRARY/Getty Images

Synthetic data is data generated by AI systems, and OpenAI has considered it an option for its models.

"As long as you can get over the synthetic data event horizon, where the model is smart enough to make good synthetic data, everything will be fine," OpenAI CEO Sam Altman said at a tech conference last May, according to the Times.

The issue with training AI systems on synthetic data is that it can reinforce some of the mistakes and limitations of AI, the Times reported. OpenAI is working on a process to address this in which one AI system produces data, and another AI system judges it.

Whisper, a speech recognition tool that translates YouTube videos

Whisper, a speech recognition tool that translates YouTube videos
YouTube wants to create AI-generated music.      Getty Images

OpenAI has also built Whisper, a speech recognition tool that can translate YouTube videos and podcasts. Its latest large language model, GPT-4, has been trained on over one million hours of YouTube videos transcribed by Whisper.

OpenAI's president, Greg Brockman, was a key developer of Whisper and told the Times that OpenAI relies on "numerous sources" of data for its systems.

Photobucket: A treasure trove of photos from Myspace and Friendster

Photobucket: A treasure trove of photos from Myspace and Friendster
Photobucket, which hosted photos on Myspace, might be licensing its data to tech companies.      eHowTech/YouTube

Photobucket was once "the world's top image-hosting site" and accounted for nearly half of the US online photo market, according to Reuters. Part of that was because it hosted photos for early social media sites like Myspace and Friendster.

Its database of pictures might now soon be licensed to tech companies for training their AI systems, Reuters reported. Photobucket declined to identify prospective buyers to Reuters.

Advertisement