Llama copyright drama: Meta stops disclosing what data it uses to train the company's giant AI models

Alistair BarrJul 19, 2023, 03:32 IST

Business Insider

Meta CEO Mark Zuckerberg.Erin Scott/Reuters

Meta released a huge new AI model called Llama 2 on Tuesday.
The company didn't disclose what training data was used to train Llama 2.

A major battle is brewing over generative AI and copyright. Publishers want to be paid if their work has been used to train large language models. Big tech companies would rather not pay.

One way to avoid the issue is to just not tell anyone what data you used to train your AI model. Meta seems to be trying that tactic.

On Tuesday, the social-media giant release a massive new model called Llama 2. The research paper shares very little about what data was used.

"A new mix of publicly available online data," Meta researchers wrote in the paper. That's basically it.

This is unusual. Until now, the AI industry has been open about the training data used for models. There's a reason: This powerful technology must be understood, and its outputs must be as explainable and traceable as possible, so that if something goes wrong researchers can go back and fix things. Training data is key to how these model perform.

Take a look at the original Transformer research paper that kicked off the Generative AI boom. Those researchers disclosed granular information on the training data used. It included about 40,000 sentences from The Wall Street Journal. (Rupert Murdoch, did you know??)

When Meta released the first version of LLaMA in February, that research paper listed all its training data in a table and detailed paragraphs. It included a bunch of books and the Common Crawl data set, which is a humongous copy of the internet, amassed since 2008 and stored on Amazon's cloud, ready to download any time. That last data set made up more than two-thirds of the information Meta used to train LLaMA.

So what changed in the past five months?

Publishers, authors, and other creators have suddenly realized their work is being used to train all these AI models. Were they asked for permission? No. Will Big Tech companies get away with this? Maybe.

A slew of lawsuits are already challenging tech companies' right to use this information for AI model training. Sarah Silverman's complaint is probably the most famous so far.

New risk factors

Big Tech companies know this is a risk. Microsoft, backer of industry leader OpenAI, added this risk factor to its quarterly SEC filing recently. I've bolded the new parts that Microsoft lawyers added in April.

"AI algorithms or training methodologies may be flawed," Microsoft wrote. "As a result of these and other challenges associated with innovative technologies, our implementation of AI systems could subject us to competitive harm, regulatory action, legal liability, including under new proposed legislation regulating AI in jurisdictions such as the European Union ("EU"), new applications of existing data protection, privacy, intellectual property, and other laws, and brand or reputational harm." (Copyright is an important part of intellectual property law.)

Google, another AI leader, does not like to pay for online content as this would undermine its highly profitable business model. The company's top lawyer Halimah DeLaine Prado has said US law "supports using public information to create new beneficial uses." This argument might prevail in court.

Why Meta doesn't want to reveal the data it used

Meanwhile, Meta seems to have decided that not telling anyone what data it uses is a safe move until this fascinating new legal issue is decided.

To be sure, there are probably other reasons for Meta's reticence here. Sharon Zhou, CEO of of the startup Lamini AI, laid out some theories to me, starting with the most controversial:

Meta is avoiding legal repercussions
The company wants to keep the ability to replicate Llama 2 to itself
More realistic, less spicy: It's a lot of work to get all the metadata in order, so Meta will probably release the training data details at some point when it's ready

I asked Meta about this, and a spokesperson shared the following statement.

"We believe developers will have plenty to work with as we release our model weights and starting code for pretrained and conversational fine-tuned versions as well as responsible use resources. While data mixes are intentionally withheld for competitive reasons, all models have gone through Meta's internal Privacy Review process to ensure responsible data usage in building our products. We are dedicated to the responsible and ethical development of our genAI products, ensuring our policies reflect diverse contexts and meet evolving societal expectations."

Cookies on the Business Insider India website

Llama copyright drama: Meta stops disclosing what data it uses to train the company's giant AI models

New risk factors

Why Meta doesn't want to reveal the data it used