The tricky truth about how generative AI uses your data
When the White House revealed its list of voluntary safety and societal commitments signed by seven AI companies, one thing was noticeably missing: anything related to the data these AI systems collect and use to train this powerful technology. Including, very likely, yours.
There are many concerns about the potential harm that sophisticated generative AI systems have unleashed on the public. What they do with our data is one of them. We know very little about where these models get the petabytes of data they need, how that data is being used, and what protections, if any, are in place when it comes to sensitive information. The companies that make these systems aren’t telling us much, and may not even know themselves.
vox-mark Sign up for the newsletter Vox Technology Get the best of Vox technology coverage, from essential reporting on Silicon Valley to the latest news about media, policy, and beyond. Thanks for signing up! Check your inbox for a welcome email. Email (required) Oops. Something went wrong. Please enter a valid email and try again. By submitting your email, you agree to our Terms and Privacy Notice . You can opt out at any time. This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply. For more newsletters, check out our newsletters page Subscribe
You may be okay with all of this, or think the good that generative AI can do far outweighs whatever bad went into building it. But a lot of other people aren’t.
Two weeks ago, a viral tweet accused Google of scraping Google Docs for data on which to train its AI tools. In a follow-up, its author claimed that Google “used docs and emails to train their AI for years.” The initial tweet has nearly 10 million views, and it’s been retweeted thousands of times. The fact that this may not even be true is almost beside the point. (Google says it doesn’t use data from its free or enterprise Workspace products — that includes Gmail and Docs — to train its generative AI models unless it has user permission, though it does train some Workspace AI features like spellcheck and Smart Compose using anonymized data.)
“Up until this point, tech companies have not done what they’re doing now with generative AI, which is to take everyone’s information and feed it into a product that can then contribute to people’s professional obsolescence and totally decimate their privacy in ways previously unimaginable,” said Ryan Clarkson, whose law firm is behind class action lawsuits against OpenAI and Microsoft and Google.
Google’s general counsel, Halimah DeLaine Prado, said in a statement that the company has been clear that it uses data from public sources, adding that “American law supports using public information to create new beneficial uses, and we look forward to refuting these baseless claims.”
Exactly what rights we may have over our own information, however, is still being worked out in lawsuits, worker strikes, regulator probes, executive orders, and possibly new laws. Those might take care of your data in the future, but what can you do about what these companies already took, used, and profited from? The answer is probably not a whole lot.
Generative AI companies are hungry for your data. Here’s how they get it.
Simply put, generative AI systems need as much data as possible to train on. The more they get, the better they can generate approximations of how humans sound, look, talk, and write. The internet provides massive amounts of data that’s relatively easy to gobble up through web scraping tools and APIs. But that gobbling process doesn’t distinguish between copyrighted works or personal data; if it’s out there, it takes it.
“In the absence of meaningful privacy regulations, that means that people can scrape really widely all over the internet, take anything that is ‘publicly available’ — that top layer of the internet for lack of a better term — and just use it in their product,” said Ben Winters, who leads the Electronic Privacy Information Center’s AI and Human Rights Project and co-authored its report on generative AI harms.
Which means that, unbeknownst to you and, apparently, several of the companies whose sites were being scraped, some startup may be taking and using your data to power a technology you had no idea was possible. That data may have been posted on the internet years before these companies existed. It may not have been posted by you at all. Or you may have thought you were giving a company your data for one purpose that you were fine with, but now you’re afraid it was used for something else. Many companies’ privacy policies, which are updated and changed all the time, may let them do exactly that. They often say something along the lines of how your data may be used to improve their existing products or develop new ones. Conceivably, that includes generative AI systems.
Not helping matters is how cagey generative AI companies have been about revealing their data sources, often simply saying that they’re “publicly available.” Even Meta’s more detailed list of sources for its first LLaMA model refers to things like “Common Crawl,” which is an open source archive of the entire internet, as well as sites like Github, Wikipedia, and Stack Exchange, which are also enormous repositories of information. (Meta hasn’t been as forthcoming about the data used for the just-released Llama 2.) All of these sources may contain personal information. OpenAI admits that it uses personal data to train its models, but says it comes across that data “incidentally” and only uses it to make “our models better,” as opposed to building profiles of people to sell ads to them.
Google and Meta have vast troves of personal user data they say they don’t use to train their language models now, but we have no guarantee they won’t do so in the future, especially if it means gaining a competitive advantage. We know that Google scanned users’ emails for years in order to target ads (the company says it no longer does this). Meta had a major scandal and a $5 billion fine when it shared data with third parties, including Cambridge Analytica, which then misused it. The fact is, these companies have given users plenty of reasons not to take their assurances about data privacy or commitments to produce safe systems at face value.
“The voluntary commitments by big tech require a level of trust that they don’t deserve, and they have not earned,” Clarkson said.
Copyrights, privacy laws, and “publicly available” data
For creators — writers, musicians, and actors, for instance — copyrights and image rights are a major issue, and it’s pretty obvious why. Generative AI models have both been trained on their work and could put them out of work in the future.
That’s why comedian Sarah Silverman is suing OpenAI and Meta as part of a class action lawsuit. She alleges that the two companies trained off of her written work by using datasets that contained text from her book, The Bedwetter. There are also lawsuits over image rights and the use of open source computer code.
The use of generative AI is also one of the reasons why writers and actors are on strike, with both of their unions, the WGA and SAG-AFTRA, fearing that studios will train AI models on artists’ words and images and simply generate new content without compensating the original human creators.
But you, the average person, might not have intellectual property to protect, or at least your livelihood may not depend on it. So your concerns might be more about how companies like OpenAI are protecting your privacy when their systems scoop it up, remix it, and spit it back out.
Regulators, lawmakers, and lawyers are wondering about this, too. Italy, which has stronger privacy laws than the US, even temporarily banned ChatGPT over privacy issues. Other European countries are looking into doing their own probes of ChatGPT. The Federal Trade Commission has also set its sights on OpenAI, investigating it for possible violations of consumer protection laws. The agency has also made it clear that it will keep a close eye on generative AI tools.
But the FTC can only enforce what the laws allow it to. President Biden has encouraged Congress to pass AI-related bills, and many members of Congress have said they want to do the same. Congress is notoriously slow-moving, however, and has done little to regulate or protect consumers from social media platforms. Lawmakers may learn a lesson from this and act faster when it comes to AI, or they may repeat their mistake. The fact that there is interest in doing something relatively soon after generative AI’s introduction to the general public is promising.
“The pace at which people have introduced legislation and said they want to do something about [AI] is, like, 9 million times faster than it was with any of these other issues,” Winters said.
But it’s also hard to imagine Congress acting on data privacy. The US doesn’t have a federal consumer online privacy law. Children under 13 do get some privacy protections, as do residents of states that passed their own privacy laws. Some types of data are protected, too. That leaves a lot of adults across the country with very little by way of data privacy rights.
We will likely be looking at the courts to figure out how generative AI fits with the laws we already have, which is where people like Clarkson come in.
“This is a chance for the people to have their voice heard, through these lawsuits,” he said. “And I think that they’re going to demand action on some of these issues that we haven’t made much progress through the other channels thus far. Transparency, the ability to opt out, compensation, ethical sourcing of data — those kinds of things.”
In some instances, Clarkson and Tim Giordano, a partner at Clarkson Law Firm who is also working on these cases, said there’s existing law that doesn’t explicitly cover people’s rights with generative AI but which a judge can interpret to apply there. In others, there are things like California’s privacy law, which requires companies that share or sell people’s data to give them a way to opt out and delete their information.
“There’s currently no way for these models to delete the personal information that they’ve learned about us, so we think that that’s a clear example of a privacy violation,” Giordano said.
ChatGPT’s opt out and data deletion tools, for example, are only for data collected by people using the ChatGPT service. It does have a way for people in “certain jurisdictions” to opt out of having their data processed by OpenAI’s models now, but it also doesn’t guarantee it will do so and it requires that you provide evidence that your data was processed in the first place.
Although OpenAI recently changed its policy and has stopped training models off data provided by its own customers, another set of privacy concerns crops up with how these models use the data you give them when you use them and the information they release into the wild. “Customers clearly want us not to train on their data,” Sam Altman, CEO of OpenAI, told CNBC, an indicator that people aren’t comfortable with their data being used to train AI systems, though only some are given the chance to opt out of it, and in limited circumstances. Meanwhile, OpenAI has been sued for defamation over a ChatGPT response that falsely claimed that someone had defrauded and stolen money from a non-profit. And this isn’t the only time a ChatGPT response levied false accusations against someone.
So what can you currently do about any of this? That’s what’s so tricky here. A lot of the privacy issues now are the result of a failure to pass real, meaningful privacy laws in the past that could have protected your data before these datasets and technologies even existed. You can always try to minimize the data you put out there now, but you can’t do much about what’s already been scraped and used. You’d need a time machine for that, and not even generative AI has been able to invent one yet.
A version of this story was also published in the Vox technology newsletter. Sign up here so you don’t miss the next one!
Source: Vox.com