OpenAI’s hunger for data is coming back to bite it

OpenAI’s hunger for data is coming back to bite it

It seems unlikely that OpenAI will be able to argue that it gained people’s consent when it scraped their data. That leaves it with the argument that it had a  “legitimate interest” in doing so. This will likely require the company to make a convincing case to regulators about how essential ChatGPT really is to justify data collection without consent, says Edwards. 

OpenAI told us it believes it complies with privacy laws, and in a blog post it said it works to remove personal information from the training data upon request “where feasible.”

The company says that its models are trained on publicly available content, licensed content, and content generated by human reviewers. But for the GDPR, that’s too low a bar. 

“The US has a doctrine that when stuff is in public, it’s no longer private, which is not at all how European law works,” says Edwards. The GDPR gives people rights as “data subjects,” such as the right to be informed about how their data is collected and used and to have their data removed from systems, even if it was public in the first place. 

Finding a needle in a haystack

OpenAI has another problem. The Italian authority says OpenAI is not being transparent about how it collects users’ data during the post-training phase, such as in chat logs of their interactions with ChatGPT. 

“What’s really concerning is how it uses data that you give it in the chat,” says Leautier. People tend to share intimate, private information with the chatbot, telling it about things like their mental state, their health, or their personal opinions. Leautier says it is problematic if there’s a risk that ChatGPT regurgitates this sensitive data to others. And under European law, users need to be able to get their chat log data deleted, he adds. 

OpenAI is going to find it near-impossible to identify individuals’ data and remove it from its models, says Margaret Mitchell, an AI researcher and chief ethics scientist at startup Hugging Face, who was formerly Google’s AI ethics co-lead. 

The company could have saved itself a giant headache by building in robust data record-keeping from the start, she says. Instead, it is common in the AI industry to build data sets for AI models by scraping the web indiscriminately and then outsourcing the work of removing duplicates or irrelevant data points, filtering unwanted things, and fixing typos. These methods, and the sheer size of the data set, mean tech companies tend to have a very limited understanding of what has gone into training their models. 

Add a Comment