In my last post we talked about data protection authorities coordinating their efforts on right of erasure. Today, I want to contextualize the right of erasure in the age of AI. I was curious about how LLMs managed data deletion and went to my favorite source of information - LLMs. Here are some surprising things I learned.
(1) LLMs are generally aware of their inability and surrounding complexity to removing an inference about an individual
Me: So if a user asks to have their data deleted out of the model you would tell them it's not possible is that what you're saying?
Gemini: Yes that's right you'd explain that complete removal isn't possible but you can take steps to minimize the chances of their data appearing in future interactions.
(2) the models are aware of the legal obligations surrounding data deletion
Me: In some areas isn't it a legal requirement that you delete data if an individual requests that you delete data about them
Gemini: you're absolutely right there are definitely legal requirements about deleting user data especially with regulations like GDPR and CCPA those laws give people the right to request that their personal information be deleted so even though it's technically challenging with large language models companies still need to find ways to comply
(3) LLMs don’t have great answers on complying with their legal obligations to honor a person's right to be forgotten.
Gemini: You might need to get creative. Perhaps you could try adding new data that dilutes or contradicts the old information or maybe you could develop some kind of filter that recognizes and blocks any outputs related to that person. It's definitely a complex challenge.
Me: but ultimately deletion is not possible you would just have to have some post-hoc method of removing that inference
Gemini: yep you got it deletion isn't possible but you could have some kind of post talk method to remove that inference it's a tricky situation but that's the best way to handle it
This is not a unique response to Gemini, I got similar results from Grok, OpenAI and Claude, and when I pressed further on the legality of only partially removing the personal information from the AI system, I got a uniform response along the lines of, great question, it’s complicated, we can’t do it fully, but we’re trying.
Why is it so hard for AI to forget and what it means under GDPR
By way of analogy, you can imagine a traditional data store as a bin of legos. Lets say that a bin of legos also comes with a set of instructions to build a lego house. You then are able to build a house by assembling the bricks in the bin according to the instructions - much like traditional software assembles, manipulates or redisplays data.
For an LLM, instead of a bin of legos and a set of instructions, they’ve seen thousands of lego pieces, lego houses, and infer how those pieces fit together to construct the house. The houses it builds may not be the same every time, but it understands the gist of a lego house and is able to make one on demand.
In a traditional data store, deletion would be a simple matter of removing the blocks associated with a deletion request, but with an LLM, you are asking the AI system to forget how to build a house.
Now take GDPR, which has certain data subject rights, like the right to be forgotten, contextualized in a world of traditional data stores (ironically, AI was a consideration back then, but gaps remain), that is, an individual has the right to have their personal data erased in certain circumstances.
The problem here is that, unlike traditional databases, LLMs encode information within their model parameters, making it difficult to isolate and delete knowledge related to an individual (and it's questionable that knowledge is considered data) as required by GDPR's right of erasure.
What does this mean for businesses subject to GDPR that are deploying AI models?
Privacy laws do apply to AI Models, and organizations are still required to honor data subject rights. This is a difficult problem, but a problem that can be solved. Understanding your AI systems lifecycle holistically, across data sources, foundation models, use cases, output monitoring, and post-hoc filtering are all techniques that should be employed.
Once you have an AI Governance program in place you can make effective decisions and adopt business practices aligned with your objectives.
For example, the EDPB released an opinion last December tackling the question of whether or not AI models themselves were considered personal data if they trained on personal information.
The answer, like many legal frameworks, involved assessing the intended purpose, and then performing a three-pronged analysis considering the risks of identification, the likely use of the model, and an assessment of risks performed by the Controller.
The silver lining here is that EDPB does acknowledge that models may be anonymous, and therefore not subject to GDPR, the downside is, making that determination may be cost prohibitive, and retraining a model all together has estimated costs in the upwards of 7 figures.
What if there were a better way?
With the right AI governance platform in place, you have the visibility into how your data is flowing across your organization, situations where AI systems are worth deploying over the regulatory cost, or potentially having the evidence to prove out your AI systems are not subject to regulations like GDPR all together.
This highlights the criticality of having an AI governance platform and visibility into your AI pipeline - undoing data contamination is a messy affair only exacerbates high investment and inability to walk-back AI systems once they’ve learned something new.