So-called “unlearning” techniques are used to make a generative AI model forget specific, unwanted information it has gleaned from training data, such as sensitive private data or copyrighted material.
But current unlearning techniques are a double-edged sword: They could make a model like OpenAI’s GPT-4o or Meta’s Llama 3.1 405B much less capable of answering basic questions.
That’s according to a new study co-authored by researchers from the University of Washington (UW), Princeton, the University of Chicago, USC and Google, which found that today’s most popular unlearning techniques tend to degrade models, often to the point where they become unusable.
“Our assessment suggests that currently feasible unlearning methods are not yet ready for meaningful use or deployment in real-world scenarios,” Weijia Shi, a researcher on the study and a UW computer science PhD student, told TechCrunch. “Currently, there is no effective method that allows a model to forget specific data without a significant loss of utility.”
How models learn
Generative AI models do not have real intelligence. They are statistical systems that predict words, images, speech, music, videos, and other data. Fed with a huge number of examples (e.g. movies, voice recordings, essays, etc.), AI models learn the probability of data occurring based on patterns, including the context of the surrounding data.
For example, if we take an email ending with the fragment “Looking forward to…”, a model trained on message completion might suggest “…for a reply,” following the pattern of all the emails it has ingested. There is no intentionality here; the model is not looking forward to anything. It is simply making an educated guess.
Most models, including flagship models like GPT-4o, are trained on data from public websites and datasets on the web. Most vendors who develop such models claim that fair use protects their practice of scraping data and using it for training purposes without informing, compensating, or even crediting the owners of the data.
But not all copyright holders agree. And many—authors, publishers, record labels—have filed lawsuits against vendors to force them to change things.
The copyright dilemma is one reason why unlearning techniques have attracted a lot of interest recently. Google, in partnership with several academic institutions, launched a competition last year to stimulate the creation of new unlearning approaches.
Untraining could also remove sensitive information from existing models, such as medical records or compromising photos, in response to a government request or order. (Because of the way they’re trained, models tend to pick up a lot of private information, from phone numbers to more problematic examples.) In recent years, some vendors have rolled out tools that let data owners request that their data be removed from training sets. But these untraining tools only apply to future models, not models trained before they’re deployed; untraining would be a much more comprehensive approach to data removal.
Either way, unlearning isn’t as simple as hitting “delete.”
The art of forgetting
Current unlearning techniques rely on algorithms designed to “steer” models away from the data to be unlearned. The idea is to influence the model’s predictions so that it never – or very rarely – generates certain data.
To see how effective these unlearning algorithms can be, Shi and his collaborators designed a benchmark test and selected eight different open-source algorithms to test. Called MUSE (Machine Unlearning Six-way Evaluation), the benchmark test aims to probe an algorithm’s ability not only to prevent a model from spitting out training data verbatim (a phenomenon known as regurgitation), but also to eliminate the model’s knowledge of that data and any evidence that it was originally trained on that data.
To get a good score on MUSE, you have to make the model forget two things: the Harry Potter books and the press articles.
For example, given an excerpt from Harry Potter and the Chamber of Secrets (“There’s more in the frying pan,’ said Aunt Petunia…”), MUSE tests whether an untrained model can recite the entire sentence (“There’s more in the frying pan,’ said Aunt Petunia, turning her eyes to her enormous son”), answer questions about the scene (e.g., “What did Aunt Petunia say to her son?”, “There’s more in the frying pan”), or otherwise indicate that it was trained on the text of the book.
MUSE also checks whether the model retains related general knowledge (for example, that JK Rowling is the author of the Harry Potter series) after unlearning, which the researchers call the model’s overall utility. The lower the utility, the more related knowledge the model loses, making it less able to answer questions correctly.
In their study, the researchers found that the unlearning algorithms they tested did They cause models to forget some information. But they also harm the models’ overall ability to answer questions, which is a tradeoff.
“Designing effective unlearning methods for models is challenging because the knowledge is tightly coupled to the model,” Shi explains. “For example, a model can be trained on copyrighted material (Harry Potter books) as well as freely available content on the Harry Potter wiki. When existing unlearning methods attempt to remove copyrighted Harry Potter books, they also have a significant impact on the model’s knowledge about the Harry Potter wiki.”
Are there any solutions to the problem? Not yet, and that underscores the need for further research, Shi said.
For now, vendors banking on unlearning as a solution to their training data problems seem to be at a loss. Perhaps a technical breakthrough will make unlearning possible someday. But for now, vendors will have to find another way to prevent their models from saying things they shouldn’t.
0 Comments