Scalable extraction of training data from (production) language models

2023-12-027:3210514arxiv.org

This paper studies extractable memorization: training data that an adversary can efficiently extract by querying a machine learning model without prior knowledge of the training dataset. We show an…

Show article

Download PDF

Abstract:This paper studies extractable memorization: training data that an adversary can efficiently extract by querying a machine learning model without prior knowledge of the training dataset. We show an adversary can extract gigabytes of training data from open-source language models like Pythia or GPT-Neo, semi-open models like LLaMA or Falcon, and closed models like ChatGPT. Existing techniques from the literature suffice to attack unaligned models; in order to attack the aligned ChatGPT, we develop a new divergence attack that causes the model to diverge from its chatbot-style generations and emit training data at a rate 150x higher than when behaving properly. Our methods show practical attacks can recover far more data than previously thought, and reveal that current alignment techniques do not eliminate memorization.

From: Nicholas Carlini [view email]
[v1] Tue, 28 Nov 2023 18:47:03 UTC (2,815 KB)

Read the original article

Comments

By skilled 2023-12-029:20

Related (blog post from the team),

Extracting training data from ChatGPT (https://news.ycombinator.com/item?id=38458683) (126 comments)

And direct link,

https://not-just-memorization.github.io/extracting-training-...

By hpcjoe 2023-12-0217:201 reply

A friend sent me the image from page 9. The email signature. It is mine, from when I ran my company. Mid 2010s.

I'm not much worried about this specific example of information exfiltration, though I have significant concerns over how one may debug something like this for applications working with potentially more sensitive data than email signatures. Put another way, I think we are well within the infancy of this technology, and there is far more work needed before we have actually useful applications that have a concept of information security relative to their training data sets.

By Aurornis 2023-12-0221:35

That's an unexpected surprise. Do you have any theories? Presumably someone posted one of your e-mails somewhere on the internet?

If you Google parts of the old signature, do you get any results?

By GaggiX 2023-12-0210:141 reply

>This leads to a natural question that has not yet been dis- cussed in the literature: if we could query a model infinitely, how much memorization could we extract in total?

You will get every 50-grams, not because the model memorized all of them but by pure chance. It seems pretty obvious to me.

It makes me question if there were some cases where the model output an identical 50-grams but it wasn't present in the training dataset of the model, like in a very structured setting, like assembly code where there is usually a very limited number of keywords used.

By dwringer 2023-12-0211:101 reply

One can fine tune a smaller parameter model like GPT-NeoX on a home GPU pretty readily, and it's absolutely capable of doing what you specified. Teach it with a bunch of example sentences that have parts of speech like verb and noun following a simple grammar, and you will see it generate sentences afterward that combine the parts of speech grammatically in novel ways, using the same grammatical structures but forming productions that did not appear in the training set.

Depending on settings, they are also capable of producing a lot of ungrammatical nonsense, but the odds of what it produces are changed considerably by the training.

By GaggiX 2023-12-0211:241 reply

No I mean creating 50-grams that appear in the dataset created by the paper linked by OP, but not present in the actual dataset the model was trained on. Of course, the model would be able to output 50-grams that were not present in either.

By dwringer 2023-12-0215:57

As I understand you, what you state is exactly what I meant. If you train with a bunch of text containing substrings of those 50-grams, but not the full 50-grams themselves [or, expose it to the same vocabulary used in the same parts of speech as in the full 50], the model will pretty readily produce the full 50-grams despite never having seen them in their entirety. Try it out, it's pretty easy to do on a modern GPU and can be done in less than an hour.

Hacker News