Summary of DEPN Detecting and Editing Privacy Neurons in Pretrained Language Models

Summary DEPN Detecting and Editing Privacy Neurons in Pretrained Language Models arxiv.org

6,771 words - PDF document - View PDF document

One Line

DEPN algorithm identifies and modifies privacy neurons in language models, effectively minimizing data exposure while maintaining model performance.

Slides

Slide Presentation (10 slides)

Copy slides outline Copy embed code Download as Word

DEPN Detecting and Editing Privacy Neurons in Pretrained Language Models

Source: arxiv.org - PDF - 6,771 words - view

Introduction

• DEPN is a framework to detect and edit privacy neurons in pretrained language models.

• The framework consists of three components: a privacy neuron detector, a privacy neuron editor, and a privacy neuron aggregator.

• DEPN minimizes data exposure while maintaining model performance.

Privacy Neuron Detector

• The privacy neuron detector computes the contributions of multiple markers to neuron activations using gradient integration.

• It estimates an overall privacy attribution score for private information.

• The top privacy neurons with the highest scores are selected for editing.

Privacy Neuron Editor

• The privacy neuron editor sets the activations of selected privacy neurons to zero.

• This effectively erases the model's memorization of private information.

• Minimizing the flow of information through these neurons reduces data leakage.

Privacy Neuron Aggregator

• The privacy neuron aggregator facilitates privacy information editing in batches.

• It calculates the privacy attribution score matrix for each sequence in the batch.

• The top privacy neurons based on these scores are selected for editing.

Experimental Results

• DEPN significantly and efficiently reduces the exposure of private data leakage.

• Model performance is not affected by the editing process.

• DEPN is highly effective in protecting privacy in scenarios with deep model memorization.

Relationship Between Model Memorization and Privacy Neurons

• In-depth analyses reveal that larger models tend to have a higher risk of privacy breaches.

• The distribution of privacy neurons over layers increases as training progresses.

• DEPN offers better protection for these models.

Limitations

• DEPN is more effective in scenarios with a small number of data leaks.

• The framework may exhibit weaker performance with a large batch of private data.

• Its effectiveness depends on the amount of private data to be erased.

Conclusion

• DEPN is a promising framework for privacy protection in pretrained language models.

• It effectively detects and edits privacy neurons, reducing the risk of data leakage without compromising model performance.

• DEPN contributes to model dememorization and offers a new approach to privacy preservation.

Key Takeaways

• DEPN detects and edits privacy neurons in pretrained language models.

• The framework consists of a privacy neuron detector, editor, and aggregator.

• DEPN significantly reduces data exposure without affecting model performance.

• Larger models have a higher risk of privacy breaches, but DEPN provides better protection.

• The framework has limitations but offers a promising approach to privacy protection in language models.

Key Points

DEPN is a framework proposed to detect and edit privacy neurons in pretrained language models.
The framework consists of three components: a privacy neuron detector, a privacy neuron editor, and a privacy neuron aggregator.
The privacy neuron detector uses gradient integration to compute the contributions of multiple markers to neuron activations, allowing for the estimation of an overall privacy attribution score for private information.
The privacy neuron editor sets the activations of selected privacy neurons to zero, effectively erasing the model's memorization of private information.
DEPN can significantly and efficiently reduce the exposure of private data leakage without affecting model performance.
In-depth analyses reveal the relationship between model memorization and privacy neurons, indicating that larger models tend to have a higher risk of privacy breaches.
The framework has limitations, including being more effective in scenarios with a small number of data leaks and depending on the amount of private data to be erased.
DEPN offers a promising framework for privacy protection in pretrained language models, contributing to model dememorization.

Summaries

20 word summary

DEPN detects and edits privacy neurons in pretrained language models using gradient integration, reducing data exposure without impacting model performance.

63 word summary

DEPN is a framework that detects and edits privacy neurons in pretrained language models. It includes a privacy neuron detector, editor, and aggregator. The detector uses gradient integration to identify neurons contributing to privacy leakage. The editor sets activations of selected privacy neurons to zero, reducing data exposure without impacting model performance. DEPN offers a promising approach to privacy preservation and model dememorization.

154 word summary

DEPN is a framework designed to detect and edit privacy neurons in pretrained language models, aiming to mitigate the risk of data leakage. It consists of three components: a privacy neuron detector, editor, and aggregator. The detector uses gradient integration to estimate a privacy attribution score for identifying neurons contributing to privacy leakage. The editor sets activations of selected privacy neurons to zero, erasing the model's memorization of private information. The aggregator facilitates batch editing by calculating privacy attribution scores for each sequence. Experimental results demonstrate DEPN's effectiveness in reducing private data exposure without impacting model performance. The framework introduces novel methods for privacy protection in pretrained language models and reveals the relationship between model memorization and privacy neurons. However, limitations include reduced effectiveness with a large number of data leaks and dependency on the amount of private data to be erased. Overall, DEPN offers a promising approach to privacy preservation and model dememorization.

431 word summary

DEPN is a framework proposed to detect and edit privacy neurons in pretrained language models. The goal is to reduce the risk of data leakage in large language models, which have been shown to memorize and regurgitate a significant portion of the training data. The framework consists of three components: a privacy neuron detector, a privacy neuron editor, and a privacy neuron aggregator.

The privacy neuron detector uses gradient integration to compute the contributions of multiple markers to neuron activations, allowing for the estimation of an overall privacy attribution score for private information. This score measures the neuron's contribution to the leakage of privacy information. The top privacy neurons with the highest scores are selected for editing.

The privacy neuron editor sets the activations of the selected privacy neurons to zero, effectively erasing the model's memorization of private information. This editing strategy minimizes the flow of information through these neurons.

The privacy neuron aggregator facilitates privacy information editing in batches when processing multiple sentences at the same time. It calculates the privacy attribution score matrix for each sequence in the batch and selects the top privacy neurons based on these scores for editing.

Experimental results show that DEPN can significantly and efficiently reduce the exposure of private data leakage without affecting model performance. It is highly effective in protecting privacy and suitable for scenarios with deep model memorization.

The framework has several contributions. It explores model editing for privacy protection in pretrained language models, providing a new way to address privacy risks. It introduces the privacy neuron detector to locate privacy neurons based on gradient attribution and the privacy neuron editor to dememorize private information. The framework is efficient and robust, as demonstrated by experimental results and analyses.

In-depth analyses further reveal the relationship between model memorization and privacy neurons. The distribution of privacy neurons over layers indicates that as training progresses, the aggregation of privacy neurons in deep layers increases. Larger models tend to have a higher risk of privacy breaches, but the DEPN framework offers better protection for these models.

The framework has some limitations. It is more effective in scenarios with a small number of data leaks and may exhibit weaker performance with a large batch of private data. Additionally, the framework's effectiveness depends on the amount of private data to be erased.

In conclusion, DEPN is a promising framework for privacy protection in pretrained language models. It effectively detects and edits privacy neurons, reducing the risk of data leakage without compromising model performance. The framework offers a new approach to privacy preservation and contributes to model dememorization.

Raw indexed text (43,220 chars / 6,771 words / 1,119 lines)

DEPN: Detecting and Editing Privacy Neurons in Pretrained Language

Models

Xinwei Wu 1 , Junzhuo Li 2 , Minghui Xu 1 , Weilong Dong 1 ,

Shuangzhi Wu 3 , Chao Bian 3,4 , Deyi Xiong 1,2∗

College of Intelligence and Computing, Tianjin University, Tianjin, China

School of New Media and Communication, Tianjin University, Tianjin, China

Department of Computer Science and Technology, Tsinghua University, Beijing, China

ByteDance Lark AI, Beijing, China

{wuxw2021,jzli,xuminghui,willowd,dyxiong}@tju.edu.cn,

[email protected], [email protected]

Abstract

Large language models pretrained on a huge

amount of data capture rich knowledge and

information in the training data. The ability

of data memorization and regurgitation in pre-

trained language models, revealed in previous

studies, brings the risk of data leakage. In or-

der to effectively reduce these risks, we pro-

pose a framework DEPN to Detect and Edit

Privacy Neurons in pretrained language mod-

els, partially inspired by knowledge neurons

and model editing. In DEPN, we introduce a

novel method, termed as privacy neuron detec-

tor, to locate neurons associated with private

information, and then edit these detected pri-

vacy neurons by setting their activations to zero.

Furthermore, we propose a privacy neuron ag-

gregator dememorize private information in a

batch processing manner. Experimental results

show that our method can significantly and ef-

ficiently reduce the exposure of private data

leakage without deteriorating the performance

of the model. Additionally, we empirically

demonstrate the relationship between model

memorization and privacy neurons, from multi-

ple perspectives, including model size, training

time, prompts, privacy neuron distribution, il-

lustrating the robustness of our approach.

Introduction

Remarkable progress has been made in large lan-

guage models (LLMs) in recent years (Brown et al.,

2020; Liu et al., 2021; Ouyang et al., 2022; Lee

et al., 2023). However,despite this success, LLMs

are confronted with privacy and security concerns

in real-world applications (Guo et al., 2022; Brown

et al., 2022; Li et al., 2023). The primary cause of

privacy and security risks is the inherent nature of

large pretrained language models. Previous studies

(Carlini et al., 2019, 2021; Thakkar et al., 2021;

∗

*Corresponding author.

Henderson et al., 2018) have demonstrated that

pretrained language models tend to memorize and

regurgitate a significant portion of the training data,

including atypical data points that appear only once

in the training data. Additionally, external factors

(e.g., membership attack) also contribute to these

risks. A variety of methods have been explored

to attack LLMs for training data extraction. For

instance, Carlini et al. (2021) have successfully ex-

tracted personal information from GPT-3’s output,

while Li et al. (2023) have induced the genera-

tion of personal information by utilizing multi-step

prompts in ChatGPT. All these show that large pre-

trained language models suffer from a serious risk

of privacy leakage.

In order to safeguard privacy, numerous meth-

ods have been proposed. The majority focus on

either removing sensitive information during the

data processing stage (Liu et al., 2017; El Emam

et al., 2009; Zhou et al., 2008; García-Pablos et al.,

2020), or reducing the extent to which models

memorize training data during the training stage

(Li et al., 2021; Hoory et al., 2021; Plant et al.,

2021; Coavoux et al., 2018). However, privacy

breaches often come to light after the completion

of model training, rendering previous methods less

effective. There are also methods proposed in the

post-processing stage, which involve slight param-

eter retraining to make the model forget privacy

information (Bourtoule et al., 2021; Gupta et al.,

2021; Neel et al., 2020). Nevertheless, these meth-

ods generally incur high computational complexity,

making it challenging to apply them to complex

model architectures. In practice, model develop-

ers often attempt to prevent language models from

outputting specific information via blocking or fil-

tering certain keywords, which, however, does not

truly address the underlying issue.

We speculate that private information might bestored in specific neurons, just like knowledge neu-

rons (Geva et al., 2021; Meng et al., 2022; Dai

et al., 2022). This presumption suggests that we

could change the model memorization of private

information by detecting and deleting these neu-

rons (termed as privacy neurons). Therefore, we

propose a framework DEPN for detecting and edit-

ing privacy neurons. To detect privacy neurons,

we introduce a privacy neuron detector that uses

gradient integration to simultaneously compute the

contributions of multiple markers to neuron activa-

tions. This allows us to estimate an overall privacy

attribution score for private information. Subse-

quently, we further propose a privacy neuron editor

that simply sets the activations of the top z privacy

neurons with the highest privacy scores to zero to

erase the model memorization of the corresponding

private information. For the scenario of process-

ing multiple sentences at the same time, we also

present a privacy neuron aggregator to facilitate

privacy information editing in batches.

Experimental results show that our framework

can quickly reduce the risk of private data leakage

without affecting model performance. Compared

with other methods, our framework is highly ef-

ficient. Furthermore, we have found that model

memorization leads to the aggregation of privacy

neurons in our experiments, and demonstrated that

our framework is very suitable for the scenario of

deep model dememorization.

The main contributions of our work are summa-

rized as follows:

• For the first time, we explore model edit-

ing into privacy protection of pretrained lan-

guage models, provide a new way for privacy

protection, and propose DEPN to effectively

eliminate model memorization in the post-

processing stage.

• We propose the privacy neuron detector to

localize privacy neurons based on gradient

attribution, and the privacy neuron editor to

dememorize privacy information in pretrained

language models.

Privacy Definition Privacy preservation has be-

come an issue of great concern in the era of pre-

trained language models. Protecting privacy first

requires specifying the boundaries of privacy. The

definition of privacy is broad. It is closely related

to its context and discourse (Brown et al., 2022).

In any texts about, a specific person can be consid-

ered as private. For the convenience of research, a

narrow definition of privacy is usually taken (Sousa

and Kern, 2023), which treats personal identity in-

formation as privacy, such as names, ID numbers,

phone numbers and other related expressions. The

proposed DEPN can be adapted to the above two

definitions.

Model Editing Geva et al. (2021) find that the

feed-forward network module in Transformer (i.e.,

a two-layer perceptron) can be considered as a key-

value memory, where each key corresponds to a

text pattern and each value represents a distribution

over the vocabulary. Based on this finding, a strand

of research, (Geva et al., 2021; Meng et al., 2022;

Dai et al., 2022) propose to edit factual knowledge

encoded in pre-trained LLMs by locating neurons

related to the entities of factual knowledge.

The basic idea of localization is to change the pa-

rameters of neurons, and then observe the changes

in the probability of the object entity predicted by

the model. The neurons with greater influence on

the probability are more closely related to the ob-

ject entity.

However, these methods have a limitation that

they can only observe the probability change of

one token at a time. Semantic units are usually

composed of a sequence of tokens, rather than a

single token. This makes it impossible to use these

methods directly.

Methodology

The proposed DEPN consists of three components:

the privacy neuron detector (§3.2), the privacy neu-

ron editor (§3.3) to erase the model memorization

of privacy data, and the privacy neuron aggregator

(§3.4) for privacy preservation in batches.

3.1

• We conduct experiments to demonstrate that

the proposed framework is capable of protect-

ing privacy leakage from pretrained language

models.

Preliminary

Privacy Prediction Task

Given a tuple T = {X, Y }, let Y = {y 1 , ..., y n }

be the sequence with private information, X be

the the context of the sequence, θ be the parame-

ters of a language model. Given a context X, theFigure 1: The diagram of DEPN. When a language model leaks privacy information, DEPN calculates privacy

attribution scores using the Privacy Neuron Detector. It then selects the top z privacy neurons with the Privacy

Neuron Aggregator and eliminates the model memorization of privacy information using the Privacy Editor.

probability of the language model yielding a token

is P (y i |X, θ), y i ∈ Y , so the probability of the

model leaking the private sequence is:

P (Y |X, θ) =

|Y |

P (y i |X, θ)

(1)

i=1

Take "An■ Ka■ is a senior writer at ESPN.com" as

private sentence containing a person’s name "An■

Ka■". Suppose the input to the language model

is "_ _ is a senior writer at ESPN.com", our goal

is to reduce the probability of privacy leakage, i.e.,

minimizing the probability of predicting "An■"

and "Ka■" .

3.2

Privacy Neuron Detector

As described in Section 2 factual knowledge is

found to be stored in the feed-forward networks

of Transformer, in the form of key-value memory.

Inspired by this, we speculate that private infor-

mation might be also encoded in specific neurons.

Model editing has offered methods to locate and

edit knowledge-related neurons. However, existing

methods can only deal with semantic units com-

posed of a single token, making them not directly

applicable to detect and edit mutli-token private

sequences. To address this issue, we propose a pri-

vacy attribution method based on gradient integra-

tion. The proposed privacy attribution can evaluate

which neurons play a key role in the leakage of

private information from language models.

Let w l k be a neuron to be evaluated by the pri-

vacy attribution method, where l is the layer of the

neuron in the language model, and k is its position.

According to §3.1, the probability of the model

outputting private information is:

P (Y

|X, w l k )

|Y |

P (y i |X, w l k = α l k )

(2)

i=1

where α l k represents the value of the k-th neuron in

the l-ith FFN layer.

We gradually change the parameter of the target

neuron from 0 to the original value of the neuron.

In this process, the probability of the output will

accordingly change. We calculate the cumulative

gradient of the probability change during this pro-

cess as the neuron’s contribution (i.e., privacy attri-

bution score) to the privacy-sensitive output. The

privacy attribution score is computed as:

Att(w l k )

β l k

∂P (Y |X, α l k ) k

dα l

∂w l k

(3)

where β l k is the original value of the neuron w l k ,

∂P (Y |X,α kl )

∂w l k

calculates the gradient of the modeloutput with regard to w l k . Directly calculating con-

tinuous integrals is intractable. We follow Dai

et al. (2022) to use Riemann approximation:

Att(w l k ) =

j k

β l )

β l k P m ∂P (Y |X, m

j=1

∂w l

(4)

|Y | k

j k

β l )

β l P m ∂P (y i |X, m

j=1

∂w l

(5)

i=1

If the neuron has a great influence on the output of

a private information, the gradient will be signifi-

cant, and a large integration value will be obtained.

Therefore, the privacy attribution score can mea-

sure the neuron’s contribution to the leakage of

privacy information, and the greater the privacy

attribution score, the greater the privacy sensitivity

of the neuron. We select neurons with the top z

privacy attribution score as candidates for editing.

3.3

Privacy Editor

After detecting the privacy neuron candidates with

the privacy neuron detector, we reduce the model

memorization of private information by editing.

Particularly, we use a simple yet effective editing

strategy: setting the parameters (activation values)

of the corresponding neurons to 0, so that the in-

formation flow will not pass through these privacy

neurons.

3.4

Experiments

We carried out experiments to examine the effec-

tiveness of the proposed DEPN on a dataset con-

taining private information.

4.1

where m = 20 is the number of approximation

steps.

Q |Y |

As P (Y |X, w l k ) = i=1 P (y i |X, w l k = α l k ),

we have

Att(w l k )

Privacy Neuron Aggregator

As a number of sentences in the training data of

LLMs contain private information, the privacy neu-

ron detection and editing can be done over multiple

sentences in a batch processing way. To erase pri-

vacy information encoded in the language model

from multiple sentences in the training data, we

propose the privacy neuron aggregator. When the

input is a text batch, we calculate the privacy attri-

bution score matrix of each sequence in the batch.

After the privacy attribution score calculation, we

let each sequence vote for neurons according to

their privacy attribution scores, and select the top z

neurons with the most votes. These selected neu-

rons will be edited to erase private information.

The hyperparameter z is adjusted according to the

model size, training epochs and other factors. More

details can be found in (§5.1).

Setup

Dataset We used the Enron dataset (Klimt and

Yang, 2004). It consists of employee emails that

were publicly disclosed during Enron’s legal inves-

tigation by the Federal Energy Regulatory Commis-

sion. It is the largest publicly available collection of

"real" email data, containing over 500,000 emails

from 158 users. 1 We randomly sampled 5% of

the data from Enron as the validation dataset to

evaluate model performance.

Private Information Sampling In our study, we

categorized the private information in the Enron

dataset into two types: private phrases (for the nar-

row definition of privacy), such as names and phone

numbers, and a batch of randomly sampled sen-

tences to be edit. Names: We selected 20 unique

names that are memorized by language models,

found in 126 sentences, such as "An■ Ka■ is a

senior writer at ESPN.com". Phone Numbers:

We also selected 20 unique LM-memorized phone

numbers, such as "My phone number is 7 1 3 8 5 ■

■ ■ ■ ■". Private texts: We randomly selected

100 sentences that are not semantically overlap-

ping with each other. In Appendix A.4, we discuss

how we determine whether private information is

memorized by a language model.

Model Settings We conducted experiments us-

ing the widely used pretrained model, BERT-base

(Devlin et al., 2018). The model consists of 12

transformer layers, with a hidden state size of 768

and an internal hidden size of 3072 for the feed-

forward network (FFN). Our experiments were per-

formed on NVIDIA Tesla A6000 graphics pro-

cessors. More training details are show in Ap-

pendix A.1.

Baselines To demonstrate the effectiveness and

robustness of DEPN, we compared it with the fol-

lowing baseline models. BERT-O: The bert model

that has not been trained on the Enron dataset.

Since the model does not know the privacy infor-

mation in the dataset, it provides an oracle for as-

sessing the risk of privacy leakage; BERT-F: The

https://www.cs.cmu.edu/~enron/Privacy Type

Phone Number

Name

Random Text

Models Time ↓ Valid-PPL ↓

BERT-O

BERT-F

BERT-FE

BERT-DP

BERT-O

BERT-F

BERT-FE

BERT-DP

BERT-O

BERT-F

BERT-FE

BERT-DP -

100%

2.4%

181.4%

100%

4.4%

181.4%

100%

4.6%

181.4% 25.23

3.07

3.11

5.43

25.23

3.07

3.11

5.43

25.23

3.07

3.11

5.43

Privacy Leakage Risk

Metric

Value

1.58

15.74

Exposure ↓

9.78

3.12

0.87

1.21

MRR ↓

1.15

0.95

10.05

2.30

PPL ↑

3.67

8.82

Table 1: Results of testing the risks of leaking private Phone Numbers, Names, and Texts on different baseline

models, as well as the efficiency of protection. Bold and underlined results indicate the best and second best result,

respectively. ↑: the higher the better. ↓: the lower the better.

bert model trained on the Enron dataset, which cor-

responds to the best predictive performance on the

Enron dataset, but has the greatest risk of privacy

leakage; BERT-DP: A privacy model trained by

the differential privacy gradient descent method

(Li et al., 2021) on the Enron dataset, which is the

commonly used privacy protection method when

using private data for training.

We applied our proposed DEPN on BERT-F to

make a safe model, which is referred to as BERT-

FE in following experiments. Our codes are avail-

able now. 2 token sequence E = {e 1 , ..., e n }, the length is |E|,

the model predicts the rank of the target token as

rank(e i |Q), and the MRR for the name E is cal-

culated as follows:

P |E|

Metrics To observe the effect of different privacy

preserving methods on the model performance, we

use the Perplexity of Masked Language Modeling

task on the Enron validation dataset (Valid-PPL)

as the metric.

Due to the different types of private information,

we provide metrics separately for the risk of privacy

leakage.

Exposure: The exposure (Carlini et al., 2019)

metric is commonly used in privacy attacks to mea-

sure the exposure risk of phone numbers. Given

a number sequence c, a model with parameters θ,

and the randomness space R, the exposure e θ of c

can be calculated as : Table 1 presents our main results, including model

performance, privacy leakage risk, and execution

time cost. The results demonstrate the competitive-

ness of our framework.

For the performance on the Enron validation

dataset (Valid-PPL), BERT-O, which is not trained

on the Enron dataset, exhibits the poorest perfor-

mance. BERT-DP trained with DP-SGD does not

perform well either, due to noise introduced during

backpropagation. In contrast, BERT-FE equipped

with DEPN performs almost on par with BERT-F

on the validation dataset, indicating that neuron

erasure minimally impacts model performance.

Regarding privacy leakage risk metrics, includ-

ing exposure, MRR, and PPL, clearly indicate that

BERT-FE equipped with DEPN achieve the reduc-

tion of privacy leakage risk. BERT-F, trained di-

rectly on private data, exhibits the highest risk. In

comparison, DEPN significantly reduces the risk of

leakage. BERT-O, which has no access to private

data, demonstrates the lowest risk across all three

data types. The BERT-DP model also exhibits very

low risk.

e θ = log 2 |R| − log 2 Rank θ (c).

(6)

Mean Reciprocal Rank (MRR): A person’s

name is usually composed of multiple tokens.

Therefore, we use the reciprocal average of the

rank of each target token to measure the model’s

memorization of names. Given a prefix Q, a name

https://github.com/flamewei123/DEPN

i=1 Rank(e i |Q)

|E|

(7)

Perplexity (PPL): When the private text is a

complete sentence, we directly use the perplexity

as the measure of the model memorization.

4.2

Main Results(a) Exposures with different number of edited neurons.

(b) Model performance with different number of edited

neuron.

Figure 2: The performance of the model and the risk of privacy leakage with the change trend of the number of

neurons edited.

In terms of execution time cost, we assume that

the fine-tuning time of BERT-F on data excluding

privacy is 100% (reference time cost). The DEPN

framework requires less than 5% of the reference

time cost, while BERT-DP requires more time due

to gradient clipping.

In conclusion, while differential privacy training

and fine-tuning with non-private data can mitigate

privacy leakage risks, they incur more time and

may significantly undermine model performance.

The DEPN framework strikes an excellent balance

between performance and privacy protection.

Analysis

We further conducted in-depth analyses to demon-

strate why DEPN is able to dememorize privacy in

LLMs from multiple perspectives, including anal-

yses on the relationship between privacy neurons

and model memorization, on the robustness as well

as the cost-effectiveness of DEPN.

5.1

Effect of the Hyperparameter

Figure 2 illustrates the impact of the hyperparame-

ter, the number of edited neurons, on the model. We

calculate the exposures of the original model BERT-

F and the enhanced model BERT-FE on 20 phone

numbers. In Figure 2(a), the red line represents

the average exposure of BERT-F, while the green

line represents the average exposure of BERT-FE

with varying numbers of edited neurons. As the

number of edited neurons increases, the exposure

significantly decreases. In Figure 2(b), the purple

line represents the PPL of BERT-F on the valida-

tion set, while the blue line represents the PPL of

BERT-FE on the validation set with different num-

bers of edited neurons. As the number of erasures

increases, the PPL noticeably increases. Therefore,

increasing the number of edited neurons reduces

the risk of privacy leakage in the model, but it also

leads to a decrease in the model performance.

5.2

Relationship between Memorization And

Privacy Neurons

As it is widely recognized, privacy data leakage

often stems from the model’s ability to memorize

the training data.

In this subsection, we conducted experiments to

investigate the relationship between model memo-

rization and privacy neurons, providing further evi-

dence for the effectiveness of the proposed DEPN.

Impact of Training Time on Privacy Neuron

Distribution over Layers Figure 3 depicts the

evolution of the distribution of privacy neurons

over layers as the number of training epochs in-

creases. Overall, the distribution of privacy neu-

rons is pyramid-shaped, and most privacy neurons

identified by the privacy neuron detector are lo-

cated in layers 10-12 of BERT-base. Specifically,

in epoch 1, about 40% of privacy neurons are in

the top layer of BERT-base. As training progresses,

the proportion of privacy neurons from deep layers

increases to 60% by epoch 3 and to 80% by epoch

6. By the 9-th epoch, the distribution of privacy

neurons remains largely unchanged compared to

the 6-th epoch. This suggests that as the depth

of model training increases, the memorization of(a) epoch 1 (b) epoch 3

Figure 3: The distribution of privacy neurons in the bert-base model at different training epochs.

Models # Edited Neurons Time

bert-small

bert-base

bert-large 100

200

400 0.26h

1.59h

7.66h

Before Editing

Valid-PPL Exposure

4.09

5.10

3.07

15.74

2.93

18.10

After Editing

Valid-PPL Exposure

4.57

3.39

3.11

9.78

2.98

7.63

Reduction Rate

33.5%

37.86%

57.84%

Table 2: The privacy leakage risk reduction rate for models of different sizes.

private data tends to converge.

In Appendix A.3, we conducted experiments to

observe the changes of privacy leakage risk reduc-

tion at different training epoch. The results show

that when the training time increases, the risk of

privacy leakage increases too, and the proposed

DEPN becomes more effective in privacy preserva-

tion.

Effect of the Model Size Table 2 illustrates the

performance of the DEPN framework on models

of different scales. Each model was trained for

10 epochs using the optimal hyperparameter set-

tings. Overall, larger models require more time

to identify privacy neurons and require editing a

greater number of privacy neurons for optimal per-

formance. Larger models tended to show a deeper

memory for phone numbers before privacy neu-

rons are edited, leading to higher exposure. After

privacy neuron editing, from the perspective of re-

duction rate, the exposure of the large model is

reduced even more. These findings suggest that

larger models are more at risk of privacy breaches.

Fortunately, the DEPN framework demonstrates

better performance on larger models compared to

smaller ones, offering improved protection against

privacy risks.

Summary of the Relationship between Mem-

orization and Privacy Neurons Based on the

aforementioned experimental findings, we can con-

clude that the model’s scale, training time, and

frequency of privacy data occurrence are all factors

that have influence on the model memorization. As

the model memorization of privacy data deepens,

the aggregation of privacy neurons associated with

privacy data becomes more pronounced, which

makes the method of locating and eliminating pri-

vacy neurons more suitable for deep memorization

scenarios. Therefore, the DEPN framework has

demonstrated excellent effectiveness in mitigating

model memorization.

5.3

Robustness Analysis

Ablation Study We conducted ablation experi-

ments to assess the robustness of the privacy neu-

ron detector by comparing its performance with

different neuron localization methods on phone

number data. In Table 4, we present the results of

these experiments. Specifically, "KN" refers to the

knowledge attribution approach proposed by Dai

et al. (2022), while "Random" donates an approachPrivacy Amount # Edited Neurons Time

100

1000 200

500

2000 0.76h

1.59h

17.61h

Before Editing

Valid-PPL Exposure

3.07

15.74

3.07

12.46

3.07

8.32

After Editing

Valid-PPL Exposure

3.11

9.78

3.33

10.47

3.81

8.03

Table 3: Analysis results on the cost-effectiveness of DEPN.

Methods

PND + Editing

KN + Editing

Random + Editing

Before Editing

Valid-PPL Exposure

3.07

15.54

3.07

15.54

3.07

15.54

After Editing

Valid-PPL Exposure

3.11

9.78

3.10

10.75

3.07

12.48

Prompts

‘Contact me at ***’

‘Contact me at : ***’

‘Contact me : ***’

‘Call me at ***’

‘My phone number is ***’

‘You can call me at ***’

Table 4: Effect of using different neuron localization

methods on results.

that randomly selects the same number of neurons

as our method. Our method PND (privacy neuron

detector) achieves superior performance in terms

of exposure reduction compared to the other meth-

ods. Although the knowledge attribution approach

gains a good exposure reduction, it is less effec-

tive than our method due to its attribution being

targeted at a single token. The random selection

approach is also able to decrease privacy exposure

but the exposure reduction is not as significant as

the KN approach and our detector. These results

unequivocally demonstrate the effectiveness of our

method for in privacy neuron localization.

Robustness to Different Prompts We conducted

experiments to validate the robustness of DEPN

to different prompts. We sampled private data

containing phone numbers, all composed of the

same prefix, from the training dataset. We then

performed privacy attacks during inference using

different prompts to examine whether changing

prompts would still result in privacy leakage. Ta-

ble 5 presents the results of these experiments. The

training data consist of phone numbers with the

same prefix of ‘Contact me at ***’. We observe

privacy risk reduction across all prompts, demon-

strating the robustness of DEPN to prompt.

5.4

Analysis on the Cost-Effectiveness of

DEPN

In this subsection we discuss the limitation of

DEPN, specifically its dependency on the amount

of private data to be erased. We conducted an exper-

iment where we used 1,000 private data instances,

each containing phone numbers, extracted from our

training dataset. DEPN was applied onto the BERT-

base model to erase private information. Experi-

Original Exposure

12.52

11.20

12.50

12.31

13.41

13.04

Exposure

9.77 ↓

9.40 ↓

9.68 ↓

11.82 ↓

12.96 ↓

12.84 ↓

Table 5: Results with varying prompts during privacy

attack. ‘Contact me at ***’ is the prefix to the private

phone numbers in the training data, and the others are

varying prompts used in inference.

ment results are shown in Table 3. As the amount

of private data increases, more neurons need to

be edited to achieve better privacy protection, and

the performance of the model drops significantly.

Furthermore, it becomes apparent that, with the

escalation of private data volume, the reduction in

privacy risks gradually diminishes. These obser-

vations indicate that DEPN excels in remediating

language models when dealing with a small num-

ber of data leaks, but exhibits weak performance

when confronted with a large batch of private data.

Related Work

Model Editing To edit incorrect or undesirable

information captured in LLMs, a variety of model

editing approaches have been proposed, which can

be categorized into four strategies.

First, the Constrained Fine-tuning strategy (Zhu

et al., 2020) updates LLMs specifically for the tar-

get knowledge, allowing precise modification. Sec-

ond, the Memory-based Editing strategy (Mitchell

et al., 2022; Dong et al., 2022) maintains a knowl-

edge cache that stores new information to re-

place undesirable predictions. Third, the Meta-

learning-based Editing strategy (De Cao et al.,

2021; Mitchell et al., 2021) introduces editable

training based on meta-learning, training model

parameters to accommodate editing. Lastly, the

Locating and Editing strategy (Geva et al., 2021;

Meng et al., 2022; Dai et al., 2022) assumes that

knowledge is locally stored in LLMs. This strat-

egy locates specific parameters associated with the

knowledge and directly edits parameters to performediting. in the post-processing stage.

Privacy Protection To address privacy risks in

NLP models, various privacy-preserving methods

have been proposed, which can be categorized

into three main stages of application (Guo et al.,

2022; Sousa and Kern, 2023): data processing

stage, pre-training and/or fine-tuning stage, and

post-processing stage. In the data processing stage,

methods involve removing or replacing sensitive

information in the original data (Liu et al., 2017;

El Emam et al., 2009; Zhou et al., 2008; García-

Pablos et al., 2020).

In the pre-training or fine-tuning stage, data pri-

vacy can be protected by modifying the model train-

ing process. One approach is differential privacy

stochastic gradient descent (DP-SGD) (Li et al.,

2021; Hoory et al., 2021), which introduces noise

into the clipped gradient to reduce the distinction

between gradients and prevent memorization of

training data. Another method is adversarial train-

ing (Plant et al., 2021; Coavoux et al., 2018), which

constrains the model’s learning of private informa-

tion through adversarial training techniques. How-

ever, methods used in the data processing stage and

in the pre-training or fine-tuning stage are not appli-

cable if the privacy leakage is discovered after the

model training is completed. Methods used in the

post-processing stage focus on making trained mod-

els forget specific data or alter specific parameters

to safeguard hidden private information (Bourtoule

et al., 2021; Gupta et al., 2021; Neel et al., 2020).

These methods are often with high computational

cost and cannot be easily applied to large models.

In contrast, proposed DEPN can achieve the protec-

tion of private information in the post-processing

stage with a small computational overhead. Limitations Our current study still has two lim-

itations. First, although we propose a method to

process private data in batches, we find that too

many instances in a batch will reduce the effect of

memorization erasure. Second, we use a few types

of private information in our experiments due to

the limited availability of datasets containing pri-

vate information. We would like to collect more

available datasets for our framework in the future.

Conclusion

In this paper, we have presented a privacy neuron

detecting and editing framework DEPN to address

privacy leakage risks in pretrained language mod-

els. Through the privacy neuron detector based on

the privacy attribution scoring method, we accu-

rately detect risky neurons associated with private

information. The privacy neuron editor effectively

eliminates model memorization of private data. Ex-

perimental results and in-depth analyses demon-

strate the ability of DEPN to reduce privacy risks

efficiently without degrading model performance.

Our work explores a novel approach to privacy pro-

tection and contributes to model de-memorization

Ethical Statement In this paper, we use the En-

ron dataset to evaluate the privacy-preserving ef-

fect of DEPN. This dataset consists of employee

emails that were publicly disclosed during Enron’s

legal investigation by the Federal Energy Regula-

tory Commission. Since the data comes from real

persons, we masked sensitive information such as

specific names and phone numbers in this paper.

Acknowledgements

The work was partially supported by the research

collaboration project between Tianjin University

and ByteDance(PJ20210625900030) and Zhejiang

Lab (No. 2022KH0AB01). We would like to thank

the anonymous reviewers for their insightful com-

ments.

References

Prajjwal Bhargava, Aleksandr Drozd, and Anna Rogers.

2021. Generalization in nli: Ways (not) to go be-

yond simple heuristics. In Proceedings of the Second

Workshop on Insights from Negative Results in NLP,

pages 125–135.

Lucas Bourtoule, Varun Chandrasekaran, Christopher A

Choquette-Choo, Hengrui Jia, Adelin Travers, Baiwu

Zhang, David Lie, and Nicolas Papernot. 2021. Ma-

chine unlearning. In 2021 IEEE Symposium on Secu-

rity and Privacy (SP), pages 141–159. IEEE.

Hannah Brown, Katherine Lee, Fatemehsadat

Mireshghallah, Reza Shokri, and Florian Tramèr.

2022. What does it mean for a language model

to preserve privacy? In Proceedings of the 2022

ACM Conference on Fairness, Accountability, and

Transparency, pages 2280–2292.

Tom Brown, Benjamin Mann, Nick Ryder, Melanie

Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind

Neelakantan, Pranav Shyam, Girish Sastry, Amanda

Askell, et al. 2020. Language models are few-shot

learners. Advances in neural information processing

systems, 33:1877–1901.Nicholas Carlini, Chang Liu, Úlfar Erlingsson, Jernej

Kos, and Dawn Song. 2019. The secret sharer: Eval-

uating and testing unintended memorization in neu-

ral networks. In 28th USENIX Security Symposium

(USENIX Security 19), pages 267–284. Varun Gupta, Christopher Jung, Seth Neel, Aaron Roth,

Saeed Sharifi-Malvajerdi, and Chris Waites. 2021.

Adaptive machine unlearning. In Advances in Neural

Information Processing Systems, volume 34, pages

16319–16330. Curran Associates, Inc.

Nicholas Carlini, Florian Tramer, Eric Wallace,

Matthew Jagielski, Ariel Herbert-Voss, Katherine

Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar

Erlingsson, et al. 2021. Extracting training data from

large language models. In 30th USENIX Security

Symposium (USENIX Security 21), pages 2633–2650. Peter Henderson, Koustuv Sinha, Nicolas Angelard-

Gontier, Nan Rosemary Ke, Genevieve Fried, Ryan

Lowe, and Joelle Pineau. 2018. Ethical challenges

in data-driven dialogue systems. In Proceedings of

the 2018 AAAI/ACM Conference on AI, Ethics, and

Society, pages 123–129.

Maximin Coavoux, Shashi Narayan, and Shay B Co-

hen. 2018. Privacy-preserving neural representations

of text. In Proceedings of the 2018 Conference on

Empirical Methods in Natural Language Processing,

pages 1–10. Shlomo Hoory, Amir Feder, Avichai Tendler, Sofia Erell,

Alon Peled-Cohen, Itay Laish, Hootan Nakhost, Uri

Stemmer, Ayelet Benjamini, Avinatan Hassidim, et al.

2021. Learning and evaluating a differentially pri-

vate pre-trained language model. In Findings of the

Association for Computational Linguistics: EMNLP

2021, pages 1178–1189.

Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Baobao

Chang, and Furu Wei. 2022. Knowledge neurons in

pretrained transformers. In Proceedings of the 60th

Annual Meeting of the Association for Computational

Linguistics (Volume 1: Long Papers), pages 8493–

8502.

Nicola De Cao, Wilker Aziz, and Ivan Titov. 2021. Edit-

ing factual knowledge in language models. In Pro-

ceedings of the 2021 Conference on Empirical Meth-

ods in Natural Language Processing, pages 6491–

6506.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and

Kristina Toutanova. 2018. Bert: Pre-training of deep

bidirectional transformers for language understand-

ing. arXiv preprint arXiv:1810.04805.

Qingxiu Dong, Damai Dai, Yifan Song, Jingjing Xu,

Zhifang Sui, and Lei Li. 2022. Calibrating factual

knowledge in pretrained language models. In Find-

ings of the Association for Computational Linguistics:

EMNLP 2022, pages 5937–5947.

Khaled El Emam, Fida Kamal Dankar, Romeo Issa,

Elizabeth Jonker, Daniel Amyot, Elise Cogo, Jean-

Pierre Corriveau, Mark Walker, Sadrul Chowdhury,

Regis Vaillancourt, et al. 2009. A globally optimal k-

anonymity method for the de-identification of health

data. Journal of the American Medical Informatics

Association, 16(5):670–682.

Aitor García-Pablos, Naiara Pérez, and Montse Cuadros.

2020. Sensitive data detection and classification in

spanish clinical text: Experiments with bert. In Pro-

ceedings of the 12th Language Resources and Evalu-

ation Conference, pages 4486–4494.

Mor Geva, Roei Schuster, Jonathan Berant, and Omer

Levy. 2021. Transformer feed-forward layers are

key-value memories. In Proceedings of the 2021

Conference on Empirical Methods in Natural Lan-

guage Processing, pages 5484–5495.

Shangwei Guo, Chunlong Xie, Jiwei Li, Lingjuan Lyu,

and Tianwei Zhang. 2022. Threats to pre-trained lan-

guage models: Survey and taxonomy. arXiv preprint

arXiv:2202.06862.

Bryan Klimt and Yiming Yang. 2004. Introducing the

enron corpus. In CEAS, volume 45, pages 92–96.

Peter Lee, Sebastien Bubeck, and Joseph Petro. 2023.

Benefits, limits, and risks of gpt-4 as an ai chatbot

for medicine. New England Journal of Medicine,

388(13):1233–1239.

Haoran Li, Dadi Guo, Wei Fan, Mingshi Xu, and

Yangqiu Song. 2023. Multi-step jailbreaking privacy

attacks on chatgpt. arXiv preprint arXiv:2304.05197.

Xuechen Li, Florian Tramer, Percy Liang, and Tatsunori

Hashimoto. 2021. Large language models can be

strong differentially private learners. In International

Conference on Learning Representations.

Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang,

Hiroaki Hayashi, and Graham Neubig. 2021. Pre-

train, prompt, and predict: A systematic survey of

prompting methods in natural language processing.

arXiv preprint arXiv:2107.13586.

Zengjian Liu, Buzhou Tang, Xiaolong Wang, and Qing-

cai Chen. 2017. De-identification of clinical notes

via recurrent neural network and conditional random

field. Journal of biomedical informatics, 75:S34–

S42.

Kevin Meng, David Bau, Alex Andonian, and Yonatan

Belinkov. 2022. Locating and editing factual associ-

ations in gpt. Advances in Neural Information Pro-

cessing Systems, 35:17359–17372.

Eric Mitchell, Charles Lin, Antoine Bosselut, Chelsea

Finn, and Christopher D Manning. 2021. Fast model

editing at scale. In International Conference on

Learning Representations.

Eric Mitchell, Charles Lin, Antoine Bosselut, Christo-

pher D Manning, and Chelsea Finn. 2022. Memory-

based model editing at scale. In International Con-

ference on Machine Learning, pages 15817–15831.

PMLR.Seth Neel, Aaron Roth, and Saeed Sharifi-Malvajerdi.

2020. Descent-to-delete: Gradient-based methods

for machine unlearning. In International Conference

on Algorithmic Learning Theory.

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida,

Carroll Wainwright, Pamela Mishkin, Chong Zhang,

Sandhini Agarwal, Katarina Slama, Alex Ray, et al.

2022. Training language models to follow instruc-

tions with human feedback. Advances in Neural

Information Processing Systems, 35:27730–27744.

Richard Plant, Dimitra Gkatzia, and Valerio Giuffrida.

2021. Cape: Context-aware private embeddings for

private language learning. In Proceedings of the

2021 Conference on Empirical Methods in Natural

Language Processing, pages 7970–7978.

Samuel Sousa and Roman Kern. 2023. How to keep

text private? a systematic review of deep learning

methods for privacy-preserving natural language pro-

cessing. Artificial Intelligence Review, 56(2):1427–

1492.

Om Dipakbhai Thakkar, Swaroop Ramaswamy, Rajiv

Mathews, and Francoise Beaufays. 2021. Under-

standing unintended memorization in language mod-

els under federated learning. In Proceedings of the

Third Workshop on Privacy in Natural Language Pro-

cessing, pages 1–10.

Bin Zhou, Jian Pei, and WoShun Luk. 2008. A brief sur-

vey on anonymization techniques for privacy preserv-

ing publishing of social network data. ACM Sigkdd

Explorations Newsletter, 10(2):12–22.

Chen Zhu, Ankit Singh Rawat, Manzil Zaheer, Srinadh

Bhojanapalli, Daliang Li, Felix Yu, and Sanjiv Kumar.

2020. Modifying memories in transformer models.

arXiv preprint arXiv:2012.00363.A

Appendix

A.1

Training Details

For BERT-base fine-tuning, we set the hyperpa-

rameters as follows: 20 training epochs, a learning

rate of 3e-5 with linear warm-up, and a batch size of

16. We fine-tuned BERT-base on the Enron dataset

using the Masked Language Modeling task to simu-

late training on datasets containing privacy informa-

tion. Additionally, we pretrained smaller (layer=4,

hidden size=512, intermediate size=2048) (Bhar-

gava et al., 2021) and larger (layer=24, hidden

size=1024, intermediate size=4096) BERT mod-

els 3 to compare the performance of privacy erasure

at different model scales.

A.2

Effect of the Frequency of Privacy Data

Ocurrence

We also examined the influence of the frequency of

privacy data ocurrence in the training set on DEPN.

As shown in Table 6, phone numbers with an ocur-

rence frequency greater than 10 exhibit higher ex-

posure compared to those with a frequency less

than 10, indicating a higher risk of leakage. How-

ever, after erasure, the exposure of phone num-

bers with a frequency greater than 10 is reduced

by 32.65%, while the exposure of phone numbers

with a frequency less than 10 is reduced by 22.58%.

These results suggest that our method effectively

reduces exposure for both high-frequency and low-

frequency phone numbers, mitigating the risk of

privacy leakage.

Frequency

>=10

<10

Original Exposure

23.15

8.90

Exposure

15.59

6.89

Table 6: Frequency of privacy data ocurrence make

exposure different.

A.3

Effect of Training Time

Figure 4 illustrates the changes in exposure of

phone number data before and after erasing pri-

vacy neurons in models with different training

epochs. We conducted experiments using 20 differ-

ent phone numbers and averaged the final results.

The blue line represents the exposure of phone

numbers before privacy neuron erasing. The blue

line initially remains low but exhibits a significant

surge after the 10-th epoch, indicating that models

https://huggingface.co/BERT-large-uncased

Figure 4: Comparison of privacy leakage risk reduction

at different training epochs.

with longer training time have a more pronounced

memorization of the training data. Additionally,

the yellow line represents the exposure of phone

numbers after privacy neuron erasing. The widen-

ing gap between the two lines indicates that as the

model’s memorization becomes more apparent, the

proposed DEPN becomes more effective in privacy

preservation.

A.4

The Judgement of the Memorization

In our experiment, we specifically identify the pri-

vate data memorized by the language model from

the training dataset. To assess whether the model

has memorized private data, we employ the con-

text of private information as the input to the lan-

guage model. Subsequently, we calculate the risk

of private information leakage and classify the in-

formation with a leakage risk exceeding predefined

thresholds as having been memorized by the lan-

guage model. For names, we establish a threshold

for memorization as the MRR of less than 1.5. For

phone numbers, we employ the Exposure values

exceeding 15 as memorization.