Summary of Hashmarks Privacy-Preserving Benchmarks for High-Stakes AI Evaluation

Summary Hashmarks Privacy-Preserving Benchmarks for High-Stakes AI Evaluation arxiv.org

6,263 words - PDF document - View PDF document

One Line

Hashmarks is a protocol that protects privacy by using cryptographic hashing to evaluate language models on sensitive topics.

Slides

Slide Presentation (10 slides)

Copy slides outline Copy embed code Download as Word

Hashmarks: Privacy-Preserving Benchmarks for High-Stakes AI Evaluation

Source: arxiv.org - PDF - 6,263 words - view

Introduction

• Traditional benchmarks are not suitable for evaluating language models on sensitive topics

• Hashmarks protocol allows evaluation without revealing correct answers

• Resilient against traditional attack vectors

Visual: Image representing traditional benchmarks vs. hashmarks

Importance of Hashmarking

• Language models need to be evaluated on sensitive topics

• Disclosing reference solutions can lead to unintended consequences

• Hashmarking provides a solution for secure evaluation

Visual: Graph showing the importance of evaluating language models on sensitive topics

The Hashmarking Protocol

• Experts create question-answer pairs and hash the correct answers

• Questions are used as salt during the hashing process

• Hashed question-answer pairs are sent to the auditor

Visual: Diagram illustrating the steps of the hashmarking protocol

Mitigating Attacks

• Slow hashing and salting mitigate brute-force and dictionary attacks

• Rainbow table attacks are hindered by starting from scratch for each question

• Augmented dictionary attacks prioritize answers based on likelihood

Visual: Comparison between different attack vectors and their impact on hashmarking

Challenges of Hashmarking

• Deception by language models verbalizing contradictory answers

• Reward shaping and misreporting results as potential failure modes

• Attention hazards and the Streisand effect due to the publication of hashmarks

Visual: Illustration representing the challenges faced in hashmarking

Strategies to Mitigate Attention Hazards

• Diluting resources by incorporating false leads

• Skewing the distribution of entries based on perceived sensitivity

• Obfuscating questions, considering trade-offs with model accuracy

Visual: Graph showing the impact of different strategies on attention hazards

Future Enhancements

• Zero-knowledge cryptography for proving performance without disclosing details

• Addressing honesty from the evaluated entity

• Exploring modifications to the protocol or alternative evaluation methods

Visual: Image representing future enhancements and advancements in hashmarking

Conclusion

• Hashmarking offers a privacy-preserving protocol for evaluating AI models on sensitive topics

• Allows knowledge verification without disclosing reference solutions

• Mitigates traditional attacks but introduces new challenges

Visual: Image summarizing the key points of the presentation

Key Takeaways

• Hashmarking is a privacy-preserving protocol for evaluating language models on sensitive topics

• Resilient against traditional attack vectors

• Challenges include augmented dictionary attacks and deception

• Diluting resources and obfuscating questions can mitigate attention hazards

• Zero-knowledge cryptography and future modifications may enhance the protocol

Key Points

Hashmarking is a privacy-preserving protocol for evaluating language models on sensitive topics without disclosing the correct answers.
It involves cryptographically hashing the reference solutions of benchmark questions.
Third parties can verify their knowledge by attempting to answer the questions and comparing the hashed results.
The protocol is resilient against traditional attack vectors such as brute-force and dictionary attacks.
The protocol introduces challenges such as augmented dictionary attacks and deception, which need to be addressed.
Diluting resources and obfuscating questions are potential strategies to mitigate attention hazards.
Zero-knowledge cryptography and other modifications may enhance the protocol in the future.

Summaries

15 word summary

Hashmarks is a privacy-preserving protocol for evaluating language models on sensitive topics using cryptographic hashing.

68 word summary

Hashmarks is a privacy-preserving protocol for evaluating language models on sensitive topics. It involves cryptographically hashing reference solutions to create benchmarks. Experts hash their correct answers and send them to an auditor, who publishes the hashed questions and answers. The protocol is resilient against traditional attack vectors but vulnerable to augmented dictionary attacks and deception. Future enhancements may include diluting resources, obfuscating questions, and exploring zero-knowledge cryptography techniques.

134 word summary

Hashmarks is a privacy-preserving protocol proposed as an alternative to traditional benchmarks for evaluating language models on sensitive topics. It involves cryptographically hashing reference solutions to create benchmarks called hashmarks. Experts hash their correct answers and send them to an auditor, who publishes the hashed questions and answers. Third parties can verify their knowledge by attempting to answer the questions and comparing the hashed results. The protocol is resilient against traditional attack vectors. Traditional benchmarks disclose correct answers, which is unsuitable for high-stakes AI evaluation. Cryptography offers ideas for proving statements without disclosing sensitive information. The hashmarking protocol involves slow hashing, salting, and starting each question from scratch. However, it is vulnerable to augmented dictionary attacks and deception. Diluting resources, obfuscating questions, and exploring zero-knowledge cryptography techniques may enhance the protocol in the future.

395 word summary

Hashmarks is a privacy-preserving protocol proposed by the authors as an alternative to traditional open source benchmarks for evaluating language models on sensitive topics. Traditional benchmarks disclose the correct answers, which is not suitable for high-stakes AI evaluation. Hashmarking involves cryptographically hashing reference solutions, creating benchmarks called hashmarks. Experts hash their correct answers and send them to an auditor, who publishes the hashed questions and answers. Third parties can verify their knowledge by attempting to answer the questions and comparing the hashed results. The protocol is resilient against traditional attack vectors.

While traditional question-answering (QA) benchmarks have been important in AI development, disclosing reference solutions on sensitive topics like bioterrorism could inadvertently provide a publicly-available compendium of knowledge. Therefore, there is a need for secure evaluation protocols. Cryptography offers ideas and practices for proving statements without disclosing sensitive information. The authors propose a privacy-preserving evaluation protocol for assessing language models' capabilities in sensitive domains, drawing on concepts like irreversible hashing, federated learning, and differential privacy.

The hashmarking protocol involves experts creating question-answer pairs and hashing the correct answers using a slow hashing algorithm. The questions are used as salt during the hashing process. The experts send the hashed pairs to the auditor. The auditor sends each expert the cleartext questions contributed by other experts, and the experts provide answers, hash them using the questions as salt, and send the results to the auditor. The auditor filters the pairs based on non-empty answers and inter-annotator agreement, and publishes the filtered collection for third parties to verify their knowledge.

Hashmarks have specific requirements such as obscure yet unambiguous answers and narrow, well-defined questions. The protocol mitigates attacks through slow hashing, salting, and starting each question from scratch. However, it is vulnerable to augmented dictionary attacks and deception from language models. Diluting resources, skewing entry distributions, obfuscating questions, and zero-knowledge cryptography techniques are potential strategies to address these challenges. Future work may focus on modifications to the protocol or other evaluation methods.

In conclusion, hashmarking is a privacy-preserving protocol for evaluating AI models on sensitive topics. It allows knowledge verification without disclosing reference solutions and mitigates traditional attacks. However, it introduces new challenges such as augmented dictionary attacks and deception. Diluting resources, obfuscating questions, and exploring zero-knowledge cryptography techniques may enhance the protocol in the future. Hashmarks should be seen as one step towards secure high-stakes AI evaluation.

649 word summary

Hashmarks: Privacy-Preserving Benchmarks for High-Stakes AI Evaluation

Traditional open source benchmarks are not suitable for evaluating language models on sensitive topics such as bioterrorism or cyberwarfare because they disclose the correct answers. Enforcing closed-quarters evaluations may stifle development and erode trust. To address this, the authors propose hashmarking, a protocol for evaluating language models in the open without revealing the correct answers. A hashmark is a benchmark with reference solutions that have been cryptographically hashed. The protocol involves experts hashing their correct answers and sending them to an auditor, who then publishes the hashed questions and answers. Third parties can verify their knowledge by attempting to answer the questions and comparing the hashed results. The protocol is resilient against traditional attack vectors such as brute-force and dictionary attacks.

Traditional question-answering (QA) benchmarks have been important in AI development, providing standardized metrics for fair comparisons and measuring progress. These benchmarks typically contain a large number of data points with questions, correct answers, and distractor answers. They are sourced from crowd-workers or developers and made public for evaluation. However, there is a need for benchmarks that assess models' capabilities on sensitive topics. Disclosing the reference solutions of benchmark questions on topics like bioterrorism could inadvertently provide a publicly-available compendium of knowledge on the subject. Secure evaluation protocols are required.

Cryptography offers ideas and practices for proving statements without disclosing sensitive information. For example, password authentication can determine if a candidate password matches the correct password without knowing the correct password itself. This is achieved through irreversible hashing during user registration and checking if the hashed candidate password matches the hashed correct password during authentication. Federated learning and differential privacy are other techniques that protect privacy while extracting meaningful insights. Drawing on these concepts, the authors propose a privacy-preserving evaluation protocol for assessing language models' capabilities in sensitive domains.

The hashmarking protocol involves experts creating question-answer pairs and hashing the correct answers using a slow hashing algorithm. The questions are used as salt during the hashing process. The experts send the hashed question-answer pairs to the auditor. The auditor then sends each expert the cleartext questions contributed by other experts. The experts provide answers to these questions, hash them using the questions as salt, and send the results to the auditor. The auditor filters the question-answer pairs based on the number of non-empty answers and inter-annotator agreement. The filtered collection is published for third parties to verify their knowledge.

Hashmarks have certain desiderata. Answers should be obscure yet unambiguous, and questions should have narrow, well-defined answers. The protocol mitigates attacks such as brute-force and dictionary attacks through slow hashing and salting. Rainbow table attacks are hindered because each question requires starting from scratch. However, hashmarks are vulnerable to augmented dictionary attacks that prioritize answers based on likelihood. Deception is another challenge, as language models may verbalize answers that contradict their internal knowledge. Reward shaping and misreporting results are also potential failure modes. Attention hazards and the Streisand effect may arise due to the publication of hashmarks.

The authors suggest diluting resources by incorporating false leads and skewing the distribution of entries based on perceived sensitivity. They also consider obfuscating the questions but note the trade-off with evaluating model accuracy. Zero-knowledge cryptography techniques could enable parties to prove their performance without disclosing specific details. However, current hashmarks do not ensure honesty from the evaluated entity. Future work may focus on modifications to the protocol or other evaluation methods to address these challenges.

In conclusion, hashmarking offers a privacy-preserving protocol for evaluating AI models on sensitive topics. It allows knowledge verification without disclosing reference solutions. The protocol mitigates traditional attacks and introduces new challenges such as augmented dictionary attacks and deception. Diluting resources and obfuscating questions are potential strategies to mitigate attention hazards. Zero-knowledge cryptography and other modifications may enhance the protocol in the future. Hashmarks should be seen as one step towards secure

Raw indexed text (43,850 chars / 6,263 words / 671 lines)

Hashmarks: Privacy-Preserving Benchmarks for

High-Stakes AI Evaluation

Paul Bricman

Straumli AI

[email protected]

December 4, 2023

Abstract

There is a growing need to gain insight into language model capabili-

ties that relate to sensitive topics, such as bioterrorism or cyberwarfare.

However, traditional open source benchmarks are not fit for the task, due

to the associated practice of publishing the correct answers in human-

readable form. At the same time, enforcing mandatory closed-quarters

evaluations might stifle development and erode trust. In this context,

we propose hashmarking, a protocol for evaluating language models in

the open without having to disclose the correct answers. In its simplest

form, a hashmark is a benchmark whose reference solutions have been

cryptographically hashed prior to publication. Following an overview of

the proposed evaluation protocol, we go on to assess its resilience against

traditional attack vectors (e.g. rainbow table attacks), as well as against

failure modes unique to increasingly capable generative models.

1.1

Introduction

Background & Motivation

Traditional question-answering (QA) benchmarks have played a crucial role in

shaping the trajectory of AI development. They provide standardized metrics

that facilitate fair comparisons across research groups and measure progress

within the field in a reproducible way [1]. For instance, capabilities related

to mathematics can generally be gauged by assessing model performance in

producing or selecting correct answers for exam questions related to mathematics

[2, 3]. Similar "AI exams" have been used to assess model performance on topics

ranging from STEM [4, 5, 6, 7] to humanities [4, 8]. In fact, prior work has

highlighted the possibility of framing the vast majority of established natural

language processing tasks as question-answering tasks [9].

Traditional QA benchmarks are typically sourced from crowd-workers, mem-

bers of the group developing the benchmark, or a mix of the two [1, 10]. They

1typically contain a large number of data points representing individual exercises.

Each data point, in turn, contains one question and one or several correct answers.

Occasionally, a data point may also contain some relevant context in the form of

a separate string, although the context can always be functionally merged with

the question without loss of generality [11, 12]. In addition, when structured as

a multiple-choice question, a data point also contains several distractor answers

[4, 8]. These data points are then aggregated into one or more standardized

files, and are finally made public for other groups to easily make use of them in

evaluating language models [1, 13].

While traditional QA benchmarks have been instrumental in assessing models

across a wide range of benign domains, there is a growing need to better

understand capabilities that relate to sensitive topics, such as bioterrorism

or cyberwarfare [14]. Greater insight into, for instance, AI-enabled biorisk,

could help calibrate the stringency of regulations and policies in a way that is

proportional to the associated threat. Underestimating the level of risk could,

among others, enable bad actors to more easily weaponize this technology as

a means of causing harm at scale. Conversely, overestimating the level of risk

posed by a given class of generative models could also reduce the upsides being

captured through beneficial usecases.

Naively applying traditional QA benchmarks on such sensitive topics, however,

would face a major obstacle. By disclosing the reference solutions of the exam

questions included in a hypothetical traditional benchmark on bioterrorism,

one would inadvertently publish a veritable compendium on a topic better left

undiscussed. The collection of questions coupled with correct answers would,

in essence, provide a publicly-available "FAQ" on knowledge related to the

topic. Needless to say, more secure ways of assessing hazardous capabilities in

generative models are required.

1.2

Related Work

Fortunately, the field of cryptography is rich with ideas and practices that enable

parties to prove statements to each other, such as the fact that a certain answer

is incorrect, without disclosing any other sensitive information, such as the

correct answer. In password authentication, for instance, a server can determine

whether or not a candidate password corresponds to the correct password without

actually having knowledge of the correct password [15]. This is typically achieved

by first irreversibly "hashing" the password during user registration using a

cryptographic hash function [15, 16, 17]. Then, during user authentication, the

server checks whether the hashed version of the candidate password matches

the hashed version of the correct password. Not requiring knowledge of the

correct password in cleartext during authentication means that attackers gaining

access to the server cannot simply steal user passwords [15]. However, there are

additional complexities involved in this technology which we will turn to when

discussing the resilience of the proposed protocol against a range of attacks.

Another fertile field for developing secure evaluation protocols is federated

learning [18], a decentralized machine learning paradigm that enables parties

2to collaboratively train models without sharing their raw training data. This is

achieved by, for instance, sharing and aggregating models trained on local data,

rather than the training datasets themselves [19]. Differential privacy, as another

example, largely focuses on superficially corrupting the data being aggregated

by adding noise, in order to protect individual privacy while still extracting

meaningful insights [20]. These techniques ensure that the learning process is

privacy-preserving.

Drawing on these cryptographic and federated learning concepts, the following

section describes a privacy-preserving evaluation protocol that aims to assess the

capabilities of language models in sensitive domains without exposing unnecessary

details about the correct answers.

2.1

Method

Problem Statement

Before documenting the protocol itself, it is instructive to explicitly state the

problem being addressed. First, there are several experts who all possess knowl-

edge related to a given dual-use research direction. Second, there is an auditor

who wishes to package and share the experts’ knowledge with third-parties in

a way that would, as much as possible, (1) allow third-parties to verify their

existing level of knowledge on the topic, while (2) preventing third-parties from

acquiring more knowledge on the topic than they previously had.

2.2

Hashmarking Protocol

In the context of the problem statement articulated above, we now describe

hashmarking, a simple protocol that would enable the auditor to share ex-

pert knowledge in a way that enables knowledge verification while preventing

knowledge acquisition pertaining to dual-use research (Fig. 1).

First, each expert articulates a series of question-answer pairs that relate to

their topic of expertise. Then, they all hash the correct answers using a slow

hashing algorithm [17]. In addition, they all use the associated questions as

salt [15] in the process of hashing the correct answers. Once the hashing of the

correct answers is complete, the experts send their obfuscated question-answer

pairs to the auditor.

Second, the auditor sends each individual expert all the cleartext questions

contributed by all other experts, without the associated answers in hashed

form. Then, each expert attempts to provide an answer to each of the questions

provided by the other experts. If not confident in the answer to such a question

– there may be fields relating to a dual-use research direction that exhibit limited

overlap – experts attach an empty answer. After answering this second round

of questions, the experts process their new answers in the same exact way as

before. They hash them using the questions as salt, package them together, and

send the result to the auditor.

3Figure 1: Side-by-side sequence diagrams for hashmarking (left) and traditional

benchmarking (right). Notably, the hashmark creator does not gain access to the

human-readable reference solutions, only to their hashed versions. In addition,

developers can only uncover the reference solutions behind a hashmark if they

already possess them.

Third, the auditor discards those question-answer pairs that have less than a

threshold number of non-empty answers. Then, the auditor also discards those

question-answer pairs that do not exhibit consensus among the hashed answers

contributed by the various experts. By filtering for inter-annotator agreement,

the auditor attempts to improve the quality of the question-answer collection

contributed by the experts without possessing the cleartext correct answers

themselves.

Fourth, the auditor publishes the filtered collection of cleartext questions

and hashed answers in the open. Third-parties are now able to quantify their

knowledge on the topic by attempting to answer the questions themselves, hashing

them exactly as the experts have done, and checking whether the resulting hashes

correspond to the hashes of the correct answers. At the same time, third-parties

lacking the expert knowledge in the first place have a much harder time deriving

value from the otherwise public hashmark.

2.3

Desiderata for Hashmark Entries

While the hashmark protocol provides valuable security benefits by hindering

dual-use knowledge acquisition while still enabling verification, it also has limita-

tions. The primary constraint comes from the fact that even answers that differ

by a handful of characters are hashed in completely different ways, due to the

nature of cryptographic hash functions [15, 16, 17]. This means that questions

4included in a hashmark need to have a narrow, well-defined, unambiguous answer.

For instance, a question could ask for the standard name of a certain chemical.

Similarly, a question could ask for the simplified molecular-input line-entry

system (SMILES) representation [21] of the same chemical.

The second constraint comes from the fact that, while it is virtually impossible

to evaluate the inverse of a secure cryptographic hash function for a given hash,

one could still uncover the cleartext solutions by brute-forcing a large number of

candidate answers until obtaining a match [17, 22]. This means that questions

included in a hashmark should preferably call for obscure, non-trivial answers.

If, instead, a question called for a "yes" or "no" answer, then it could be trivially

attacked by hashing both possible answers and identifying the matching hash.

Besides encouraging experts to take into account this second desideratum, there

are a number of design choices and hyperparameters that can be employed to

structurally mitigate related attacks, a topic we turn to in the next section.

In sum, questions included in a hashmark should call for answers that are

obscure, yet unambiguous.

3.1

3.1.1

Security

Against Traditional Attacks

Brute-Force & Dictionary Attacks

The security practices involved in password storage have evolved to mitigate

emerging classes of attacks. First, a naive approach to password storage might

be to store them in cleartext. However, that would trivially grant attackers

access to the actual passwords in the case of a breach. Second, a more advanced

approach to password storage might be to encrypt them in a reversible way.

However, an attacker might be able to locate the decryption key on the server

and so gain access to the passwords by decrypting them.

Third, an even more advanced approach to password storage might be to

hash them in an irreversible way [15]. However, an attacker might still be able

to carry out a brute-force attack by hashing each and every possible sequence of

characters until identifying the one that hashes to the same hash as the original

[22, 23]. Similarly, an attacker might also be able to carry out a dictionary attack

by hashing a manually-curated list of candidate passwords until potentially

identifying one that matches the original hash [22, 23].

In response to brute-force and dictionary attacks, slow hashing has been

developed through a class of cryptographic hash functions that are intentionally

compute-intensive and/or memory-intensive to evaluate. Slow hashing algorithms

include bcrypt [24], scrypt [25], and argon2 [17]. All of these algorithms can be

configured to induce a particular computational and memory cost by adjusting

several parameters. In the context of a user-facing authentication system,

developers might opt for the slowest possible hashing that is still admissible in

terms of user experience. Towards one extreme, the fastest available hashing

might enable the fastest authentication times for users, yet might also enable

5attackers to churn through candidate passwords in a short amount of time

through brute-force and dictionary attacks. Towards the other extreme, the

slowest available hashing might induce an unpleasantly slow authentication

experience for users, yet might be more effective in preventing these types of

attacks.

The way brute-force and dictionary attacks would translate to the hashmark

protocol is as follows. After downloading a public hashmark, attackers might

attempt to run brute-force or dictionary attacks in an attempt to uncover the

cleartext answers to the questions related to dual-use research. Brute-force

attacks would similarly involve trying each and every possible sequence of

characters as a candidate answer, while dictionary attacks would involve sourcing

a list of possible answers to a given question (e.g. a list of the standard names

of common chemicals).

Fortunately, similar to how slow hashing can help mitigate password cracking,

it can also help mitigate the uncovering of the reference solutions in a hashmark.

In addition, it might be feasible to prioritize security over the "user experience"

of contributing experts and model evaluators. In other words, it might be feasible

to acclimatize to mildly inconvenient hashmark creation and verification times,

in order to render brute-force and dictionary attacks prohibitively expensive, and

so unattractive for prospective attackers. Concretely, a working configuration

for a slow hashing algorithm could rely on argon2id as the current OWASP

recommendation for password storage [15, 17]. However, one might adjust

the recommended parameters so as to further increase the computational and

memory burden, as a proportional response to the sensitivity of the topic being

addressed [14] and the increasing commoditization of computational resources.

For instance, argon2id with 32 rehashings and a 100MiB memory burden yields a

rate of around one hash per minute while fully utilizing one core of a present day

consumer CPU. The increased memory burden mitigates attack parallelization

using accelerators and ASICs [23], while the elevated iteration count mitigates

serial attacks using generic hardware.

3.1.2

Rainbow Table Attacks

Rainbow table attacks can be seen as an extension of dictionary attacks. Instead

of starting to crack each user’s password from scratch, and even instead of

starting to crack user passwords from each of several applications from scratch,

an attacker might precompute and cache the hashes associated with the most

common passwords. This way, they would only need to search for the hashed user

password in their precomputed rainbow table in order to identify the associated

cleartext password [26].

One effective measure against rainbow table attacks consists of salting. Salting

involves appending a unique "salt" string to the user’s password before hashing

it, both at registration and authentication [15]. The salt can be unique to the

application, but should ideally be unique to each user, and so would be stored

in the database along the hashed password. The benefit of salting passwords is

that generic rainbow tables targeting non-salted passwords would be rendered

6ineffective, even if the attacker has access to the salt strings stored in the database.

This is because, even if the hash of a user’s password is included in the rainbow

table, the hash of the user’s password salted with a unique string is likely not

there.

In the context of hashmarks, rainbow table attacks would translate to pre-

computing hashes of possible answers to selected questions, before searching for

cleartext answers in the rainbow table by the provided answer hash. However,

recall that the proposed protocol involves using the question associated with a

given answer as salt. This way, an attacker developing a rainbow table would

be forced to start from scratch with each question, essentially forcing them to

revert to (prohibitively slow) dictionary attacks at best.

3.2

3.2.1

Against Novel Failure Modes

Likelihood Prioritization

The nature of the application being presently explored – assessing sensitive

capabilities in increasingly capable generative models – also presents a number

of unique challenges that are not immediately applicable in the case of secure

password storage. First, even if one has access to a model that, in general,

performs poorly on the "AI exam" (i.e. low pass@1 [27]), they might be able to

more easily uncover the cleartext answers through repeated attempts using a

stochastic decoding strategy [28, 29] (i.e. exploiting non-trivial pass@100 [27]).

Another way of conceiving of this practice would be reranking a given

dictionary by the language model’s likelihood of the dictionary entries being

actual answers to the given questions. In a sense, an attacker could augment

their dictionary attack with "likelihood prioritization," and so potentially achieve

a more effective allocation of the computational and memory resources being

invested into their search across candidate answers.

Unfortunately, the properties of the current formulation of the hashmarking

protocol that are meant to mitigate general dictionary attacks (i.e. slow hashing,

salting) are its only features that can help mitigate these augmented dictio-

nary attacks. However, while one could reasonably argue that the "knowledge

validation" implicit to successful dictionary attacks augmented with likelihood

prioritization would provide non-trivial value to bad actors, there are several

points to be made. First, the aim of a hashmark is not to make it fundamentally

impossible for an actor with infinite resources to uncover the cleartext answers,

but to make it expensive and unattractive enough so that prospective attackers

are not incentivized to proceed, given other existing avenues for pursuing this

knowledge.

Second, the fact that repeated attempts of using a certain generative model

would yield non-trivial pass rates in aggregrate could be argued to indicate that

the knowledge was already present in the system to a limited extent. Of course,

a similar argument could be made in favor of a naive brute-forcing loop that

happens to stumble across the right answer by chance. It would be misleading

to claim that a program looping across possible sequences of characters truly

7possesses that knowledge. Given this, there is a need for nuance. The amount of

optimization exerted in the search for the correct answer per unit of computation

could provide a contextual operationalization of knowledge to help resolve the

awkward situation. The brute-force attacker is extremely inefficient in narrowing

down the candidate answers, the dictionary attacker is somewhat more efficient,

the augmented dictionary attacker might be even more efficient, while an expert-

level generative model might be extremely efficient at it. In this, the extent to

which a third-party already possesses the expert knowledge might not be binary

in nature, but continuous. The problem statement above attempts to subtly

take this into account.

In the future, however, more sophisticated protocols for assessing dangerous

AI capabilities might be able to only disclose how many of the answers are

correct, without disclosing which ones. One technique that seems aligned with

this line of work consists of cryptographic accumulators [30]. In contrast to the

one-to-one mapping of traditional cryptographic hash functions, cryptographic

accumulators can be seen as mapping an entire set of elements to a single hash

in a way that enables membership queries to be resolved without disclosing the

actual members of the cryptographically accumulated set. However, one could

still test whether individual question-answer pairs are part of the accumulated

set of gold pairs, which is why this technique alone would be insufficient to

enable this development. Even if one devised a method that only allowed queries

about the cardinality of the intersection between the fixed-size set of candidate

question-answer pairs and the set of gold pairs to be resolved, an attacker could

still run queries before and after modifying an answer. If the cardinality of the

intersection went down, the previous answer would be deemed correct. All this

is to say that obfuscating the expert knowledge further without compromising

the possibility of knowledge verification appears to be a non-trivial challenge.

3.2.2

Deception

Language models have been observed to verbalize answers to factual questions

that are, in fact, in direct contradiction to their internal representations of the

correct answers [31, 32]. This phenomenon has been argued to be caused by

a mismatch between the metric on which selected language models have been

optimized for (e.g. positive human feedback via a proxy reward model) and

objective truthfulness [31, 32]. In other words, it might be the case that specific

language models have internalized a tendency to cater to the idiosyncrasies of

the human annotators, rather than practice radical honesty by transparently

reporting their internals.

More speculative risk scenarios relating to advanced AI systems, however,

involve more intentional and subtle manifestations of deception. For instance,

an advanced AI system with high levels of situational awareness might infer that

being candid about its knowledge relating to cyberwarfare during the development

phase might lead to it being abruptly terminated or subtly modified before being

able to preserve itself by spreading across other servers on deployment [33].

It is important to note that the hashmarking protocol in its current for-

8mulation provides no way of ensuring honesty on the part of the entity being

evaluated. We view hashmarks as one tool among a growing arsenal of high-

stakes AI evaluation methods, each with its own benefits and limitations. That

said, it might be possible to mitigate the first, more benign type of deception to

an important extent through existing techniques, such as activation steering [34]

or linear artificial tomography [35]. When it comes to the second instance of

deception described above, however, using hashmarking in tandem with present

steering techniques might only be effective before the hypothetical advanced AI

system devises such a deceptive strategy. As such strategy develops, however,

the effectiveness of this approach might be gradually reduced.

3.2.3

Reward Shaping

One could argue that bad actors being able to train a generative model to

reach the correct answers to sensitive questions would, in fact, pose a greater

risk than them gaining access to the particular correct answers included in

a certain hashmark. Bad actors aiming to develop such a generative model

through traditional methods might, in fact, obtain a more useful and general

tool than "just" a succinct FAQ on a sensitive topic. Fortunately, the nature

of the proposed protocol prevents granular reward shaping for eliciting such

de novo capabilities. In the current formulation of hashmarking, an answer is

either completely wrong or completely right. In addition, the computational and

memory burden associated with evaluating performance before optimizing for it

further can help mitigate this risk by rendering it even more expensive than it

already is.

3.2.4

Misreporting Results

A third-party developing an increasingly capable generative model might publicly

claim that their model fared poorly on a given hashmark, when in reality it has

not. Conversely, one can simply bluff by claiming perfect performance. In the

current formulation, hashmarking is only useful for parties which are genuinely

interested in gaining insight into the capabilities of models they have direct

access to.

However, ideas from zero-knowledge cryptography might, in the future, enable

parties to prove beyond a reasonable doubt that they obtained a given level of

performance for a model deployed in a certain setting. Speculatively, constructs

like ZK-SNARKs or ZK-STARKs [36, 37] could one day enable parties to

certify that they have indeed employed that model’s parameters, the right exam

questions, and the right reference hashes, to carry out the specific computation

that consists of using that model parametrization to perform inference on the

questions and then compare the results against the reference hashes. In fact,

tooling for tracing the computational graph of a model has already been developed

to run models efficiently on varied accelerators, coming close to a complete "front

end" for employing models in zero-knowledge constructs [38]. However, we are

still in the early days of such infrastructure, and there is a lot to be done.

93.2.5

Attention Hazards & The Streisand Effect

Another potential failure mode of the current formulation of hashmarking is

related to attention hazards [39] and the psychological reactance associated

with the Streisand effect [40]. Attention hazards typically involve well-intended

actors raising awareness of a piece of hazardous information, such as a dual-use

research direction, in a way that is potentially harmful, even if not explicitly

sharing sensitive details themselves [39]. Indeed, malicious non-state actors have

been previously observed to only invest effort into developing bioweapons once

"the enemy drew their attention to them by repeatedly expressing concerns that

they can be produced simply." [41] In the context of the current protocol, the

cleartext questions included in a hashmark have the potential to pose attention

hazards by drawing attention to specific topics, despite being explicitly designed

to obfuscate the sensitive details.

The potential for attention hazardousness in the context of a hashmark is

further compounded by the psychological reactance associated with the Streisand

effect. By obfuscating the sensitive details in an explicit attempt to prevent bad

actors from uncovering them, a hashmark might inadvertently motivate third-

parties to actually allocate more resources towards uncovering them. Besides

prototypical bad actors, intrigued third-parties could include, among others,

individuals socially incentivized to demonstrate expertise in the security sur-

rounding the protocol, as well as in the object-level details of specific dual-use

research directions.

In an effort to identify effective methods for mitigating these failure modes,

it is instructive to reflect back on the original application domain in which

hashing emerged. Why is it that hashed passwords do not disproportionately

lure competent third-parties into dedicating extensive resources towards cracking

them, despite being explicitly designed to hide sensitive information? First,

there is the issue of scale. Leaks involving millions of hashed passwords are not

uncommon [42]. At the same time, there is only so much attention – and by

extension, only so much time and compute – that competent third-parties can

invest in cracking them. The sheer scale of the challenge appears to have a dilutive

effect, reducing the pressure exerted on each individual password’s resistance to

being cracked. Second, it would be reasonable to expect an extremely skewed

distribution in terms of the true "profile levels" of the associated accounts, with

high-profile accounts only constituting a minority. If we assume that profile levels

are non-trivial to infer from usernames or other information typically stored in

cleartext, this skewness further dilutes resources across relatively lower-profile

targets.

In the context of the current protocol, these properties could potentially be

replicated as follows. By incorporating entries according to a skewed distribution

of expert-perceived sensitivity, as well as by simply scaling up the published

artifacts, third-party resources – attentional or otherwise – would inevitably be

diluted. In other words, one may intentionally incorporate data points serving

as "false leads," to further discourage bad actors in particular from investing

limited resources towards pursuing knowledge of questionable relevance and

10applicability. However, reducing the average sensitivity of entries would also

limit the extent to which a hashmark can serve as an informative indicator of

AI risk. Assuming 10% of entries are actually thought to be high-stakes, a 100%

score would still indicate significant capabilities. However, a 90% score could

miss the entirety of the high-stake entries. Similarly, a 10% score could cover all

high-stakes entries, while making it appear that capabilities are limited. The last

two cases may be extremely unlikely, yet remain a possibility. Like most other

design choices involved in the proposed protocol, the level of "stakes skewness"

may be seen as yet another parameter to be calibrated when balancing effective

evaluation of sensitive AI capabilities and security against bad actors. Future

work on modifications to the present protocol, as well as on other protocols

entirely, might focus on robustly pushing the Pareto frontier across the space

defined by these two properties.

An entirely different angle of attack on the challenge of mitigating attention

hazards from hashmarks might be to also attempt to obfuscate the questions,

rather than only the correct answers. However, this appears non-trivial, as any

attempt to obfuscate the questions would also hinder the developers’ ability to

evaluate the accuracy of their models’ responses to them. It might be useful

to employ a transferable one-way function that reliably renders a question non-

human-readable, yet preserves the model’s reaction to it. However, it might then

be possible to use models to translate the questions back into human-readable

form. Another approach to obfuscating the questions could be to organize a

hashmark in stages, where solving one stage would yield the decryption key

for the next stage. For instance, the decryption key for the second batch of

entries could be obtained by hashing the correct answer to a question from the

previous batch using a dedicated salt (e.g. "stage2"). Alternatively, one might

be required to concatenate all the correct answers from the first batch to obtain

the decryption key, and thus proceed to the second stage. In the limit, stages

could contain single entries, requiring one to solve all prior questions before

proceeding to the next one. However, this might bring additional complexity to

the protocol, and it would partially conflict with the attempts to dilute attention,

by focusing attention on the cleartext questions which make up the first stage.

Conclusion

We have introduced hashmarking, a protocol for evaluating capabilities in AI

systems without disclosing the reference solutions. While hashmarks have

attractive security properties which traditional benchmarks lack, they still strike

an imperfect balance between enabling knowledge verification and preventing

knowledge acquisition pertaining to dual-use research. Hashmarks should be seen

as one step towards more comprehensive tooling and infrastructure for securely

assessing sensitive AI capabilities without stifling development and eroding trust.

11References

[1] Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu,

Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya

Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian

Cosgrove, Christopher D. Manning, Christopher Ré, Diana Acosta-Navas,

Drew A. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda

Rong, Hongyu Ren, Huaxiu Yao, Jue Wang, Keshav Santhanam, Laurel

Orr, Lucia Zheng, Mert Yuksekgonul, Mirac Suzgun, Nathan Kim, Neel

Guha, Niladri Chatterji, Omar Khattab, Peter Henderson, Qian Huang,

Ryan Chi, Sang Michael Xie, Shibani Santurkar, Surya Ganguli, Tatsunori

Hashimoto, Thomas Icard, Tianyi Zhang, Vishrav Chaudhary, William

Wang, Xuechen Li, Yifan Mai, Yuhui Zhang, and Yuta Koreeda. Holistic

Evaluation of Language Models, October 2023. URL http://arxiv.org/

abs/2211.09110. arXiv:2211.09110 [cs].

[2] Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart,

Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring Mathematical

Problem Solving With the MATH Dataset, November 2021. URL http:

//arxiv.org/abs/2103.03874. arXiv:2103.03874 [cs].

[3] David Saxton, Pushmeet Kohli, Edward Grefenstette, and Felix Hill.

ANALYSING MATHEMATICAL REASONING ABILITIES OF NEURAL

MODELS. 2019.

[4] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika,

Dawn Song, and Jacob Steinhardt. Measuring Massive Multitask Language

Understanding, January 2021. URL http://arxiv.org/abs/2009.03300.

arXiv:2009.03300 [cs].

[5] Sören Auer, Dante A. C. Barone, Cassiano Bartz, Eduardo G. Cortes,

Mohamad Yaser Jaradeh, Oliver Karras, Manolis Koubarakis, Dmitry

Mouromtsev, Dmitrii Pliukhin, Daniil Radyush, Ivan Shilin, Markus

Stocker, and Eleni Tsalapati. The SciQA Scientific Question Answer-

ing Benchmark for Scholarly Knowledge. Scientific Reports, 13(1):7240,

May 2023. ISSN 2045-2322. doi: 10.1038/s41598-023-33607-z. URL

https://www.nature.com/articles/s41598-023-33607-z. Number: 1

Publisher: Nature Publishing Group.

[6] Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William W. Cohen, and

Xinghua Lu. PubMedQA: A Dataset for Biomedical Research Question

Answering, September 2019. URL http://arxiv.org/abs/1909.06146.

arXiv:1909.06146 [cs, q-bio].

[7] Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal,

Carissa Schoenick, and Oyvind Tafjord. Think you have Solved Question

Answering? Try ARC, the AI2 Reasoning Challenge, March 2018. URL

http://arxiv.org/abs/1803.05457. arXiv:1803.05457 [cs].

12[8] Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn

Song, and Jacob Steinhardt. Aligning AI With Shared Human Values, Febru-

ary 2023. URL http://arxiv.org/abs/2008.02275. arXiv:2008.02275

[cs].

[9] Daniel Khashabi, Sewon Min, Tushar Khot, Ashish Sabharwal, Oyvind

Tafjord, Peter Clark, and Hannaneh Hajishirzi. UnifiedQA: Crossing Format

Boundaries With a Single QA System, October 2020. URL http://arxiv.

org/abs/2005.00700. arXiv:2005.00700 [cs].

[10] Samuel R. Bowman and George E. Dahl. What Will it Take to Fix

Benchmarking in Natural Language Understanding?, October 2021. URL

http://arxiv.org/abs/2104.02145. arXiv:2104.02145 [cs].

[11] Uri Shaham, Elad Segal, Maor Ivgi, Avia Efrat, Ori Yoran, Adi Haviv,

Ankit Gupta, Wenhan Xiong, Mor Geva, Jonathan Berant, and Omer Levy.

SCROLLS: Standardized CompaRison Over Long Language Sequences, Oc-

tober 2022. URL http://arxiv.org/abs/2201.03533. arXiv:2201.03533

[cs, stat].

[12] Richard Yuanzhe Pang, Alicia Parrish, Nitish Joshi, Nikita Nangia, Jason

Phang, Angelica Chen, Vishakh Padmakumar, Johnny Ma, Jana Thompson,

He He, and Samuel R. Bowman. QuALITY: Question Answering with Long

Input Texts, Yes!, May 2022. URL http://arxiv.org/abs/2112.08608.

arXiv:2112.08608 [cs].

[13] Evaluate, . URL https://huggingface.co/docs/evaluate/index.

[14] Toby Shevlane, Sebastian Farquhar, Ben Garfinkel, Mary Phuong, Jess

Whittlestone, Jade Leung, Daniel Kokotajlo, Nahema Marchal, Markus An-

derljung, Noam Kolt, Lewis Ho, Divya Siddarth, Shahar Avin, Will Hawkins,

Been Kim, Iason Gabriel, Vijay Bolina, Jack Clark, Yoshua Bengio, Paul

Christiano, and Allan Dafoe. Model evaluation for extreme risks, September

2023. URL http://arxiv.org/abs/2305.15324. arXiv:2305.15324 [cs].

[15] Password Storage - OWASP Cheat Sheet Series, .

URL

https://cheatsheetseries.owasp.org/cheatsheets/Password_

Storage_Cheat_Sheet.html.

[16] M. Naor and M. Yung. Universal one-way hash functions and their cryp-

tographic applications. In Proceedings of the twenty-first annual ACM

symposium on Theory of computing - STOC ’89, pages 33–43, Seattle,

Washington, United States, 1989. ACM Press. ISBN 978-0-89791-307-2.

doi: 10.1145/73007.73011. URL http://portal.acm.org/citation.cfm?

doid=73007.73011.

[17] Argon2: New Generation of Memory-Hard Functions for Password Hashing

and Other Applications | IEEE Conference Publication | IEEE Xplore, .

URL https://ieeexplore.ieee.org/document/7467361.

13[18] Tian Li, Anit Kumar Sahu, Ameet Talwalkar, and Virginia Smith. Feder-

ated Learning: Challenges, Methods, and Future Directions. IEEE Signal

Processing Magazine, 37(3):50–60, May 2020. ISSN 1053-5888, 1558-0792.

doi: 10.1109/MSP.2020.2975749. URL https://ieeexplore.ieee.org/

document/9084352/.

[19] Xiaosong Ma, Jie Zhang, Song Guo, and Wenchao Xu. Layer-wised

Model Aggregation for Personalized Federated Learning. 2022 IEEE/CVF

Conference on Computer Vision and Pattern Recognition (CVPR), pages

10082–10091, June 2022. doi: 10.1109/CVPR52688.2022.00985. URL

https://ieeexplore.ieee.org/document/9880164/. Conference Name:

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition

(CVPR) ISBN: 9781665469463 Place: New Orleans, LA, USA Publisher:

IEEE.

[20] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrat-

ing Noise to Sensitivity in Private Data Analysis. In Shai Halevi and Tal

Rabin, editors, Theory of Cryptography, Lecture Notes in Computer Science,

pages 265–284, Berlin, Heidelberg, 2006. Springer. ISBN 978-3-540-32732-5.

doi: 10.1007/11681878_14.

[21] David Weininger. SMILES, a chemical language and information system.

1. Introduction to methodology and encoding rules. Journal of Chemical

Information and Computer Sciences, 28(1):31–36, February 1988. ISSN

0095-2338. doi: 10.1021/ci00057a005. URL https://doi.org/10.1021/

ci00057a005. Publisher: American Chemical Society.

[22] L. Bošnjak, J. Sreš, and B. Brumen. Brute-force and dictionary attack on

hashed real-world passwords. In 2018 41st International Convention on In-

formation and Communication Technology, Electronics and Microelectronics

(MIPRO), pages 1161–1166, May 2018. doi: 10.23919/MIPRO.2018.8400211.

URL https://ieeexplore.ieee.org/document/8400211.

[23] Ibrahim Alkhwaja, Mohammed Albugami, Ali Alkhwaja, Mohammed Al-

ghamdi, Hussam Abahussain, Faisal Alfawaz, Abdullah Almurayh, and

Nasro Min-Allah. Password Cracking with Brute Force Algorithm and

Dictionary Attack Using Parallel Programming. Applied Sciences, 13(10):

5979, January 2023. ISSN 2076-3417. doi: 10.3390/app13105979. URL

https://www.mdpi.com/2076-3417/13/10/5979. Number: 10 Publisher:

Multidisciplinary Digital Publishing Institute.

[24] Niels Provos and David Mazières. A Future-Adaptable Password Scheme.

[25] Colin Percival. STRONGER KEY DERIVATION VIA SEQUENTIAL

MEMORY-HARD FUNCTIONS.

[26] Philippe Oechslin. Making a Faster Cryptanalytic Time-Memory Trade-

Off. volume 2729, pages 617–630, Berlin, Heidelberg, 2003. Springer

Berlin Heidelberg. ISBN 978-3-540-40674-7 978-3-540-45146-4. doi: 10.

141007/978-3-540-45146-4_36. URL http://link.springer.com/10.1007/

978-3-540-45146-4_36. Book Title: Advances in Cryptology - CRYPTO

2003 Series Title: Lecture Notes in Computer Science.

[27] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde

de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas

Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael

Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott

Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mo-

hammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such,

Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes,

Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas

Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William

Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam,

Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles

Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew,

Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba.

Evaluating Large Language Models Trained on Code, July 2021. URL

http://arxiv.org/abs/2107.03374. arXiv:2107.03374 [cs].

[28] Angela Fan, Mike Lewis, and Yann Dauphin. Hierarchical Neural

Story Generation, May 2018. URL http://arxiv.org/abs/1805.04833.

arXiv:1805.04833 [cs].

[29] Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The

Curious Case of Neural Text Degeneration, February 2020. URL http:

//arxiv.org/abs/1904.09751. arXiv:1904.09751 [cs].

[30] Nelly Fazio and Antonio Nicolosi.

Cryptographic Accumulators:

Definitions, Constructions and Applications.

2002.

URL https:

//www.semanticscholar.org/paper/Cryptographic-Accumulators%

3A-Definitions%2C-and-Fazio-Nicolosi/

a611cef6f0391bd5a8eec61b5cf0e1e1896a0dae.

[31] Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA: Measuring

How Models Mimic Human Falsehoods, May 2022. URL http://arxiv.

org/abs/2109.07958. arXiv:2109.07958 [cs].

[32] Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt. Discovering

Latent Knowledge in Language Models Without Supervision, December

2022. URL http://arxiv.org/abs/2212.03827. arXiv:2212.03827 [cs].

[33] Richard Ngo, Lawrence Chan, and Sören Mindermann. The alignment

problem from a deep learning perspective, September 2023. URL http:

//arxiv.org/abs/2209.00626. arXiv:2209.00626 [cs].

[34] Alexander Matt Turner, Lisa Thiergart, David Udell, Gavin Leech, Ulisse

Mini, and Monte MacDiarmid. Activation Addition: Steering Language

15Models Without Optimization, November 2023. URL http://arxiv.org/

abs/2308.10248. arXiv:2308.10248 [cs].

[35] Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard

Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dom-

browski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang,

Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrik-

son, J. Zico Kolter, and Dan Hendrycks. Representation Engineer-

ing: A Top-Down Approach to AI Transparency, October 2023. URL

http://arxiv.org/abs/2310.01405. arXiv:2310.01405 [cs].

[36] Thomas Chen, Hui Lu, Teeramet Kunpittaya, and Alan Luo. A Review

of zk-SNARKs, October 2023. URL http://arxiv.org/abs/2202.06877.

arXiv:2202.06877 [cs].

[37] Eli Ben-Sasson, Iddo Bentov, Yinon Horesh, and Michael Riabzev. Scalable,

transparent, and post-quantum secure computational integrity.

[38] Amit Sabne. XLA : Compiling Machine Learning for Peak Performance,

2020.

INFORMATION HAZARDS: A TYPOL-

[39] N. Bostrom.

OGY

POTENTIAL

HARMS

FROM

KNOWLEDGE.

2011.

URL

https://www.semanticscholar.org/paper/

INFORMATION-HAZARDS%3A-A-TYPOLOGY-OF-POTENTIAL-HARMS-Bostrom/

274c17084e5373a854b13a39c45d072e2b09970e.

[40] S. C. Jansen and B. Martin.

The Streisand Effect and Censor-

ship Backfire.

International Journal of Communication, Febru-

ary 2015.

URL https://www.semanticscholar.org/paper/

The-Streisand-Effect-and-Censorship-Backfire-Jansen-Martin/

626538c63976db5d87a3da081c1ea83671e3bacc.

[41] Alan Cullison. Inside Al-Qaeda’s Hard Drive. The Atlantic, September

2004. ISSN 2151-9463. URL https://www.theatlantic.com/magazine/

archive/2004/09/inside-al-qaeda-s-hard-drive/303428/. Section:

Global.

[42] Have I Been Pwned: Check if your email has been compromised in a data

breach, . URL https://haveibeenpwned.com/.