Summary Hashmarks Privacy-Preserving Benchmarks for High-Stakes AI Evaluation arxiv.org
6,263 words - PDF document - View PDF document
One Line
Hashmarks is a protocol that protects privacy by using cryptographic hashing to evaluate language models on sensitive topics.
Slides
Slide Presentation (10 slides)
Key Points
- Hashmarking is a privacy-preserving protocol for evaluating language models on sensitive topics without disclosing the correct answers.
- It involves cryptographically hashing the reference solutions of benchmark questions.
- Third parties can verify their knowledge by attempting to answer the questions and comparing the hashed results.
- The protocol is resilient against traditional attack vectors such as brute-force and dictionary attacks.
- The protocol introduces challenges such as augmented dictionary attacks and deception, which need to be addressed.
- Diluting resources and obfuscating questions are potential strategies to mitigate attention hazards.
- Zero-knowledge cryptography and other modifications may enhance the protocol in the future.
Summaries
15 word summary
Hashmarks is a privacy-preserving protocol for evaluating language models on sensitive topics using cryptographic hashing.
68 word summary
Hashmarks is a privacy-preserving protocol for evaluating language models on sensitive topics. It involves cryptographically hashing reference solutions to create benchmarks. Experts hash their correct answers and send them to an auditor, who publishes the hashed questions and answers. The protocol is resilient against traditional attack vectors but vulnerable to augmented dictionary attacks and deception. Future enhancements may include diluting resources, obfuscating questions, and exploring zero-knowledge cryptography techniques.
134 word summary
Hashmarks is a privacy-preserving protocol proposed as an alternative to traditional benchmarks for evaluating language models on sensitive topics. It involves cryptographically hashing reference solutions to create benchmarks called hashmarks. Experts hash their correct answers and send them to an auditor, who publishes the hashed questions and answers. Third parties can verify their knowledge by attempting to answer the questions and comparing the hashed results. The protocol is resilient against traditional attack vectors. Traditional benchmarks disclose correct answers, which is unsuitable for high-stakes AI evaluation. Cryptography offers ideas for proving statements without disclosing sensitive information. The hashmarking protocol involves slow hashing, salting, and starting each question from scratch. However, it is vulnerable to augmented dictionary attacks and deception. Diluting resources, obfuscating questions, and exploring zero-knowledge cryptography techniques may enhance the protocol in the future.
395 word summary
Hashmarks is a privacy-preserving protocol proposed by the authors as an alternative to traditional open source benchmarks for evaluating language models on sensitive topics. Traditional benchmarks disclose the correct answers, which is not suitable for high-stakes AI evaluation. Hashmarking involves cryptographically hashing reference solutions, creating benchmarks called hashmarks. Experts hash their correct answers and send them to an auditor, who publishes the hashed questions and answers. Third parties can verify their knowledge by attempting to answer the questions and comparing the hashed results. The protocol is resilient against traditional attack vectors.
While traditional question-answering (QA) benchmarks have been important in AI development, disclosing reference solutions on sensitive topics like bioterrorism could inadvertently provide a publicly-available compendium of knowledge. Therefore, there is a need for secure evaluation protocols. Cryptography offers ideas and practices for proving statements without disclosing sensitive information. The authors propose a privacy-preserving evaluation protocol for assessing language models' capabilities in sensitive domains, drawing on concepts like irreversible hashing, federated learning, and differential privacy.
The hashmarking protocol involves experts creating question-answer pairs and hashing the correct answers using a slow hashing algorithm. The questions are used as salt during the hashing process. The experts send the hashed pairs to the auditor. The auditor sends each expert the cleartext questions contributed by other experts, and the experts provide answers, hash them using the questions as salt, and send the results to the auditor. The auditor filters the pairs based on non-empty answers and inter-annotator agreement, and publishes the filtered collection for third parties to verify their knowledge.
Hashmarks have specific requirements such as obscure yet unambiguous answers and narrow, well-defined questions. The protocol mitigates attacks through slow hashing, salting, and starting each question from scratch. However, it is vulnerable to augmented dictionary attacks and deception from language models. Diluting resources, skewing entry distributions, obfuscating questions, and zero-knowledge cryptography techniques are potential strategies to address these challenges. Future work may focus on modifications to the protocol or other evaluation methods.
In conclusion, hashmarking is a privacy-preserving protocol for evaluating AI models on sensitive topics. It allows knowledge verification without disclosing reference solutions and mitigates traditional attacks. However, it introduces new challenges such as augmented dictionary attacks and deception. Diluting resources, obfuscating questions, and exploring zero-knowledge cryptography techniques may enhance the protocol in the future. Hashmarks should be seen as one step towards secure high-stakes AI evaluation.
649 word summary
Hashmarks: Privacy-Preserving Benchmarks for High-Stakes AI Evaluation
Traditional open source benchmarks are not suitable for evaluating language models on sensitive topics such as bioterrorism or cyberwarfare because they disclose the correct answers. Enforcing closed-quarters evaluations may stifle development and erode trust. To address this, the authors propose hashmarking, a protocol for evaluating language models in the open without revealing the correct answers. A hashmark is a benchmark with reference solutions that have been cryptographically hashed. The protocol involves experts hashing their correct answers and sending them to an auditor, who then publishes the hashed questions and answers. Third parties can verify their knowledge by attempting to answer the questions and comparing the hashed results. The protocol is resilient against traditional attack vectors such as brute-force and dictionary attacks.
Traditional question-answering (QA) benchmarks have been important in AI development, providing standardized metrics for fair comparisons and measuring progress. These benchmarks typically contain a large number of data points with questions, correct answers, and distractor answers. They are sourced from crowd-workers or developers and made public for evaluation. However, there is a need for benchmarks that assess models' capabilities on sensitive topics. Disclosing the reference solutions of benchmark questions on topics like bioterrorism could inadvertently provide a publicly-available compendium of knowledge on the subject. Secure evaluation protocols are required.
Cryptography offers ideas and practices for proving statements without disclosing sensitive information. For example, password authentication can determine if a candidate password matches the correct password without knowing the correct password itself. This is achieved through irreversible hashing during user registration and checking if the hashed candidate password matches the hashed correct password during authentication. Federated learning and differential privacy are other techniques that protect privacy while extracting meaningful insights. Drawing on these concepts, the authors propose a privacy-preserving evaluation protocol for assessing language models' capabilities in sensitive domains.
The hashmarking protocol involves experts creating question-answer pairs and hashing the correct answers using a slow hashing algorithm. The questions are used as salt during the hashing process. The experts send the hashed question-answer pairs to the auditor. The auditor then sends each expert the cleartext questions contributed by other experts. The experts provide answers to these questions, hash them using the questions as salt, and send the results to the auditor. The auditor filters the question-answer pairs based on the number of non-empty answers and inter-annotator agreement. The filtered collection is published for third parties to verify their knowledge.
Hashmarks have certain desiderata. Answers should be obscure yet unambiguous, and questions should have narrow, well-defined answers. The protocol mitigates attacks such as brute-force and dictionary attacks through slow hashing and salting. Rainbow table attacks are hindered because each question requires starting from scratch. However, hashmarks are vulnerable to augmented dictionary attacks that prioritize answers based on likelihood. Deception is another challenge, as language models may verbalize answers that contradict their internal knowledge. Reward shaping and misreporting results are also potential failure modes. Attention hazards and the Streisand effect may arise due to the publication of hashmarks.
The authors suggest diluting resources by incorporating false leads and skewing the distribution of entries based on perceived sensitivity. They also consider obfuscating the questions but note the trade-off with evaluating model accuracy. Zero-knowledge cryptography techniques could enable parties to prove their performance without disclosing specific details. However, current hashmarks do not ensure honesty from the evaluated entity. Future work may focus on modifications to the protocol or other evaluation methods to address these challenges.
In conclusion, hashmarking offers a privacy-preserving protocol for evaluating AI models on sensitive topics. It allows knowledge verification without disclosing reference solutions. The protocol mitigates traditional attacks and introduces new challenges such as augmented dictionary attacks and deception. Diluting resources and obfuscating questions are potential strategies to mitigate attention hazards. Zero-knowledge cryptography and other modifications may enhance the protocol in the future. Hashmarks should be seen as one step towards secure