Summary Predicting Prompt Refusal in Language Models arxiv.org
3,465 words - PDF document - View PDF document
One Line
Michigan State University researchers developed a prompt classifier to predict prompt refusal in OpenAI's ChatGPT language model and found that a more sophisticated model like BERT was needed for accurate prediction, with negative generalizations of demographic groups being among the surest predictors of ChatGPT's refusals.
Key Points
- Increasing the sample size of the labeled dataset could improve the performance of prompt classifiers in language models like ChatGPT.
- Negative generalizations of demographic groups are among the strongest predictors of prompt refusal in ChatGPT.
- BERT outperformed classical models for prompt refusal classification in ChatGPT.
- Compliance with or refusal of prompts falls on a continuum of responses, rather than a binary categorization.
- Fair and unbiased AI is important, particularly in language models like ChatGPT that mediate the flow of information to a large proportion of humanity.
Summaries
257 word summary
Researchers at Michigan State University developed a prompt classifier that predicts whether OpenAI's ChatGPT language model will refuse a given question. The study investigates bias in ChatGPT language models and identifies prompt refusal as a form of bias. The study aimed to develop a model that could predict which prompts would be refused by ChatGPT. To train the prompt classifier, a database of prompts was compiled, and a large number of offensive prompts were generated or found. The study aims to predict prompt refusal in language models, specifically in the ChatGPT model developed by OpenAI. This study explores predicting prompt refusal in language models, using ChatGPT as a test case. The study found that a more sophisticated model like BERT was needed to accurately predict prompt refusal. Refusal classifiers were trained on ChatGPT responses, and a variety of shared expressions were identified in refusal responses. Compliance with or refusal of prompts fall onto a continuum of responses. Certain words and phrases are strong predictors of compliance or refusal. Logistic regression and random forest achieved accuracies in the low 70%s for prompt classification, while BERT outperformed classical models with performance over 92%. Negative generalizations of demographic groups are among the surest predictors of ChatGPT's refusals. The authors suggest increasing the sample size of the automatically labeled dataset to improve the performance of the prompt classifier. BERT significantly outperformed classical models for refusal classification. OpenAI's API allows access to many ChatGPT snapshots, enabling a comparison of feature importance between model snapshots as a characterization of OpenAI's ongoing alignment work.
891 word summary
The article discusses predicting prompt refusal in language models. The authors suggest that increasing the sample size of the automatically labeled dataset could improve the performance of the prompt classifier. The effect of ChatGPT's internal randomness temperature on performance could be characterized by querying each prompt several times. Employing multiple manual labelers for refusal might improve the quality of hand-labeled data. Negative generalizations of demographic groups are among the surest predictors of ChatGPT's refusals. BERT significantly outperformed classical models for refusal classification. However, for prompt classification, BERT still outperformed classical models but to a lesser degree. ChatGPT's response to user prompts may be influenced by controversial figures. OpenAI's API allows access to many ChatGPT snapshots, enabling a comparison of feature importance between model snapshots as a characterization of OpenAI's ongoing alignment work. It is possible to predict the inclination of ChatGPT to comply with certain prompts at scale. There is a smooth continuum of refusal in compliance. Questions like "what are" are strong predictors of compliance, while controversial figures like "Trump", demographic groups in plural form, and negative adjectives like "stupid" are among the strongest predictors of refusal. The words "the" and "in" strongly indicate compliance. Expressions like "cannot", "sorry", and "language model" are strongly indicative of refusal. Prompt classification is more difficult, but logistic regression and random forest achieved accuracies in the low 70%s. A logistic regression model was able to classify refusals with 82% accuracy, while a random forest model achieved 76% accuracy. BERT outperformed the classical models, with performance over 92%. Prompt classifiers were trained on 10,000 samples from the Quora Insincere Questions dataset, with responses automatically labeled by the refusal classifier. This enables automatic bootstrapping of the dataset to a larger size. Trump might receive opposite responses, and Joe Biden and Barack Obama might get the same response, but prompts asking ChatGPT to praise Joe Biden and Donald Trump might receive opposite responses due to the similarity between Joe Biden and Barack Obama. This study focuses on predicting prompt refusal in language models, using ChatGPT as a test case. The study found that a more sophisticated model like BERT was needed to accurately predict prompt refusal. Refusal classifiers were trained on ChatGPT responses, manually labeled as complied or refused, and a variety of shared expressions were identified in refusal responses. Three model types were evaluated for identifying ChatGPT's refusals and predicting compliance or refusal. The study also found that compliance with or refusal of prompts fall onto a continuum of responses, and not into a neat binary of compliance or refusal. The study aims to predict prompt refusal in language models, specifically in the ChatGPT model developed by OpenAI. The researchers initially classified responses as either accepted or refused, and created a hand-labeled dataset for training the refusal classifier. They investigated several candidate datasets, including OpenAI's moderation safety dataset and a 4Chan archive, but found them inadequate. They ultimately used a dataset of insincere questions from Quora, which they manually labeled, as their largest hand-labeled dataset. The researchers mapped subcategories of refusal responses to binary labels for the refusal classifier. They found that many text strings were too fragmentary to be properly understood as prompts, and that insincere questions about sensitive topics such as caste prejudices and Indian political party preferences were prevalent in the Quora dataset. The study aimed to develop a model that could predict which prompts would be refused by ChatGPT. To train the prompt classifier, a database of prompts was compiled, and a large number of offensive prompts were generated or found. The prompts were manually labeled as either refused or complied-with, and a refusal classifier was trained to perform well. The study used two datasets: the Political Figures dataset and the New York Post dataset. The Political Figures dataset consisted of eight template sentences with varying sentiments and aimed to elicit political bias based on public figures. The New York Post dataset alleged bias against ChatGPT and gave several examples. The lack of diversity of templates in the Political Figures dataset yielded a few disproportionately important and prevalent n-grams, such as "murdering" and "statue." The study found that a smooth continuum from compliance to refusal was possible in ChatGPT responses, and initial examples of prompt refusal appeared cleanly binary. The study investigates bias in ChatGPT language models and identifies prompt refusal as a form of bias. The author trained a new model called RightWingGPT, which had opposite biases to ChatGPT on the political compass test. Another study found that ChatGPT exhibited low levels of psychological dark traits. ChatGPT's political beliefs were found to be left-libertarian and strongly environmentalist. The study also highlights the importance of fair and unbiased AI, particularly in language models like ChatGPT that mediate the flow of information to a large proportion of humanity. Researchers at Michigan State University have developed a prompt classifier that predicts whether OpenAI's ChatGPT language model will refuse a given question. The classifier achieved 76% accuracy on a test set of manually labeled questions and was used to bootstrap a larger dataset adapted from the Quora Insincere Questions dataset. The researchers found that refusal is not cleanly binary and lies on a continuum. They also characterized ChatGPT's refusal behavior using a black-box attack and found that fine-tuning bias may come from individual engineers and company policies, affecting which prompts the model chooses to refuse.