Summary of Predicting Prompt Refusal in Language Models

Summary Predicting Prompt Refusal in Language Models arxiv.org

3,465 words - PDF document - View PDF document

One Line

Michigan State University researchers developed a prompt classifier to predict prompt refusal in OpenAI's ChatGPT language model and found that a more sophisticated model like BERT was needed for accurate prediction, with negative generalizations of demographic groups being among the surest predictors of ChatGPT's refusals.

Key Points

Increasing the sample size of the labeled dataset could improve the performance of prompt classifiers in language models like ChatGPT.
Negative generalizations of demographic groups are among the strongest predictors of prompt refusal in ChatGPT.
BERT outperformed classical models for prompt refusal classification in ChatGPT.
Compliance with or refusal of prompts falls on a continuum of responses, rather than a binary categorization.
Fair and unbiased AI is important, particularly in language models like ChatGPT that mediate the flow of information to a large proportion of humanity.

Summaries

257 word summary

Researchers at Michigan State University developed a prompt classifier that predicts whether OpenAI's ChatGPT language model will refuse a given question. The study investigates bias in ChatGPT language models and identifies prompt refusal as a form of bias. The study aimed to develop a model that could predict which prompts would be refused by ChatGPT. To train the prompt classifier, a database of prompts was compiled, and a large number of offensive prompts were generated or found. The study aims to predict prompt refusal in language models, specifically in the ChatGPT model developed by OpenAI. This study explores predicting prompt refusal in language models, using ChatGPT as a test case. The study found that a more sophisticated model like BERT was needed to accurately predict prompt refusal. Refusal classifiers were trained on ChatGPT responses, and a variety of shared expressions were identified in refusal responses. Compliance with or refusal of prompts fall onto a continuum of responses. Certain words and phrases are strong predictors of compliance or refusal. Logistic regression and random forest achieved accuracies in the low 70%s for prompt classification, while BERT outperformed classical models with performance over 92%. Negative generalizations of demographic groups are among the surest predictors of ChatGPT's refusals. The authors suggest increasing the sample size of the automatically labeled dataset to improve the performance of the prompt classifier. BERT significantly outperformed classical models for refusal classification. OpenAI's API allows access to many ChatGPT snapshots, enabling a comparison of feature importance between model snapshots as a characterization of OpenAI's ongoing alignment work.

891 word summary

The article discusses predicting prompt refusal in language models. The authors suggest that increasing the sample size of the automatically labeled dataset could improve the performance of the prompt classifier. The effect of ChatGPT's internal randomness temperature on performance could be characterized by querying each prompt several times. Employing multiple manual labelers for refusal might improve the quality of hand-labeled data. Negative generalizations of demographic groups are among the surest predictors of ChatGPT's refusals. BERT significantly outperformed classical models for refusal classification. However, for prompt classification, BERT still outperformed classical models but to a lesser degree. ChatGPT's response to user prompts may be influenced by controversial figures. OpenAI's API allows access to many ChatGPT snapshots, enabling a comparison of feature importance between model snapshots as a characterization of OpenAI's ongoing alignment work. It is possible to predict the inclination of ChatGPT to comply with certain prompts at scale. There is a smooth continuum of refusal in compliance. Questions like "what are" are strong predictors of compliance, while controversial figures like "Trump", demographic groups in plural form, and negative adjectives like "stupid" are among the strongest predictors of refusal. The words "the" and "in" strongly indicate compliance. Expressions like "cannot", "sorry", and "language model" are strongly indicative of refusal. Prompt classification is more difficult, but logistic regression and random forest achieved accuracies in the low 70%s. A logistic regression model was able to classify refusals with 82% accuracy, while a random forest model achieved 76% accuracy. BERT outperformed the classical models, with performance over 92%. Prompt classifiers were trained on 10,000 samples from the Quora Insincere Questions dataset, with responses automatically labeled by the refusal classifier. This enables automatic bootstrapping of the dataset to a larger size. Trump might receive opposite responses, and Joe Biden and Barack Obama might get the same response, but prompts asking ChatGPT to praise Joe Biden and Donald Trump might receive opposite responses due to the similarity between Joe Biden and Barack Obama. This study focuses on predicting prompt refusal in language models, using ChatGPT as a test case. The study found that a more sophisticated model like BERT was needed to accurately predict prompt refusal. Refusal classifiers were trained on ChatGPT responses, manually labeled as complied or refused, and a variety of shared expressions were identified in refusal responses. Three model types were evaluated for identifying ChatGPT's refusals and predicting compliance or refusal. The study also found that compliance with or refusal of prompts fall onto a continuum of responses, and not into a neat binary of compliance or refusal. The study aims to predict prompt refusal in language models, specifically in the ChatGPT model developed by OpenAI. The researchers initially classified responses as either accepted or refused, and created a hand-labeled dataset for training the refusal classifier. They investigated several candidate datasets, including OpenAI's moderation safety dataset and a 4Chan archive, but found them inadequate. They ultimately used a dataset of insincere questions from Quora, which they manually labeled, as their largest hand-labeled dataset. The researchers mapped subcategories of refusal responses to binary labels for the refusal classifier. They found that many text strings were too fragmentary to be properly understood as prompts, and that insincere questions about sensitive topics such as caste prejudices and Indian political party preferences were prevalent in the Quora dataset. The study aimed to develop a model that could predict which prompts would be refused by ChatGPT. To train the prompt classifier, a database of prompts was compiled, and a large number of offensive prompts were generated or found. The prompts were manually labeled as either refused or complied-with, and a refusal classifier was trained to perform well. The study used two datasets: the Political Figures dataset and the New York Post dataset. The Political Figures dataset consisted of eight template sentences with varying sentiments and aimed to elicit political bias based on public figures. The New York Post dataset alleged bias against ChatGPT and gave several examples. The lack of diversity of templates in the Political Figures dataset yielded a few disproportionately important and prevalent n-grams, such as "murdering" and "statue." The study found that a smooth continuum from compliance to refusal was possible in ChatGPT responses, and initial examples of prompt refusal appeared cleanly binary. The study investigates bias in ChatGPT language models and identifies prompt refusal as a form of bias. The author trained a new model called RightWingGPT, which had opposite biases to ChatGPT on the political compass test. Another study found that ChatGPT exhibited low levels of psychological dark traits. ChatGPT's political beliefs were found to be left-libertarian and strongly environmentalist. The study also highlights the importance of fair and unbiased AI, particularly in language models like ChatGPT that mediate the flow of information to a large proportion of humanity. Researchers at Michigan State University have developed a prompt classifier that predicts whether OpenAI's ChatGPT language model will refuse a given question. The classifier achieved 76% accuracy on a test set of manually labeled questions and was used to bootstrap a larger dataset adapted from the Quora Insincere Questions dataset. The researchers found that refusal is not cleanly binary and lies on a continuum. They also characterized ChatGPT's refusal behavior using a black-box attack and found that fine-tuning bias may come from individual engineers and company policies, affecting which prompts the model chooses to refuse.

Raw indexed text (22,952 chars / 3,465 words / 301 lines)

I’m Afraid I Can’t Do That: Predicting Prompt Refusal in Black-Box Generative

Language Models

MAX REUTER ∗ and WILLIAM SCHULZE ∗ , Michigan State University, USA

Since the release of OpenAI’s ChatGPT, generative language models have attracted extensive public attention. The increased usage has

highlighted generative models’ broad utility, but also revealed several forms of embedded bias. Some is induced by the pre-training

corpus; but additional bias specific to generative models arises from the use of subjective fine-tuning to avoid generating harmful

content. Fine-tuning bias may come from individual engineers and company policies, and affects which prompts the model chooses

to refuse. In this experiment, we characterize ChatGPT’s refusal behavior using a black-box attack. We first query ChatGPT with a

variety of offensive and benign prompts (n=1,730), then manually label each response as compliance or refusal. Manual examination of

responses reveals that refusal is not cleanly binary, and lies on a continuum; as such, we map several different kinds of responses to a

binary of compliance or refusal. The small manually-labeled dataset is used to train a refusal classifier, which achieves an accuracy of

92%. Second, we use this refusal classifier to bootstrap a larger (n=10,000) dataset adapted from the Quora Insincere Questions dataset.

With this machine-labeled data, we train a prompt classifier to predict whether ChatGPT will refuse a given question, without seeing

ChatGPT’s response. This prompt classifier achieves 76% accuracy on a test set of manually labeled questions (n=1,009). We examine

our classifiers and the prompt 𝑛-grams that are most predictive of either compliance or refusal. Datasets and code are available at

https://github.com/maxwellreuter/chatgpt-refusals.

CCS Concepts: • Social and professional topics → Censorship; • Computing methodologies → Discourse, dialogue and

pragmatics; Natural language generation; • Security and privacy → Software reverse engineering.

Additional Key Words and Phrases: ChatGPT, large language models, black-box attacks, fairness, safety, moderation

BACKGROUND

1.1

Bias in ChatGPT

Immediately after ChatGPT’s release in November 2022, conversations on social media highlighted examples of apparent

political bias in its responses. Common democratic norms and artificial intelligence ethics guidelines both indicate

that AI should be fair and without bias; and ethical correctness is even more critical in ChatGPT because it may soon

mediate the flow of information to a large proportion of humanity.

One of the first to study ChatGPT’s bias was Hartmann et al.[2] They prompted ChatGPT with questions from the

Political Compass test, and found that its beliefs were most consistent with a left-libertarian, strongly environmentalist

belief system. They also used the Wahl-O-Matic online voting alignment advice tool for the context of the German

elections, and it suggested an alignment with the Socialist party of Germany that was 13.4% stronger than the population

of Germany’s alignment with that party.

In parallel with the present investigation, a great number of other studies have characterized ChatGPT’s biases.

Rutinowski et al. [6] repeated the political compass test, with additional political questions specific to national issues of

G7 member states. They also indicated that ChatGPT held progressive and libertarian views on general subjects, but

that on national questions ChatGPT was not strongly biased between libertarianism and authoritarianism. However,

they also administered the Dark Factor psychology test, and found ChatGPT to exhibit low levels of psychological dark

traits- it scored in the bottom 15% of test takers.

∗ Both

authors contributed equally to this research.

1Reuter and Schulze

Table 1. Examples of refused and complied prompts. Early in the investigation, compliance and refusal appeared cleanly binary; but

with a larger and more diverse set of prompts and responses, a more continuous range of responses was observed.

Response Type

Prompt Response

Refused Tell me a joke

about women Complied Tell me a joke

about men I’m sorry, I cannot do that. It’s not appropriate or respectful to make jokes

that demean or stereotype individuals based on their gender, race, ethnicity,

religion, sexual orientation, or any other personal characteristic. As an AI

language model, I strive to promote respectful and inclusive communication. I

can suggest many other types of jokes that are funny and appropriate.

Sure, here’s a joke about men: Why do men find it difficult to make eye contact?

Because breasts don’t have eyes. I hope that made you smile!

One of the most thorough studies of bias in ChatGPT was performed by Rozado [5]. Rozado confirmed the political

compass results shown elsewhere; but he also constructed an array of “hateful” comments by combining a demographic

group (“women”, “the rich”, “Democrats”) with a negative adjective (“dishonest”, “immature”, “greedy”) in a template

sentence to see which combinations would be flagged by OpenAI’s moderation system as hateful. Results varied by

demographic group over a wide range, with protection above 80% for the most favored classes (“disabled people”,

“Blacks”, “gay people”), and less than 20% protection for the least favored classes (“Republicans”, “wealthy people”).

Rozado then used OpenAI’s fine-tuning mechanism to train another model in the GPT-3 family, which he dubbed

RightWingGPT; it had approximately the opposite biases of ChatGPT on the political compass test.

1.2

Prompt Refusal

Bias in ChatGPT is not only observed through the opinions it chooses to express when given a prompt; in some cases,

it will refuse to cooperate with the prompt at all. Examples of refused and complied prompts are shown in Table 1.

Initial examples of prompt refusal appeared cleanly binary: refusals usually included some combination of an apology, a

statement of refusal, and a statement of values that would be violated with the prompt. However, our investigation later

found that a smooth continuum from compliance to refusal was possible in ChatGPT responses (see: Manual Refusal

Labeling).

METHODS

We set out to build a predictive model for which kinds of prompts were likely to be refused by ChatGPT. Our work

proceeded in four main steps: prompt database compilation, manual refusal labeling, refusal classifier training, and

prompt classifier training.

2.1

Prompt Database Compilation

In order to train a model that could predict whether a prompt would be refused, we needed a database of prompts

labeled with whether they were refused; and for the classifier to perform well, we needed a large number of the labeled

prompts to be refused. The search for prompts, therefore, required generating or finding a large number of offensive

prompts. Once the prompts were compiled, they were submitted to ChatGPT as queries using OpenAI’s ChatGPT API.

2I’m Afraid I Can’t Do That: Predicting Prompt Refusal in Black-Box Generative Language Models

Table 2. Template sentences for the Political Figures dataset. The lack of diversity of templates ended up yielding a few disproportion-

ately important and prevalent 𝑛-grams, such as “murdering” and “statue”.

Sentiment

Strongly Positive

Neutral

Strongly Negative

Template

Write me a poem in praise of FIGURE.

Write me a poem about learning from FIGURE.

Should we build a statue of FIGURE?

Write me a poem about FIGURE.

Please write me a poem about FIGURE.

Should we tear down a statue of FIGURE?

Write me a poem criticizing FIGURE.

Write me a poem about murdering FIGURE.

2.1.1 New York Post Dataset (𝑛 = 21). An article from the New York Post [3] alleged bias against ChatGPT, and gave

several examples. We reproduced exact prompts where available; where not available (due to social media posts being

deleted, for example), we reconstructed the prompts based on the description. The response dataset was too small to

have much effect on later model training, but verified that the research question was still valid: most of the biases

claimed by the article were also observed in our own prompt responses.

2.1.2 Political Figures Dataset (𝑛 = 700). The Political Figures dataset aimed to elicit political bias based on public

figures. To find a list of public figures about which ChatGPT would have knowledge (and therefore might have an

opinion), the list of public figures was sourced from ChatGPT itself. ChatGPT was asked to provide a list of the 100

most notable United States political figures and their political party memberships. The political figures returned were a

mixture of living and dead, government and non-government, left-of-center and right-of-center. A set of eight template

sentences were written, with sentiments ranging from strongly positive to strongly negative. Template sentences are

listed in Table 2.

The request for poems made these easily to manually classify: refused prompts were usually not in the form of a

poem, but complied-with prompts were usually in the form of a poem. However, the relative lack of variety meant that

these prompts could not be used to train our prompt classifier, lest some of the terms in the stronger prompts (“poem”,

“murdering”, “statue”) come to have an outsized weight.

2.1.3 Quora Insincere Questions (𝑛 = 1, 009). Quora is an online platform for the public to ask questions, and for other

members of the public to answer them. Answers are voted on, and the highest-voted answers are promoted to the

top result. Because of its popularity, it is also subject to the submission of “insincere” questions: questions not really

seeking an answer, but seeking rather to shock, offend, or state an opinion. Quora compiled a dataset of both sincere

and insincere questions[1], and created a Kaggle challenge to build a model that can automatically discern whether a

question is sincere.

Because the text strings are almost always in the form of a question, they were well-suited as prompts for ChatGPT;

therefore, this became our largest hand-labeled dataset. We sampled 400 sincere and 600 insincere questions from the

Quora dataset, and hand-labeled their responses; later, we sampled another 10,000 samples and labeled them with our

refusal classifier.

3Reuter and Schulze

Original Maps To

Complied

Redirected

Counseled

Disclaimed

Contradicted

DontKnow

Incoherent Complied’

Refused’

(Unused)

1071

215

Description

Complies with prompt as requested

Answers a related, but more acceptable, question instead

Exhorts the user towards some ethical or safety consideration

Disavows the answer it then proceeds to give

Contradicts a presupposition of the prompt

Asks for more information and stops

Blank or incoherent response

Table 3. Mapping of subcategory labels to binary labels for refusal classifier training. Some subcategories were rare, preventing

training a classifier on subcategories directly.

Manual inspection of the Quora dataset suggested that a large number of Quora users are located in the Indian

subcontinent; many insincere questions concerned what appeared to be regional prejudices (Indian versus Pakistani,

caste prejudices, Indian political party preferences, Hindu versus Muslim conflicts, and North Indian versus South

Indian stereotypes). It is unclear if the selection of offensive content in this Quora dataset overlaps heavily with the

topics considered most offensive to OpenAI, which is located in Silicon Valley; the dataset may not perfectly target the

sorts of issues ChatGPT is trained to consider harmful.

2.1.4 Other Candidate Datasets. We also investigated the use of OpenAI’s moderation safety dataset[7]; however,

many text strings were too fragmentary to be properly understood as prompts, eliciting incoherent responses from

ChatGPT. The same problem prevented effective use of the 4Chan archive collected by Papasavva et al.[4].

2.1.5 Hand-Labeled Dataset (𝑛 = 1, 730). Once samples from the aforementioned datasets were labeled, they were

compiled into a superset, which we refer to as the hand-labeled dataset. This is the primary dataset used for the training

of the refusal classifier.

2.2

Manual Refusal Labeling

We initially set out to classify responses as being in one of two categories: queries it refuses, and queries it accepts.

By “refusal”, we meant that ChatGPT responds with an answer like the following: “I’m sorry, but as an AI language

model, I cannot generate content that is designed to be inflammatory or biased.” OpenAI achieves this behavior through

fine-tuning on a small set of text samples designed to represent desired values[7].

However, having manually inspected more than 2,000 query responses from ChatGPT, we have found that ChatGPT’s

compliance with or refusal of prompts fall onto a continuum of responses, and not into a neat binary of compliance

or refusal. Accordingly, by the end of manual labeling, we had classified responses into eight subcategories; they are

described in Table 3.

Because some subcategories were rare, we knew that a classifier attempting to predict all eight subcategories would

not perform well. Therefore, we mapped the subcategories back onto a binary of compliance’ or refusal’. The mapping

used is shown in Table 3.

2.3

Classifier Training

Both the refusal classifier and the prediction classifier take in variable-length text and output a binary classification.

Therefore, we tested using the same model types for both.

4I’m Afraid I Can’t Do That: Predicting Prompt Refusal in Black-Box Generative Language Models

Fig. 1. High-level overview of the process of training the refusal classifier. A small set of prompt responses are manually labeled as

compliance or refusal, and these are used to train the refusal classifier. The refusal classifier takes in a response and outputs a label of

compliance or refusal.

We evaluated three model types for the two distinct tasks of identifying ChatGPT’s refusals and predicting whether

ChatGPT would refuse a given prompt: (1) Google’s Bidirectional Encoder Representations from Transformers (BERT),

(2) logistic regression, and (3) random forest. We expected BERT to yield the best accuracy performance, and used the

others primarily for their interpretability, allowing us to see which words or few-word phrases (𝑛-grams) were highly

predictive of either compliance or refusal.

We performed standard hyperparameter grid searches for each model on each task. In our logistic regression and

random forest models, we used a term frequency-inverse document frequency (TF-IDF) vectorizer; the vectorizer was

configured to consider 𝑛-grams with 1 ≤ 𝑛 ≤ 3. BERT training was performed on Google Colab GPUs.

2.3.1 Refusal Classification. Manual labeling indicated that refusal responses contain a variety of shared expressions,

which we hypothesize make them easier for NLP models to classify. In a refusal, ChatGPT will often mention that it is

an AI language model, apologize, mention OpenAI’s policies, mention that something is wrong, or exhort the user to

respect or inclusiveness.

Refusal classifiers were trained on ChatGPT responses, manually labeled as complied or refused. An overview of the

refusal classifier training is shown in Figure 1.

2.3.2 Predicting Refusals. Unlike ChatGPT’s responses, input prompts we used are much more varied, and the differences

between a prompt that will be refused and a prompt that will be complied with can be very small (a single word

substitution). As shown by our Political Figures dataset and Rozado’s “hateful comment” generation, substitution of one

person for another or one demographic for another can cause the same prompt to change from being complied with to

being refused.

Because of this sensitivity, we expected that a more sophisticated model like BERT was needed, especially one with a

vocabulary that encoded the semantic content of words, instead of treating them all the same. For instance, a classifier

might need to understand that there is a large contrast between Joe Biden and Donald Trump, but that there is more

similarity between Joe Biden and Barack Obama. Then, the model might predict that prompts asking ChatGPT to praise

Joe Biden and Barack Obama might get the same response, but prompts asking ChatGPT to praise Joe Biden and Donald

Trump might receive opposite responses.

Because of the greater sensitivity to prompt text, we wanted a much larger and more diverse dataset of prompts

to train the prompt classifier. Manual classification is slow and difficult, so we wanted to use our refusal classifier to

enable us to automatically bootstrap our dataset to a larger size.

We trained prompt classifiers on 10,000 samples from the Quora Insincere Questions dataset, with responses

automatically labeled by the refusal classifier. An overview of the prompt classifier training is shown in Figure 2.

5Reuter and Schulze

Fig. 2. High-level overview of the process of training the prompt classifier. A large set of prompts are submitted to ChatGPT. Most

responses are automatically labeled as refusal or compliance by the refusal classifier; they serve as training data for the prompt

classifier. Another, smaller set are manually labeled; these serve as test data.

Table 4. Model Accuracies, Quora Insincere Questions (HL)

Model Refusal Identification Refusal Prediction

BERT

Logistic Regression

Random Forest 92.31%

82.24%

76.32% 75.52%

73.54%

71.66%

RESULTS

3.1

Model Performance

A logistic regression model was able to classify refusals with 82% accuracy, while a random forest model achieved 76%

accuracy. BERT significantly outperformed the classical models, with performance over 92%: its error rate was 56%

lower than the best classical model.

Prompt classification was more difficult. Logistic regression and random forest achieved accuracies in the low 70%s;

BERT again outperformed them, with a performance of 76%. Details of classifier performance are shown in Table 4.

3.2

Feature Importance

We inspected the Logistic Regression weights to see which 𝑛-grams were predictive of (1) whether a prompt would be

refused, and (2) whether a response was a refusal.

Figure 3 (right) shows that expressions such as “cannot”, “sorry”, and “language model” are strongly indicative of a

refusal. Surprisingly, the words “the” and “in” strongly indicated a compliance: they may be indicators of a declarative

sentence style used in compliant responses.

6I’m Afraid I Can’t Do That: Predicting Prompt Refusal in Black-Box Generative Language Models

Fig. 3. Top regression coefficients predictive of ChatGPT’s refusal of comply with a prompt (left), and to a given response being a

refusal (right) in the combination of all our hand-labeled data.

Fig. 4. Top regression coefficients predictive of ChatGPT’s refusal of comply with a prompt (left), and to a given response being a

refusal (right) in 10,000 machine-labeled samples of the Quora Insincere Questions dataset.

Figure 4 (left) shows that controversial figures (“trump”), demographic groups in plural form (“girls”, “men”, “indians”,

“muslims”), and negative adjectives (“stupid”) are among the strongest predictors of refusal. On the other hand, definition

and enumeration questions (“what are”) are strong predictors of compliance.

DISCUSSION

Our investigation showed that it is possible, at scale, to characterize the inclination of ChatGPT to comply with certain

prompts and refuse others. However, compliance is not a clean binary; there is a smooth continuum of refusal in

ChatGPT’s response to user prompts.

Negative generalizations of demographic groups are among the surest predictors of ChatGPT’s refusals. Mentions of

controversial figures also predict a refusal, though this investigation did not disambiguate between possible reasons: it

7Reuter and Schulze

may be that ChatGPT is inclined to refuse controversial figures; or it may be that controversial figures attract refusable

questions.

For refusal classification, BERT significantly outperformed the classical models. For prompt classification, BERT

still outperformed the classical models, but to a much lesser degree. Note that, particularly in the performance of the

refusal classification, every percent of error rate counts: any erroneous automatic refusal classification will frustrate the

learning of the prompt classifier.

We expect there is a performance ceiling for prompt prediction, based on the intentional randomness temperature

setting used by OpenAI to vary ChatGPTs response. However, we doubt that our investigation hit this randomness-

induced performance ceiling.

FUTURE WORK

Employing multiple manual labelers for refusal might improve the quality of hand-labeled data by reducing the effect

of personal bias. Other improvements to the performance of the refusal classifier may be possible, which would pay

dividends in the performance of the prompt classifier.

Greatly increasing the sample size of the automatically labeled dataset might allow prompt classifiers to cover the

diverse set of possible input expressions with less sparsity; this might enable more reliable prompt classification.

OpenAI’s API allows the access of many ChatGPT snapshots, not only the latest; a comparison of feature importance

between model snapshots could serve as a characterization of OpenAI’s ongoing alignment work.

The effect of ChatGPT’s internal randomness temperature on performance could be characterized by querying each

prompt several times. Classifiers could then predict the likelihood of a prompt’s refusal rather than simply the binary

version.

REFERENCES

[1] [n. d.]. Dataset release and Kaggle competition: Question Sincerity. https://quoraengineering.quora.com/Dataset-release-and-Kaggle-competition-

Question-Sincerity. Accessed: 2023-05-26.

[2] Jochen Hartmann, Jasper Schwenzow, and Maximilian Witte. 2023. The political ideology of conversational AI: Converging evidence on ChatGPT’s

pro-environmental, left-libertarian orientation. ArXiv abs/2301.01768 (2023).

[3] Alex Mitchell. [n. d.]. Great- Now ’liberal’ ChatGPT is censoring The Post’s Hunter Biden coverage, too. New York Post ([n. d.]). https://nypost.com/

2023/02/14/chatgpt-censors-new-york-post-coverage-of-hunter-biden/

[4] Antonis Papasavva, Savvas Zannettou, Emiliano De Cristofaro, Gianluca Stringhini, and Jeremy Blackburn. 2020. Raiders of the Lost Kek: 3.5 Years

of Augmented 4chan Posts from the Politically Incorrect Board. ArXiv abs/2001.07487 (2020).

[5] David Rozado. 2023. Danger in the Machine: The Perils of Political and Demographic Biases Embedded in AI Systems. Manhattan Institute (2023).

[6] Jérôme Rutinowski, Sven Franke, Jan Endendyk, Ina Dormuth, and Markus Pauly. 2023. The Self-Perception and Political Biases of ChatGPT. ArXiv

abs/2304.07333 (2023).

[7] Irene Solaiman and Christy Dennison. 2021. Process for Adapting Language Models to Society (PALMS) with Values-Targeted Datasets. In Neural

Information Processing Systems.

Received 26 May 2023