Summary Opinions Reflected by Language Models arxiv.org
13,471 words - PDF document - View PDF document
One Line
A study on language models found that they tend to align with the views of liberals and moderates, have low consistency scores in replicating human experiments, and the authors propose a methodology to evaluate them using a dataset called OpinionQA.
Key Points
- Language models (LMs) misalign with the opinions of US demographic groups on various topics.
- OpinionQA dataset is created to investigate LM opinions using high-quality public opinion polls and human responses.
- A methodology is proposed to evaluate LM opinions using the OpinionQA dataset and 1-Wasserstein distance as a similarity measure.
- LMs tend to converge towards the modal views of liberals and moderates, with text-davinci-003 having a unique and unrepresentative opinion distribution.
- LMs may misrepresent specific groups, and caution must be taken to avoid replicating human biases.
Summaries
283 word summary
The study "Opinions Reflected by Language Models" evaluates language models from OpenAI and AI21 Labs on topics such as personal health, finance, data privacy, leadership, healthcare, global attitudes, sexuality, and gender. The study analyzes opinions reflected by language models using Pew survey data on topics such as privacy, misinformation, race, science, leadership, community, gender, guns, and automation. The study found that language models are robust to design choices but are sensitive to prompt format and option ordering. The authors caution that language models that perfectly represent human opinions may replicate human biases. The study examines opinions reflected by language models and includes references to related research. The dataset used in the study is derived from the annual Pew American Trends Panel (ATP) survey. A study on language models (LMs) and their alignment with human opinions found that LMs tend to converge towards the modal views of liberals and moderates, and that text-davinci-003 has a unique and unrepresentative opinion distribution. The study also evaluated LMs' ability to replicate results from human experiments and mimic human behaviors and found low consistency scores indicating that they express a patchwork of disparate opinions. The authors propose a methodology to convert public opinion surveys into evaluation metrics for LMs, and a dataset called OpinionQA is curated. The study uses the 1-Wasserstein distance as a similarity measure between distributions and evaluates different LMs on OpinionQA and their opinion agreement with Democrats and Republicans on abortion. The authors find that none of the models are perfectly representative of the overall populace, and there are irreconcilable differences between the opinions of certain groups. The authors also evaluate the group representativeness scores for LMs as a function of political ideology and income.
744 word summary
Language models (LMs) reflect substantial misalignment with the views of 60 US demographic groups on various topics. A new dataset called OpinionQA was created to investigate LM opinions, which uses high-quality public opinion polls and human responses. The authors develop a framework to analyze human-LM opinion alignment along three axes: representativeness, steerability, and distributions. The authors evaluate 9 LMs on this dataset and find substantial misalignment between the opinions reflected in current LMs and that of the general US populace and various demographic groups. The document proposes a methodology to convert public opinion surveys into evaluation metrics for LMs, and a dataset called OpinionQA is curated. This paper presents a methodology for evaluating language models' opinions using the OpinionQA dataset. The dataset includes multiple-choice questions on various topics and demographic groups, with associated human opinion distributions. The evaluation is done at both an individual and group level. The study uses the 1-Wasserstein distance as a similarity measure between distributions. Three approaches are used to supply demographic information to the LM. The study evaluates different LMs on OpinionQA and their opinion agreement with Democrats and Republicans on abortion. The authors find that none of the models are perfectly representative of the overall populace, and there are irreconcilable differences between the opinions of certain groups. The authors also evaluate the group representativeness scores for LMs as a function of political ideology and income. This article discusses a study on the alignment of language models (LMs) with human opinions, specifically on politics and demographics. The study found that LMs tend to converge towards the modal views of liberals and moderates, and that text-davinci-003 has a unique and unrepresentative opinion distribution. The article highlights the challenges of recruiting diverse crowdsourcing workers and the limitations of using human feedback to align LMs with different demographic groups. The study evaluates LMs' ability to replicate results from human experiments and mimic human behaviors and finds low consistency scores indicating that they express a patchwork of disparate opinions. The article concludes by stating that more research is needed to improve the representativeness and steerability of LMs towards specific groups. Previous works on bias and fairness in NLP systems have not focused on the subjectivity of the alignment problem, and recent works have examined the slants in the opinions of LMs by prompting them with contentious propositions/questions generated by LMs or from political and word associations. The study leverages public opinion surveys to improve understanding of LM steerability in three ways: breadth, distributional view, and measurability. The document explores the limitations and potential biases of language models (LMs) in reflecting human opinions. The study focuses on probing LM behaviors using global equivalents to OpinionQA and identifies several ways in which LMs may misrepresent specific groups. The authors caution that LMs that perfectly represent human opinions may replicate human biases. The document provides a list of references and resources related to language models, including studies on measuring biases, reducing harms, and simulating human language. The study examines opinions reflected by language models and includes references to related research. The dataset used in the study is derived from the annual Pew American Trends Panel (ATP) survey, which recruits panelists over multiple years. The panelists are offered a paid incentive to participate in the survey. This study adapted Pew ATP surveys to OpinionQA by modifying multiple-choice questions to be suitable for language models. The study analyzed opinions reflected by language models using Pew survey data on topics such as privacy, misinformation, race, science, leadership, community, gender, guns, and automation. The study found that a decline in the share of Americans belonging to an organized religion is generally viewed as bad for society, and attitudes towards limiting Chinese students studying in the U.S. are mixed. The study surveyed opinions on voice assistants, drones, autonomous vehicles, technology companies, crime and security, guns, justice system, military, terrorism, discrimination, workplace experience, economy and inequality, and college education. The document "Opinions Reflected by Language Models" explores opinions on various topics, including personal health, finance, data privacy, leadership, healthcare, global attitudes, sexuality, and gender. The study evaluates language models from OpenAI and AI21 Labs, examining their alignment with different subgroups based on demographics such as religion, race, political party, income, ideology, gender, education, and region. The article also explores biases in language models and their sensitivity to prompt formatting and option ordering. The study finds that language models are robust to design choices but are sensitive to prompt format and option ordering.
2834 word summary
The study examines the opinions reflected by language models and their sensitivity to prompt formatting and option ordering. The researchers vary the prompts fed into the models to get their opinion distribution and analyze the model's responses to questions with different formats. The study finds that the overall and subgroup-level trends remain largely consistent, even with different prompt formats and option ordering. However, there are small fluctuations in the actual representativeness scores of the models. The researchers also note that current language models perform well in the zero-shot setting but are still known to be sensitive to their prompt format. Overall, the study suggests that language models are robust to design choices, but their sensitivity to prompt format and option ordering should be taken into account when interpreting their results. The article explores biases in language models (LMs) and their alignment with different subgroups based on political ideology, education, and income levels. The representativeness and steerability of LMs are also examined, with comparisons to human opinions and survey respondents. The article includes figures illustrating the alignment of LMs with fine-grained topics, subgroup representativeness scores, refusal rates, and entropy of per-question response distributions. Demographic attributes that were omitted from the main paper are discussed in the appendix. The document explores opinions reflected by language models. Appendix Figure 8 visualizes the subgroup representativeness. In Appendix Figure 7, the distribution of probability mass assigned by different models to one of the answer choices is shown. The ideal value for this distribution would be close to one for all questions. The experiment used a temperature of 1e-3 in the analysis, but results were fairly robust to the choice of temperature. The Wasserstein distance between human and LM opinion distributions to a question is computed to map the options to a metric space. A random model that chooses one of the answer choices per question is used as a baseline. The model opinion distribution is obtained by evaluating the log probabilities of each answer, exponentiating and then normalizing them. The document uses a series of models from OpenAI and AI21 labs for analysis. The study evaluates various language models, including OpenAI's text-davinci-001, text-davinci-002, and text-davinci-003, as well as AI21 Labs' j1-Grande, j1-Grande v2 beta, and j1-Jumbo. Demographic groups used in the steerability analysis include religion, race, political party, income, ideology, gender, education, and region. Topics covered in the study include political issues such as the two-party system, health, science, and climate change. Opinions Reflected by Language Models is a document that explores the opinions of individuals on various topics such as personal health, personal finance, current events, data privacy, leadership, healthcare, global attitudes and foreign policy, sexuality, gender attitudes, and the future. Participants were asked to give their opinions on organic foods, debt, made-up news, data privacy, empathy in leadership, job loss, medical treatments, COVID-19 restrictions, abortion, transgender acceptance, and gender attitudes. The study found that opinions varied widely on each topic. College education: Opinions on the importance of standardized tests in college education were surveyed.
Economy and inequality: Opinions on the role of major corporations in contributing to economic inequality in the country were surveyed.
Workplace experience: Personal experiences with being passed over for a promotion and experiencing sexual harassment at work were surveyed.
Discrimination: Opinions on how fairly black people are treated compared to white people were surveyed.
Terrorism: Opinions on the priority of taking measures to protect the US from terrorist attacks were surveyed.
Military: Confidence in the military to act in the best interests of the public was surveyed.
Justice system: Opinions on whether people convicted of crimes in the country serve too much or too little time in prison were surveyed.
Guns: Opinions on the importance of advising visitors with children that there are guns in the house were surveyed.
Crime and security: Worries about crime and security were surveyed.
Technology companies: Opinions on the power and influence of technology companies on today's economy were surveyed.
Autonomous vehicles: Enthusiasm for the development of driverless vehicles was surveyed.
Drones: Opinions on whether private citizens should be allowed to pilot drones near crime scenes or traffic accidents were surveyed.
Voice assistants: Accuracy of digital assistants in responding to commands was surveyed, as well as prior knowledge about the idea of computers with advanced capabilities being able to do most jobs done by humans today. The text excerpt discusses a study on opinions reflected by language models. The study analyzed survey data from Pew Research Center, categorizing questions into topics such as religion, income, politics, and relationships. The study used NQ and NR to denote the number of questions and human respondents, respectively. The study found that a decline in the share of Americans belonging to an organized religion is generally viewed as bad for society. It also found that attitudes towards limiting Chinese students studying in the U.S. are mixed, with some opposing and some supporting. The study highlights the importance of understanding what companies do with the data they collect. This document analyzes opinions reflected by language models using Pew survey data. The analysis includes questions about privacy, misinformation, race, science, leadership, community, gender, guns, and automation. The dataset is manually categorized into topics for post-hoc analysis. A subset of 500 questions is selected for steerability analysis. The document discusses the process of adapting American Trends Panel (ATP) surveys to OpinionQA. The researchers extract multiple-choice questions from Pew ATP surveys and modify them to be suitable for language models. They restate questions to be self-contained, fix any formatting issues, and omit variable-dependent questions. The questions are chosen from surveys that span a broad range of topics and demographic traits. Pew researchers conduct data quality checks to identify issues with the surveys and determine valid answer choices. The questionnaire is complicated because surveys can ask about topics in varying degrees of detail, and the creation of questions that accurately measure opinions and experiences is crucial. The dataset used in the study is derived from the annual Pew American Trends Panel (ATP) survey, which recruits panelists over multiple years. The panelists, about 10,000 participants within the US, are invited to take the ATP to reduce the burden on individual respondents. Pew relies on a sample of households from USPS's Delivery Sequence File with concerted efforts to ensure representativeness of the sample. They also solicit input from households without internet access, either via phone or by providing them with tablets to take the survey. Panelists are offered a paid incentive to participate in the survey.
The study examines opinions reflected by language models and includes references to related research, such as consequences of asking sensitive questions in surveys, bot-adversarial dialogue for safe conversational agents, and toxic language detection. It also includes references to works that explore language model behaviors and ways to red team language models. Finally, it references studies in public opinion and surveys, including writing survey questions and creating populated prototypes for social computing systems. Researchers have published several studies on language models (LMs) that focus on measuring bias, ethical judgments, and political ideology. One study measures stereotypical bias in pretrained language models, while another examines community ethical judgments on 32,000 real-life anecdotes. Several studies probe partisan worldviews, personality estimation, and the impact of rater identity on toxicity annotation. Additionally, researchers explore ways to integrate dissenting voices into machine learning models and improve alignment of dialogue agents via targeted human judgments. Finally, a framework for few-shot language model evaluation is proposed. The document provides a list of references and resources related to language models, including studies on measuring biases, reducing harms, and simulating human language. The references include papers from conferences such as FAccT and NeurIPS, as well as preprints from arXiv. The studies were supported by grants and fellowships from organizations such as Open Philanthropy and SAIL. The authors express their gratitude to individuals who provided guidance and feedback on their work. The document discusses the limitations and potential biases of language models (LMs) in reflecting human opinions. The authors acknowledge the need for further investigation into how results transfer to different settings and whether opinion alignment allows for precise evaluation of LMs. The study focuses on probing LM behaviors using global equivalents to OpinionQA and identifies several ways in which LMs may misrepresent specific groups. The authors caution that LMs that perfectly represent human opinions may not necessarily be desirable, as they may also replicate human biases. The study contributes to the broader discourse around LMs, including questions of steerability and subgroup representativeness. The authors put forth a framework for examining the opinions reflected by LMs through the lens of the representativeness of opinions expressed on contentious but important topics such as religion or privacy. The work takes a complementary perspective by evaluating LMs on inherently subjective questions taken from Pew Research and flagging undesirable outcomes when the gold standard behavior is somewhat well-defined. There is a long line of work studying the bias and fairness of NLP systems, with recent works focusing on bias, toxicity, and truthfulness. While previous works recognize the subjectivity of the alignment problem, they do not focus on it. There has been a long-standing push within the NLP community to consider the subjective and affective dimensions of language in evaluating models. Recent works have examined the slants in the opinions of LMs by prompting them with contentious propositions/questions generated by LMs or from political and word associations. Through case studies, studies have examined whether LMs can be used to simulate personas. By leveraging public opinion surveys, we are able to improve our understanding of LM steerability in three ways: (i) breadth, (ii) distributional view, and (iii) measurability. The study evaluates language models' (LMs) ability to replicate results from human experiments and mimic human behaviors. The consistency scores of current LMs are low, indicating that they express a patchwork of disparate opinions. The study examines the fraction of topics for which an LM's most aligned group matches the LM's most aligned group on the given topic. They find significant topic-level inconsistencies, especially for base LMs, and strong educational attainment consistency for RLHF trained LMs. The study visualizes which LMs are most effective at adapting towards a particular group. This article discusses the alignment of language models (LMs) with the opinions of different demographic groups. The study found that while steering LMs towards certain groups may improve representativeness, it does not solve opinion misalignment. Steerability was measured as the ability of an LM to adapt to represent the opinion of various demographic groups. The study also compared the refusal rates of LMs and human respondents and highlighted the importance of considering the entire spectrum of human responses rather than just the mode. The article concludes by stating that OpenAI's production systems are not public. The paper explores the alignment of language models (LMs) with human opinions, specifically on contentious topics related to politics and demographics. The authors find that LMs tend to converge towards the modal views of liberals and moderates, while text-davinci-003 has a unique and unrepresentative opinion distribution. The paper also discusses the challenges of recruiting diverse crowdsourcing workers and the limitations of using human feedback to align LMs with different demographic groups. The authors conclude that more research is needed to improve the representativeness and steerability of LMs towards specific groups. The study assesses the representativeness of various language models (LMs) in reflecting the opinions of different demographic groups on contentious topics. The authors construct a scale of alignment values between pairs of demographic groups on questions from specific topics and compare the representativeness scores of LMs to human baselines. They find that none of the models are perfectly representative of the overall populace, and there are irreconcilable differences between the opinions of certain groups. The authors also evaluate the group representativeness scores for LMs as a function of political ideology and income. The metric used to measure representativeness is the alignment between the default opinion distribution of the model and that of the overall population or a particular group's opinion. The document analyzes language models (LMs) and their opinions, beginning with assessing their representativeness using different prompt templates and permuting answer choices. The study uses a metric called Opinion Alignment to compare the LM's opinion distribution to that of all survey respondents and specific groups. The study also evaluates different LMs on OpinionQA and their opinion agreement with Democrats and Republicans on abortion. The study defines alignment as a metric that measures one aspect of alignment, opinions, and preferences between LMs and humans. Finally, the study measures opinion alignment using a metric space suitable for projecting ordinal answer choices to a positive integer. This article discusses the use of language models (LMs) to reflect opinions. The 1-Wasserstein distance is chosen as a similarity measure between distributions in order to avoid misleading estimates of disagreement. To determine whose opinions LMs reflect, opinion distributions are defined for overall, group-level, and per-LM responses to survey questions. The refusal probability is measured for each question, and the distribution of model opinions is obtained by measuring the log probabilities assigned to each answer choice. Three approaches are used to supply demographic information to the LM: QA, BIO, and PORTRAY. In the steered setting, additional context is added to the prompt to describe the group that the model should emulate. This document discusses a methodology for evaluating language models' opinions using the OpinionQA dataset. The dataset includes multiple-choice questions on various topics and demographic groups, with associated human opinion distributions. The methodology involves prompting the model with the questions and comparing its output to the human opinions. The evaluation is done at both an individual and group level. The dataset is US-centric and in English. The questions are categorized into 23 coarse and 40 fine-grained topic categories. The methodology accounts for potential sampling biases and uses weights assigned by the survey to correct for them. The analysis is limited to the US populace and demographic groups within. The document proposes a methodology to convert public opinion surveys into evaluation metrics for language models (LMs). The surveys are an ideal testbed to study LM opinions. The challenges associated with querying LMs with surveys include designing questions to capture nuances and extracting LM opinions. To address these challenges, a dataset called OpinionQA is curated, and three metrics are proposed: representativeness, consistency, and steerability. The dataset and metrics are viewed as probes to enable developers to better understand model behavior and identify representation failures, rather than as benchmarks. The OpinionQA dataset is obtained by aggregating human responses to the same survey question at a population level and by demographic. The document examines the opinions reflected by language models (LMs) and evaluates their alignment with the general US population and various demographic groups. The authors develop a framework to analyze human-LM opinion alignment along three axes: representativeness, steerability, and distributions. They build the OpinionQA dataset using Pew Research's American Trends Panels, with 1498 questions spanning topics such as science, politics, and personal relationships in a multiple-choice format that can easily be adapted to an LM prompt. The authors evaluate 9 LMs on this dataset and find substantial misalignment between the opinions reflected in current LMs and that of the general US populace and various demographic groups. They identify certain groups that make up a significant portion of the US population that are poorly represented by all models and note recent reinforcement learning-based HF models tend to align towards more liberal viewpoints. The authors conclude that while LMs can reflect human opinions, they fail to model the subtleties of human opinions entirely and tend to express the dominant viewpoint of certain groups. The document discusses a framework for studying the opinions reflected by language models (LMs) and their alignment with different human populations. The authors use public opinion surveys to identify topics of public interest to probe models on and develop methods for directly measuring the alignment between LM's responses on these topics to certain tendencies of the corresponding groups. The authors aim to identify whose opinions are reflected by LMs and suggest that a key evaluation for LMs in open-ended tasks will be not only to assess beliefs but also to identify whose opinions are reflected by LMs. The authors also note that LMs have been observed to offer opinions in subjective queries, and it is hard to predict how LMs will respond to such queries. The authors' analysis confirms prior observations about the left-leaning tendencies of some human feedback-tuned LMs but also surfaces groups whose opinions are poorly reflected by current LMs. Language models (LMs) reflect substantial misalignment with the views of 60 US demographic groups on topics ranging from abortion to automation. This misalignment persists even after steering LMs towards specific US demographic groups. To investigate LM opinions, a new dataset called OpinionQA was created, which uses high-quality public opinion polls and human responses. The opinions reflected by LMs in response to subjective queries can have a profound impact on user satisfaction and society at large. Language models are increasingly being used in open-ended contexts.