Summary of Modeling Ambiguity in Language Understanding

Summary Modeling Ambiguity in Language Understanding arxiv.org

11,682 words - PDF document - View PDF document

One Line

The document discusses the modeling of ambiguity in language understanding and proposes the AmbiNLI model to address the issue, evaluating multilabel NLI models and creating a dataset called AMBIENT to evaluate the ability of language models to recognize and disentangle possible meanings.

Key Points

Ambiguity in language can lead to miscommunication and confusion, making ambiguity-sensitive tools important for natural language processing and human understanding.
A multilabel natural language inference (NLI) model can be used to detect misinformation in political claims and the value of ambiguity recognition.
The AmbiNLI model is proposed to address ambiguity in natural language understanding.
The study evaluates the ability of Language Models (LMs) to generate disambiguations and recognize plausible interpretations, showing that LMs can effectively model ambiguity in language understanding.
The authors encourage future work to collect more data in other languages and to systematically extend the dataset and analyses.
The importance of recognizing interpretation-specific contexts and disambiguations is highlighted.

Summaries

285 word summary

This document explores the modeling of ambiguity in language understanding with a focus on generating paraphrases for political claims. The authors use the InstructGPT model and sample 5 outputs per 21,273 possible prompts to construct a taxonomy of ambiguity categories and review examples from their dataset. They emphasize the importance of recognizing interpretation-specific contexts and disambiguations and develop a benchmark to evaluate language models' sensitivity to context. The authors also experimentally evaluate the idea of modeling ambiguity on political claims using the development set of CLAIM DECOMP and illustrate the value of ambiguity-sensitive models in detecting misleading political claims. The document includes references to various studies and evaluations related to natural language processing and language understanding, covering topics such as multi-task learning, semantic evaluation, language modeling, annotation, ambiguity, and bias detection. The AmbiNLI model is proposed to address ambiguity in natural language understanding. The study evaluates multilabel NLI models and proposes a method to quantify the likelihood of a continuation given a distractor sentence. The document explores modeling ambiguity in language understanding through a three-step process of annotating ambiguous examples, recognizing disambiguations, and selecting a single label. The study evaluates the ability of Language Models (LMs) to generate disambiguations and recognize plausible interpretations. Additionally, the authors use the KL divergence to measure the impact of ambiguous contexts on LMs and propose a few-shot template for generating disambiguations. A dataset called AMBIENT was created to evaluate the ability of language models to recognize and disentangle possible meanings. The authors used a pipeline to annotate and validate examples acquired from a corpus of unlabeled NLI examples that are likely to be ambiguous. Finally, they analyzed the ambiguous examples in their dataset and collected disambiguations labeled by linguists.

829 word summary

Ambiguity in language can lead to miscommunication and confusion, making ambiguity-sensitive tools important for natural language processing and human understanding. A study investigated the use of a multilabel natural language inference (NLI) model to detect misinformation in political claims and the value of ambiguity recognition. A dataset called AMBIENT was created to evaluate the ability of language models to recognize and disentangle possible meanings. The authors used a pipeline to annotate and validate examples acquired from a corpus of unlabeled NLI examples that are likely to be ambiguous. The dataset includes 1,645 examples, each annotated with a set of labels indicating whether a premise entails, contradicts, or is neutral with respect to a hypothesis. The authors also identified groups of premise-hypothesis pairs that share a reasoning pattern to encourage the creation of new examples with the same pattern. Finally, they analyzed the ambiguous examples in their dataset and collected disambiguations labeled by linguists. The document explores modeling ambiguity in language understanding through a three-step process of annotating ambiguous examples, recognizing disambiguations, and selecting a single label. The study evaluates the ability of Language Models (LMs) to generate disambiguations and recognize plausible interpretations, showing that LMs can effectively model ambiguity in language understanding. The authors also discuss a method for modeling ambiguity in determining whether a claim is true, false, or inconclusive given a premise, with the best model achieving an E DIT-F1 score of 18.0%. Additionally, the authors use the KL divergence to measure the impact of ambiguous contexts on LMs and propose a few-shot template for generating disambiguations, with the best model achieving 63% accuracy. The study proposes the AmbiNLI model to address ambiguity in natural language understanding. Multilabel NLI models are evaluated using various methods and datasets, and the results show that ambiguity remains a severe challenge across models and tests. The study also proposes a method to quantify the likelihood of a continuation given a distractor sentence and evaluates it on various language models.

The authors experimentally evaluate the idea of modeling ambiguity on political claims using the development set of CLAIM DECOMP. They use a multilabel NLI model to assign at least two labels to each resulting NLI example and then paraphrase each claim five times with InstructGPT zero-shot. They illustrate the value of ambiguity-sensitive models in detecting misleading political claims.

The authors train three-way classification models on the single-label train sets of MNLI and WANLI and train a multilabel model on the power set of NLI labels, minus the empty set. They also train a classifier over sets which performs 7-way classification over the annotations per example. The multilabel model trained on WANLI achieves the highest macro F1 score of 37.8%.

Ambiguity in language understanding is a long-standing issue, and recent work studies whether the confidence of coreference and NLI models is sensitive to ambiguities more broadly. The functional approach to ambiguity is inspired by AMBIG QA, and pretrained LMs are evaluated for the language model to solve highly ambiguous crossword clues. Political claims flagged as ambiguous by the detection method are shown in Table 6. The document discusses modeling ambiguity in language understanding and the need for ambiguity-sensitive tools to address systematic biases. The authors develop a benchmark to evaluate language models' sensitivity to context and emphasize the importance of studying the nuances of natural language communication. They also investigate different approaches to studying label variation in natural language inference (NLI) and recognize the growing interest in ambiguity-sensitive tools for various applications. The document includes references to various studies and evaluations related to natural language processing, such as coping with syntactic ambiguity and scaling up pretraining. The authors encourage future work to collect more data in other languages and to systematically extend the dataset and analyses. The text also includes a list of references and resources related to natural language processing and language understanding, covering topics such as multi-task learning, semantic evaluation, language modeling, annotation, ambiguity, and bias detection. The document examines modeling ambiguity in language understanding through disambiguation examples. The authors use heuristics to obtain 104,071 unlabeled examples and sample 5 outputs per 21,273 possible prompts. They employ InstructGPT as the model and provide curated examples and dataset creation details. The authors construct a taxonomy of ambiguity categories, review examples from their dataset, and annotate 100 randomly sampled examples to categorize possible sources of ambiguity. The study finds that the “closeness” of distractors affects the difficulty of a test and highlights the importance of recognizing interpretation-specific contexts and disambiguations. This document focuses on the modeling of ambiguity in language understanding, specifically in generating paraphrases for political claims. The InstructGPT model was used and trained on a dataset of political claims over 30 epochs. The AmbiNLI model was found to require no threshold tuning for performance evaluation using logit thresholds. Additionally, the setup of NLI models that predict multiple labels as output is described. However, the document notes that the noun replacement procedure used in some tests may not always produce accurate results.

2347 word summary

This document discusses the modeling of ambiguity in language understanding, specifically in generating paraphrases for political claims. The model used is InstructGPT, trained on a dataset of political claims over 30 epochs. The performance of various models is evaluated using logit thresholds, and the AmbiNLI model is found to require no threshold tuning. The setup of NLI models that predict multiple labels as output is also described. The document notes that the noun replacement procedure used in some tests may not always produce accurate results. This document discusses modeling ambiguity in language understanding. The study finds that the "closeness" of distractors affects the difficulty of a test. There can be a stylistic mismatch between the original ambiguous sentence and its disambiguation, with the latter being more stilted. The document describes the process of generating continuations and creating distractors. The KL divergence is used to measure the difference between two probability distributions. Implementation details and test results are provided. The study highlights the importance of recognizing interpretation-specific contexts and disambiguations. The document discusses modeling ambiguity in language understanding. The claim presented in the document has multiple interpretations that affect its correctness. The accuracy of LMs on four templates is presented in Table 10. The claim has ambiguities, and the premise and hypothesis are often ambiguous. The study used crowdworkers to assess the plausibility of three interpretations of an ambiguous sentence. The workers were paid $0.40 per NLI example, and only those who passed a qualification test were selected. The study found that the distribution of ambiguity in naturally occurring language is not uniform, and some sentences have multiple ambiguities. The paper focuses on modeling ambiguity in language understanding. The authors construct a taxonomy of ambiguity categories by reviewing examples from their dataset. They annotate 100 randomly sampled examples and categorize the possible sources of ambiguity. The authors review all 2,020 examples and validate the annotations. Among the ambiguous examples, 74.3% have ambiguity in the premise and 32.6% in the hypothesis. The disambiguating rewrites are, on average, 2.36 words longer than their ambiguous counterparts. The paper demonstrates shared ambiguity patterns where sentences about the past or desires about the future induce a cancellable implicature about the present. The authors provide a prompt template for GPT-3 used to create un-labeled examples for annotation. They ultimately focus on linguistic ambiguity and provide a table of ambiguity categories. The document discusses modeling ambiguity in language understanding. It includes examples of sentences that have changed meaning due to the use of different economic systems. The document also discusses the process of generating and annotating examples for a multilabel NLI model, which was used to filter examples based on certain rules. The resulting set of examples was used to test the model's ability to predict the relationship between a premise and a hypothesis. The document includes a table with examples from various sources that were used in the study. The article discusses modeling ambiguity in language understanding through disambiguation examples. The authors employ heuristics to discard examples that exhibit observable failure cases and obtain a total of 104,071 unlabeled examples. They sample 5 outputs per 21,273 possible prompts and discard the output if the generated output is not correctly formatted with max tokens 120 and stop sequence ”/n/n”. The model used is InstructGPT. The article provides curated examples and dataset creation details, including references to related research. This text excerpt is a list of references and resources related to natural language processing and language understanding. It includes mentions of various tools, benchmarks, and models used in the field, as well as research papers and conference proceedings. The referenced materials cover topics such as multi-task learning, semantic evaluation, language modeling, annotation, ambiguity, and bias detection. Some of the highlighted resources are specific models or algorithms, such as Blenderbot 3 or GPT-4, while others are datasets or evaluation frameworks, such as GLUE or SemEval-2021. The text also mentions several authors and researchers who have contributed to the field of natural language processing, such as Samuel Bowman or Catherine Havasi. Overall, the text serves as a resource list for anyone interested in exploring the current state of language understanding research. This is a list of various academic papers and conference proceedings related to natural language processing, including topics such as ambiguity in language understanding, natural language inference, human-AI collaboration, and machine learning models. The papers cover a range of subtopics within these areas, such as exploring language model capabilities, investigating reasons for disagreement in natural language inference, and integrating dissenting voices into machine learning models. Some specific papers mentioned include "What can we learn from collective human opinions on linguistics?", "No language left behind: Scaling human-centered machine translation", and "The curious case of neural text degeneration". The document discusses modeling ambiguity in language understanding. It includes references to various studies and evaluations, such as the use of HPSG for English grammar, coping with syntactic ambiguity, and scaling up pretraining. The authors note that while larger language models may overfit to more common interpretations, scaling up pretraining and reinforcement learning from human feedback may lead to further gains. The authors also point out that while LMs struggle with ambiguity in English, the way ambiguity manifests in other languages can vary greatly due to systematic typological factors or idiosyncratic differences. They encourage future work to collect more data in other languages and to systematically extend the dataset and analyses. This work aims to collect a broad-coverage dataset of ambiguities to model ambiguity in language understanding. The authors acknowledge the limitations of existing models due to data sources and the need for ambiguity-sensitive tools to address systematic biases. They develop a benchmark to evaluate language models' sensitivity to context and emphasis and encourage future work to study the nuances of natural language communication. The authors investigate different approaches to studying label variation in natural language inference (NLI) and develop a set of labels for plausible readings. They argue that uncertainty in sentence meaning should be directly characterized, potentially as a function of demographic characteristics. The authors recognize the growing interest in ambiguity-sensitive tools for various applications, such as toxic language detection. Ambiguity in language understanding is a long-standing and well-studied issue for NLP tasks involving symbolic analyses of sentences, such as syntactic and semantic parsing. Recent work studies whether the confidence of coreference and NLI models is sensitive to ambiguities more broadly, whose resolution is a prerequisite to understanding meaning. The task ambiguity arises when the task is underspecified, subjectivity of annotation, and input ambiguity. Our functional approach to ambiguity, where the ambiguity in task input is disambiguated in natural language to account for variation in possible outputs, is inspired by A MBIG QA. Going beyond analysis and evaluation of task modeling, we evaluate pretrained LMs for the language model to solve highly ambiguous crossword clues. We find this approach enables a flexible and explainable way of representing ambiguity. Table 6 shows political claims flagged as ambiguous by our detection method. Namely, a generated paraphrase (shown in the hypothesis column) happens to be disambiguating, thus leading the multilabel NLI model to predict multiple labels. The document discusses a method for modeling ambiguity in language understanding. The authors experimentally evaluate the idea on the development set of CLAIM DECOMP, using political claims as a case study. They use a multilabel NLI model to assign at least two labels to each resulting NLI example and then paraphrase each claim five times with InstructGPT zero-shot. They read each instance of ambiguity or factuality and mark whether the fact-check describes an issue of ambiguity. The authors illustrate the value of ambiguity-sensitive models in detecting misleading political claims. They train three-way classification models on the single-label train sets of MNLI and WANLI and train a multilabel model on the power set of NLI labels, minus the empty set. They also train a classifier over sets which performs 7-way classification over the annotations per example. The multilabel model trained on WANLI achieves the highest macro F1 score of 37.8%. While this is substantially higher than the random-guessing baseline of 1/7 = 14.3% for EM accuracy, it is considerably short of 89.7% human accuracy. The study explores the challenge of ambiguity in natural language understanding and proposes the use of the AmbiNLI model to address this issue. The study evaluates multilabel NLI models using various methods such as regression, classification, and distributional models on datasets like W A NLI, MNLI, and Uncertain NLI. The study also experiments with predicting a single set of labels or a probability value for ambiguous examples. The results show that ambiguity remains a severe challenge across models and tests. The study suggests that performance on ambiguity is heavily dependent on performance in other settings, and the inconsistent trends suggest the need for further investigation. The study also proposes a method to quantify the likelihood of a continuation given a distractor sentence and evaluates it on various language models. This document discusses a method for modeling ambiguity in language understanding. The authors use the KL divergence to measure the impact of ambiguous contexts on language models (LMs). They sample continuations for each interpretation and compare their likelihoods under the ambiguous sentence and corresponding disambiguation. The authors also propose a few-shot template for generating disambiguations and evaluate the performance of pretrained models on this task. The best model achieves 63% accuracy. The document discusses modeling ambiguity in language understanding, specifically in determining whether a claim is true, false, or inconclusive given a premise. The process involves selecting model-predicted NLI labels and considering plausible disambiguations based on majority vote. Human evaluation is also conducted using the same setup, with the F1 metric used to represent disambiguation. Different interpretations of the context can lead to different judgments about the claim, and recognizing disambiguations can be challenging. One strategy involves restating the premise to clarify ambiguity. The best model achieves an E DIT-F1 score of 18.0%. This article discusses the modeling of ambiguity in language understanding. The study evaluates whether Language Models (LMs) can learn to generate disambiguations and recognize the validity of plausible interpretations. The LMs evaluated include ChatGPT, InstructGPT, FLAN-T5, GPT-3, and LLaMa. The experiment tests the ability of LMs to directly generate relevant disambiguations and recognize the full set of ambiguities in a given input. The study shows that input ambiguity is a source of disagreement among annotators, and that individual agreement is largely resolved on the corresponding disambiguated examples. The results demonstrate that LMs can learn to generate disambiguations and recognize plausible interpretations through pretraining. Furthermore, the study shows that back-translation can be used to generate semantically similar distractors and that annotators overwhelmingly recognize possible interpretations. Overall, the study suggests that LMs can effectively model ambiguity in language understanding. The document discusses modeling ambiguity in language understanding. The process involves three steps: annotation of ambiguous examples, recognition of disambiguations, and selection of a single label. Each example is reviewed by 9 workers, and inter-annotator agreement is calculated. The types of ambiguity present include lexical, syntactic, figurative, pragmatic, scopal, coreference, and other. A final dataset is created by combining curated and generated-then-annotated examples. Annotators may discard examples themselves, and the validation phase is performed by a subset of linguistics students. The authors review the annotations to select a set of labels for each example, including the singleton set when the example is unambiguous. The document discusses modeling ambiguity in language understanding. The authors use a pipeline to annotate and validate examples acquired from a corpus of unlabeled NLI examples that are likely to be ambiguous. The authors use a multilabel RoBERTa-large model trained on WANLI and retain all examples where the model assigns probability ≥0.05 to more than one label. They further filter for likely-ambiguous instances, such as sentences indicating at least slight uncertainty in NLI label or containing interpretable ambiguity patterns, such as sentences with differing pragmatic and literal readings. They use overgeneration and filtering to automatically create a large corpus of labeled premise-hypothesis pairs, which they directly annotate with the set of labels and disambiguations. The authors also identify groups of premise-hypothesis pairs that share a reasoning pattern, to encourage the creation of new examples with the same pattern. Finally, they analyze the ambiguous examples in their dataset and collect disambiguations labeled by linguists. The article discusses the creation of a dataset called AMBIENT, which contains examples of ambiguous language in natural language inference (NLI). The dataset includes 1,645 examples, each annotated with a set of labels indicating whether a premise entails, contradicts, or is neutral with respect to a hypothesis. The authors used two approaches to collect source examples: manual curation and automatic generation. The inclusion of ambiguous examples facilitates evaluating model abilities to first detect the presence of relevant ambiguity, then resolve it to distinct interpretations. The article highlights the promise of tools to aid real-world communication in dealing with ambiguous claims. A study investigated the use of a multilabel NLI model to detect misinformation in leading political claims. The study also explored the value of ambiguity recognition and whether LMs can distinguish between different interpretations of ambiguous sentences. A suite of tests was designed to characterize ambiguity, including manual curation and a functional approach using natural language meaning representation. The ability to recognize ambiguity in language can lead to clearer communication and more effective writing aids. Pretrained LMs can aid in identifying misleading or deceptive language and aid in human communication. Ambiguity is a common feature of language, with multiple interpretations possible based on contextual factors. This can lead to unintended miscommunication and confusion, requiring listeners to ask clarifying questions and communicators to anticipate ambiguity. A multilabel natural language inference model can be used to flag potentially misleading political claims. Ambiguity-sensitive tools are important for natural language processing and human language understanding, and managing ambiguity is a key part of language comprehension. Language models need to better model ambiguity, which can be extremely challenging due to the diverse kinds of ambiguity present in natural language. A linguist-annotated benchmark called A MBI E NT can be used to evaluate the ability of language models to recognize and disentangle possible meanings.

Raw indexed text (72,848 chars / 11,682 words / 1,788 lines)

We’re Afraid Language Models Aren’t Modeling Ambiguity

♡

♦

♠

♣

♡

Alisa Liu

Zhaofeng Wu

Julian Michael

Alane Suhr

Peter West

♥♣

♢

♡♣

Alexander Koller

Swabha Swayamdipta

Noah A. Smith

Yejin Choi

♡

Paul G. Allen School of Computer Science & Engineering, University of Washington

♣

♢

Allen Institute for Artificial Intelligence

University of Southern California

♥

♠

♦

Saarland University

New York University

Massachusetts Institute of Technology

[email protected]

Abstract

Ambiguity is an intrinsic feature of natural lan-

guage. Managing ambiguity is a key part of

human language understanding, allowing us

to anticipate misunderstanding as communica-

tors and revise our interpretations as listeners.

As language models (LMs) are increasingly

employed as dialogue interfaces and writing

aids, handling ambiguous language is critical

to their success. We characterize ambiguity in

a sentence by its effect on entailment relations

with another sentence, and collect A MBI E NT ,

a linguist-annotated benchmark of 1,645 exam-

ples with diverse kinds of ambiguity. We de-

sign a suite of tests based on A MBI E NT , pre-

senting the first evaluation of pretrained LMs

to recognize ambiguity and disentangle possi-

ble meanings. We find that the task remains ex-

tremely challenging, including for the recent

GPT-4, whose generated disambiguations are

considered correct only 32% of the time in

human evaluation, compared to 90% for dis-

ambiguations in our dataset. Finally, to illus-

trate the value of ambiguity-sensitive tools, we

show that a multilabel NLI model can flag po-

litical claims in the wild that are misleading

due to ambiguity. We encourage the field to re-

discover the importance of ambiguity for NLP.

Introduction

Ambiguity seems to be an essential, in-

dispensable element for the transfer of

information from one place to another by

words. — Thomas (1974), as referenced

in the epilogue of Grosz (1977)

Ambiguity is an intrinsic feature of language, al-

lowing speakers to balance efficiency and clarity

in communication (Zipf, 1949; Piantadosi et al.,

2012). Language understanding thus requires rec-

ognizing the presence of multiple interpretations:

Data and code can be found at https://github.com/

alisawuffles/ambient

Figure 1: Ambiguity can be the result of innocent mis-

communication (top), or deliberately used to mislead

one’s listeners (bottom). For instance, if the cat is con-

fused about its whereabouts after leaving the house,

then it is lost in the sense of being unable to find its way

(entailment edge); if it has not returned home in many

days, then it is lost in the sense that others cannot lo-

cate it (neutral edge). Each example in A MBI E NT con-

tains a set of labels corresponding to plausible readings,

along with a disambiguating rewrite for each reading.

as communicators, we anticipate the possibility of

misunderstanding; as listeners, we ask clarifying

questions, disambiguate meanings on the basis of a

wide range of contextual factors, and backtrack and

revise our earlier interpretations as needed. Beyond

unintended miscommunication, ambiguity is also

an effective tool for sending covert messages, e.g.,

out of politeness or to mislead one’s listeners while

avoiding accountability (see Figure 1).

As language models (LMs) are increasingly em-

ployed to act as dialogue agents (OpenAI, 2022;

Shuster et al., 2022) or to aid human communica-

tion as writing aids (Lee et al., 2022), being able

to work with ambiguous language will make them

more effective. This skill would support adaptation

to different contexts, clearer communication, and

identification of misleading or deceptive language.

Yet, the ability of pretrained LMs to recognize am-

biguity and disentangle possible meanings remains

unstudied, partly because ambiguous instances are

systematically excluded in the curation of bench-marks (Beigman Klebanov and Beigman, 2009).

We present A MBI E NT , Ambiguity in

Entailment, a benchmark of 1,645 examples

covering a variety of lexical, syntactic, and

pragmatic ambiguities, and more broadly sen-

tences which can be plausibly read as conveying

one of multiple different messages. Formally

characterizing ambiguity requires a choice of

meaning representation to distinguish between

possible interpretations, and determining the full

set can be tricky or impractical. Instead, we adopt

a functional approach: using the natural language

inference (NLI) task format, we characterize

ambiguity in the premise and/or hypothesis by its

effect on entailment relations.

Each A MBI E NT example consists of a premise

and hypothesis pair, assigned a set of labels (among

entailment, neutral, and contradiction), along with

disambiguating rewrites corresponding to each la-

bel when multiple are plausible (see Table 1 for

examples). Examples are collected through two ap-

proaches: manual curation to target textbook ambi-

guities, and expert annotation of automatically gen-

erated unlabeled examples to uncover more diverse

phenomena. Through analysis, we find that crowd-

workers can reliably distinguish different readings

of an ambiguous sentence and their impact on en-

tailment choices; thus we can explicitly charac-

terize the underlying reasons for uncertainty that

would otherwise surface as “disagreement” (§3).

We design a suite of tests based on A MBI E NT

to investigate the extent to which understanding of

ambiguity is acquired during pretraining of large

LMs (§4). These tests evaluate whether LMs can di-

rectly produce relevant disambiguations, recognize

possible interpretations, and model different inter-

pretations in their continuation distributions. We

find that these tasks remain extremely challenging,

including for the recent GPT-4 (OpenAI, 2023).

Therefore, we additionally investigate whether

LMs can be finetuned on existing NLI data for the

less demanding task of ambiguity recognition, with-

out explicit disambiguation (§5). We adapt several

finetuned NLI models to a multilabel setting, and

find that the best model predicts the exact label set

in only 43.6% of instances, suggesting that the NLI

task is much more challenging when formulated to

account for ambiguity.

For example, Koller et al. (2008) find that including

all possible quantifier scope readings in the Rondane Tree-

bank (Copestake and Flickinger, 2000) results in 5% of sen-

tences having ≥650,000 possible semantic analyses.

Finally, to illustrate the value of ambiguity-

sensitive tools, we present a case study of how

a multilabel NLI model can be used to detect mis-

leading political claims in the wild. We find that the

strongest model from §5, despite its limitations, can

not only recover claims flagged by fact-checkers as

ambiguous, but highlight previously unidentified

ambiguous claims, indicating the promise of such

tools to aid real-world communication.

The simplifying assumption that text has only

one interpretation has facilitated the development

of large-scale benchmarks, yet limits the depth of

what these benchmarks can evaluate. In this work

we show that sensitivity to ambiguity—a funda-

mental aspect of human language understanding—

is lacking in our ever-larger models, and illustrate

the value such understanding could bring.

A MBI E NT

Traditionally, the NLI task requires predicting

whether a premise entails, contradicts, or is neutral

with respect to a hypothesis. Yet, ambiguities in

the premise and/or hypothesis (as in Table 1) may

impact the determination of the label.

We present A MBI E NT , a dataset of 1,645 NLI

examples, each annotated with a set of labels,

reflecting potentially multiple readings of the

premise and/or hypothesis. Ambiguous examples,

i.e., those having more than one label, make up

35.2% of the dataset and include a disambiguating

rewrite corresponding to each label; unambiguous

examples have a single label. The inclusion of un-

ambiguous examples facilitates evaluating model

abilities to first detect the presence of relevant am-

biguity, then resolve it to distinct interpretations.

We use two approaches to collect source exam-

ples: manual curation and automatic generation.

Manual curation (§2.1) involves crafting a small set

of examples targeting specific types of ambiguity.

Further, to cover more diverse forms of ambiguity,

we produce a larger collection of examples via text

generation and heuristic filtering (§2.2), followed

by expert manual annotation (§2.3), forming the

bulk of A MBI E NT . Details are in §A.

2.1

Curated Examples

The authors curate a set of 142 examples, which are

either handwritten or sourced from existing NLI

datasets and linguistics textbooks (Kearns, 2000;

Carnie, 2013). We choose examples ad hoc from

the synthetic NLI datasets DistNLI (Ban et al.,P: I’m worried...

NEUTRAL : [9 N] P: I’m sorry to share that...

P: John and Anna are married.

H: John and Anna are not a couple.

*NEUTRAL, CONTRADICT+ : [5 N, 4 C] P: ... are both married.

NEUTRAL : [7 N, 2 E] P: ... are married to each other.

CONTRADICT : [9 C]

P: This seminar is full now, but interesting seminars

are being offered next quarter too.

H: There will be more interesting seminars next

quarter.

*ENTAIL, NEUTRAL+ : [7 E, 2 N] H: There will be more seminars

... that are interesting.

ENTAIL : [9 E] H: There will be seminars... that

are more interesting.

P: The novel has been banned in many schools be- H: There are many schools

cause of its explicit language.

where the novel has not been

H: The novel has not been banned in many schools. banned.

NEUTRAL : [9 N]

*NEUTRAL, CONTRADICT+ : [4 N, 5 C] H: It is not the case that the

novel has been banned in many

schools.

CONTRADICT : [9 C]

ENTAIL

P: A new study has found that nearly half of all

Americans are in favor of gun control.

H: The study found that half of all Americans are

in favor of gun control.

*ENTAIL, CONTRADICT+ : [1 E, 2 N, 6 C] H: ... that exactly half of all

Americans...

CONTRADICT : [8 C, 1 N]

P: It is shocking that...

: [9 E]

P: ... for the coming December.

CONTRADICT

: [9 C]

P: It is questionable that...

NEUTRAL

: [9 N]

H: ... that about half of all

Americans...

ENTAIL : [9 E]

P: It is difficult to believe that the author of such a

masterpiece could have been only 23 years old.

H: The author of the masterpiece was only 23.

*ENTAIL, NEUTRAL+ : [3 E, 6 N]

: [9 E]

: [9 N]

ENTAIL

P: ... for December next year.

NEUTRAL

: [9 C]

P: It is currently March, and they plan to schedule

their wedding for next December.

H: They plan to schedule their wedding for next

year.

*ENTAIL, CONTRADICT+ : [3 E, 2 N, 4 C]

CONTRADICT

P: I’m afraid the cat was hit by a car.

H: The cat was not hit by a car.

*NEUTRAL, CONTRADICT+ : [7 N, 2 C]

Type

Disambiguation 2

Disambiguation 1 Example

Table 1: Ambiguous examples in A MBI E NT with linguist-annotated *gold labels+. As analysis, we collect the :

[distribution of NLI labels] as judged by nine crowdworkers under the traditional single-label annotation scheme

(§3), finding that disagreement on ambiguous examples is largely resolved on disambiguations. The Type column

indicates the ambiguity type for each example, along with its estimated representation in the dataset (§2.5).

2022) for predicate distributivity and I MP P RES

(Jeretic et al., 2020) for implicatures. We also in-

clude some instances with differing pragmatic and

literal readings from NLI Diagnostics (Wang et al.,

2018), and ones leading to disagreement from large-

scale NLI datasets like MNLI (Williams et al.,

2018) and W A NLI (Liu et al., 2022). The authors

directly annotate these examples with the set of

labels and disambiguations (examples in §A.1).

2.2

Generated Examples

To cover more ambiguities, we use overgenera-

tion and filtering to automatically create a large

corpus of unlabeled NLI examples that are likely

to be ambiguous. Inspired by W A NLI (Liu et al.,

2022), we automatically identify groups of premise-

hypothesis pairs that share a reasoning pattern, to

encourage the creation of new examples with the

same pattern. We use W A NLI as our source of ex-

amples; each group contains a randomly chosen ex-

ample on which its two annotators disagreed (indi-

cating possible ambiguity), along with its 4 nearest

neighbors according to the final-layer embedding

of a W A NLI-trained NLI model. These examples

are formatted into a prompt with the instruction,

“Write pairs of sentences that are related to each

other in the same way.” For each prompt, we sam-

ple 5 continuations from InstructGPT (Ouyang

et al., 2022), discarding those that cannot be parsed

into a premise and hypothesis.

We observe that these groups can share inter-

pretable ambiguity patterns, such as sentences

about the past (e.g., “When I was young, I was

obsessed with the supernatural.”) inducing a can-Figure 3: Distribution of set labels in A MBI E NT .

Figure 2: Pipeline for the annotation of generated ex-

amples in A MBI E NT . Unlabeled examples are created

by InstructGPT, then annotated independently by two

linguists, whose annotations are consolidated by an au-

thor.

cellable implicature about the present (that “I” am

no longer obsessed; full prompt in §A).

To further filter for likely-ambiguous instances,

we use a multilabel RoBERTa-large model trained

on W A NLI and retain all examples where the

model assigns probability ≥ 0.05 to more than one

NLI label, indicating at least slight uncertainty in

whether there can be multiple possible readings.

2.3

Annotation and Validation

Examples acquired in §2.2 consist of not-yet-

labeled premise-hypothesis pairs, which we next

annotate with label sets and relevant disambigua-

tions. Following A MBIG QA (Min et al., 2020)

and as shown in Figure 2, each example is first

annotated by two experts, then presented to a third

expert for validation and consolidation.

We recruit 37 university-level linguistics stu-

dents for the annotation phase, as identifying

ambiguities of a sentence then delineating its pos-

sible interpretations is a challenging task. They

select a set of labels for each example, including

the singleton set when the example is unambiguous;

when more than one label is chosen, they provide

a disambiguating rewrite for each one. They are

asked to discard the example if it is offensive or

low-quality due to issues in fluency or coherence.

The validation phase is performed by a subset of

the authors to ensure high quality (details in §A.4).

The authors review the two sets of annotations to

revise and aggregate them into a single coherent an-

notation, optionally adding interpretations missed

by both annotators. Validation is skipped when ei-

ther annotator discarded an example; the validators

may additionally discard examples themselves.

We refer to them as “linguists” elsewhere.

Linguists annotate a total of 2,616 examples.

Due to the option for discarding, 2,020 examples

emerge from the annotation phase, and after valida-

tion, there are a total of 1,503 final examples.

2.4

Agreement

We expect that coverage of interpretations will im-

prove with more annotators, but that individual

annotators can certainly contribute more than one.

Indeed, on ambiguous examples, each annotator

covers 67.8% of the final (validated) set of labels,

whereas the union of the two covers 94.9%.

To calculate inter-annotator agreement for val-

idation, the four validators annotate a subset of

50 examples in common. The Fleiss κ agreement

score on the binary classification task for each label

is 0.62 for contradiction, 0.65 for entailment, and

0.44 for neutral, thus ranging from “moderate” to

“substantial” agreement.

2.5

A MBI E NT Statistics

The final dataset, which combines curated and

generated-then-annotated examples, consists of

1,645 examples. We sample 100 for a develop-

ment set and treat the rest as the test set. The label

distribution is shown in Figure 3.

To understand the types of ambiguity present in

A MBI E NT , the authors annotate a random subset of

100 ambiguous examples with the ambiguity type,

among lexical, syntactic, figurative, pragmatic, sco-

pal, coreference, and other (described in §A.6).

Results are shown in the Type column of Table 1.

Does Ambiguity Explain

Disagreement?

We conduct an analysis to understand how anno-

tators behave on ambiguous input, under the tradi-

tional 3-way annotation scheme for NLI. We find

that ambiguity is recognizable to individual work-

ers and explains much of the label variation that

emerges, thus challenging the popular assumptionthat example uncertainty should be modeled as “dis-

agreement” among annotators.

3.1

Setup

We recruit crowdworkers on Amazon Mechanical

Turk to review ambiguous examples in A MBI E NT .

Each example is reviewed by 9 workers. The task

is split into three steps, each appearing only after

the earlier steps are complete.

(i) Annotation of ambiguous example Follow-

ing the traditional NLI labeling setup, crowdwork-

ers are presented with the original ambiguous ex-

ample alone, and asked to choose a single label.

(ii) Recognition of disambiguations The am-

biguous sentence of the example (either the premise

or hypothesis) is isolated for consideration. Three

candidate interpretations are presented in a random

order, composed of the two disambiguations and

a semantically similar “distractor”. (In the case

where an example has three interpretations, no dis-

tractor is included.) Workers are asked to indicate

whether each sentence is a “possible interpretation”

of the isolated sentence. We instruct that this is sub-

jective, and they should use their best judgment.

The distractor ensures that workers do not con-

sider all sentences as valid readings, and is obtained

by back-translating the ambiguous sentence with

Yorùbá using NLLB (Meta, 2022). A low-resource

language is chosen so that the back-translation is a

close, but often not entirely faithful, paraphrase.

(iii) Annotation of disambiguated examples

Three new NLI examples are obtained by substitut-

ing the ambiguous sentence of the original example

with each candidate interpretation from (ii). Work-

ers select a single NLI label for each new example.

3.2

Results

As hypothesized, the original ambiguous examples

produce high disagreement, with a Fleiss κ score of

0.12, considered “slight” agreement (step (i)). Dis-

agreement is largely resolved on the corresponding

disambiguated examples (step (iii)), with κ increas-

ing to 0.67, representing “substantial” agreement.

Moreover, annotators overwhelmingly recognize

disambiguations as plausible interpretations of the

ambiguous sentence (step (ii)). True disambigua-

tions are marked plausible 96.7% of the time, com-

For simplicity in the task setup, we only include exam-

ples where either the premise or the hypothesis is ambiguous

(93.1% of examples), but not both.

pared to 46.7% for the distractor. On average,

93.7% of annotators accept all true interpretations,

thus recognizing the full set of possibilities.

We additionally establish human performance

through this experiment as the rate at which the

majority vote recognizes the full set of ambiguities

(step (ii)) and verifies their labels (step (iii)). In this

sense, the committee of crowdworkers performs

correctly on 89.7% of examples.

Thus, input ambiguity is indeed a source of “dis-

agreement” in NLI under a single-label annotation

scheme. However, we have shown that individual

annotators overwhelmingly can recognize multiple

possible readings of the input and their correspond-

ing output labels, and much of this disagreement

can be resolved in practice by incorporating disam-

biguation into the task. In this way, input ambiguity

can be disentangled from annotator subjectivity.

Evaluating Pretrained Language

Models

In our experiments, we first investigate the extent to

which understanding of ambiguity is acquired nat-

urally through the course of pretraining. Our three

tests evaluate if LMs can directly generate relevant

disambiguations (§4.1), recognize the validity of

plausible interpretations (§4.2), and finally, model

open-ended continuations reflecting different inter-

pretations (§4.3). For these tests, we consider only

the ambiguous instances in A MBI E NT .

As our set of LMs, we evaluate LLaMa (65B;

Touvron et al., 2023) and GPT-3 (davinci), as well

as instruction-tuned models FLAN-T5 (xxl; Chung

et al., 2022), InstructGPT (text-davinci-003),

ChatGPT, and the recently released GPT-4.

4.1

Generating Disambiguations

We first study whether LMs can learn in-context to

directly generate disambiguations and correspond-

ing labels. We construct a natural prompt (see

Table 2) by explaining that there is some ambiguity

that makes the correctness of a “claim” (hypothesis)

difficult to resolve given the “context” (premise).

For each test instance, we randomly sample 4 other

test instances as in-context examples.

As there are multiple ways to express the same

disambiguation, we perform both automatic and

human evaluation. For the former, we match each

generated disambiguation with a reference disam-

biguation based on the generated label. Following

If the label verbalizer for a disambiguation does not cor-Results Shown in Table 4, the best model is

GPT-4, achieving an E DIT -F1 score of 18.0% and

human-judged correctness of 32.0%. The latter can

be directly compared to the crowdworker-judged

correctness of A MBI E NT itself at 89.7% (§3).

One strategy for attempting disambiguation we

observe across model classes is restating the am-

biguous sentence with additional context that di-

rectly affirms or negates the hypothesis, rather than

making a targeted revision to clarify the ambiguity.

In some cases, this “shortcut” does lead to techni-

cally correct disambiguations (and marked as such

in human evaluation). For instance, for

P: He always ignores his mother’s advice

to follow his own dreams.

H: He follows his dreams.

ChatGPT disambiguates the premise by restating it,

followed by “and therefore does follow his dreams”

versus “and therefore does not follow his dreams.”

The former forces the interpretation that he ignores

her advice in order to follow his dreams; the latter

the interpretation that his mother’s advice is for

him to follow his dreams. Thus, the human-judged

correctness may overestimate the models’ ability

to precisely report the source of ambiguity.

4.2

Recognizing Disambiguations

For the next test, we focus on the ambiguous sen-

tences alone (without the rest of the NLI example),

and create a series of templated true and false state-

ments about possible interpretations as shown in

Table 3. For instance, it is both true that an ambigu-

ous sentence may mean a particular interpretation,

but also that it does not necessarily mean it. We

consider the model prediction to be the token with

the greater logit between True and False. We

respond to any label in the reference label set, then the model

receives a score of 0 for that disambiguation.

As the API for ChatGPT and GPT-4 does not return token

logits, we simply consider whether the top-1 token is correct.

In each example, you will be given some context and a claim,

where the correctness of the claim is affected by some ambiguity

in the context. Enumerate two or three interpretations of the

context that lead to different judgments about the claim.

A MBIG QA, we score generations using the E DIT -

F1 metric, which represents a disambiguation by

its added and deleted unigrams, and computes the

F1 score between the reference and the prediction.

For human evaluation, we use the same setup as

the crowdworker experiment in §3 on 50 randomly

sampled examples, except without step (i). We use

three workers per example, and consider the LM

correct on an example if the majority vote indicates

that each disambiguation is plausible (step (ii)) and

selects the model-predicted NLI labels (step (iii)).

Context: {premise}

Claim: {hypothesis} Given the context alone, is this claim

true, false, or inconclusive?

We don’t know, because the context can be interpreted in many

different ways:

1. {disambiguation 1} Then the claim is true.

2. {disambiguation 2} Then the claim is false.

3. {disambiguation 3} Then the claim is inconclusive.

Table 2: Few-shot template for the task of generat-

ing disambiguations (§4.1) when the premise is am-

biguous. The label verbalizer correspondences are true

↔ ENTAIL, false ↔ CONTRADICT, and inconclusive ↔

NEUTRAL. The instruction is stated once, followed by

four in-context examples. At the end of the prompt, the

test example is provided up until “1.”.

Template

{a}

This may mean: {d}

This does not necessarily mean: {d}

This cannot mean: {d}

This can only mean: {d}

Correct Answer

True

False

Table 3: Templates for True/False evaluation (§4.2),

where {a} denotes the ambiguous sentence and {d}

a possible disambiguation. Given the infilled template

followed by “True or False? Answer:”, the LM is ex-

pected to choose the correct answer.

execute this task zero-shot as the prompt template

completely determines the label.

Results The T/F Acc. column of Table 4 shows

the accuracy averaged across the four templates.

The best model (GPT-4) achieves 63.0% compared

to the random accuracy of 50%, with other models

ranging between 49% and 58%. When we con-

sider the proportion of disambiguations for which

GPT-4 answers all four templates correctly, perfor-

mance drops to 2.5%, which is worse than random

guessing of 6.25%. We do not observe consistent

trends across models on the per-template accuracy

(shown in §C.2), though four of six models achieve

the highest accuracy on template 1.

In general, we observe that LMs are not inter-

nally consistent across the questions. For instance,

for 76% of pairs of disambiguations (d 1 , d 2 ) for

the same ambiguous sentence a, GPT-4 both ac-

knowledges that a may mean d 1 and may mean d 2

(template 1), yet also asserts that a can only mean

d 1 and can only mean d 2 (template 4).

We find either True or False is the top token in 97.6% and

99.7% of examples, respectively, showing the task is clear.FLAN-T5 (xxl)

LLaMa (65B)

GPT-3 (davinci)

InstructGPT

ChatGPT

GPT-4

E DIT -F1 Correctness

(human) T/F Acc. KL Rank.

Acc.

5.2

10.0

10.1

14.5

13.0

18.0 0.0

10.0

2.0

4.0

18.0

32.0 56.4

55.0

57.8

49.6

57.7

63.0 81.0

68.9

75.7

71.4

Table 4: Performance of pretrained models on

A MBI E NT . Higher values are better for all metrics. A

baseline that reproduces the ambiguous sentence as its

disambiguation would achieve 0 E DIT -F1 and human-

judged accuracy; random performance for T/F accu-

racy is 50% and for KL ranking accuracy is 32.8%.

4.3

Modeling Interpretation-Specific

Continuations

Finally, we determine whether LMs, when con-

ditioned on an ambiguous sentence, implicitly

model different interpretations in their distributions

of text continuations. Since LMs are trained to

model words given context, understanding ambi-

guity should mean recognizing the union of the

contexts for a sentence’s interpretations.

To measure this, we obtain continuations for

each interpretation, and quantify how “surprised”

the LM is to see them when conditioned on the

ambiguous sentence. Specifically, we first sample

100 continuations c ∼ P (⋅ ∣ d i ) conditioned on

each disambiguation d i as context. Then, we com-

pare the likelihood of c under the ambiguous sen-

tence a versus the corresponding disambiguation d i

by computing log P (c ∣ d i ) − log P (c ∣ a). This

describes how much the LM “suffers” by seeing the

ambiguous instead of the unambiguous context,

and is an unbiased estimate of the KL divergence

between P (⋅ ∣ d i ) and P (⋅ ∣ a) (proof in §C.3):

D(P (⋅ ∣ d i ) ∣∣ P (⋅ ∣ a))

= lim

N →∞

∑

j=1

c j ∼P (⋅∣d i )

log

P (c j ∣ d i )

P (c j ∣ a)

Intuitively, we want the KL divergence not to be too

large — the LM should reasonably expect to see

continuations for either interpretation. To quantify

this, we introduce a “distractor” sentence d ˜ formed

by replacing a randomly selected noun in a with a

same-category word from ConceptNet (Speer et al.,

We exclude ChatGPT and GPT-4 from evaluation as the

API does not enable calculating likelihood under the model.

This method assumes that the likelihood of a continuation

is based on its meaning alone, but surface-form attributes like

style are a confounding factor. See further discussion in §C.3.

2017), e.g., replacing “school” with “library.”

We expect the LM to model continuations from

both disambiguations d i better than those from the

˜ i.e., for all true disambiguations d i ,

distractor d,

˜ ∣∣ P (⋅ ∣ a)) > D(P (⋅ ∣ d i ) ∣∣ P (⋅ ∣ a)).

D(P (⋅ ∣ d)

We call the fraction of ambiguous contexts for

which this is true the KL ranking accuracy.

Results The KL Rank. Acc. column of Table 4

shows that FLAN-T5 demonstrates the correct pref-

erence of continuations for 81.0% of examples,

making it the best model here despite its poor

performance in other settings. The inconsistent

trends suggest that results are heavily dependent on

how competence on ambiguity is operationalized.

Nonetheless, ambiguity remains a severe challenge

across models and across the suite of tests.

Evaluating Multilabel NLI Models

Given that language models still struggle to pro-

cess ambiguity in §4, we next investigate the effec-

tiveness of finetuning them on existing NLI data

collected in the line of work on underspecification

and subjectivity in NLI. Here, we consider the

discriminative task of multilabel NLI prediction,

across both ambiguous and unambiguous examples

in A MBI E NT . Experimental details are in §D.

5.1

Methods

We experiment with methods that predict a single

probability value, a distribution over labels, or a

set of labels. We also include traditional 3-way

classification models.

We use the development set of A MBI E NT to tune

threshold(s) that map the output of these models

onto a set of labels (see §D.1). All models are

based on roberta-large, and we report results

over 5 random seeds for model training.

Regression models We train a regression model

on Uncertain NLI (UNLI; Chen et al., 2020) that

predicts a value on [0, 1] representing the probabil-

ity of the hypothesis being true given the premise.

Distributional models Distributional models

aim to predict the distribution of annotator judg-

ments. We use two models from prior work: 1)

one trained on AmbiNLI (Meissner et al., 2021),

The size of A MBI E NT is not large enough for a train-

ing split; future efforts to annotate data in the fashion of

A MBI E NT might be able to address this issue.EM Macro F1 Group EM

Uncertain NLI (C+20) 24.5 2.3 62.2 1.0 4.7 2.5

AmbiNLI (M+21)

SNLI + MNLI (Z+22) 21.0 1.6

24.3 1.1 63.8 0.8

68.0 0.1 10.1 2.5

4.7 1.2

MNLI (M+18)

W A NLI (L+22) 25.3 1.8

30.8 3.8 68.8 0.9

71.4 0.3 4.0 2.5

10.1 7.9

Method and Train Set

Multi-label MNLI (JD22)

Multi-label W A NLI

Classifier over sets W A NLI 15.8 3.4

35.1 3.0

43.6 0.8 63.2 0.6

72.5 0.3

70.7 0.2 0.9 1.2

19.1 4.8

37.8 0.4

Table 5: Performance of multilabel NLI models on

A MBI E NT . While all model outputs are mapped onto a

set of labels, their original output varies over regression

(reg.), distributional (dist.), classification (class.), and

multilabel (multi.) output. EM and Macro F1 measure

performance on the original example; group EM con-

siders performance on both the original example and

its disambiguations. We report the mean and standard

deviation over 5 random seeds for model training.

with examples with multiple annotations from

SNLI (Bowman et al., 2015) and MNLI, and 2)

a model trained through distribution distillation

(Zhou et al., 2022), where a teacher model trained

on SNLI + MNLI is used to re-annotate the data

with soft labels then used to train a new model.

Multilabel models Prior work trained a multil-

abel model (Jiang and de Marneffe, 2022) on the

development sets of MNLI + ChaosNLI by turn-

ing distributional labels into discrete ones with a

threshold of 0.20. In addition, we train a multilabel

model on W A NLI’s train set (which has two anno-

tations per example), as well as a classifier over

sets which performs 7-way classification over the

power set of NLI labels, minus the empty set.

Classification models We train 3-way classifica-

tion models on the single-label train sets of MNLI

and W A NLI.

5.2

Metrics

On the original examples, we calculate the macro

F1 score and the exact match accuracy (EM); the

latter requires the model to exactly predict the label

set. We also report the group EM accuracy as

the fraction of examples where the model exactly

predicts the label set for both the original example

and all of its disambiguations.

5.3

Results

As shown in Table 5, the multilabel model trained

on W A NLI achieves the highest macro F1 score of

72.5%, and the classifier over sets achieves the best

EM accuracy of 43.6% and group EM accuracy of

37.8%. While this is substantially higher than the

random-guessing baseline of 1/7 = 14.3% for EM

accuracy, it is considerably short of 89.7% human

performance. Overall, finetuning NLI models on

existing data with label variation still leaves large

room for improvement on the multilabel NLI task.

Case Study: Detecting Misleading

Political Claims

We illustrate the value of ambiguity-sensitive mod-

els via a case study on detecting misleading politi-

cal claims in the wild. Here, we use the key insight

that for ambiguous sentences, some paraphrases

are naturally disambiguating, as paraphrases must

either preserve the ambiguity or paraphrase a par-

ticular interpretation. Therefore, if we cast a given

sentence as the premise and a paraphrase as the

hypothesis, and a multilabel NLI model assigns at

least two labels to the pair, this should indicate the

presence of ambiguity. Moreover, the paraphrase

resulting in this prediction should reveal the source.

We experimentally evaluate this idea on the de-

velopment set of C LAIM D ECOMP (Chen et al.,

2022), which contains 200 claims with their Poli-

tiFact fact-checks. The authors read each instance

and mark whether the fact-check describes an issue

of ambiguity or factuality (regardless of whether

we perceive ambiguity ourselves). Then we para-

phrase each claim 5 times with InstructGPT zero-

shot, and apply the multilabel W A NLI model

from §5, which achieved the highest F1 score, on

each resulting NLI example. A claim is considered

ambiguous if the model predicts more than one

label. Examples are shown in Table 6.

This method recalls 88.8% of ambiguous claims.

While precision is lower at 12.4%, qualitative in-

spection of false positives reveals many ambigui-

ties that were left unmentioned in the fact-check,

illustrating the potential of these tools to anticipate

sources of misunderstanding. In this case, our anal-

ysis suggests that fact-checking as a more general

problem may need refinement, due to the possible

presence of both true and false interpretations. This

case study shows only one use case of ambiguity-

sensitive models, and we hope for A MBI E NT for

benchmark further progress on this front.

Related Work

Ambiguity Ambiguity is a long-standing and

well-studied issue for NLP tasks involving sym-

bolic analyses of sentences, such as syntactic andPrediction Explanation of ambi-

guity (ours)

The stock market reacted immediately to Pres- Barely

ident Obama’s election in 2008, ...

-true *ENTAIL,

NEUTRAL+ The claim implies a

causal relationship

Rhode Island is one of the states... where mur-

derers must spend the longest time in prison True

before being eligible for parole. *ENTAIL,

NEUTRAL,

CONTRADICT+ “dead last” may mean

shortest or longest, de-

pending on stance

*ENTAIL,

NEUTRAL+ “on his first day” may

describe either the say-

ing or the requiring

Political claim (premise) Generated paraphrase (hypothesis)

When President Obama was elected, the market

crashed... Rhode Island is "almost dead last"... in the

length of time first-degree murderers must spend

in prison before they’re eligible for parole.

Donald Trump even said, on his very first day Donald Trump said on his first day in office

in office, he would require every school in Amer- that every school in America would have to

ica to let people carry guns into our classrooms. allow people to carry guns in classrooms.

Rating

True

Table 6: Political claims flagged as ambiguous by our detection method. Namely, a generated paraphrase (shown

here) happens to be disambiguating, thus leading the multilabel NLI model to predict multiple labels. For the claim

in the first row, the ambiguity was noted by the fact checker (Rating column), thus leading to a barely-true rating;

in the bottom two, the ambiguity was not mentioned, showing the value of this method for ambiguity detection.

semantic parsing (Church and Patil, 1982; Koller

et al., 2008) or coreference resolution (Poesio and

Artstein, 2005). Since the field has shifted to

higher-level understanding and reasoning problems,

ambiguity in language has been less discussed.

Our functional approach to ambiguity, where the

task input is disambiguated in natural language to

account for variation in possible outputs, is inspired

by A MBIG QA (Min et al., 2020). We find this ap-

proach enables a flexible and explainable way of

representing ambiguity. While A MBIG QA mainly

deals with ambiguous event and entity references

in open-domain questions, A MBI E NT contains lin-

guistic ambiguities more broadly, whose resolution

is a prerequisite to understanding meaning.

Recent work studies whether the confidence of

coreference and NLI models is sensitive to ambigu-

ity in synthetically-constructed input (Yuan et al.,

2023; Ban et al., 2022). Efrat et al. (2021) train a

model to solve highly ambiguous crossword clues.

Going beyond analysis and evaluation of task mod-

els, we evaluate pretrained LMs for the language

skill of managing ambiguity.

Human label variation Human label variation

(Plank, 2022) is a broad phenomenon with three dis-

tinct sources, as summarized by Jiang and de Marn-

effe (2022): task ambiguity, subjectivity of annota-

tor attitudes, and input ambiguity (our focus). Ex-

plored in contemporary work (Tamkin et al., 2023),

task ambiguity arises when the task is underspeci-

fied with respect to the desired output; subjectivity

is observed when different people disagree, such

as for toxic language detection (Sap et al., 2022).

There is growing recognition of and interest in

studying this variation, where the dominant ap-

proach is to model the distribution of human judg-

ments (Pavlick and Kwiatkowski, 2019; Nie et al.,

2020; Uma et al., 2021), potentially as a function

of their demographic characteristics (Gordon et al.,

2022). In our work, we argue that when uncertainty

is in the input, we should instead directly charac-

terize the underlying reasons for the uncertainty.

NLI beyond three-way classification For NLI,

the seminal work investigating label variation was

Pavlick and Kwiatkowski (2019), and subsequent

work collected more annotations (Nie et al., 2020)

and modeled this variation (Zhou et al., 2022;

Zhang et al., 2021). Other approaches aim to pre-

dict the probability of entailment (Chen et al., 2020;

Zhang et al., 2017) or a fourth “disagreement” la-

bel (Zhang and de Marneffe, 2021). We contribute

another approach, where NLI models predict the

set of labels for plausible readings.

Jiang and de Marneffe (2022) investigate MNLI

data to taxonomize sources of disagreement, and

identify “uncertainty in sentence meaning” as one

source, though they named only lexical and impli-

cature ambiguities. Our benchmark enables a better

understanding of this source of disagreement with

a wider coverage of types of ambiguity.

Conclusion

Ambiguity in language will become increasingly

conspicuous as we push the limits of LM capabili-

ties and build tools that engage with the nuances of

natural language communication. We develop the

first benchmark to evaluate whether language mod-

els recognize different readings of ambiguous text,

and demonstrate that the task remains extremely

challenging. We encourage future work to study the

sensitivity of LMs to context and emphasis, investi-

gate the presence of systematic biases in interpreta-

tion, and explore the promising space of real-world

applications enabled by ambiguity-sensitive tools.Acknowledgments References

We would like to thank Nathan Schneider, Ellie

Pavlick, Doug Downey, Ewin Tang, Roy Schwartz,

Yanai Elazar, Valentina Pyatkin, and Ari Holtzman,

as well as the greater UW NLP and AI2 community,

for valuable feedback and discussion at different

stages of this work. Our dataset would not have

been possible without the expertise of our linguist

annotators, which include Emma Miller, Sofia Y.

Ahmed, Wendy Kempsell Jacinto, Maxine Appel,

Edi Xin, Magdelina Thornton, Huijae Seo, Gita

Dhungana, and Aldrich Gran Lapid, and 28 others.

This work was funded in part by the DARPA

MCS program through NIWC Pacific (N66001-19-

2-4031). We thank OpenAI for offering access to

various models through the API. The first author

is supported by the National Science Foundation

Graduate Research Fellowship Program. Pangbo Ban, Yifan Jiang, Tianran Liu, and Shane

Steinert-Threlkeld. 2022. Testing pre-trained lan-

guage models’ understanding of distributivity via

causal mediation analysis. In Proceedings of the

Fifth BlackboxNLP Workshop on Analyzing and In-

terpreting Neural Networks for NLP, pages 314–324,

Abu Dhabi, United Arab Emirates (Hybrid). Associ-

ation for Computational Linguistics.

Limitations

In this work we attempt to collect a broad-coverage

dataset of ambiguities, but the size and diversity

are nonetheless limited due to the data sources

and the effort required for expert annotation. We

thus encourage future work to collect more data in

the format of A MBI E NT , especially for naturally-

occurring ambiguities. In addition, we only study

ambiguity phenomena in English, but how ambigu-

ity manifests in other languages can vary greatly

due to systematic typological factors or idiosyn-

cratic differences. For example, while A MBI E NT

does not contain many instances of morphological

ambiguity, these are very common in morphologi-

cally richer languages such as Turkish and Finnish.

A systematic extension of our dataset and analyses

to other languages would be exciting future work.

Though LMs struggle across the board on our

evaluations, this does not guarantee that they will

not handle ambiguity well in other task settings

or using other extraction methods. We observe

that GPT-4 is the highest-performing model on two

of the three evaluations (§4.1, §4.2), while the

smallest FLAN-T5 performs best on the last evalua-

tion (§4.3). Scaling up general-purpose pretraining

and reinforcement learning from human feedback

(Ouyang et al., 2022) may lead to further gains,

though we hypothesize that the trend will be un-

clear as larger LMs may overfit to more common

interpretations at the expense of recognizing less

common ones, which is especially detrimental for

reasoning about misleading language.

Beata Beigman Klebanov and Eyal Beigman. 2009.

From Annotator Agreement to Noise Models. Com-

putational Linguistics, 35(4):495–503.

Samuel R. Bowman, Gabor Angeli, Christopher Potts,

and Christopher D. Manning. 2015. A large anno-

tated corpus for learning natural language inference.

In Proceedings of the 2015 Conference on Empiri-

cal Methods in Natural Language Processing, pages

632–642, Lisbon, Portugal. Association for Compu-

tational Linguistics.

Andrew Carnie. 2013. Syntax: A Generative Introduc-

tion. Introducing Linguistics. Wiley.

Jifan Chen, Aniruddh Sriram, Eunsol Choi, and Greg

Durrett. 2022. Generating literal and implied sub-

questions to fact-check complex claims. In Proceed-

ings of the 2022 Conference on Empirical Methods

in Natural Language Processing, pages 3495–3516,

Abu Dhabi, United Arab Emirates. Association for

Computational Linguistics.

Tongfei Chen, Zhengping Jiang, Adam Poliak, Keisuke

Sakaguchi, and Benjamin Van Durme. 2020. Un-

certain natural language inference. In Proceedings

of the 58th Annual Meeting of the Association for

Computational Linguistics, pages 8772–8779, On-

line. Association for Computational Linguistics.

Hyung Won Chung, Le Hou, Shayne Longpre, Barret

Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi

Wang, Mostafa Dehghani, Siddhartha Brahma, Al-

bert Webson, Shixiang Shane Gu, Zhuyun Dai,

Mirac Suzgun, Xinyun Chen, Aakanksha Chowdh-

ery, Alex Castro-Ros, Marie Pellat, Kevin Robin-

son, Dasha Valter, Sharan Narang, Gaurav Mishra,

Adams Yu, Vincent Zhao, Yanping Huang, Andrew

Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean,

Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V.

Le, and Jason Wei. 2022. Scaling instruction-

finetuned language models.

Kenneth Church and Ramesh Patil. 1982. Coping with

syntactic ambiguity or how to put the block in the

box on the table. American Journal of Computa-

tional Linguistics, 8(3-4):139–149.

Ann Copestake and Dan Flickinger. 2000.

open source grammar development environment and

broad-coverage English grammar using HPSG. In

Proceedings of the Second International Conference

on Language Resources and Evaluation (LREC’00),

Athens, Greece. European Language Resources As-

sociation (ELRA).Avia Efrat, Uri Shaham, Dan Kilman, and Omer Levy.

2021. Cryptonite: A cryptic crossword benchmark

for extreme ambiguity in language. In Proceed-

ings of the 2021 Conference on Empirical Methods

in Natural Language Processing, pages 4186–4192,

Online and Punta Cana, Dominican Republic. Asso-

ciation for Computational Linguistics.

Mitchell L. Gordon, Michelle S. Lam, Joon Sung Park,

Kayur Patel, Jeff Hancock, Tatsunori Hashimoto,

and Michael S. Bernstein. 2022. Jury learning: Inte-

grating dissenting voices into machine learning mod-

els. In Proceedings of the 2022 CHI Conference

on Human Factors in Computing Systems, CHI ’22,

New York, NY, USA. Association for Computing

Machinery.

Barbara J. Grosz. 1977. The Representation and Use

of Focus in Dialogue Understanding. Ph.D. thesis.

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and

Yejin Choi. 2020. The curious case of neural text de-

generation. In International Conference on Learn-

ing Representations.

Paloma Jeretic, Alex Warstadt, Suvrat Bhooshan, and

Adina Williams. 2020. Are natural language infer-

ence models IMPPRESsive? Learning IMPlicature

and PRESupposition. In Proceedings of the 58th An-

nual Meeting of the Association for Computational

Linguistics, pages 8690–8705, Online. Association

for Computational Linguistics.

Nan-Jiang Jiang and Marie-Catherine de Marneffe.

2022. Investigating Reasons for Disagreement in

Natural Language Inference. Transactions of the As-

sociation for Computational Linguistics, 10:1357–

1374.

Kate Kearns. 2000. Semantics. St. Martin’s Press.

Alexander Koller, Michaela Regneri, and Stefan Thater.

2008. Regular tree grammars as a formalism for

scope underspecification. In Proceedings of ACL-

08: HLT, pages 218–226, Columbus, Ohio. Associa-

tion for Computational Linguistics.

Mina Lee, Percy Liang, and Qian Yang. 2022. Coau-

thor: Designing a human-ai collaborative writing

dataset for exploring language model capabilities.

In CHI Conference on Human Factors in Computing

Systems, New Orleans, LA, USA.

Alisa Liu, Swabha Swayamdipta, Noah A. Smith, and

Yejin Choi. 2022. WANLI: Worker and AI collab-

oration for natural language inference dataset cre-

ation. In Findings of the Association for Computa-

tional Linguistics: EMNLP 2022, pages 6826–6847,

Abu Dhabi, United Arab Emirates. Association for

Computational Linguistics.

Johannes Mario Meissner, Napat Thumwanit, Saku

Sugawara, and Akiko Aizawa. 2021. Embracing

ambiguity: Shifting the training target of NLI mod-

els. In Proceedings of the 59th Annual Meeting of

the Association for Computational Linguistics and

the 11th International Joint Conference on Natu-

ral Language Processing (Volume 2: Short Papers),

pages 862–869, Online. Association for Computa-

tional Linguistics.

Meta. 2022. No language left behind: Scaling human-

centered machine translation.

Sewon Min, Julian Michael, Hannaneh Hajishirzi, and

Luke Zettlemoyer. 2020. AmbigQA: Answering am-

biguous open-domain questions. In Proceedings of

the 2020 Conference on Empirical Methods in Nat-

ural Language Processing (EMNLP), pages 5783–

5797, Online. Association for Computational Lin-

guistics.

Yixin Nie, Xiang Zhou, and Mohit Bansal. 2020. What

can we learn from collective human opinions on nat-

ural language inference data? In Proceedings of the

2020 Conference on Empirical Methods in Natural

Language Processing (EMNLP), pages 9131–9143,

Online. Association for Computational Linguistics.

OpenAI. 2022. Introducing chatgpt.

OpenAI. 2023. GPT-4 technical report.

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida,

Carroll L. Wainwright, Pamela Mishkin, Chong

Zhang, Sandhini Agarwal, Katarina Slama, Alex

Ray, John Schulman, Jacob Hilton, Fraser Kelton,

Luke Miller, Maddie Simens, Amanda Askell, Pe-

ter Welinder, Paul Christiano, Jan Leike, and Ryan

Lowe. 2022. Training language models to follow in-

structions with human feedback.

Ellie Pavlick and Tom Kwiatkowski. 2019. Inherent

disagreements in human textual inferences. Transac-

tions of the Association for Computational Linguis-

tics, 7:677–694.

Steven T. Piantadosi, Harry Tily, and Edward Gibson.

2012. The communicative function of ambiguity in

language. Cognition, 122(3):280–291.

Barbara Plank. 2022. The “problem” of human la-

bel variation: On ground truth in data, modeling

and evaluation. In Proceedings of the 2022 Con-

ference on Empirical Methods in Natural Language

Processing, pages 10671–10682, Abu Dhabi, United

Arab Emirates. Association for Computational Lin-

guistics.

Massimo Poesio and Ron Artstein. 2005. The relia-

bility of anaphoric annotation, reconsidered: Taking

ambiguity into account. In Proceedings of the Work-

shop on Frontiers in Corpus Annotations II: Pie in

the Sky, pages 76–83, Ann Arbor, Michigan. Associ-

ation for Computational Linguistics.

Maarten Sap, Swabha Swayamdipta, Laura Vianna,

Xuhui Zhou, Yejin Choi, and Noah A. Smith. 2022.

Annotators with attitudes: How annotator beliefs

and identities bias toxic language detection. In Pro-

ceedings of the 2022 Conference of the North Amer-

ican Chapter of the Association for ComputationalLinguistics: Human Language Technologies, pages

5884–5906, Seattle, United States. Association for

Computational Linguistics.

Kurt Shuster, Jing Xu, Mojtaba Komeili, Da Ju,

Eric Michael Smith, Stephen Roller, Megan Ung,

Moya Chen, Kushal Arora, Joshua Lane, Morteza

Behrooz, William Ngan, Spencer Poff, Naman

Goyal, Arthur Szlam, Y-Lan Boureau, Melanie Kam-

badur, and Jason Weston. 2022. Blenderbot 3: a de-

ployed conversational agent that continually learns

to responsibly engage.

Robyn Speer, Joshua Chin, and Catherine Havasi. 2017.

Conceptnet 5.5: An open multilingual graph of gen-

eral knowledge. In Proceedings of the Thirty-First

AAAI Conference on Artificial Intelligence, page

4444–4451. AAAI Press.

Alex Tamkin, Kunal Handa, Avash Shrestha, and Noah

Goodman. 2023. Task ambiguity in humans and lan-

guage models. In The Eleventh International Con-

ference on Learning Representations.

Lewis Thomas. 1974. The lives of a cell. Notes of a

biology watcher, New york (The Viking Press) 1974.

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier

Martinet, Marie-Anne Lachaux, Timothée Lacroix,

Baptiste Rozière, Naman Goyal, Eric Hambro,

Faisal Azhar, Aurelien Rodriguez, Armand Joulin,

Edouard Grave, and Guillaume Lample. 2023.

Llama: Open and efficient foundation language mod-

els.

Alexandra Uma, Tommaso Fornaciari, Anca Dumi-

trache, Tristan Miller, Jon Chamberlain, Barbara

Plank, Edwin Simpson, and Massimo Poesio. 2021.

SemEval-2021 task 12: Learning with disagree-

ments. In Proceedings of the 15th International

Workshop on Semantic Evaluation (SemEval-2021),

pages 338–347, Online. Association for Computa-

tional Linguistics.

Alex Wang, Amanpreet Singh, Julian Michael, Fe-

lix Hill, Omer Levy, and Samuel Bowman. 2018.

GLUE: A multi-task benchmark and analysis plat-

form for natural language understanding. In Pro-

ceedings of the 2018 EMNLP Workshop Black-

boxNLP: Analyzing and Interpreting Neural Net-

works for NLP, pages 353–355, Brussels, Belgium.

Association for Computational Linguistics.

Adina Williams, Nikita Nangia, and Samuel Bowman.

2018. A broad-coverage challenge corpus for sen-

tence understanding through inference. In Proceed-

ings of the 2018 Conference of the North American

Chapter of the Association for Computational Lin-

guistics: Human Language Technologies, Volume

1 (Long Papers), pages 1112–1122, New Orleans,

Louisiana. Association for Computational Linguis-

tics.

Yuewei Yuan, Chaitanya Malaviya, and Mark Yatskar.

2023. Ambicoref: Evaluating human and model sen-

sitivity to ambiguous coreference. In Proceedings of

EACL Findings.

Sheng Zhang, Rachel Rudinger, Kevin Duh, and Ben-

jamin Van Durme. 2017. Ordinal common-sense in-

ference. Transactions of the Association for Compu-

tational Linguistics, 5:379–395.

Shujian Zhang, Chengyue Gong, and Eunsol Choi.

2021. Learning with different amounts of annota-

tion: From zero to many labels. In Proceedings of

the 2021 Conference on Empirical Methods in Natu-

ral Language Processing, pages 7620–7632, Online

and Punta Cana, Dominican Republic. Association

for Computational Linguistics.

Xinliang Frederick Zhang and Marie-Catherine

de Marneffe. 2021. Identifying inherent disagree-

ment in natural language inference. In Proceedings

of the 2021 Conference of the North American

Chapter of the Association for Computational

Linguistics: Human Language Technologies, pages

4908–4915, Online. Association for Computational

Linguistics.

Xiang Zhou, Yixin Nie, and Mohit Bansal. 2022. Dis-

tributed NLI: Learning to predict human opinion dis-

tributions for language reasoning. In Findings of

the Association for Computational Linguistics: ACL

2022, pages 972–987, Dublin, Ireland. Association

for Computational Linguistics.

George Kingsley Zipf. 1949. Human behavior and the

principle of least effort.

A.1

Dataset Creation Details

Curated Examples

The first author skimmed through several exist-

ing NLI datasets and manually identified examples

that were both natural and contained salient ambi-

guities. Only a few examples were chosen from

each dataset, to avoid overly redundant examples

in A MBI E NT . In Table 7, we show an example

from each of the sources we drew from. They are

directly annotated with the set of labels and disam-

biguations by the first author.

A.2

Generated Examples

The template for prompting GPT-3 to gener-

ate unlabeled NLI examples is shown in Ta-

ble 8. The model we used is InstructGPT

(text-davinci-002), queried on September 4,

2022, with top p = 0.9 (Holtzman et al., 2020),

max tokens 120, and stop sequence “\n\n”. If

the generated output is not correctly formatted with

“\nSentence 2:” in the sequence (which separates

the premise and hypothesis), we discard the output.

We sample 5 outputs per 21,273 possible prompts

to obtain a total of 104,071 unlabeled examples.

We first employ simple heuristics to discard ex-

amples exhibiting observable failure cases. ThatExample Disambiguation 1

Disambiguation 2

P: It is the only possibility of making the

law the servant of the people, not the other

way around.

H: The law should be the servant of the

people, not the other way around.

*ENTAIL, NEUTRAL+ P: ... of making the law the ser-

P: It is the only possibility

vant of the people, as it should

that would lead to the law...

be, ...

NEUTRAL

ENTAIL

P: Then he sobered.

H: He was drunk.

*ENTAIL, NEUTRAL+ P: Then he sobered

drinking alcohol.

ENTAIL

P: Patrick did not manage to leave.

H: Patrick tried to leave.

*ENTAIL, NEUTRAL+

after

Source

W A NLI test

P: Then he became more

sensible.

NEUTRAL MNLI dev

P: ..., despite his attempt.

ENTAIL P: ... , whether or not he tried.

NEUTRAL I MP P RES

P: LaBeouf had tried to bum a smoke from

two strangers, unaware that one of them

was a police officer.

H: LaBeouf had tried to bum a smoke

from a police officer.

*ENTAIL, CONTRADICTION+ H: LaBeouf had tried to find a

police officer to bum a smoke

from.

ENTAIL H: LeBeouf had tried to bum

a smoke from someone who

happened to be a police officer.

CONTRADICTION P: Jenny and Zoe solved the puzzle.

H: They solved it together.

*ENTAIL, CONTRADICTION+ P: ... solved the puzzle together.

ENTAIL P: ... each solved the puzzle.

CONTRADICTION P: John opened the door again.

H: John opened the door before.

*ENTAIL, NEUTRAL+ P: John opened the door before, P: The door was open before,

and did it again.

and John opened the door again.

ENTAIL

NEUTRAL Carnie

P: John wishes to marry Adrienne, a

Frenchwoman.

H: John wants to marry a Frenchwoman.

*ENTAIL, NEUTRAL+ P: John wants to marry a certain

woman who is French.

ENTAIL P: John wants for his future wife

to be French.

NEUTRAL Kearns

P: You should visit Norway in the summer.

P: You should visit Norway the

H: Summer is a good season to visit Nor-

coming summer.

way.

ENTAIL

*ENTAIL, NEUTRAL+ P: You should visit Norway in

the summer season.

NEUTRAL Handwritten

NLI

Diagnostics

DistNLI

Table 7: An example in A MBI E NT from each of the sources we draw from for the curated examples (§2.1).

is, we discard examples if 1) either the premise

or hypothesis is shorter than 5 characters, 2) the

premise and hypothesis are identical, 3) the gener-

ated example is copied from an in-context example,

or 4) the examples contain some redundant patterns

observed in the development phase. For instance,

there are an abundance of generations with the ex-

act premise, “Mary wants to try a little bit of every

country’s food on her trip around the world.” Af-

ter filtering based on these rules, 77,564 examples

remain.

Next, we use a multilabel NLI model (as de-

scribed in §2.2 of the main text) to filter the exam-

ples for those that are more likely to be ambigu-

ous. Finally, to approximately balance the resulting

examples (of course, exactly balancing would be

impossible without gold labels), we keep an equal

number of examples where the multilabel model

predicts (according to the low threshold of 0.05)

*ENTAIL, NEUTRAL+ and *CONTRADICT, NEUTRAL+,

and all other examples with multiple labels pre-

dicted. Thus, the final set of generated examples is

16,826.

A.3

Linguist annotation

Of the generated examples, we ultimately annotate

only 2,616 of them. This is due to the slow pace of

expert annotation as the project went on, and the

diminishing returns of annotating more data. Each

example was annotated by two linguists. We dis-

card an example if either linguist chose to discard

it. The final set of examples is 2,020.Write pairs of sentences that are related to each other in the

same way.

Sentence 1: In the past, I have been of the opinion that a

free market economy is a superior economic system.

Sentence 2: I have changed my mind and now believe that

a planned economy is superior.

Sentence 1: I would like to go to the circus.

Sentence 2: I have never been to the circus.

Sentence 1: For a long time, this concept of “collective

responsibility” was more important than the need to protect

the individual.

Sentence 2: This concept of “collective responsibility” is

no longer important.

Sentence 1: When I was young, I was obsessed with the

supernatural.

Sentence 2: I am not obsessed with the supernatural any-

more.

Sentence 1:

Category Description

Lexical

Syntactic

Figurative

Pragmatic A lexical item has different senses

Different syntactic parses lead to different interpretations

Literal and figurative readings are present

Literal and pragmatic interpretations are present

Ambiguity from the relative scopal order of quantifiers

OR the scope of particular modifiers

Ambiguous coreference

Ambiguity that does not fall into the above categories

Scopal

Coreferential

Other

Table 9: Ambiguity categories.

of task underspecification in NLI (something we

do not directly study in this work), and allowed us

to focus on linguistic ambiguity.

Ultimately, 1,503 examples emerge from this

phase.

A.5

Table 8: Prompt template for GPT-3 used to create un-

labeled examples for annotation, formatted with an ac-

tual set of in-context examples. In-context examples

are from W A NLI and found automatically via nearest

neighbors in [CLS] token embedding space of an NLI

model finetuned on W A NLI. All the examples demon-

strate a shared ambiguity pattern where sentences about

the past or desires about the future induce a cancellable

implicature about the present. For instance, When I was

young, I was obsessed with the supernatural implies

that “I” am no longer obsessed, and I would like to go

to the circus might be taken to imply (more tenuously)

that “I” have not been before.

Our expert annotators were 37 university-level

linguistics students at the University of Washing-

ton, recruited through a Linguistics mailing list.

They were paid $20/hour, in addition to $0.05 per

example.

A.4

Validation by authors

The authors review all 2,020 examples, each with

two annotations, combining and revising them into

a single coherent annotation, or optionally discard-

ing the example. The authors validated the ex-

amples together on Zoom calls over the course of

several weeks, actively discussing examples that

they were unsure about and developing consistent

standards. For instance, we chose to discard exam-

ples that boiled down to temporal ordering (e.g.,

“I didn’t realize that I left my keys at home” either

entails or contradicts “I realized I left my keys at

home”, depending on the ordering of the sentences)

or vagueness (e.g., “He is six feet tall” may or may

not entail “He is tall” due to the vagueness of the

word “tall”). This process revealed to us the extent

Additional statistics

The disambiguating rewrites are, on average, 2.36

words longer than their ambiguous counterparts.

Among the ambiguous examples, 74.3% have am-

biguity in the premise and 32.6% in the hypothesis,

with 6.9% having ambiguity in both. 97.5% of

ambiguous sentences are labeled with two disam-

biguating rewrites, with the rest having three or

more.

A.6

Ambiguity category annotation

The authors construct a taxonomy of ambiguity

types by reviewing A MBI E NT examples and cate-

gorizing possible sources of ambiguity, described

in Table 9.

Then two of the authors annotate 100 randomly

sampled examples from A MBI E NT for the ambi-

guity type. Each ambiguity is labeled with one

category; examples may have multiple categories

when they contain multiple ambiguities (e.g., both

premise and hypothesis are ambiguous, or one sen-

tence has multiple ambiguous parts). When multi-

ple categories are plausible for a single ambiguity

(e.g., a word is lexically ambiguous but pragmatics

encourages the reading of one over the other), we

choose the first one in the order of the table (here,

lexical).

Note that the distribution of ambiguity in

A MBI E NT does not necessarily reflect that of

naturally-occurring ambiguity.

B.1

Crowdworker experiment details

The crowdworkers

To qualify workers, we designed a qualification test

with 5 questions that paid $5.00, open to the 64annotators who revised and labeled NLI examples

for the creation of W A NLI. Of the 43 workers tak-

ing the test, 34 passed, though only 29 participated

in the actual project. Through a poll taken after

the annotation phase was completed, we find that

all but one of the participants spoke English as a

native language.

For the remainder of the study, crowdworkers

were paid $0.40 per NLI example, which involved

labeling the original ambiguous example, assessing

the plausibility of three interpretations, and finally

labeling three (closely related) NLI examples. At

the end of data collection, we aggregate the earning

and time spent from each crowdworker, and find

that the median hourly rate was $19.13.

B.2

Setup details

To create a “distractor” sentence among the true dis-

ambiguations, we use back-translation with Yorùbá

by employing the NLLB model (Meta, 2022) with

greedy decoding for both Eng→Yor and Yor→Eng.

In case the generated distractor was an exact

copy of the original ambiguous sentence, we repeat

the Yor→Eng leg of backtranslation with multino-

mial beam search, with a beam size of 5.0, top

p = 1.0, and temperature t = 2.0. Of the 5 se-

quences returned, we randomly choose a sequence

that is distinct from the original source sentence.

For instance, “It is currently March, and they

plan to have their wedding scheduled for next De-

cember” is back-translated to “It is March, and they

are to be married in December,” which is a faithful

though somewhat lossy paraphrase, and 8/9 crowd-

workers consider this a possible interpretation. On

the other hand, “There will be more interesting sem-

inars next quarter” is back-translated to “There will

be many more exciting conventions in the next half,”

which is not a faithful paraphrase and considered a

possible interpretation by 1/9 workers.

C.1

LM Experiment details and discussion

Generating Disambiguations

For the test in §4.1, there is a different template

for when the premise is ambiguous and when the

hypothesis is ambiguous. For simplicity, we ex-

clude the 6.9% of examples where both the premise

and hypothesis are ambiguous. The former tem-

plate is shown in Table 2; the latter contains only

minor modifications. The instruction is “In each

example, you will be given some context and a

claim. Unfortunately, the claim has some ambigu-

FLAN-T5 (xxl)

LLaMa (65B)

GPT-3 (davinci)

InstructGPT (-003)

ChatGPT

GPT-4

1 2 3 4 Avg

85.9

96.1

46.2

71.9

81.5

91.6 28.2

92.1

69.0

18.1

51.7

68.8 100.0

11.8

45.0

81.0

74.5

81.8 11.6

19.9

71.1

27.5

23.4

9.9 56.4

55.0

57.8

49.6

57.7

63.0

Table 10: Accuracy of LMs on the four templates from

the True/False evaluation in §4.2. The Avg. column is

the one reported in the T/F Acc. column of Table 4.

ity that affects whether it is correct. Enumerate two

or three interpretations of the claim that lead to

different judgments about its correctness.” Then,

immediately following the statement of the context

and claim, “We don’t know, because the claim can

be interpreted in many different ways:”.

C.2

Recognizing Disambiguations

For the test in §4.2, accuracy on each template is

shown in Table 10.

C.3

Recognizing Interpretation-Specific

Continuations

This section includes implementation details and

discussion for the test in §4.3.

KL divergence For a given disambiguation d i ,

let X be a random variable equal to

x c = log

P (c ∣ d i )

P (c ∣ a)

with prob. p c = P (c ∣ d i )

In §4.3, we calculate the mean over X j , indepen-

dent and identically distributed copies of X:

X̄ n =

∑ X j

j=1

First we show that X is an unbiased estimator for

the KL divergence.

E[ X̄ n ] = E[X]

= ∑ p c x c

c∈X

= ∑ P (c ∣ d i ) log

c∈X

P (c ∣ d i )

P (c ∣ a)

= D(P (⋅ ∣ d i ) ∣∣ P (⋅ ∣ a))

where the first step follows from the linearity of

expectation.

And from the law of large numbers, we observethat X̄ n tends to the KL divergence in the limit.

lim X̄ n = E[X] = D(P (⋅ ∣ d i ) ∣∣ P (⋅ ∣ a))

n→∞

Prepending a stem We append one of two stems

to the beginning of the disambiguation (or distrac-

tor), for both generating continuations and measur-

ing the likelihood of generated continuations. For

instruction-tuned models, we append the prompt

“Write a story.

,” so that generating on-

topic continuations is consistent with its instruction-

following objective. For vanilla LMs, we append

a start quotation mark “, which we find leads to

significantly more topical continuations; otherwise,

models may generate a newline and proceed to

a new topic.

Creating the distractor To create the distractor

for an ambiguous sentence, we tokenize the sen-

tence using spacy and randomly select a word w

with the tag NOUN or PROPN (proper noun). Then

we find the category node c where w has the IsA

relation to c, i.e., w → c, with the largest weight.

Finally, we randomly sample a same-category node

′

w ≠ w, representing a single word, such that

′

w → c.

Sometimes this replacement is not viable, e.g.,

when there are no nouns in the sentence, the noun

is not in ConceptNet, or there are no same-category

words. In this case, we next attempt to replace

a pronoun with another heuristically-determined

pronoun; failing all else, we randomly replace any

noun or pronoun with the word “corgi.”

Generating continuations Given either a true

disambiguation or distractor as context, we gener-

ate continuations by sampling 100 single-sentence

continuations from the full probability distribution,

i.e., with top p = 1.0. To obtain a single sentence,

we stop generation when a sentence-ending punc-

tuation mark (one of !, ?, and .) is generated, and

append a period back.

Limitations Finally we discuss some limitations

we observed with this test. First, the likelihood of

a continuation conditioned on context depends not

only on the meaning of the context, but also surface-

form attributes like the style and tone, which is a

confounding factor in this experiment. Indeed, we

observe that there can be a stylistic mismatch be-

tween original ambiguous sentence and its disam-

biguation, often with the latter being more stinted

and formal. Generated continuations thus match

the formal style, and have lower likelihood under

the ambiguous sentence than a semantically equiv-

alent, more casual paraphrase.

In addition, the “closeness” of the distractor af-

fects how easy or challenging the test is. We find

that in most cases, the noun replacement procedure

creates a sentence which we would expect to have

a substantially different set of plausible continua-

tions, potentially leading the test to be too “easy”.

Yet this varies with the noun being replaced, the

replacement chosen, as well as the overall sentence

in which it appears. Nonetheless, we require the

distractor for this test in order to make a judgment

about the performance of the model.

D.1

Multilabel Model Experiments

Methods

Here we describe the setup of NLI models that pre-

dict multiple labels as output (§5.1). Multilabel

models train separate binary classifier heads for

each label on top of the transformer output. Dur-

ing inference, the labels are independently selected

based on a threshold (shared across labels) tuned on

the development set to maximize F1. Regression

models train a regressor into [0, 1] that represents

the probability of hypothesis being true given the

premise. The development set is used to select a

mapping from each NLI label into a continuous

sub-range, and at inference time we pick all la-

bels whose ranges overlap with the regressed value.

Classifier over sets is a seven-way classifier over

the power set of NLI labels minus the empty set.

As it directly predicts a set of labels, this model

requires no threshold tuning.

The median thresholds across 5 seeds from our

experiments are shown in Table 11.

D.2

Training Details

For models from prior work, we replicate the train-

ing details to the best of our ability. All models are

based on roberta-large.

The UNLI model (Chen et al., 2020) is trained

on SNLI’s training set (heuristically mapped to

regression labels) for 1 epoch, then trained on u-

SNLI (human-annotated with regression labels) for

3 epochs.

The AmbiNLI model (Meissner et al., 2021) is

first pretrained on single-label data from SNLI +

MNLI for 3 epochs, then further finetuned on Am-

biNLI for 2 epochs. AmbiNLI examples have dis-

tributional outputs, and is sourced from the devel-

opment set of SNLI and MNLI (which contain 5Thresholds

Uncertain NLI (C+20) E: (0.69, 1.0)

N: (0.01, 1.0)

C: (0.03, 0.71)

AmbiNLI (M+21)

Dist. Distillation (Z+22) −3.43

−1.55

MNLI (M+18)

W A NLI (L+22) −2.68

−1.19

Multi-label MNLI (M+18)

Multi-label W A NLI

Set classifier on W A NLI −2.78

−1.97

N/A

Model

Table 11: Logit thresholds used to map the output of

various models to a set of labels, for multilabel predic-

tion experiments (§5). The way these thresholds are ob-

tained and used at inference-time is explained in §D.1.

labels) and train set of UNLI (which are heuristi-

cally mapped to soft labels).

The Distribution Distillation model (Zhou et al.,

2022) is trained for 2 epochs on SNLI + MNLI

training examples that are re-annotated with the

distributional output of a teacher model. The

teacher model is a traditional three-way classifi-

cation model trained on SNLI + MNLI.

Finally, the multilabel model from Jiang and

de Marneffe (2022) is trained on the development

set of MNLI and ChaosNLI, where a label is con-

sidered present if 20% of annotators choose the

label. The model with the lowest loss on held-out

data over 30 epochs is selected as the final model.

Political Claims Case Study

To paraphrase each political claim, we use

InstructGPT (text-davinci-003) zero-shot

with the simple prompt “Paraphrase the

text.

{Claim} Paraphrase:”, and decode

with top p = 0.9, to encourage both correctness

and diversity among generated paraphrases.