Summary of DROP A Reading Comprehension Benchmark for Discrete Reasoning

Summary DROP A Reading Comprehension Benchmark for Discrete Reasoning arxiv.org

7,645 words - PDF document - View PDF document

One Line

Researchers have developed a new reading comprehension benchmark called DROP that combines neural methods with symbolic reasoning to test discrete reasoning.

Slides

Slide Presentation (8 slides)

Copy slides outline Copy embed code Download as Word

DROP: A Reading Comprehension Benchmark for Discrete Reasoning

Source: arxiv.org - PDF - 7,645 words - view

Introduction

• Researchers have introduced a new English reading comprehension benchmark called DROP.

• DROP focuses on discrete reasoning over paragraphs and aims to push for a more comprehensive analysis of paragraph understanding.

Dataset Overview

• The DROP dataset consists of 96,567 questions from various categories.

• The emphasis is on sports game summaries and history passages.

• Questions require systems to resolve references and perform discrete operations over the content of paragraphs.

Baseline System Performance

• Baseline systems performed poorly on the DROP dataset.

• The best performing system achieved only 32.7% F1.

• Complex types of reasoning, such as arithmetic operations and counting, were particularly difficult for the models.

Introducing NAQANet

• NAQANet is a new model that combines neural reading comprehension with limited numerical reasoning.

• NAQANet achieved 47.0% F1 on the DROP dataset, showing promise in combining neural methods with symbolic reasoning.

Challenges Highlighted

• The performance of all tested models on the DROP dataset was significantly lower compared to other reading comprehension datasets.

• Error analysis revealed limitations in information extraction and spuriousness of logical forms used for training.

Need for Further Research

• The results highlight the need for further research in combining neural methods with symbolic reasoning.

• Improving information extraction for semantic parsing tasks is crucial to enhance performance on benchmarks like DROP.

Key Takeaways

• The DROP dataset presents a challenging benchmark for reading comprehension that requires comprehensive paragraph understanding and discrete reasoning.

• Baseline systems performed poorly, while NAQANet showed promise in combining neural methods with symbolic reasoning.

• Further research is needed to improve information extraction and advance the field of semantic parsing.

Key Points

Researchers have introduced a new English reading comprehension benchmark called DROP.
DROP focuses on discrete reasoning over paragraphs and aims to push for a more comprehensive analysis of paragraph understanding.
The dataset consists of 96,567 questions from various categories, with an emphasis on sports game summaries and history passages.
Baseline systems performed poorly on the DROP dataset, with the best performing system achieving only 32.7% F1.
A new model called NAQANet achieved 47.0% F1 on the dataset, showing promise in combining neural methods with symbolic reasoning.
Complex types of reasoning, such as arithmetic operations and counting, were particularly difficult for the models.
The results highlight the need for further research in combining neural methods with symbolic reasoning and improving information extraction for semantic parsing tasks.

Summaries

19 word summary

Researchers have introduced DROP, a reading comprehension benchmark focusing on discrete reasoning and combining neural methods with symbolic reasoning.

111 word summary

Researchers have introduced a new English reading comprehension benchmark called DROP. It focuses on discrete reasoning over paragraphs and requires systems to resolve references in a question and perform operations such as addition, counting, or sorting. The dataset consists of 96,567 questions from various categories, with an emphasis on sports game summaries and history passages. The best performing system achieved 32.7% F1, while a new model called NAQANet achieved 47.0% F1, showing promise in combining neural methods with symbolic reasoning. The DROP dataset highlights challenges in arithmetic operations and counting, emphasizing the need for further research in combining neural methods with symbolic reasoning and improving information extraction for semantic parsing tasks.

132 word summary

Researchers have introduced a new English reading comprehension benchmark called DROP, which focuses on discrete reasoning over paragraphs. DROP requires systems to resolve references in a question and perform operations such as addition, counting, or sorting. The dataset consists of 96,567 questions from various categories, with an emphasis on sports game summaries and history passages. Baseline systems were evaluated on the DROP dataset, with the best performing system achieving 32.7% F1 and human performance at 96.4%. A new model called NAQANet achieved 47.0% F1, showing promise in combining neural methods with symbolic reasoning. Complex types of reasoning, such as arithmetic operations and counting, posed challenges for the models. The DROP dataset emphasizes the need for further research in combining neural methods with symbolic reasoning and improving information extraction for semantic parsing tasks.

295 word summary

Researchers have introduced a new English reading comprehension benchmark called DROP, which focuses on discrete reasoning over paragraphs. The goal of this benchmark is to push the field towards a more comprehensive analysis of paragraph understanding. Unlike previous datasets, DROP requires systems to resolve references in a question and perform discrete operations over the content of paragraphs, such as addition, counting, or sorting. The dataset was constructed through crowdsourcing, with passages collected from Wikipedia and challenging questions created by crowd workers. The dataset consists of 96,567 questions from various categories, with an emphasis on sports game summaries and history passages. The answers to the questions are required to be spans in the passage or question, numbers, or dates.

Baseline systems were evaluated on the DROP dataset, including semantic parsing models, SQuAD-style reading comprehension models, and heuristic baselines. The best performing system achieved only 32.7% F1 on the dataset, while human performance was 96.4%. A new model called NAQANet was also introduced, which combines neural reading comprehension with limited numerical reasoning. This model achieved 47.0% F1 on the dataset, showing promise in combining neural methods with symbolic reasoning.

The performance of all tested models on the DROP dataset was significantly lower compared to other reading comprehension datasets, highlighting the challenges posed by this benchmark. Error analysis revealed that complex types of reasoning, such as arithmetic operations and counting, were particularly difficult for the models. Semantic parsing baselines performed poorly due to limitations in information extraction and spuriousness of logical forms used for training.

In conclusion, the DROP dataset presents a challenging benchmark for reading comprehension that requires comprehensive paragraph understanding and discrete reasoning. The results highlight the need for further research in combining neural methods with symbolic reasoning and improving information extraction for semantic parsing tasks.

Raw indexed text (47,905 chars / 7,645 words / 1,295 lines)

DROP: A Reading Comprehension Benchmark

Requiring Discrete Reasoning Over Paragraphs

Dheeru Dua ♣ , Yizhong Wang ♦∗ , Pradeep Dasigi ♥ ,

Gabriel Stanovsky ♥+ , Sameer Singh ♣ , and Matt Gardner ♠

♣

University of California, Irvine, USA

♦

Peking University, Beijing, China

♥

Allen Institute for Artificial Intelligence, Seattle, Washington, USA

♠

Allen Institute for Artificial Intelligence, Irvine, California, USA

University of Washington, Seattle, Washington, USA

[email protected]

Abstract

Reading comprehension has recently seen

rapid progress, with systems matching humans

on the most popular datasets for the task. How-

ever, a large body of work has highlighted

the brittleness of these systems, showing that

there is much work left to be done. We in-

troduce a new English reading comprehension

benchmark, DROP, which requires Discrete

Reasoning Over the content of Paragraphs. In

this crowdsourced, adversarially-created, 96k-

question benchmark, a system must resolve

references in a question, perhaps to multiple in-

put positions, and perform discrete operations

over them (such as addition, counting, or sort-

ing). These operations require a much more

comprehensive understanding of the content of

paragraphs than what was necessary for prior

datasets. We apply state-of-the-art methods

from both the reading comprehension and se-

mantic parsing literatures on this dataset and

show that the best systems only achieve 32.7%

F 1 on our generalized accuracy metric, while

expert human performance is 96.4%. We ad-

ditionally present a new model that combines

reading comprehension methods with simple

numerical reasoning to achieve 47.0% F 1 .

Introduction

The task of reading comprehension, where sys-

tems must understand a single passage of text well

enough to answer arbitrary questions about it, has

seen significant progress in the last few years, so

much that the most popular datasets available for

this task have been solved (Chen et al., 2016; De-

vlin et al., 2019). We introduce a substantially

more challenging English reading comprehension

dataset aimed at pushing the field towards more

comprehensive analysis of paragraphs of text. In

∗

Work done as an intern at the Allen Institute for Artificial

Intelligence in Irvine, California.

this new benchmark, which we call DROP, a sys-

tem is given a paragraph and a question and must

perform some kind of Discrete Reasoning Over the

text in the Paragraph to obtain the correct answer.

These questions that require discrete reasoning

(such as addition, sorting, or counting; see Table 1)

are inspired by the complex, compositional ques-

tions commonly found in the semantic parsing lit-

erature. We focus on this type of questions because

they force a structured analysis of the content of the

paragraph that is detailed enough to permit reason-

ing. Our goal is to further paragraph understand-

ing; complex questions allow us to test a system’s

understanding of the paragraph’s semantics.

DROP is also designed to further research on

methods that combine distributed representations

with symbolic, discrete reasoning. In order to

do well on this dataset, a system must be able to

find multiple occurrences of an event described in

a question (presumably using some kind of soft

matching), extract arguments from the events, then

perform a numerical operation such as a sort, to

answer a question like “Who threw the longest

touchdown pass?”.

We constructed this dataset through crowdsourc-

ing, first collecting passages from Wikipedia that

are easy to ask hard questions about, then encour-

aging crowd workers to produce challenging ques-

tions. This encouragement was partially through

instructions given to workers, and partially through

the use of an adversarial baseline: we ran a base-

line reading comprehension method (BiDAF) (Seo

et al., 2017) in the background as crowd workers

were writing questions, requiring them to give ques-

tions that the baseline system could not correctly

answer. This resulted in a dataset of 96,567 ques-

tions from a variety of categories in Wikipedia,

with a particular emphasis on sports game sum-

maries and history passages. The answers to the

questions are required to be spans in the passage orquestion, numbers, or dates, which allows for easy

and accurate evaluation metrics.

We present an analysis of the resulting dataset

to show what phenomena are present. We find

that many questions combine complex question se-

mantics with SQuAD-style argument finding; e.g.,

in the first question in Table 1, BiDAF correctly

finds the amount the painting sold for, but does not

understand the question semantics and cannot per-

form the numerical reasoning required to answer

the question. Other questions, such as the fifth

question in Table 1, require finding all events in the

passage that match a description in the question,

then aggregating them somehow (in this instance,

by counting them and then performing an argmax).

Very often entity coreference is required. Table 1

gives a number of different phenomena, with their

proportions in the dataset.

We used three types of systems to judge base-

line performance on DROP: (1) heuristic baselines,

to check for biases in the data; (2) SQuAD-style

reading comprehension methods; and (3) semantic

parsers operating on a pipelined analysis of the pas-

sage. The reading comprehension methods perform

the best, with our best baseline achieving 32.7%

F 1 on our generalized accuracy metric, while ex-

pert human performance is 96.4%. Finally, we

contribute a new model for this task that combines

limited numerical reasoning with standard reading

comprehension methods, allowing the model to an-

swer questions involving counting, addition and

subtraction. This model reaches 47% F 1 , a 14.3%

absolute increase over the best baseline system.

The dataset, code for the baseline systems, and

a leaderboard with a hidden test set can be found

at https://allennlp.org/drop.

Related Work

Question answering datasets With systems reach-

ing human performance on the Stanford Question

Answering Dataset (SQuAD) (Rajpurkar et al.,

2016), many follow-on tasks are currently being

proposed. All of these datasets throw in additional

complexities to the reading comprehension chal-

lenge, around tracking conversational state (Reddy

et al., 2019; Choi et al., 2018), requiring passage

retrieval (Joshi et al., 2017; Yang et al., 2018; Tal-

mor and Berant, 2018), mismatched passages and

questions (Saha et al., 2018; Kociský et al., 2018;

Rajpurkar et al., 2018), integrating knowledge from

external sources (Mihaylov et al., 2018; Zhang

et al., 2019), tracking entity state changes (Mishra

et al., 2018; Ostermann et al., 2018) or a particular

kind of “multi-step” reasoning over multiple doc-

uments (Welbl et al., 2018; Khashabi et al., 2018).

Similar facets are explored in medical domain

datasets (Pampari et al., 2018; Šuster and Daele-

mans, 2018) which contain automatically generated

queries on medical records based on predefined

templates. We applaud these efforts, which offer

good avenues to study these additional phenomena.

However, we are concerned with paragraph under-

standing, which on its own is far from solved, so

DROP has none of these additional complexities.

It consists of single passages of text paired with

independent questions, with only linguistic facil-

ity required to answer the questions. 1 One could

argue that we are adding numerical reasoning as

an “additional complexity”, and this is true; how-

ever, it is only simple reasoning that is relatively

well-understood in the semantic parsing literature,

and we use it as a necessary means to force more

comprehensive passage understanding.

Many existing algebra word problem datasets

also contain similar phenomena to what is in

DROP (Koncel-Kedziorski et al., 2015; Kushman

et al., 2014; Hosseini et al., 2014; Clark et al., 2016;

Ling et al., 2017). Our dataset is different in that it

uses much longer contexts, is more open domain,

and requires deeper paragraph understanding.

Semantic parsing The semantic parsing litera-

ture has a long history of trying to understand com-

plex, compositional question semantics in terms of

some grounded knowledge base or other environ-

ment (Zelle and Mooney, 1996; Zettlemoyer and

Collins, 2005; Berant et al., 2013a, inter alia). It

is this literature that we modeled our questions on,

particularly looking at the questions in the Wik-

iTableQuestions dataset (Pasupat and Liang, 2015).

If we had a structured, tabular representation of

the content of our paragraphs, DROP would be

largely the same as WikiTableQuestions, with simi-

lar (possibly even simpler) question semantics. Our

novelty is that we are the first to combine these com-

plex questions with paragraph understanding, with

the aim of encouraging systems that can produce

comprehensive structural analyses of paragraphs,

either explicitly or implicitly.

Adversarial dataset construction We continue

Some questions in our dataset require limited sports do-

main knowledge to answer; we expect that there are enough

such questions that systems can reasonably learn this knowl-

edge from the data.Reasoning Passage (some parts shortened) Question Answer BiDAF

Subtraction

(28.8%) That year, his Untitled (1981), a painting of a haloed,

black-headed man with a bright red skeletal body, de-

picted amid the artists signature scrawls, was sold by

Robert Lehrman for $16.3 million, well above its $12

million high estimate. How many more dol-

lars was the Untitled

(1981) painting sold

for than the 12 million

dollar estimation? 4300000 $16.3

million

Comparison In 1517, the seventeen-year-old King sailed to Castile.

(18.2%)

There, his Flemish court . . . . In May 1518, Charles

traveled to Barcelona in Aragon. Where did Charles

travel to first, Castile

or Barcelona? Castile Aragon

Selection

(19.4%) In 1970, to commemorate the 100th anniversary of the

founding of Baldwin City, Baker University professor

and playwright Don Mueller and Phyllis E. Braun,

Business Manager, produced a musical play entitled

The Ballad Of Black Jack to tell the story of the events

that led up to the battle. Who was the Uni-

versity professor that

helped produce The

Ballad Of Black Jack,

Ivan Boyd or Don

Mueller? Don

Mueller Baker

Addition

(11.7%) Before the UNPROFOR fully deployed, the HV clashed

with an armed force of the RSK in the village of Nos

Kalik, located in a pink zone near Šibenik, and captured

the village at 4:45 p.m. on 2 March 1992. The JNA

formed a battlegroup to counterattack the next day. What date did the JNA

form a battlegroup to

counterattack after the

village of Nos Kalik

was captured? 3 March

1992 2 March

1992

Count

(16.5%)

and Sort

(11.7%) Denver would retake the lead with kicker Matt Prater

nailing a 43-yard field goal, yet Carolina answered as

kicker John Kasay ties the game with a 39-yard field

goal. . . . Carolina closed out the half with Kasay nail-

ing a 44-yard field goal. . . . In the fourth quarter, Car-

olina sealed the win with Kasay’s 42-yard field goal. Which kicker kicked

the most field goals? John

Kasay Matt

Prater

Coreference James Douglas was the second son of Sir George Dou-

Resolution glas of Pittendreich, and Elizabeth Douglas, daughter

David Douglas of Pittendreich. Before 1543 he mar-

(3.7%)

ried Elizabeth, daughter of James Douglas, 3rd Earl of

Morton. In 1553 James Douglas succeeded to the title

and estates of his father-in-law. How many years af-

ter he married Eliza-

beth did James Dou-

glas succeed to the ti-

tle and estates of his

father-in-law? 10 1553

Other

Arithmetic

(3.2%) Although the movement initially gathered some 60,000

adherents, the subsequent establishment of the Bulgar-

ian Exarchate reduced their number by some 75%. How many adherents

were left after the es-

tablishment of the Bul-

garian Exarchate? 15000 60,000

Set of

spans

(6.0%) According to some sources 363 civilians were killed in

Kavadarci, 230 in Negotino and 40 in Vatasha. What were the 3 vil-

lages that people were

killed in? Kavadarci,

Negotino,

Vatasha Negotino

and 40 in

Vatasha

Other

(6.8%) This Annual Financial Report is our principal financial

statement of accountability. The AFR gives a compre-

hensive view of the Department’s financial activities ... What does AFR stand

for? Annual

Financial

Report one of the

Big Four

audit firms

Table 1: Example questions and answers from the DROP dataset, showing the relevant parts of the associated

passage and the reasoning required to answer the question.

a recent trend in creating datasets with adversarial

baselines in the loop (Paperno et al., 2016; Min-

ervini and Riedel, 2018; Zellers et al., 2018; Zhang

et al., 2019; Zellers et al., 2019). In our case, in-

stead of using an adversarial baseline to filter auto-

matically generated examples, we use it in a crowd-

sourcing task, to teach crowd workers to avoid easy

questions, raising the difficulty level of the ques-

tions they provide.

Neural symbolic reasoning DROP is designed

to encourage research on methods that combine

neural methods with discrete, symbolic reasoning.

We present one such model in Section 6. Other re-

lated work along these lines has been done by Reed

and de Freitas (2016), Neelakantan et al. (2016),

and Liang et al. (2017).

DROP Data Collection

In this section, we describe our annotation proto-

col, which consists of three phases. First, we auto-

matically extract passages from Wikipedia which

are expected to be amenable to complex questions.

Second, we crowdsource question-answer pairs on

these passages, eliciting questions which requirediscrete reasoning. Finally, we validate the devel-

opment and test portions of DROP to ensure their

quality and report inter-annotator agreement.

Passage extraction We searched Wikipedia for

passages that had a narrative sequence of events,

particularly with a high proportion of numbers, as

our initial pilots indicated that these passages were

the easiest to ask complex questions about. We

found that National Football League (NFL) game

summaries and history articles were particularly

promising, and we additionally sampled from any

Wikipedia passage that contained at least twenty

numbers. 2 This process yielded a collection of

about 7,000 passages.

Question collection We used Amazon Mechani-

cal Turk 3 to crowdsource the collection of question-

answer pairs, where each question could be an-

swered in the context of a single Wikipedia passage.

In order to allow some flexibility during the annota-

tion process, in each human intelligence task (HIT)

workers were presented with a random sample of

5 of our Wikipedia passages, and were asked to

produce a total of at least 12 question-answer pairs

on any of these.

We presented workers with example questions

from five main categories, inspired by ques-

tions from the semantic parsing literature (addi-

tion/subtraction, minimum/maximum, counting, se-

lection and comparison; see examples in Table 1),

to elicit questions that require complex linguistic

understanding and discrete reasoning. In addition,

to further increase the difficulty of the questions

in DROP, we employed a novel adverserial anno-

tation setting, where workers were only allowed

to submit questions which a real-time QA model

BiDAF could not solve. 4

Next, each worker answered their own question

with one of three answer types: spans of text from

either question or passage, a date (which was com-

mon in history and open-domain text) and numbers,

allowed only for questions which explicitly stated

a specific unit of measurement (e.g., “How many

yards did Brady run?”), in an attempt to simplify

the evaluation process.

Initially, we opened our HITs to all United States

We used an October 2018 Wikipedia dump, as well as

scraping of online Wikipedia.

www.mturk.com

While BiDAF is no longer state-of-the-art, performance is

reasonable and the AllenNLP implementation (Gardner et al.,

2017) made it the easiest to deploy as a server.

Statistic

Number of passages

Avg. passage len [words]

Number of questions

Avg. question len [words]

Avg. questions / passage

Question vocabulary size

Train Dev Test

5565

213.45

77,409

10.79

13.91

29,929 582

191.62

9,536

11.17

16.38

8,023 588

195.12

9,622

11.23

16.36

8,007

Table 2: Dataset statistics across the different splits.

workers and gradually reduced our worker pool to

workers who understood the task and annotated it

well. Each HIT paid 5 USD and could be com-

pleted within 30 minutes, compensating a trained

worker with an average pay of 10 USD/ hour.

Overall, we collected a total of 96,567 question-

answer pairs with a total Mechanical Turk budget

of 60k USD (including validation). The dataset

was randomly partitioned by passage into training

(80%), development (10%) and test (10%) sets, so

all questions about a particular passage belong to

only one of the splits.

Validation In order to test inter-annotator agree-

ment and to improve the quality of evaluation

against DROP, we collected at least two additional

answers for each question in the development and

test sets.

In a separate HIT, workers were given context

passages and a previously crowdsourced question,

and were asked to either answer the question or

mark it as invalid (this occurred for 0.7% of the

data, which we subsequently filtered out). We

found that the resulting inter-annotator agreement

was good and on par with other QA tasks; overall

Cohen’s κ was 0.74, with 0.81 for numbers, 0.62

for spans, and 0.65 for dates.

DROP Data Analysis

In the following, we quantitatively analyze proper-

ties of passages, questions, and answers in DROP.

Different statistics of the dataset are depicted in Ta-

ble 2. Notably, questions have a diverse vocabulary

of around 30k different words in our training set.

Question analysis To assess the question type

distribution, we sampled 350 questions from the

training and development sets and manually anno-

tated the categories of discrete operations required

to answer the question. Table 1 shows the distri-

bution of these categories in the dataset. In addi-

tion, to get a better sense of the lexical diversity of

questions in the dataset, we find the most frequentAnswer Type

Percent Example

66.1

12.2

9.4

7.3

3.5

1.5 12

Jerry Porter

males

Seahawks

Tom arrived at Acre

3 March 1992

NUMBER

PERSON

OTHER

OTHER ENTITIES

VERB PHRASE

DATE

Table 3: Distribution of answer types in training set,

according to an automatic named entity recognition.

trigram patterns in the questions per answer type.

We find that the dataset offers a huge variety of lin-

guistic constructs, with the most frequent pattern

(“Which team scored”) appearing in only 4% of the

span type questions. For number type questions,

the 5 most frequent question patterns all start with

“How many”, indicating the need to perform count-

ing and other arithmetic operations. A distribution

of the trigrams containing the start of the questions

are shown in Figure 1.

Answer analysis To discern the level of passage

understanding needed to answer the questions in

DROP, we annotate the set of spans in the passage

that are necessary for answering the 350 questions

mentioned above. We find that on an average 2.18

spans need to be considered to answer a question

and the average distance between these spans is

26 words, with 20% of samples needing at least

3 spans (see appendix for examples). Finally, we

assess the answer distribution in Table 3, by run-

ning the part-of-speech tagger and named entity

recognizer from spaCy 5 to automatically partition

all the answers into various categories. We find that

a majority of the answers are numerical values and

proper nouns.

Baseline Systems

In this section we describe the initial baselines

that we evaluated on the DROP dataset. We used

three types of baselines: state-of-the-art semantic

parsers (§5.1), state-of-the-art reading comprehen-

sion models (§5.2), and heuristics looking for an-

notation artifacts (§5.3). We use two evaluation

metrics to compare model performance: Exact-

Match, and a numeracy-focused (macro-averaged)

F 1 score, which measures overlap between a bag-

of-words representation of the gold and predicted

answers. We employ the same implementation of

Exact-Match accuracy as used by SQuAD, which

https://spacy.io/

removes articles and does other simple normaliza-

tion, and our F 1 score is based on that used by

SQuAD. Since DROP is numeracy-focused, we de-

fine F 1 to be 0 when there is a number mismatch

between the gold and predicted answers, regardless

of other word overlap. When an answer has multi-

ple spans, we first perform a one-to-one alignment

greedily based on bag-of-word overlap on the set

of spans and then compute average F 1 over each

span. When there are multiple annotated answers,

both metrics take a max over all gold answers.

5.1

Semantic Parsing

Semantic parsing has been used to translate nat-

ural language utterances into formal executable

languages (e.g., SQL) that can perform discrete

operations against a structured knowledge repre-

sentation, such as knowledge graphs or tabular

databases (Zettlemoyer and Collins, 2005; Berant

et al., 2013b; Yin and Neubig, 2017; Chen and

Mooney, 2011, inter alia). Since many of DROP’s

questions require similar discrete reasoning, it is

appealing to port some of the successful work in

semantic parsing to the DROP dataset. Specifi-

cally, we use the grammar-constrained semantic

parsing model built by Krishnamurthy et al. (2017)

(KDG) for the W IKI T ABLE Q UESTIONS tabular

dataset (Pasupat and Liang, 2015).

Sentence representation schemes We experi-

mented with three paradigms to represent para-

graphs as structured contexts: (1) Stanford de-

pendencies (de Marneffe and Manning, 2008, Syn

Dep); which capture word-level syntactic relations,

(2) Open Information Extraction (Banko et al.,

2007, Open IE), a shallow semantic representation

which directly links predicates and arguments; and

(3) Semantic Role Labeling (Carreras and Màrquez,

2005, SRL), which disambiguates senses for pol-

ysemous predicates and assigns predicate-specific

argument roles. 6 To adhere to KDG’s structured

representation format, we convert each of these rep-

resentations into a table, where rows are predicate-

argument structures and columns correspond to

different argument roles.

Logical form language Our logical form lan-

guage identifies five basic elements in the table rep-

resentation: predicate-argument structures (i.e., ta-

ble rows), relations (column-headers), strings, num-

We used the AllenNLP implementations of state-of-the-

art models for all of these representations (Gardner et al., 2017;

Dozat et al., 2017; He et al., 2017; Stanovsky et al., 2018).the

from

was

wn did

tou

percen

people

were

not

the

yards

long

the

kick

had

the

caught

who

threw

the

rst

(a) For span type answers

how many

percent

not

are

threw

pas

ses

the

quar

terbac

tou

chd

own

happened

was

event

did

from

group

ya ple

gro

age

are

ich

(b) For number type answers

Figure 1: Distribution of the most popular question prefixes for two different subsets of the training data.

bers, and dates. In addition, it defines functions

that operate on these elements, such as counters and

filters. 7 Following Krishnamurthy et al. (2017),

we use the argument and return types of these

functions to automatically induce a grammar to

constrain the parser. We also add context-specific

rules to produce strings occurring in both question

and paragraph, and those paragraph strings that are

neighbors of question tokens in the GloVe embed-

ding space (Pennington et al., 2014), up to a cosine

distance of d. 8 The complete set of functions used

in our language and their induced grammar can be

found in the code release.

Training and inference During training, the

KDG parser maximizes the marginal likelihood of

a set of (possibly spurious) question logical forms

that evaluate to the correct answer. We obtain this

set by performing an exhaustive search over the

grammar up to a preset tree depth. At test time, we

use beam search to produce the most likely logical

form, which is then executed to predict an answer.

5.2

SQuAD-style Reading Comprehension

We test four different SQuAD-style reading com-

prehension models on DROP: (1) BiDAF (Seo

et al., 2017), which is the adversarial baseline

For example filter number greater takes a set of

predicate-argument structures, the name of a relation, and a

number, and returns all those structures where the numbers

in the argument specified by the relation are greater than the

given number.

d = 0.3 was manually tuned on the development set.

we used in data construction (66.8% EM on

SQuAD 1.1); (2) QANet (Yu et al., 2018), cur-

rently the best-performing published model on

SQuAD 1.1 without data augmentation or pre-

training (72.7% EM); (3) QANet + ELMo, which

enhances the QANet model by concatenating pre-

trained ELMo representations (Peters et al., 2018)

to the original embeddings (78.7% EM); (4) BERT

(Devlin et al., 2019), which recently achieved im-

provements on many NLP tasks with a novel pre-

training technique (84.7% EM). 9

These models require a few minor adaptations

when training on DROP. While SQuAD provides

answer indices in the passage, our dataset only

provides the answer strings. To address this, we use

the marginal likelihood objective function proposed

by Clark and Gardner (2018), which sums over

the probabilities of all the matching spans. 10 We

also omitted the training questions which cannot

be answered by a span in the passage (45%), and

therefore cannot be represented by these systems.

For the BiDAF baseline, we use the implementa-

tion in AllenNLP but change it to use the marginal

objective. For the QANet model, our settings differ

from the original paper only in the batch size (16

v.s. 32) and number of blocks in the modeling layer

The first three scores are based on our own im-

plementation, while the score for BERT is based on

an open-source implementation from Hugging Face:

https://github.com/huggingface/pytorch-pretrained-bert

For the black-box BERT model, we convert DROP to

SQuAD format by using the first match as the gold span.(6 v.s. 7) due to the GPU memory limit. We adopt

the ELMo representations trained on 5.5B corpus

for the QANet+ELMo baseline and the large un-

cased BERT model for the BERT baseline. The

hyper-parameters for our NAQANet model (§6) are

the same as for the QANet baseline.

5.3

Heuristic Baselines

A recent line of work (Gururangan et al., 2018;

Kaushik and Lipton, 2018) has identified that pop-

ular crowdsourced NLP datasets (such as SQuAD

(Rajpurkar et al., 2016) or SNLI (Bowman et al.,

2015)) are prone to have artifacts and annotation

biases which can be exploited by supervised algo-

rithms that learn to pick up these artifacts as signal

instead of more meaningful semantic features. We

estimate artifacts by training the QANet model de-

scribed in Section 5.2 on a version of DROP where

either the question or the paragraph input repre-

sentation vectors are zeroed out (question-only

and paragraph-only, respectively). Consequently,

the resulting models can then only predict answer

spans from either the question or the paragraph.

In addition, we devise a baseline that estimates

the answer variance in DROP. We start by counting

the unigram and bigram answer frequency for each

wh question-word in the train set (as the first word

in the question). The majority baseline then pre-

dicts an answer as the set of 3 most common answer

spans for the input question word (e.g., for “when”,

these were “quarter”, “end” and “October”).

NAQANet

DROP is designed to encourage models that com-

bine neural reading comprehension with symbolic

reasoning. None of the baselines we described in

Section 5 can do this. As a preliminary attempt

toward this goal, we propose a numerically-aware

QANet model, NAQANet, which allows the state-

of-the-art reading comprehension system to pro-

duce three new answer types: (1) spans from the

question; (2) counts; (3) addition or subtraction

over numbers. To predict numbers, the model first

predicts whether the answer is a count or an arith-

metic expression. It then predicts the specific num-

bers involved in the expression. This can be viewed

as the neural model producing a partially executed

logical form, leaving the final arithmetic to a sym-

bolic system. While this model can currently only

handle a very limited set of operations, we believe

this is a promising approach to combining neural

methods and symbolic reasoning. The model is

trained by marginalizing over all execution paths

that lead to the correct answer.

6.1

Model Description

Our NAQANet model follows the typical archi-

tecture of previous reading comprehension mod-

els, which is composed of embedding, encoding,

passage-question attention, and output layers. We

use the original QANet architecture for everything

up to the output layer. This gives us a question rep-

resentation Q ∈ R m×d , and a projected question-

aware passage representation P̄ ∈ R n×d . We have

four different output layers, for the four different

kinds of answers the model can produce:

Passage span As in the original QANet model,

to predict an answer in the passage we apply three

repetitions of the QANet encoder to the passage

representation P̄ and get their outputs as M 0 , M 1 ,

M 2 respectively. Then the probabilities of the start-

ing and ending positions from the passage can be

computed as:

p p start = softmax(FFN([M 0 ; M 1 ]),

p end

= softmax(FFN([M 0 ; M 2 ])

(1)

(2)

where FFN is a two-layer feed-forward network

with the RELU activation.

Question span Some questions in DROP have

their answer in the question instead of the passage.

To predict an answer from the question, the model

first computes a vector h P that represents the infor-

mation it finds in the passage:

α P = softmax(W P P̄),

h = α P̄

(3)

(4)

Then it computes the probabilities of the starting

and ending positions from the question as:

p q start = softmax(FFN([Q; e |Q| ⊗ h P ]), (5)

p q end = softmax(FFN([Q; e |Q| ⊗ h P ]) (6)

where the outer product with the identity (e |Q| ⊗ ·)

simply repeats h P for each question word.

Count We model the capability of counting as

a multi-class classification problem. Specifically,

we consider ten numbers (0–9) in this preliminary

model and the probabilities of choosing these num-

bers is computed based on the passage vector h P :

p count = softmax(FFN(h P ))

(7)Arithmetic expression Many questions in

DROP require the model to locate multiple

numbers in the passage and add or subtract them

to get the final answer. To model this process,

we first extract all the numbers from the passage

and then learn to assign a plus, minus or zero for

each number. In this way, we get an arithmetic

expression composed of signed numbers, which

can be evaluated to give the final answer.

To do this, we first apply another QANet encoder

to M 2 and get a new passage representation M 3 .

Then we select an index over the concatenation of

M 0 and M 3 , to get a representation for each number

in this passage. The i th number can be represented

as h N

i and the probabilities of this number being

assigned a plus, minus or zero are:

sign

p i

= softmax(FFN(h N

i ))

(8)

Answer type prediction We use a categorical

variable to decide between the above four answer

types, with probabilities computed as:

p type = softmax(FFN([h P , h Q ]))

Weakly-Supervised Training

For supervision, DROP contains only the answer

string, not which of the above answer types is

used to arrive at the answer. To train our model,

we adopt the weakly supervised training method

widely used in the semantic parsing literature (Be-

rant et al., 2013a). We find all executions that eval-

uate to the correct answer, including matching pas-

sage spans and question spans, correct count num-

bers, as well as sign assignments for numbers. Our

training objective is then to maximize the marginal

likelihood of these executions. 11

Dev

Test

EM F 1 EM F 1

Heuristic Baselines

Majority

0.09

Q-only

4.28

P-only

0.13 1.38

8.07

2.27 0.07

4.18

0.14 1.44

8.59

2.26

Semantic Parsing

Syn Dep

OpenIE

SRL 9.38

8.80

9.28 11.64

11.31

11.72 8.51

8.53

8.98 10.84

10.77

11.45

SQuAD-style RC

BiDAF

QANet

QANet+ELMo

BERT 26.06

27.50

27.71

30.10 28.85

30.44

30.33

33.36 24.75

25.50

27.08

29.45 27.49

28.36

29.67

32.70

NAQANet

+ Q Span

+ Count

+ Add/Sub

Complete Model 25.94

30.09

43.07

46.20 29.17

33.92

45.71

49.24 24.98

30.04

40.40

44.07 28.18

32.75

42.96

47.01

- - 94.09 96.42

Human

Table 4: Performance of the different models on our de-

velopment and test set, in terms of Exact Match (EM),

and numerically-focused F 1 (§5). Both metrics are cal-

culated as the maximum against a set of gold answers.

(9)

where h Q is computed over Q, in a similar way as

we did for h P . At test time, we first determine this

answer type greedily and then get the best answer

from the selected type.

6.2

Method

Results and Discussion

The performance of all tested models on the

DROP dataset is presented in Table 4. Most notably,

all models perform significantly worse than on

other prominent reading comprehension datasets,

while human performance remains at similar high

Due to the exponential search space and the possible

noise, we only search the addition/subtraction of two numbers.

Given this limited search space, the search and marginalization

are exact.

levels. 12 For example, BERT, the current state-of-

the-art on SQuAD, drops by more than 50 abso-

lute F1 points. This is a positive indication that

DROP is indeed a challenging reading comprehen-

sion dataset, which opens the door for tackling new

and complex reasoning problems on a large scale.

The best performance is obtained by our

NAQANet model. Table 6 shows that our gains are

obtained on the challenging and frequent number

answer type, which requires various complex types

of reasoning. Future work may also try combining

our model with BERT. Furthermore, we find that all

heuristic baselines do poorly on our data, hopefully

attesting to relatively small biases in DROP.

Difficulties of building semantic parsers We

see that all the semantic parsing baselines perform

quite poorly on DROP. This is mainly because of

our pipeline of extracting tabular information from

paragraphs, followed by the denotation-driven log-

ical form search, can yield logical forms only for

a subset of the training data. For SRL and syntac-

tic dependency sentence representation schemes,

Human performance was estimated by the authors collec-

tively answering 560 questions from the test set, which were

then evaluated using the same metric as learned systems. This

is in contrast to holding out one gold annotation and evaluating

it against the other annotations, as done in prior work, which

underestimates human performance relative to systems.Answer Our

model

How many of Bartolom de Ams-

queta’s 150 men were not sick? 125 145

. . . Macedonians were the largest ethnic

group in Skopje, with 338,358 inhabi-

tants . . . Then came . . . Serbs (14,298

inhabitants), Turks (8,595), Bosniaks

(7,585) and Vlachs (2,557) . . . How many ethnicities had less than

10000 people? 3 2

Domain

knowledge . . . Smith was sidelined by a torn pec-

toral muscle suffered during practice . . . How many quarters did Smith play? 0 2

Addition . . . culminating in the Battle of Vienna

of 1683, which marked the start of the

15-year-long Great Turkish War . . . What year did the Great Turkish

War end? 1698 1668

Phenomenon Passage Highlights Question

Subtraction

+ Coreference . . . Twenty-five of his 150 men were

sick, and his advance stalled . . . Count + Filter

Table 5: Representative examples from our model’s error analysis. We list the identified semantic phenomenon,

the relevant passage highlights, a gold question-answer pair, and the erroneous prediction by our model.

the search was able to yield logical forms for 34%

of the training data, whereas with OpenIE, it was

only 25%. On closer examination of a sample of

60 questions and the information extracted by the

SRL scheme (the best performing of the three), we

found that only 25% of the resulting tables con-

tained information needed to the answer the ques-

tions. These observations show that high quality

information extraction is a strong prerequisite for

building semantic parsers for DROP. Additionally,

the fact that this is a weakly supervised semantic

parsing problem also makes training hard. The

biggest challenge in this setup is the spuriousness

of logical forms used for training, where the logical

form evaluates to the correct denotation but does

not actually reflect the semantics of the question.

This makes it hard for the model trained on these

spurious logical forms to generalize to unseen data.

From the set of logical forms for a sample of 60

questions analyzed, we found that only 8 questions

(13%) contained non-spurious logical forms.

Error Analysis Finally, in order to better under-

stand the outstanding challenges in DROP, we con-

ducted an error analysis on a random sample of

100 erroneous NAQANet predictions. The most

common errors were on questions which required

complex type of reasoning, such as arithmetic

operations (evident in 51% of the errors), count-

ing (30%), domain knowledge and common sense

(23%), co-reference (6%), or a combination of dif-

ferent types of reasoning (40%). See Table 5 for

examples of some of the common phenomena.

Type

Date

Numbers

Single Span

> 1 Spans

(%)

1.57

61.94

31.71

4.77

Exact Match

QN+ BERT QN+ BERT

28.7

44.0

58.2

0 38.7

14.5

64.6

0 35.5

44.2

64.6

17.13 42.8

14.8

70.1

25.0

Table 6: Dev set performance breakdown by different

answer types; our model (NAQANet, marked as QN+)

vs. BERT, the best-performing baseline.

Conclusion

We have presented DROP, a dataset of com-

plex reading comprehension questions that require

Discrete Reasoning Over Paragraphs. This dataset

is substantially more challenging than existing

datasets, with the best baseline achieving only

32.7% F1, while humans achieve 96%. We hope

this dataset will spur research into more compre-

hensive analysis of paragraphs, and into methods

that combine distributed representations with sym-

bolic reasoning. We have additionally presented

initial work in this direction, with a model that

augments QANet with limited numerical reasoning

capability, achieving 47% F1 on DROP.

Acknowledgments

We would like to thank Noah Smith, Yoav Gold-

berg, and Jonathan Berant for insightful discussions

that informed the direction of this work. The com-

putations on beaker.org were supported in part

by credits from Google Cloud.References

Michele Banko, Michael J. Cafarella, Stephen Soder-

land, Matthew G Broadhead, and Oren Etzioni.

2007. Open information extraction from the web. In

IJCAI.

Jonathan Berant, Andrew Chou, Roy Frostig, and Percy

Liang. 2013a. Semantic parsing on freebase from

question-answer pairs. In EMNLP.

Suchin Gururangan, Swabha Swayamdipta, Omer

Levy, Roy Schwartz, Samuel Bowman, and Noah A.

Smith. 2018. Annotation artifacts in natural lan-

guage inference data. In Proc. of NAACL.

Luheng He, Kenton Lee, Mike Lewis, and Luke S.

Zettlemoyer. 2017. Deep semantic role labeling:

What works and what’s next. In ACL.

Jonathan Berant, Andrew Chou, Roy Frostig, and Percy

Liang. 2013b. Semantic parsing on freebase from

question-answer pairs. In Proceedings of the 2013

Conference on Empirical Methods in Natural Lan-

guage Processing, pages 1533–1544. Mohammad Javad Hosseini, Hannaneh Hajishirzi,

Oren Etzioni, and Nate Kushman. 2014. Learning

to solve arithmetic word problems with verb catego-

rization. In Proceedings of the 2014 Conference on

Empirical Methods in Natural Language Processing

(EMNLP), pages 523–533.

Samuel R. Bowman, Gabor Angeli, Christopher Potts,

and Christopher D. Manning. 2015. A large anno-

tated corpus for learning natural language inference.

In EMNLP. Mandar S. Joshi, Eunsol Choi, Daniel S. Weld, and

Luke S. Zettlemoyer. 2017. Triviaqa: A large scale

distantly supervised challenge dataset for reading

comprehension. In ACL.

Xavier Carreras and Lluı́s Màrquez. 2005. Introduc-

tion to the conll-2005 shared task: Semantic role la-

beling. In Proceedings of CONLL, pages 152–164. Divyansh Kaushik and Zachary Chase Lipton. 2018.

How much reading does reading comprehension re-

quire? a critical investigation of popular bench-

marks. In EMNLP.

Danqi Chen, Jason Bolton, and Christopher D. Man-

ning. 2016. A thorough examination of the cnn/daily

mail reading comprehension task.

David L Chen and Raymond J Mooney. 2011. Learn-

ing to interpret natural language navigation instruc-

tions from observations. In AAAI, volume 2, pages

1–2.

Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen

tau Yih, Yejin Choi, Percy Liang, and Luke S. Zettle-

moyer. 2018. Quac: Question answering in context.

In EMNLP.

Christopher Clark and Matt Gardner. 2018. Simple

and effective multi-paragraph reading comprehen-

sion. In ACL.

Peter Clark, Oren Etzioni, Tushar Khot, Ashish Sab-

harwal, Oyvind Tafjord, Peter Turney, and Daniel

Khashabi. 2016. Combining retrieval, statistics, and

inference to answer elementary science questions.

In Thirtieth AAAI Conference on Artificial Intelli-

gence.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and

Kristina Toutanova. 2019. Bert: Pre-training of deep

bidirectional transformers for language understand-

ing. NAACL, abs/1810.04805.

Timothy Dozat, Peng Qi, and Christopher D. Manning.

2017. Stanford’s graph-based neural dependency

parser at the conll 2017 shared task. In CoNLL

Shared Task.

Matt Gardner, Joel Grus, Mark Neumann, Oyvind

Tafjord, Pradeep Dasigi, Nelson F. Liu, Matthew

Peters, Michael Schmitz, and Luke S. Zettlemoyer.

2017. Allennlp: A deep semantic natural language

processing platform. In Proceedings of Workshop

for NLP Open Source Software (NLP-OSS). Associ-

ation for Computational Linguistics.

Daniel Khashabi, Snigdha Chaturvedi, Michael Roth,

Shyam Upadhyay, and Dan Roth. 2018. Looking

beyond the surface: A challenge set for reading

comprehension over multiple sentences. In NAACL-

HLT.

Tomás Kociský, Jonathan Schwarz, Phil Blunsom,

Chris Dyer, Karl Moritz Hermann, Gábor Melis, and

Edward Grefenstette. 2018. The narrativeqa reading

comprehension challenge. TACL, 6:317–328.

Rik Koncel-Kedziorski, Hannaneh Hajishirzi, Ashish

Sabharwal, Oren Etzioni, and Siena Dumas Ang.

2015. Parsing algebraic word problems into equa-

tions. TACL, 3:585–597.

Jayant Krishnamurthy, Pradeep Dasigi, and Matt Gard-

ner. 2017. Neural semantic parsing with type con-

straints for semi-structured tables. In EMNLP.

Nate Kushman, Yoav Artzi, Luke Zettlemoyer, and

Regina Barzilay. 2014. Learning to automatically

solve algebra word problems. In Proceedings of the

52nd Annual Meeting of the Association for Compu-

tational Linguistics (Volume 1: Long Papers), vol-

ume 1, pages 271–281.

Chen Liang, Jonathan Berant, Quoc Le, Kenneth D.

Forbus, and Ni Lao. 2017. Neural symbolic ma-

chines: Learning semantic parsers on freebase with

weak supervision. In ACL.

Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blun-

som. 2017. Program induction by rationale genera-

tion: Learning to solve and explain algebraic word

problems. In ACL.

Marie-Catherine de Marneffe and Christopher D. Man-

ning. 2008. The stanford typed dependencies repre-

sentation. In [email protected] Mihaylov, Peter Clark, Tushar Khot, and Ashish

Sabharwal. 2018. Can a suit of armor conduct elec-

tricity? a new dataset for open book question answer-

ing. In EMNLP.

Pasquale Minervini and Sebastian Riedel. 2018. Ad-

versarially regularising neural nli models to integrate

logical background knowledge. In CoNLL.

Bhavana Dalvi Mishra, Lifu Huang, Niket Tandon,

Wen-tau Yih, and Peter Clark. 2018. Tracking state

changes in procedural text: A challenge dataset and

models for process paragraph comprehension.

Arvind Neelakantan, Quoc V. Le, and Ilya Sutskever.

2016. Neural programmer: Inducing latent pro-

grams with gradient descent. ICLR.

Simon Ostermann, Ashutosh Modi, Michael Roth, Ste-

fan Thater, and Manfred Pinkal. 2018. Mcscript: a

novel dataset for assessing machine comprehension

using script knowledge. LREC Proceedings, 2018.

Anusri Pampari, Preethi Raghavan, Jennifer Liang, and

Jian Peng. 2018. emrqa: A large corpus for question

answering on electronic medical records.

Denis Paperno, Germán Kruszewski, Angeliki Lazari-

dou, Quan Ngoc Pham, Raffaella Bernardi, San-

dro Pezzelle, Marco Baroni, Gemma Boleda, and

Raquel Fernández. 2016. The lambada dataset:

Word prediction requiring a broad discourse context.

ACL.

Panupong Pasupat and Percy Liang. 2015. Composi-

tional semantic parsing on semi-structured tables. In

ACL.

Jeffrey Pennington, Richard Socher, and Christopher D.

Manning. 2014. Glove: Global vectors for word rep-

resentation. In EMNLP.

Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt

Gardner, Christopher Clark, Kenton Lee, and Luke

Zettlemoyer. 2018. Deep contextualized word repre-

sentations. In NAACL-HLT.

Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018.

Know what you don’t know: Unanswerable ques-

tions for squad. In ACL.

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and

Percy Liang. 2016. Squad: 100, 000+ questions for

machine comprehension of text. In EMNLP.

Siva Reddy, Danqi Chen, and Christopher D. Manning.

2019. Coqa: A conversational question answering

challenge. TACL.

Scott E. Reed and Nando de Freitas. 2016. Neural

programmer-interpreters. ICLR.

Amrita Saha, Rahul Aralikatte, Mitesh M. Khapra, and

Karthik Sankaranarayanan. 2018. Duorc: Towards

complex language understanding with paraphrased

reading comprehension. In ACL.

Min Joon Seo, Aniruddha Kembhavi, Ali Farhadi, and

Hannaneh Hajishirzi. 2017. Bidirectional attention

flow for machine comprehension. ICLR.

Gabriel Stanovsky, Julian Michael, Luke S. Zettle-

moyer, and Ido Dagan. 2018. Supervised open in-

formation extraction. In NAACL-HLT.

Simon Šuster and Walter Daelemans. 2018. Clicr: a

dataset of clinical case reports for machine reading

comprehension.

Alon Talmor and Jonathan Berant. 2018. The web as

a knowledge-base for answering complex questions.

In NAACL-HLT.

Johannes Welbl, Pontus Stenetorp, and Sebastian

Riedel. 2018. Constructing datasets for multi-hop

reading comprehension across documents. TACL,

6:287–302.

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Ben-

gio, William W. Cohen, Ruslan Salakhutdinov, and

Christopher D. Manning. 2018. Hotpotqa: A dataset

for diverse, explainable multi-hop question answer-

ing. In EMNLP.

Pengcheng Yin and Graham Neubig. 2017. A syntactic

neural model for general-purpose code generation.

In ACL’17.

Adams Wei Yu, David Dohan, Minh-Thang Luong, Rui

Zhao, Kai Chen, Mohammad Norouzi, and Quoc V.

Le. 2018. Qanet: Combining local convolution

with global self-attention for reading comprehen-

sion. ICLR.

John M. Zelle and Raymond J. Mooney. 1996. Learn-

ing to parse database queries using inductive logic

programming. In AAAI/IAAI, Vol. 2.

Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin

Choi. 2019. From recognition to cognition: Visual

commonsense reasoning. CVPR, abs/1811.10830.

Rowan Zellers, Yonatan Bisk, Roy Schwartz, and Yejin

Choi. 2018. Swag: A large-scale adversarial dataset

for grounded commonsense inference. In EMNLP.

Luke S. Zettlemoyer and Michael Collins. 2005. Learn-

ing to map sentences to logical form: Structured

classification with probabilistic categorial grammars.

In UAI.

Sheng Zhang, Xiaodong Liu, Jingjing Liu, Jianfeng

Gao, Kevin Duh, and Benjamin Van Durme. 2019.

ReCoRD: Bridging the gap between human and ma-

chine commonsense reading comprehension.Figure 2: Question Answering HIT sample above with passage on the left and input fields for answer on the right

and Highlighted candidate spans of sample answers below