Summary of Fine-tuned LLMs for Wikidata Semantic Parsing

Summary Fine-tuned LLMs for Wikidata Semantic Parsing arxiv.org

9,201 words - PDF document - View PDF document

One Line

The WikiWebQuestions benchmark for Wikidata demonstrates that using large language models for semantic parsing improves answer accuracy, as evidenced by strong experimental results.

Slides

Slide Presentation (12 slides)

Copy slides outline Copy embed code Download as Word

Fine-tuned LLMs for Wikidata Semantic Parsing

Source: arxiv.org - PDF - 9,201 words - view

Introducing WikiWebQuestions

• WikiWebQuestions is a high-quality question answering benchmark for Wikidata.

• It provides a comprehensive evaluation of answer accuracy.

• The dataset consists of real-world questions collected from users.

WikiSP: A Semantic Parser for Wikidata

• WikiSP is a few-shot sequence-to-sequence semantic parser for Wikidata.

• It complements large language models (LLMs) to improve answer accuracy.

• By grounding LLMs in Wikidata, factuality is enhanced.

Modifying SPARQL for Improved Parsing

• SPARQL is modified to use domain and property names instead of unique IDs.

• This makes it easier for LLMs to adapt to changes in query notation.

• The modified SPARQL queries are fed into WikiSP for answer generation.

Experimental Results: Answer Accuracy

• The proposed methodology achieves a strong baseline of 76% and 65% answer accuracy in the dev and test sets of WikiWebQuestions.

• Combining WikiSP with GPT-3, the system provides useful answers to 96% of the questions in the dev set.

• Outperforms the state-of-the-art for the QALD-7 Wikidata dataset by 3.6% in F1 score.

Importance of Semantic Parsing

• LLMs can answer questions directly but lack interpretability and may provide incorrect answers.

• Semantic parsing provides interpretable and grounded results in Wikidata.

• Users can verify answers and obtain more reliable information.

WikiWebQuestions Dataset

• Migrated WebQuestionsSP benchmark from Freebase to Wikidata.

• Provides up-to-date answers from a larger knowledge base.

• Real-world questions collected from users using the Google Suggest API.

Implementation: Entity Linking and Fine-tuning

• ReFinED is used as the entity linker for WikiSP.

• Fine-tuning of ReFinED with the WikiWebQuestions training set improves performance.

• LLaMA and Alpaca are fine-tuned to enhance the factuality of LLMs.

Evaluation Results: WikiSP Performance

• WikiSP achieves a 65.5% exact match accuracy and a 71.9% F1 score on the WikiWebQuestions dataset.

• Entity linking and allowing mentions as entities improve answer accuracy.

• Ablation experiments demonstrate the importance of these factors.

Combining GPT-3 with WikiSP

• GPT-3 answers 66.4% of the questions correctly but provides incomplete or wrong answers for some.

• WikiSP provides definitive answers for 75.6% of the questions.

• Combining GPT-3 with WikiSP improves answer accuracy for a large percentage of questions.

Error Analysis and Improvements

• Errors in the WWQ dev set include alternative interpretations, alternative SPARQL queries, and entity linking errors.

• WikiSP outperforms the state-of-the-art WDAqua by 3.6% in terms of F1 score on the QALD-7 dataset.

• Better training datasets are needed to handle complex and less popular questions.

Key Takeaways

• WikiWebQuestions provides a high-quality benchmark for Wikidata question answering.

• WikiSP complements LLMs and improves answer accuracy.

• Semantic parsing offers interpretable results grounded in Wikidata.

• Combining GPT-3 with WikiSP enhances answer accuracy in question answering tasks.

Key Points

WikiWebQuestions is a high-quality question answering benchmark for Wikidata.
WikiSP is a few-shot sequence-to-sequence semantic parser for Wikidata.
The goal is to improve the factuality of large language models (LLMs) by grounding them in Wikidata.
Semantic parsing is used to complement LLMs and provide more accurate answers.
The authors modify SPARQL to use domain and property names instead of unique IDs.
The authors achieve strong results in answer accuracy using their methodology.
Semantic parsing provides interpretable results grounded in Wikidata for better verification of answers.
The combination of GPT-3 with WikiSP improves answer accuracy in question answering tasks.

Summaries

23 word summary

WikiWebQuestions is a question answering benchmark for Wikidata. Semantic parsing with large language models improves answer accuracy. Experimental results show a strong baseline.

94 word summary

This paper presents WikiWebQuestions, a question answering benchmark for Wikidata, and introduces WikiSP, a semantic parser for Wikidata. The authors propose using semantic parsing alongside large language models (LLMs) to improve answer accuracy. They modify SPARQL and train the parser to link entities in user queries to their unique ID in Wikidata. Experimental results show that this methodology achieves a strong baseline of 76% and 65% answer accuracy in the dev and test sets of WikiWebQuestions, respectively. The authors highlight the importance of semantic parsing for grounding LLMs and discuss limitations and future work.

139 word summary

This paper introduces WikiWebQuestions, a question answering benchmark for Wikidata, and presents WikiSP, a semantic parser for Wikidata. The authors propose using semantic parsing as a complement to large language models (LLMs) to improve answer accuracy. They modify SPARQL by using domain and property names instead of unique IDs and train the parser to link entities in user queries to their unique ID in Wikidata. If the query fails, GPT-3 is used as a fallback. Experimental results show that this methodology improves answer accuracy, achieving a strong baseline of 76% and 65% answer accuracy in the dev and test sets of WikiWebQuestions, respectively. Semantic parsing is crucial for grounding LLMs, and combining it with GPT-3 provides more reliable answers. The authors also introduce the WikiWebQuestions dataset, evaluate fine-tuned LLMs for Wikidata semantic parsing, and discuss limitations and future work.

516 word summary

This paper introduces WikiWebQuestions, a question answering benchmark for Wikidata, and presents WikiSP, a semantic parser for Wikidata. The authors aim to enhance the accuracy of large language models (LLMs) by grounding them in Wikidata. To address the issue of LLMs providing incorrect answers, the authors propose using semantic parsing as a complement to LLMs for more accurate answers.

The authors modify SPARQL, a query language for semantic parsing, by using domain and property names instead of unique IDs. They train the parser to link entities in user queries to their unique ID in Wikidata using an entity linker or mentions in the query. The modified SPARQL queries are then fed into the WikiSP semantic parser. If the query fails, GPT-3 is used as a fallback and the result is labeled as a GPT-3 guess.

Experimental results demonstrate that this methodology improves answer accuracy. The authors achieve a strong baseline of 76% and 65% answer accuracy in the dev and test sets of WikiWebQuestions, respectively. By combining the semantic parser with GPT-3, they provide useful answers to 96% of the questions in the dev set. They also outperform the state-of-the-art for the QALD-7 Wikidata dataset by 3.6% in F1 score.

Semantic parsing is crucial for grounding LLMs as it provides interpretable results grounded in Wikidata. This allows users to verify answers since LLMs may not always be correct. By combining semantic parsing with GPT-3's guesses, the system offers more reliable answers.

The authors introduce the WikiWebQuestions dataset, a high-quality semantic parsing dataset for Wikidata. It is an updated version of the WebQuestionsSP benchmark, providing real-world questions collected from users via the Google Suggest API.

ReFinED is used as the entity linker for WikiSP. It is fine-tuned with question and entity pairs from the WikiWebQuestions training set to learn common terms used in Wikidata. Additionally, LLaMA, a large language model, is fine-tuned with a few-shot training set along with instructions used to fine-tune Alpaca, another large language model.

Evaluation results of WikiSP on the WikiWebQuestions dataset show promising performance, achieving a 65.5% exact match accuracy and a 71.9% F1 score. Answer accuracy is improved by entity linking and allowing mentions as entities.

The authors evaluate fine-tuned LLMs for Wikidata semantic parsing, focusing on using property and domain names instead of IDs and combining GPT-3 with WikiSP for question answering. Using property and domain names improves answer accuracy by 2.0%. The combination of GPT-3 with WikiSP provides definitive, correct, and complete answers for 75% of the questions in the dev set.

Error analysis reveals alternative interpretations, SPARQL queries that don't retrieve answers, and entity linking errors as common errors. WikiSP outperforms WDAqua by 3.6% in F1 score on Task 4 from the QALD-7 dataset. Combining GPT-3 with WikiSP yields additional correct answers for 34% of the questions.

Limitations discussed include the focus on factoid question answering and English datasets, as well as the need for better training datasets to handle more complex questions.

In conclusion, the authors create the WikiWebQuestions benchmark dataset, establish a strong baseline using fine-tuned LLMs, and demonstrate the advantages of combining GPT-3 with WikiSP.

570 word summary

This paper introduces WikiWebQuestions, a question answering benchmark for Wikidata, and presents WikiSP, a semantic parser for Wikidata. The goal is to improve the accuracy of large language models (LLMs) by grounding them in Wikidata. LLMs have the tendency to give incorrect answers, so the authors propose using semantic parsing to complement LLMs and provide more accurate answers.

The authors modify SPARQL, a query language for semantic parsing, to use domain and property names instead of unique IDs. They train the parser to use either an entity linker or mentions in the query to link entities in the user query to their unique ID in Wikidata. The modified SPARQL queries are then fed into the WikiSP semantic parser. If the query fails to return a result, the system defaults to using GPT-3 and labels the result as a GPT-3 guess.

Experimental results show that this methodology improves answer accuracy. The authors achieve a strong baseline of 76% and 65% answer accuracy in the dev and test sets of WikiWebQuestions, respectively. By combining their semantic parser with GPT-3, they provide useful answers to 96% of the questions in the dev set. They also outperform the state-of-the-art for the QALD-7 Wikidata dataset by 3.6% in F1 score.

Semantic parsing is important in grounding LLMs. While LLMs can answer questions directly, their answers may not always be correct. Semantic parsers provide interpretable results grounded in Wikidata, allowing users to verify the answers. By combining the results from the semantic parser with GPT-3's guesses, the system provides more reliable answers.

The authors also introduce the WikiWebQuestions dataset, a high-quality semantic parsing dataset for Wikidata. They migrated the WebQuestionsSP benchmark from Freebase to Wikidata, providing up-to-date answers from a larger knowledge base. The dataset consists of real-world questions collected from users using the Google Suggest API.

The authors use ReFinED as the entity linker for WikiSP. They fine-tune ReFinED with question and entity pairs from the WikiWebQuestions training set to learn common terms used in Wikidata. They also fine-tune LLaMA, a large language model, with a few-shot training set along with instructions used to fine-tune Alpaca, another large language model.

The evaluation results of WikiSP on the WikiWebQuestions dataset show promising performance. The model achieves a 65.5% exact match accuracy and a 71.9% F1 score. Entity linking and allowing mentions as entities improve answer accuracy.

The authors conducted an evaluation of fine-tuned LLMs for Wikidata semantic parsing, focusing on two aspects: using property and domain names instead of PIDs and QIDs, and combining GPT-3 with WikiSP for question answering.

Using property and domain names improves answer accuracy by 2.0%. LLMs can adapt to changes in query notation with fine-tuning. The combination of GPT-3 with WikiSP provides definitive, correct, and complete answers for 75% of the questions in the dev set.

Error analysis shows that errors include alternative interpretations, alternative SPARQL queries that don't retrieve an answer, and entity linking errors.

WikiSP achieves 38% accuracy on Task 4 from the QALD-7 dataset, outperforming the state-of-the-art WDAqua by 3.6% in F1 score. Combining GPT-3 with WikiSP provides additional correct answers for 34% of the questions.

The authors discuss limitations such as the focus on factoid question answering and English datasets, and the need for better training datasets to handle more complex questions.

In conclusion, the authors create the WikiWebQuestions benchmark dataset, establish a strong baseline using fine-tuned LLMs, and show the benefits of combining GPT-3 with WikiSP.

965 word summary

This paper presents WikiWebQuestions, a high-quality question answering benchmark for Wikidata. It introduces WikiSP, a few-shot sequence-to-sequence semantic parser for Wikidata. The goal is to improve the factuality of large language models (LLMs) by grounding them in Wikidata, which contains over 12 billion facts. LLMs have the ability to answer questions, but they are prone to hallucinating and giving incorrect answers. The authors propose using semantic parsing to complement LLMs and provide more accurate answers.

The authors modify SPARQL, a query language used for semantic parsing, to use domain and property names instead of their unique IDs. They train the parser to use either the results from an entity linker or mentions in the query. The entity linker is used to link entities in the user query to their unique ID in Wikidata. The modified SPARQL queries are then fed into the WikiSP semantic parser to produce answers. If applying the query to Wikidata fails to return a result, the system defaults to using GPT-3, a large language model, and labels the result as a GPT-3 guess.

Experimental results show that this methodology is effective in improving answer accuracy. The authors achieve a strong baseline of 76% and 65% answer accuracy in the dev and test sets of WikiWebQuestions, respectively. By combining their semantic parser with GPT-3, they are able to provide useful answers to 96% of the questions in the dev set. They also outperform the state-of-the-art for the QALD-7 Wikidata dataset by 3.6% in F1 score.

The authors highlight the importance of semantic parsing in grounding LLMs. While LLMs can answer questions directly, they lack interpretability and their answers may not always be correct. Semantic parsers provide interpretable results that are grounded in Wikidata, allowing users to verify the answers. By combining the results from the semantic parser with GPT-3's guesses, the system provides users with more reliable answers.

The authors also introduce the WikiWebQuestions dataset, which is a high-quality semantic parsing dataset for Wikidata. They migrated the popular WebQuestionsSP benchmark from Freebase to Wikidata, providing up-to-date answers from a larger knowledge base. The dataset consists of real-world questions collected from users using the Google Suggest API.

In terms of implementation, the authors use ReFinED as the entity linker for WikiSP. They fine-tune ReFinED with the question and entity pairs from the WikiWebQuestions training set to learn common terms used in Wikidata. They also fine-tune LLaMA, a large language model, with a few-shot training set along with instructions used to fine-tune Alpaca, another large language model.

The evaluation results of WikiSP on the WikiWebQuestions dataset show promising performance. The model achieves a 65.5% exact match accuracy and a 71.9% F1 score. Ablation experiments demonstrate the importance of entity linking and allowing mentions as entities in improving answer accuracy.

Overall, this paper presents a method for fine-tuning LLMs and improving their factuality by grounding them in Wikidata. The proposed WikiSP semantic parser achieves strong results on the WikiWebQuestions dataset and outperforms existing methods. The authors highlight the benefits of combining semantic parsing with large language models to provide more accurate and interpretable answers.

The authors of the paper conducted an evaluation of fine-tuned LLMs for Wikidata semantic parsing. They focused on two specific aspects: the effectiveness of using property and domain names instead of PIDs and QIDs, and the combination of GPT-3 with WikiSP for question answering.

In their evaluation, the authors found that using property and domain names instead of PIDs and QIDs improved the answer accuracy by 2.0%. This indicates that LLMs can adapt to changes in query notation with fine-tuning, and it is easier for them to learn names than random IDs. However, the replacement of QIDs with their names would likely be more significant if mentions were not allowed in the predicted logical form.

The authors also evaluated the combination of GPT-3 with WikiSP for question answering using the WWQ dataset. GPT-3 answered 66.4% of the questions correctly, but provided incomplete answers for 26.5% of the questions and wrong answers for 7.1% of the questions. In contrast, WikiSP provided definitive answers for 75.6% of the questions. When combining GPT-3 with WikiSP, they were able to give definitive, correct, and complete answers for 75% of the questions in the dev set.

Error analysis showed that 18% of the errors in the WWQ dev set were actually deemed to be correct alternative results. These included cases where the model predicted an alternative interpretation to the question that still provided a reasonable answer. Another 6.3% of the errors were due to reasonable alternative SPARQL queries that did not retrieve an answer. The biggest source of errors, accounting for 35.1% of the failed examples, was entity linking errors. The entity linker failed to provide correct entities in these cases.

The authors also conducted an experiment with WikiSP on Task 4 from the QALD-7 dataset. WikiSP achieved 38% accuracy on this dataset, outperforming the state-of-the-art WDAqua by 3.6% in terms of F1 score. They also evaluated the combination of GPT-3 with WikiSP on QALD-7 and found that the combination approach provided additional correct answers for 34% of the questions.

The authors discussed the limitations of their work, including the focus on factoid question answering and the use of English datasets. They also mentioned the need for better training datasets to improve the performance of WikiSP on less popular questions.

In conclusion, the authors created a high-quality benchmark dataset called WikiWebQuestions for large knowledge-base question answering. They established a strong baseline for answer accuracy and F1 score using fine-tuned LLMs with a few-shot training dataset. They also showed that combining GPT-3 with WikiSP can reduce hallucination and provide useful information for a large percentage of questions. However, they acknowledged the need for further improvements and better training datasets to handle more complex and less popular questions.

Raw indexed text (59,242 chars / 9,201 words / 1,390 lines)

Fine-tuned LLMs Know More, Hallucinate Less

with Few-Shot Sequence-to-Sequence Semantic Parsing over Wikidata

Silei Xu ∗ Shicheng Liu ∗ Theo Culhane Elizaveta Pertseva

Meng-Hsi Wu 1 Sina J. Semnani Monica S. Lam

Computer Science Department, Stanford University

Stanford, CA

{silei, shicheng, tculhane, pertseva, sinaj, lam}@cs.stanford.edu

Ailly.ai

[email protected]

Abstract

Where did Bronx take

place?

While large language models (LLMs) can an-

swer many questions correctly, they can also

hallucinate and give wrong answers. Wikidata,

with its over 12 billion facts, can be used to

ground LLMs to improve their factuality.

WikiSP

(Semantic Parser)

Entity Linker

(‘A Bronx Tale’,

‘Q1130705’)

This paper presents WikiWebQuestions, a high-

quality question answering benchmark for

Wikidata. Ported over from WebQuestions for

Freebase, it consists of real-world data with

SPARQL annotation.

This paper presents a few-shot sequence-to-

sequence semantic parser for Wikidata. We

modify SPARQL to use the unique domain and

property names instead of their IDs. We train

the parser to use either the results from an entity

linker or mentions in the query. We fine-tune

LLaMA by adding the few-shot training data

to that used to fine-tune Alpaca.

No Response

Response

From Wikidata, the

filming location of ‘A

Bronx Tale’ includes

New Jersey and New

York

Our experimental results demonstrate the ef-

fectiveness of this methodology, establishing

a strong baseline of 76% and 65% answer ac-

curacy in the dev and test sets of WikiWeb-

Questions, respectively. By pairing our seman-

tic parser with GPT-3, we combine verifiable

results with qualified GPT-3 guesses to pro-

vide useful answers to 96% of the questions in

dev. We also show that our method outperforms

the state-of-the-art for the QALD-7 Wikidata

dataset by 3.6% in F1 score. 1

(GPT-3)

GPT-3 guesses that the movie took

place in Bronx, New York

LAM

Introduction

Large language models (LLMs) such as GPT-3 can

answer open-domain questions without access to

external knowledge or any task-specific training

examples. However, LLMs are prone to halluci-

nate (Bang et al., 2023), while using a convinc-

ing and confident tone. This may cause signifi-

cant harm as people increasingly accept LLMs as a

knowledge source (Goddard, 2023; Weiser, 2023).

Equal contribution

Code,

data,

and model are available

https://github.com/stanford-oval/

wikidata-emnlp23

STANFORD

Figure 1: An Overview of WikiSP. An entity linker is

used to link entities in the user query to their unique ID

in Wikidata; e.g. “A Bronx Tale” is linked to entity ID

“Q1130705”. The query and entity linker outputs are fed

to the WikiSP semantic parser to produce a modified

version of SPARQL, where property IDs (e.g. “P915”)

are replaced by their unique string identifiers (e.g. “film-

ing_location”). If applying the query to Wikidata fails to

return a result, we default to GPT-3, labeling the result

as a GPT-3 guess. Returned answers are presented in

the context of the query, so the user can tell if the an-

swer is acceptable; if not, we also show the guess from

GPT-3. Here WikiSP mistakenly uses “filming_location”

instead of “narrative_location”; the user detects the mis-

take, thumbs down the answer, and the GPT-3 answer is

provided.7%

Incorrect GPT

Incomplete GPT

15%

27%

Correct GPT

Incomplete GPT

66%

Correct GPT

GPT-3 only

76%

Veriﬁed from

WikiSP

WikiSP + GPT-3

Figure 2: Distribution of correct, incomplete, and incor-

rect answers for the WikiWebQuestions dev set, when

GPT-3 is used alone and when combined with WikiSP.

In contrast, traditional knowledge base ques-

tion answering (KBQA) is grounded with a given

knowledge base. Semantic parsing (SP) has been

widely used to tackle this challenging task, where

the questions are first parsed into a logical form

and then executed to retrieve answers from the

knowledge base. It has better interpretability than

GPT-3 and other information-retrieval-based ap-

proaches (Dong et al., 2015; Miller et al., 2016;

Sun et al., 2018, 2019) where answers are predicted

directly.

To handle large knowledge bases, previous SP-

based approaches tend to use a multi-stage pipeline

of sub-tasks, starting with extracting the relevant

subgraph based on entities detected in the ques-

tions (Yih et al., 2015; Luo et al., 2018). Such

an approach struggles with questions that have a

large search space and fails to understand questions

that refer to information missing in the knowledge

graph. Having to retrieve the relevant subgraphs to

create the logical form conflates query resolution

with semantic parsing, rendering classical query

optimization inapplicable.

End-to-end seq2seq translation, on the other

hand, has mainly been used on schemas of rela-

tively small relational databases (Yu et al., 2018;

Xu et al., 2020a,b) and web APIs (Campagna et al.,

2017; Su et al., 2017). To handle large knowledge

graphs, recent work proposed retrieving (1) infor-

mation on linked entities, (2) exemplary logical

forms relevant to the query (Gu et al., 2021; Ye

et al., 2022), and (3) schemas as context to seman-

tic parsing (Shu et al., 2022). Others use induction

or iterative methods to generate complex logical

forms (Cao et al., 2022b; Gu and Su, 2022).

1.1

Few-Shot Seq2Seq Semantic Parsing

This paper investigates how we can leverage large

language models (LLMs) to create seq2seq neural

semantic parsers for large knowledge bases such as

Wikidata.

Pretrained with the internet corpora, LLMs are

already familiar with the syntax of formal query

languages such as SQL (Hu et al., 2022; Poesia

et al., 2022; Li et al., 2023; An et al., 2023; Nan

et al., 2023; Arora et al., 2023). When given simple

SQL schemas, they can perform zero-shot semantic

parsing of simple natural language queries into for-

mal queries. Unlike Freebase, the KB used in most

of the KBQA semantic parsing research, Wikidata

does not have a pre-defined schema, making it a

much harder problem. It has 150K domains, 3K

applicable properties, and 107M entities, each of

the properties and entities are uniquely identified

with PIDs and QIDs, respectively. While zero-shot

LLMs can generate SPARQL queries for the easiest

and most common questions, they do not know all

the PIDs and QIDs, and nor is it possible to include

them in a prompt.

This paper presents WikiSP, a few-shot

sequence-to-sequence semantic parser for Wikidata

that translates a user query, along with results from

an entity linker, directly into SPARQL queries. To

handle the 100M+ entities in Wikidata, we train

the parser to use either the entity linker results or a

mention in the query; to handle the 150K domains

and 3K applicable properties, we modify SPARQL

to use domain and property names instead of their

unique QIDs and PIDs, respectively. We fine-tune

a LLaMA (Touvron et al., 2023) with a few-shot

training set along with the instructions used to fine-

tune Alpaca (Taori et al., 2023).

1.2

A New Dataset: WikiWebQuestions

Most of the widely-used high-quality benchmarks

for KBQA are based on Freebase (Bollacker et al.,

2008) which has been shut down since 2015. With

outdated knowledge, it is hard to compare the re-

sults with modern LLMs such as GPT-3, since an-

swers have changed over time for most of the ques-

tions. Wikidata, despite being the largest and most

popular knowledge base nowadays, has very few

datasets annotated with SPARQL queries; they are

either extremely small (Usbeck et al., 2017) or syn-

thetic (Saha et al., 2018).

We migrated the popular WebQuestionsSP (Yih

et al., 2016) benchmark from Freebase to Wiki-

data, with updated SPARQL and up-to-date an-

swers from the much larger Wikidata.1.3

Complementing Large Language Models

Trained on Wikipedia and all of the internet, LLMs

can answer many questions directly. Unfortunately,

the user cannot tell if the answers are correct, thus

requiring them to fact-check every answer.

Unlike humans, GPT-3 always sounds definitive

even when they are wrong by providing specific

and plausible facts. For example, on the question

“what is the biggest country in Europe by popula-

tion?”, GPT-3 answers “Germany”, when the an-

swer is “Russia”. Or, on the question, “where does

the name Melbourne come from?” GPT-3 answers

“Melbourne comes from the Latin word ‘melbur-

num’ meaning ‘blackburn’ or ‘blackbird’.”, but in

reality, Melbourne is named after William Lamb,

2nd Viscount Melbourne. It is not possible to tell

when GPT-3’s answers are wrong, and every an-

swer needs to be fact-checked.

Semantic parsers can be used to complement

LLMs as they are interpretable; their results are

grounded in Wikidata, which we assume to be cor-

rect. It is possible for semantic parsers to misun-

derstand a query, but by providing the answer in

the context of the query, the user can spot the error.

We propose getting the best of both worlds by

answering the question with WikiSP if possible.

Otherwise, we report GPT-3’s guesses by prefac-

ing it with: “GPT-3 guesses that” (Figure 1). In

this way, the user can have full confidence with

the answers from the former, while also benefiting

from the latter. It is easier for users to fact-check

an answer than trying to find the answer.

1.4

Contributions

WikiWebQuestions, a high-quality semantic

parsing dataset for Wikidata, migrated from the

popular WebQuestions dataset for Freebase.

WikiSP, a few-shot Seq2Seq semantic parser

by fine-tuning LLaMA with a few shot training set.

We improve the learnability of SPARQL queries by

replacing the IDs of properties and domains with

their unique names; we tolerate errors in entity

linking by accepting mentions in the queries as

entities. We establish a first, strong baseline of

76% and 65% answer accuracy for the dev set and

test set of our new WikiWebQuestions benchmark,

respectively. We also demonstrate that our method

surpasses the state of the art for QALD-7 wikidata

set by 3.6% in F1 score.

We improve GPT-3’s trustworthiness by first

returning interpretable results from semantic parser

and backing it up with GPT-3 guesses. WikiSP can

provide verifiable results for WikiWebQuestions

76% of the time and improves the guesses by GPT-

3, resulting in errors only 4% of the time (Figure 2).

2.1

Related Work

KBQA

The KBQA task aims to make large knowledge

bases accessible by natural language. One com-

mon approach is semantic parsing where a natural

language query is translated into a formal logical

form, which is then executed to retrieve an answer

from the knowledge base. To handle large KBs,

one method is to formulate SP as a multi-staged

search problem by retrieving entities and expanding

the graphs according to the relationships between

their properties and the query (Yih et al., 2015,

2016; Luo et al., 2018). Lan and Jiang (2020) add

constraints to the staged query graph generation

method. Another popular method is to use seq2seq

models obtained by fine-tuning pretrained language

models. Das et al. (2021) first find other queries

that contain semantically similar subparts, and con-

struct a new logical form by combining the similar

subparts of the found queries. Ye et al. (2022)

search over the KB based on predefined rules to de-

rive a set of candidate logical forms, rank them, and

generate the final logical form. Cao et al. (2022b)

first generate a “sketch” program and then fill in

its arguments. Gu and Su (2022) use dynamic pro-

gram induction to generate query structures. Based

on a user query, Shu et al. (2022) retrieve entities,

example logical forms, and related schema. Unlike

FreeBase, Wikidata does not have a fixed schema.

Another approach to KBQA is based on graph

retrieval (Dong et al., 2015; Miller et al., 2016; Sun

et al., 2018, 2019; Mavromatis and Karypis, 2022;

Sen et al., 2021; Vivona and Hassani, 2019; Verga

et al., 2021). It predicts the answers directly within

the subgraph extracted based on the topic entity

in the question. Yu et al. (2023) combine seman-

tic parsing with retrieval and achieve the state-of-

the-art on the WebQuestionsSP dataset (Yih et al.,

2016). However, retrieval-based methods cannot

handle entire categories of questions, such as ques-

tions with no available answer and questions like

“the tallest mountain” where no entities are men-

tioned by name. They have poor interpretability

and do not support query optimization.2.2

KBQA Benchmarks

Most of the early KBQA benchmarks are based on

Freebase (Berant et al., 2013; Yih et al., 2016; Tal-

mor and Berant, 2018). Recently, new benchmarks

have been created for Wikidata (Cao et al., 2022a;

Saha et al., 2019). However, these benchmarks are

created using rule-based synthesis or paraphrases,

which are easier for semantic parsers. CSQA col-

lects human-written questions for single triples and

constructs complex questions using fixed rules with

very limited natural language variety (Saha et al.,

2019). KQA Pro first synthesizes queries with

canonical natural language and then crowdsources

human paraphrases (Cao et al., 2022a). Campagna

et al. (2019) show that a model can achieve signifi-

cantly higher accuracy over paraphrased data com-

pared to real-world data even for untrained queries.

Thus, we base our WikiWebQuestions dataset on

WebQuestionsSP (Yih et al., 2016), where data are

collected from real-world users using the Google

Suggest API.

2.3

LLMs for Semantic Parsing

Shin et al. (2021) show the promise of few-shot

prompting LLMs for semantic parsing. They use

constrained decoding to enforce the syntax of the

formal language, and achieve comparable results

with a smaller fine-tuned BART model (Lewis et al.,

2020) on datasets with small database schemas.

Rubin et al. (2022) fine-tune a small retriever to

obtain the most relevant few-shot examples to use

for each input. Niu et al. (2023) use a few-shot

prompted Codex model to break down the natural

language input to make the task easier for a smaller

semantic parser. LLMs have also been applied to

semantic parsing on relational databases (Hu et al.,

2022; Poesia et al., 2022; Li et al., 2023; An et al.,

2023; Nan et al., 2023; Arora et al., 2023). The

schemas used in these projects are very small when

compared to Wikidata.

2.4

Entity Linking

Entity linking involves finding the named entities

in a query, and linking them to the corresponding

entities in the knowledge graph so that the query

can be executed using the proper entities as ref-

erence points. The current state-of-the-art entity

linking model on the WebQuestionsSP dataset is

ReFinED (Ayoola et al., 2022). They use a bidirec-

tional transformer on the query to predict the most

likely mentions of named entities within a query,

and then combine that information with embed-

dings computed over every entity in the knowledge

base to predict which entity the mention is most

likely to be referring to. Prior to ReFinED, the

state-of-the-art was ELQ (Li et al., 2020). They

similarly generate embeddings for each entity in

the knowledge base, and then use the predicted

mentions of entities combined with these predicted

embeddings to generate likely entities.

Semantic Parsing for Wikidata

Wikidata is the largest public knowledge base

with over 12 billion facts represented by subject-

predicate-object triples using 100+ million entities

and 10,000 properties. 3,000 of the properties are

useful for answering natural language questions,

whereas the rest are used to link data in Wikidata

with external library catalogs and database IDs.

Entities and properties are given unique identi-

fiers, QIDs and PIDs, respectively. For example,

the fact that Joe Biden is the president of the US can

be represented as a triple (Q6279, P39, Q11696),

where P39 is the PID for property position held,

Q6279 and Q11696 are QIDs for Joe Biden and the

president of the United States, respectively.

3.1

Query Format

Unlike relational databases and Freebase, Wikidata

has no predefined domains or types. Any entity

can have an arbitrary set of properties. However,

even though Wikidata is property-based, all named

entities have one or more instance of properties to

some domain entity; domain entities are organized

into a hierarchy with the subclass of property.

Note that the names of domain entities and prop-

erties are unique. Non-domain entities, on the other

hand, can be ambiguous. For example, “Lincoln”

can refer to the president, a car brand, a sparrow,

an aircraft, and many different cities.

We posit that it is impossible for LLMs to memo-

rize the QIDs and PIDs for domains and properties.

We modify the format of SPARQL queries to use

the more mnemonic property name, instead of its

PID. Similarly, we use entity names for domains.

For example, the original SPARQL for the query

“What car models does GM make?” is

SELECT DISTINCT ?x WHERE {

?x wdt:P31/wdt:P279* wd:Q3231690.

?x wdt:P176 wd:Q81965. }

This says that we are seeking x, where x

is transitively either an instance of (wdt:P31)or a subclass of (wdt:P279) of an automobile

model (wd:Q3231690), and x has General Mo-

tors (wd:Q81965) as the manufacturer (wdt:P176).

Note wdt is the prefix for Wikidata property, and

wd is for Wikidata entity.

With our modification, the query becomes:

SELECT DISTINCT ?x WHERE {

?x wdt:instance_of/wdt:subclass_of*

wd:automobile_model.

?x wdt:manufacturer wd:Q81965. }

For non-domain entity QIDs, we also accept a

string in lieu of a QID in case of entity linking

errors. At inference time, we use simple heuristics

to resolve the string to a QID before applying the

query. For example, “wd:Q81965” in the query

may be replaced with “wd:GM”. See Section 3.2.2

for more details.

Normally, we refrain from changing standard

query notations since LLMs have been pretrained

on them. However, we posit that learning this new

syntax is much easier than learning the PIDs and

QIDs. Our experimentation with few-shot prompt-

ing suggests that LLMs can easily adjust to this

format.

3.2

Entity Linking

Linking entities for WikiWebQuestions is particu-

larly difficult. First, since the dataset is collected

from real-world questions without prompting the

users for more information, users tend to refer to

their entities of interest without using their full

names. Second, the questions are generally short

with very limited context, making it harder to

disambiguate among entities with similar names.

Lastly, many QIDs in Wikidata are used to repre-

sent terms not generally known as “named entities”.

For example, domain entities are often ignored by

entity linker models, as in “What is the biggest

country in Europe by population?”, both “country”

(Q6256) and “Europe” (Q46) are required to con-

struct the correct SPARQL, but entity linkers only

provide “Europe” and ignore “country”.

3.2.1 Semantic Parsing with Entity Linking

To handle ambiguous entities, we use an entity

linker to first find the domain names and QIDs

of the entities mentioned in the text. We train a

semantic parser that accepts users’ input along with

the results produced by the entity linker.

Formally, given a user input T , and a set of entity

linker results ⟨e, q⟩, where e is the name (default

label) Wikidata gives to an entity and q is its QID,

the semantic parser produces the semantic parse of

T in our modified SPARQL format.

For the example above, the SOTA ReFinED en-

tity linker (Ayoola et al., 2022) returns {⟨General

Motors, Q81965⟩}. Unfortunately, it misses the

entity automobile model (Q3231690), a term not

usually considered to be an entity.

3.2.2

Recovering from Entity Linker Errors

We want our semantic parser to be able to recover

from mistakes by an entity linker. That is, the

semantic parser should use entity linking when it

is helpful, but it should still try to predict the right

logical form when the linker fails.

The semantic parser is trained to accept, along

with the user query, an optional set of potentially

useful QIDs from the entity linker. We include

samples where some of the supplied linked entities

are not used in the gold answer, as well as samples

where there are missing linked entities. For the

latter, we use mentions in the original query in lieu

of the QIDs. At inference time, we use the men-

tions to look up the QIDs in Wikidata. If multiple

matches exist, the most popular entity is returned.

An example is shown in Appendix A.

With the above example where the entity linker

misses “automobile model”, the semantic parser is

likely to predict “car model” by copying from the

user query. We search “automobile model” among

aliases in domains to find the correct QID. This

design allows the model to potentially recover from

entity-linking failures.

WikiWebQuestions (WWQ) Dataset

Despite being the most popular large knowledge

base for a long time, existing benchmarks on Wiki-

data with labeled SPARQL queries are unfortu-

nately either small or of low quality. On the

other hand, benchmarks over the deprecated Free-

base still dominate the KBQA research with better-

quality data. For example, the WebQuestions (Yih

et al., 2015) dataset was collected by using Google

Search API instead of human paraphrasing or syn-

thesis. As a result, it is much more natural and

truly reflects the real-world questions users may

ask. This dataset is later annotated with SPARQL

over Freebase, named WebQuestionsSP (Yih et al.,

2016). Examples with no legitimate SPARQL to

retrieve answers from Freebase are dropped. In

total, WebQuestionsSP consists of 3098 examples

in the training set and 1639 in the test set.We migrated WebQuestionsSP, the best collec-

tion of natural language questions over a general

knowledge graph, from Freebase to Wikidata, with

the help of an automatic tool we developed, based

on Google’s entity mapping 2 and Wikidata’s re-

lation mapping 3 . About 60% of the dataset was

automatically converted. One of the authors of

this paper, who did not participate in model tuning,

manually converted those instances that failed to

convert automatically.

4.1

Migrating WebQuestionsSP to Wikidata

Here are the major decisions we made in migrating

WebQuestionsSP dataset to Wikidata. While much

bigger, Wikidata does not necessarily contain all

the information available in Freebase. For example,

it lacks countries’ trade partners, hence we drop all

such questions from the WebQuestionsSP dataset.

If multiple paths can lead to the correct answer,

we choose the path that provides the most com-

plete answers and has the best availability among

entities in the same domain. For example, when

asking for books written by an author X, we can

either search for books whose author is X or find

notable works of X that are books. While the latter

is more efficient, the property notable works is not

always available for all authors and it often does

not provide a complete list. Thus, we annotate such

examples using the former representation.

We also cleaned up the original dataset. The

dataset contained questions like “who does Ronald-

inho play for now in 2011?”. We drop the appended

year as it conflicts with “now” in the utterance, and

it would refer to the live information in Wikidata.

In total, we dropped 9% of the examples from

WebQuestionsSP and created a training, dev, and

test set of 2431, 454, and 1431 samples, respec-

tively. Given that Wikidata has 100 million entities

and 3,000 useful properties for answering ques-

tions, the training data set is woefully inadequate

and can be considered as a “fewshot” training set

at best.

Implementation

5.1

We use ReFinED (Ayoola et al., 2022) for entity

linking, which is the current state of the art for

WebQuestionsSP. As discussed before, Wikidata

treats many common terms such as “country” as

named entities and assigns them QIDs. To fine-tune

ReFinED to learn such terms, we add the question

and entity pairs from the training set of WikiWeb-

Questions to the data used to train ReFinED’s ques-

tions model.

We run 10 epochs of finetuning using the default

hyperparameters suggested by Ayoola et al. (2022).

For each identified entity, we provide the mention

in the original utterance, the QID, as well as its

domain in plain text. The information is appended

to the utterance before being fed into the neural

semantic parsing model.

5.2

https://developers.google.com/

freebase

https://www.wikidata.org/wiki/

Wikidata:WikiProject_Freebase/Mapping

The WikiSP Semantic Parser

We prepare the training data with entities provided

by fine-tuned ReFinED. Comparing with the gold

entities, ReFinED provides extra entities in 215

cases, while missing at least one entity in 137 cases.

When ReFinED failed to produce the correct en-

tities, we replace the missing QIDs in the logical

form with the corresponding mention of the entity

in the question. During evaluation, if a mention of

an entity is predicted by the model, we look up the

QID using the Wikidata “wbsearchentities” API 4 .

We fine-tune LLaMA with 7B parameters be-

cause it has been shown to perform well despite

its relatively small size (Touvron et al., 2023). We

include the Alpaca (Taori et al., 2023) instruction

following data, which was derived using the self-

instruct (Wang et al., 2023) method, in our training

data. The training data samples in WikiWebQues-

tion start with the following instruction: “Given

a Wikidata query with resolved entities, generate

the corresponding SPARQL. Use property names

instead of PIDs.”. We concatenate the resolved

entities and the user utterance together as input.

We up-sample the WikiWebQuestion fewshot set

5 times and train for 3 epochs using 2e-5 learning

rate and 0.03 warmup ratio.

5.3

This section discusses the implementation details

of the entity linker and the WikiSP semantic parser.

Entity Linking

Executing Queries on Wikidata

SPARQL queries are used to retrieve answers from

the Wikidata SPARQL endpoint 5 . Since Wikidata

https://www.wikidata.org/w/api.php?

action=wbsearchentities

https://www.wikidata.org/wiki/

Wikidata:SPARQL_query_serviceWikiSP (ours)

EM F1

65.5 71.9

Table 1: Results of WikiSP on the WWQ test set.

is actively being updated, the gold SPARQL can be

easily re-executed to acquire up-to-date answers,

allowing the benchmark to compare with forthcom-

ing iterations of large language models.

Experiments

In this section, we evaluate WikiSP on WikiWeb-

Questions and demonstrate how it can be used to

complement large language models such as GPT-3.

6.1

Semantic Parser Results

We evaluate our model with two different answer

accuracy metrics: (1) exact match (EM): the per-

centage of examples where the answers of the pre-

dicted SPARQL exactly match the gold answers,

and (2) Macro F1 score (F1): the average F1 score

for answers of each example. The evaluation re-

sults are shown in Table 1. Our approach achieves a

65.5% exact match accuracy and a 71.9% F1 score

on the WWQ dataset.

As a reference, the current state-of-the-art result

on the original WebQuestionsSP dataset for Free-

base is 78.8% F1 (Yu et al., 2023). The result was

obtained with a combination of semantic parsing

and retrieval. The WikiWebQuestions dataset is

slightly different, as discussed above. More signifi-

cantly, unlike Freebase, Wikidata does not have a

fixed schema and ours is an end-to-end, seq2seq

semantic parser.

6.2

Ablation Experiments

6.2.1 Entity Linking

Our first ablation study evaluates the need for entity

linking with ReFinED, by replacing it with simply

using the LLM to detect entities as mentions. In

this experiment, all entity IDs in the training data

are replaced by their mentions; during inference,

we map the predicted entities to their actual QIDs

according to Section 3.2.2.

The results show that replacing the neural entity

linker with just using mentions reduces the exact

match by 9.1% and the F1 score by 9.3%. This

suggests that entity linking is important.

6.2.2 Allowing Mentions as Entities

Our logical form is designed to recover from entity

linking errors by allowing entities be specified by

WikiSP (ours)

No Entity Linking

No mentions, trained with ReFinED

No mentions, trained with Oracle entities

PIDs and QIDs for properties & domains

EM F1

75.6

66.5

73.3

72.2

73.6 76.9

67.6

75.0

73.4

74.7

Table 2: Ablation results of WikiSP on the WWQ dev

set.

a mention, as an alternative to a QID. Our ablation

study on this feature tested two training strategies:

ReFinED. The entity linker tuples are produced by

fine-tuned ReFinED, which may be missing entities

in the gold target. The data show that generating

unseen QIDs is needed for missing entities.

Oracle. The entity linker tuples are exactly all

the entities used in the gold. The model would

only encounter missing QIDs at test time when

ReFinED fails to generate all the necessary QIDs.

The answer accuracy of the model using en-

tity linked tuples from ReFinED (“No mentions,

trained with ReFinED” in Table 2) lags by 2.3%

when compared against our best model. The model

using Oracle (“No mentions, trained with Oracle

entities” in Table 2) lags by 3.4%. These results

indicate that allowing mentions is useful for recov-

ering from entity linking errors.

6.2.3 Names vs. IDs for Properties & Domains

Our logical form replaces PIDs with property

names, and domain-entity QIDs with the domain

names. Here we evaluate the effectiveness of this

query format. We compare our approach with the

original SPARQL where all properties and entities

are represented with PIDs and QIDs. Our ablation

study shows that our representation with property

names and domain names improves the answer ac-

curacy by 2.0% (Table 2). This shows that LLMs

can adapt to changes in query notation with fine-

tuning, and it is easier to learn names than remem-

bering random IDs. If we did not allow mentions in

the predicted logical form, the replacement of QIDs

with their names is likely to be more significant.

6.3

Complementing GPT-3

LLMs like GPT-3 can answer many questions on

general knowledge correctly; however, they may

also hallucinate. WWQ is representative of popular

questions, so we expect GPT-3 to perform well. We

use text-davinci-002 with the temperature set to 0

to evaluate GPT-3’s performance on WWQ.

On the dev set of WWQ, GPT-3 answers 66.4%of the questions correctly and provides incomplete

answers to 26.5% of the questions. For example,

when asked “What does Obama have a degree in?”,

GPT-3 correctly identifies President Obama’s po-

litical science degree, but fails to mention his law

degree. In total, GPT-3 gives wrong answers to

7.1% of the questions.

For this dev set, we can give definitive answers to

75.6% of the questions with WikiSP (Table 2). For

the rest of the questions (24.4%), accounting for

the overlap between the GPT-3 and our semantic

parser’s results, the percentages of guessing cor-

rectly, incompletely, and incorrectly are at 15.2%,

5.5%, and 3.7%, respectively (Figure 2).

In summary, the combination of GPT-3 and

WikiSP makes it possible to give a definitive, cor-

rect, and complete answer three quarters of the time

for the dev set. Users can also benefit from GPT-

3’s guesses the rest of the time at a 3.7% error rate,

which is about half of the original error rate.

6.4

Error Analysis

We analyzed the 111 examples in the WWQ dev

set where the model failed.

6.4.1

Acceptable Alternative Results (18.0%)

Our analysis shows that 18.0% of the “errors” can

actually be deemed to be correct.

Reasonable alternate answers (11.7%). In

11.7% of the cases, the model predicts an alterna-

tive interpretation to the question and returns a rea-

sonable answer that is different from the gold. For

example, the gold for question “what did Boudicca

do?” uses the position held property, while the

model predicts occupation property. Both are con-

sidered valid answers to the question.

Reasonable alternative SPARQL but no an-

swer was retrieved (6.3%). In another 6.3%

of cases, the model predicts a reasonable alterna-

tive SPARQL, but the SPARQL returns no answer.

Sometimes, since the information for the “correct”

property is missing, the question is represented

with a similar property. For example, since resi-

dence property is missing for Patrick Henry, the

gold SPARQL for “where did Patrick Henry live?”

uses place of birth instead, while our model pre-

dicts residence.

6.4.2

Errors in Entity Linking (35.1%)

The biggest source of errors is entity linking. En-

tity linker failed to provide the correct entities in

35.1% of the failed examples. While WikiSP can

potentially recover from missing entities, it can-

not recover from incorrect entities. This is espe-

cially common for character roles, as some char-

acter roles have different entities for books and

movies or even different series of movies. Some-

times WikiSP located the correct mention from

the question, but the lookup failed. For example,

the model located the mention of the event “allied

invasion of France” in question “where did the al-

lied invasion of France take place?”, but failed to

find the corresponding entity from Wikidata by the

name.

6.4.3 Errors Beyond Entity Linking

Semantic parsing in Wikidata is challenging as

there are no predefined schemas, and there are

150K domains and 3K applicable properties. Some

representative mistakes include the following:

Wrong property (17.1%). 17.1% of the er-

rors are caused by predicting the wrong property.

Some of the examples require background knowl-

edge to parse. For example the answer of the ques-

tion “what did martin luther king jr do in his life?”

should return the value of movement, while the

model predicts occupation. Properties are a chal-

lenge in Wikidata because as illustrated here which

property to predict depends on the entity itself.

Missing domain constraint (5.4%). Another

common problem is missing the domain constraint.

For example, the model correctly identifies that

property shares border with should be used for

question “what countries are around Egypt?”. How-

ever, it does not limit the answer to countries only,

thus extra entities are returned.

Experiment with QALD-7

For another evaluation of WikiSP, we apply our

model on Task 4 from QALD-7 (Usbeck et al.,

2017) dataset. QALD-7 is part of the QALD (Ques-

tion Answering over Linked Data) which is a series

of challenges started in 2011 known for their com-

plex, manually created questions. It mainly focuses

on DBpedia, but QALD-7’s Task 4 is engineered

for Wikidata. The task includes 100 train examples,

which we use to fine-tune our model and 50 test

examples. There is no dev set.

We choose QALD-7 as it is a manually crafted

dataset with complex questions. We avoid datasets

built on synthetic or human-paraphrased data, such

as CSQA (Saha et al., 2018) and KQA-Pro (Cao

et al., 2022a). As they have limited natural lan-

guage variety between the training and evaluationSTAGG (Yih et al., 2016)

GGNN (Sorokin and Gurevych, 2018)

WDAqua (Diefenbach et al., 2017)

WikiSP (Ours)

EM F1

38.0 19.0

21.3

40.0

43.6

12%

18%

Incorrect

GPT

Incorrect GPT

16%

20%

Incomplete GPT

62%

Correct GPT

Table 3: Evaluation results of WikiSP on QALD-7 Task

4 and comparison with prior work.

data, models can get artificially high accuracy. For

example, a simple BART based model can achieve

over 90% accuracy on KQA-Pro even without an

entity linking module (Cao et al., 2022a).

The QALD-7 test set provides both the SPARQL

queries as well as the answers. To double-check the

correctness of the QALD-7 dataset, we applied the

50 gold queries of the test set to Wikidata and found

that 4 did not return an answer. We hypothesize

that the discrepancy is caused by the change in

Wikidata structure/quantity of information. We

evaluate WikiSP by comparing the answers where

possible, and by comparing the generated SPARQL

syntactically otherwise.

For this experiment, we use the same hyper-

parameters and data format as described in Sec-

tion 5.3. In addition to the training data for WikiSP,

we also include the QALD-7 train samples, upsam-

pled 20 times.

7.1

QALD-7 Results

Our model achieves 38% accuracy on the QALD-7

dataset and outperforms the F1 score of the state-

of-the-art WDAqua (Diefenbach et al., 2017) by

3.6%, as shown in Table 3. Note that WDAqua

is based on retrieval, whereas WikiSP is based on

sequence-to-sequence semantic parsing. QALD-

7 (Usbeck et al., 2017) reports WDAqua as the

winner of the leaderboard with 55.2 F1, however

the authors of WDAqua reported 40.0 F1 in their

papers (Diefenbach et al., 2017).

7.2

Complementing GPT-3 on QALD-7

Similar to WWQ, we also assess the combination

of GPT with WikiSP on QALD-7 as shown in Fig-

ure 3. The GPT model used was "text-davinci-002".

Since there is no validation set and the test set is

already very small, one of the authors who was

not involved in training or finetuning the model

evaluated GPT-3 on the test set.

GPT-3 is fully accurate on 62% of the questions,

20% incomplete, and 18% wrong. With our ap-

proach, we can provide 38% verifiably good an-

swers from WikiSP; the guesses of GPT-3 get an

Incomplete GPT

38%

Verified from

WikiSP

34%

Correct GPT

LAM

GPT-3 only

WikiSP + GPT-3

STANFORD

Figure 3: Distribution of correct, incomplete, and incor-

rect answers for the QALD-7 test set, when GPT-3 is

used alone and when combined with WikiSP.

additional 34% correct, 16% incomplete, and only

12% wrong.

7.3

Discussion

We did not conduct error analysis on the perfor-

mance of QALD-7 as it has no dev set. The author

evaluating GPT-3 noted that the test set of QALD-7

is much more complicated than the training data

(of just 100 samples), with most of the queries con-

taining multiple properties. This explains the lower

accuracy of WikiSP on QALD-7 when compared to

WikiWebQuestions, which has a few-shot training

data set with a similar distribution as the test set.

This result suggests that the performance of

WikiSP depends heavily on a good few-shot train-

ing data for fine-tuning the LLMs. We hypothesize

that we can increase the performance of WikiSP

in handling less popular questions with a better,

possibly synthesized, training dataset.

Conclusion

We have created a new high-quality benchmark,

WikiWebQuestions, for large knowledge-base ques-

tion answering. The dataset is based on the popular

WebQuestionsSP dataset with natural questions,

annotated with SPARQL for Wikidata.

We establish a first, strong baseline of 65% an-

swer accuracy and 72% F1 score for WikiWeb-

Questions. This is achieved by fine-tuning LLaMA

with a few-shot training data set using a SPARQL

query format modified for semantic parsing.

We show that we can reduce the hallucination

of large language models like GPT-3 by grounding

it with a semantic parser. For the dev set of Wiki-

WebQuestions, this combination approach provides

useful information for 96% of the questions in the

dev set of the benchmark. More importantly, it gen-

erates verifiable answers for 76% of the questions.Limitations Acknowledgements

While applications of large language models seem

to expand every day, this paper mainly focuses on

factoid question answering. Long-form text genera-

tion, for example, is outside the scope of the experi-

ments of this paper, but the methodology described

here may be extended to this setting in the future.

Even though knowledge bases are an important

source of facts, a large portion of the knowledge

available in digital form (e.g. Wikipedia, news arti-

cles, etc.), is not organized into knowledge bases.

As such, the results of this paper can be considered

complementary to the larger body of fact-checking

research based on free text.

Our semantic parser can be used to verify an-

swers from LLMs. However, this additional round

of running the semantic parser and querying Wiki-

data increase the response latency, which may be

noticeable by end-users of such systems.

All of our datasets and experiments are con-

ducted for English. Expanding to other languages,

while possible (Moradshahi et al., 2020) are outside

the scope of this work.

Our experiments were performed using GPT-3

(davinci-002) as that was what we had access to

when we started the project. Undoubtedly, the later

LLMs will produce better results. Nonetheless,

the need to have verifiable results based on live

database accesses will remain. This work is supported in part by the National Sci-

ence Foundation, the Alfred P. Sloan Foundation,

the Verdant Foundation, Microsoft Azure AI credit,

KDDI, JPMorgan Chase, and the Stanford Human-

Centered Artificial Intelligence (HAI) Institute. We

also thank the reviewers for their valuable com-

ments and suggestions.

Ethical Considerations

LLMs are used by millions of people everyday.

We hope that this line of work will help make them

more reliable for everyone, mitigating some of their

potential downsides, and giving users access to

more accurate information. Our use of Wikidata

will enable future researchers and developers to

connect their systems with a large, diverse and live

knowledge graph that is updated every day. We do

not anticipate any harm resulting from the methods

introduced in this work.

We did not crowdsource any datasets for this

paper, as the questions are converted from a previ-

ous dataset and all the re-annotation and analysis

is done by the authors.

To conduct experiments in this paper, we used

an estimated total of 60 NC96ads-A100 GPU hours

on Microsoft Azure. Each finetuning experiment

takes roughly 3 hours, and we conducted roughly

20 experiments to arrive at the results in this paper.

References

Shengnan An, Bo Zhou, Zeqi Lin, Qiang Fu, Bei Chen,

Nanning Zheng, Weizhu Chen, and Jian-Guang Lou.

2023. Skill-based few-shot selection for in-context

learning.

Aseem Arora, Shabbirhussain Bhaisaheb, Harshit

Nigam, Manasi Patwardhan, Lovekesh Vig, and Gau-

tam Shroff. 2023. Adapt and decompose: Efficient

generalization of text-to-sql via domain adapted least-

to-most prompting.

Tom Ayoola, Shubhi Tyagi, Joseph Fisher, Christos

Christodoulopoulos, and Andrea Pierleoni. 2022.

ReFinED: An efficient zero-shot-capable approach

to end-to-end entity linking. In Proceedings of the

2022 Conference of the North American Chapter of

the Association for Computational Linguistics: Hu-

man Language Technologies: Industry Track, pages

209–220, Hybrid: Seattle, Washington + Online. As-

sociation for Computational Linguistics.

Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wen-

liang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei

Ji, Tiezheng Yu, Willy Chung, Quyet V. Do, Yan

Xu, and Pascale Fung. 2023. A multitask, multilin-

gual, multimodal evaluation of chatgpt on reasoning,

hallucination, and interactivity.

Jonathan Berant, Andrew Chou, Roy Frostig, and Percy

Liang. 2013. Semantic parsing on Freebase from

question-answer pairs. In Proceedings of the 2013

Conference on Empirical Methods in Natural Lan-

guage Processing, pages 1533–1544, Seattle, Wash-

ington, USA. Association for Computational Linguis-

tics.

Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim

Sturge, and Jamie Taylor. 2008. Freebase: A col-

laboratively created graph database for structuring

human knowledge. In Proceedings of the 2008 ACM

SIGMOD International Conference on Management

of Data, SIGMOD ’08, page 1247–1250, New York,

NY, USA. Association for Computing Machinery.

Giovanni Campagna, Rakesh Ramesh, Silei Xu,

Michael Fischer, and Monica S. Lam. 2017. Al-

mond: The architecture of an open, crowdsourced,

privacy-preserving, programmable virtual assistant.

In Proceedings of the 26th International Conference

on World Wide Web - WWW ’17, pages 341–350, New

York, New York, USA. ACM Press.Giovanni Campagna, Silei Xu, Mehrad Moradshahi,

Richard Socher, and Monica S. Lam. 2019. Genie:

A generator of natural language semantic parsers for

virtual assistant commands. In Proceedings of the

40th ACM SIGPLAN Conference on Programming

Language Design and Implementation, PLDI 2019,

page 394–410, New York, NY, USA. Association for

Computing Machinery.

Shulin Cao, Jiaxin Shi, Liangming Pan, Lunyiu Nie,

Yutong Xiang, Lei Hou, Juanzi Li, Bin He, and Han-

wang Zhang. 2022a. KQA pro: A dataset with ex-

plicit compositional programs for complex question

answering over knowledge base. In Proceedings

of the 60th Annual Meeting of the Association for

Computational Linguistics (Volume 1: Long Papers),

pages 6101–6119, Dublin, Ireland. Association for

Computational Linguistics.

Shulin Cao, Jiaxin Shi, Zijun Yao, Xin Lv, Jifan Yu,

Lei Hou, Juanzi Li, Zhiyuan Liu, and Jinghui Xiao.

2022b. Program transfer for answering complex

questions over knowledge bases. In Proceedings

of the 60th Annual Meeting of the Association for

Computational Linguistics (Volume 1: Long Papers),

pages 8128–8140, Dublin, Ireland. Association for

Computational Linguistics.

Rajarshi Das, Manzil Zaheer, Dung Thai, Ameya God-

bole, Ethan Perez, Jay Yoon Lee, Lizhen Tan, Lazaros

Polymenakos, and Andrew McCallum. 2021. Case-

based reasoning for natural language queries over

knowledge bases. In Proceedings of the 2021 Confer-

ence on Empirical Methods in Natural Language Pro-

cessing, pages 9594–9611, Online and Punta Cana,

Dominican Republic. Association for Computational

Linguistics.

Dennis Diefenbach, Kamal Singh, and Pierre Maret.

2017. Wdaqua-core0: A question answering compo-

nent for the research community. In Semantic Web

Evaluation Challenge, pages 84–89. Springer.

Li Dong, Furu Wei, Ming Zhou, and Ke Xu. 2015. Ques-

tion answering over Freebase with multi-column con-

volutional neural networks. In Proceedings of the

53rd Annual Meeting of the Association for Compu-

tational Linguistics and the 7th International Joint

Conference on Natural Language Processing (Vol-

ume 1: Long Papers), pages 260–269, Beijing, China.

Association for Computational Linguistics.

Jerome Goddard. 2023. Hallucinations in chatgpt: A

cautionary tale for biomedical researchers. The Amer-

ican Journal of Medicine.

Yu Gu, Sue Kase, Michelle Vanni, Brian Sadler, Percy

Liang, Xifeng Yan, and Yu Su. 2021. Beyond i.i.d.:

Three levels of generalization for question answering

on knowledge bases. In Proceedings of the Web

Conference 2021. ACM.

Yu Gu and Yu Su. 2022. ArcaneQA: Dynamic program

induction and contextualized encoding for knowl-

edge base question answering. In Proceedings of

the 29th International Conference on Computational

Linguistics, pages 1718–1731, Gyeongju, Republic

of Korea. International Committee on Computational

Linguistics.

Yushi Hu, Chia-Hsuan Lee, Tianbao Xie, Tao Yu,

Noah A. Smith, and Mari Ostendorf. 2022. In-

context learning for few-shot dialogue state tracking.

In Findings of the Association for Computational

Linguistics: EMNLP 2022, pages 2627–2643, Abu

Dhabi, United Arab Emirates. Association for Com-

putational Linguistics.

Yunshi Lan and Jing Jiang. 2020. Query graph gen-

eration for answering multi-hop complex questions

from knowledge bases. In Proceedings of the 58th

Annual Meeting of the Association for Computational

Linguistics, pages 969–974, Online. Association for

Computational Linguistics.

Mike Lewis, Yinhan Liu, Naman Goyal, Marjan

Ghazvininejad, Abdelrahman Mohamed, Omer Levy,

Veselin Stoyanov, and Luke Zettlemoyer. 2020.

BART: Denoising sequence-to-sequence pre-training

for natural language generation, translation, and com-

prehension. In Proceedings of the 58th Annual Meet-

ing of the Association for Computational Linguistics,

pages 7871–7880, Online. Association for Computa-

tional Linguistics.

Belinda Z. Li, Sewon Min, Srinivasan Iyer, Yashar

Mehdad, and Wen-tau Yih. 2020. Efficient one-pass

end-to-end entity linking for questions. In Proceed-

ings of the 2020 Conference on Empirical Methods

in Natural Language Processing (EMNLP), pages

6433–6441, Online. Association for Computational

Linguistics.

Jinyang Li, Binyuan Hui, Ge Qu, Binhua Li, Jiaxi Yang,

Bowen Li, Bailin Wang, Bowen Qin, Rongyu Cao,

Ruiying Geng, Nan Huo, Xuanhe Zhou, Chenhao Ma,

Guoliang Li, Kevin C. C. Chang, Fei Huang, Reynold

Cheng, and Yongbin Li. 2023. Can llm already serve

as a database interface? a big bench for large-scale

database grounded text-to-sqls.

Kangqi Luo, Fengli Lin, Xusheng Luo, and Kenny Zhu.

2018. Knowledge base question answering via encod-

ing of complex query graphs. In Proceedings of the

2018 Conference on Empirical Methods in Natural

Language Processing, pages 2185–2194, Brussels,

Belgium. Association for Computational Linguistics.

Costas Mavromatis and George Karypis. 2022. ReaRev:

Adaptive reasoning for question answering over

knowledge graphs. In Findings of the Association

for Computational Linguistics: EMNLP 2022, pages

2447–2458, Abu Dhabi, United Arab Emirates. As-

sociation for Computational Linguistics.

Alexander Miller, Adam Fisch, Jesse Dodge, Amir-

Hossein Karimi, Antoine Bordes, and Jason Weston.

2016. Key-value memory networks for directly read-

ing documents. In Proceedings of the 2016 Con-

ference on Empirical Methods in Natural LanguageProcessing, pages 1400–1409, Austin, Texas. Associ-

ation for Computational Linguistics.

Mehrad Moradshahi, Giovanni Campagna, Sina Sem-

nani, Silei Xu, and Monica Lam. 2020. Localizing

open-ontology QA semantic parsers in a day using

machine translation. In Proceedings of the 2020 Con-

ference on Empirical Methods in Natural Language

Processing (EMNLP), pages 5970–5983, Online. As-

sociation for Computational Linguistics.

Linyong Nan, Yilun Zhao, Weijin Zou, Narutatsu

Ri, Jaesung Tae, Ellen Zhang, Arman Cohan, and

Dragomir Radev. 2023. Enhancing few-shot text-to-

sql capabilities of large language models: A study on

prompt design strategies.

Yilin Niu, Fei Huang, Wei Liu, Jianwei Cui, Bin Wang,

and Minlie Huang. 2023. Bridging the Gap between

Synthetic and Natural Questions via Sentence De-

composition for Semantic Parsing. Transactions

of the Association for Computational Linguistics,

11:367–383.

Gabriel Poesia, Alex Polozov, Vu Le, Ashish Tiwari,

Gustavo Soares, Christopher Meek, and Sumit Gul-

wani. 2022. Synchromesh: Reliable code generation

from pre-trained language models. In The Tenth In-

ternational Conference on Learning Representations,

ICLR 2022, Virtual Event, April 25-29, 2022. Open-

Review.net.

Ohad Rubin, Jonathan Herzig, and Jonathan Berant.

2022. Learning to retrieve prompts for in-context

learning. In Proceedings of the 2022 Conference of

the North American Chapter of the Association for

Computational Linguistics: Human Language Tech-

nologies, pages 2655–2671, Seattle, United States.

Association for Computational Linguistics.

Amrita Saha, Ghulam Ahmed Ansari, Abhishek Laddha,

Karthik Sankaranarayanan, and Soumen Chakrabarti.

2019. Complex program induction for querying

knowledge bases in the absence of gold programs.

Transactions of the Association for Computational

Linguistics, 7:185–200.

Amrita Saha, Vardaan Pahuja, Mitesh Khapra, Karthik

Sankaranarayanan, and Sarath Chandar. 2018. Com-

plex sequential question answering: Towards learn-

ing to converse over linked question answer pairs

with a knowledge graph. In Proceedings of the AAAI

conference on artificial intelligence, volume 32.

Priyanka Sen, Armin Oliya, and Amir Saffari. 2021.

Expanding end-to-end question answering on differ-

entiable knowledge graphs with intersection. In Pro-

ceedings of the 2021 Conference on Empirical Meth-

ods in Natural Language Processing, pages 8805–

8812, Online and Punta Cana, Dominican Republic.

Association for Computational Linguistics.

Richard Shin, Christopher Lin, Sam Thomson, Charles

Chen, Subhro Roy, Emmanouil Antonios Platanios,

Adam Pauls, Dan Klein, Jason Eisner, and Benjamin

Van Durme. 2021. Constrained language models

yield few-shot semantic parsers. In Proceedings of

the 2021 Conference on Empirical Methods in Natu-

ral Language Processing, pages 7699–7715, Online

and Punta Cana, Dominican Republic. Association

for Computational Linguistics.

Yiheng Shu, Zhiwei Yu, Yuhan Li, Börje Karlsson,

Tingting Ma, Yuzhong Qu, and Chin-Yew Lin. 2022.

TIARA: Multi-grained retrieval for robust question

answering over large knowledge base. In Proceed-

ings of the 2022 Conference on Empirical Methods

in Natural Language Processing, pages 8108–8121,

Abu Dhabi, United Arab Emirates. Association for

Computational Linguistics.

Daniil Sorokin and Iryna Gurevych. 2018. Modeling se-

mantics with gated graph neural networks for knowl-

edge base question answering. In Proceedings of the

27th International Conference on Computational Lin-

guistics, pages 3306–3317, Santa Fe, New Mexico,

USA. Association for Computational Linguistics.

Yu Su, Ahmed Hassan Awadallah, Madian Khabsa,

Patrick Pantel, Michael Gamon, and Mark Encar-

nacion. 2017. Building natural language interfaces to

web apis. In Proceedings of the 2017 ACM on Con-

ference on Information and Knowledge Management,

pages 177–186.

Haitian Sun, Tania Bedrax-Weiss, and William Cohen.

2019. PullNet: Open domain question answering

with iterative retrieval on knowledge bases and text.

In Proceedings of the 2019 Conference on Empirical

Methods in Natural Language Processing and the

9th International Joint Conference on Natural Lan-

guage Processing (EMNLP-IJCNLP), pages 2380–

2390, Hong Kong, China. Association for Computa-

tional Linguistics.

Haitian Sun, Bhuwan Dhingra, Manzil Zaheer, Kathryn

Mazaitis, Ruslan Salakhutdinov, and William Cohen.

2018. Open domain question answering using early

fusion of knowledge bases and text. In Proceed-

ings of the 2018 Conference on Empirical Methods

in Natural Language Processing, pages 4231–4242,

Brussels, Belgium. Association for Computational

Linguistics.

Alon Talmor and Jonathan Berant. 2018. The web as

a knowledge-base for answering complex questions.

In Proceedings of the 2018 Conference of the North

American Chapter of the Association for Computa-

tional Linguistics: Human Language Technologies,

Volume 1 (Long Papers), pages 641–651, New Or-

leans, Louisiana. Association for Computational Lin-

guistics.

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann

Dubois, Xuechen Li, Carlos Guestrin, Percy

Liang, and Tatsunori B. Hashimoto. 2023. Stan-

ford alpaca:

An instruction-following llama

model. https://github.com/tatsu-lab/

stanford_alpaca.Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier

Martinet, Marie-Anne Lachaux, Timothée Lacroix,

Baptiste Rozière, Naman Goyal, Eric Hambro,

Faisal Azhar, et al. 2023. Llama: Open and effi-

cient foundation language models. arXiv preprint

arXiv:2302.13971.

Ricardo Usbeck, Axel-Cyrille Ngonga Ngomo, Bastian

Haarmann, Anastasia Krithara, Michael Röder, and

Giulio Napolitano. 2017. 7th open challenge on ques-

tion answering over linked data (qald-7). In Semantic

web evaluation challenge, pages 59–69. Springer.

Pat Verga, Haitian Sun, Livio Baldini Soares, and

William Cohen. 2021. Adaptable and interpretable

neural MemoryOver symbolic knowledge. In Pro-

ceedings of the 2021 Conference of the North Amer-

ican Chapter of the Association for Computational

Linguistics: Human Language Technologies, pages

3678–3691, Online. Association for Computational

Linguistics.

Salvatore Vivona and Kaveh Hassani. 2019. Relational

graph representation learning for open-domain ques-

tion answering.

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa

Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh

Hajishirzi. 2023. Self-instruct: Aligning language

models with self-generated instructions. In Proceed-

ings of the 61st Annual Meeting of the Association for

Computational Linguistics (Volume 1: Long Papers),

pages 13484–13508, Toronto, Canada. Association

for Computational Linguistics.

Benjamin Weiser. 2023. Here’s what happens when

your lawyer uses chatgpt. The New York Times.

Silei Xu, Giovanni Campagna, Jian Li, and Monica S.

Lam. 2020a. Schema2qa: High-quality and low-cost

q&a agents for the structured web. In Proceedings

of the 29th ACM International Conference on Infor-

mation & Knowledge Management, CIKM ’20, page

1685–1694, New York, NY, USA. Association for

Computing Machinery.

Silei Xu, Sina Semnani, Giovanni Campagna, and Mon-

ica Lam. 2020b. AutoQA: From databases to QA

semantic parsers with only synthetic training data. In

Proceedings of the 2020 Conference on Empirical

Methods in Natural Language Processing (EMNLP),

pages 422–434, Online. Association for Computa-

tional Linguistics.

Xi Ye, Semih Yavuz, Kazuma Hashimoto, Yingbo Zhou,

and Caiming Xiong. 2022. RNG-KBQA: Generation

augmented iterative ranking for knowledge base ques-

tion answering. In Proceedings of the 60th Annual

Meeting of the Association for Computational Lin-

guistics (Volume 1: Long Papers), pages 6032–6043,

Dublin, Ireland. Association for Computational Lin-

guistics.

Wen-tau Yih, Ming-Wei Chang, Xiaodong He, and Jian-

feng Gao. 2015. Semantic parsing via staged query

graph generation: Question answering with knowl-

edge base. In Proceedings of the 53rd Annual Meet-

ing of the Association for Computational Linguistics

and the 7th International Joint Conference on Natu-

ral Language Processing (Volume 1: Long Papers),

pages 1321–1331, Beijing, China. Association for

Computational Linguistics.

Wen-tau Yih, Matthew Richardson, Chris Meek, Ming-

Wei Chang, and Jina Suh. 2016. The value of se-

mantic parse labeling for knowledge base question

answering. In Proceedings of the 54th Annual Meet-

ing of the Association for Computational Linguistics

(Volume 2: Short Papers), pages 201–206, Berlin,

Germany. Association for Computational Linguis-

tics.

Donghan Yu, Sheng Zhang, Patrick Ng, Henghui Zhu,

Alexander Hanbo Li, Jun Wang, Yiqun Hu, William

Wang, Zhiguo Wang, and Bing Xiang. 2023. De-

caf: Joint decoding of answers and logical forms for

question answering over knowledge bases. In ICLR

2023.

Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga,

Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingn-

ing Yao, Shanelle Roman, Zilin Zhang, and Dragomir

Radev. 2018. Spider: A large-scale human-labeled

dataset for complex and cross-domain semantic pars-

ing and text-to-SQL task. In Proceedings of the 2018

Conference on Empirical Methods in Natural Lan-

guage Processing, pages 3911–3921, Brussels, Bel-

gium. Association for Computational Linguistics.A

Examples of Recovering from Entity

Linking Errors

Here, we illustrate our proposal of using entity

mentions to recover from entity linking errors. In

the training set, we have the following example:

• Query: What year did giants win the world

series?

• Original Gold SPARQL:

SELECT DISTINCT ?x WHERE {

?y wdt:sports_season_of_league_or_competition

wd:Q265538;

wdt:winner wd:Q308966;

wdt:point_in_time ?x. }

• Gold Entity linker result:

World Series (QID Q265538),

San Francisco Giants (QID Q308966);

• ReFinED result:

San Francisco Giants (QID Q308966);

Here, the ReFinED entity linker model fails to

identify the “World Series” entity. Our proposal

of mentions gives the semantic parser a chance to

recover from entity linker failures. To train the

parser to generate mentions, our training includes

samples like this:

• Query: what year did giants win the world

series?

• ReFinED result:

San Francisco Giants (QID Q308966);

• Gold target:

SELECT DISTINCT ?x WHERE {

?y wdt:sports_season_of_league_or_competition;

wd:world_series;

wdt:winner wd:Q308966;

wdt:point_in_time ?x. }

The gold query mentions “world_series”. At in-

ference time, our heuristics use the predicted men-

tion to look up the actual Wikidata entity. For ex-

ample, if wd:world_series is predicted at inference

time, our heuristics maps it back to wd:Q265538.