New Summary

Summary Fine-tuned LLMs for Wikidata Semantic Parsing arxiv.org

9,201 words - PDF document - View PDF document

Chat with this pdf   Buy me a coffee

Processing...

AI Summary Complete!

Error!

One Line

The WikiWebQuestions benchmark for Wikidata demonstrates that using large language models for semantic parsing improves answer accuracy, as evidenced by strong experimental results.

Slides

Slide Presentation (12 slides)

Copy slides outline   Copy embed code   Download as Word

Fine-tuned LLMs for Wikidata Semantic Parsing

Source: arxiv.org - PDF - 9,201 words - view

Introducing WikiWebQuestions


• WikiWebQuestions is a high-quality question answering benchmark for Wikidata.

• It provides a comprehensive evaluation of answer accuracy.

• The dataset consists of real-world questions collected from users.

WikiSP: A Semantic Parser for Wikidata


• WikiSP is a few-shot sequence-to-sequence semantic parser for Wikidata.

• It complements large language models (LLMs) to improve answer accuracy.

• By grounding LLMs in Wikidata, factuality is enhanced.

Modifying SPARQL for Improved Parsing


• SPARQL is modified to use domain and property names instead of unique IDs.

• This makes it easier for LLMs to adapt to changes in query notation.

• The modified SPARQL queries are fed into WikiSP for answer generation.

Experimental Results: Answer Accuracy


• The proposed methodology achieves a strong baseline of 76% and 65% answer accuracy in the dev and test sets of WikiWebQuestions.

• Combining WikiSP with GPT-3, the system provides useful answers to 96% of the questions in the dev set.

• Outperforms the state-of-the-art for the QALD-7 Wikidata dataset by 3.6% in F1 score.

Importance of Semantic Parsing


• LLMs can answer questions directly but lack interpretability and may provide incorrect answers.

• Semantic parsing provides interpretable and grounded results in Wikidata.

• Users can verify answers and obtain more reliable information.

WikiWebQuestions Dataset


• Migrated WebQuestionsSP benchmark from Freebase to Wikidata.

• Provides up-to-date answers from a larger knowledge base.

• Real-world questions collected from users using the Google Suggest API.

Implementation: Entity Linking and Fine-tuning


• ReFinED is used as the entity linker for WikiSP.

• Fine-tuning of ReFinED with the WikiWebQuestions training set improves performance.

• LLaMA and Alpaca are fine-tuned to enhance the factuality of LLMs.

Evaluation Results: WikiSP Performance


• WikiSP achieves a 65.5% exact match accuracy and a 71.9% F1 score on the WikiWebQuestions dataset.

• Entity linking and allowing mentions as entities improve answer accuracy.

• Ablation experiments demonstrate the importance of these factors.

Combining GPT-3 with WikiSP


• GPT-3 answers 66.4% of the questions correctly but provides incomplete or wrong answers for some.

• WikiSP provides definitive answers for 75.6% of the questions.

• Combining GPT-3 with WikiSP improves answer accuracy for a large percentage of questions.

Error Analysis and Improvements


• Errors in the WWQ dev set include alternative interpretations, alternative SPARQL queries, and entity linking errors.

• WikiSP outperforms the state-of-the-art WDAqua by 3.6% in terms of F1 score on the QALD-7 dataset.

• Better training datasets are needed to handle complex and less popular questions.

Key Takeaways


• WikiWebQuestions provides a high-quality benchmark for Wikidata question answering.

• WikiSP complements LLMs and improves answer accuracy.

• Semantic parsing offers interpretable results grounded in Wikidata.

• Combining GPT-3 with WikiSP enhances answer accuracy in question answering tasks.

   

Key Points

  • WikiWebQuestions is a high-quality question answering benchmark for Wikidata.
  • WikiSP is a few-shot sequence-to-sequence semantic parser for Wikidata.
  • The goal is to improve the factuality of large language models (LLMs) by grounding them in Wikidata.
  • Semantic parsing is used to complement LLMs and provide more accurate answers.
  • The authors modify SPARQL to use domain and property names instead of unique IDs.
  • The authors achieve strong results in answer accuracy using their methodology.
  • Semantic parsing provides interpretable results grounded in Wikidata for better verification of answers.
  • The combination of GPT-3 with WikiSP improves answer accuracy in question answering tasks.

Summaries

23 word summary

WikiWebQuestions is a question answering benchmark for Wikidata. Semantic parsing with large language models improves answer accuracy. Experimental results show a strong baseline.

94 word summary

This paper presents WikiWebQuestions, a question answering benchmark for Wikidata, and introduces WikiSP, a semantic parser for Wikidata. The authors propose using semantic parsing alongside large language models (LLMs) to improve answer accuracy. They modify SPARQL and train the parser to link entities in user queries to their unique ID in Wikidata. Experimental results show that this methodology achieves a strong baseline of 76% and 65% answer accuracy in the dev and test sets of WikiWebQuestions, respectively. The authors highlight the importance of semantic parsing for grounding LLMs and discuss limitations and future work.

139 word summary

This paper introduces WikiWebQuestions, a question answering benchmark for Wikidata, and presents WikiSP, a semantic parser for Wikidata. The authors propose using semantic parsing as a complement to large language models (LLMs) to improve answer accuracy. They modify SPARQL by using domain and property names instead of unique IDs and train the parser to link entities in user queries to their unique ID in Wikidata. If the query fails, GPT-3 is used as a fallback. Experimental results show that this methodology improves answer accuracy, achieving a strong baseline of 76% and 65% answer accuracy in the dev and test sets of WikiWebQuestions, respectively. Semantic parsing is crucial for grounding LLMs, and combining it with GPT-3 provides more reliable answers. The authors also introduce the WikiWebQuestions dataset, evaluate fine-tuned LLMs for Wikidata semantic parsing, and discuss limitations and future work.

516 word summary

This paper introduces WikiWebQuestions, a question answering benchmark for Wikidata, and presents WikiSP, a semantic parser for Wikidata. The authors aim to enhance the accuracy of large language models (LLMs) by grounding them in Wikidata. To address the issue of LLMs providing incorrect answers, the authors propose using semantic parsing as a complement to LLMs for more accurate answers.

The authors modify SPARQL, a query language for semantic parsing, by using domain and property names instead of unique IDs. They train the parser to link entities in user queries to their unique ID in Wikidata using an entity linker or mentions in the query. The modified SPARQL queries are then fed into the WikiSP semantic parser. If the query fails, GPT-3 is used as a fallback and the result is labeled as a GPT-3 guess.

Experimental results demonstrate that this methodology improves answer accuracy. The authors achieve a strong baseline of 76% and 65% answer accuracy in the dev and test sets of WikiWebQuestions, respectively. By combining the semantic parser with GPT-3, they provide useful answers to 96% of the questions in the dev set. They also outperform the state-of-the-art for the QALD-7 Wikidata dataset by 3.6% in F1 score.

Semantic parsing is crucial for grounding LLMs as it provides interpretable results grounded in Wikidata. This allows users to verify answers since LLMs may not always be correct. By combining semantic parsing with GPT-3's guesses, the system offers more reliable answers.

The authors introduce the WikiWebQuestions dataset, a high-quality semantic parsing dataset for Wikidata. It is an updated version of the WebQuestionsSP benchmark, providing real-world questions collected from users via the Google Suggest API.

ReFinED is used as the entity linker for WikiSP. It is fine-tuned with question and entity pairs from the WikiWebQuestions training set to learn common terms used in Wikidata. Additionally, LLaMA, a large language model, is fine-tuned with a few-shot training set along with instructions used to fine-tune Alpaca, another large language model.

Evaluation results of WikiSP on the WikiWebQuestions dataset show promising performance, achieving a 65.5% exact match accuracy and a 71.9% F1 score. Answer accuracy is improved by entity linking and allowing mentions as entities.

The authors evaluate fine-tuned LLMs for Wikidata semantic parsing, focusing on using property and domain names instead of IDs and combining GPT-3 with WikiSP for question answering. Using property and domain names improves answer accuracy by 2.0%. The combination of GPT-3 with WikiSP provides definitive, correct, and complete answers for 75% of the questions in the dev set.

Error analysis reveals alternative interpretations, SPARQL queries that don't retrieve answers, and entity linking errors as common errors. WikiSP outperforms WDAqua by 3.6% in F1 score on Task 4 from the QALD-7 dataset. Combining GPT-3 with WikiSP yields additional correct answers for 34% of the questions.

Limitations discussed include the focus on factoid question answering and English datasets, as well as the need for better training datasets to handle more complex questions.

In conclusion, the authors create the WikiWebQuestions benchmark dataset, establish a strong baseline using fine-tuned LLMs, and demonstrate the advantages of combining GPT-3 with WikiSP.

570 word summary

This paper introduces WikiWebQuestions, a question answering benchmark for Wikidata, and presents WikiSP, a semantic parser for Wikidata. The goal is to improve the accuracy of large language models (LLMs) by grounding them in Wikidata. LLMs have the tendency to give incorrect answers, so the authors propose using semantic parsing to complement LLMs and provide more accurate answers.

The authors modify SPARQL, a query language for semantic parsing, to use domain and property names instead of unique IDs. They train the parser to use either an entity linker or mentions in the query to link entities in the user query to their unique ID in Wikidata. The modified SPARQL queries are then fed into the WikiSP semantic parser. If the query fails to return a result, the system defaults to using GPT-3 and labels the result as a GPT-3 guess.

Experimental results show that this methodology improves answer accuracy. The authors achieve a strong baseline of 76% and 65% answer accuracy in the dev and test sets of WikiWebQuestions, respectively. By combining their semantic parser with GPT-3, they provide useful answers to 96% of the questions in the dev set. They also outperform the state-of-the-art for the QALD-7 Wikidata dataset by 3.6% in F1 score.

Semantic parsing is important in grounding LLMs. While LLMs can answer questions directly, their answers may not always be correct. Semantic parsers provide interpretable results grounded in Wikidata, allowing users to verify the answers. By combining the results from the semantic parser with GPT-3's guesses, the system provides more reliable answers.

The authors also introduce the WikiWebQuestions dataset, a high-quality semantic parsing dataset for Wikidata. They migrated the WebQuestionsSP benchmark from Freebase to Wikidata, providing up-to-date answers from a larger knowledge base. The dataset consists of real-world questions collected from users using the Google Suggest API.

The authors use ReFinED as the entity linker for WikiSP. They fine-tune ReFinED with question and entity pairs from the WikiWebQuestions training set to learn common terms used in Wikidata. They also fine-tune LLaMA, a large language model, with a few-shot training set along with instructions used to fine-tune Alpaca, another large language model.

The evaluation results of WikiSP on the WikiWebQuestions dataset show promising performance. The model achieves a 65.5% exact match accuracy and a 71.9% F1 score. Entity linking and allowing mentions as entities improve answer accuracy.

The authors conducted an evaluation of fine-tuned LLMs for Wikidata semantic parsing, focusing on two aspects: using property and domain names instead of PIDs and QIDs, and combining GPT-3 with WikiSP for question answering.

Using property and domain names improves answer accuracy by 2.0%. LLMs can adapt to changes in query notation with fine-tuning. The combination of GPT-3 with WikiSP provides definitive, correct, and complete answers for 75% of the questions in the dev set.

Error analysis shows that errors include alternative interpretations, alternative SPARQL queries that don't retrieve an answer, and entity linking errors.

WikiSP achieves 38% accuracy on Task 4 from the QALD-7 dataset, outperforming the state-of-the-art WDAqua by 3.6% in F1 score. Combining GPT-3 with WikiSP provides additional correct answers for 34% of the questions.

The authors discuss limitations such as the focus on factoid question answering and English datasets, and the need for better training datasets to handle more complex questions.

In conclusion, the authors create the WikiWebQuestions benchmark dataset, establish a strong baseline using fine-tuned LLMs, and show the benefits of combining GPT-3 with WikiSP.

965 word summary

This paper presents WikiWebQuestions, a high-quality question answering benchmark for Wikidata. It introduces WikiSP, a few-shot sequence-to-sequence semantic parser for Wikidata. The goal is to improve the factuality of large language models (LLMs) by grounding them in Wikidata, which contains over 12 billion facts. LLMs have the ability to answer questions, but they are prone to hallucinating and giving incorrect answers. The authors propose using semantic parsing to complement LLMs and provide more accurate answers.

The authors modify SPARQL, a query language used for semantic parsing, to use domain and property names instead of their unique IDs. They train the parser to use either the results from an entity linker or mentions in the query. The entity linker is used to link entities in the user query to their unique ID in Wikidata. The modified SPARQL queries are then fed into the WikiSP semantic parser to produce answers. If applying the query to Wikidata fails to return a result, the system defaults to using GPT-3, a large language model, and labels the result as a GPT-3 guess.

Experimental results show that this methodology is effective in improving answer accuracy. The authors achieve a strong baseline of 76% and 65% answer accuracy in the dev and test sets of WikiWebQuestions, respectively. By combining their semantic parser with GPT-3, they are able to provide useful answers to 96% of the questions in the dev set. They also outperform the state-of-the-art for the QALD-7 Wikidata dataset by 3.6% in F1 score.

The authors highlight the importance of semantic parsing in grounding LLMs. While LLMs can answer questions directly, they lack interpretability and their answers may not always be correct. Semantic parsers provide interpretable results that are grounded in Wikidata, allowing users to verify the answers. By combining the results from the semantic parser with GPT-3's guesses, the system provides users with more reliable answers.

The authors also introduce the WikiWebQuestions dataset, which is a high-quality semantic parsing dataset for Wikidata. They migrated the popular WebQuestionsSP benchmark from Freebase to Wikidata, providing up-to-date answers from a larger knowledge base. The dataset consists of real-world questions collected from users using the Google Suggest API.

In terms of implementation, the authors use ReFinED as the entity linker for WikiSP. They fine-tune ReFinED with the question and entity pairs from the WikiWebQuestions training set to learn common terms used in Wikidata. They also fine-tune LLaMA, a large language model, with a few-shot training set along with instructions used to fine-tune Alpaca, another large language model.

The evaluation results of WikiSP on the WikiWebQuestions dataset show promising performance. The model achieves a 65.5% exact match accuracy and a 71.9% F1 score. Ablation experiments demonstrate the importance of entity linking and allowing mentions as entities in improving answer accuracy.

Overall, this paper presents a method for fine-tuning LLMs and improving their factuality by grounding them in Wikidata. The proposed WikiSP semantic parser achieves strong results on the WikiWebQuestions dataset and outperforms existing methods. The authors highlight the benefits of combining semantic parsing with large language models to provide more accurate and interpretable answers.

The authors of the paper conducted an evaluation of fine-tuned LLMs for Wikidata semantic parsing. They focused on two specific aspects: the effectiveness of using property and domain names instead of PIDs and QIDs, and the combination of GPT-3 with WikiSP for question answering.

In their evaluation, the authors found that using property and domain names instead of PIDs and QIDs improved the answer accuracy by 2.0%. This indicates that LLMs can adapt to changes in query notation with fine-tuning, and it is easier for them to learn names than random IDs. However, the replacement of QIDs with their names would likely be more significant if mentions were not allowed in the predicted logical form.

The authors also evaluated the combination of GPT-3 with WikiSP for question answering using the WWQ dataset. GPT-3 answered 66.4% of the questions correctly, but provided incomplete answers for 26.5% of the questions and wrong answers for 7.1% of the questions. In contrast, WikiSP provided definitive answers for 75.6% of the questions. When combining GPT-3 with WikiSP, they were able to give definitive, correct, and complete answers for 75% of the questions in the dev set.

Error analysis showed that 18% of the errors in the WWQ dev set were actually deemed to be correct alternative results. These included cases where the model predicted an alternative interpretation to the question that still provided a reasonable answer. Another 6.3% of the errors were due to reasonable alternative SPARQL queries that did not retrieve an answer. The biggest source of errors, accounting for 35.1% of the failed examples, was entity linking errors. The entity linker failed to provide correct entities in these cases.

The authors also conducted an experiment with WikiSP on Task 4 from the QALD-7 dataset. WikiSP achieved 38% accuracy on this dataset, outperforming the state-of-the-art WDAqua by 3.6% in terms of F1 score. They also evaluated the combination of GPT-3 with WikiSP on QALD-7 and found that the combination approach provided additional correct answers for 34% of the questions.

The authors discussed the limitations of their work, including the focus on factoid question answering and the use of English datasets. They also mentioned the need for better training datasets to improve the performance of WikiSP on less popular questions.

In conclusion, the authors created a high-quality benchmark dataset called WikiWebQuestions for large knowledge-base question answering. They established a strong baseline for answer accuracy and F1 score using fine-tuned LLMs with a few-shot training dataset. They also showed that combining GPT-3 with WikiSP can reduce hallucination and provide useful information for a large percentage of questions. However, they acknowledged the need for further improvements and better training datasets to handle more complex and less popular questions.

Raw indexed text (59,242 chars / 9,201 words / 1,390 lines)

Fine-tuned LLMs Know More, Hallucinate Less
with Few-Shot Sequence-to-Sequence Semantic Parsing over Wikidata
Silei Xu ∗ Shicheng Liu ∗ Theo Culhane Elizaveta Pertseva
Meng-Hsi Wu 1 Sina J. Semnani Monica S. Lam
Computer Science Department, Stanford University
Stanford, CA
{silei, shicheng, tculhane, pertseva, sinaj, lam}@cs.stanford.edu
1
Ailly.ai
[email protected]
Abstract
Where did Bronx take
place?
While large language models (LLMs) can an-
swer many questions correctly, they can also
hallucinate and give wrong answers. Wikidata,
with its over 12 billion facts, can be used to
ground LLMs to improve their factuality.
WikiSP
(Semantic Parser)
Entity Linker
(‘A Bronx Tale’,
‘Q1130705’)
This paper presents WikiWebQuestions, a high-
quality question answering benchmark for
Wikidata. Ported over from WebQuestions for
Freebase, it consists of real-world data with
SPARQL annotation.
This paper presents a few-shot sequence-to-
sequence semantic parser for Wikidata. We
modify SPARQL to use the unique domain and
property names instead of their IDs. We train
the parser to use either the results from an entity
linker or mentions in the query. We fine-tune
LLaMA by adding the few-shot training data
to that used to fine-tune Alpaca.
No Response
Response
From Wikidata, the
filming location of ‘A
Bronx Tale’ includes
New Jersey and New
York
Our experimental results demonstrate the ef-
fectiveness of this methodology, establishing
a strong baseline of 76% and 65% answer ac-
curacy in the dev and test sets of WikiWeb-
Questions, respectively. By pairing our seman-
tic parser with GPT-3, we combine verifiable
results with qualified GPT-3 guesses to pro-
vide useful answers to 96% of the questions in
dev. We also show that our method outperforms
the state-of-the-art for the QALD-7 Wikidata
dataset by 3.6% in F1 score. 1
1
(GPT-3)
GPT-3 guesses that the movie took
place in Bronx, New York
LAM
Introduction
Large language models (LLMs) such as GPT-3 can
answer open-domain questions without access to
external knowledge or any task-specific training
examples. However, LLMs are prone to halluci-
nate (Bang et al., 2023), while using a convinc-
ing and confident tone. This may cause signifi-
cant harm as people increasingly accept LLMs as a
knowledge source (Goddard, 2023; Weiser, 2023).
*
Equal contribution
Code,
data,
and model are available
https://github.com/stanford-oval/
wikidata-emnlp23
1
at
STANFORD
Figure 1: An Overview of WikiSP. An entity linker is
used to link entities in the user query to their unique ID
in Wikidata; e.g. “A Bronx Tale” is linked to entity ID
“Q1130705”. The query and entity linker outputs are fed
to the WikiSP semantic parser to produce a modified
version of SPARQL, where property IDs (e.g. “P915”)
are replaced by their unique string identifiers (e.g. “film-
ing_location”). If applying the query to Wikidata fails to
return a result, we default to GPT-3, labeling the result
as a GPT-3 guess. Returned answers are presented in
the context of the query, so the user can tell if the an-
swer is acceptable; if not, we also show the guess from
GPT-3. Here WikiSP mistakenly uses “filming_location”
instead of “narrative_location”; the user detects the mis-
take, thumbs down the answer, and the GPT-3 answer is
provided.7%
6%
Incorrect GPT
4%
Incorrect GPT
Incomplete GPT
15%
27%
Correct GPT
Incomplete GPT
66%
Correct GPT
GPT-3 only
76%
Verified from
WikiSP
WikiSP + GPT-3
Figure 2: Distribution of correct, incomplete, and incor-
rect answers for the WikiWebQuestions dev set, when
GPT-3 is used alone and when combined with WikiSP.
In contrast, traditional knowledge base ques-
tion answering (KBQA) is grounded with a given
knowledge base. Semantic parsing (SP) has been
widely used to tackle this challenging task, where
the questions are first parsed into a logical form
and then executed to retrieve answers from the
knowledge base. It has better interpretability than
GPT-3 and other information-retrieval-based ap-
proaches (Dong et al., 2015; Miller et al., 2016;
Sun et al., 2018, 2019) where answers are predicted
directly.
To handle large knowledge bases, previous SP-
based approaches tend to use a multi-stage pipeline
of sub-tasks, starting with extracting the relevant
subgraph based on entities detected in the ques-
tions (Yih et al., 2015; Luo et al., 2018). Such
an approach struggles with questions that have a
large search space and fails to understand questions
that refer to information missing in the knowledge
graph. Having to retrieve the relevant subgraphs to
create the logical form conflates query resolution
with semantic parsing, rendering classical query
optimization inapplicable.
End-to-end seq2seq translation, on the other
hand, has mainly been used on schemas of rela-
tively small relational databases (Yu et al., 2018;
Xu et al., 2020a,b) and web APIs (Campagna et al.,
2017; Su et al., 2017). To handle large knowledge
graphs, recent work proposed retrieving (1) infor-
mation on linked entities, (2) exemplary logical
forms relevant to the query (Gu et al., 2021; Ye
et al., 2022), and (3) schemas as context to seman-
tic parsing (Shu et al., 2022). Others use induction
or iterative methods to generate complex logical
forms (Cao et al., 2022b; Gu and Su, 2022).
1.1
Few-Shot Seq2Seq Semantic Parsing
This paper investigates how we can leverage large
language models (LLMs) to create seq2seq neural
semantic parsers for large knowledge bases such as
Wikidata.
Pretrained with the internet corpora, LLMs are
already familiar with the syntax of formal query
languages such as SQL (Hu et al., 2022; Poesia
et al., 2022; Li et al., 2023; An et al., 2023; Nan
et al., 2023; Arora et al., 2023). When given simple
SQL schemas, they can perform zero-shot semantic
parsing of simple natural language queries into for-
mal queries. Unlike Freebase, the KB used in most
of the KBQA semantic parsing research, Wikidata
does not have a pre-defined schema, making it a
much harder problem. It has 150K domains, 3K
applicable properties, and 107M entities, each of
the properties and entities are uniquely identified
with PIDs and QIDs, respectively. While zero-shot
LLMs can generate SPARQL queries for the easiest
and most common questions, they do not know all
the PIDs and QIDs, and nor is it possible to include
them in a prompt.
This paper presents WikiSP, a few-shot
sequence-to-sequence semantic parser for Wikidata
that translates a user query, along with results from
an entity linker, directly into SPARQL queries. To
handle the 100M+ entities in Wikidata, we train
the parser to use either the entity linker results or a
mention in the query; to handle the 150K domains
and 3K applicable properties, we modify SPARQL
to use domain and property names instead of their
unique QIDs and PIDs, respectively. We fine-tune
a LLaMA (Touvron et al., 2023) with a few-shot
training set along with the instructions used to fine-
tune Alpaca (Taori et al., 2023).
1.2
A New Dataset: WikiWebQuestions
Most of the widely-used high-quality benchmarks
for KBQA are based on Freebase (Bollacker et al.,
2008) which has been shut down since 2015. With
outdated knowledge, it is hard to compare the re-
sults with modern LLMs such as GPT-3, since an-
swers have changed over time for most of the ques-
tions. Wikidata, despite being the largest and most
popular knowledge base nowadays, has very few
datasets annotated with SPARQL queries; they are
either extremely small (Usbeck et al., 2017) or syn-
thetic (Saha et al., 2018).
We migrated the popular WebQuestionsSP (Yih
et al., 2016) benchmark from Freebase to Wiki-
data, with updated SPARQL and up-to-date an-
swers from the much larger Wikidata.1.3
Complementing Large Language Models
Trained on Wikipedia and all of the internet, LLMs
can answer many questions directly. Unfortunately,
the user cannot tell if the answers are correct, thus
requiring them to fact-check every answer.
Unlike humans, GPT-3 always sounds definitive
even when they are wrong by providing specific
and plausible facts. For example, on the question
“what is the biggest country in Europe by popula-
tion?”, GPT-3 answers “Germany”, when the an-
swer is “Russia”. Or, on the question, “where does
the name Melbourne come from?” GPT-3 answers
“Melbourne comes from the Latin word ‘melbur-
num’ meaning ‘blackburn’ or ‘blackbird’.”, but in
reality, Melbourne is named after William Lamb,
2nd Viscount Melbourne. It is not possible to tell
when GPT-3’s answers are wrong, and every an-
swer needs to be fact-checked.
Semantic parsers can be used to complement
LLMs as they are interpretable; their results are
grounded in Wikidata, which we assume to be cor-
rect. It is possible for semantic parsers to misun-
derstand a query, but by providing the answer in
the context of the query, the user can spot the error.
We propose getting the best of both worlds by
answering the question with WikiSP if possible.
Otherwise, we report GPT-3’s guesses by prefac-
ing it with: “GPT-3 guesses that” (Figure 1). In
this way, the user can have full confidence with
the answers from the former, while also benefiting
from the latter. It is easier for users to fact-check
an answer than trying to find the answer.
1.4
Contributions
WikiWebQuestions, a high-quality semantic
parsing dataset for Wikidata, migrated from the
popular WebQuestions dataset for Freebase.
WikiSP, a few-shot Seq2Seq semantic parser
by fine-tuning LLaMA with a few shot training set.
We improve the learnability of SPARQL queries by
replacing the IDs of properties and domains with
their unique names; we tolerate errors in entity
linking by accepting mentions in the queries as
entities. We establish a first, strong baseline of
76% and 65% answer accuracy for the dev set and
test set of our new WikiWebQuestions benchmark,
respectively. We also demonstrate that our method
surpasses the state of the art for QALD-7 wikidata
set by 3.6% in F1 score.
We improve GPT-3’s trustworthiness by first
returning interpretable results from semantic parser
and backing it up with GPT-3 guesses. WikiSP can
provide verifiable results for WikiWebQuestions
76% of the time and improves the guesses by GPT-
3, resulting in errors only 4% of the time (Figure 2).
2
2.1
Related Work
KBQA
The KBQA task aims to make large knowledge
bases accessible by natural language. One com-
mon approach is semantic parsing where a natural
language query is translated into a formal logical
form, which is then executed to retrieve an answer
from the knowledge base. To handle large KBs,
one method is to formulate SP as a multi-staged
search problem by retrieving entities and expanding
the graphs according to the relationships between
their properties and the query (Yih et al., 2015,
2016; Luo et al., 2018). Lan and Jiang (2020) add
constraints to the staged query graph generation
method. Another popular method is to use seq2seq
models obtained by fine-tuning pretrained language
models. Das et al. (2021) first find other queries
that contain semantically similar subparts, and con-
struct a new logical form by combining the similar
subparts of the found queries. Ye et al. (2022)
search over the KB based on predefined rules to de-
rive a set of candidate logical forms, rank them, and
generate the final logical form. Cao et al. (2022b)
first generate a “sketch” program and then fill in
its arguments. Gu and Su (2022) use dynamic pro-
gram induction to generate query structures. Based
on a user query, Shu et al. (2022) retrieve entities,
example logical forms, and related schema. Unlike
FreeBase, Wikidata does not have a fixed schema.
Another approach to KBQA is based on graph
retrieval (Dong et al., 2015; Miller et al., 2016; Sun
et al., 2018, 2019; Mavromatis and Karypis, 2022;
Sen et al., 2021; Vivona and Hassani, 2019; Verga
et al., 2021). It predicts the answers directly within
the subgraph extracted based on the topic entity
in the question. Yu et al. (2023) combine seman-
tic parsing with retrieval and achieve the state-of-
the-art on the WebQuestionsSP dataset (Yih et al.,
2016). However, retrieval-based methods cannot
handle entire categories of questions, such as ques-
tions with no available answer and questions like
“the tallest mountain” where no entities are men-
tioned by name. They have poor interpretability
and do not support query optimization.2.2
KBQA Benchmarks
Most of the early KBQA benchmarks are based on
Freebase (Berant et al., 2013; Yih et al., 2016; Tal-
mor and Berant, 2018). Recently, new benchmarks
have been created for Wikidata (Cao et al., 2022a;
Saha et al., 2019). However, these benchmarks are
created using rule-based synthesis or paraphrases,
which are easier for semantic parsers. CSQA col-
lects human-written questions for single triples and
constructs complex questions using fixed rules with
very limited natural language variety (Saha et al.,
2019). KQA Pro first synthesizes queries with
canonical natural language and then crowdsources
human paraphrases (Cao et al., 2022a). Campagna
et al. (2019) show that a model can achieve signifi-
cantly higher accuracy over paraphrased data com-
pared to real-world data even for untrained queries.
Thus, we base our WikiWebQuestions dataset on
WebQuestionsSP (Yih et al., 2016), where data are
collected from real-world users using the Google
Suggest API.
2.3
LLMs for Semantic Parsing
Shin et al. (2021) show the promise of few-shot
prompting LLMs for semantic parsing. They use
constrained decoding to enforce the syntax of the
formal language, and achieve comparable results
with a smaller fine-tuned BART model (Lewis et al.,
2020) on datasets with small database schemas.
Rubin et al. (2022) fine-tune a small retriever to
obtain the most relevant few-shot examples to use
for each input. Niu et al. (2023) use a few-shot
prompted Codex model to break down the natural
language input to make the task easier for a smaller
semantic parser. LLMs have also been applied to
semantic parsing on relational databases (Hu et al.,
2022; Poesia et al., 2022; Li et al., 2023; An et al.,
2023; Nan et al., 2023; Arora et al., 2023). The
schemas used in these projects are very small when
compared to Wikidata.
2.4
Entity Linking
Entity linking involves finding the named entities
in a query, and linking them to the corresponding
entities in the knowledge graph so that the query
can be executed using the proper entities as ref-
erence points. The current state-of-the-art entity
linking model on the WebQuestionsSP dataset is
ReFinED (Ayoola et al., 2022). They use a bidirec-
tional transformer on the query to predict the most
likely mentions of named entities within a query,
and then combine that information with embed-
dings computed over every entity in the knowledge
base to predict which entity the mention is most
likely to be referring to. Prior to ReFinED, the
state-of-the-art was ELQ (Li et al., 2020). They
similarly generate embeddings for each entity in
the knowledge base, and then use the predicted
mentions of entities combined with these predicted
embeddings to generate likely entities.
3
Semantic Parsing for Wikidata
Wikidata is the largest public knowledge base
with over 12 billion facts represented by subject-
predicate-object triples using 100+ million entities
and 10,000 properties. 3,000 of the properties are
useful for answering natural language questions,
whereas the rest are used to link data in Wikidata
with external library catalogs and database IDs.
Entities and properties are given unique identi-
fiers, QIDs and PIDs, respectively. For example,
the fact that Joe Biden is the president of the US can
be represented as a triple (Q6279, P39, Q11696),
where P39 is the PID for property position held,
Q6279 and Q11696 are QIDs for Joe Biden and the
president of the United States, respectively.
3.1
Query Format
Unlike relational databases and Freebase, Wikidata
has no predefined domains or types. Any entity
can have an arbitrary set of properties. However,
even though Wikidata is property-based, all named
entities have one or more instance of properties to
some domain entity; domain entities are organized
into a hierarchy with the subclass of property.
Note that the names of domain entities and prop-
erties are unique. Non-domain entities, on the other
hand, can be ambiguous. For example, “Lincoln”
can refer to the president, a car brand, a sparrow,
an aircraft, and many different cities.
We posit that it is impossible for LLMs to memo-
rize the QIDs and PIDs for domains and properties.
We modify the format of SPARQL queries to use
the more mnemonic property name, instead of its
PID. Similarly, we use entity names for domains.
For example, the original SPARQL for the query
“What car models does GM make?” is
SELECT DISTINCT ?x WHERE {
?x wdt:P31/wdt:P279* wd:Q3231690.
?x wdt:P176 wd:Q81965. }
This says that we are seeking x, where x
is transitively either an instance of (wdt:P31)or a subclass of (wdt:P279) of an automobile
model (wd:Q3231690), and x has General Mo-
tors (wd:Q81965) as the manufacturer (wdt:P176).
Note wdt is the prefix for Wikidata property, and
wd is for Wikidata entity.
With our modification, the query becomes:
SELECT DISTINCT ?x WHERE {
?x wdt:instance_of/wdt:subclass_of*
wd:automobile_model.
?x wdt:manufacturer wd:Q81965. }
For non-domain entity QIDs, we also accept a
string in lieu of a QID in case of entity linking
errors. At inference time, we use simple heuristics
to resolve the string to a QID before applying the
query. For example, “wd:Q81965” in the query
may be replaced with “wd:GM”. See Section 3.2.2
for more details.
Normally, we refrain from changing standard
query notations since LLMs have been pretrained
on them. However, we posit that learning this new
syntax is much easier than learning the PIDs and
QIDs. Our experimentation with few-shot prompt-
ing suggests that LLMs can easily adjust to this
format.
3.2
Entity Linking
Linking entities for WikiWebQuestions is particu-
larly difficult. First, since the dataset is collected
from real-world questions without prompting the
users for more information, users tend to refer to
their entities of interest without using their full
names. Second, the questions are generally short
with very limited context, making it harder to
disambiguate among entities with similar names.
Lastly, many QIDs in Wikidata are used to repre-
sent terms not generally known as “named entities”.
For example, domain entities are often ignored by
entity linker models, as in “What is the biggest
country in Europe by population?”, both “country”
(Q6256) and “Europe” (Q46) are required to con-
struct the correct SPARQL, but entity linkers only
provide “Europe” and ignore “country”.
3.2.1 Semantic Parsing with Entity Linking
To handle ambiguous entities, we use an entity
linker to first find the domain names and QIDs
of the entities mentioned in the text. We train a
semantic parser that accepts users’ input along with
the results produced by the entity linker.
Formally, given a user input T , and a set of entity
linker results ⟨e, q⟩, where e is the name (default
label) Wikidata gives to an entity and q is its QID,
the semantic parser produces the semantic parse of
T in our modified SPARQL format.
For the example above, the SOTA ReFinED en-
tity linker (Ayoola et al., 2022) returns {⟨General
Motors, Q81965⟩}. Unfortunately, it misses the
entity automobile model (Q3231690), a term not
usually considered to be an entity.
3.2.2
Recovering from Entity Linker Errors
We want our semantic parser to be able to recover
from mistakes by an entity linker. That is, the
semantic parser should use entity linking when it
is helpful, but it should still try to predict the right
logical form when the linker fails.
The semantic parser is trained to accept, along
with the user query, an optional set of potentially
useful QIDs from the entity linker. We include
samples where some of the supplied linked entities
are not used in the gold answer, as well as samples
where there are missing linked entities. For the
latter, we use mentions in the original query in lieu
of the QIDs. At inference time, we use the men-
tions to look up the QIDs in Wikidata. If multiple
matches exist, the most popular entity is returned.
An example is shown in Appendix A.
With the above example where the entity linker
misses “automobile model”, the semantic parser is
likely to predict “car model” by copying from the
user query. We search “automobile model” among
aliases in domains to find the correct QID. This
design allows the model to potentially recover from
entity-linking failures.
4
WikiWebQuestions (WWQ) Dataset
Despite being the most popular large knowledge
base for a long time, existing benchmarks on Wiki-
data with labeled SPARQL queries are unfortu-
nately either small or of low quality. On the
other hand, benchmarks over the deprecated Free-
base still dominate the KBQA research with better-
quality data. For example, the WebQuestions (Yih
et al., 2015) dataset was collected by using Google
Search API instead of human paraphrasing or syn-
thesis. As a result, it is much more natural and
truly reflects the real-world questions users may
ask. This dataset is later annotated with SPARQL
over Freebase, named WebQuestionsSP (Yih et al.,
2016). Examples with no legitimate SPARQL to
retrieve answers from Freebase are dropped. In
total, WebQuestionsSP consists of 3098 examples
in the training set and 1639 in the test set.We migrated WebQuestionsSP, the best collec-
tion of natural language questions over a general
knowledge graph, from Freebase to Wikidata, with
the help of an automatic tool we developed, based
on Google’s entity mapping 2 and Wikidata’s re-
lation mapping 3 . About 60% of the dataset was
automatically converted. One of the authors of
this paper, who did not participate in model tuning,
manually converted those instances that failed to
convert automatically.
4.1
Migrating WebQuestionsSP to Wikidata
Here are the major decisions we made in migrating
WebQuestionsSP dataset to Wikidata. While much
bigger, Wikidata does not necessarily contain all
the information available in Freebase. For example,
it lacks countries’ trade partners, hence we drop all
such questions from the WebQuestionsSP dataset.
If multiple paths can lead to the correct answer,
we choose the path that provides the most com-
plete answers and has the best availability among
entities in the same domain. For example, when
asking for books written by an author X, we can
either search for books whose author is X or find
notable works of X that are books. While the latter
is more efficient, the property notable works is not
always available for all authors and it often does
not provide a complete list. Thus, we annotate such
examples using the former representation.
We also cleaned up the original dataset. The
dataset contained questions like “who does Ronald-
inho play for now in 2011?”. We drop the appended
year as it conflicts with “now” in the utterance, and
it would refer to the live information in Wikidata.
In total, we dropped 9% of the examples from
WebQuestionsSP and created a training, dev, and
test set of 2431, 454, and 1431 samples, respec-
tively. Given that Wikidata has 100 million entities
and 3,000 useful properties for answering ques-
tions, the training data set is woefully inadequate
and can be considered as a “fewshot” training set
at best.
5
Implementation
5.1
We use ReFinED (Ayoola et al., 2022) for entity
linking, which is the current state of the art for
WebQuestionsSP. As discussed before, Wikidata
treats many common terms such as “country” as
named entities and assigns them QIDs. To fine-tune
ReFinED to learn such terms, we add the question
and entity pairs from the training set of WikiWeb-
Questions to the data used to train ReFinED’s ques-
tions model.
We run 10 epochs of finetuning using the default
hyperparameters suggested by Ayoola et al. (2022).
For each identified entity, we provide the mention
in the original utterance, the QID, as well as its
domain in plain text. The information is appended
to the utterance before being fed into the neural
semantic parsing model.
5.2
2
https://developers.google.com/
freebase
3
https://www.wikidata.org/wiki/
Wikidata:WikiProject_Freebase/Mapping
The WikiSP Semantic Parser
We prepare the training data with entities provided
by fine-tuned ReFinED. Comparing with the gold
entities, ReFinED provides extra entities in 215
cases, while missing at least one entity in 137 cases.
When ReFinED failed to produce the correct en-
tities, we replace the missing QIDs in the logical
form with the corresponding mention of the entity
in the question. During evaluation, if a mention of
an entity is predicted by the model, we look up the
QID using the Wikidata “wbsearchentities” API 4 .
We fine-tune LLaMA with 7B parameters be-
cause it has been shown to perform well despite
its relatively small size (Touvron et al., 2023). We
include the Alpaca (Taori et al., 2023) instruction
following data, which was derived using the self-
instruct (Wang et al., 2023) method, in our training
data. The training data samples in WikiWebQues-
tion start with the following instruction: “Given
a Wikidata query with resolved entities, generate
the corresponding SPARQL. Use property names
instead of PIDs.”. We concatenate the resolved
entities and the user utterance together as input.
We up-sample the WikiWebQuestion fewshot set
5 times and train for 3 epochs using 2e-5 learning
rate and 0.03 warmup ratio.
5.3
This section discusses the implementation details
of the entity linker and the WikiSP semantic parser.
Entity Linking
Executing Queries on Wikidata
SPARQL queries are used to retrieve answers from
the Wikidata SPARQL endpoint 5 . Since Wikidata
4
https://www.wikidata.org/w/api.php?
action=wbsearchentities
5
https://www.wikidata.org/wiki/
Wikidata:SPARQL_query_serviceWikiSP (ours)
EM F1
65.5 71.9
Table 1: Results of WikiSP on the WWQ test set.
is actively being updated, the gold SPARQL can be
easily re-executed to acquire up-to-date answers,
allowing the benchmark to compare with forthcom-
ing iterations of large language models.
6
Experiments
In this section, we evaluate WikiSP on WikiWeb-
Questions and demonstrate how it can be used to
complement large language models such as GPT-3.
6.1
Semantic Parser Results
We evaluate our model with two different answer
accuracy metrics: (1) exact match (EM): the per-
centage of examples where the answers of the pre-
dicted SPARQL exactly match the gold answers,
and (2) Macro F1 score (F1): the average F1 score
for answers of each example. The evaluation re-
sults are shown in Table 1. Our approach achieves a
65.5% exact match accuracy and a 71.9% F1 score
on the WWQ dataset.
As a reference, the current state-of-the-art result
on the original WebQuestionsSP dataset for Free-
base is 78.8% F1 (Yu et al., 2023). The result was
obtained with a combination of semantic parsing
and retrieval. The WikiWebQuestions dataset is
slightly different, as discussed above. More signifi-
cantly, unlike Freebase, Wikidata does not have a
fixed schema and ours is an end-to-end, seq2seq
semantic parser.
6.2
Ablation Experiments
6.2.1 Entity Linking
Our first ablation study evaluates the need for entity
linking with ReFinED, by replacing it with simply
using the LLM to detect entities as mentions. In
this experiment, all entity IDs in the training data
are replaced by their mentions; during inference,
we map the predicted entities to their actual QIDs
according to Section 3.2.2.
The results show that replacing the neural entity
linker with just using mentions reduces the exact
match by 9.1% and the F1 score by 9.3%. This
suggests that entity linking is important.
6.2.2 Allowing Mentions as Entities
Our logical form is designed to recover from entity
linking errors by allowing entities be specified by
WikiSP (ours)
No Entity Linking
No mentions, trained with ReFinED
No mentions, trained with Oracle entities
PIDs and QIDs for properties & domains
EM F1
75.6
66.5
73.3
72.2
73.6 76.9
67.6
75.0
73.4
74.7
Table 2: Ablation results of WikiSP on the WWQ dev
set.
a mention, as an alternative to a QID. Our ablation
study on this feature tested two training strategies:
ReFinED. The entity linker tuples are produced by
fine-tuned ReFinED, which may be missing entities
in the gold target. The data show that generating
unseen QIDs is needed for missing entities.
Oracle. The entity linker tuples are exactly all
the entities used in the gold. The model would
only encounter missing QIDs at test time when
ReFinED fails to generate all the necessary QIDs.
The answer accuracy of the model using en-
tity linked tuples from ReFinED (“No mentions,
trained with ReFinED” in Table 2) lags by 2.3%
when compared against our best model. The model
using Oracle (“No mentions, trained with Oracle
entities” in Table 2) lags by 3.4%. These results
indicate that allowing mentions is useful for recov-
ering from entity linking errors.
6.2.3 Names vs. IDs for Properties & Domains
Our logical form replaces PIDs with property
names, and domain-entity QIDs with the domain
names. Here we evaluate the effectiveness of this
query format. We compare our approach with the
original SPARQL where all properties and entities
are represented with PIDs and QIDs. Our ablation
study shows that our representation with property
names and domain names improves the answer ac-
curacy by 2.0% (Table 2). This shows that LLMs
can adapt to changes in query notation with fine-
tuning, and it is easier to learn names than remem-
bering random IDs. If we did not allow mentions in
the predicted logical form, the replacement of QIDs
with their names is likely to be more significant.
6.3
Complementing GPT-3
LLMs like GPT-3 can answer many questions on
general knowledge correctly; however, they may
also hallucinate. WWQ is representative of popular
questions, so we expect GPT-3 to perform well. We
use text-davinci-002 with the temperature set to 0
to evaluate GPT-3’s performance on WWQ.
On the dev set of WWQ, GPT-3 answers 66.4%of the questions correctly and provides incomplete
answers to 26.5% of the questions. For example,
when asked “What does Obama have a degree in?”,
GPT-3 correctly identifies President Obama’s po-
litical science degree, but fails to mention his law
degree. In total, GPT-3 gives wrong answers to
7.1% of the questions.
For this dev set, we can give definitive answers to
75.6% of the questions with WikiSP (Table 2). For
the rest of the questions (24.4%), accounting for
the overlap between the GPT-3 and our semantic
parser’s results, the percentages of guessing cor-
rectly, incompletely, and incorrectly are at 15.2%,
5.5%, and 3.7%, respectively (Figure 2).
In summary, the combination of GPT-3 and
WikiSP makes it possible to give a definitive, cor-
rect, and complete answer three quarters of the time
for the dev set. Users can also benefit from GPT-
3’s guesses the rest of the time at a 3.7% error rate,
which is about half of the original error rate.
6.4
Error Analysis
We analyzed the 111 examples in the WWQ dev
set where the model failed.
6.4.1
Acceptable Alternative Results (18.0%)
Our analysis shows that 18.0% of the “errors” can
actually be deemed to be correct.
Reasonable alternate answers (11.7%). In
11.7% of the cases, the model predicts an alterna-
tive interpretation to the question and returns a rea-
sonable answer that is different from the gold. For
example, the gold for question “what did Boudicca
do?” uses the position held property, while the
model predicts occupation property. Both are con-
sidered valid answers to the question.
Reasonable alternative SPARQL but no an-
swer was retrieved (6.3%). In another 6.3%
of cases, the model predicts a reasonable alterna-
tive SPARQL, but the SPARQL returns no answer.
Sometimes, since the information for the “correct”
property is missing, the question is represented
with a similar property. For example, since resi-
dence property is missing for Patrick Henry, the
gold SPARQL for “where did Patrick Henry live?”
uses place of birth instead, while our model pre-
dicts residence.
6.4.2
Errors in Entity Linking (35.1%)
The biggest source of errors is entity linking. En-
tity linker failed to provide the correct entities in
35.1% of the failed examples. While WikiSP can
potentially recover from missing entities, it can-
not recover from incorrect entities. This is espe-
cially common for character roles, as some char-
acter roles have different entities for books and
movies or even different series of movies. Some-
times WikiSP located the correct mention from
the question, but the lookup failed. For example,
the model located the mention of the event “allied
invasion of France” in question “where did the al-
lied invasion of France take place?”, but failed to
find the corresponding entity from Wikidata by the
name.
6.4.3 Errors Beyond Entity Linking
Semantic parsing in Wikidata is challenging as
there are no predefined schemas, and there are
150K domains and 3K applicable properties. Some
representative mistakes include the following:
Wrong property (17.1%). 17.1% of the er-
rors are caused by predicting the wrong property.
Some of the examples require background knowl-
edge to parse. For example the answer of the ques-
tion “what did martin luther king jr do in his life?”
should return the value of movement, while the
model predicts occupation. Properties are a chal-
lenge in Wikidata because as illustrated here which
property to predict depends on the entity itself.
Missing domain constraint (5.4%). Another
common problem is missing the domain constraint.
For example, the model correctly identifies that
property shares border with should be used for
question “what countries are around Egypt?”. How-
ever, it does not limit the answer to countries only,
thus extra entities are returned.
7
Experiment with QALD-7
For another evaluation of WikiSP, we apply our
model on Task 4 from QALD-7 (Usbeck et al.,
2017) dataset. QALD-7 is part of the QALD (Ques-
tion Answering over Linked Data) which is a series
of challenges started in 2011 known for their com-
plex, manually created questions. It mainly focuses
on DBpedia, but QALD-7’s Task 4 is engineered
for Wikidata. The task includes 100 train examples,
which we use to fine-tune our model and 50 test
examples. There is no dev set.
We choose QALD-7 as it is a manually crafted
dataset with complex questions. We avoid datasets
built on synthetic or human-paraphrased data, such
as CSQA (Saha et al., 2018) and KQA-Pro (Cao
et al., 2022a). As they have limited natural lan-
guage variety between the training and evaluationSTAGG (Yih et al., 2016)
GGNN (Sorokin and Gurevych, 2018)
WDAqua (Diefenbach et al., 2017)
WikiSP (Ours)
EM F1
-
-
-
38.0 19.0
21.3
40.0
43.6
12%
18%
Incorrect
GPT
Incorrect GPT
16%
20%
Incomplete GPT
62%
Correct GPT
Table 3: Evaluation results of WikiSP on QALD-7 Task
4 and comparison with prior work.
data, models can get artificially high accuracy. For
example, a simple BART based model can achieve
over 90% accuracy on KQA-Pro even without an
entity linking module (Cao et al., 2022a).
The QALD-7 test set provides both the SPARQL
queries as well as the answers. To double-check the
correctness of the QALD-7 dataset, we applied the
50 gold queries of the test set to Wikidata and found
that 4 did not return an answer. We hypothesize
that the discrepancy is caused by the change in
Wikidata structure/quantity of information. We
evaluate WikiSP by comparing the answers where
possible, and by comparing the generated SPARQL
syntactically otherwise.
For this experiment, we use the same hyper-
parameters and data format as described in Sec-
tion 5.3. In addition to the training data for WikiSP,
we also include the QALD-7 train samples, upsam-
pled 20 times.
7.1
QALD-7 Results
Our model achieves 38% accuracy on the QALD-7
dataset and outperforms the F1 score of the state-
of-the-art WDAqua (Diefenbach et al., 2017) by
3.6%, as shown in Table 3. Note that WDAqua
is based on retrieval, whereas WikiSP is based on
sequence-to-sequence semantic parsing. QALD-
7 (Usbeck et al., 2017) reports WDAqua as the
winner of the leaderboard with 55.2 F1, however
the authors of WDAqua reported 40.0 F1 in their
papers (Diefenbach et al., 2017).
7.2
Complementing GPT-3 on QALD-7
Similar to WWQ, we also assess the combination
of GPT with WikiSP on QALD-7 as shown in Fig-
ure 3. The GPT model used was "text-davinci-002".
Since there is no validation set and the test set is
already very small, one of the authors who was
not involved in training or finetuning the model
evaluated GPT-3 on the test set.
GPT-3 is fully accurate on 62% of the questions,
20% incomplete, and 18% wrong. With our ap-
proach, we can provide 38% verifiably good an-
swers from WikiSP; the guesses of GPT-3 get an
Incomplete GPT
38%
Verified from
WikiSP
34%
Correct GPT
LAM
GPT-3 only
WikiSP + GPT-3
STANFORD
Figure 3: Distribution of correct, incomplete, and incor-
rect answers for the QALD-7 test set, when GPT-3 is
used alone and when combined with WikiSP.
additional 34% correct, 16% incomplete, and only
12% wrong.
7.3
Discussion
We did not conduct error analysis on the perfor-
mance of QALD-7 as it has no dev set. The author
evaluating GPT-3 noted that the test set of QALD-7
is much more complicated than the training data
(of just 100 samples), with most of the queries con-
taining multiple properties. This explains the lower
accuracy of WikiSP on QALD-7 when compared to
WikiWebQuestions, which has a few-shot training
data set with a similar distribution as the test set.
This result suggests that the performance of
WikiSP depends heavily on a good few-shot train-
ing data for fine-tuning the LLMs. We hypothesize
that we can increase the performance of WikiSP
in handling less popular questions with a better,
possibly synthesized, training dataset.
8
Conclusion
We have created a new high-quality benchmark,
WikiWebQuestions, for large knowledge-base ques-
tion answering. The dataset is based on the popular
WebQuestionsSP dataset with natural questions,
annotated with SPARQL for Wikidata.
We establish a first, strong baseline of 65% an-
swer accuracy and 72% F1 score for WikiWeb-
Questions. This is achieved by fine-tuning LLaMA
with a few-shot training data set using a SPARQL
query format modified for semantic parsing.
We show that we can reduce the hallucination
of large language models like GPT-3 by grounding
it with a semantic parser. For the dev set of Wiki-
WebQuestions, this combination approach provides
useful information for 96% of the questions in the
dev set of the benchmark. More importantly, it gen-
erates verifiable answers for 76% of the questions.Limitations Acknowledgements
While applications of large language models seem
to expand every day, this paper mainly focuses on
factoid question answering. Long-form text genera-
tion, for example, is outside the scope of the experi-
ments of this paper, but the methodology described
here may be extended to this setting in the future.
Even though knowledge bases are an important
source of facts, a large portion of the knowledge
available in digital form (e.g. Wikipedia, news arti-
cles, etc.), is not organized into knowledge bases.
As such, the results of this paper can be considered
complementary to the larger body of fact-checking
research based on free text.
Our semantic parser can be used to verify an-
swers from LLMs. However, this additional round
of running the semantic parser and querying Wiki-
data increase the response latency, which may be
noticeable by end-users of such systems.
All of our datasets and experiments are con-
ducted for English. Expanding to other languages,
while possible (Moradshahi et al., 2020) are outside
the scope of this work.
Our experiments were performed using GPT-3
(davinci-002) as that was what we had access to
when we started the project. Undoubtedly, the later
LLMs will produce better results. Nonetheless,
the need to have verifiable results based on live
database accesses will remain. This work is supported in part by the National Sci-
ence Foundation, the Alfred P. Sloan Foundation,
the Verdant Foundation, Microsoft Azure AI credit,
KDDI, JPMorgan Chase, and the Stanford Human-
Centered Artificial Intelligence (HAI) Institute. We
also thank the reviewers for their valuable com-
ments and suggestions.
Ethical Considerations
LLMs are used by millions of people everyday.
We hope that this line of work will help make them
more reliable for everyone, mitigating some of their
potential downsides, and giving users access to
more accurate information. Our use of Wikidata
will enable future researchers and developers to
connect their systems with a large, diverse and live
knowledge graph that is updated every day. We do
not anticipate any harm resulting from the methods
introduced in this work.
We did not crowdsource any datasets for this
paper, as the questions are converted from a previ-
ous dataset and all the re-annotation and analysis
is done by the authors.
To conduct experiments in this paper, we used
an estimated total of 60 NC96ads-A100 GPU hours
on Microsoft Azure. Each finetuning experiment
takes roughly 3 hours, and we conducted roughly
20 experiments to arrive at the results in this paper.
References
Shengnan An, Bo Zhou, Zeqi Lin, Qiang Fu, Bei Chen,
Nanning Zheng, Weizhu Chen, and Jian-Guang Lou.
2023. Skill-based few-shot selection for in-context
learning.
Aseem Arora, Shabbirhussain Bhaisaheb, Harshit
Nigam, Manasi Patwardhan, Lovekesh Vig, and Gau-
tam Shroff. 2023. Adapt and decompose: Efficient
generalization of text-to-sql via domain adapted least-
to-most prompting.
Tom Ayoola, Shubhi Tyagi, Joseph Fisher, Christos
Christodoulopoulos, and Andrea Pierleoni. 2022.
ReFinED: An efficient zero-shot-capable approach
to end-to-end entity linking. In Proceedings of the
2022 Conference of the North American Chapter of
the Association for Computational Linguistics: Hu-
man Language Technologies: Industry Track, pages
209–220, Hybrid: Seattle, Washington + Online. As-
sociation for Computational Linguistics.
Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wen-
liang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei
Ji, Tiezheng Yu, Willy Chung, Quyet V. Do, Yan
Xu, and Pascale Fung. 2023. A multitask, multilin-
gual, multimodal evaluation of chatgpt on reasoning,
hallucination, and interactivity.
Jonathan Berant, Andrew Chou, Roy Frostig, and Percy
Liang. 2013. Semantic parsing on Freebase from
question-answer pairs. In Proceedings of the 2013
Conference on Empirical Methods in Natural Lan-
guage Processing, pages 1533–1544, Seattle, Wash-
ington, USA. Association for Computational Linguis-
tics.
Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim
Sturge, and Jamie Taylor. 2008. Freebase: A col-
laboratively created graph database for structuring
human knowledge. In Proceedings of the 2008 ACM
SIGMOD International Conference on Management
of Data, SIGMOD ’08, page 1247–1250, New York,
NY, USA. Association for Computing Machinery.
Giovanni Campagna, Rakesh Ramesh, Silei Xu,
Michael Fischer, and Monica S. Lam. 2017. Al-
mond: The architecture of an open, crowdsourced,
privacy-preserving, programmable virtual assistant.
In Proceedings of the 26th International Conference
on World Wide Web - WWW ’17, pages 341–350, New
York, New York, USA. ACM Press.Giovanni Campagna, Silei Xu, Mehrad Moradshahi,
Richard Socher, and Monica S. Lam. 2019. Genie:
A generator of natural language semantic parsers for
virtual assistant commands. In Proceedings of the
40th ACM SIGPLAN Conference on Programming
Language Design and Implementation, PLDI 2019,
page 394–410, New York, NY, USA. Association for
Computing Machinery.
Shulin Cao, Jiaxin Shi, Liangming Pan, Lunyiu Nie,
Yutong Xiang, Lei Hou, Juanzi Li, Bin He, and Han-
wang Zhang. 2022a. KQA pro: A dataset with ex-
plicit compositional programs for complex question
answering over knowledge base. In Proceedings
of the 60th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers),
pages 6101–6119, Dublin, Ireland. Association for
Computational Linguistics.
Shulin Cao, Jiaxin Shi, Zijun Yao, Xin Lv, Jifan Yu,
Lei Hou, Juanzi Li, Zhiyuan Liu, and Jinghui Xiao.
2022b. Program transfer for answering complex
questions over knowledge bases. In Proceedings
of the 60th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers),
pages 8128–8140, Dublin, Ireland. Association for
Computational Linguistics.
Rajarshi Das, Manzil Zaheer, Dung Thai, Ameya God-
bole, Ethan Perez, Jay Yoon Lee, Lizhen Tan, Lazaros
Polymenakos, and Andrew McCallum. 2021. Case-
based reasoning for natural language queries over
knowledge bases. In Proceedings of the 2021 Confer-
ence on Empirical Methods in Natural Language Pro-
cessing, pages 9594–9611, Online and Punta Cana,
Dominican Republic. Association for Computational
Linguistics.
Dennis Diefenbach, Kamal Singh, and Pierre Maret.
2017. Wdaqua-core0: A question answering compo-
nent for the research community. In Semantic Web
Evaluation Challenge, pages 84–89. Springer.
Li Dong, Furu Wei, Ming Zhou, and Ke Xu. 2015. Ques-
tion answering over Freebase with multi-column con-
volutional neural networks. In Proceedings of the
53rd Annual Meeting of the Association for Compu-
tational Linguistics and the 7th International Joint
Conference on Natural Language Processing (Vol-
ume 1: Long Papers), pages 260–269, Beijing, China.
Association for Computational Linguistics.
Jerome Goddard. 2023. Hallucinations in chatgpt: A
cautionary tale for biomedical researchers. The Amer-
ican Journal of Medicine.
Yu Gu, Sue Kase, Michelle Vanni, Brian Sadler, Percy
Liang, Xifeng Yan, and Yu Su. 2021. Beyond i.i.d.:
Three levels of generalization for question answering
on knowledge bases. In Proceedings of the Web
Conference 2021. ACM.
Yu Gu and Yu Su. 2022. ArcaneQA: Dynamic program
induction and contextualized encoding for knowl-
edge base question answering. In Proceedings of
the 29th International Conference on Computational
Linguistics, pages 1718–1731, Gyeongju, Republic
of Korea. International Committee on Computational
Linguistics.
Yushi Hu, Chia-Hsuan Lee, Tianbao Xie, Tao Yu,
Noah A. Smith, and Mari Ostendorf. 2022. In-
context learning for few-shot dialogue state tracking.
In Findings of the Association for Computational
Linguistics: EMNLP 2022, pages 2627–2643, Abu
Dhabi, United Arab Emirates. Association for Com-
putational Linguistics.
Yunshi Lan and Jing Jiang. 2020. Query graph gen-
eration for answering multi-hop complex questions
from knowledge bases. In Proceedings of the 58th
Annual Meeting of the Association for Computational
Linguistics, pages 969–974, Online. Association for
Computational Linguistics.
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan
Ghazvininejad, Abdelrahman Mohamed, Omer Levy,
Veselin Stoyanov, and Luke Zettlemoyer. 2020.
BART: Denoising sequence-to-sequence pre-training
for natural language generation, translation, and com-
prehension. In Proceedings of the 58th Annual Meet-
ing of the Association for Computational Linguistics,
pages 7871–7880, Online. Association for Computa-
tional Linguistics.
Belinda Z. Li, Sewon Min, Srinivasan Iyer, Yashar
Mehdad, and Wen-tau Yih. 2020. Efficient one-pass
end-to-end entity linking for questions. In Proceed-
ings of the 2020 Conference on Empirical Methods
in Natural Language Processing (EMNLP), pages
6433–6441, Online. Association for Computational
Linguistics.
Jinyang Li, Binyuan Hui, Ge Qu, Binhua Li, Jiaxi Yang,
Bowen Li, Bailin Wang, Bowen Qin, Rongyu Cao,
Ruiying Geng, Nan Huo, Xuanhe Zhou, Chenhao Ma,
Guoliang Li, Kevin C. C. Chang, Fei Huang, Reynold
Cheng, and Yongbin Li. 2023. Can llm already serve
as a database interface? a big bench for large-scale
database grounded text-to-sqls.
Kangqi Luo, Fengli Lin, Xusheng Luo, and Kenny Zhu.
2018. Knowledge base question answering via encod-
ing of complex query graphs. In Proceedings of the
2018 Conference on Empirical Methods in Natural
Language Processing, pages 2185–2194, Brussels,
Belgium. Association for Computational Linguistics.
Costas Mavromatis and George Karypis. 2022. ReaRev:
Adaptive reasoning for question answering over
knowledge graphs. In Findings of the Association
for Computational Linguistics: EMNLP 2022, pages
2447–2458, Abu Dhabi, United Arab Emirates. As-
sociation for Computational Linguistics.
Alexander Miller, Adam Fisch, Jesse Dodge, Amir-
Hossein Karimi, Antoine Bordes, and Jason Weston.
2016. Key-value memory networks for directly read-
ing documents. In Proceedings of the 2016 Con-
ference on Empirical Methods in Natural LanguageProcessing, pages 1400–1409, Austin, Texas. Associ-
ation for Computational Linguistics.
Mehrad Moradshahi, Giovanni Campagna, Sina Sem-
nani, Silei Xu, and Monica Lam. 2020. Localizing
open-ontology QA semantic parsers in a day using
machine translation. In Proceedings of the 2020 Con-
ference on Empirical Methods in Natural Language
Processing (EMNLP), pages 5970–5983, Online. As-
sociation for Computational Linguistics.
Linyong Nan, Yilun Zhao, Weijin Zou, Narutatsu
Ri, Jaesung Tae, Ellen Zhang, Arman Cohan, and
Dragomir Radev. 2023. Enhancing few-shot text-to-
sql capabilities of large language models: A study on
prompt design strategies.
Yilin Niu, Fei Huang, Wei Liu, Jianwei Cui, Bin Wang,
and Minlie Huang. 2023. Bridging the Gap between
Synthetic and Natural Questions via Sentence De-
composition for Semantic Parsing. Transactions
of the Association for Computational Linguistics,
11:367–383.
Gabriel Poesia, Alex Polozov, Vu Le, Ashish Tiwari,
Gustavo Soares, Christopher Meek, and Sumit Gul-
wani. 2022. Synchromesh: Reliable code generation
from pre-trained language models. In The Tenth In-
ternational Conference on Learning Representations,
ICLR 2022, Virtual Event, April 25-29, 2022. Open-
Review.net.
Ohad Rubin, Jonathan Herzig, and Jonathan Berant.
2022. Learning to retrieve prompts for in-context
learning. In Proceedings of the 2022 Conference of
the North American Chapter of the Association for
Computational Linguistics: Human Language Tech-
nologies, pages 2655–2671, Seattle, United States.
Association for Computational Linguistics.
Amrita Saha, Ghulam Ahmed Ansari, Abhishek Laddha,
Karthik Sankaranarayanan, and Soumen Chakrabarti.
2019. Complex program induction for querying
knowledge bases in the absence of gold programs.
Transactions of the Association for Computational
Linguistics, 7:185–200.
Amrita Saha, Vardaan Pahuja, Mitesh Khapra, Karthik
Sankaranarayanan, and Sarath Chandar. 2018. Com-
plex sequential question answering: Towards learn-
ing to converse over linked question answer pairs
with a knowledge graph. In Proceedings of the AAAI
conference on artificial intelligence, volume 32.
Priyanka Sen, Armin Oliya, and Amir Saffari. 2021.
Expanding end-to-end question answering on differ-
entiable knowledge graphs with intersection. In Pro-
ceedings of the 2021 Conference on Empirical Meth-
ods in Natural Language Processing, pages 8805–
8812, Online and Punta Cana, Dominican Republic.
Association for Computational Linguistics.
Richard Shin, Christopher Lin, Sam Thomson, Charles
Chen, Subhro Roy, Emmanouil Antonios Platanios,
Adam Pauls, Dan Klein, Jason Eisner, and Benjamin
Van Durme. 2021. Constrained language models
yield few-shot semantic parsers. In Proceedings of
the 2021 Conference on Empirical Methods in Natu-
ral Language Processing, pages 7699–7715, Online
and Punta Cana, Dominican Republic. Association
for Computational Linguistics.
Yiheng Shu, Zhiwei Yu, Yuhan Li, Börje Karlsson,
Tingting Ma, Yuzhong Qu, and Chin-Yew Lin. 2022.
TIARA: Multi-grained retrieval for robust question
answering over large knowledge base. In Proceed-
ings of the 2022 Conference on Empirical Methods
in Natural Language Processing, pages 8108–8121,
Abu Dhabi, United Arab Emirates. Association for
Computational Linguistics.
Daniil Sorokin and Iryna Gurevych. 2018. Modeling se-
mantics with gated graph neural networks for knowl-
edge base question answering. In Proceedings of the
27th International Conference on Computational Lin-
guistics, pages 3306–3317, Santa Fe, New Mexico,
USA. Association for Computational Linguistics.
Yu Su, Ahmed Hassan Awadallah, Madian Khabsa,
Patrick Pantel, Michael Gamon, and Mark Encar-
nacion. 2017. Building natural language interfaces to
web apis. In Proceedings of the 2017 ACM on Con-
ference on Information and Knowledge Management,
pages 177–186.
Haitian Sun, Tania Bedrax-Weiss, and William Cohen.
2019. PullNet: Open domain question answering
with iterative retrieval on knowledge bases and text.
In Proceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing and the
9th International Joint Conference on Natural Lan-
guage Processing (EMNLP-IJCNLP), pages 2380–
2390, Hong Kong, China. Association for Computa-
tional Linguistics.
Haitian Sun, Bhuwan Dhingra, Manzil Zaheer, Kathryn
Mazaitis, Ruslan Salakhutdinov, and William Cohen.
2018. Open domain question answering using early
fusion of knowledge bases and text. In Proceed-
ings of the 2018 Conference on Empirical Methods
in Natural Language Processing, pages 4231–4242,
Brussels, Belgium. Association for Computational
Linguistics.
Alon Talmor and Jonathan Berant. 2018. The web as
a knowledge-base for answering complex questions.
In Proceedings of the 2018 Conference of the North
American Chapter of the Association for Computa-
tional Linguistics: Human Language Technologies,
Volume 1 (Long Papers), pages 641–651, New Or-
leans, Louisiana. Association for Computational Lin-
guistics.
Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann
Dubois, Xuechen Li, Carlos Guestrin, Percy
Liang, and Tatsunori B. Hashimoto. 2023. Stan-
ford alpaca:
An instruction-following llama
model. https://github.com/tatsu-lab/
stanford_alpaca.Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
Martinet, Marie-Anne Lachaux, Timothée Lacroix,
Baptiste Rozière, Naman Goyal, Eric Hambro,
Faisal Azhar, et al. 2023. Llama: Open and effi-
cient foundation language models. arXiv preprint
arXiv:2302.13971.
Ricardo Usbeck, Axel-Cyrille Ngonga Ngomo, Bastian
Haarmann, Anastasia Krithara, Michael Röder, and
Giulio Napolitano. 2017. 7th open challenge on ques-
tion answering over linked data (qald-7). In Semantic
web evaluation challenge, pages 59–69. Springer.
Pat Verga, Haitian Sun, Livio Baldini Soares, and
William Cohen. 2021. Adaptable and interpretable
neural MemoryOver symbolic knowledge. In Pro-
ceedings of the 2021 Conference of the North Amer-
ican Chapter of the Association for Computational
Linguistics: Human Language Technologies, pages
3678–3691, Online. Association for Computational
Linguistics.
Salvatore Vivona and Kaveh Hassani. 2019. Relational
graph representation learning for open-domain ques-
tion answering.
Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa
Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh
Hajishirzi. 2023. Self-instruct: Aligning language
models with self-generated instructions. In Proceed-
ings of the 61st Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers),
pages 13484–13508, Toronto, Canada. Association
for Computational Linguistics.
Benjamin Weiser. 2023. Here’s what happens when
your lawyer uses chatgpt. The New York Times.
Silei Xu, Giovanni Campagna, Jian Li, and Monica S.
Lam. 2020a. Schema2qa: High-quality and low-cost
q&a agents for the structured web. In Proceedings
of the 29th ACM International Conference on Infor-
mation & Knowledge Management, CIKM ’20, page
1685–1694, New York, NY, USA. Association for
Computing Machinery.
Silei Xu, Sina Semnani, Giovanni Campagna, and Mon-
ica Lam. 2020b. AutoQA: From databases to QA
semantic parsers with only synthetic training data. In
Proceedings of the 2020 Conference on Empirical
Methods in Natural Language Processing (EMNLP),
pages 422–434, Online. Association for Computa-
tional Linguistics.
Xi Ye, Semih Yavuz, Kazuma Hashimoto, Yingbo Zhou,
and Caiming Xiong. 2022. RNG-KBQA: Generation
augmented iterative ranking for knowledge base ques-
tion answering. In Proceedings of the 60th Annual
Meeting of the Association for Computational Lin-
guistics (Volume 1: Long Papers), pages 6032–6043,
Dublin, Ireland. Association for Computational Lin-
guistics.
Wen-tau Yih, Ming-Wei Chang, Xiaodong He, and Jian-
feng Gao. 2015. Semantic parsing via staged query
graph generation: Question answering with knowl-
edge base. In Proceedings of the 53rd Annual Meet-
ing of the Association for Computational Linguistics
and the 7th International Joint Conference on Natu-
ral Language Processing (Volume 1: Long Papers),
pages 1321–1331, Beijing, China. Association for
Computational Linguistics.
Wen-tau Yih, Matthew Richardson, Chris Meek, Ming-
Wei Chang, and Jina Suh. 2016. The value of se-
mantic parse labeling for knowledge base question
answering. In Proceedings of the 54th Annual Meet-
ing of the Association for Computational Linguistics
(Volume 2: Short Papers), pages 201–206, Berlin,
Germany. Association for Computational Linguis-
tics.
Donghan Yu, Sheng Zhang, Patrick Ng, Henghui Zhu,
Alexander Hanbo Li, Jun Wang, Yiqun Hu, William
Wang, Zhiguo Wang, and Bing Xiang. 2023. De-
caf: Joint decoding of answers and logical forms for
question answering over knowledge bases. In ICLR
2023.
Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga,
Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingn-
ing Yao, Shanelle Roman, Zilin Zhang, and Dragomir
Radev. 2018. Spider: A large-scale human-labeled
dataset for complex and cross-domain semantic pars-
ing and text-to-SQL task. In Proceedings of the 2018
Conference on Empirical Methods in Natural Lan-
guage Processing, pages 3911–3921, Brussels, Bel-
gium. Association for Computational Linguistics.A
Examples of Recovering from Entity
Linking Errors
Here, we illustrate our proposal of using entity
mentions to recover from entity linking errors. In
the training set, we have the following example:
• Query: What year did giants win the world
series?
• Original Gold SPARQL:
SELECT DISTINCT ?x WHERE {
?y wdt:sports_season_of_league_or_competition
wd:Q265538;
wdt:winner wd:Q308966;
wdt:point_in_time ?x. }
• Gold Entity linker result:
World Series (QID Q265538),
San Francisco Giants (QID Q308966);
• ReFinED result:
San Francisco Giants (QID Q308966);
Here, the ReFinED entity linker model fails to
identify the “World Series” entity. Our proposal
of mentions gives the semantic parser a chance to
recover from entity linker failures. To train the
parser to generate mentions, our training includes
samples like this:
• Query: what year did giants win the world
series?
• ReFinED result:
San Francisco Giants (QID Q308966);
• Gold target:
SELECT DISTINCT ?x WHERE {
?y wdt:sports_season_of_league_or_competition;
wd:world_series;
wdt:winner wd:Q308966;
wdt:point_in_time ?x. }
The gold query mentions “world_series”. At in-
ference time, our heuristics use the predicted men-
tion to look up the actual Wikidata entity. For ex-
ample, if wd:world_series is predicted at inference
time, our heuristics maps it back to wd:Q265538.

loading