Summary Fine-tuned LLMs for Wikidata Semantic Parsing arxiv.org
9,201 words - PDF document - View PDF document
One Line
The WikiWebQuestions benchmark for Wikidata demonstrates that using large language models for semantic parsing improves answer accuracy, as evidenced by strong experimental results.
Slides
Slide Presentation (12 slides)
Key Points
- WikiWebQuestions is a high-quality question answering benchmark for Wikidata.
- WikiSP is a few-shot sequence-to-sequence semantic parser for Wikidata.
- The goal is to improve the factuality of large language models (LLMs) by grounding them in Wikidata.
- Semantic parsing is used to complement LLMs and provide more accurate answers.
- The authors modify SPARQL to use domain and property names instead of unique IDs.
- The authors achieve strong results in answer accuracy using their methodology.
- Semantic parsing provides interpretable results grounded in Wikidata for better verification of answers.
- The combination of GPT-3 with WikiSP improves answer accuracy in question answering tasks.
Summaries
23 word summary
WikiWebQuestions is a question answering benchmark for Wikidata. Semantic parsing with large language models improves answer accuracy. Experimental results show a strong baseline.
94 word summary
This paper presents WikiWebQuestions, a question answering benchmark for Wikidata, and introduces WikiSP, a semantic parser for Wikidata. The authors propose using semantic parsing alongside large language models (LLMs) to improve answer accuracy. They modify SPARQL and train the parser to link entities in user queries to their unique ID in Wikidata. Experimental results show that this methodology achieves a strong baseline of 76% and 65% answer accuracy in the dev and test sets of WikiWebQuestions, respectively. The authors highlight the importance of semantic parsing for grounding LLMs and discuss limitations and future work.
139 word summary
This paper introduces WikiWebQuestions, a question answering benchmark for Wikidata, and presents WikiSP, a semantic parser for Wikidata. The authors propose using semantic parsing as a complement to large language models (LLMs) to improve answer accuracy. They modify SPARQL by using domain and property names instead of unique IDs and train the parser to link entities in user queries to their unique ID in Wikidata. If the query fails, GPT-3 is used as a fallback. Experimental results show that this methodology improves answer accuracy, achieving a strong baseline of 76% and 65% answer accuracy in the dev and test sets of WikiWebQuestions, respectively. Semantic parsing is crucial for grounding LLMs, and combining it with GPT-3 provides more reliable answers. The authors also introduce the WikiWebQuestions dataset, evaluate fine-tuned LLMs for Wikidata semantic parsing, and discuss limitations and future work.
516 word summary
This paper introduces WikiWebQuestions, a question answering benchmark for Wikidata, and presents WikiSP, a semantic parser for Wikidata. The authors aim to enhance the accuracy of large language models (LLMs) by grounding them in Wikidata. To address the issue of LLMs providing incorrect answers, the authors propose using semantic parsing as a complement to LLMs for more accurate answers.
The authors modify SPARQL, a query language for semantic parsing, by using domain and property names instead of unique IDs. They train the parser to link entities in user queries to their unique ID in Wikidata using an entity linker or mentions in the query. The modified SPARQL queries are then fed into the WikiSP semantic parser. If the query fails, GPT-3 is used as a fallback and the result is labeled as a GPT-3 guess.
Experimental results demonstrate that this methodology improves answer accuracy. The authors achieve a strong baseline of 76% and 65% answer accuracy in the dev and test sets of WikiWebQuestions, respectively. By combining the semantic parser with GPT-3, they provide useful answers to 96% of the questions in the dev set. They also outperform the state-of-the-art for the QALD-7 Wikidata dataset by 3.6% in F1 score.
Semantic parsing is crucial for grounding LLMs as it provides interpretable results grounded in Wikidata. This allows users to verify answers since LLMs may not always be correct. By combining semantic parsing with GPT-3's guesses, the system offers more reliable answers.
The authors introduce the WikiWebQuestions dataset, a high-quality semantic parsing dataset for Wikidata. It is an updated version of the WebQuestionsSP benchmark, providing real-world questions collected from users via the Google Suggest API.
ReFinED is used as the entity linker for WikiSP. It is fine-tuned with question and entity pairs from the WikiWebQuestions training set to learn common terms used in Wikidata. Additionally, LLaMA, a large language model, is fine-tuned with a few-shot training set along with instructions used to fine-tune Alpaca, another large language model.
Evaluation results of WikiSP on the WikiWebQuestions dataset show promising performance, achieving a 65.5% exact match accuracy and a 71.9% F1 score. Answer accuracy is improved by entity linking and allowing mentions as entities.
The authors evaluate fine-tuned LLMs for Wikidata semantic parsing, focusing on using property and domain names instead of IDs and combining GPT-3 with WikiSP for question answering. Using property and domain names improves answer accuracy by 2.0%. The combination of GPT-3 with WikiSP provides definitive, correct, and complete answers for 75% of the questions in the dev set.
Error analysis reveals alternative interpretations, SPARQL queries that don't retrieve answers, and entity linking errors as common errors. WikiSP outperforms WDAqua by 3.6% in F1 score on Task 4 from the QALD-7 dataset. Combining GPT-3 with WikiSP yields additional correct answers for 34% of the questions.
Limitations discussed include the focus on factoid question answering and English datasets, as well as the need for better training datasets to handle more complex questions.
In conclusion, the authors create the WikiWebQuestions benchmark dataset, establish a strong baseline using fine-tuned LLMs, and demonstrate the advantages of combining GPT-3 with WikiSP.
570 word summary
This paper introduces WikiWebQuestions, a question answering benchmark for Wikidata, and presents WikiSP, a semantic parser for Wikidata. The goal is to improve the accuracy of large language models (LLMs) by grounding them in Wikidata. LLMs have the tendency to give incorrect answers, so the authors propose using semantic parsing to complement LLMs and provide more accurate answers.
The authors modify SPARQL, a query language for semantic parsing, to use domain and property names instead of unique IDs. They train the parser to use either an entity linker or mentions in the query to link entities in the user query to their unique ID in Wikidata. The modified SPARQL queries are then fed into the WikiSP semantic parser. If the query fails to return a result, the system defaults to using GPT-3 and labels the result as a GPT-3 guess.
Experimental results show that this methodology improves answer accuracy. The authors achieve a strong baseline of 76% and 65% answer accuracy in the dev and test sets of WikiWebQuestions, respectively. By combining their semantic parser with GPT-3, they provide useful answers to 96% of the questions in the dev set. They also outperform the state-of-the-art for the QALD-7 Wikidata dataset by 3.6% in F1 score.
Semantic parsing is important in grounding LLMs. While LLMs can answer questions directly, their answers may not always be correct. Semantic parsers provide interpretable results grounded in Wikidata, allowing users to verify the answers. By combining the results from the semantic parser with GPT-3's guesses, the system provides more reliable answers.
The authors also introduce the WikiWebQuestions dataset, a high-quality semantic parsing dataset for Wikidata. They migrated the WebQuestionsSP benchmark from Freebase to Wikidata, providing up-to-date answers from a larger knowledge base. The dataset consists of real-world questions collected from users using the Google Suggest API.
The authors use ReFinED as the entity linker for WikiSP. They fine-tune ReFinED with question and entity pairs from the WikiWebQuestions training set to learn common terms used in Wikidata. They also fine-tune LLaMA, a large language model, with a few-shot training set along with instructions used to fine-tune Alpaca, another large language model.
The evaluation results of WikiSP on the WikiWebQuestions dataset show promising performance. The model achieves a 65.5% exact match accuracy and a 71.9% F1 score. Entity linking and allowing mentions as entities improve answer accuracy.
The authors conducted an evaluation of fine-tuned LLMs for Wikidata semantic parsing, focusing on two aspects: using property and domain names instead of PIDs and QIDs, and combining GPT-3 with WikiSP for question answering.
Using property and domain names improves answer accuracy by 2.0%. LLMs can adapt to changes in query notation with fine-tuning. The combination of GPT-3 with WikiSP provides definitive, correct, and complete answers for 75% of the questions in the dev set.
Error analysis shows that errors include alternative interpretations, alternative SPARQL queries that don't retrieve an answer, and entity linking errors.
WikiSP achieves 38% accuracy on Task 4 from the QALD-7 dataset, outperforming the state-of-the-art WDAqua by 3.6% in F1 score. Combining GPT-3 with WikiSP provides additional correct answers for 34% of the questions.
The authors discuss limitations such as the focus on factoid question answering and English datasets, and the need for better training datasets to handle more complex questions.
In conclusion, the authors create the WikiWebQuestions benchmark dataset, establish a strong baseline using fine-tuned LLMs, and show the benefits of combining GPT-3 with WikiSP.
965 word summary
This paper presents WikiWebQuestions, a high-quality question answering benchmark for Wikidata. It introduces WikiSP, a few-shot sequence-to-sequence semantic parser for Wikidata. The goal is to improve the factuality of large language models (LLMs) by grounding them in Wikidata, which contains over 12 billion facts. LLMs have the ability to answer questions, but they are prone to hallucinating and giving incorrect answers. The authors propose using semantic parsing to complement LLMs and provide more accurate answers.
The authors modify SPARQL, a query language used for semantic parsing, to use domain and property names instead of their unique IDs. They train the parser to use either the results from an entity linker or mentions in the query. The entity linker is used to link entities in the user query to their unique ID in Wikidata. The modified SPARQL queries are then fed into the WikiSP semantic parser to produce answers. If applying the query to Wikidata fails to return a result, the system defaults to using GPT-3, a large language model, and labels the result as a GPT-3 guess.
Experimental results show that this methodology is effective in improving answer accuracy. The authors achieve a strong baseline of 76% and 65% answer accuracy in the dev and test sets of WikiWebQuestions, respectively. By combining their semantic parser with GPT-3, they are able to provide useful answers to 96% of the questions in the dev set. They also outperform the state-of-the-art for the QALD-7 Wikidata dataset by 3.6% in F1 score.
The authors highlight the importance of semantic parsing in grounding LLMs. While LLMs can answer questions directly, they lack interpretability and their answers may not always be correct. Semantic parsers provide interpretable results that are grounded in Wikidata, allowing users to verify the answers. By combining the results from the semantic parser with GPT-3's guesses, the system provides users with more reliable answers.
The authors also introduce the WikiWebQuestions dataset, which is a high-quality semantic parsing dataset for Wikidata. They migrated the popular WebQuestionsSP benchmark from Freebase to Wikidata, providing up-to-date answers from a larger knowledge base. The dataset consists of real-world questions collected from users using the Google Suggest API.
In terms of implementation, the authors use ReFinED as the entity linker for WikiSP. They fine-tune ReFinED with the question and entity pairs from the WikiWebQuestions training set to learn common terms used in Wikidata. They also fine-tune LLaMA, a large language model, with a few-shot training set along with instructions used to fine-tune Alpaca, another large language model.
The evaluation results of WikiSP on the WikiWebQuestions dataset show promising performance. The model achieves a 65.5% exact match accuracy and a 71.9% F1 score. Ablation experiments demonstrate the importance of entity linking and allowing mentions as entities in improving answer accuracy.
Overall, this paper presents a method for fine-tuning LLMs and improving their factuality by grounding them in Wikidata. The proposed WikiSP semantic parser achieves strong results on the WikiWebQuestions dataset and outperforms existing methods. The authors highlight the benefits of combining semantic parsing with large language models to provide more accurate and interpretable answers.
The authors of the paper conducted an evaluation of fine-tuned LLMs for Wikidata semantic parsing. They focused on two specific aspects: the effectiveness of using property and domain names instead of PIDs and QIDs, and the combination of GPT-3 with WikiSP for question answering.
In their evaluation, the authors found that using property and domain names instead of PIDs and QIDs improved the answer accuracy by 2.0%. This indicates that LLMs can adapt to changes in query notation with fine-tuning, and it is easier for them to learn names than random IDs. However, the replacement of QIDs with their names would likely be more significant if mentions were not allowed in the predicted logical form.
The authors also evaluated the combination of GPT-3 with WikiSP for question answering using the WWQ dataset. GPT-3 answered 66.4% of the questions correctly, but provided incomplete answers for 26.5% of the questions and wrong answers for 7.1% of the questions. In contrast, WikiSP provided definitive answers for 75.6% of the questions. When combining GPT-3 with WikiSP, they were able to give definitive, correct, and complete answers for 75% of the questions in the dev set.
Error analysis showed that 18% of the errors in the WWQ dev set were actually deemed to be correct alternative results. These included cases where the model predicted an alternative interpretation to the question that still provided a reasonable answer. Another 6.3% of the errors were due to reasonable alternative SPARQL queries that did not retrieve an answer. The biggest source of errors, accounting for 35.1% of the failed examples, was entity linking errors. The entity linker failed to provide correct entities in these cases.
The authors also conducted an experiment with WikiSP on Task 4 from the QALD-7 dataset. WikiSP achieved 38% accuracy on this dataset, outperforming the state-of-the-art WDAqua by 3.6% in terms of F1 score. They also evaluated the combination of GPT-3 with WikiSP on QALD-7 and found that the combination approach provided additional correct answers for 34% of the questions.
The authors discussed the limitations of their work, including the focus on factoid question answering and the use of English datasets. They also mentioned the need for better training datasets to improve the performance of WikiSP on less popular questions.
In conclusion, the authors created a high-quality benchmark dataset called WikiWebQuestions for large knowledge-base question answering. They established a strong baseline for answer accuracy and F1 score using fine-tuned LLMs with a few-shot training dataset. They also showed that combining GPT-3 with WikiSP can reduce hallucination and provide useful information for a large percentage of questions. However, they acknowledged the need for further improvements and better training datasets to handle more complex and less popular questions.