Summary of Tree of Uncertain Thoughts Reasoning for Large Language Models

Summary Tree of Uncertain Thoughts Reasoning for Large Language Models arxiv.org

3,715 words - PDF document - View PDF document

One Line

The Tree of Uncertain Thoughts (TouT) is a framework that improves the reasoning abilities of Large Language Models (LLMs).

Slides

Slide Presentation (8 slides)

Copy slides outline Copy embed code Download as Word

Tree of Uncertain Thoughts Reasoning for Large Language Models

Source: arxiv.org - PDF - 3,715 words - view

Introduction

• Modern Large-scale Language Models (LLMs) have shown remarkable reasoning abilities.

• The Tree of Thoughts (ToT) framework improved LLMs' decision-making capabilities.

• However, ToT overlooks local uncertainties in intermediate thoughts.

Local Uncertainty in LLMs

• Local uncertainties are inherent to LLMs due to their diverse responses.

• These uncertainties pose a significant challenge to the reasoning process.

• The Tree of Uncertain Thoughts (TouT) addresses this gap.

TouT's Solution

• TouT leverages Monte Carlo Dropout for uncertainty quantification.

• Monte Carlo Dropout provides uncertainty scores for diverse local responses.

• TouT integrates local uncertainty with global search algorithms.

Experimental Validation

• Rigorous experiments were conducted on Game of 24 and Mini Crosswords tasks.

• TouT outperformed ToT and chain-of-thought prompting methods.

• Empirical evidence supports TouT's superiority in response generation.

Contributions of TouT

• Inception of TouT as a groundbreaking reasoning framework for LLMs.

• Innovative integration of Monte Carlo Dropout for local uncertainty quantification.

• Thorough experimental validation confirming TouT's dominance.

Large Language Models (LLMs)

• LLMs have showcased remarkable reasoning abilities.

• Their reasoning process relies on autoregressive mechanisms.

• TouT enhances LLMs' reasoning capabilities.

Conclusion

• TouT, the Tree of Uncertain Thoughts, improves LLMs' reasoning abilities.

• It leverages local uncertainty quantification and global search algorithms.

• TouT outperforms previous methods in rigorous experiments.

• Incorporating uncertainty-aware inference is crucial for LLMs' reasoning.

Key Points

The Tree of Uncertain Thoughts (TouT) is a reasoning framework tailored for Large Language Models (LLMs).
TouT leverages Monte Carlo Dropout to quantify uncertainty scores associated with LLMs' diverse local responses.
By integrating local uncertainty quantification with global search algorithms, TouT enhances the accuracy of model responses.
TouT outperforms the Tree of Thoughts (ToT) and chain-of-thought prompting methods in rigorous experiments on planning tasks.
Large Language Models (LLMs) have shown remarkable prowess in tasks that demand reasoning, but their reasoning process primarily relies on autoregressive mechanisms.

Summaries

20 word summary

The Tree of Uncertain Thoughts (TouT) is a reasoning framework for Large Language Models (LLMs) that enhances their reasoning capabilities.

34 word summary

The Tree of Uncertain Thoughts (TouT) is a reasoning framework for Large Language Models (LLMs) that addresses local uncertainties. It proposes a thoughts reasoning framework for LLaMA-2 that enhances the reasoning capabilities of LLMs.

249 word summary

The Tree of Uncertain Thoughts (TouT) is a reasoning framework designed for Large Language Models (LLMs) that addresses the local uncertainties in intermediate decision points. These uncertainties arise from the potential for diverse responses in LLMs and can impact

LLaMA-2 focused on the intersection of linguistic properties and deep learning capabilities in LLMs. The main focus of this work is to propose a thoughts reasoning framework for LLaMA-2 that enhances the reasoning capabilities of LLMs.

This excerpt discusses the use of uncertainty-aware inference in large language models (LLMs). The authors propose a method to explicitly quantify uncertainty for each local response in intermediate steps. They introduce a novel uncertainty evaluator that generates a confidence score for each local intermediate state

We conducted experiments using the Breadth-first search algorithm to test the success rate of Mini Crosswords games. We compared our results to previous methods and used the same LLM weight for a fair comparison. Our experiments were conducted on NVIDIA-A100 GPUs.

This summary presents a concise version of the excerpted text, highlighting key points and preserving important details. The summary is organized into separate paragraphs to distinguish distinct ideas for readability, while retaining the original order in which ideas were presented.

The effectiveness of using Large

This excerpt provides a list of references to related research on large language models. The references include papers and preprints that explore various aspects of language models, such as their ability to multitask, their few-shot learning capabilities, and techniques for improving their reasoning

Raw indexed text (23,606 chars / 3,715 words / 543 lines)

TREE OF UNCERTAIN THOUGHTS REASONING FOR LARGE LANGUAGE MODELS

Shentong Mo 1,2

Miao Xin 3∗

Carnegie Mellon University, 2 MBZUAI, 3 Institute of Automation, Chinese Academy of Sciences

ABSTRACT

While the recently introduced Tree of Thoughts (ToT) has

heralded advancements in allowing Large Language Models

(LLMs) to reason through foresight and backtracking for

global decision-making, it has overlooked the inherent local

uncertainties in intermediate decision points or “thoughts”.

These local uncertainties, intrinsic to LLMs given their po-

tential for diverse responses, remain a significant concern in

the reasoning process. Addressing this pivotal gap, we intro-

duce the Tree of Uncertain Thoughts (TouT) – a reasoning

framework tailored for LLMs. Our TouT effectively leverages

Monte Carlo Dropout to quantify uncertainty scores associ-

ated with LLMs’ diverse local responses at these intermediate

steps. By marrying this local uncertainty quantification with

global search algorithms, TouT enhances the model’s preci-

sion in response generation. We substantiate our approach

with rigorous experiments on two demanding planning tasks:

Game of 24 and Mini Crosswords. The empirical evidence

underscores TouT’s superiority over both ToT and chain-of-

thought prompting methods.

Index Terms— large language models, tree of thoughts,

uncertainty estimation

1. INTRODUCTION

Modern Large-scale Language Models (LLMs), including

GPT’s early iterations [1, 2, 3], the recent GPT-4 [4], and

LLaMA-2 [5], have showcased remarkable prowess in tasks

that demand mathematical, symbolic, commonsense, and

knowledge reasoning. Despite this, their reasoning process

primarily hinges on the autoregressive mechanism, sequen-

tially generating text and making token-level decisions from

left-to-right [6, 7].

The recently conceptualized Tree of Thoughts (ToT) [8]

made significant strides in enabling LLMs to exercise fore-

sight and backtrack for holistic decision-making. Yet, an ap-

parent blind spot has been the oversight of local uncertainties

in the intermediate [9]. These uncertainties, stemming from

LLMs’ propensity for varied responses [10], pose a consider-

able challenge to the reasoning process.

One fundamental obstacle is the monumental scale of

LLMs, rendering them impervious to fine-tuning. They pre-

∗ Corresponding

author.

dominantly serve as black boxes, with the Bayesian modi-

fication to obtain a distribution-based uncertainty qualifica-

tion (UQ) being far from practical [11]. However, the LLMs

with emergent abilities can be perceived as approximately

unbiased estimations of our inherently uncertain reality [12].

This makes them less susceptible to the influence of the out-

of-domain data on uncertainty estimation, allowing inference

approximation of low training-complexity to work. Hence,

the complexity caused by vast scale and the fascinating nature

associated with it advocate a direct and effective mechanism

for dealing with uncertainty.

Our novel solution comes in the form of the Tree of Uncer-

tain Thoughts (TouT), a pioneering reasoning framework ex-

pressly crafted for LLMs. Central to TouT is its ingenious em-

ployment of Monte Carlo Dropout [13] for uncertainty qual-

ification. This decision was not arbitrary. Given the chal-

lenges with LLMs, Monte Carlo Dropout presents an elegant,

minimalistic, yet robust technique to gauge uncertainty scores

linked with the diverse responses of LLMs at intermediate

junctures. By integrating this local uncertainty measurement

with comprehensive sorting algorithms, TouT bolsters the ac-

curacy of model responses.

To validate our method, we undertook rigorous experi-

mentation on two intricate planning tasks: Game of 24 and

Mini Crosswords. The experimental results decisively high-

light the supremacy of TouT over both ToT and the chain-of-

thought prompting techniques.

Our pivotal contributions encapsulate:

• The inception of TouT, a groundbreaking Tree of Un-

certain Thoughts framework, ushering in uncertainty-

aware inference in LLMs.

• The innovative integration of Monte Carlo Dropout for

local uncertainty quantification and sorting, amplifying

model response confidence.

• Thorough experimental validation confirming TouT’s

dominance over the extant ToT and chain-of-thought

prompting standards.

2. RELATED WORK

Large Language Models. The advent and progression of

Large Language Models (LLMs) [1, 2, 3, 14] have been

transformative for the fields of natural language processing

and machine learning. Central to this transformation is theGPT series, which was spearheaded by Radford et al. [1].

Their seminal work led to the development of subsequent

iterations, each building upon the strengths and addressing

the challenges of the previous versions. While the early

GPT models [2, 3] laid the foundation, GPT-4 [4] marked a

paradigm shift in the capabilities and applications of LLMs.

It showcased enhanced reasoning, comprehension, and gen-

eration abilities, bridging gaps previously identified in LLMs.

Parallelly, models like LLaMA-2 [5] have also contributed

significantly to the domain. LLaMA-2, in particular, em-

phasized the intersection of linguistic properties with the

deep learning capabilities of LLMs, opening new avenues

for research and application. In this work, our main focus is

to propose an innovative thoughts reasoning framework tai-

lored for LLaMA-2, aiming to unravel complex deliberative

challenges.

Thoughts Reasoning. With the sophistication of LLMs,

there arose a need to understand, modulate, and enhance the

reasoning capabilities underlying their decision-making pro-

cesses. The initial Input-Output models established a basic

framework for how models perceive and respond to given

prompts. Building on this, the Chain of Thoughts (CoT) [6]

model was introduced, emphasizing a chained, sequential

approach to decision-making. It was further refined with

CoT-SC [7], which provided a more structured and systematic

framework for thoughts sequencing. More recently, ToT [8]

incorporated both a hierarchical and lateral understanding

of reasoning, enabling LLMs to not only draw from a depth

of knowledge but also to assess and reassess decisions in a

tree-like structure. This approach facilitated greater foresight,

backtracking, and holistic decision-making in LLMs. Yet, a

noticeable gap persisted: the oversight of local uncertainties

during intermediate decision-making. Addressing this, our

work pioneers a framework that synergizes local uncertainty

quantification with advanced global algorithms, intending to

heighten the accuracy of LLM responses.

Uncertainty Quantification. The prediction of uncer-

tainty [13, 15] represents the foundation for dependable and

consistent automated decision-making, and consequently is

receiving increasing attention. However, obtaining uncer-

tainty quantification in LLM is very challenging, mainly due

to the extremely high dimensionality [16]. Certain recent

methodologies explore the issue of uncertainty quantification

with black-box LLMs [10, 11]. However, these techniques

concentrate mostly on free-form question answering. Scant

research has explored the uncertainty quantification for LLMs

in complex reasoning, the emphasis of the present paper.

3. METHOD

Given a deliberate problem with several intermediate steps,

our target is to leverage pre-trained large language models for

problem-solving and decision-making. We propose a novel

framework with the Tree of Uncertain Thoughts, named TouT,

for language model inference, which mainly consists of two

modules, Local Uncertainty Quantification in Section 3.2 and

Uncertainty Global Search in Section 3.3.

3.1. Preliminaries

In this section, we first describe the problem setup and nota-

tions and then revisit the tree of thoughts reasoning for LLMs

inference.

Problem Setup and Notations. Given a pre-trained language

model (LM) (p θ ) with parameters θ and a language sequence

{x, y, z, s, ...}, our target is to infer LM (p θ (x)) for generating

answers to a deliberate problem. For each language sequence

x with t tokens, we denote x = {x[1], x[2], ..., x[t]}.

Revisit Non-Tree-Based Prompting. To address the task,

Input-output (IO) prompting generated the output y from LM

with x as input instructions, which can be denoted as y ∼

p IO

θ (y|x). When it comes to non-trivial questions with multi-

ple steps, Chain-of-thought (CoT) prompting [6] introduced a

chain of n thoughts z 1 , z 2 , ..., z n to solve the problem, where

the output is formulated as y ∼ p COT

θ (y|x, z 1 , z 2 , ..., z n ). To

improve COT further, ensemble-based CoT-SC [7] proposed

to generate multiple chains of thoughts and select the highest

frequency output.

Revisit Tree-of-Thoughts. To solve the problem in a hu-

man problem-solving manner, ToT [8] proposed to search

over a tree consisting of multiple partial solutions (state s =

{x, z 1,2,...,i }) as nodes. Given the properties of different prob-

lems, ToT first decomposed intermediate thought steps and

used a thought generator G(p θ , s, k) to generate k candidates

based on a tree state s. With a set S of different states, they

adopted a state evaluator V (p θ , S) to independently measure

the possibility of solving the problem for each state. Finally,

they plugged a search algorithm to select the most promising

state for the final output.

However, the ToT reasoning paradigm grapples with the

complexities of local uncertainties at intermediate “thoughts”.

Given the innate capacity of LLMs to generate a spectrum

of responses, these local uncertainties can become signifi-

cant impediments in the reasoning process. We introduce the

“Tree of Uncertain Thoughts” framework to address this chal-

lenge, pioneering a shift towards uncertainty-aware inference

within LLMs.

3.2. Local Uncertainty Quantification

To explicitly quantify the uncertainty for each local response

in intermediate steps, we introduce a novel uncertainty eval-

uator U (p θ , S 1,2,...,m ) to generate a scalar value to represent

the confidence score for each local intermediate states, that is,

U (p θ , S)(s) ∼ p uncertain (u|s), ∀s ∈ S. Specifically, we are

inspired by Monte Carlo Dropout [13] and generate S 1,2,...,m

with m sampling steps on LLMs inference. Meanwhile, we

adopt an m-step-based linear interpolation on the input tem-Algorithm 1 TouT-BFS(x, p θ , G, k, V, T, b, U, m)

Require: Input x, LM p θ , thought generator G(·), candidates

size k, states evaluator V (·), global steps T , breadth limit

b, uncertainty evaluator U (·), sampling steps m.

S 0 ← {x}

for t = 1, 2, ..., T do

S t ′ ← {[s, z]|s ∈ S t−1 , z t ∈ G(p θ , s, k)}

U t ← U (p θ , {S t ′ })

V t ← V (p θ , {S t ′ })

S t ← arg max S⊂S t ′ ,|S|=b s∈S V t (S)/U t (S)

end for

return G(θ, arg max s∈S T V t (S)/U t (S), 1)

perature of LLMs to control the quality of responses for each

intermediate step.

After sampling, we compute the variance of values from

a set {S t ′ } of m states in this step, where the variance of val-

ues is used as the local uncertainty score u for each state.

Such evaluations will enable us to comprehensively evaluate

diverse local responses instead of generating candidates us-

ing one fixed model temperature. Furthermore, we can use

quantified states for later global searching to find the correct

answers to the problem more confidently.

Algorithm 2 TouT-DFS(s, t, p θ , G, kV, T, v th , U, u th )

Require: Current state s, step t, LM p θ , thought generator

G(·), candidates size k, states evaluator V (·), global steps

T , state threshold v th , uncertainty evaluator U (·), uncer-

tainty threshold u th .

if t > T then

record output G(p θ , s, 1)

end if

for s ′ ∈ G(p θ , s, k) do

if V (p θ , {s ′ })(s) > v th and U (p θ , {s ′ })(s) < u th

then

DFS(s ′ , t + 1)

end if

end for

Method

Success Rate (%)

IO prompt

CoT prompt [6]

CoT-SC [7]

ToT [8] (b=1)

TouT (ours, b=1)

ToT [8] (b=5)

TouT (ours, b=5)

7.2

Table 1. Quantitative results of Game of 24.

3.3. Uncertainty-aware Global Search

Benefiting from the above uncertainty quantification on lo-

cal responses, we leverage a novel and explicit uncertainty-

aware global search mechanism to select a more precise state.

During searching, we use v/u as the final evaluation score

for criteria to finalize the state with the largest score, where

v, u denote the value and uncertainty of the state, respectively.

Based on the new criteria, we propose two search algorithms

for uncertainty-aware global search.

One is based on Breadth-first search (BFS), TouT-BFS

uses a set of the b most confident states per step by select-

ing from m states using the new score V t (S)/U t (S), as il-

lustrated in Algorithm 1. The other one is Depth-first search

(DFS), we either explore the most confident state in global

steps T or use V (p θ , {s ′ })(s) ≤ v th for a value threshold

v th and U (p θ , {s ′ })(s) ≥ u th for a uncertainty threshold u th .

For both cases, the algorithm backtracks to the parent state of

s with the higher value and lower uncertainty and continually

finds the correct answers, as shown in Algorithm 2.

Crosswords [8] includes 156 games of 5 × 5 mini crosswords.

For this task, the input is the 5 horizontal clues and 5 verti-

cal clues, and the output should be a board of 25 letters to

solve the problem. This task has 5-10 intermediate steps for

solving, such as h1. shown and v5. naled.

Evaluation Metrics. For Game of 24, if the output is an

equation that uses the input numbers each exactly once equals

24, it is regarded as a success, such as (13-9)*(10-4)=24.

We compute the average success rate of total testing games,

where the Breadth-first search algorithm is used for this task.

For Mini Crosswords, we follow ToT [8], and adopt three lev-

els of success: the accuracy of letters (25 per game), words

(10 per game), and games.

Implementation. For LLMs, we use officially released

LLaMA-2-70B [5] weights. Since GPT-4 is more expen-

sive to use, we reproduce all baseline results using the same

LLM weight for a fair comparison. The number of Monte

Carlo sampling steps m is 20. Our experiments are conducted

on NVIDIA-A100 GPUs.

4. EXPERIMENTS

4.2. Comparison to prior work

4.1. Experimental setup

Tasks. Game of 24 [8] contains 1,362 games with human

solving levels from easy to hard, and a subset of relatively

hard games indexed 901-1,000 is used for testing. The

thoughts in this task are decomposed into 3 steps. Mini

In this work, we propose a novel and effective framework for

deliberate problem-solving with LLMs inference. In order to

demonstrate the effectiveness of the proposed TouT, we com-

pare it to previous non-tree-based prompting [6, 7] and tree-

of-thoughts [8] methods.Success Rate (%)

Letter Word Game

Method

IO prompt

CoT prompt [6]

ToT [8]

TouT (ours)

ToT [8] + best state

TouT (ours) + best state

29.5

33.2

62.2

64.5

10.8

53.9

58.2

Table 2. Quantitative results of Mini Crosswords.

LUQ UGS Game of 24

✗

✓

✗

✓ ✗

✗

✓

✓ 56

Mini Crosswords

Letter Word Game

64.5

55.6

56.1

58.2

Table 3. Ablation study on Local Uncertainty Quantification

(LUQ) and Uncertainty-aware Global Search (UGS).

For the Game of 24 task, we report the quantitative com-

parison results in Table 1. As can be seen, we achieve the

best results regarding both b=1 and b=5 for solving the Game

of 24 problem. In particular, the proposed TouT superiorly

outperforms ToT [8], the current state-of-the-art LLM infer-

ence baseline, by 5% and 9%. Furthermore, we achieve sig-

nificant performance gains compared to previous non-tree-

based prompting approaches [6, 7]. These significant im-

provements demonstrate the superiority of our approach in

deliberate problem-solving with LLMs inference.

In addition, In addition, significant gains in the Mini

Crosswords task can be observed in Table 2. Compared to

ToT [8], we achieve the results gains of 2%, 4%, and 3%

on letter, word and game. We also achieve highly better re-

sults against IO and CoT prompting baselines. These results

demonstrate the effectiveness of our approach in using LLMs

inference for solving problems.

4.3. Experimental analysis

In this section, we performed ablation studies to demonstrate

the benefit of introducing the Local Uncertainty Quantifica-

tion and Uncertainty-aware Global Search modules. We also

conducted extensive experiments to explore the impact of

Monte Carlo sampling steps in uncertainty quantification.

Local Uncertainty Quantification & Uncertainty-aware

Global Search. In order to demonstrate the effectiveness

of the introduced Local Uncertainty Quantification (LUQ)

and Uncertainty-aware Global Search (UGS), we ablate the

necessity of each module and report the quantitative results

in Table 3. As can be observed, adding LUQ to the vanilla

baseline highly increases the results of 4%, 2%, 3.6%, and

Steps

100

Game of 24

Mini Crosswords

Letter Word Game

62.7

63.2

64.5

64.2

63.9

55.3

56.8

58.2

58.1

57.8

Table 4. Exploration studies on the Monte Carlo sampling

steps in Local Uncertainty Quantification.

6%, which validates the benefit of LUQ in quantifying the

uncertainty for each local response in intermediate steps.

Meanwhile, introducing only UGS in the baseline increases

the performance regarding all metrics. More importantly,

incorporating LUQ and UGS into the baseline significantly

raises the performance, and achieves the best. These im-

proving results validate the importance of local uncertainty

quantification and uncertainty-aware global search in deliber-

ate problem-solving with LLMs inference.

Impact of Monte Carlo sampling steps. The number of

Monte Carlo sampling steps used in the proposed LUQ af-

fects the selected state in global searching for the final answer.

To explore such effects more comprehensively, we varied the

number of sampling steps from {5, 10, 20, 50, 100}. We re-

port the comparison results of Game of 24 and Mini Cross-

words in Table 4. When the number of Monte Carlo sam-

pling steps is 20, we achieve the best performance regarding

all metrics. With increased depth from 5 to 20, the proposed

TouT consistently increases performance as best states are ex-

tracted from more quantified local responses. Nevertheless,

increasing the steps from 20 to 100 will not continually im-

prove the results since 20 steps might be enough to extract

the best state from these quantified states for addressing our

deliberate problems with at most 10 intermediate thoughts.

5. CONCLUSION

In this work, we present TouT, a novel framework with the

Tree of Uncertain Thoughts for large-scale language model

inference. We leverage Monte Carlo Dropout for local un-

certainty quantification on diverse responses at intermediate

steps. Furthermore, we integrate local uncertainty into global

sorting to amplify model response confidence. Experimental

results on Game of 24 and Mini Crosswords comprehensively

demonstrate the state-of-the-art superiority against previous

ToT and CoT prompting methods. Extensive ablation studies

also validate the importance of local uncertainty quantifica-

tion and Local Uncertainty Sorting in generating more accu-

rate answers for LLMs inference to solve deliberate problems.6. REFERENCES

[1] Alec Radford, Karthik Narasimhan, Tim Salimans, and

Ilya Sutskever, “Improving language understanding by

generative pre-training,” OpenAI blog, 2018.

[2] Alec Radford, Rewon Child Jeffrey Wu, David Luan,

Dario Amodei, and Ilya Sutskever, “Language mod-

els are unsupervised multitask learners,” OpenAI blog,

2019.

[3] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie

Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind

Neelakantan, Pranav Shyam, Girish Sastry, Amanda

Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen

Krueger, Tom Henighan, Rewon Child, Aditya Ramesh,

Daniel M. Ziegler, Jeffrey Wu, Clemens Winter,

Christopher Hesse, Mark Chen, Eric Sigler, Mateusz

Litwin, Scott Gray, Benjamin Chess, Jack Clark,

Christopher Berner, Sam McCandlish, Alec Radford,

Ilya Sutskever, and Dario Amodei, “Language mod-

els are few-shot learners,” NeurIPS, vol. 33, pp.

1877–1901, 2020.

[4] OpenAI, “Gpt-4 technical report,”

arXiv:2303.08774, 2023.

arXiv preprint

[5] Hugo Touvron, Louis Martin, Kevin Stone, Peter Al-

bert, Amjad Almahairi, Yasmine Babaei, Nikolay Bash-

lykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhos-

ale, Dan Bikel, Lukas Blecher, Cristian Canton Fer-

rer, Moya Chen, Guillem Cucurull, David Esiobu, Jude

Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cyn-

thia Gao, Vedanuj Goswami, Naman Goyal, Anthony

Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan,

Marcin Kardas, Viktor Kerkez, Madian Khabsa, Is-

abel Kloumann, Artem Korenev, Punit Singh Koura,

Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Di-

ana Liskovich, Yinghai Lu, Yuning Mao, Xavier Mar-

tinet, Todor Mihaylov, Pushkar Mishra, Igor Moly-

bog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein,

Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan

Silva, Eric Michael Smith, Ranjan Subramanian, Xi-

aoqing Ellen Tan, Binh Tang, Ross Taylor, Adina

Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan,

Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kam-

badur, Sharan Narang, Aurelien Rodriguez, Robert Sto-

jnic, Sergey Edunov, and Thomas Scialom, “Llama 2:

Open foundation and fine-tuned chat models,” arXiv

preprint arXiv:2307.09288, 2023.

[6] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten

Bosma, Ed Huai hsin Chi, F. Xia, Quoc Le, and

Denny Zhou, “Chain of thought prompting elicits

reasoning in large language models,” arXiv preprint

arXiv:2201.11903, 2022.

[7] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le,

Ed Huai hsin Chi, and Denny Zhou, “Self-consistency

improves chain of thought reasoning in language mod-

els,” arXiv preprint arXiv:2203.11171, 2022.

[8] Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran,

Thomas L. Griffiths, Yuan Cao, and Karthik

Narasimhan, “Tree of Thoughts: Deliberate prob-

lem solving with large language models,”

arXiv

preprint arXiv:2305.10601, 2023.

[9] Chenglei Si, Zhe Gan, Zhengyuan Yang, Shuohang

Wang, Jianfeng Wang, Jordan Lee Boyd-Graber, and Li-

juan Wang, “Prompting gpt-3 to be reliable,” in ICLR,

2022.

[10] Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar, “Se-

mantic uncertainty: Linguistic invariances for uncer-

tainty estimation in natural language generation,” ICLR,

vol. abs/2302.09664, 2023.

[11] Zhen Lin, Shubhendu Trivedi, and Jimeng Sun, “Gen-

erating with confidence: Uncertainty quantification

for black-box large language models,” ArXiv, vol.

abs/2305.19187, 2023.

[12] Stephanie C. Lin, Jacob Hilton, and Owain Evans,

“Teaching models to express their uncertainty in words,”

TMLR, vol. 2022, 2022.

[13] Yarin Gal and Zoubin Ghahramani, “Dropout as a

bayesian approximation: Representing model uncer-

tainty in deep learning,” in ICML, 2015.

[14] Susan Zhang, Stephen Roller, Naman Goyal, Mikel

Artetxe, Moya Chen, Shuohui Chen, Christopher De-

wan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mi-

haylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel

Simig, Punit Singh Koura, Anjali Sridhar, Tianlu

Wang, and Luke Zettlemoyer,

“Opt: Open pre-

trained transformer language models,” arXiv preprint

arXiv:2205.01068, 2022.

[15] Andrew Foong, David Burt, Yingzhen Li, and Richard

Turner, “On the expressiveness of approximate infer-

ence in bayesian neural networks,” NeurIPS, vol. 33,

pp. 15897–15908, 2020.

[16] Yuxin Xiao, Paul Pu Liang, Umang Bhatt, Willie

Neiswanger, Ruslan Salakhutdinov, and Louis-Philippe

Morency, “Uncertainty quantification with pre-trained

language models: A large-scale empirical analysis,” in

EMNLP, 2022.