Summary of Language Agent Tree Search Unifies Reasoning

Summary Language Agent Tree Search Unifies Reasoning arxiv.org

11,983 words - PDF document - View PDF document

One Line

LATS integrates planning, acting, and reasoning in language models to improve decision-making through external feedback and self-reflection.

Slides

Slide Presentation (8 slides)

Copy slides outline Copy embed code Download as Word

Language Agent Tree Search Unifies Reasoning

Source: arxiv.org - PDF - 11,983 words - view

LATS integrates planning, acting, and reasoning in language models

• LATS combines planning, acting, and reasoning capabilities in large language models (LLMs)

• LATS enhances decision-making through the integration of these components

• LATS utilizes an external environment for feedback to enable adaptive decision-making

[Visual: Image illustrating the integration of planning, acting, and reasoning]

LATS improves decision-making through external feedback

• LATS incorporates external feedback to enhance reasoning

• External feedback allows for more deliberate and adaptive decision-making

• LATS addresses the limitations of existing techniques by considering multiple reasoning paths

[Visual: Graph showing the impact of external feedback on decision-making]

LATS adapts Monte Carlo tree search (MCTS) to language agents

• LATS incorporates Monte Carlo tree search (MCTS) to improve reasoning

• MCTS allows for the exploration of multiple chains of thought

• LATS utilizes MCTS to enhance decision-making in language models

[Visual: Diagram illustrating the process of MCTS in LATS]

LATS achieves high performance on HumanEval for programming and HotPotQA

• LATS demonstrates high performance on HumanEval for programming tasks

• LATS also achieves impressive results on HotPotQA tasks

• The effectiveness of LATS is demonstrated through these evaluations

[Visual: Comparison chart showing the performance of LATS compared to other techniques]

LATS addresses the limitations of existing techniques

• Existing methods often fail to consider multiple reasoning paths or plan ahead

• LATS overcomes these limitations by incorporating external feedback and self-reflection

• The integration of these components improves sensibility and adaptability

[Visual: Image representing the limitations of existing techniques and how LATS overcomes them]

The Language Agent Tree Search (LATS) algorithm enhances decision-making in language models

• LATS combines planning, acting, and reasoning to improve decision-making

• LATS incorporates external feedback and self-reflection to enhance sensibility and adaptability

• LATS achieves high performance on various tasks, demonstrating its effectiveness

• LATS addresses the limitations of existing techniques and offers a unified framework for decision-making in language models

[Visual: Image summarizing the main points of the presentation]

Key Points

LATS is a framework that combines planning, acting, and reasoning capabilities in large language models (LLMs) to enhance decision-making.
LATS utilizes an external environment for feedback, allowing for more deliberate and adaptive decision-making.
LATS adapts Monte Carlo tree search (MCTS) to language agents and incorporates external feedback to enhance reasoning.
LATS achieves high performance on HumanEval for programming and on HotPotQA.
LATS addresses the limitations of existing techniques by incorporating external feedback and self-reflection.

Summaries

19 word summary

LATS combines planning, acting, and reasoning in large language models to enhance decision-making by incorporating external feedback and self-reflection.

58 word summary

LATS (Language Agent Tree Search) combines planning, acting, and reasoning in large language models (LLMs) to enhance decision-making. It addresses limitations by incorporating external feedback and self-reflection. LATS unifies LM planning, acting, and reasoning strategies, adapting Monte Carlo tree search (MCTS) to language agents. Experimental evaluations demonstrate LATS's applicability and effectiveness in diverse domains, improving decision-making in LLMs.

127 word summary

LATS (Language Agent Tree Search) is a framework that combines planning, acting, and reasoning capabilities in large language models (LLMs) to enhance decision-making. LATS addresses the limitations of existing techniques by incorporating external feedback and self-reflection. LLMs offer strong reasoning and general adaptability, making them an alternative paradigm to traditional reinforcement learning. However, LLMs often lack deliberate decision-making characteristics. To overcome this, LATS is introduced as a framework that unifies LM planning, acting, and reasoning strategies. It adapts Monte Carlo tree search (MCTS) from model-based reinforcement learning to language agents and utilizes text as an interface for adaptive planning. Experimental evaluations across diverse domains demonstrate the applicability and effectiveness of LATS for both reasoning and acting tasks. LATS is a versatile framework that improves decision-making in LLMs.

445 word summary

LATS (Language Agent Tree Search) is a framework that combines planning, acting, and reasoning capabilities in large language models (LLMs) to enhance decision-making. By incorporating external feedback and self-reflection, LATS addresses the limitations of existing techniques. It utilizes LLMs as agents, value functions, and optimizers, allowing for more deliberate and adaptive decision-making beyond traditional methods.

While traditional reinforcement learning has focused on developing autonomous agents capable of reasoning and decision-making, LLMs offer an alternative paradigm that provides strong reasoning and general adaptability. LLMs have excelled in natural language processing tasks and have been successful in complex environments involving knowledge and reasoning.

However, LLMs often lack deliberate and thoughtful decision-making characteristics. Existing methods fail to consider multiple reasoning paths or plan ahead. To overcome these limitations, LATS is introduced as a general framework for decision-making and reasoning with language models. It expands on existing methods like CoT and ReAct by unifying LM planning, acting, and reasoning strategies. LATS adapts Monte Carlo tree search (MCTS) from model-based reinforcement learning to language agents, using the LM as an agent, value function, and optimizer. It utilizes text as an interface between components for adaptive planning without additional training.

Experimental evaluations across diverse domains demonstrate the applicability of LATS for both reasoning and acting tasks. It achieves high performance on HumanEval for programming with GPT-4 and on HotPotQA with GPT-3.5, showcasing its effectiveness and generality.

LLMs have been used for reasoning tasks by decomposing complex inputs into sequential intermediate steps. Various prompting techniques have been developed to improve LM performance, but error propagation remains an issue. Recent advancements like ToT prompting and RAP aim to mitigate this issue by exploring multiple reasoning paths.

In addition, LLMs have been adapted for decision-making tasks in robotics and complex games. Acting-based prompting techniques, such as ReAct, have shown success in text-based environments but lack planning capabilities.

LATS employs tree-based search, a widely used technique in planning algorithms and reinforcement learning, to fully utilize the potential of LLMs. It selects nodes, expands the tree with sampled actions, evaluates the nodes based on their values, simulates trajectories until a terminal state is reached, and backpropagates the resulting values. Self-reflection is used to refine the decision-making process.

In conclusion, LATS is a framework that unifies planning, acting, and reasoning in large language models for enhanced decision-making. It incorporates external feedback and self-reflection to address existing limitations. Experimental evaluations demonstrate its versatility and effectiveness across various domains. The appendix provides additional details about the LATS algorithm, including pseudocode, further discussion, limitations, future directions, and the broader impact. It also specifies the details of the environments used in the experiments, demonstrating the application of LATS in different scenarios.

565 word summary

LATS (Language Agent Tree Search) is a framework that enhances decision-making by combining planning, acting, and reasoning capabilities in large language models (LLMs). It addresses the limitations of existing techniques by incorporating external feedback and self-reflection. LATS utilizes LLMs as agents, value functions, and optimizers, repurposing their strengths for improved problem-solving. The framework employs an external environment for feedback, allowing for more deliberate and adaptive decision-making beyond the limitations of existing techniques.

Traditional reinforcement learning has focused on developing autonomous agents capable of reasoning and decision-making. However, the rise of large language models (LLMs) offers an alternative paradigm that provides strong reasoning and general adaptability. LLMs have excelled in natural language processing tasks and have been adapted to a wide range of tasks that require advanced common-sense reasoning or quantitative skills. They have also been successful in complex environments that involve knowledge and reasoning.

While LLMs have shown promise in reasoning and acting tasks, they often lack deliberate and thoughtful decision-making characteristics. Existing methods fail to consider multiple reasoning paths or plan ahead. To address these limitations, LATS is introduced as a general framework for decision-making and reasoning with language models. It unifies LM planning, acting, and reasoning strategies by expanding on existing methods like CoT and ReAct. LATS adapts Monte Carlo tree search (MCTS) from model-based reinforcement learning to language agents, using the LM as an agent, value function, and optimizer. Text is used as an interface between components, allowing for adaptive planning without additional training.

Experimental evaluations across diverse domains demonstrate the applicability of LATS for both reasoning and acting tasks. LATS achieves high performance on HumanEval for programming with GPT-4 and on HotPotQA with GPT-3.5. The effectiveness and generality of LATS are demonstrated through these experiments.

LLMs have been used for reasoning tasks by decomposing complex inputs into sequential intermediate steps. Various prompting techniques have been developed to improve LM performance, including Chain-of-Thought (CoT) prompting and its variants. However, these methods often suffer from error propagation as the number of steps increases. Recent advancements, such as ToT prompting and Reasoning via Planning (RAP), have aimed to mitigate this issue by exploring multiple reasoning paths.

LLMs have also been adapted for decision-making tasks in robotics and complex games. Acting-based prompting techniques, such as ReAct, have seen success in text-based environments. These methods have been enhanced by access to external tools or models but lack planning capabilities.

LATS employs tree-based search, a widely used technique in planning algorithms and reinforcement learning, to fully utilize the potential of LLMs. The search algorithm selects nodes, expands the tree with sampled actions, evaluates the nodes based on their values, simulates trajectories until a terminal state is reached, and backpropagates the resulting values. Self-reflection is used to refine the decision-making process.

In conclusion, LATS is a framework that unifies planning, acting, and reasoning in large language models for enhanced decision-making. It addresses the limitations of existing techniques by incorporating external feedback and self-reflection. Experimental evaluations demonstrate the versatility and effectiveness of LATS across various domains. The appendix provides additional details about the LATS algorithm, including the pseudocode, further discussion on the method and its limitations, future directions, and the broader impact. The appendix also specifies the details of the environments used in the experiments, including the HotPotQA, Programming, and Webshop environments. It lists the prompts used for each environment, demonstrating how LATS can be applied in different scenarios.

1020 word summary

LATS (Language Agent Tree Search) is a framework that combines planning, acting, and reasoning capabilities in large language models (LLMs) to enhance decision-making. It employs LLMs as agents, value functions, and optimizers, repurposing their strengths for improved problem-solving. LATS utilizes an external environment for feedback, allowing for more deliberate and adaptive decision-making beyond the limitations of existing techniques.

In traditional reinforcement learning, general autonomous agents capable of reasoning and decision-making have been of interest. However, with the rise of large language models (LLMs), there is an alternative paradigm that offers strong reasoning and general adaptability. LLMs have excelled in natural language processing tasks and have been adapted to a wide range of tasks that require advanced common-sense reasoning or quantitative skills. They have also been successful in complex environments that involve knowledge and reasoning.

While LLMs have shown promise in reasoning and acting tasks, they often lack the deliberate and thoughtful decision-making characteristics of humans. Existing methods fail to consider multiple reasoning paths or plan ahead. Recent search-guided LLM works have addressed this issue by searching over multiple chains of thought or using Monte Carlo tree search (MCTS). However, these methods do not incorporate external feedback that can enhance reasoning.

To address these limitations, LATS is introduced as a general framework for decision-making and reasoning with language models. It unifies LM planning, acting, and reasoning strategies by expanding on existing methods like CoT and ReAct. LATS adapts MCTS from model-based reinforcement learning to language agents, using the LM as an agent, value function, and optimizer. Text is used as an interface between components, allowing for adaptive planning without additional training.

Experimental evaluations across diverse domains such as programming, HotPotQA, and WebShop demonstrate the applicability of LATS for both reasoning and acting tasks. LATS achieves high performance on HumanEval for programming with GPT-4 and on HotPotQA with GPT-3.5. The effectiveness and generality of LATS are demonstrated through these experiments.

LLMs have been used for reasoning tasks by decomposing complex inputs into sequential intermediate steps. Various prompting techniques have been developed to improve LM performance, including Chain-of-Thought (CoT) prompting and its variants. However, these methods often suffer from error propagation as the number of steps increases. Recent advancements, such as ToT prompting and Reasoning via Planning (RAP), have aimed to mitigate this issue by exploring multiple reasoning paths.

Tree-based search, a widely used technique in planning algorithms and reinforcement learning, is employed in LATS to fully utilize the potential of LLMs. The search algorithm selects nodes, expands the tree with sampled actions, evaluates the nodes based on their values, simulates trajectories until a terminal state is reached, and backpropagates the resulting values. Self-reflection is used to refine the decision-making process.

Experiments on HotPotQA, programming, and WebShop demonstrate the effectiveness of LATS. LATS outperforms existing methods in terms of reasoning and acting performance. The use of external feedback and self-reflection improves sensibility and adaptability. LATS achieves state-of-the-art performance on HumanEval and surpasses reinforcement learning-based approaches on WebShop.

The Language Agent Tree Search (LATS) algorithm is proposed as a framework to improve reasoning and decision-making in large language models (LLMs). LATS combines thought, action, and observation steps to enhance the performance of LLMs in complex tasks. The algorithm uses a tree search approach to explore different actions and their consequences in an environment. It incorporates a value function to evaluate the potential outcomes of each action and a reflection generator to provide feedback on previous actions. LATS aims to generate high-level linguistic reasoning and actions through multiple rounds of decision-making and reflection.

The LATS algorithm is presented in pseudocode in the paper. It involves initializing the action and observation spaces, as well as the state-action value function and visit counter. The algorithm then iterates through a series of steps, including expansion and simulation, evaluation, selection, and backpropagation. During expansion and simulation, the algorithm generates multiple actions and observations based on the current state. The evaluation step calculates the value of each state-action pair using the value function. The selection step chooses the action with the highest value, taking into account the visit counter and exploration weight. Finally, the backpropagation step updates the state-action value function based on the rewards received.

The limitations of LATS are also discussed in the paper. The algorithm requires more computational resources and time compared to simpler techniques, making it less efficient. However, it is recommended for difficult tasks or situations where performance is prioritized over efficiency. The authors hope that future advancements in LLMs will reduce the computational costs associated with LATS.

The broader impact of LATS is also considered. While LATS improves autonomous decision-making, it also raises concerns about potential harmful uses of LLMs. However, LATS enhances interpretability and alignment by generating understandable high-level linguistic reasoning and actions. This is in contrast to relying on implicit low-level token values.

The appendix provides additional details about the LATS algorithm, including the pseudocode, further discussion on the method and its limitations, future directions, and the broader impact. The appendix also specifies the details of the environments used in the experiments, including the HotPotQA, Programming, and Webshop environments. It lists the prompts used for each environment, demonstrating how LATS can be applied in different scenarios.

Overall, the Language Agent Tree Search (LATS) algorithm is proposed as a framework to improve reasoning and decision-making in large language models. The algorithm combines thought, action, and observation steps to enhance performance in complex tasks. While LATS has limitations in terms of computational costs, it has the potential to improve interpretability and alignment in LLMs. The appendix provides additional details and prompts for different environments where LATS can be applied.

Raw indexed text (73,779 chars / 11,983 words / 1,402 lines)

Preprint

L ANGUAGE A GENT T REE S EARCH U NIFIES R EASON -

ING A CTING AND P LANNING IN L ANGUAGE M ODELS

Andy Zhou 1,2 , Kai Yan 1 , Michal Shlapentokh-Rothman 1 , Haohan Wang 1 , Yu-Xiong Wang 1

University of Illinois at Urbana-Champaign

AI@UIUC

{andyz3,kaiyan3,michal5,haohanw,yxw}@illinois.edu

A BSTRACT

While large language models (LLMs) have demonstrated impressive performance

on a range of decision-making tasks, they rely on simple acting processes and fall

short of broad deployment as autonomous agents. We introduce LATS (Language

Agent Tree Search), a general framework that synergizes the capabilities of LLMs

in planning, acting, and reasoning. Drawing inspiration from Monte Carlo tree

search in model-based reinforcement learning, LATS employs LLMs as agents,

value functions, and optimizers, repurposing their latent strengths for enhanced

decision-making. What is crucial in this method is the use of an environment for

external feedback, which offers a more deliberate and adaptive problem-solving

mechanism that moves beyond the limitations of existing techniques. Our ex-

perimental evaluation across diverse domains, such as programming, HotPotQA,

and WebShop, illustrates the applicability of LATS for both reasoning and act-

ing. In particular, LATS achieves 94.4% for programming on HumanEval with

GPT-4 and an average score of 75.9 for web browsing on WebShop with GPT-3.5,

demonstrating the effectiveness and generality of our method.

I NTRODUCTION

General autonomous agents capable of reasoning and decision-making in a variety of environments

(Wooldridge & Jennings, 1995) have been of longstanding interest in the field of artificial intelli-

gence. While this has traditionally been studied in reinforcement learning, the recent rise of large

language models (LLMs) (Brown et al., 2020; Chowdhery et al., 2022; Touvron et al., 2023; Ope-

nAI, 2023) with strong reasoning and general adaptability offers an alternative paradigm. Not only

have LLMs excelled on standard NLP tasks such as text summarization (Nallapati et al., 2016) or

natural language inference (Bowman et al., 2015), but they have been adapted to an increasingly di-

verse set of tasks that often require advanced common-sense reasoning or quantitative skills (Cobbe

et al., 2021; Saparov & He, 2022). LLMs are also capable of performing in complex environments

that involve knowledge and reasoning, such as web navigation (Yao et al., 2022; Deng et al., 2023),

tool-use (Schick et al., 2023), or open-ended games (Fan et al., 2022).

Reasoning and acting abilities have also been

improved by prompting techniques that aug-

ment LLMs with feedback or observations from

an external environment (Yao et al., 2023b; Gao

et al., 2022; Shinn et al., 2023). This elimi-

nates the need to rely entirely on the base abil-

ities of the Language Model (LM), enhanc-

ing it through external tools or semantic feed-

back. Despite this strength, these methods are

reflexive and fall short of the deliberate and

thoughtful decision-making characteristics that

humans employ to solve problems (Sloman,

1996; Evans, 2010). In particular, such meth-

Obs / Reward

Actions

Environment

Context

Memory

Evaluation /

Self-reflection

LLM Agent

Values

Best Node

Tree Search

Figure 1: An overview of LATS. An external en-

vironment and reflection improves reasoning and

decision-making.

1Preprint

Approach

CoT (Wei et al., 2022)

ReAct (Yao et al., 2023b)

ToT (Yao et al., 2023a)

RAP (Hao et al., 2023)

Self-Refine (Madaan et al., 2023)

Beam Search (Xie et al., 2023)

Reflexion (Shinn et al., 2023)

LATS (Ours)

Self

Refine Memory Reasoning Decision

Making Planning

✓

✓ ×

✓

✓ ✓

✓

✓ ×

✓

✓ ×

✓

Table 1: A summary of related work on reasoning, decision-making, and planning. LATS is the first

work that incorporates designs from all three domains, allowing use in all corresponding tasks.

ods fail to consider multiple reasoning paths or to plan ahead. Recent search-guided LLM works

(Xie et al., 2023; Yao et al., 2023a; Hao et al., 2023) address this issue by searching over multi-

ple chains of thought. While these methods enable planning, they operate in isolation and do not

incorporate external feedback that can improve reasoning.

To help address these issues, we propose LATS (Language Agent Tree Search), a general framework

for decision-making and reasoning with language models. LATS unifies LM planning, acting, and

reasoning strategies by expanding ReAct (Yao et al., 2023b) into a search over a combinatorial

space of possible reasoning and acting steps. We adapt Monte Carlo tree search (MCTS) from

model-based reinforcement learning (Silver et al., 2017; Anthony et al., 2017; Jiang et al., 2018) to

language agents, repurposing a pretrained LLM as an agent, value function, and optimizer. Utilizing

the strong natural language understanding and in-context learning ability of modern LMs, we use

text as an interface between each component of the framework, allowing LATS to adapt planning

to environmental conditions without additional training. To the best of our knowledge, LATS is the

first framework that combines reasoning, acting, and planning to enhance LLMs. Notably, LATS

doubles the performance of GPT-3.5 on HotPotQA (Yang et al., 2018) over ReAct (Yao et al., 2023b)

and raises the average score by 22.1 on WebShop (Yao et al., 2022). When used with GPT-4, LATS

achieves a 94.4 Pass@1 rate for programming on HumanEval (Chen et al., 2021), setting the state

of the art. To summarize, our contributions are the following:

• We introduce an LM-based Monte Carlo tree search variant to deliberately construct the

best trajectory from sampled actions, enabling more flexible and adaptive problem-solving

compared to reflexive prompting methods. This is guided by heuristics from the LM.

• By integrating external feedback and self-reflection, LATS enhances model sensibility and

enables agents to learn from experience, surpassing reasoning-based search methods.

• Through experiments across diverse domains like programming, interactive QA, and web

navigation, we demonstrate the versatility of LATS in harnessing LLMs for autonomous

reasoning and decision-making. 1

R ELATED W ORK

LLMs for reasoning. For LLMs, reasoning typically involves decomposing complex inputs into

sequential intermediate steps before producing a final answer (Cobbe et al., 2021), commonly

demonstrated with Chain-of-Thought (CoT) prompting (Wei et al., 2022) and its variants (Wei et al.,

2022; Kojima et al., 2022; Wang et al., 2022). However, these methods, which create chains auto-

regressively in a single step, often suffer from error propagation as the number of steps increases

(Guo et al., 2018; Chen et al., 2022b), in which errors tend to compound. Various advancements

have aimed to mitigate this issue; some approaches, such as Self-Consistency (Wang et al., 2022),

employ majority voting over sampled chains, while others focus on multi-step decomposition, such

as least-to-most prompting (Zhou et al., 2022), or use of external tools such as a scratchpad (Nye

et al., 2021) or compiler (Gao et al., 2022). More recently, CoT has been improved with search

https://andyz245.github.io/LanguageAgentTreeSearch/

2Preprint

Observation

Input

Action

Input

Reflection

Environment

Reward

Thought

S 1

S 2

S 3

S 2

S 3

S 1

S 2 S 2

S 3 S 3

S 1

S 2

S 3

S 1

S 2

S 3

S 2

S 3

…

Output

a) Tree-of-Thoughts

Output

S 1

Output

b) Reasoning via Planning

S 3

…

Output

S 2

S 3

S 2

S 3

…

Output

c) Language Agent Tree Search

Figure 2: An overview of the differences between LATS and recently proposed LM search algo-

rithms ToT (Yao et al., 2023a) and RAP (Hao et al., 2023). LATS leverages environmental feedback

and self-reflection to further adapt search and improve performance.

algorithms (Yao et al., 2023a; Hao et al., 2023; Besta et al., 2023) that can sample trajectories more

effectively. Tree-of-thought (ToT) prompting (Yao et al., 2023a) uses DFS or BFS-based search

guided by an LM-generated heuristic while Reasoning via Planning (RAP) (Hao et al., 2023) uses

MCTS with rollouts simulated by the LM. Despite using a search algorithm, these frameworks rely

solely on the internal knowledge of the LM and cannot adapt to external inputs that could enhance

the reasoning process.

LLMs for decision-making. The strong reasoning and common-sense abilities of LLMs have also

been adapted for decision-making tasks as a policy model in interactive environments. In the realm

of robotics LLMs have been employed as high-level controllers of control policies (Ahn et al., 2022;

Huang et al., 2022; Driess et al., 2023). Similar work (Baker et al., 2022; Wang et al., 2023; Zhu

et al., 2023) has also adapted LLM agents to complex multimodal games such as Minecraft (Guss

et al., 2019; Fan et al., 2022). LLMs are particularly useful in text-based environments (Liu et al.,

2018; Shridhar et al., 2020; Liu et al., 2023), where acting-based prompting techniques such as

ReAct (Yao et al., 2023b) have seen success. Similar to CoT, ReAct is limited by its simplicity

and cannot effectively adapt to environment conditions. Many extensions have been proposed to

address this, including Self-refine (Madaan et al., 2023) and Reflexion (Shinn et al., 2023; Yao et al.,

2023c), which uses self-reflection to enhance reasoning and decision-making, and AdaPlanner (Sun

et al., 2023), which incorporates both positive and negative environmental feedback. However these

methods focus on refining an individual plan or trajectory and do not consider alternative choices at

each step. Alternatively to pure decision-making environments, the reasoning and practical abilities

of LLMs have been enhanced by access to external tools, such as APIs, search engines, calculators,

or other models (Schick et al., 2023; Shen et al., 2023; Surı́s et al., 2023). Contrary to reasoning-

based approaches, these methods have not been improved with planning, limiting their effectiveness.

Tree-based search. Tree-based search, where multiple branches of outcomes are explored during

search, is widely used in many planning algorithms (Świechowski et al., 2023; LaValle et al., 2001)

and Reinforcement Learning (RL) (Hafner et al., 2019; Du et al., 2023; Wu et al., 2023) algorithms

for its good exploration-exploitation trade-off. Though tree-based search requires an environment

model that can expand from arbitrary state (Vodopivec et al., 2017), which often requires extra

training in RL (Hafner et al., 2023), such problem does not exist for LM tasks as we can conveniently

backup to any state by setting the input to be the context and corresponding previous output by the

LM. Thus, we work on the tree-based framework and use MCTS (Świechowski et al., 2023) to fully

release the potential of LMs, while avoiding the cost of training a value function over language

descriptions by leveraging the in-context learning (Brown et al., 2020) abilities of LLMs.

3Preprint

P RELIMINARIES

Before describing LATS, we first define our problem and outline a few established methods that

leverage large language models for reasoning or decision-making. In LM reasoning or decision mak-

ing, we are given an input x in natural language and a pretrained language model p θ (x) parameter-

ized by θ; our goal is to generate a final output y ∼ p θ (x) corresponding to the answer (reasoning) or

completes the task (decision-making). Both x and y are language sequences, which are comprised of

a list of tokens (the basic elements of natural language, often words), denoted as x = (x[1], . . . , x[n])

and y = (y[1], . . . , y[n]). The LM decodes text autoregressively,

without other inputs, the prob-

Q i.e.,

ability for an LM to generate a sequence x is given by p θ (x) = i=1 p θ (x[i]|x[1 . . . i−1]). Usually,

to improve the LM, prompts are provided along with the input x, which are specific instructions or

few-shot input-output examples. We denote the generic process where an input x is transformed

into an output y by LM: y ∼ p θ (y|prompt IO (x)), where prompt IO (x) denotes the input x along

with the prompts.

Chain-of-thought (CoT) Prompting (Wei et al., 2022) was introduced to cater to scenarios where

direct mapping from x to y is intricate, such as when x is from a mathematical query or challenging

question. This method hinges on creating thoughts z 1 , . . . , z n that act as stepping stones between x

and y; each thought z i is a language sequence. To employ CoT prompting, thoughts are extracted

sequentially as z i ∼ p CoT

(z i |x, z 1···i−1 ), with the final output being y ∼ p CoT

(y|x, z 1···n ).

Tree-of-thought (ToT) Prompting (Yao et al., 2023a) extends CoT prompting by exploring mul-

tiple reasoning paths over thoughts. It frames problems as a search over a tree where each node

s = [x, z 1·i ] represents a partial solution state comprising the original input x and thought sequence

z 1···i . Thoughts z i are generated by proposal or sampling with CoT z i ∼ p CoT

(z i |x, z 1···i−1 ). De-

liberate search algorithms like breadth-first or depth-first search are used to systematically explore

the tree, guided by heuristics based on language model evaluations V (s) of each state.

Reasoning via Planning (RAP) (Hao et al., 2023) is similar to ToT, except that MCTS is used over

DFS or BFS. Heuristics are designed from an LM, such as the likelihood or confidence of an action,

and the LM is used as a world model to predict subsequent states during the simulation step.

ReAct (Yao et al., 2023b) extends language models to tasks where the mapping from x to y is en-

hanced by or requires interactions with an external environment, such as a game or API. This tech-

nique constructs an action space Â = A ∪ Z that adds permissible actions a to the reasoning traces

z from CoT. Observations o from the environment are used to improve both reasoning and acting.

To solve problems with ReAct, after each observation, actions are generated from p θ sequentially as

(y | x, o 1···n , a 1···n ).

(a i |x, o 1···i−1 , a 1···i−1 ), with the final output being y ∼ p ReAct

a i ∼ p ReAct

While the previously described prompting techniques improve LM performance on reasoning tasks,

they falter on difficult tasks that involve multifaceted decision-making due to several shortcomings:

1) Flexibility: Base prompting methods (CoT or ReAct) autoregressively sample from the LM, ne-

glecting potential alternative continuations from specific states. 2) Sensibility: Reasoning-based

methods (CoT, RAP, or ToT) rely solely on the internal representations of the LM and cannot con-

sider external observations. This dependency risks fact hallucination and error propagation while

setting a performance ceiling. 3) Adaptability: Current planning frameworks (RAP or ToT) use sim-

ple search algorithms such as BFS or cannot leverage environmental feedback to improve planning.

Additionally, the agent is static and cannot reuse previous experience or learn from trial and error.

While RAP also adopts MCTS, it is constrained to tasks where the LM can become a world model

and accurately predict states. These shortcomings limit the ability of LMs to be deployed as general

problem-solving agents and form the motivation for LATS.

4.1

U NIFYING P LANNING , R EASONING , AND A CTING

LM A GENT

LATS supports sequential reasoning or decision-making tasks on the basis of ReAct. At time step t,

an agent receives an observation o t ∈ O from the environment and takes an action a t ∈ A following

some policy π(a t |x, o 1···i−1 , a 1···i−1 ), where x consists of the task instruction and a number of

few-shot examples. We initialize the agent with p θ to leverage the useful language representations

4Preprint

2) Expansion

1) Selection

Input

3) Evaluation

4) Simulation

Input

5) Backpropagation

Input

6) Reflection

Input

…

S 1

S 2

S 1

S 2

S 1

S 2

Value

S 1

S 2

S 1

Output

S 2

S 3

…

S 3

…

Reflection

Output

Figure 3: An overview of the six operations of LATS. A node is selected, expanded, evaluated,

then simulated until a terminal node is reached, then the resulting value is backpropagated. If

the trajectory fails, a reflection is generated and used as additional context for future trials. These

operations are performed in succession until the budget is reached or task is successful.

of an LM as a base decision-maker. We follow the ReAct instantiation in which the action space

Â = A ∪ Z consists of both the space of permissible actions A and language space of reasoning

traces Z. Actions directly affect the environment and result in observation, while thoughts are

used to formalize decisions by organizing information, planning future actions, or injecting internal

knowledge. The exact instantiation of the action space depends on the particular environment; for

decision-making tasks actions might consist of commands on a website while for reasoning tasks

the action space might be limited to a few external tools or APIs.

Instead of greedily decoding one trajectory or solution, we sample n actions from p θ using the

current state. This is based on the intuition that for complex decision-making tasks, there is likely

to be a range of potential trajectories or reasoning paths that are correct (Evans, 2010). Sampling

a diverse set of candidates at each step mitigates the stochastic nature of LM text generation and

enables greater exploration in both the decision-making and reasoning space. We wrap p θ within

our proposed search algorithm to deliberately construct the best trajectory from sampled actions.

4.2

LATS

The main component of LATS is a search algorithm that controls the overall problem-solving pro-

cess with deliberate planning. In order to find the most promising trajectory and systemically bal-

ance exploration with exploitation, we adopt a variant of Monte Carlo Tree Search that frames

decision-making as a tree search, in which each node s = [x, a 1···i , o 1···i ] represents a state com-

prising the original input x, action sequence a 1·i , and observation sequence o 1·i . With instructions

and in-context learning, LATS repurposes p θ as an agent, state evaluator, and feedback generator,

leveraging the useful language priors of modern LMs to facilitate planning. LATS consists of a

series of operations, selection, expansion, evaluation, simulation, backpropagation, and reflection,

performed in succession until the task is successfully completed or a predetermined computational

limit is reached. The full psuedocode of LATS can be found in Sec. A in the Appendix.

Selection. In the first operation, the algorithm identifies a segment of the current tree most suitable

for subsequent expansion. Starting from the root node, denoted as the initial state s 0 , a child node is

selected at each tree level until a leaf node is reached. To balance exploration and exploitation, we

use the Upper Confidence bounds applied to Trees (UCT) algorithm (Kocsis & Szepesvári, 2006) to

select nodes

ln N (p)

V (s)

+ w

UCT(s) =

M (s)

5Preprint

where N (s) is the number of visits to a node s, V (s) is the value of s, w is the exploration weight,

and p is the parent node of s. At each level, the child node with the highest UCT value is selected.

Expansion. After selecting a node, the second operation expands the tree by sampling n actions

from p θ , as described in the prior section. The environment receives each action and returns corre-

sponding feedback as an observation. This results in n new child nodes added to the tree. This tree

is stored in an external long-term memory structure.

Evaluation. The third operation assigns a scalar value to each new child node to be used for selec-

tion and backpropagation. This value effectively quantifies the agent’s progress in task completion,

serving as a heuristic to steer the search algorithm towards the most promising regions of the tree.

Following Yao et al. (2023a) we repurpose p θ into a value function by prompting it to reason about

a given state. To obtain a scalar value, we instruct p θ to end its reasoning trace with a score indi-

cating the correctness of the trajectory. This method offers enhanced flexibility over programmed

heuristics (Campbell et al., 2002) and greater efficiency than learned heuristics (Silver et al., 2017).

Simulation. The fourth operation expands the currently selected node until a terminal state is

reached. At each depth level we sample and evaluate nodes with the same operations, but prioritize

nodes of highest value. Reaching a terminal state provides objective feedback on the correctness of

a trajectory. If the task is completed successfully, then LATS terminates the search. If the solution is

partially successful or unsuccessful, then we perform two additional operations as described below.

Backpropagation. This operation updates the values of the tree based on the outcome of a trajectory.

For each node in the trajectory, its value is updated to reflect the outcome of the simulation.

Reflection. In addition to the environmental feedback, we also leverage self-reflection to further

refine the decision-making process (Shinn et al., 2023; Madaan et al., 2023). Upon encountering an

unsuccessful terminal node, p θ is prompted with the trajectory and final reward to provide a verbal

self-reflection that summarizes the errors in the reasoning or acting process and proposes superior

alternatives. We store both failed trajectories and corresponding reflections in the memory. In sub-

sequent iterations, these are integrated as additional context to the agent and value function, refining

both through in-context learning. This imparts a semantic gradient signal more useful than a scalar

value, enabling the agent to learn from trial and error without the cost of expensive optimization

processes such as reinforcement learning.

Conceptually, LATS has the following advantages as a general framework for reasoning and

decision-making with LM agents. (1) Generality: LATS supports both reasoning and decision-

making tasks by defining a shared space of thoughts and actions. (2) Deliberate: The use of MCTS

and LM value function ensures a principled search that selects options with high value while explor-

ing promising alternatives. (3) Adaptability: LATS is designed around the use of external feedback

through observations and self-reflection, enabling greater adaptation during problem-solving. (4)

Flexibility: LATS can accommodate different scenarios, environments, and resource stipulations by

modifying state design and tree dimensions. (5) Modularity: The base LM agent, reflection genera-

tor, and value function can be independently altered and adapted to individual LM properties.

E XPERIMENTS

To demonstrate the general applicability of LATS, we evaluate our method on a variety of decision-

making domains: programming (Chen et al., 2021; Austin et al., 2021), HotPotQA (Yang et al.,

2018), and WebShop (Yao et al., 2022).

5.1

H OT P OT QA

For a task that can be approached with both reasoning-based and acting-based strategies, we consider

HotPotQA (Yang et al., 2018), a multi-hop question-answering benchmark that requires retrieval

over two or more Wikipedia passages. For the action space, in addition to LM thoughts we follow

the setup from Yao et al. (2023b), which provides the agent with API calls to search and lookup

information. The output of these API calls and self-generated reflections form the observation space.

We use a subset of 100 questions and three few-shot examples for each method. For ToT, we use

DFS as the base search algorithm and scoring with the LM as the heuristic. For all methods that

6Preprint

Prompt Method

HotpotQA (EM)

I/O

CoT (Wei et al., 2022)

CoT - SC (Wang et al., 2022)

ToT (Yao et al., 2023a)

RAP (Hao et al., 2023)

RAP (n = 10)

LATS (CoT)

0.32

0.34

0.38

0.55

0.60

HotpotQA (EM)

ReAct (Yao et al., 2023b)

ReAct (best of k)

Reflexion (Shinn et al., 2023)

LATS

LATS (n = 3)

LATS (n = 10)

LATS (CoT + ReAct) 0.32

0.38

0.51

0.61

0.56

0.64

0.71

SOTA 0.675

(Zhu et al., 2021)

Table 2: GPT-3.5 reasoning-based prompting (left) and acting-based prompting (right) results on

HotpotQA. LATS achieves the highest exact match (EM) for acting and is competitive on reasoning.

Unless otherwise specified, we sample n = 5 nodes during expansion and k = 50 trajectories.

ReAct

Question: Fearless is the debut single of a South Korean

girl group that was formed by who?

LATS

z 1 Thought: I need to search Fearless and find the South

Korean girl group. Then, find who formed the group. z 1 Thought: I need to search Fearless and find the South

Korean girl group. Then, find who formed the group.

a 1 Action: Search[Fearless] a 1 Action: Search[Fearless]

o 1 Observation: Fearless is the debut single recorded by

South Korean girl group Le Sserafim…It was released

as the EP’s lead single by Source Music… o 1 Observation: Fearless is the debut single recorded by

South Korean girl group Le Sserafim…It was released

as the EP’s lead single by Source Music…

z 2 Thought: Fearless is the debut album of South Korean

group Le Sserafim, which is under Source Music. I

need to search Le Sserafim (South Korean band) and

find who formed it. a 2 Action: Search[Le Sserafim formation] o 2 Observation: Could not find [Le Sserafim formation] z 2 Thought: Fearless is the debut

album of South Korean group Le

Sserafim…

a 2 Action: Search[Le Sserafim]

o 2 Observation: Le Sserafim is a

South Korean girl group

consisting of five members:

Sakura, Kim Chae-won…

Thought: Fearless was

recorded by South Korean

group Le Sserafim and was

released by Source Music

Action: Finish[Source

Music]

Observation: Episode

finished, reward = 1

Figure 4: Comparing trajectories from HotPotQA for ReAct (left) and LATS (right). LATS is able

to sample a greater number of actions and avoid failure from previous mistakes.

involve sampling, including LATS, we sample k = 50 trajectories. More details and prompts can be

found in Sec. C and Sec. D in the Appendix.

We evaluate internal reasoning strategies by removing actions and observations from the context,

corresponding to CoT (Wei et al., 2022) and its variants, CoT-SC (Wang et al., 2022), ToT (Yao

et al., 2023a), and RAP (Hao et al., 2023). These methods rely solely on the agent’s existing knowl-

edge to answer the question. We also consider acting-based methods ReAct, Reflexion, and LATS,

which augment the agent with the interactive API environment and primarily evaluate its informa-

tion retrieval abilities. While LATS is designed for scenarios where external feedback can enhance

reasoning, we also implement a reasoning-only version with CoT as the base prompt. We also com-

bine internal and external reasoning in LATS by first prompting with a CoT-based prompt, then

switching to a ReAct-based prompt upon failure. This is closer to how humans might approach this

task, by using tools to lookup additional information only when the answer is not already known.

Results. We observe in Tab. 2 that both internal reasoning and external retrieval strategies perform

well on HotPotQA. Due to their large-scale training corpus, modern LLMs already encode factual

knowledge and can often directly answer the question correctly. While CoT can slightly enhance

performance on questions that require reasoning, larger gains are observed with search methods ToT

and RAP, which can sample and explore a greater number of outputs. We observe similar results

for acting-based methods. LATS surpasses ReAct, even when sampling the same number of trajec-

tories, by expanding more nodes with principled search. This is demonstrated when modifying n,

the number of nodes expanded during each iteration. Increasing n can consistently improve per-

7Preprint

Prompt Method Model CoT (Wei et al., 2022)

ReAct (Yao et al., 2023b)

Reflexion (Shinn et al., 2023)

ToT (Yao et al., 2023a)

RAP (Hao et al., 2023)

LATS (Ours) GPT-3.5

GPT-3.5

GPT-3.5 46.9

56.9

68.1

54.4

63.1

86.9

I/O

Reflexion

LATS GPT-4

GPT-4

GPT-4 80.1

91.0

94.4

Prompt Method

Pass@1

CoT (Wei et al., 2022)

ReAct (Wei et al., 2022)

Reflexion (Shinn et al., 2023)

ToT (Yao et al., 2023a)

RAP (Hao et al., 2023)

LATS (Ours)

Pass@1

54.9

67.0

70.0

65.8

71.4

81.1

Table 3: GPT-3.5 and GPT-4 Pass@1 accuracy on HumanEval (Chen et al., 2021) and MBPP (Austin

et al., 2021). Prompting with LATS achieves the highest performance. We sample 5 solutions during

expansion for 8 iterations.

formance, although at greater computational and inference costs. LATS is also competitive to RAP

on internal reasoning, but performs worse than acting. Combining internal and external reasoning

in LATS surpasses the supervised state of the art, indicating the importance of external feedback to

augment reasoning even in tasks the base LM can already perform.

5.2

P ROGRAMMING

To demonstrate the importance of external observations for complex reasoning tasks, we evaluate

the baselines and LATS on programming with Humaneval (Chen et al., 2021) and MBPP (Austin

et al., 2021). Both datasets measure the correctness of synthesized programs in Python from natural

language docstrings. We use individual solutions as the action space and test suite and compiler

feedback as the external observation. We follow Chen et al. (2022a) and use an LLM to generate

a synthetic test suite of syntactically valid “assert” statements for each question. For each step,

the solution is evaluated on this test suite, and the results including successful and failed tests and

compiler output, are added to the context as an observation. We use the same test suite for Reflexion.

For this task, the reasoning and acting baselines share an action space, but acting methods are able

to incorporate observations as additional context. For LATS, since each action corresponds to a

complete solution, we skip the simulation step of LATS and directly use the percentage of passed

tests as the backpropagated reward. We use k = 8 iterations, set the number of generated tests at 4,

and sample n = 5 solutions during expansion. After the search is completed, we select the solution

with the highest value and evaluate it on the real test suite for the pass@1 accuracy evaluation. More

details and prompts can be found in Sec. C and Sec. E in the Appendix.

Results. We find in Tab 3 that both search and semantic feedback are crucial for better performance.

Despite not using observations, ToT and RAP are competitive with Reflexion. LATS has the highest

performance on both datasets. Since RAP uses a similar search algorithm as LATS, this reveals the

importance of external feedback for difficult reasoning tasks such as programming. With GPT-4,

using LATS sets the state of the art for HumanEval, showing LATS can be used with more advanced

LLMs for higher performance.

5.3

W EBSHOP

For a complex decision-making environment with practical applications, we consider WebShop (Yao

et al., 2022), an online shopping environment composed of a website with 1.18M real-world prod-

ucts and 12k human instructions. Agents must navigate a website through a variety of commands

to purchase an item matching a user specification. We use the preconstructed action space of search

and click commands and browser feedback and reflections for the observation. The performance

is gauged using two metrics: an average score, reflecting the percentage of user-specified attributes

met by the selected product, and a success rate, indicating the frequency with which the chosen

product fulfills all given conditions. We compare against acting-based prompting methods and

reinforcement-learning-based approaches. We evaluate on 50 instructions, expand n = 5 children

8Preprint

Method Score SR Prompt Method

ReAct (Yao et al., 2023b)

ReAct (best of k)

Reflexion (Shinn et al., 2023)

LATS 53.8

59.1

64.2

75.9 28.0

32.0

35.0

38.0 LATS (No LM Heuristic)

LATS (DFS)

LATS (No Reflection)

LATS

IL+RL 59.9

62.4 29.1

28.7 Expert 82.1 59.6

Table 4: Score and success rate (SR) on Web-

shop. Table is separated into prompting, RL-

based training, and human performance. For the

same number of iterations, LATS improves both

score and success rate, and surpasses RL-based

training. IL/IL+RL taken from Yao et al. (2022).

HotPotQA (EM)

0.37

0.42

0.56

0.61

Table 5: Overview of ablation results on LATS in

HotPotQA; we use ReAct as the base prompt and

sample n = 5 children and k = 50 maximum

trajectories. LATS requires every component and

operation for optimal performance. The LM eval-

uator is crucial for LATS, as the heuristic is nec-

essary to guide search. Self-reflection is also im-

portant, as it provides an additional optimization

signal. Using MCTS also results in a substantial

improvement over DFS.

for LATS, and set k = 30 for LATS, ReAct best of k, and Reflexion. More details and prompts can

be found in Sec. C and Sec. F in the Appendix.

Results. We find in Tab. 4 that GPT-3.5 with ReAct is competitive to imitation learning, and can

exceed reinforcement learning techniques with stronger prompting strategies. Sampling k = 30

trajectories with ReAct and Reflexion results in a similar performance, suggesting the semantic

feedback is not as helpful in complex environments like WebShop. Indeed like in Shinn et al. (2023),

we find that generated reflections are often generic and do not provide useful feedback, resulting in

a tendency for the agent to become stuck in local minima. However, using LATS indeed results in a

noticeable improvement, indicating a more effective exploration for the same number of iterations.

5.4

A DDITIONAL O BSERVATIONS

We also conduct additional experiments on HotPotQA to demonstrate the effect of each component

of LATS. We use HotPotQA as our setup incorporates both reasoning (through thoughts) and acting

(through API calls); the results are shown in Tab. 5.

LM value function. The LM value function scores states based on expected future reward. Without

this heuristic, the only signal to guide search would be from environment rewards for completed

trajectories, which are scarce and often binary. We observe a dramatic −0.34 drop in performance

when we remove the evaluation operation.

Self-reflection. We use self-reflection to provide an additional semantic signal for the agent and

search algorithm. We observe a small 0.04 drop in performance when removed from LATS, sug-

gesting this is useful. This is a smaller gain Reflexion (Shinn et al., 2023) observes over ReAct (Yao

et al., 2023b) as shown in Tab. 2, suggesting there is overlap between the types of questions where

there is an improvement with self-reflection and search.

Search Algorithm. While MCTS is a more principled search algorithm than variants like A* or

DFS search, it is slower and requires more LM inference calls. We observe the effects of using DFS,

and incorporate the LM-based heuristic used in ToT (Yao et al., 2023a) in which branches with low

values are pruned. This removes the selection and backpropagation operations, and we observe a

significant −0.19 drop in performance.

C ONCLUSION

In this work, we introduce Language Agent Tree Search (LATS), the first framework to unify plan-

ning, acting, and reasoning for enhanced LLM problem solving. By deliberately constructing tra-

jectories with search algorithms, incorporating external feedback, and enabling agents to learn from

experience, LATS addresses key limitations of prior prompting techniques. Our evaluations demon-

strate the ability of LATS to harness LLM capabilities for a variety of reasoning and decision-making

9Preprint

tasks without additional training. The proposed synergies between search, interaction, and reflec-

tion offer a versatile approach to autonomous decision-making, highlighting the potential of LLMs

as generalist agents. A full discussion of the limitations and broader impacts of LATS can be found

in Sec. B in the Appendix.

A CKNOWLEDGMENTS

We thank Daniel Campos for providing valuable feedback on earlier versions of this paper.

R EFERENCES

Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea

Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Daniel Ho, Jasmine

Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Eric Jang, Rosario Jauregui Ruano, Kyle Jeffrey, Sally

Jesmonth, Nikhil J Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Kuang-Huei Lee,

Sergey Levine, Yao Lu, Linda Luu, Carolina Parada, Peter Pastor, Jornell Quiambao, Kanishka

Rao, Jarek Rettinghouse, Diego Reyes, Pierre Sermanet, Nicolas Sievers, Clayton Tan, Alexander

Toshev, Vincent Vanhoucke, Fei Xia, Ted Xiao, Peng Xu, Sichun Xu, Mengyuan Yan, and Andy

Zeng. Do as i can, not as i say: Grounding language in robotic affordances. arXiv:2204.01691,

2022.

T. Anthony, Z. Tian, and D. Barber. Thinking fast and slow with deep learning and tree search. In

NIPS, 2017.

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan,

Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language

models. arXiv:2108.07732, 2021.

Bowen Baker, Ilge Akkaya, Peter Zhokhov, Joost Huizinga, Jie Tang, Adrien Ecoffet, Brandon

Houghton, Raul Sampedro, and Jeff Clune. Video pretraining (vpt): Learning to act by watching

unlabeled online videos. arXiv:2206.11795, 2022.

Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Lukas Gianinazzi, Joanna Gajda,

Tomasz Lehmann, Michal Podstawski, Hubert Niewiadomski, Piotr Nyczyk, and Torsten Hoefler.

Graph of thoughts: Solving elaborate problems with large language models. arXiv:2308.09687,

2023.

Samuel R Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning. A large anno-

tated corpus for learning natural language inference. In EMNLP, 2015.

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhari-

wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal,

Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M.

Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin,

Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford,

Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. In NeurIPS, 2020.

Murray Campbell, A Joseph Hoane Jr, and Feng-hsiung Hsu. Deep blue. Artificial intelligence,

2002.

Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, and Weizhu

Chen. Codet: Code generation with generated tests. arXiv:2207.10397, 2022a.

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared

Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large

language models trained on code. arXiv:2107.03374, 2021.

Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W Cohen. Program of thoughts prompt-

ing: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint

arXiv:2211.12588, 2022b.

10Preprint

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam

Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm:

Scaling language modeling with pathways. arXiv:2204.02311, 2022.

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser,

Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to

solve math word problems. arXiv:2110.14168, 2021.

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and

Yu Su. Mind2web: Towards a generalist agent for the web. arXiv:2306.06070, 2023.

Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter,

Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar,

Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc

Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Pete Florence. Palm-e: An embodied

multimodal language model. arXiv:2303.03378, 2023.

Yilun Du, Mengjiao Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Joshua B. Tenenbaum, Dale

Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation.

arXiv:2302.00111, 2023.

Jonathan St BT Evans. Intuition and reasoning: A dual-process perspective. Psychological Inquiry,

2010.

Linxi Fan, Guanzhi Wang, Yunfan Jiang, Ajay Mandlekar, Yuncong Yang, Haoyi Zhu, Andrew Tang,

De-An Huang, Yuke Zhu, and Anima Anandkumar. Minedojo: Building open-ended embodied

agents with internet-scale knowledge. In NeurIPS Datasets and Benchmarks Track, 2022.

Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and

Graham Neubig. Pal: Program-aided language models. arXiv preprint arXiv:2211.10435, 2022.

Jiaxian Guo, Sidi Lu, Han Cai, Weinan Zhang, Yong Yu, and Jun Wang. Long text generation via

adversarial training with leaked information. AAAI, 2018.

William H. Guss, Brandon Houghton, Nicholay Topin, Phillip Wang, Cayden Codel, Manuela

Veloso, and Ruslan Salakhutdinov. Minerl: A large-scale dataset of minecraft demonstrations.

In IJCAI, 2019.

Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James

Davidson. Learning latent dynamics for planning from pixels. In ICML, 2019.

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains

through world models. arXiv:2301.04104, 2023.

Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong, Zhen Wang, Daisy Zhe Wang, and Zhiting Hu.

Reasoning with language model is planning with world model. arXiv:2305.14992, 2023.

Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan

Tompson, Igor Mordatch, Yevgen Chebotar, et al. Inner monologue: Embodied reasoning through

planning with language models. arXiv:2207.05608, 2022.

D. Jiang, E. Ekwedike, and H. Liu. Feedback-based tree search for reinforcement learning. In ICML,

2018.

Levente Kocsis and Csaba Szepesvári. Bandit based monte-carlo planning. In ECML, 2006.

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large

language models are zero-shot reasoners. arXiv:2205.11916, 2022.

Steven M LaValle, James J Kuffner, BR Donald, et al. Rapidly-exploring random trees: Progress

and prospects. Algorithmic and computational robotics: new directions, 2001.

Evan Zheran Liu, Kelvin Guu, Panupong Pasupat, Tianlin Shi, and Percy Liang. Reinforcement

learning on web interfaces using workflow-guided exploration. In ICLR, 2018.

11Preprint

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding,

Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui

Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang.

Agentbench: Evaluating llms as agents. arXiv:2308.03688, 2023.

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri

Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad

Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine:

Iterative refinement with self-feedback. arXiv:2303.17651, 2023.

Ramesh Nallapati, Bowen Zhou, Cicero dos Santos, Caglar Gulcehre, and Bing Xiang. Abstractive

text summarization using sequence-to-sequence rnns and beyond. In SIGNLL, 2016.

Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David

Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, et al. Show your work:

Scratchpads for intermediate computation with language models. arXiv:2112.00114, 2021.

OpenAI. Gpt-4 technical report. arXiv:2303.08774, 2023.

Abulhair Saparov and He He. Language models are greedy reasoners: A systematic formal analysis

of chain-of-thought. arXiv:2210.01240, 2022.

Timo Schick, Jane Dwivedi-Yu, Roberto Dessı̀, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer,

Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to

use tools. arXiv:2302.04761, 2023.

Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugging-

gpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv:2303.17580, 2023.

Noah Shinn, Federico Cassano, Beck Labash, Ashwin Gopinath, Karthik Narasimhan, and Shunyu

Yao. Reflexion: Language agents with verbal reinforcement learning. arXiv:2303.11366, 2023.

Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew

Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning.

arXiv:2010.03768, 2020.

David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez,

Thomas Hubert, Lucas baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy P. Lillicrap,

Fan Hui, L. Sifre, George van den Driessche, Thore Graepel, and Demis Hassabis. Mastering the

game of go without human knowledge. Nature, 2017.

Steven A. Sloman. The empirical case for two systems of reasoning. Psychological Bulletin, 1996.

Haotian Sun, Yuchen Zhuang, Lingkai Kong, Bo Dai, and Chao Zhang. Adaplanner: Adaptive

planning from feedback with language models. arXiv:2305.16653, 2023.

Dı́dac Surı́s, Sachit Menon, and Carl Vondrick. Vipergpt: Visual inference via python execution for

reasoning. arXiv preprint arXiv:2303.08128, 2023.

Maciej Świechowski, Konrad Godlewski, Bartosz Sawicki, and Jacek Mańdziuk. Monte carlo tree

search: A review of recent modifications and applications. Artificial Intelligence Review, 2023.

Hugo Touvron, Louis Martin, Kevin R. Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei,

Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Daniel M. Bikel, Lukas

Blecher, Cristian Cantón Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes,

Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony S.

Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian

Khabsa, Isabel M. Kloumann, A. V. Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut

Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov,

Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta,

Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, R. Subramanian, Xia Tan, Binh

Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zhengxu Yan, Iliyan Zarov,

Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert

Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open foundation and fine-tuned chat

models. arXiv:2307.09288, 2023.

12Preprint

Tom Vodopivec, Spyridon Samothrakis, and Branko Ster. On monte carlo tree search and reinforce-

ment learning. Journal of Artificial Intelligence Research, 2017.

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan,

and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models.

arXiv:2305.16291, 2023.

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, and Denny Zhou. Self-consistency

improves chain of thought reasoning in language models. arXiv:2203.11171, 2022.

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou.

Chain of thought prompting elicits reasoning in large language models. arXiv:2201.11903, 2022.

Michael Wooldridge and Nicholas R Jennings. Intelligent agents: Theory and practice. The knowl-

edge engineering review, 1995.

Philipp Wu, Alejandro Escontrela, Danijar Hafner, Pieter Abbeel, and Ken Goldberg. Daydreamer:

World models for physical robot learning. In CoRL. PMLR, 2023.

Yuxi Xie, Kenji Kawaguchi, Yiran Zhao, Xu Zhao, Min-Yen Kan, Junxian He, and Qizhe Xie. De-

composition enhances reasoning via self-evaluation guided decoding. arXiv:2305.00633, 2023.

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov,

and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question

answering. arXiv:1809.09600, 2018.

Shunyu Yao, Howard Chen, John Yang, and Karthik R Narasimhan. Webshop: Towards scalable

real-world web interaction with grounded language agents. In NeurIPS, 2022.

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik

Narasimhan. Tree of thoughts: Deliberate problem solving with large language models.

arXiv:2305.10601, 2023a.

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao.

ReAct: Synergizing reasoning and acting in language models. In ICLR, 2023b.

Weiran Yao, Shelby Heinecke, Juan Carlos Niebles, Zhiwei Liu, Yihao Feng, Le Xue, Rithesh

Murthy, Zeyuan Chen, Jianguo Zhang, Devansh Arpit, Ran Xu, Phil Mui, Huan Wang, Caim-

ing Xiong, and Silvio Savarese. Retroformer: Retrospective large language agents with policy

gradient optimization. arXiv preprint arXiv:2308.02151, 2023c.

Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schu-

urmans, Olivier Bousquet, Quoc Le, and Ed Chi. Least-to-most prompting enables complex

reasoning in large language models. arXiv:2205.10625, 2022.

Xizhou Zhu, Yuntao Chen, Hao Tian, Chenxin Tao, Weijie Su, Chenyu Yang, Gao Huang, Bin Li,

Lewei Lu, Xiaogang Wang, Yu Qiao, Zhaoxiang Zhang, and Jifeng Dai. Ghost in the minecraft:

Generally capable agents for open-world environments via large language models with text-based

knowledge and memory. arXiv:2305.17144, 2023.

Yunchang Zhu, Liang Pang, Yanyan Lan, Huawei Shen, and Xueqi Cheng. Adaptive information

seeking for open-domain question answering. arXiv:2109.06747, 2021.

13Preprint

A PPENDIX

The appendix is organized as follows. First in Sec. A, we show the pseudocode of our proposed

algorithm, LATS; then in Sec. B, we provide further discussion of our method and its limitations,

future direction and broader impact; then in Sec. C, we specify the environment details in our exper-

iments; finally, we list our prompts used for the three environments in Sec. D (HotPotQA), Sec. E

(Programming) and Sec. F (Webshop) respectively.

LATS P SEUDOCODE

Alg. 1 shows the pseudocode of our algorithm LATS. Nodes are stored explicitly in the memory.

Unless otherwise specified, in all experiments we use n = 5 and w = 1.

Algorithm 1 LATS(S 0 , p θ , p Q , p ref , d, k, n, w)

Require: Initial state s 1 , action generator p θ , value function p Q , reflection generator p ref , number

of generated actions n, depth limit L, number of roll-outs K, context c, and exploration weight w

Initialize action space A, observation space O

Initialize the state-action value function p Q : S × A 7→ R and visit counter M : S 7→ N to zero

for k ← 0, . . . , K − 1 do

for t ← 0, . . . , L − 1 do

if s t not terminal then

▷ Expansion & Simulation

for i ← 1, . . . , n do

(i)

Sample a t ∼ p θ (a | s t )

(i)

Get o t from environment, s t+1 ← (c t , o t , a t ), c t+1 ← (o t , a t )

(i)

Evaluate v t ∼ p Q (s t )

▷ Evaluation

(i)

v(s t ) ← v t

(i)

Add s t to children

end for

end if

if s t is terminal then

▷ Reflection

Get r t from environment

if r t not success then

reflection ← p ref (c t )

c ← reflection

end if

ln M (s t−1 )

v(s t )

a t ← arg max a∈e(s t ) M (s t ) + w

▷ Selection

M (s t )

M (s t+1 ) ← M (s t+1 ) + 1

if a t is an output action then break

end for

T ← the actual number of steps

for t ← T − 1, . . . , 0 do

Update Q(s t , a t ) with {r t , r t+1 , . . . , r l }

end for

▷ Backpropagation

D ISCUSSION

Limitations. Although LATS can improve reasoning and decision-making, this arrives at a higher

computational cost. The search process takes more time than standard prompting or simpler tech-

niques, and requires greater inference costs. In practice, we recommend using LATS for difficult

tasks like programming or for situations where performance is prioritized over efficiency. We hope

that continued advancements in LLMs will reduce costs and increase the practicality of LATS.

Broader impact. LATS is a framework that enhances LLM performance through interactions with

an environment. This improvement in autonomous decision-making may facilitate harmful uses of

LLMs. Alternatively, LATS enhances interpretability and the potential for greater alignment, as

14Preprint

it generates understandable, high-level linguistic reasoning and actions through several rounds of

decision-making and reflection, rather than relying on implicit, low-level token values.

C.1

E NVIRONMENT D ETAILS

H OT P OT QA

HotPotQA (Yang et al., 2018) is a question-answering dataset that requires reasoning over multi-

ple supporting documents to answer questions. It contains 113k Wikipedia-based question-answer

pairs crafted by crowdworkers to be diverse, multi-hop, and explainable. Questions cover a range

of types like entities, locations, dates, and comparison of shared properties between two entities.

Crowdworkers also provide supporting facts from the documents that justify the answer. We use the

HotPotQA benchmark setting with all the Wikipedia paragraphs to test retrieval. We use a randomly

selected subset of 100 questions for our experiments and a maximum depth limit of 6.

Action Space. We adopt the Wikipedia web API proposed in Yao et al. (2023b), with three types of

actions to support interactive information retrieval:

(1) search[entity], which returns the first 5 sentences from the corresponding entity wiki

page if it exists, or else suggests top-5 similar entities from the Wikipedia search engine,

(2) lookup[string], which returns the next sentence in the page containing string,

(3) finish[answer], which finishes the current task with answer.

These API calls and free-form thoughts form the action space for this environment.

C.2

P ROGRAMMING

The HumanEval dataset (Chen et al., 2021) is a collection of 164 handwritten programming prob-

lems introduced to evaluate the functional correctness of models for synthesizing programs from

natural language descriptions. Each problem includes a function signature, docstring description,

reference implementation, and multiple unit tests, with an average of 7.7 tests per problem. The

programming tasks assess comprehension of natural language, reasoning, algorithms, and basic

mathematics, at a difficulty level comparable to simple software interview questions. Pass rates

are evaluated with the pass@k metric, where k samples are generated per problem and a problem is

considered solved if any sample passes all tests. We use all 164 problems for our experiments and a

maximum depth limit of 8.

The Mostly Basic Programming Problems (MBPP) Austin et al. (2021) benchmark contains 974

short Python functions designed to evaluate program synthesis techniques. The dataset was con-

structed by crowdsourcing from workers with basic Python knowledge. Each data point consists

of a natural language description of a programming task, a reference solution implementation, and

three test cases for functional correctness. The natural language prompts are typically short, one-

sentence descriptions. Solutions cover common programming constructs including mathematical

operations, list processing, string manipulation, and usage of the Python standard library. On aver-

age, solutions are 6.8 lines of code. The dataset is also supplemented with an additional set of 426

problems that were manually verified for unambiguous specifications, standard function signatures,

and accurate test cases. We use a randomly selected subset of 397 problems for our experiments.

C.3

W EB S HOP

WebShop (Yao et al., 2022) is an interactive web-based environment designed to evaluate agents

on grounded language understanding and decision-making. It simulates an e-commerce shopping

task by providing agents with over 1 million real-world products scraped from Amazon, spanning

5 categories and 113 subcategories. These products contain rich linguistic information, with an

average text length of 262 words and a vocabulary size of 224k. In addition, there are over 800k

unique product options available for customization. The environment renders webpages in two

modes: HTML mode provides pixel-level observations with interactive elements, while simple

mode converts the raw HTML into a structured text observation more amenable for training

agents. The action space consists of query searches and button clicks, which transition between 4

15Preprint

page types: search, results, item and item-detail. Instructions are crowdsourced natural language

specifying product attributes and options, with a total of 12k collected. Automatic rewards are

computed by comparing the product purchased by the agent against the attributes and options

specified in the instruction, using both lexical matching and semantic similarity metrics.

Type Argument State → Next State

choose

choose [Query]

Back to search

Prev/Next page

[Product title]

[Option]

Desc/Overview

Buy Search → Results

∗ → Search

Results → Results

Results → Item

Item → Item

Item → Item-Detail

Item-Detail → Item

Item → Episode End

Table 6: Action space of webshop.

There are two evaluation metrics used in WebShop: (1) Task Score: defined as (100 × avg. reward),

which captures the average reward obtained across episodes; and (2) Success Rate (SR) defined as

the portion of instructions where r = 1. The reward is calculated based on the number of attributes

satisfied by the selected item. We use 50 environments for our experiments and a maximum depth

limit of 15.

D.1

H OT P OT QA P ROMPTS

B ASE A CTING P ROMPT

Solve a question answering task with interleaving Thought, Action, Observation steps. Thought can

reason about the current situation, and Action can be three types:

(1) Search[entity], which searches the exact entity on Wikipedia and returns the first paragraph if it

exists. If not, it will return some similar entities to search.

(2) Lookup[keyword], which returns the next sentence containing keyword in the current passage.

(3) Finish[answer], which returns the answer and finishes the task.

After each observation, provide the next Thought and next Action.

Here are some examples:

Question: Which magazine was started first Arthur’s Magazine or First for Women?

Thought 1: I need to search Arthur’s Magazine and First for Women, and find which was started

first.

Action 1: Search[Arthur’s Magazine]

Observation 1: Arthur’s Magazine (18440̆0e20̆0800̆0931846) was an American literary periodical

published in Philadelphia in the 19th century. Edited by Timothy Shay Arthur, it featured work by

Edgar A. Poe, J.H. Ingraham, Sarah Josepha Hale, Thomas G. Spear, and others.[1][2] In May 1846

it was merged into Godey’s Lady’s Book.[3]

Thought 2: Arthur’s Magazine was started in 1844. I need to search First for Women next.

Action 2: Search[First for Women]

Observation 2: First for Women is a woman’s magazine published by Bauer Media Group in the

USA.[1] The magazine was started in 1989. It is based in Englewood Cliffs, New Jersey.[2] In 2011

the circulation of the magazine was 1,310,696 copies.[3]

16Preprint

Thought 3: First for Women was started in 1989. 1844 (Arthur’s Magazine) ¡ 1989 (First for

Women), so Arthur’s Magazine was started first.

Action 3: Finish[Arthur’s Magazine]

(examples)

You have attempted to answer the following question before and failed. The following reflection(s)

give a plan to avoid failing to answer the question in the same way you did previously. Use them to

improve your strategy of correctly answering the given question.

(trajectories)

(input)

D.2

B ASE R EASONING P ROMPT

Solve a question answering task by having a Thought, then Finish with your answer. Thought can

reason about the current situation. Finish[answer] returns the answer and finishes the task. You will

be given context that you should use to help you answer the question. Start your response with either

Action or an indexed Thought

Here are some examples:

Question: What is the elevation range for the area that the eastern sector of the Colorado orogeny

extends into?

Let’s think step by step.

Thought 1: The eastern sector of Colorado orogeny extends into the High Plains.

Thought 2: High Plains rise in elevation from around 1,800 to 7,000 ft

Thought 3: The answer is 1,800 to 7,000 ft.

Action: Finish[1,800 to 7,000 ft]

(examples)

Previous trial: (trajectories)

(input)

D.3

V ALUE F UNCTION P ROMPT

Analyze the trajectories of a solution to a question answering task. The trajectories are labeled by

environmental observations about the situation, thoughts that can reason about the current situation

and actions that can be three types:

(1) Search[entity], which searches the exact entity on Wikipedia and returns the first paragraph if it

exists. If not, it will return some similar entities to search.

(2) Lookup[keyword], which returns the next sentence containing keyword in the current passage.

(3) Finish[answer], which returns the answer and finishes the task.

Given a question and a trajectory, evaluate its correctness and provide your reasoning and analysis

in detail. Focus on the latest thought, action, and observation. Incomplete trajectories can be correct

if the thoughts and actions so far are correct, even if the answer is not found yet. Do not generate

additional thoughts or actions. Then at the last line conclude ”Thus the correctness score is s”, where

s is an integer from 1 to 10.

Question: Which magazine was started first Arthur’s Magazine or First for Women?

Thought 1: I need to search Arthur’s Magazine and First for Women, and find which was started

first.

Action 1: Search[Arthur’s Magazine]

17Preprint

Observation 1: Arthur’s Magazine (18440̆0e20̆0800̆0931846) was an American literary periodical

published in Philadelphia in the 19th century. Edited by Timothy Shay Arthur, it featured work by

Edgar A. Poe, J.H. Ingraham, Sarah Josepha Hale, Thomas G. Spear, and others.[1][2] In May 1846

it was merged into Godey’s Lady’s Book.[3]

This trajectory is correct as it is reasonable to search for the first magazine provided in the question.

It is also better to have simple searches corresponding to a single entity, making this the best action.

Thus the correctness score is 10

(other examples)

(failed trajectories)

(context)

D.4

R EFLECTION P ROMPT