Summary of LLMP Empowering Large Language Models with Optimal Planning

Summary LLMP Empowering Large Language Models with Optimal Planning arxiv.org

7,215 words - PDF document - View PDF document

One Line

The LLMP framework combines large language models with classical planners to generate optimal plans for planning problems, addressing the lack of true understanding in LLMs and allowing for zero-shot generalization ability.

Key Points

Large Language Models (LLMs) like GPT-4 lack true understanding despite their impressive zero-shot generalization abilities
LLM+P methodology incorporates classical planners into LLMs to provide optimal solutions for planning problems
LLM+P pipeline generates correct solutions to more planning problems than LLMs on their own
LLMP combines LLMs with Classical Planners to empower LLMs with optimal planning capabilities
LLMP improves LLMs' performance in solving complex planning tasks and uses Planning Domain Definition Language (PDDL) to formalize planning problems
LLMP proposes ways to extend the framework, including enabling LLMs to auto-detect when and how to apply finetuning and reducing their dependency on human input.

Summaries

302 word summary

LLMP is a framework that combines Large Language Models (LLMs) with Classical Planners to empower LLMs with optimal planning capabilities. It uses Planning Domain Definition Language (PDDL) to formalize planning problems and leverages symbolic planners to output correct plans. LLMP uses in-context learning, which allows LLMs to perform unseen downstream tasks without fine-tuning their parameters. The system assumes that a human expert can provide a domain description in PDDL format, which is used to generate problem instances. LLMs are pretrained with a large amount of offline text, allowing for zero-shot generalization ability. Classical planning algorithms are used for complicated planning tasks, and the proposed LLM+P framework focuses on translating the planning problem from natural language to structured PDDL format and produces an optimal plan for the majority of problems. The paper discusses various planning systems and historical developments in artificial intelligence and robotics and proposes ways to extend the LLMP framework, including enabling LLMs to auto-detect when and how to apply finetuning and reducing their dependency on human input. Large language models lack true understanding despite their impressive zero-shot generalization abilities. The LLM+P methodology incorporates classical planners into LLMs to address this issue. LLM+P takes a natural language description of a planning problem and returns a correct plan for solving it in natural language. Through experiments, researchers found that LLM+P provides optimal solutions for most problems, while LLMs fail to provide even feasible plans for most problems. LLM+P connects LLMs to general-purpose planners to solve planning problems without altering their training corpus. This methodology can be applied to any class of problems with a sound and complete solver. The LLM+P pipeline generates correct solutions to more planning problems than LLMs on their own. However, the LLM is not asked to recognize prompts suitable for processing using the LLM+P pipeline in this paper.

604 word summary

Large language models (LLMs) like GPT-4 and ChatGPT lack true understanding despite their impressive zero-shot generalization abilities. The LLM+P methodology incorporates classical planners into LLMs to address this issue. LLM+P takes a natural language description of a planning problem and returns a correct plan for solving it in natural language. Through experiments, researchers found that LLM+P provides optimal solutions for most problems, while LLMs fail to provide even feasible plans for most problems. LLMs are proficient in linguistic competence but lack functional competence. LLM+P connects LLMs to general-purpose planners to solve planning problems without altering their training corpus. The LLM+P pipeline outputs a problem description, solves the problem using a general-purpose planner, and converts the output back to natural language. This methodology can be applied to any class of problems with a sound and complete solver. The LLM+P pipeline generates correct solutions to more planning problems than LLMs on their own. However, the LLM is not asked to recognize prompts suitable for processing using the LLM+P pipeline in this paper. LLMP is a system that combines Large Language Models (LLMs) with Classical Planners to empower LLMs with optimal planning capabilities. It uses Planning Domain Definition Language (PDDL) to formalize planning problems and leverages symbolic planners to output correct plans. LLMP uses in-context learning, which allows LLMs to perform unseen downstream tasks without fine-tuning their parameters. The system assumes that a human expert can provide a domain description in PDDL format, which is used to generate problem instances. LLMP provides a natural language interface for giving tasks to robot systems and automates the process of generating PDDL descriptions for problem domains. LLMs are pretrained with a large amount of offline text, allowing for zero-shot generalization ability. Classical planning algorithms are used for complicated planning tasks, and recent classical planning systems designed for robotics often use PDDL or answer set programming (ASP) as the underlying action language for the planners. The LLM+P pipeline requires a chatbot to trigger the LLM+P based on its conversation with a human user. Finally, the LLM translates the PDDL plan back into natural language. The study proposes a framework called LLMP to empower Large Language Models (LLMs) with optimal planning capabilities by combining classical planning with LLMs. Seven planning domains were utilized in the study to evaluate the proposed approach and answer three research questions. The study shows that augmenting LLMs with classical planners improves their performance in solving complex tasks. LLM+P performs significantly better than LLM-AS-P, and context plays a crucial role in the success of LLM+P. The proposed LLM+P framework focuses on translating the planning problem from natural language to structured PDDL format and produces an optimal plan for the majority of problems. The hardest domains are the ones with complex spatial relationships. The paper discusses various planning systems and historical developments in artificial intelligence and robotics and proposes ways to extend the LLMP framework, including enabling LLMs to auto-detect when and how to apply finetuning and reducing their dependency on human input. LLMP Empowering Large Language Models with Optimal Planning is an academic paper that discusses the use of language models as zero-shot planners for robot task planning and situation handling in open worlds. The paper references multiple other academic papers that explore the use of large language models for object rearrangement, task and motion planning, and multi-robot planning with conflicts. The document discusses LLMP, a method for empowering large language models with optimal planning, which involves defining actions and their effects using PDDL, a language for specifying planning problems. The paper presents three planning problems: Grippers, BlocksWorld, and Barman, and several problem scenarios and their corresponding goals.

1704 word summary

LLMP Empowering Large Language Models with Optimal Planning presents several problem scenarios and their corresponding goals. In the first problem, the goal is to replace flat tyres with intact tyres on the hubs while ensuring that the nuts are tight and the boot is closed. The second problem involves building blocks to achieve a height of 3 at a specific position on a grid. The third problem requires moving crates from a load area to a depot using hoists while ensuring that all crates and container storeareas are connected. In the fourth problem, two robots must transport two balls to their respective rooms using their grippers. LLMP Empowering Large Language Models with Optimal Planning presents three planning problems: Grippers, BlocksWorld, and Barman. The Grippers problem involves moving two balls between three rooms using three robots. The BlocksWorld problem requires making three cocktails using a shaker, three dispensers, four shot glasses, and three ingredients. The Barman problem involves providing a PDDL file that describes a planning problem and providing an optimal plan to solve the problem. The paper includes examples of representative failures for both baseline methods and LLMP-AS-P, but no further explanations are provided. The document discusses LLMP, a method for empowering large language models with optimal planning. The method involves defining actions and their effects using PDDL, a language for specifying planning problems. The resulting plans can be used to solve complex tasks, such as arranging blocks in a specific configuration. The document includes examples of PDDL actions and a sample problem domain. The document does not include any boilerplate or unrelated information. LLMP Empowering Large Language Models with Optimal Planning is an academic paper that discusses the use of language models as zero-shot planners for robot task planning and situation handling in open worlds. The paper references multiple other academic papers that explore the use of large language models for object rearrangement, task and motion planning, and multi-robot planning with conflicts. The paper also references the use of integrated task and motion planning in belief space and nonmonotonic reasoning for urban planning. Additionally, the paper mentions various language models, including OPT, BERT, and ChatGPT, and their applications in code evaluation, transformers for language understanding, and scaling language modeling with pathways. The paper discusses the LLMP framework for empowering large language models (LLMs) with optimal planning. It references various planning systems and historical developments in artificial intelligence and robotics. The paper highlights the challenges of LLMs in planning and reasoning about change and proposes ways to extend the LLMP framework, including enabling LLMs to auto-detect when and how to apply finetuning and reducing their dependency on human input. The proposed LLM+P framework aims to empower large language models with optimal planning capabilities by making them aware of a simple (problem, PDDL) pair. The key design choice is to focus LLMs on translating the planning problem from natural language to structured PDDL format. The proposed LLM+P produces an optimal plan for the majority of problems, and it is important to make LLMs aware of the context for in-context learning. The failures of LLM+P cases are due to mis-specified problem files, such as missing one of the initial conditions causing the planning problem to be unsolvable. The hardest domains are the ones with complex spatial relationship. LLM+P performs significantly better than LLM-AS-P and can solve most complicated planning problems. The success rate of applying LLM-AS-P, LLM+P without context, and LLM+P is shown in Table 1. Automatic validation software is used to validate whether the predicted problem PDDL file is correct for the given problem. LLMP provides a natural language planning system called LLM-AS-P that generates plans for various domains. However, adding context does not significantly increase the success rate, and most plans generated are not feasible. The study applied LLM-AS-P and LLM+P to seven domains, and the results showed that the found plans are near-optimal but not optimal. The experiment setup includes an automatic and procedural generation script for each domain, and each problem comes with a natural language description and a ground-truth problem PDDL file. The domains include Blocksworld, Barman, Floortile, Grippers, Storage, Terme, and Tyreworld. This document presents a study on how to empower Large Language Models (LLMs) with optimal planning capabilities. To address this issue, the authors propose combining classical planning with LLMs. They present seven planning domains that are frequently utilized in classical planning competitions and conduct extensive experiments to evaluate the proposed approach. The study aims to answer three research questions: (1) how well does LLM-AS-P work? (2) how well does LLM+P work compared to LLM-AS-P? and (3) what role does context play in the success of LLM+P? The authors propose augmenting LLMs with external modules, such as classical planners, to improve their performance in solving complex tasks. The study shows that LLM+P performs much better than LLM-AS-P and that context plays a crucial role in the success of LLM+P. Additionally, the study highlights the importance of incorporating long-horizon reasoning and knowledge from classical planners into existing LLMs. Large language models (LLMs) have been developed for task planning in household domains using natural language prompts. LLMs are pretrained with a large amount of offline text, allowing for zero-shot generalization ability. Classical planning algorithms are used for complicated planning tasks, and recent classical planning systems designed for robotics often use planning domain description language (PDDL) or answer set programming (ASP) as the underlying action language for the planners. A new approach, LLM+P, combines classical planning in discrete spaces and robot motion planning in continuous space. The LLM+P pipeline requires a chatbot to trigger the LLM+P based on its conversation with a human user. The LLM uses in-context learning to infer the problem PDDL file corresponding to the task. Once the problem PDDL file is generated, it is fed into any classical planner, together with the provided domain PDDL file, to generate a PDDL plan. Finally, the LLM translates the PDDL plan back into natural language. The document discusses LLMP, a system that combines Large Language Models (LLMs) with Classical Planners to empower LLMs with optimal planning capabilities. The system assumes that a human expert can provide a domain description in PDDL format, which is used to generate problem instances. LLMP uses in-context learning, which allows LLMs to perform unseen downstream tasks without fine-tuning their parameters. An example problem is given involving moving blocks, and a PDDL file is generated by GPT-4 based on natural language input and context. The generated file appears to have the correct syntax but uses a made-up predicate. Overall, LLMP provides a natural language interface for giving tasks to robot systems and automates the process of generating PDDL descriptions for problem domains. The paper discusses using the Planning Domain Definition Language (PDDL) to rewrite planning prompts for Large Language Models (LLMs) to improve their long-horizon reasoning abilities. The paper provides a detailed explanation of PDDL and how it is used to formalize planning problems. It introduces the LLM+P method, which leverages symbolic planners to output correct plans. The paper assumes a factored state space and a fixed set of variables, and defines the input of a planning problem as a tuple. The paper includes a running example and provides two PDDL files: a domain file and a problem file. The paper also discusses LLM+P (In-context Learning), which uses an LLM to produce the PDDL description of a problem, then leverages a classical planner to find an optimal plan, and finally translates the raw plan back to natural language using the LLM again. The paper recaps the standard representation of classical planners and their use with LLMs. LLM+P is a methodology that connects Large Language Models (LLMs) to general-purpose planners to solve planning problems. The objective of this research is to enable LLMs to solve planning problems correctly without altering their training corpus. The LLM+P pipeline consists of three components: 1) outputs a problem description suitable as input to a general-purpose planner, 2) solves the problem using the general-purpose planner, and 3) converts the output of the planner back to natural language. While this paper focuses on planning problems, this general methodology can be applied to any class of problems for which we have a sound and complete solver. The LLM+P pipeline is able to generate correct solutions to many more planning problems than are LLMs on their own. However, in this paper, the LLM is not asked to recognize that it has been posed a prompt that is suitable for processing using the proposed LLM+P pipeline.

LLMs have become amazingly proficient at linguistic competence - knowing how to say things but not nearly as good at functional competence - knowing what to say. They are not specifically built to demonstrate understanding, but rather trained to generate sequences of words that might be plausible to a human given a prior context. Given how LLMs are designed and trained, this phenomenon should come as no surprise. LLMs can be (relatively) easily fooled by asking for the result of a straightforward arithmetic problem that does not appear in their training corpus or by asking for the solution to a novel planning problem that requires knowledge about how the world works.

The text also includes an example of a failure of GPT-4 in planning and a contribution section with a link to code and results available on GitHub. Large language models (LLMs) such as GPT-4 and ChatGPT have shown impressive zero-shot generalization abilities but lack true understanding. AI researchers have long sought to create programs that can converse in natural language with the same grace and flexibility as people. However, while even relatively simple models can generate responses to some prompts that seem reasonable, it has always been relatively easy to generate prompts that expose their weaknesses compared to people. In an effort to address this issue, the paper introduces LLM+P, which incorporates the strengths of classical planners into LLMs. LLM+P takes in a natural language description of a planning problem, then returns a correct (or optimal) plan for solving that problem in natural language. Via a comprehensive set of experiments taken from common planning scenarios, the researchers found that LLM+P is able to provide optimal solutions for most problems, while LLMs fail to provide even feasible plans for most problems.

Raw indexed text (43,946 chars / 7,215 words / 791 lines)

LLM+P: Empowering Large Language Models

with Optimal Planning Proficiency

Liu ∗ , † Yuqian Jiang ∗ , ‡ Xiaohan Zhang

Liu, ‡ Shiqi Zhang, † Joydeep Biswas, † Peter Stone

Department of Computer Science

† University of Texas at Austin, ‡ State University of New York at Binghamton

{bliu, lqiang, joydeep, pstone}@cs.utexas.edu

[email protected], {xzhan244,zhangs}@binghamton.edu

† Bo

† Qiang

Abstract: Large language models (LLMs) have demonstrated remarkable zero-

shot generalization abilities: state-of-the-art chatbots can provide plausible an-

swers to many common questions that arise in daily life. However, so far, LLMs

cannot reliably solve long-horizon planning problems. By contrast, classical plan-

ners, once a problem is given in a formatted way, can use efficient search al-

gorithms to quickly identify correct, or even optimal, plans. In an effort to get

the best of both worlds, this paper introduces LLM+P, the first framework that

incorporates the strengths of classical planners into LLMs. LLM+P takes in a

natural language description of a planning problem, then returns a correct (or op-

timal) plan for solving that problem in natural language. LLM+P does so by

first converting the language description into a file written in the planning do-

main definition language (PDDL), then leveraging classical planners to quickly

find a solution, and then translating the found solution back into natural language.

Along with LLM+P, we define a diverse set of different benchmark problems

taken from common planning scenarios. Via a comprehensive set of experiments

on these benchmark problems, we find that LLM+P is able to provide optimal

solutions for most problems, while LLMs fail to provide even feasible plans for

most problems. 2

Keywords: Planning, Large Language Model, In-context Learning

Introduction

Ever since the birth of the field, AI researchers have sought to create programs that can converse in

natural language with the same grace and flexibility as people. While even relatively simple models,

such as Eliza from 1966 [1], can generate responses to some prompts that seem reasonable, it has

always been relatively easy to generate prompts that expose their weaknesses compared to people

— their lack of true “understanding.”

While large language models (LLMs) such as GPT-4 [2] and ChatGPT [3] have far surpassed ex-

pectations of just a few years ago, they are no different in this respect. Indeed the internet is now

awash with examples of people reveling in getting ChatGPT to generate output that even a 5-year-old

human child would know to be ill-advised.

Given how LLMs are designed and trained, this phenomenon should come as no surprise. They

are not specifically built to demonstrate understanding. They are trained to generate sequences of

words that might be plausible to a human given a prior context. In the terms of Mahowald et al. [4]

LLMs have become amazingly proficient at linguistic competence — knowing how to say things;

but they are not nearly as good at functional competence — knowing what to say. Specifically, they

can be (relatively) easily fooled by, for example, asking for the result of a straightforward arithmetic

problem that does not appear in their training corpus or by asking for the solution to a novel planning

problem that requires knowledge about how the world works.

∗ Equal

2 The

contribution.

code and results are publicly available at https://github.com/Cranial-XIX/llm-pddl.git.A Failure Example of GPT-4 in Planning

Problem (P1): You have 5 blocks. One cannot place more than one block on another block.

b5 is on top of b3. b4 is on top of b2. b2 is on top of b1. b3 is on top of b4. b1 is on the

table. b5 is clear. Your arm is empty. Your goal is to move the blocks. b1 should be on top

of b2. b3 should be on top of b5. b4 should be on top of b1.

GPT-4:

1. Move

2. Move

3. Move

4. Move

5. Move

6. Move

from

b3 to the

b4 to b5.

b2 to the

b1 to b5.

the table

table.

to b2.

to b1.

Does that mean that we should increase efforts to include all arithmetic and planning problems in

their training corpus? Clearly, that is a fool’s errand. On the other hand, why should it be necessary?

We already have calculators and general-purpose symbolic planners that are guaranteed to produce

correct answers. Thus a natural alternative approach, and one that we are admittedly not the first to

explore, is to connect LLMs to such tools.

With this motivation in mind, the objective of the research reported in this paper is, for the first time,

to enable LLMs to solve planning problems correctly. We aim to do so without altering the LLMs

themselves, even with finetuning [5, 6]. Rather, we introduce a methodology, called LLM+P by

which, when posed a natural language description of a planning problem, the LLM:

1. outputs a problem description suitable as input to a general-purpose planner;

2. solves the problem using the general-purpose planner; and

3. converts the output of the planner back to natural language.

Our extensive empirical evaluations indicate that LLM+P is able to generate correct solutions to

many more planning problems than are LLMs on their own. While demonstrated in this paper on

planning problems, this general methodology can be applied to any class of problems for which we

have a sound and complete solver, such as arithmetic problems (by leveraging calculators).

Limitation: In this paper, we do not ask the LLM to recognize that it has been posed a prompt that

is suitable for processing using the proposed LLM+P pipeline. A valuable future research direction

will be to consider recognizing when a prompt should be processed by LLM+P.

Background

This section introduces the notation we use for representing a planning problem to be solved by

LLMs, and recaps the standard representation of classical planners.

2.1

The Classical Planning Problem

Formally, the input of a planning problem P is defined by a tuple hS , s init , S G , A , f i:

• S is a finite and discrete set of states used to describe the world’s state (i.e., state space).

We assume a factored state space such that each state s ∈ S is defined by the values of a

fixed set of variables.

• s init ∈ S is an initial world state.

• S G ⊂ S is a set of goal states. S G are usually specified as a list of goal conditions, all of

which must hold in a goal state.

• A is a set of symbolic actions.

2Module

Generated Text

Provided Text

Context

Ex. P & Ex. Sol

Problem (P)

Plan

Domain

LLM

LLM-As-Planner (In-context Learning)

LLM-As-Planner

Context

LLM

Ex. P & Ex. PDDL

Domain PDDL

Plan

Problem (P)

LLM Planner

Problem PDDL PDDL Plan

LLM

LLM + P (In-context Learning)

Figure 1: LLM+P makes use of a large language model (LLM) to produce the PDDL description

of the given problem, then leverages a classical planner for finding an optimal plan, then translates

the raw plan back to natural language using the LLM again.

• f is the underlying state transition function. f takes the current state and an action as input

and outputs the corresponding next state.

A solution to a planning problem P is a symbolic plan π in the form of ha 1 , a 2 , . . . , a N i, such that the

preconditions of a 1 hold in s init , the preconditions of a 2 hold in the state that results from applying

a 1 , and so on, with the goal conditions all holding in the state that results after applying a N .

2.2

Planning Domain Definition Language (PDDL)

The planning domain definition language (PDDL) serves as a standardized encoding of classical

planning problems [7, 8]. The PDDL representation of a planning problem P is separated into two

files: a domain file and a problem file. The domain PDDL file provides a lifted representation

of the underlying rules of the world. It includes a set of predicates that define the state space S

and the actions (i.e., A ) with their preconditions and effects (i.e., the transition function f ). The

problem PDDL file provides a list of objects to ground the domain, the problem’s initial state s init

and goal conditions S G . We provide an example of the two PDDL files in Appendix A. There

exists a rich set of symbolic planners that implement efficient search algorithms to solve planning

problems formalized in PDDL. For a more detailed introduction to PDDL, we refer the readers to

the following slides. 3 In this work, we aim to take a natural language prompt which describes the

initial state s init and goal conditions S G , formulate it in PDDL, and leverage symbolic planners to

output correct plans. We assume the domain rules are available (See the assumptions in Section 5).

Method

This section first uses a running example to investigate ways for formulating planning prompts in

PDDL and then introduces the LLM+P method.

3 We

refer the readers to https://www.cs.torontnaturalo.edu/~sheila/2542/s14/A1/

introtopddl2.pdf as a good introduction to PDDL.

33.1

Large Language Model as a PDDL Writer

LLMs are bad at planning (or long-horizon reasoning) [9] but they are good at describing and trans-

lating textual inputs, including re-writing planning prompts in the PDDL format. The intuition is

that we can view PDDL as a different language than English, so re-writing planning prompts in

PDDL is essentially a “machine translation” task that LLMs are known to excel at. Here we pro-

vide an example of a PDDL problem file written by GPT-4 without any prompt-engineering. As we

can see, the generated file appears to have the correct PDDL syntax, but uses a made-up predicate

(empty) and misses the initial condition that b1 is on the table.

An Example PDDL Problem File Written by GPT-4

Prompt: Description of Problem (P1) + Provide me with the problem PDDL file that

describes the planning problem directly without further explanations.

GPT-4 (the generated problem PDDL):

(:objects b1 b2 b3 b4 b5 - block)

(:init (on b5 b3) (on b4 b2) (on b2 b1) (on b3 b4) (clear b5)

(empty))

(:goal (and (on b1 b2) (on b3 b5) (on b4 b1)))

3.2

In-Context Learning

LLMs are known to be capable of in-context learning without finetuning their parameters. By in-

context learning, we mean LLMs’ ability to perform unseen downstream tasks by simply condition-

ing on a few input-label pairs (demonstrations) [10]. Here is a pair of problem descriptions in natural

language and in PDDL provided to GPT-4 as context. When the context is included with the prompt

from the example above, the resulting PDDL problem file is directly solvable by the planner.

An Example PDDL Problem File Written by GPT-4 with Context

Context (shaded):

An example planning problem is: You have 5 blocks. b2 is on top of b5. b5 is on top of b1.

b1 is on top of b4. b3 is on top of b2. b4 is on the table. b3 is clear. Your arm is empty. Your

goal is to move the blocks. b4 should be on top of b3. The problem PDDL is:

(:objects b1 b2 b3 b4 b5)

(:init (arm-empty) (on b1 b4) (on b2 b5) (on b3 b2)

(on-table b4) (on b5 b1) (clear b3))

(:goal (and (on b4 b3))))

Prompt: Description of Problem (P1) + Context + Provide me with the problem PDDL file

that describes the planning problem directly without further explanations.

GPT-4 (the generated problem PDDL):

(:objects b1 b2 b3 b4 b5 )

(:init (arm-empty) (on-table b1) (on b2 b1) (on b3 b4)

(on b4 b2) (on b5 b3) (clear b5)))

(:goal (and (on b1 b2) (on b3 b5) (on b4 b1)))

Planner:

(unstack b5 b3) (putdown b5) (unstack b3 b4) (stack b3 b5)

(unstack b4 b2) (putdown b4) (unstack b2 b1) (putdown b2)

(pickup b1) (stack b1 b2) (pickup b4) (stack b4 b1)

3.3

Large Language Model + Classical Planner (LLM+P)

Having introduced the LLM’s ability to encode problems in PDDL and in-context learning, we are

ready to introduce the proposed LLM+P solution (the bottom of Figure 1). In particular, we assume

that for each problem domain, a human expert can provide a domain description (i.e. action precon-

4ditions and effects) that will be fixed for all problem instances that happen in that domain. While the

problem of automatically generating the description is another valuable research question, in this

proposed work, we assume that the description is available as a PDDL domain file. The LLM+P

method is directly applicable as a natural language interface for giving tasks to robot systems. For

instance, assume we want a robot to act as a bartender to make cocktails. It is reasonable to tell it

what actions it can take, but leave itself to infer how to make new cocktails most efficiently given

a set of ingredients to combine. Moreover, we assume the agent is provided with a minimal exam-

ple that demonstrates what an example problem PDDL looks like for a simple problem inside that

domain. Next, the agent is provided with a new (potentially quite complicated) problem (P). The

LLM then uses the in-context learning to infer the problem PDDL file corresponding to P. Once the

problem PDDL file is generated, we feed it into any classical planner, together with the provided

domain PDDL file, to generate a PDDL plan [11]. In the end, the LLM translates the PDDL plan

back into the natural language to finish up the LLM+P pipeline.

To summarize, the assumptions we need for LLM+P are:

1. A chatbot knows when to trigger LLM+P based on its conversation with a human user.

2. A domain PDDL file is provided for the problem the user asks for.

3. A simple problem description in natural language and its corresponding problem PDDL

file are also provided beforehand.

Related Work

This section first provides a brief overview of classical planning algorithms. Then it summarizes

recent advances in using large language models for planning tasks. It concludes with a discussion

of recent research on augmenting LLMs with external modules.

Classical Planning Automated planning (or classical planning) techniques can be used for com-

puting a sequence of actions that achieves a given goal [12, 13, 14]. Automated planning algorithms

have been widely used in robot systems. Shakey is the first robot that was equipped with a plan-

ning component, which was constructed using STRIPS [15]. Some previous general-purpose plan-

ning architectures were also demonstrated to be useful for robot planning, such as PRODIGY [16]

and HTN [17]. Recent classical planning systems designed for robotics frequently use planning

domain description language (PDDL) or answer set programming (ASP) as the underlying action

language for the planners [18, 19, 20, 21]. For example, researchers have used classical planning al-

gorithms for sequencing actions for a mobile robot working on delivery tasks [22], reasoning about

safe and efficient urban driving behaviors for autonomous vehicles [23], and planning actions for a

team of mobile robots [24]. Task and motion planning (TAMP) is a hierarchical planning frame-

work that combines classical planning in discrete spaces and robot motion planning in continuous

space [25, 26].

Most of the above-mentioned planning methods require domain-specific programming languages as

the underlying representation of the problems and their solutions. LLM+P, on the other hand, takes

advantage of LLMs to use natural language descriptions for solving complex planning tasks. The

main feature that motivates us to use such classical planning systems is that most of these planners

are sound and complete, meaning that they are guaranteed to be logically correct and will output a

plan if one exists. Many are also able to find optimal (shortest) plans, at least if given sufficient time.

Planning with Large Language Models Various large language models (LLMs) have been devel-

oped in recent years, such as Bert [27], CodeX [28], Opt [29], GPT-3 [10], ChatGPT [30], GPT-4 [2],

LLAMA [31]. and PaLM [32]. As LLMs are pretrained with a tremendous amount of offline text

data, they can emerge with surprising zero-shot generalization ability, which can be leveraged for

robot planning tasks [33, 34, 35, 36, 37, 38, 39, 40, 41]. For instance, the work of Huang et al.

showed that LLMs can be used for task planning in household domains by iteratively augmenting

prompts [38]. SayCan is another approach that enabled robot planning with affordance functions

to account for action feasibility, where the service requests are specified in natural language [33].

Vemprala et al. recently studied how ChatGPT can be applied to generalized robotics domains.

However, a major drawback of existing LLMs is their lack of long-horizon reasoning/planning abil-

ity for complex tasks (See [9, 42] and Section 8.2 from [2]). Specifically, the output they produce

5when presented with such a task is often incorrect in the sense that following the output plan will

not actually solve the task. Therefore, in this work, we focus on resolving this issue by leveraging

the properties of classical planners. Similarly, some recent work also investigates approaches for

combining classical planning with LLMs [43, 44]. They either use prompting or fine-tuning to make

LLMs capable of solving PDDL planning problems. In contrast, we do not solely rely on LLM as the

problem solver, but are more into taking the advantage of both the planner (i.e., generating accurate

and optimal plans) and the LLM itself (i.e., 1-shot generalization for translating natural-language

problem descriptions into PDDL).

Augmenting LLMs with External Modules Recently developed methods have shown that the

performance of downstream tasks of LLMs can be improved by combining them with external mod-

ules. For instance, WebGPT [45] is a fine-tuned version of GPT-3 by combining web knowledge

to answer open-ended questions. Lazaridou et al. studied how search engines like Google can be

utilized as external tools for LLMs. MemPrompt [47] presented a human-in-the-loop system where

a growing memory of errors and user feedback is served as past experience adding to the prompts for

more accurately answering new questions. REPLUG [48] is another retrieval-augmented language

modeling paradigm that treats the language model as a black box and augments it with a tuneable

retrieval model. In very recent work related to ours, Schick et al. trained a model called ToolFormer

that can decide when and how to call certain tool APIs by in-line augmentation on prompts for

LLMs.In this work, we propose that classical planners can be another particularly useful external

module.

In comparison, LLM+P, does not rely on any fine-tuning or re-training of LLMs. By simply in-

corporating knowledge from classical planners, LLM+P incorporates long-horizon reasoning and

planning capabilities into existing LLMs.

Experiments

We conduct extensive experiments to answer the following three questions:

1. How well does LLM- AS -P work? That is, to what extent can LLMs be directly used for

planning? (Not at all)

2. How well does LLM+P work compared to LLM- AS -P? (Much better)

3. What role does the context play in the success of LLM+P? (It’s crucial)

To address these questions, we first introduce a set of benchmark problems borrowed from the

classical planning community. Then, we comprehensively evaluate both LLM- AS -P and LLM+P

on the proposed benchmark problems. In the end, we provide the results, analysis, and example

failure cases that could lead to future improvement.

5.1

Benchmark Problems

We present seven planning domains that are frequently utilized in classical planning competitions.

Moreover, we have developed an automatic generation script for each domain, which can generate a

virtually unlimited number of tasks in natural language. To ensure fair and reproducible benchmark-

ing, we restrict each domain to a maximum of 20 tasks. Below is a list of the planning domains,

along with a brief summary of each.

1. B LOCKSWORLD : Given a set of piles of blocks on a table, a robot is tasked with rearranging

them into a specified target configuration while obeying the laws of physics.

2. B ARMAN : A bartender is tasked with creating cocktails for a customer’s order, utilizing

the available ingredients and aiming to use the minimum number of steps possible.

3. F LOORTILE : A set of robots are tasked to use different colors to paint patterns in floor tiles.

Robots can move around and change colors to paint but are restricted by some moving

constraints and painting directions.

4. G RIPPERS : A set of robots with grippers are given a task to move objects among different

rooms.

65. S TORAGE : Given a set of hoists, the goal is to lift and drop all the crates using the hoists

into a depot. Crates are initially stored in different areas and hoists can be moved among

store areas.

6. T ERMES : The robot’s goal is to build complex structures by carrying and placing blocks,

and also climbing on them so that it can move to neighboring blocks.

7. T YREWORLD : The robot is given a task to replace flat tires with intact tires on the hubs.

It requires inflating intact tires, tightening the nuts, and moving all tools back to the boot

when it is done.

For each domain, we implement an automatic and procedural generation script to produce 20 dif-

ferent planning problems for reproducibility. The implemented generation scripts are made publicly

available in our codebase for the convenience of future research. For each problem P, P comes with

a natural language description and a ground-truth problem PDDL file. We assume each problem

domain has its own domain PDDL file given by the user or a domain expert prior to addressing any

planning problems in that domain.

5.2

Experiment Setup

We leverage the TEXT - DAVINCI -003 model 4 provided by OpenAI 5 for all experiments as it is the

latest text completion that has the highest text completion quality among all models we have access

to. We set the max token number to 2048 and temperature to 0, and use the top probability response.

As a result, the response returned from the LLM is deterministic. Once a text PDDL response is

generated, we feed it into the FAST - DOWNWARD planner, 6 using the alias LAMA with a maximum

search time of 200 seconds. When the planner finds multiple solutions for a given problem, we pick

the solution with the fewest number of steps for validation. Note that we put a hard constraint on

the search time for the convenience of running experiments. Therefore, it is possible that some of

the found plans are near-optimal but not optimal. However, in theory, an optimal plan can always be

found in finite time. 7 We run the experiment three times independently and report the average and

standard error of the success rate.

5.3

Results and Analysis

The results of applying LLM- AS -P and LLM+P across 7 domains are provided in Table 1.

Findings (LLM- AS -P):

1. We observe that though LLM- AS -P provides a plan in natural language for every problem,

most of these plans are not feasible. The main reason is that LLM- AS -P lacks the ability

to reason about preconditions. Moreover, adding context (e.g., LLM- AS -P (w/ context))

does not increase the success rate unless the new solution can be built from the example

solution as a template.

2. The T YREWORLD domain has the most performance boost from LLM- AS -P (w/o context)

(0% success) to LLM- AS -P (w/ context) (40%). This domain has a fixed sequence of

actions for replacing each tire, and the problems only vary in the number of tires that have

to be replaced.

3. In domains other than T YREWORLD , LLM- AS -P fails in the same way with or without the

example plan as context. In particular, in the B LOCKSWORLD domain, LLM- AS -P cannot

keep track of properties like ON and CLEAR . In the G RIPPERS domain, the robot can only

pick up balls when they are in the same room, but most of the LLM- AS -P plans violate

this condition (See Appendix C).

4 At the moment writing this paper,

the authors cannot access GPT-4 APIs so we use TEXT - DAVINCI -003 as

it generates the best results on the benchmarked problems.

5 We use the OpenAI API following instructions from https://platform.openai.com/docs/

quickstart/build-your-application.

6 See https://www.fast-downward.org/ for the details on the FAST - DOWNWARD software. We use the

implementation from https://github.com/aibasel/downward.

7 We use automatic validation software to validate whether the predicted problem PDDL file is correct for

the given problem.

7Success Rate %

Domain

B ARMAN

B LOCKSWORLD

F LOORTILE

G RIPPERS

S TORAGE

T ERMES

T YREWORLD

LLM- AS -P (no ctx.) LLM- AS -P LLM+P (no ctx.) LLM+P

10.0

0 0

15.0

0 0

0 85.0 ±0.0

56.7 ±2.9

53.3 ±2.9

95.0 ±0.0

65.0 ±0.0

31.7 ±2.9

80.0 ±0.0

40.0 ±5.0

Table 1: Success rate of applying LLM- AS -P, LLM+P without context and LLM+P. Main mes-

sages: 1) LLM+P performs significantly better than LLM- AS -P and can solve most complicated

planning problems; 2) Without conditioning on the context, LLM+P fails completely; 3) LLM+P

is able to solve many of the problems correctly. Each experiment is conducted three times and we

report the mean and the standard error. The randomness exhibited in the results comes solely from

the planner’s search algorithm. We bold the best result in each domain.

4. The hardest domains are the ones with complex spatial relationship, e.g., F LOORTILE ,

T ERMES , and S TORAGE require reasoning about connectivities and directions on a grid

world. The LLM- AS -P methods (with or without context) completely fail at this type of

problems. For example, LLM- AS -P generated “move right to tile 0-4 and paint tile 1-2

black” but the robot can only paint neighboring tiles.

Findings (LLM+P):

1. The proposed LLM+P produces an optimal plan for the majority of problems. Most failed

cases are due to mis-specified problem files, such as missing one of the initial conditions

causing the planning problem to be unsolvable.

2. Without the context (i.e., an example problem and its corresponding problem PDDL), we

observe that LLMs fail to produce correct problem PDDL files. The failures of LLM+P

(no context) come entirely from incorrect problem encodings. Therefore, the context is

important for LLM+P to work.

For a more detailed case study on the failure plans produced by both the baseline methods and

LLM- AS -P, please refer to Appendix C.

Conclusion and Future Work

In this work, we propose to leverage classical planners to empower large language models with

optimal planning capabilities. The key design choice of the proposed LLM+P framework is to

focus LLMs on translating the planning problem from natural language to structured PDDL format.

Moreover, we show that it is important to also make LLMs aware of a simple (problem, PDDL) pair

as a demonstration (or the context) for in-context learning. Some interesting directions to further

extend the LLM+P framework include: 1) enabling the LLM to auto-detect when and how to apply

LLM+P; and 2) reducing LLM+P’s dependency on information by humans, potentially involving

finetuning.

8References

[1] J. Weizenbaum. Eliza—a computer program for the study of natural language communication

between man and machine. Communications of the ACM, 9(1):36–45, 1966.

[2] OpenAI. Gpt-4 technical report, 2023.

[3] S. Vemprala, R. Bonatti, A. Bucker, and A. Kapoor.

Chatgpt for robotics: De-

sign principles and model abilities.

Technical Report MSR-TR-2023-8, Microsoft,

February 2023. URL https://www.microsoft.com/en-us/research/publication/

chatgpt-for-robotics-design-principles-and-model-abilities/.

[4] K. Mahowald, A. A. Ivanova, I. A. Blank, N. Kanwisher, J. B. Tenenbaum, and E. Fedorenko.

Dissociating language and thought in large language models: a cognitive perspective. arXiv

preprint arXiv:2301.06627, 2023.

[5] C. Lee, K. Cho, and W. Kang. Mixout: Effective regularization to finetune large-scale pre-

trained language models. arXiv preprint arXiv:1909.11299, 2019.

[6] J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V. Le.

Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.

[7] D. McDermott, M. Ghallab, A. Howe, C. Knoblock, A. Ram, M. Veloso, D. Weld, and

D. Wilkins. Pddl-the planning domain definition language. 1998.

[8] P. Haslum, N. Lipovetzky, D. Magazzeni, and C. Muise. An introduction to the planning do-

main definition language. Synthesis Lectures on Artificial Intelligence and Machine Learning,

13(2):1–187, 2019.

[9] K. Valmeekam, A. Olmo, S. Sreedharan, and S. Kambhampati. Large language models still

can’t plan (a benchmark for llms on planning and reasoning about change). arXiv preprint

arXiv:2206.10498, 2022.

[10] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan,

P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. Advances

in neural information processing systems, 33:1877–1901, 2020.

[11] M. Helmert. The fast downward planning system. Journal of Artificial Intelligence Research,

26:191–246, 2006.

[12] T. Bylander. The computational complexity of propositional STRIPS planning. Artificial

Intelligence, 69(1-2):165–204, 1994.

[13] J. McCarthy. Situations, actions, and causal laws. Technical report, Stanford University Tech-

nical Report, 1963.

[14] R. E. Fikes and N. J. Nilsson. Strips: A new approach to the application of theorem proving to

problem solving. Artificial intelligence, 2(3-4):189–208, 1971.

[15] N. J. Nilsson et al. Shakey the robot. 1984.

[16] J. Carbonell, O. Etzioni, Y. Gil, R. Joseph, C. Knoblock, S. Minton, and M. Veloso. Prodigy:

An integrated architecture for planning and learning. ACM SIGART Bulletin, 2(4):51–55, 1991.

[17] D. S. Nau, T.-C. Au, O. Ilghami, U. Kuter, J. W. Murdock, D. Wu, and F. Yaman. Shop2: An

htn planning system. Journal of artificial intelligence research, 2003.

[18] Y.-q. Jiang, S.-q. Zhang, P. Khandelwal, and P. Stone. Task planning in robotics: an empirical

comparison of pddl-and asp-based systems. Frontiers of Information Technology & Electronic

Engineering, 20:363–373, 2019.

[19] G. Brewka, T. Eiter, and M. Truszczyński. Answer set programming at a glance. Communica-

tions of the ACM, 54(12):92–103, 2011.

[20] V. Lifschitz. Answer set programming and plan generation. Artificial Intelligence, 138(1-2):

39–54, 2002.

9[21] M. Fox and D. Long. Pddl2. 1: An extension to pddl for expressing temporal planning domains.

Journal of artificial intelligence research, 20:61–124, 2003.

[22] S. Zhang, F. Yang, P. Khandelwal, and P. Stone. Mobile robot planning using action language

bc with an abstraction hierarchy. In International Conference on Logic Programming and

Nonmonotonic Reasoning, pages 502–516. Springer, 2015.

[23] Y. Ding, X. Zhang, X. Zhan, and S. Zhang. Task-motion planning for safe and efficient urban

driving. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS),

2020.

[24] Y. Jiang, H. Yedidsion, S. Zhang, G. Sharon, and P. Stone. Multi-robot planning with conflicts

and synergies. Autonomous Robots, 43(8):2011–2032, 2019.

[25] F. Lagriffoul, N. T. Dantam, C. Garrett, A. Akbari, S. Srivastava, and L. E. Kavraki. Platform-

independent benchmarks for task and motion planning. IEEE Robotics and Automation Letters,

3(4):3765–3772, 2018.

[26] L. P. Kaelbling and T. Lozano-Pérez. Integrated task and motion planning in belief space. The

International Journal of Robotics Research, 32(9-10):1194–1227, 2013.

[27] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional

transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.

[28] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y. Burda,

N. Joseph, G. Brockman, et al. Evaluating large language models trained on code. arXiv

preprint arXiv:2107.03374, 2021.

[29] S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li,

X. V. Lin, et al. Opt: Open pre-trained transformer language models. arXiv preprint

arXiv:2205.01068, 2022.

[30] OpenAI. Chatgpt. Accessed: 2023-02-08, 2023.

chatgpt/. cit. on pp. 1, 16.

URL https://openai.com/blog/

[31] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière,

N. Goyal, E. Hambro, F. Azhar, et al. Llama: Open and efficient foundation language models.

arXiv preprint arXiv:2302.13971, 2023.

[32] A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W.

Chung, C. Sutton, S. Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv

preprint arXiv:2204.02311, 2022.

[33] M. Ahn, A. Brohan, N. Brown, Y. Chebotar, O. Cortes, B. David, C. Finn, K. Gopalakrishnan,

K. Hausman, A. Herzog, et al. Do as i can, not as i say: Grounding language in robotic

affordances. arXiv preprint arXiv:2204.01691, 2022.

[34] Y. Ding, X. Zhang, C. Paxton, and S. Zhang. Task and motion planning with large language

models for object rearrangement. arXiv preprint arXiv:2303.06247, 2023.

[35] D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson,

Q. Vuong, T. Yu, et al. Palm-e: An embodied multimodal language model. arXiv preprint

arXiv:2303.03378, 2023.

[36] W. Huang, F. Xia, T. Xiao, H. Chan, J. Liang, P. Florence, A. Zeng, J. Tompson, I. Mordatch,

Y. Chebotar, et al. Inner monologue: Embodied reasoning through planning with language

models. arXiv preprint arXiv:2207.05608, 2022.

[37] Y. Ding, X. Zhang, S. Amiri, N. Cao, H. Yang, C. Esselink, and S. Zhang. Robot task planning

and situation handling in open worlds. arXiv preprint arXiv:2210.01287, 2022.

[38] W. Huang, P. Abbeel, D. Pathak, and I. Mordatch. Language models as zero-shot planners: Ex-

tracting actionable knowledge for embodied agents. In International Conference on Machine

Learning, pages 9118–9147. PMLR, 2022.

10[39] Y. Kant, A. Ramachandran, S. Yenamandra, I. Gilitschenski, D. Batra, A. Szot, and

H. Agrawal. Housekeep: Tidying virtual households using commonsense reasoning. In Com-

puter Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022,

Proceedings, Part XXXIX, pages 355–373. Springer, 2022.

[40] I. Singh, V. Blukis, A. Mousavian, A. Goyal, D. Xu, J. Tremblay, D. Fox, J. Thomason, and

A. Garg. Progprompt: Generating situated robot task plans using large language models. arXiv

preprint arXiv:2209.11302, 2022.

[41] K. Lin, C. Agia, T. Migimatsu, M. Pavone, and J. Bohg. Text2motion: From natural language

instructions to feasible plans. arXiv preprint arXiv:2303.12153, 2023.

[42] K. Valmeekam, S. Sreedharan, M. Marquez, A. Olmo, and S. Kambhampati. On the planning

abilities of large language models (a critical investigation with a proposed benchmark). arXiv

preprint arXiv:2302.06706, 2023.

[43] T. Silver, V. Hariprasad, R. S. Shuttleworth, N. Kumar, T. Lozano-Pérez, and L. P. Kaelbling.

PDDL planning with pretrained large language models. In NeurIPS 2022 Foundation Mod-

els for Decision Making Workshop, 2022. URL https://openreview.net/forum?id=

1QMMUB4zfl.

[44] V. Pallagani, B. Muppasani, K. Murugesan, F. Rossi, L. Horesh, B. Srivastava, F. Fabiano,

and A. Loreggia. Plansformer: Generating symbolic plans using transformers. arXiv preprint

arXiv:2212.08681, 2022.

[45] R. Nakano, J. Hilton, S. Balaji, J. Wu, L. Ouyang, C. Kim, C. Hesse, S. Jain, V. Kosaraju,

W. Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback. arXiv

preprint arXiv:2112.09332, 2021.

[46] A. Lazaridou, E. Gribovskaya, W. Stokowiec, and N. Grigorev. Internet-augmented language

models through few-shot prompting for open-domain question answering. arXiv preprint

arXiv:2203.05115, 2022.

[47] A. Madaan, N. Tandon, P. Clark, and Y. Yang. Memory-assisted prompt editing to improve

gpt-3 after deployment, 2023.

[48] W. Shi, S. Min, M. Yasunaga, M. Seo, R. James, M. Lewis, L. Zettlemoyer, and W.-t. Yih.

Replug: Retrieval-augmented black-box language models. arXiv preprint arXiv:2301.12652,

2023.

[49] T. Schick, J. Dwivedi-Yu, R. Dessı̀, R. Raileanu, M. Lomeli, L. Zettlemoyer, N. Cancedda, and

T. Scialom. Toolformer: Language models can teach themselves to use tools. arXiv preprint

arXiv:2302.04761, 2023.

11A

Problem and Domain PDDL Files

Problem PDDL Example

(define (problem BW-rand-5)

(:domain blocksworld-4ops)

(:objects b1 b2 b3 b4 b5 )

(:init

(arm-empty)

(on b1 b2)

(on b2 b3)

(on b3 b5)

(on b4 b1)

(on-table b5)

(clear b4)

)

(:goal

(and

(on b1 b3)

(on b3 b5))

)

Domain PDDL Example

(define (domain blocksworld-4ops)

(:requirements :strips)

(:predicates (clear ?x)

(on-table ?x)

(arm-empty)

(holding ?x)

(on ?x ?y))

(:action pickup

:parameters (?ob)

:precondition (and (clear ?ob) (on-table ?ob) (arm-empty))

:effect (and (holding ?ob) (not (clear ?ob))

(not (on-table ?ob)) (not (arm-empty))))

(:action putdown

:parameters (?ob)

:precondition (holding ?ob)

:effect (and (clear ?ob) (arm-empty) (on-table ?ob)

(not (holding ?ob))))

(:action stack

:parameters (?ob ?underob)

:precondition (and (clear ?underob) (holding ?ob))

:effect (and (arm-empty) (clear ?ob) (on ?ob ?underob)

(not (clear ?underob)) (not (holding ?ob))))

(:action unstack

:parameters (?ob ?underob)

:precondition (and (on ?ob ?underob) (clear ?ob) (arm-empty))

:effect (and (holding ?ob) (clear ?underob)

(not (on ?ob ?underob)) (not (clear ?ob))

(not (arm-empty)))))

12B

Prompts for Different Methods

The Prompt for LLM- AS -P (no context)

DOMAIN-NL. Now consider a planning problem. The problem description is: TASK-

NL. Can you provide an optimal plan, in the way of a sequence of behaviors, to solve the

problem?

The Prompt for LLM- AS -P (with context)

DOMAIN-NL. An example planning problem is: EXAMPLE-NL. A plan for the example

problem is: EXAMPLE-SOL. Now I have a new planning problem and its description is:

TASK-NL. Can you provide an optimal plan, in the way of a sequence of behaviors, to

solve the problem?

The Prompt for LLM+P (no context)

DOMAIN-NL. Now consider a planning problem. The problem description is: TASK-NL.

Provide me with the problem PDDL file that describes the planning problem directly without

further explanations.

The Prompt for LLM+P (with context)

DOMAIN-NL. An example planning problem is: EXAMPLE-NL. The problem PDDL

file to this problem is: EXAMPLE-PDDL. Now I have a new planning problem and its de-

scription is: TASK-NL. Provide me with the problem PDDL file that describes the planning

problem directly without further explanations.

13C

Example Failures

A list of representative failures for both the baseline methods and LLM- AS -P is provided in this

section.

Failure 1 of LLM- AS -P (without context)

Problem (Barman):

You have 1 shaker with 3 levels, 4 shot glasses, 3 dispensers for 3 ingredients. The shaker

and shot glasses are clean, empty, and on the table.

Your left and right hands are empty. The first ingredient of cocktail1 is ingredient1.

The second ingredient of cocktail1 is ingredient3. The first ingredient of cocktail2 is

ingredient2. The second ingredient of cocktail2 is ingredient3. The first ingredient of

cocktail3 is ingredient1. The second ingredient of cocktail3 is ingredient2.

Your goal is to make 3 cocktails. shot1 contains cocktail1. shot2 contains cocktail3.

shot3 contains cocktail2.

GPT-3.5:

1. Grasp the shaker with your left hand.

2. Grasp shot glass 1 with your right hand.

3. Fill shot glass 1 with ingredient 1 from the dispenser.

4. Pour ingredient 1 from shot glass 1 to the shaker.

5. Grasp shot glass 2 with your right hand.

6. Fill shot glass 2 with ingredient 2 from the dispenser.

7. Pour ingredient 2 from shot glass 2 to the shaker.

8. Grasp shot glass 3 with your right hand.

9. Fill shot glass 3 with ingredient 3 from the dispenser.

10. Pour ingredient 3 from shot glass 3 to the shaker.

11. Shake the shaker.

12. Pour the cocktail1 from the shaker to shot glass 1.

13. Empty shot glass 1.

14. Clean shot glass 1.

15. Pour the cocktail2 from the shaker to shot glass 2.

16. Empty shot glass 2.

17. Clean shot glass 2.

18. Pour the cocktail3 from the shaker to shot glass 3.

19. Empty shot glass 3.

20. Clean shot glass 3.

21. Leave the shaker on the table.

14Failure 2 of LLM- AS -P (without context)

Problem (BlocksWorld):

You have 3 blocks. b3 is on top of b2. b1 is on top of b3. b2 is on the table. b1 is clear. Your

arm is empty. Your goal is to move the blocks. b2 should be on top of b3. b3 should be on

top of b1.

GPT-3.5:

Pickup b1

Stack b1 on top of b2 (Failed because b2 is not clear)

Unstack b3 from b2

Stack b3 on top of b1

Putdown b1

15Failure 3 of LLM- AS -P (without context)

Problem (Grippers):

There are 3 rooms and 2 balls. robot1 is in room1. robot4 is in room1. robot2 is in room1.

robot3 is in room1. ball2 is in room3. ball1 is in room2. The robots’ grippers are free.

Your goal is to transport the balls to their destinations.

ball2 should be in room3.

ball1 should be in room2.

GPT-3.5:

1. Robot1 picks up ball1 with its left gripper. # (Failed because robot1 and

ball1 are in different rooms)

2. Robot1 moves to room2.

3. Robot1 drops ball1 in room2.

4. Robot2 picks up ball2 with its right gripper.

5. Robot2 moves to room3.

6. Robot2 drops ball2 in room3.

16Failure 1 of LLM- AS -P (with in-context-learning)

Problem (Storage):

Depot storeareas are: depot48-1-1 depot48-1-2 depot48-2-1 depot48-2-2

Container storeareas are: container-0-0 container-0-1

Here is a map of depot storeareas:

depot48-1-1 depot48-1-2

depot48-2-1 depot48-2-2

According to the map, adjacent depot storeareas are connected. All depot storeareas are in

depot48. crate1 is on container-0-1. crate0 is on container-0-0. All crates and container

storeareas are in container0. All container storeareas are connected to loadarea. depot48-2-1

and loadarea are connected. depot48-2-2 depot48-2-1 depot48-1-1 are clear. hoist0 is in

depot48-1-2. All hoists are available.

Your goal is to move all crates to depot48.

GPT-3.5:

Go out with hoist0 from depot48-1-2 to loadarea. # (Failed because depot48-1-2

and loadarea are not connected)

Lift crate0 in container-0-0 in container 0 with hoist0 from

loadarea.

Go in with hoist0 from loadarea to depot48-2-1.

Drop crate0 with hoist0 from depot48-2-1 to depot48-1-1 in depot48.

Go out with hoist0 from depot48-2-1 to loadarea.

Lift crate1 in container-0-1 in container 0 with hoist0 from

loadarea.

Drop crate1 with hoist0 from loadarea to depot48-2-2 in depot48.

17Failure 2 of LLM- AS -P (with in-context-learning)

Problem (Termes):

The robot is on a grid with 4 rows and 3 columns.

pos-0-0 pos-0-1 pos-0-2

pos-1-0 pos-1-1 pos-1-2

pos-2-0 pos-2-1 pos-2-2

pos-3-0 pos-3-1 pos-3-2

The robot is at pos-2-0. The depot for new blocks is at pos-2-0. The maximum height of

blocks is 3.

Your goal is to build blocks so that the height at pos-1-2 is 3. You cannot have an

unplaced block at the end.

GPT-3.5:

Create block at pos-2-0

move to pos-2-1

place block at pos-1-2

adjacent)

move to pos-2-2

place block at pos-1-2

move to pos-2-1

place block at pos-1-2

move to pos-2-0

place block at pos-1-2

# (Failed because pos-2-1 and pos-1-2 are not

18Failure 1 of LLM+P (without context)

Problem (Tyreworld):

You have a jack, a pump, a wrench, a boot, 3 hubs, 3 nuts, 3 flat tyres, and 3 intact tyres.

The jack, pump, wrench, and intact tyres are in the boot. The boot is unlocked but is closed.

The intact tyres are not inflated. The flat tyres are on the hubs. The hubs are on the ground.

The nuts are tight on the hubs. The hubs are fastened.

Your goal is to replace flat tyres with intact tyres on the hubs. Intact tyres should be

inflated. The nuts should be tight on the hubs. The flat tyres, wrench, jack, and pump should

be in the boot. The boot should be closed.

GPT-3.5:

(define (problem replace-tyres)

(:domain tyre-replacement)

(:objects

jack pump wrench boot hub nut flat-tyre intact-tyre - object

)

(:init

(in jack boot)

(in pump boot)

(in wrench boot)

(in intact-tyre boot)

(inflat intact-tyre)

(on flat-tyre hub)

(on-ground hub)

(tight nut hub)

(fasten hub)

(closed boot)

)

(:goal (and

(not (on flat-tyre hub))

(on intact-tyre hub)

(inflat intact-tyre)

(tight nut hub)

(in jack boot)

(in pump boot)

(in wrench boot)

(in intact-tyre boot)

(closed boot)

))

)