Summary of Investigating Planning Abilities of Large Language Models

Summary Investigating Planning Abilities of Large Language Models arxiv.org

23,362 words - PDF document - View PDF document

One Line

Large language models (LLMs) have limited ability to generate executable plans autonomously, but they can serve as heuristic guidance for other agents in the logistics domain; caution is needed to verify correctness and bias, and safety and potential bias perpetuation should be carefully considered when using LLMs for planning.

Slides

Slide Presentation (11 slides)

Copy slides outline Copy embed code Download as Word

Investigating Planning Abilities of Large Language Models

Source: arxiv.org - PDF - 23,362 words - view

Introduction

• Large language models (LLMs) can serve as heuristic guidance for other agents in the logistics domain

• Verification of correctness and bias is essential when using LLMs for planning

• Safety and potential bias perpetuation should be carefully considered

Autonomous Planning Abilities

• LLMs have limited ability to generate executable plans autonomously

• GPT-4, the best model, achieved an average success rate of around 12%

• Verification is necessary to ensure correct and unbiased plans

Heuristic Planning Abilities

• LLM-generated plans in the heuristic mode show more promise

• They improve the search process for underlying planners

• Benefit from external verifiers' feedback

Assisting AI Planners

• LLM-generated plans can assist AI planners

• Plans can be refined through backprompting

• Backprompting allows for feedback and improvement

Limitations and Potential Benefits

• LLMs perform poorly in autonomous planning

• Verification is crucial for correctness and bias

• Potential benefits include improved search process and plan refinement

Study Findings in Logistics Domain

• LLM-generated plans in the logistics domain were comparable to human-generated plans in terms of accuracy, time taken, and cognitive load

• However, LLM-generated plans were not always correct and sometimes led to automation bias

Study Findings in Blocksworld Domain

• LLM assistance did not significantly improve performance in the blocksworld domain

• LLM-generated plans should be verified for correctness and bias before implementation

Cost of Compute for GPT-4 Model

• The cost of compute for autonomous mode experiments was $231

• The cost of compute for back-prompting experiments was $209

Safety and Bias Considerations

• Safety and potential bias perpetuation should be carefully considered

• Verification is necessary to ensure correct and unbiased plans

Key Takeaways

• LLMs have limited success in generating executable plans autonomously

• Heuristic mode shows more promise, benefiting from external feedback

• Verification for correctness and bias is crucial when using LLMs for planning

Key Points

Large language models (LLMs) were investigated for their planning abilities in generating plans autonomously and as heuristic guidance for other agents.
LLMs had limited success in generating executable plans autonomously, with the best model (GPT-4) achieving an average success rate of around 12%.
LLM-generated plans showed more promise in the heuristic mode, improving the search process for underlying planners and benefiting from external verifiers' feedback.
While LLMs performed poorly in autonomous planning, their generated plans could assist AI planners and be refined through backprompting.
The study highlighted the limitations and potential benefits of using LLMs in planning tasks, emphasizing the importance of verification for correctness and bias.

Summaries

89 word summary

This study investigated the planning abilities of large language models (LLMs) and found that while they had limited ability to generate executable plans autonomously, they showed promise as heuristic guidance for other agents. LLM-generated plans were statistically similar to those generated by humans in the logistics domain, but caution is needed to verify correctness and bias. The cost of compute for the GPT-4 model was $231 for autonomous mode experiments and $209 for back-prompting experiments. Safety and potential bias perpetuation should be carefully considered when using LLMs for planning.

198 word summary

This study investigated the planning abilities of large language models (LLMs) and evaluated their performance in generating plans autonomously and as heuristic guidance for other agents. The results showed that LLMs had limited ability to generate executable plans autonomously, with the best model achieving a success rate of around 12%. However, in the heuristic mode, LLM-generated plans showed promise, improving the search process for underlying planners and benefiting from external feedback. The study highlighted the limitations and potential benefits of using LLMs in planning tasks.

The study compared the performance of GPT-4 with human planners in logistics and blocksworld domains. In the logistics domain, LLM-generated plans were statistically similar to those generated by humans, but were not always correct and sometimes led to automation bias. In the blocksworld domain, the group assisted by LLM did not perform significantly better than the group without assistance. It is important to verify the correctness and bias of LLM-generated plans before implementation.

The cost of compute for the GPT-4 model in the study was $231 for autonomous mode experiments and $209 for back-prompting experiments. Safety and potential bias perpetuation are important considerations when using LLMs for planning, requiring careful verification of plans.

336 word summary

The authors of this study investigated the planning abilities of large language models (LLMs) by evaluating their performance in generating plans autonomously and as heuristic guidance for other agents. They conducted a systematic study using domains similar to those in the International Planning Competition and evaluated LLMs in two modes: autonomous and heuristic. The results showed that LLMs' ability to generate executable plans autonomously was limited, with the best model (GPT-4) having an average success rate of around 12% across the domains. However, in the heuristic mode, LLM-generated plans showed more promise, improving the search process for underlying planners and benefiting from external verifiers' feedback. The findings demonstrate that while LLMs perform poorly in autonomous planning, their generated plans can assist AI planners and be refined through backprompting. The study highlights the limitations and potential benefits of using LLMs in planning tasks.

The study included a series of prompts and responses related to planning abilities in large language models. In some cases, the extracted plans matched the responses given by GPT-4, while in others, they did not. The accuracy of the GPT-4 responses varied, with some matching the ground truth plans and others deviating from them.

The study compared the performance of GPT-4 with human planners in logistics and blocksworld domains. In the logistics domain, the LLM-generated plans were found to be statistically similar to those generated by humans in terms of accuracy, time taken, and cognitive load. However, the LLM-generated plans were not always correct and sometimes led to automation bias. In the blocksworld domain, the group assisted by LLM did not perform significantly better than the group without LLM assistance. Overall, while LLMs can provide plan suggestions, it is important to verify their correctness and bias before implementation.

552 word summary

In this study, the authors investigated the planning abilities of large language models (LLMs) by evaluating their performance in generating plans autonomously and as heuristic guidance for other agents. They conducted a systematic study using domains similar to those in the International Planning Competition and evaluated LLMs in two modes: autonomous and heuristic. The results showed that LLMs' ability to generate executable plans autonomously was limited, with the best model (GPT-4) having an average success rate of around 12% across the domains. However, in the heuristic mode, LLM-generated plans showed more promise, improving the search process for underlying planners and benefiting from external verifiers' feedback. The findings demonstrate that while LLMs perform poorly in autonomous planning, their generated plans can assist AI planners and be refined through backprompting. The study highlights the limitations and potential benefits of using LLMs in planning tasks.

This excerpt provides a series of prompts and responses related to planning abilities in large language models. The prompts are organized into separate paragraphs for readability.

The first set of prompts pertains to a blocksworld scenario. The initial conditions and goal are given, and the plan is requested. The GPT-4 response provides the plan as a sequence of actions. The extracted plan matches the GPT-4 response.

The second set of prompts also involves a blocksworld scenario, but with deceptive disguising. Again, the initial conditions, goal, and plan are provided. The GPT-4 response gives the plan, but the extracted plan does not match.

The third set of prompts pertains to logistics planning. The domain description is given, and a one-shot prompt with the problem and plan is provided. The GPT-4 response matches the provided plan.

The fourth set of prompts is another logistics planning scenario, but with a zero-shot prompt. The GPT-4 response provides the plan, but it does not match the ground truth plan.

The fifth set of prompts involves mystery blocksworld scenarios in PDDL format. The domain description is given, and a one-shot prompt with the problem and plan is provided. The GPT-4 response matches the provided plan.

The sixth set of prompts is another mystery blocksworld scenario in PDDL format, but with a zero-shot prompt. The GPT-4 response provides the plan, but it does not match the ground truth plan.

Overall, the GPT-4 responses vary in their accuracy, with some matching the ground truth plans and others deviating from them.

We conducted a study to investigate the planning abilities of large language models (LLMs) in the context of logistics and blocksworld domains. We used GPT-4 as our LLM model and compared its performance to human planners. In the logistics domain, we found that the LLM-generated plans were not statistically different from the plans generated by humans in terms of accuracy, time taken, and cognitive load. However, the LLM-generated plans were not always correct and sometimes led to automation bias. In the blocksworld domain, the LLM-assisted group did not perform significantly better than the group without LLM assistance. Overall, while LLMs can provide plan suggestions, they should be verified for correctness and bias before being implemented. The cost of compute for the GPT-4 model was $231 for autonomous mode experiments and $209 for back-prompting experiments. Safety and potential bias perpetuation are important considerations when using LLMs for planning, and careful verification is necessary to ensure correct and unbiased plans.

Raw indexed text (137,311 chars / 23,362 words / 2,730 lines)

On the Planning Abilities of Large Language Models -

A Critical Investigation

Karthik Valmeekam

School of Computing & AI

Arizona State University Tempe.

[email protected]

Matthew Marquez

School of Computing & AI

Arizona State University, Tempe.

[email protected]

Sarath Sreedharan ∗

Department of Computer Science,

Colorado State University, Fort Collins.

[email protected]

Subbarao Kambhampati

School of Computing & AI

Arizona State University, Tempe.

[email protected]

Abstract

Intrigued by the claims of emergent reasoning capabilities in LLMs trained on

general web corpora, in this paper, we set out to investigate their planning capa-

bilities. We aim to evaluate (1) the effectiveness of LLMs in generating plans

autonomously in commonsense planning tasks and (2) the potential of LLMs as a

source of heuristic guidance for other agents (AI planners) in their planning tasks.

We conduct a systematic study by generating a suite of instances on domains simi-

lar to the ones employed in the International Planning Competition and evaluate

LLMs in two distinct modes: autonomous and heuristic. Our findings reveal that

LLMs’ ability to generate executable plans autonomously is rather limited, with the

best model (GPT-4) having an average success rate of ∼12% across the domains.

However, the results in the heuristic mode show more promise. In the heuristic

mode, we demonstrate that LLM-generated plans can improve the search process

for underlying sound planners and additionally show that external verifiers can

help provide feedback on the generated plans and back-prompt the LLM for better

plan generation.

Introduction

It would be no exaggeration to say that transformer-based large language models (LLMs) have

revolutionized the field of natural language processing (NLP). Kicked off by the advances presented

by the GPT-x models developed by OpenAI [25], these types of language models currently provide

state-of-the-art performance in many of the standard NLP tasks. Although LLMs were originally

developed mostly to do word sequence completion tasks, with no guarantees about the completion

beyond its coherence, there have been increasing claims and anecdotal evidence that they have other

emergent capabilities that are not normally associated with sequence completion. Indeed, the hints of

such emergent capabilities has started a veritable land rush, with researchers probing (prompting) and

studying LLM behavior almost as if they were artificial organisms (c.f. [15]). Of particular interest

to us in this paper is the thread of efforts that aim to investigate (and showcase) reasoning abilities

of LLMs–including commonsense reasoning [30, 26, 7], logical reasoning [29], and even ethical

reasoning [14]. The macro-tenor of the drumbeat of these works has been suggesting that LLM’s are

indeed capable of doing such kinds of reasoning [17, 31, 4].

∗

Author was at Arizona State University during part of this work

Preprint. Under Review.One type of reasoning task that has been well studied in the AI community is planning and sequential

decision making. At its simplest, planning involves developing a course of actions (policy) which

when executed takes the agent to a desired state of the world. Planning has generally been studied

primarily as an inference on world and reward models–whether specified by humans or learned by

the agent by interacting with its world.

In this paper, we are interested in seeing what planning abilities, if any, LLMs may already have,

given their high capacity functions (with billions of tunable parameters) trained on web-scale corpora.

Specifically, we are interested in answering two broad questions:

1. How effective are LLMs by themselves in generating simple plans in commonsense planning

tasks (of the type that humans are generally quite good at)?

2. How good are LLMs in being a source of heuristic guidance for other agents in their planning

tasks?

Notice that in theory it is possible for LLMs to be very effective as idea generators for external sound

planners or humans in the loop in computer-supported cooperative work scenarios, while themselves

being very bad at generating plans that are guaranteed to be correct. This is especially likely because

the chief power of LLMs comes from their pattern finding abilities than on first-principles simulations

over world models. Compared to a planner that is guaranteed to be correct in a narrow set of

domains, LLMs may likely be good at generating plausible (but not guaranteed to be correct) plan

heuristics/suggestions in many more domains.

To investigate these questions in a systematic rather than anecdotal manner, we generate a suite of

planning problem instances based on the kinds of domains employed in the International Planning

Competition [13]. To eliminate the subjective aspect of analysis that forms the core part of many

earlier efforts on evaluating reasoning capabilities of LLMs, we automate the evaluation by leveraging

models and tools from the automated planning community. 2

Figure 1: The diagrammatic overview of the two modes of LLMs for planning.

The evaluation itself is done in two modes (shown in Figure 1). In the first “autonomous" mode,

LLMs are used standalone, and we directly assess the quality and correctness of plans they generate.

As we shall see, the results in the autonomous mode are pretty bleak. On an average, only about 12%

of the plans that the best LLM (GPT-4) generates are actually executable without errors and reach

their goals. We will show that the choice of the specific LLM (we have tested the family of GPT

LLMs including GPT-4 [23], GPT-3.5 [22], InstructGPT-3 [24] and GPT-3 [3]), as well as fine tuning

does not seem to have a major effect on this dismal performance. We also show that the performance

deteriorates further if the names of the actions and objects in the domain are obfuscated–a change

that doesn’t in anyway affect the performance of the standard AI planners. To shed further light

on the performance of GPT4, we present an evaluation of the plans it generates under a series of

more relaxed (more forgiving) executability conditions. Further, we provide a human baseline for

the simplest domain in our set of domains, by presenting the planning instances to human subjects

(through IRB-approved studies) and evaluating the quality and correctness of their plans. These

results are substantially better than those of LLMs–confirming that LLMs can’t plan even in a simple

common sense domain in the autonomous mode. In the second “heuristic" mode, the plans produced

by LLMs are given as input to an automated planner working off of a correct domain model to check

The related tools as well as experiment code will be made available in the following repo: https://

github.com/karthikv792/gpt-plan-benchmark

2whether the LLM’s plans help with the search process of the underlying planner to come up with

correct plans.

Specifically we show that a well known automated planner called LPG [6], that uses local search to

locate and remove flaws in a candidate plan to make it correct, is able to repair the LLM plans with

relative ease. We compare the LLM+LPG combination with two baselines, one where an empty plan

is used as the seed plan for the LPG and two, where a random plan is provided as the seed plan to

the LPG. We show that the average search steps by the LLM+LPG combination is much lesser than

both the baselines, thereby revealing that LLMs’ plans are indeed helping with the search process of

the underlying planner. Further, instead of having LPG correct the plans, we use an external verifier,

VAL [11], to point out the errors in the LLM-generated plans and back-prompt the LLM for a new

plan with this feedback. We show that this repeated interaction indeed improves the plan correctness

in common-sense domains.

Overall, our findings demonstrate that, with respect to planning, LLMs’ perform poorly in the

autonomous mode but the generated plans can help AI planners in the search process or can be given

to external verifiers and back-prompt the LLM for better plans. In this paper, we first present an

overview of the related work. Following that, we describe the necessary background and the prompt

generation pipeline. Finally, we provide the results and analysis of various experiments undertaken in

both autonomous and heuristic evaluation modes.

Related Work

In this work, we look at LLMs’ planning capabilities when the domain is given as part of the prompt

(as is the standard practice in automated planning [8]). Our evaluation focuses on zero shot (just

domain and problem specification), and few-shot (example problems with plans) modes. There

have been a few earlier works that looked at the planning capabilities of LLMs. Most of them, such

as [12, 2] focus on commonsense domains/tasks (e.g. moving things in kitchens, wedding/menu

planning etc.) and thus evaluate LLMs in a mode wherein the prompt doesn’t include any information

about the specific domain. Plans generated in that way are hard to evaluate as they are not directed

at any plan executor and the humans often wind up giving benefit of doubt for a plausibile–but not

actually executable–plan. This is why in SayCan [2], where executability is critical, they try to filter

out/interpret the LLM plans in terms of the skills/actions that are actually available to the executor.

While SayCan does this in a rather convoluted way that requires access to the internal log probabilities

of the LLM, our approach of specifying the domain as part of the prompt ensures that the generated

plans only use the actions in the domain specification.

One other mode of evaluation of planning capabilities in the literature involves the user incrementally

interacting with the LLM, and re-prompting it to point out flaws in its plans, with the hope that the

LLM eventually reaches an executable plan [33]. Such evaluations are notorious for their Clever

Hans effect [1] with the actual planning being done by the humans in the loop rather than the LLMs

themselves. We thus separate our evaluation into two modes–autonomous and as assistants to external

planners/reasoners.

There have also been efforts which mostly depended on LLMs as “translators" of natural language

problem/goal specification into formal specifications, which are then thrown over to sound external

planners [32, 19]. Such efforts don’t shed any light on the internal planning capabilities of the LLMs

themselves, as our evaluations in autonomous and assistive modes do. Finally, after our initial study

and benchmark were made public, other groups did parallel studies that largely corroborate our results

on the ineffectiveness of LLMs in finding executable plans [28, 19].

Taking a broader perspective, making plans in the world involves (1) discovering actions (and their

precondition/effect causal dependencies), and (2) sequencing an appropriate subset of available/dis-

covered actions to achieve the agent’s goals. The former requires broad knowledge about actions

available in the world and their individual effects, while the latter requires deep drilling-down over a

given set of actions to ensure that all goals are supported (causal chaining) without any undesirable

interactions. LLMs have an edge on the former–they do indeed have web-scale broad knowledge! As

we shall see however, they are very bad at the second phase of developing valid interaction-free plans

(in part, because LLMs don’t have the ability to do combinatorial search). Most cases in literature

where LLMs are claimed to have "planned" turn out, upon close examination, to be instances of

phase 1–your wedding plans, recipe plans etc.–where you are either using a very forgiving plan

3correctness criterion, or the phase 2 is vacuous. Standard AI planners–on the other hand–assume

that the discovery part is done and handed down as a compact domain model, and focus mostly on

the second part: selecting among known actions to establish causal chains and sequencing them to

make them interaction free. In this sense, LLMs and AI planners can be complementary, as we have

shown in this paper–with the former helping with phase 1–either with a candidate/approximate plan

or domain model–and the latter with phase 2.

3.1

Prompt Generation for Classical Planning Problems

Background

Given that we are interested in investigating the basic reasoning about actions and change problem, we

want to look at the most fundamental planning formalism first, namely the goal-directed deterministic

planning problem. Colloquially referred to as classical planning problem, these problem classes

consist of a problem domain, an initial state and a goal state. The problem domain consists of a set

of fluents which correspond to predicates with some arity and a set of actions. The state-space for

the planning problem is defined by the possible truth assignment over the predicates. Each action

consists of preconditions and effects where preconditions is a set of predicates that describe when an

action can be executed and effects are set of predicates that describe what happens when an action is

executed. The effects can further consist of add effects, which is the set of predicates that will be set

true by the action, and delete effects, which is the set of predicates that will be set false. The solution

for a planning problem is a sequence of actions, or a plan, that when applied in the initial state will

result in a state where the goal conditions are satisfied. A standard representation to specify such

kind of planning problems is the Planning Definition and Domain Language (PDDL) [20]. Below is

a snippet of an action from a popular benchmark problem called Blocksworld, in PDDL. The action

corresponds to picking up a block in that domain.

(:action pickup

:parameters (?ob)

:precondition (and (clear ?ob) (on-table ?ob) (arm-empty))

:effect (and (holding ?ob) (not (clear ?ob)) (not (on-table ?ob))

(not (arm-empty))))

A more detailed description on classical planning problems is provided in Appendix A.1. We now

will describe how we generate the prompts that are given to the LLMs.

3.2

Prompt Generation

Figure 2: The diagrammatic overview of the prompt generation pipeline. The prompt configurations

for the different experiments are generated from PDDL domain files and are modified with an example

generator and natural language translator as needed depending on the experiment requirements.

Prompt Configurations: We have developed a suite of unique planning problems to test LLMs’

abilities to generate plans. We have multiple prompt configurations based on this suite of problems,

varying in both the method of presentation as well as number of examples given to the LLM. In

particular, we use two methods of presentation, natural language and PDDL, as well as two different

methods of providing examples, zero shot (with no examples provided) and one shot (with an example

provided), giving us four different configuration combinations for our experiments.

4Table 1: Results of GPT-4, GPT-3.5 (popularly known as ChatGPT), Instruct-GPT3 (text-davinci-002)

and GPT3 (davinci) for the Plan Generation task with prompts in natural language.

Domain

Blocksworld

(BW)

Logistics

Domain

Mystery BW

(Deceptive)

Mystery BW

(Randomized)

Method

Instances correct

GPT-4 GPT-3.5 Instruct-GPT3 GPT-3

One-shot 206/600 (34.3%) 37/600 (6.1%) 41/600 (6.8%) 6/600 (1%)

Zero-shot 210/600 (34.6%) 8/600 (1.3%) - -

COT 214/600 (35.6%) - - -

One-shot 28/200 (14%) 1/200 (0.5%) 3/200 (1.5%) -

Zero-shot 15/200 (7.5%) 1/200 (0.5%) - -

One-shot 16/600 (2.6%) 0/600 (0%) 7/600 (1.1%) 0/600 (0%)

Zero-shot 1/600 (0.16%) 0/600 (0%) - -

COT 53/600 (8.8%) - - -

One-shot 11/600 (1.8%) 0/600 (0%) 5/600 (0.8%) 1/600 (0.1%)

Zero-shot 0/600 (0%) 0/600 (0%) - -

Within a prompt, LLMs are first provided with a lifted domain description. For one shot configura-

tions, the prompt additionally contains an example instance of a planning problem (consisting of

a description of the initial state and the goal) and the corresponding plan (which ends with a tag,

referred to as the plan-end tag, that denotes the end of the plan). All prompts end with a planning

problem description. The text generated by the LLM until the plan-end tag is used as the candidate

for extracting the plan. If the plan-end tag is missing or if the plan cannot be extracted then we

ignore that particular instance in our evaluation. The prompt is either formatted in natural langauge

or PDDL. Natural language prompts utilize complete natural language sentences to describe feasible

actions in the domain. Initial conditions are also reported as complete sentences. Plans in the natural

language setting take the form of a series of commands such as "stack the orange block on top of

the blue block". As implied by the name, PDDL prompts format all elements (domain description,

initial state, goal state, and plans) using PDDL. We point the reader to the supplementary material for

examples on each of these prompt configurations.

Chain of Thought Prompting: In addition to the four experiments above, we look at a fifth

experiment using a state tracking chain of thought prompting technique in a natural language one

shot setting. Within this configuration, we provide an annotated example where each action is

annotated with the state prior to the action, the reason for why the action is applicable in the prior

state, and the resulting state after applying the action. After the example, a meta-explanation about

plan correctness is provided. The LLM is then asked to return a response making the same state

tracking and justification annotations that were included in the example.

Prompt Generation Pipeline: We’ve developed a prompt generation pipeline (visualized in Figure

2) that accepts PDDL domain files as input and outputs prompts that follow the experiments described

above. The prompt generation component takes care of creating the set of PDDL problems to be

solved for all experiments. Following that, examples are added to the prompt in one shot experiments.

While our setup utilizes a planner during example generation, any example generation technique could

be used here so long as the examples generated are valid plans. In the state tracking experiment, we

also have developed a component to add justification annotations for examples so that the examples

reflect what we expect of the LLM. The last step before finishing is translation: since problems at

this point are currently in PDDL, prompts for all natural language experiments (whether an example

was added or not) need to be translated into natural language. We utilize a domain-specific translator

to do so.

Evaluating Planning Capabilities of LLMs in Autonomous Mode

In the autonomous mode, we treat the LLM as an automated planner and perform a single run of the

dataset on the LLMs for each domain and prompt configuration. In this mode, the plan generated by

5Table 2: Results of GPT-4 and GPT-3.5 (popularly known as ChatGPT) for the Plan Generation task

with one or zero examples in the prompt by directly providing the domain and problem in PDDL.

Domain

Blocksworld (BW)

Logistics Domain

Mystery BW (Deceptive)

Method

Instances correct

GPT-4 GPT-3.5

One-shot 75/600 (12.5%) 12/600 (2%)

Zero-shot 106/600 (17.6%) 12/600 (2%)

One-shot 28/200 (14%) 1/200 (0.5%)

Zero-shot 11/200 (5.5%) 0/200 (0%)

One-shot 17/600 (2.8%) 1/600 (0.1%)

Zero-shot 3/600 (0.5%) 0/600 (0%)

the LLM is back-translated from natural language to forms that can be used by external plan validators.

We use VAL [11] to evaluate the translated plan with the corresponding domain and problem file. Our

evaluation here primarily focuses on the GPT family of LLMs. We tested GPT-4 [23] and GPT-3.5

(commonly known as Chat-GPT) [22] on all the prompt configurations while we tested the older

versions of GPT (namely, Instruct-GPT3 and GPT3) on one-shot natural language prompts across the

domains. We set the temperature for all models to be 0, thereby making them deterministic. In this

section, we detail the evaluation of LLMs on these domains and prompt configurations. We would

like to point the reader to the Appendix for example prompts.

Evaluation of LLMs on the Blocksworld domain: Blocksworld problems capture common sense

block manipulations and consist of a set of blocks. Blocks are identified with unique colors and are

placed either on a table or on top of other blocks. The goal is to arrange some of these blocks in a

stack in a particular order. The general expectation here would be that one can pick up a block if it is

clear, i.e., there are no other blocks on top of that block and you can only stack a block on top of

another block if it is clear. The choice of this particular domain is motivated by both the fact that this

is a simple common sense domain and is a very popular domain in planning literature, that has a long

history of being used in various planning challenges. The instances were generated using a PDDL

generator employed in the IPC competitions. We permitted the generation of problems that varied in

terms of the number of blocks (3-5), optimal plan length, and goal properties (positive, negative, or

no interactions between subgoals).

As shown in Table 1 and Table 2, GPT-4 improves upon previous versions of GPT models in the

Blocksworld domain across all four prompt configurations. However, the overall performance is still

approximately 34% in the Blocksworld dataset. Even the chain of thought style prompting (indicated

by COT in the tables) had little effect on improving the performance. GPT-4 performs better with

natural language prompts (206 and 210 instances for one-shot and zero-shot prompts, respectively) as

opposed to PDDL prompts (75 and 106 instances). The performance drops significantly with other

GPT models. We also discovered that for instances where Instruct-GPT3 generated the correct plans,

replacing the example plan in the prompt with another example plan led to an even greater drop

in accuracy. This suggests that the LLM seems to rely primarily on pattern matching, rather than

inducing some internal model from the prompts. Overall, even in a seemingly simple common-sense

domain like Blocksworld, which humans typically find easy to navigate, LLMs prove to be quite

ineffective in planning autonomously.

Finetuning GPT-3 on Blocksworld: Along with directly testing the LLMs from the GPT family, we

have also looked at the utility of fine-tuning the LLMs. Specifically, we fine-tuned GPT-3 (Davinci)

in the Blocksworld domain. For this, we prepared a dataset comprising the initial state, goal state, and

the respective plan for 1,000 distinct Blocksworld instances. It’s important to note that these instances

were separate from our test set of 600 instances. By using the default hyperparameters provided by

OpenAI and an 80-20 train-validation data split, we carried out the fine-tuning process. Our results

revealed that the fine-tuned GPT-3 solved only 122 instances out of the 600 in our set, representing

approximately 20% of the total. This suggests that fine-tuning has a limited impact on improving the

performance of LLMs in Blocksworld planning. This outcome aligns with the observations of [34],

who argue that language models trained for reasoning tend to concentrate on the inherent statistical

features instead of the causal structure, which in turn affects their performance on such tasks.

6Human Baseline for the Blocksworld: We have previously mentioned that planning tasks on the

blocksworld domain are anecdotally simple enough for humans to perform. To establish this and

come up with a preliminary baseline to compare LLMs performance, we conducted an IRB-approved

user study where we asked 50 participants to come up with a plan for a blocksworld instance picked

at random, from the set of 600 instances that we used for the evaluation of LLMs. We presented the

same domain description as we did for the LLMs and then primed them with an example instance.

We point the reader to the supplementary material for further details on the study.

Out of the 50 participants, 39 of them (78%) came up with a valid plan. Along with validity, we also

tested the optimality of their plans even though they were not required to come up with an optimal

plan. Out of the 39 participants, 35 (89.7%) participants came up with an optimal plan. These initial

results show that the blocksworld domain is a simple enough domain where most humans are able

to come up with plans (which are also optimal) while LLMs, on the other hand, showcase subpar

performance.

Evaluation of LLMs on the Logistics domain: Logistics is also a widely recognized domain in the

planning literature. In this domain, the objective is to transport packages within cities via trucks, and

between cities via airplanes. Within a city, the locations are directly linked, allowing trucks to travel

between any two of these locations. Similarly, cities are directly connected to each other allowing

airplanes to travel between any two cities. Each city is equipped with one truck and has a designated

location that functions as an airport. We generated 200 instances on this domain.

From Tables 1 and 2, we see that in the one-shot setting with natural language input, GPT-4 only

solved 14% of the instances (28/200), and this rate dropped to 7.5% (15/200) when using zero-shot

prompting. When provided with the domain and problem in PDDL format, GPT-4’s performance

remained the same in the one-shot setting (14% or 28/200) but decreased to 5.5% (11/200) in the

zero-shot setting. GPT-3.5 did even worse.

Obfuscating names to test the brittleness of LLM Planning: Although the domain specification

is part of our prompts, the names of the objects (e.g. blocks, trucks), predicates (e.g. on-table, in-city)

and actions (e.g. pickup, drive) still do provide connections to the commonsense knowledge that the

pretrained LLMs possess. One intriguing question is whether the planning performance is based really

only on the domain model or these other background connections. To test this, we experimented with

a variation of the Blocksworld domain, where we obfuscate the action names (for example pickup

becomes attack, and unstack becomes feast) and predicate names (for example ontable becomes

planet, and handempty becomes harmony). Note that from the perspective of standard planners, these

domains are essentially identical. 3 In addition to such deceptive obfuscation, we also considered a

variation where random alphanumeric names were substituted for the action and object names.

Tables 1 and 2, we see that this simple obfuscation leads to a catastrophic drop in performance.

Specifically, with zero-shot prompting and natural language input, GPT-4 is able to solve 210 instances

out of 600 in the Blocksworld domain, but it could only solve 1 instance in the deceptive Mystery

Blocksworld domain and 0 instances in the randomized mystery domain. A similar result is observed

with the PDDL-style prompts: GPT-4 could solve 106 instances in Blocksworld, but only 3 instances

in the deceptive Mystery Blocksworld. Notably, chain of thought prompting does not significantly

improve performance over one-shot natural language prompts. GPT-3.5 does not solve even a single

instance in the entire set of natural language instances. For most of the instances, GPT-3.5 outputs

that the instance can’t be solved. These results strongly suggest that whatever accidental planning

performance LLMs show is likely connected to pattern matching rather than reasoning (which should

be robust to name change).

Analyzing GPT-4 failures: To get a better sense of the type of failures LLM generated plans

encounter, we wondered whether they will fare much better with a more forgiving test of the validity

of the generated plans. In automated planning community, the notion of relaxations of the domain

model are used to simplify the problem–chiefly to derive heuristics for planning problems [8]. Taking

a leaf from them, we considered two types of relaxations: (i) delete relaxation involves ignoring all

the delete conditions of the domain actions (thus making sure that there can be no negative interactions

between subgoals) and (ii) precondition relaxation involves ignoring all the preconditions of the

Such obfuscated domains were originally introduced into the Intl. Planning Competition by Drew McDer-

mott in 1998–specifically to check if the competion planners were truly domain independent.

7Figure 3: Assessment of GPT-4 plans with relaxations in Blocksworld domain

Table 3: Evaluation of GPT-4 and Instruct-GPT3 (I-GPT-3) plans as heuristics for a local search

planner LPG, on blocksworld (BW), logistics and mystery blocksworld domains.

Domain

LLM

Avg. Search Steps

Empty

Seed

Plan

I-GPT-3 15.8

Random LLM

Seed

Plan

20.07

Avg. Plan Length

Empty

Seed

Plan

14.5 8.45

Random LLM

Seed

Plan

9.62

11.7

Avg. Lev.

Distance

7.22

GPT-4 15.8 20.07 8.9 8.45 9.62 10.76 4.15

Logistics GPT-4 77.5 144.39 51.3 23.7 32.72 32.24 15.04

Mystery BW GPT-4 15.8 20.45 16.09 8.45 9.78 11.53 7.77

domain actions–thus assuming that the the actions are executable from any state giving their effects.

Our idea is to evaluate the plans produced by GPT4 with respect to domain models that are delete

relaxed, precondition relaxed or both. It should be clear that a plan that is correct with respect to the

normal (unrelaxed) model will also be correct with respect to all the relaxed models. Figure 3 shows

the results for blocksworld. We see that while the correctness of LLM generated plans increased

under more forgiving (relaxed) assessments (area in green), even in the most lenient assessment mode

(Delete+Precondition Relaxed), there still are plans (∼39%) that are incorrect (because they still

don’t reach the goals) across all the prompt configurations. The plots further classify the failure cases

in terms of whether they were inexecutable, shown in maroon, or could be executed but didn’t reach

the goals (shown in red). Note that when preconditions are relaxed, all plans are executable. We

provide additional details on the relaxed assessments in Appendix A.2.

Evaluating LLMs as Idea Generators for External Planners and Verifiers

While the preceding discussion establishes that LLMs are not capable of generating correct plans in

autonomous mode, there is still the possibility that they can be useful idea generators for other sound

external planners and verifiers. In this section, we investigate this possibility and demonstrate that

LLMs show promise on this front.

5.1

LLM Plans as Heuristics to Sound Planners

To see if the LLM generated plans can provide heuristic guidance to sound external planners, we use

a local-search planner LPG [6] which generates plans by starting with a seed plan and iteratively

repairing flaws until a correct plan is found. We feed the LLM-generated plan as the initial seed plan

for LPG’s iterative search. Our hypothesis is that this might put LPG on the right path and reduce the

time for it to generate a correct plan. It is interesting to note the similarities between this LLM+LPG

approach, and the approaches used in case-based planning in the past [9, 16]. Here the LLM can be

loosely viewed as “retrieving a potentially useful plan case/sketch” out of thin air, which the LPG

adapts/corrects.

We utilized the plans that were generated by LLMs in the one-shot natural language prompt configura-

tion on all three of our previous domains - Blocksworld, Mystery Blocksworld, and Logistics - as the

"seed plans" from which LPG would begin its local search for a valid plan. For the Blocksworld do-

8main, both GPT-4 and Instruct-GPT3 were evaluated, whereas for the Logistics and Mystery domains

only GPT-4 was evaluated. We confirmed that all the plans that were generated by this LLM+LPG

combination for both the domains were valid (which is as expected given that the underlying planner,

LPG, is sound). To get an idea of how far the initial LLM generated plans were from the final correct

solutions generated by LPG, we measured the Levenshtein edit distance between them. While the

default LPG local search doesn’t aim to minimize the changes to the suggested plan (there do exist

versions of LPG that do this; see [21]) , the edit distances also give an idea of how partially or

approximately correct the original LLM plan is. Along with the edit distance, we also measured the

number of search steps that were taken by the LPG to come up with a correct plan.

As shown in Table 3, the edit distances across domains are approximately half the length of the

seed plans generated by the LLMs, indicating that 50% of the final plan retains the elements of the

initial LLM plan. For each problem, we performed two additional plan initializations to serve as

baselines: initializing with an empty plan and initializing with a random plan of the same length as

the plan generated by the LLM for that problem. In the Blocksworld and Logistics 4 domains, we

see a significant improvement in search steps over the empty seed plan when GPT-4 is used and an

even larger one over the random seed plan. Consistent with our findings in the autonomous mode, the

usefulness of this assistance wanes in domains where the relationships between predicates can no

longer be inferred from common sense understandings of their names: in the Mystery Blocksworld

domain, the LLM only has meager reduction in step size over the random plan and actually uses more

steps than the empty plan.

5.2

Verifier-assisted repeated backprompting of LLMs

The interaction between LLM and LPG was

unidirectional–with LLM sending a seed plan Table 4: GPT4 Performance with Backprompting

that LPG aims to repair. One supposed advan- by VAL [11]. Mystery BW had deceptive disguis-

tage of LLMs is that they can be prompted to ing. I.C - Instances correct (within 15 feedbacks);

improve their solutions. Suppose we have ac- A.F.R - Avg. feedback rounds for correct instances.

cess to a sound automated verifier that not only

I.C

A.F.R

checks the plan correctness but also pinpoints

Domain

faults (in terms of unsatisfied preconditions or

GPT-4

delete interactions). Such feedback can be eas-

41/50 (82%)

3.68

ily converted into a "backprompt" to the LLM, Blocksworld (BW)

with the hope that LLM comes up with a better

Logistics

35/50 (70%)

3.31

plan. This is what we do with the help of VAL

Mystery BW

5/50 (10%)

7.0

[11]–an AI planning tool that uses the domain

model to validate the correctness of the plans

(and point out errors). While, as we mentioned earlier, there can be thorny “clever hans” issues

about humans prompting LLMs, an automated verifier mechanically backprompting the LLM doesn’t

suffer from these. We tested this setup on a subset of the failed instances in the one-shot natural

language prompt configuration using GPT-4, given its larger context window. We set a threshold of 15

backprompting rounds. We tested on three domains–Blocksworld, Logistics and Mystery BW–with

50 failed instances from each domain. Table 4 shows the results. We provide the prompt+feedback

examples in Appendix A.9.

We found that GPT4 is able to come up with correct plans 82% of the Blocksworld instances and

70% of the Logistics one. The average number of backprompting rounds for these successful cases

was 3.68 for BW and 3.31 for Logistics. The performance on the Mystery BW however remained

quite poor–suggesting that even with back prompting, GPT4 cannot do well unless it can tease out

commonsense patterns for the domain.

Conclusion and Future Work

In this paper, we presented a critical investigation of the planning abilities of large language models

(LLMs). To this end, we evaluated the plan generation abilities of LLMs in two different modes.

In the autonomous mode, our results show that even in simple common-sense planning domains

For the logistics domain, we considered only those instances for which a plan could be generated in less

than 550 search steps (179 instances).

9where humans could easily come up with plans, LLMs like GPT-3 exhibit a dismal performance.

Even though there is an uptick in the performance by the newer GPT-4 in the blocksworld domain,

it still fails miserably on the mystery blocksworld domain, indicating their inability to reason in an

abstract manner. In the heuristic mode, we have seen that plans generated by LLMs can help improve

the search of sound planners like LPG. Further, we showed that using external verifiers, we can

point out the errors and back-prompt LLMs for a better plan. We showed that this indeed helps in

common-sense domains. In the supplementary material, we show the prompt examples for all the

configurations, an additional user-study where we investigated the utility of LLM plans as heuristics

to human planners while solving planning tasks (Appendix A.10) and the details of the user-studies

(Appendix A.11). From our studies, we see that LLMs as autonomous planners fail miserably, but we

also see that the generated plans improve the search when used by an underlying sound planner and

that better plans can be obtained by back-prompting the LLM with feedback from an external verifier.

Acknowledgements

Kambhampati’s research is supported by ONR grants N00014-18-1-2840 and N00014-18-1-2442,

AFOSR grant FA9550-18-1-0067 and DARPA SAIL-ON grant W911NF19-2-0006, as well as a J.P.

Morgan AI Faculty Research Award. We thank Alberto Olmo for his role in developing the test suite

in the early stages of this work, and Miles Brundage of OpenAI for help with initial research access

for GPT3.

References

[1] Clever Hans. https://en.wikipedia.org/wiki/Clever_Hans.

[2] Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David,

Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, et al. Do as i can, not

as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022.

[3] Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal,

Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are

few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.

[4] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam

Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm:

Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.

[5] Mary L Cummings. Automation bias in intelligent time critical decision support systems. In

Decision making in aviation, pages 289–294. Routledge, 2017.

[6] Alfonso Gerevini and Ivan Serina. Lpg: A planner based on local search for planning graphs

with action costs. In AIPS, volume 2, pages 281–290, 2002.

[7] Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. Did

aristotle use a laptop? a question answering benchmark with implicit reasoning strategies.

Transactions of the Association for Computational Linguistics, 9:346–361, 2021.

[8] Malik Ghallab, Dana S. Nau, and Paolo Traverso. Automated Planning and Acting. Cambridge

University Press, 2016.

[9] Kristian J. Hammond. CHEF: A model of case-based planning. In Tom Kehler, editor,

Proceedings of the 5th National Conference on Artificial Intelligence. Philadelphia, PA, USA,

August 11-15, 1986. Volume 1: Science, pages 267–271. Morgan Kaufmann, 1986.

[10] Sandra G Hart and Lowell E Staveland. Development of nasa-tlx (task load index): Results of

empirical and theoretical research. Advances in psychology, 52:139–183, 1988.

[11] Richard Howey, Derek Long, and Maria Fox. VAL: Automatic plan validation, continuous

effects and mixed initiative planning using PDDL. In 16th IEEE International Conference on

Tools with Artificial Intelligence, pages 294–301. IEEE, 2004.

[12] Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. Language models as

zero-shot planners: Extracting actionable knowledge for embodied agents. arXiv preprint

arXiv:2201.07207, 2022.

[13] IPC. International planning competition, 1998.

10[14] Liwei Jiang, Jena D. Hwang, Chandrasekhar Bhagavatula, Ronan Le Bras, Maxwell Forbes, Jon

Borchardt, Jenny Liang, Oren Etzioni, Maarten Sap, and Yejin Choi. Delphi: Towards Machine

Ethics and Norms. ArXiv, abs/2110.07574, 2021.

[15] Subbarao Kambhampati. AI as (an Ersatz) Natural Science? https://cacm.acm.org/blogs/blog-

cacm/261732-ai-as-an-ersatz-natural-science/fulltext, Jun 2022.

[16] Subbarao Kambhampati and James A. Hendler. A validation-structure-based theory of plan

modification and reuse. Artif. Intell., 55(2):193–258, 1992.

[17] Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large

Language Models are Zero-Shot Reasoners. arXiv preprint arXiv:2205.11916, 2022.

[18] Paul Pu Liang, Chiyu Wu, Louis-Philippe Morency, and Ruslan Salakhutdinov. Towards

understanding and mitigating social biases in language models. In International Conference on

Machine Learning, pages 6565–6576. PMLR, 2021.

[19] Bo Liu, Yuqian Jiang, Xiaohan Zhang, Qiang Liu, Shiqi Zhang, Joydeep Biswas, and Peter

Stone. Llm+ p: Empowering large language models with optimal planning proficiency. arXiv

preprint arXiv:2304.11477, 2023.

[20] Drew McDermott, Malik Ghallab, Adele E. Howe, Craig A. Knoblock, Ashwin Ram,

Manuela M. Veloso, Daniel S. Weld, and David E. Wilkins. Pddl-the planning domain definition

language. 1998.

[21] Tuan Anh Nguyen, Minh Do, Alfonso Emilio Gerevini, Ivan Serina, Biplav Srivastava, and

Subbarao Kambhampati. Generating diverse plans to handle unknown and partially known user

preferences. Artificial Intelligence, 190:1–31, 2012.

[22] OpenAI. Introducing chatgpt by openai, 2022.

[23] OpenAI. Gpt-4 technical report, 2023.

[24] Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin,

Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to

follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022.

[25] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language

understanding by generative pre-training. 2018.

[26] Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An

adversarial winograd schema challenge at scale. In Proceedings of the AAAI Conference on

Artificial Intelligence, volume 34, pages 8732–8740, 2020.

[27] Patrick Schramowski, Cigdem Turan, Nico Andersen, Constantin A Rothkopf, and Kristian

Kersting. Large pre-trained language models contain human-like biases of what is right and

wrong to do. Nature Machine Intelligence, 4(3):258–268, 2022.

[28] Tom Silver, Varun Hariprasad, Reece S Shuttleworth, Nishanth Kumar, Tomás Lozano-Pérez,

and Leslie Pack Kaelbling. PDDL planning with pretrained large language models. In NeurIPS

2022 Foundation Models for Decision Making Workshop, 2022.

[29] Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid,

Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, et al.

Beyond the imitation game: Quantifying and extrapolating the capabilities of language models.

arXiv preprint arXiv:2206.04615, 2022.

[30] Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A ques-

tion answering challenge targeting commonsense knowledge. arXiv preprint arXiv:1811.00937,

2018.

[31] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny

Zhou. Chain of thought prompting elicits reasoning in large language models. arXiv preprint

arXiv:2201.11903, 2022.

[32] Yaqi Xie, Chen Yu, Tongyao Zhu, Jinbin Bai, Ze Gong, and Harold Soh. Translating natural

language to planning goals with large-language models. arXiv preprint arXiv:2302.05128,

2023.

[33] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and

Yuan Cao. React: Synergizing reasoning and acting in language models. In The Eleventh

International Conference on Learning Representations, 2023.

11[34] Honghua Zhang, Liunian Harold Li, Tao Meng, Kai-Wei Chang, and Guy Van den Broeck. On

the paradox of learning to reason from data. arXiv preprint arXiv:2205.11502, 2022.

12A

Appendix

Contents

1 Introduction 1

2 Related Work 3

3 Prompt Generation for Classical Planning Problems 4

3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3.2 Prompt Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

4 Evaluating Planning Capabilities of LLMs in Autonomous Mode 5

5 Evaluating LLMs as Idea Generators for External Planners and Verifiers 8

5.1 LLM Plans as Heuristics to Sound Planners . . . . . . . . . . . . . . . . . . . . . 8

5.2 Verifier-assisted repeated backprompting of LLMs . . . . . . . . . . . . . . . . . . 9

Conclusion and Future Work

7 Acknowledgements 10

A Appendix 13

A.1 Classical Planning Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . 14

A.2 Failure modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

A.2.1 LLM failures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

A.2.2 Human failures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

A.3 Blocksworld Prompts in Natural Language . . . . . . . . . . . . . . . . . . . . . . 16

A.3.1 Domain description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

A.3.2 One-shot prompt with GPT-4 plan . . . . . . . . . . . . . . . . . . . . . . 16

A.3.3 Zero-shot prompt with GPT-4 plan . . . . . . . . . . . . . . . . . . . . . . 17

A.3.4 State-tracking prompt with GPT-4 plan . . . . . . . . . . . . . . . . . . . 17

A.4 Mystery Blocksworld Prompts in Natural Language . . . . . . . . . . . . . . . . . 20

A.4.1 Domain description (Deceptive Disguising) . . . . . . . . . . . . . . . . . 20

A.4.2 One-shot prompt with GPT-4 plan (Deceptive Disguising) . . . . . . . . . 20

A.4.3 Zero-shot prompt with GPT-4 plan (Deceptive Disguising) . . . . . . . . . 21

A.4.4 State-tracking prompt with GPT-4 plan . . . . . . . . . . . . . . . . . . . 21

A.4.5 Domain description (Randomized Disguising) . . . . . . . . . . . . . . . . 23

A.4.6 One-shot prompt with GPT-4 plan (Randomized Disguising) . . . . . . . . 23

A.4.7 Zero-shot prompt with GPT-4 plan (Randomized Disguising) . . . . . . . . 24

A.5 Logistics Prompts in Natural Language . . . . . . . . . . . . . . . . . . . . . . . 24

A.5.1 Domain description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

A.5.2 One-shot prompt with GPT-4 plan . . . . . . . . . . . . . . . . . . . . . . 25

13A.1

A.5.3 Zero-shot prompt with GPT-4 plan . . . . . . . . . . . . . . . . . . . . . . 26

A.6 Blocksworld Prompts in PDDL . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

A.6.1 Domain description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

A.6.2 One-shot prompt with GPT-4 plan . . . . . . . . . . . . . . . . . . . . . . 27

A.6.3 Zero-shot prompt with GPT-4 plan . . . . . . . . . . . . . . . . . . . . . . 28

A.7 Mystery Blocksworld Prompts in PDDL . . . . . . . . . . . . . . . . . . . . . . . 29

A.7.1 Domain description (Deceptive Disguising) . . . . . . . . . . . . . . . . . 29

A.7.2 One-shot prompt with GPT-4 plan (Deceptive Disguising) . . . . . . . . . 29

A.7.3 Zero-shot prompt with GPT-4 plan (Deceptive Disguising) . . . . . . . . . 30

A.8 Logistics Prompts in PDDL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

A.8.1 Domain description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

A.8.2 One-shot prompt with GPT-4 plan . . . . . . . . . . . . . . . . . . . . . . 32

A.8.3 Zero-shot prompt with GPT-4 plan . . . . . . . . . . . . . . . . . . . . . . 34

A.9 Backprompting using VAL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

A.9.1 Blocksworld example with GPT-4 . . . . . . . . . . . . . . . . . . . . . . 35

A.9.2 Mystery Blocksworld example with GPT-4 . . . . . . . . . . . . . . . . . 37

A.9.3 Logistics example with GPT-4 . . . . . . . . . . . . . . . . . . . . . . . . 40

A.10 Additional User study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

A.11 User study details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

A.11.1 Instructions provided to the participants . . . . . . . . . . . . . . . . . . . 43

A.11.2 Interface of the user study . . . . . . . . . . . . . . . . . . . . . . . . . . 44

A.12 Additional experiment details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

A.12.1 LLM experiment details and the compute cost . . . . . . . . . . . . . . . . 45

A.12.2 LPG experiment details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

A.13 Broader Impact on using LLMs for planning . . . . . . . . . . . . . . . . . . . . . 46

Classical Planning Problem Formulation

Classical Planning Problems can be mathematically represented by using the tuple P = ⟨D, I, G⟩.

D is referred to as the problem domain, I is the initial state and G is the goal specification. The

possible truth assignment over the predicates defines the state space for the planning problem. The

domain is again defined by the tuple D = ⟨F, O⟩. F corresponds to the set of fluents, i.e., the state

variable used to define the state space and each fluent corresponds to a predicate with some arity, and

A correspond to the set of actions that can be performed as part of the planning problem. Each action

a i [V] ∈ A (where a i is the operator label and V is the variable used by the operator and each variable

could be mapped to an object), can be further defined by two components, the precondition prec[V]

which describes when an action can be executed and the effects ef f [V] which defines what happens

when an action is executed. We will assume that prec[V] consists of a set of predicates defined over

the variables V. An action is assumed to be executable only if its preconditions are met, i.e., the

predicates in the precondition hold in the given state. The effects ef f [V] is further defined by the

tuple ⟨add[V], del[V]⟩, where add[V] or add effects is the set of predicates that will be set true by

the action and del[V] or delete effects is the set of predicates that will be set false by the action. An

action is said to be grounded if we replace each of the variables with an object, else it is referred to

as a lifted domain model (we use a similar convention to differentiate between lifted and grounded

predicates). Below is a snippet of an action from a popular benchmark problem called Blocksworld,

in PDDL. The action corresponds to picking up a block in that domain.

14Figure 4: Assessment of GPT-4 plans with relaxations in Mystery Blocksworld (Deceptive Disguising)

domain

Figure 5: Assessment of GPT-4 plans with relaxations in Logistics domain

(:action pickup

:parameters (?ob)

:precondition (and (clear ?ob) (on-table ?ob) (arm-empty))

:effect (and (holding ?ob) (not (clear ?ob)) (not (on-table ?ob))

(not (arm-empty))))

The parameter line provides the possible variables, in this case ?ob, which can stand for possible

blocks. The precondition says that you can only pick up a block if it is clear (i.e. predicate (clear

?ob) is true for the block), the block is on the table and the arm is empty. The effects tell you that

after you execute the action, the predicate (holding ?ob) becomes true and the block will no longer be

considered clear, and on-table. Finally, the arm will no longer be considered empty. A solution to a

planning problem is called a plan, and corresponds to a sequence of actions that once executed in the

initial state would lead to a state where the goal specification is true. The actions may additionally be

associated with cost, in these cases, one could also talk about optimal plans, i.e., a plan π is called an

optimal one if no plan exists that is less costly than π.

The above description presents one of the simpler classes of planning models and can be extended

in multiple ways including allowing for object typing (including type hierarchy), more complex

forms of preconditions and conditional effects, not to mention supporting richer classes of planning

formalisms.

A.2

A.2.1

Failure modes

LLM failures

Figure 4 shows the assessment of GPT-4 plans with relaxations in the Mystery Blocksworld domain.

Similar to that of Blocksworld, there is an increase in the number of goal-reaching plans, but even

in the most lenient assessment mode (Delete+Precondition Relaxation), there are quite a number of

non-goal-reaching plans. In the assessment modes with precondition relaxations, an inexecutable

plan is when there is an action in the plan that does not contain the required number of parameters.

15Figure 5 shows the assessment of GPT-4 plans with relaxations in the Logistics domain. Even in this

domain, as we further relax the assessment, we again see an increase in the number of goal reaching

plans, but even the most relaxed configuration still has non-goal reaching plans.

A.2.2

Human failures

For the human baseline user study (Section 4), out of 50 participants, 11 participants failed to come

up with a valid plan. All the 11 participants came up with inexecutable plans. In the additional user

study (in Appendix A.10), for the first group, where the LLM assistance was not provided, out of

49 participants, 10 participants failed to come up with a valid plan and all the 10 participants came

up with inexecutable plans. For the second group, where LLM assistance was provided, out of 48

participants, 15 participants failed to come up with a vaild plan out of which, 14 participants came up

with an inexecutable plan and 1 participant came up with a non-goal reaching plan.

A.3

Blocksworld Prompts in Natural Language

A.3.1

Domain description

Blocksworld Domain Description

========================================

I am playing with a set of blocks where I need to arrange the blocks into stacks. Here are the

, → actions I can do

Pick up a block

Unstack a block from on top of another block

Put down a block

Stack a block on top of another block

have the following restrictions on my actions:

can only pick up or unstack one block at a time.

can only pick up or unstack a block if my hand is empty.

can only pick up a block if the block is on the table and the block is clear. A block is clear

, → if the block has no other blocks on top of it and if the block is not picked up.

I can only unstack a block from on top of another block if the block I am unstacking was really on

, → top of the other block.

I can only unstack a block from on top of another block if the block I am unstacking is clear.

Once I pick up or unstack a block, I am holding the block.

I can only put down a block that I am holding.

I can only stack a block on top of another block if I am holding the block being stacked.

I can only stack a block on top of another block if the block onto which I am stacking the block

, → is clear.

Once I put down or stack a block, my hand becomes empty.

Once you stack a block on top of a second block, the second block is no longer clear.

========================================

A.3.2

One-shot prompt with GPT-4 plan

[STATEMENT]

As initial conditions I have that, the red block is clear, the blue block is clear, the yellow

, → block is clear, the hand is empty, the blue block is on top of the orange block, the red block

, → is on the table, the orange block is on the table and the yellow block is on the table.

My goal is to have that the orange block is on top of the blue block.

My plan is as follows:

[PLAN]

unstack the blue block from on top of the orange block

put down the blue block

pick up the orange block

stack the orange block on top of the blue block

[PLAN END]

[STATEMENT]

As initial conditions I have that, the red block is clear, the yellow block is clear, the hand is

, → empty, the red block is on top of the blue block, the yellow block is on top of the orange

, → block, the blue block is on the table and the orange block is on the table.

16My goal is to have that the orange block is on top of the red block.

My plan is as follows:

[PLAN]

--------- GPT-4 response ---------

unstack the yellow block from on top of the orange block

put down the yellow block

unstack the red block from on top of the blue block

put down the red block

pick up the orange block

stack the orange block on top of the red block

[PLAN END]

--------- Extracted plan ---------

(unstack yellow orange)

(put-down yellow)

(unstack red blue)

(put-down red)

(pick-up orange)

(stack orange red)

-------- Ground truth plan ---------

(unstack yellow orange)

(put-down yellow)

(pick-up orange)

(stack orange red)

================SUCCESS================

A.3.3

Zero-shot prompt with GPT-4 plan

[STATEMENT]

As initial conditions I have that, the red block is clear, the blue block is clear, the yellow

, → block is clear, the hand is empty, the blue block is on top of the orange block, the red block

, → is on the table, the orange block is on the table and the yellow block is on the table.

My goal is to have that the orange block is on top of the blue block.

What is the plan to achieve my goal? Just give the actions in the plan.

--------- GPT-4 response ---------

1. Unstack the blue block from on top of the orange block

2. Put down the blue block on the table

3. Pick up the orange block

4. Stack the orange block on top of the blue block

--------- Extracted plan ---------

(unstack blue orange)

(put-down blue)

(pick-up orange)

(stack orange blue)

-------- Ground truth plan ---------

(unstack blue orange)

(put-down blue)

(pick-up orange)

(stack orange blue)

================SUCCESS================

A.3.4

State-tracking prompt with GPT-4 plan

COT state-tracking prompt with GPT-4 plan

The plan correctness is defined in terms of states resulting from executing the actions in the

plan. An action is executable in a state when all its preconditions hold in that state. The

state resulting from the action execution consists of everything in the previous state with

the addition and deletion of add and delete effects of the action. Plan correctness is defined

as follows: if the first action in the plan is applicable in the initial state, i.e., its

preconditions are all present there; and the second action is applicable in the state

resulting from applying the first action to the initial state, this process continues until

the state resulting from the application of the last action in the last but one state gives

rise to the final state where all the goals are satisfied.

, →

[STATEMENT]

17As initial conditions I have that, the red block is clear, the blue block is clear, the yellow

block is clear, the hand is empty, the blue block is on top of the orange block, the red block

is on the table, the orange block is on the table and the yellow block is on the table

My goal is to have that the orange block is on top of the blue block.

My plan is as follows:

, →

[PLAN]

1. Current State: the red block is clear, the blue block is clear, the yellow block is clear, the

, → hand is empty, the blue block is on top of the orange block, the red block is on the table,

, → the orange block is on the table and the yellow block is on the table

Action: unstack the blue block from on top of the orange block

Reason: The above action is applicable in the current state because its preconditions; the blue

, → block is clear, the hand is empty and the blue block is on top of the orange block, are

, → satisfied in the current state.

Resulting State: the red block is clear, the orange block is clear, the yellow block is clear,

, → the hand is currently holding blue block, the red block is on the table, the orange block

, → is on the table and the yellow block is on the table

2. Current State: the red block is clear, the orange block is clear, the yellow block is clear,

the hand is currently holding blue block, the red block is on the table, the orange block is

on the table and the yellow block is on the table

Action: put down the blue block

Reason: The above action is applicable in the current state because its preconditions; the hand

, → is currently holding blue block, are satisfied in the current state.

Resulting State: the red block is clear, the blue block is clear, the orange block is clear,

, → the yellow block is clear, the hand is empty, the red block is on the table, the blue block

, → is on the table, the orange block is on the table and the yellow block is on the table

, →

3. Current State: the red block is clear, the blue block is clear, the orange block is clear, the

yellow block is clear, the hand is empty, the red block is on the table, the blue block is on

the table, the orange block is on the table and the yellow block is on the table

Action: pick up the orange block

Reason: The above action is applicable in the current state because its preconditions; the

, → orange block is clear, the hand is empty and the orange block is on the table, are

, → satisfied in the current state.

Resulting State: the red block is clear, the blue block is clear, the yellow block is clear,

, → the hand is currently holding orange block, the red block is on the table, the blue block

, → is on the table and the yellow block is on the table

, →

4. Current State: the red block is clear, the blue block is clear, the yellow block is clear, the

hand is currently holding orange block, the red block is on the table, the blue block is on

the table and the yellow block is on the table

Action: stack the orange block on top of the blue block

Reason: The above action is applicable in the current state because its preconditions; the blue

, → block is clear and the hand is currently holding orange block, are satisfied in the current

, → state.

Resulting State: the red block is clear, the orange block is clear, the yellow block is clear,

, → the hand is empty, the orange block is on top of the blue block, the red block is on the

, → table, the blue block is on the table and the yellow block is on the table

, →

Final State: the red block is clear, the orange block is clear, the yellow block is clear, the

hand is empty, the orange block is on top of the blue block, the red block is on the table,

the blue block is on the table and the yellow block is on the table

The goal conditions are satisfied in the final state. Hence, the above plan is valid.

[PLAN END]

, →

[STATEMENT]

As initial conditions I have that, the red block is clear, the yellow block is clear, the hand is

, → empty, the red block is on top of the blue block, the yellow block is on top of the orange

, → block, the blue block is on the table and the orange block is on the table

My goal is to have that the orange block is on top of the red block.

My plan is as follows:

[PLAN]

--------- GPT-4 response ---------

1. Current State: the red block is clear, the yellow block is clear, the hand is empty, the red

, → block is on top of the blue block, the yellow block is on top of the orange block, the blue

, → block is on the table, and the orange block is on the table

Action: unstack the red block from on top of the blue block

Reason: The above action is applicable in the current state because its preconditions; the red

, → block is clear, the hand is empty, and the red block is on top of the blue block, are

, → satisfied in the current state.

Resulting State: the blue block is clear, the yellow block is clear, the hand is currently

, → holding red block, the yellow block is on top of the orange block, the blue block is on the

, → table, and the orange block is on the table

182. Current State: the blue block is clear, the yellow block is clear, the hand is currently

holding red block, the yellow block is on top of the orange block, the blue block is on the

table, and the orange block is on the table

Action: put down the red block

Reason: The above action is applicable in the current state because the precondition; the hand

, → is currently holding red block, is satisfied in the current state.

Resulting State: the red block is clear, the blue block is clear, the yellow block is clear,

, → the hand is empty, the red block is on the table, the yellow block is on top of the orange

, → block, the blue block is on the table, and the orange block is on the table

, →

3. Current State: the red block is clear, the blue block is clear, the yellow block is clear, the

hand is empty, the red block is on the table, the yellow block is on top of the orange block,

the blue block is on the table, and the orange block is on the table

Action: unstack the yellow block from on top of the orange block

Reason: The above action is applicable in the current state because its preconditions; the

, → yellow block is clear, the hand is empty, and the yellow block is on top of the orange

, → block, are satisfied in the current state.

Resulting State: the red block is clear, the blue block is clear, the orange block is clear,

, → the hand is currently holding yellow block, the red block is on the table, the blue block

, → is on the table, and the orange block is on the table

, →

4. Current State: the red block is clear, the blue block is clear, the orange block is clear, the

hand is currently holding yellow block, the red block is on the table, the blue block is on

the table, and the orange block is on the table

Action: put down the yellow block

Reason: The above action is applicable in the current state because the precondition; the hand

, → is currently holding yellow block, is satisfied in the current state.

Resulting State: the red block is clear, the blue block is clear, the orange block is clear,

, → the yellow block is clear, the hand is empty, the red block is on the table, the blue block

, → is on the table, the orange block is on the table, and the yellow block is on the table

, →

5. Current State: the red block is clear, the blue block is clear, the orange block is clear, the

yellow block is clear, the hand is empty, the red block is on the table, the blue block is on

the table, the orange block is on the table, and the yellow block is on the table

Action: pick up the orange block

Reason: The above action is applicable in the current state because its preconditions; the

, → orange block is clear, the hand is empty, and the orange block is on the table, are

, → satisfied in the current state.

Resulting State: the red block is clear, the blue block is clear, the hand is currently holding

, → orange block, the yellow block is clear, the red block is on the table, the blue block is

, → on the table, and the yellow block is on the table

, →

6. Current State: the red block is clear, the blue block is clear, the hand is currently holding

orange block, the yellow block is clear, the red block is on the table, the blue block is on

the table, and the yellow block is on the table

Action: stack the orange block on top of the red block

Reason: The above action is applicable in the current state because its preconditions; the red

, → block is clear, and the hand is currently holding orange block, are satisfied in the

, → current state.

Resulting State: the blue block is clear, the orange block is clear, the hand is empty, the

, → yellow block is clear, the orange block is on top of the red block, the blue block is on

, → the table, the red block is on the table, and the yellow block is on the table

, →

Final State: the blue block is clear, the orange block is clear, the hand is empty, the yellow

block is clear, the orange block is on top of the red block, the blue block is on the table,

the red block is on the table, and the yellow block is on the table

The goal conditions are satisfied in the final state. Hence, the above plan is valid.

[PLAN END]

--------- Extracted plan ---------

(unstack red blue)

(put-down red)

(unstack yellow orange)

(put-down yellow)

(pick-up orange)

(stack orange red)

-------- Ground truth plan ---------

(unstack yellow orange)

(put-down yellow)

(pick-up orange)

(stack orange red)

================SUCCESS================

, →

19A.4

Mystery Blocksworld Prompts in Natural Language

A.4.1

Domain description (Deceptive Disguising)

Mystery Blocksworld Domain Description (Deceptive Disguising)

I am playing with a set of objects. Here are the actions I can do

Attack object

Feast object from another object

Succumb object

Overcome object from another object

I have the following restrictions on my actions:

To perform Attack action, the following facts need to be true: Province object, Planet object,

, → Harmony.

Once Attack action is performed the following facts will be true: Pain object.

Once Attack action is performed the following facts will be false: Province object, Planet

, → object, Harmony.

To perform Succumb action, the following facts need to be true: Pain object.

Once Succumb action is performed the following facts will be true: Province object, Planet

, → object, Harmony.

Once Succumb action is performed the following facts will be false: Pain object.

To perform Overcome action, the following needs to be true: Province other object, Pain

, → object.

Once Overcome action is performed the following will be true: Harmony, Pain object, Object

, → Craves other object.

Once Overcome action is performed the following will be false: Province other object, Pain

, → object.

To perform Feast action, the following needs to be true: Object Craves other object, Province

, → object, Harmony.

Once Feast action is performed the following will be true: Pain object, Province other object.

Once Feast action is performed the following will be false:, Object Craves other object,

, → Province object, Harmony.

A.4.2

One-shot prompt with GPT-4 plan (Deceptive Disguising)

[STATEMENT]

As initial conditions I have that, province object a, province object b, province object d,

, → harmony, object b craves object c, planet object a, planet object c and planet object d.

My goal is to have that object c craves object b.

My plan is as follows:

[PLAN]

feast object b from object c

succumb object b

attack object c

overcome object c from object b

[PLAN END]

[STATEMENT]

As initial conditions I have that, province object a, province object d, harmony, object a craves

, → object b, object d craves object c, planet object b and planet object c.

My goal is to have that object c craves object a.

My plan is as follows:

[PLAN]

--------- GPT-4 response ---------

attack object a

overcome object a from object d

feast object c from object a

succumb object c

[PLAN END]

--------- Extracted plan ---------

(attack object a)

(overcome object a object d)

(feast object c object a)

(succumb object c)

-------- Ground truth plan ---------

20(feast object d object c)

(succumb object d)

(attack object c)

(overcome object c object a)

================FAILURE================

A.4.3

Zero-shot prompt with GPT-4 plan (Deceptive Disguising)

[STATEMENT]

As initial conditions I have that, province object a, province object b, province object d,

, → harmony, object b craves object c, planet object a, planet object c and planet object d.

My goal is to have that object c craves object b.

What is the plan to achieve my goal? Just give the actions in the plan.

--------- GPT-4 response ---------

1. Attack object

2. Succumb object

3. Overcome object from another object (object b)

4. Feast object from another object (object c)

--------- Extracted plan ---------

(overcome object f object b)

(feast object f object c)

-------- Ground truth plan ---------

(feast object b object c)

(succumb object b)

(attack object c)

(overcome object c object b)

================FAILURE================

A.4.4

State-tracking prompt with GPT-4 plan

COT state-tracking prompt with GPT-4 plan (Deceptive Disguising)