Summary Investigating Planning Abilities of Large Language Models arxiv.org
23,362 words - PDF document - View PDF document
One Line
Large language models (LLMs) have limited ability to generate executable plans autonomously, but they can serve as heuristic guidance for other agents in the logistics domain; caution is needed to verify correctness and bias, and safety and potential bias perpetuation should be carefully considered when using LLMs for planning.
Slides
Slide Presentation (11 slides)
Key Points
- Large language models (LLMs) were investigated for their planning abilities in generating plans autonomously and as heuristic guidance for other agents.
- LLMs had limited success in generating executable plans autonomously, with the best model (GPT-4) achieving an average success rate of around 12%.
- LLM-generated plans showed more promise in the heuristic mode, improving the search process for underlying planners and benefiting from external verifiers' feedback.
- While LLMs performed poorly in autonomous planning, their generated plans could assist AI planners and be refined through backprompting.
- The study highlighted the limitations and potential benefits of using LLMs in planning tasks, emphasizing the importance of verification for correctness and bias.
Summaries
89 word summary
This study investigated the planning abilities of large language models (LLMs) and found that while they had limited ability to generate executable plans autonomously, they showed promise as heuristic guidance for other agents. LLM-generated plans were statistically similar to those generated by humans in the logistics domain, but caution is needed to verify correctness and bias. The cost of compute for the GPT-4 model was $231 for autonomous mode experiments and $209 for back-prompting experiments. Safety and potential bias perpetuation should be carefully considered when using LLMs for planning.
198 word summary
This study investigated the planning abilities of large language models (LLMs) and evaluated their performance in generating plans autonomously and as heuristic guidance for other agents. The results showed that LLMs had limited ability to generate executable plans autonomously, with the best model achieving a success rate of around 12%. However, in the heuristic mode, LLM-generated plans showed promise, improving the search process for underlying planners and benefiting from external feedback. The study highlighted the limitations and potential benefits of using LLMs in planning tasks.
The study compared the performance of GPT-4 with human planners in logistics and blocksworld domains. In the logistics domain, LLM-generated plans were statistically similar to those generated by humans, but were not always correct and sometimes led to automation bias. In the blocksworld domain, the group assisted by LLM did not perform significantly better than the group without assistance. It is important to verify the correctness and bias of LLM-generated plans before implementation.
The cost of compute for the GPT-4 model in the study was $231 for autonomous mode experiments and $209 for back-prompting experiments. Safety and potential bias perpetuation are important considerations when using LLMs for planning, requiring careful verification of plans.
336 word summary
The authors of this study investigated the planning abilities of large language models (LLMs) by evaluating their performance in generating plans autonomously and as heuristic guidance for other agents. They conducted a systematic study using domains similar to those in the International Planning Competition and evaluated LLMs in two modes: autonomous and heuristic. The results showed that LLMs' ability to generate executable plans autonomously was limited, with the best model (GPT-4) having an average success rate of around 12% across the domains. However, in the heuristic mode, LLM-generated plans showed more promise, improving the search process for underlying planners and benefiting from external verifiers' feedback. The findings demonstrate that while LLMs perform poorly in autonomous planning, their generated plans can assist AI planners and be refined through backprompting. The study highlights the limitations and potential benefits of using LLMs in planning tasks.
The study included a series of prompts and responses related to planning abilities in large language models. In some cases, the extracted plans matched the responses given by GPT-4, while in others, they did not. The accuracy of the GPT-4 responses varied, with some matching the ground truth plans and others deviating from them.
The study compared the performance of GPT-4 with human planners in logistics and blocksworld domains. In the logistics domain, the LLM-generated plans were found to be statistically similar to those generated by humans in terms of accuracy, time taken, and cognitive load. However, the LLM-generated plans were not always correct and sometimes led to automation bias. In the blocksworld domain, the group assisted by LLM did not perform significantly better than the group without LLM assistance. Overall, while LLMs can provide plan suggestions, it is important to verify their correctness and bias before implementation.
The cost of compute for the GPT-4 model in the study was $231 for autonomous mode experiments and $209 for back-prompting experiments. Safety and potential bias perpetuation are important considerations when using LLMs for planning, and careful verification is necessary to ensure correct and unbiased plans.
552 word summary
In this study, the authors investigated the planning abilities of large language models (LLMs) by evaluating their performance in generating plans autonomously and as heuristic guidance for other agents. They conducted a systematic study using domains similar to those in the International Planning Competition and evaluated LLMs in two modes: autonomous and heuristic. The results showed that LLMs' ability to generate executable plans autonomously was limited, with the best model (GPT-4) having an average success rate of around 12% across the domains. However, in the heuristic mode, LLM-generated plans showed more promise, improving the search process for underlying planners and benefiting from external verifiers' feedback. The findings demonstrate that while LLMs perform poorly in autonomous planning, their generated plans can assist AI planners and be refined through backprompting. The study highlights the limitations and potential benefits of using LLMs in planning tasks.
This excerpt provides a series of prompts and responses related to planning abilities in large language models. The prompts are organized into separate paragraphs for readability.
The first set of prompts pertains to a blocksworld scenario. The initial conditions and goal are given, and the plan is requested. The GPT-4 response provides the plan as a sequence of actions. The extracted plan matches the GPT-4 response.
The second set of prompts also involves a blocksworld scenario, but with deceptive disguising. Again, the initial conditions, goal, and plan are provided. The GPT-4 response gives the plan, but the extracted plan does not match.
The third set of prompts pertains to logistics planning. The domain description is given, and a one-shot prompt with the problem and plan is provided. The GPT-4 response matches the provided plan.
The fourth set of prompts is another logistics planning scenario, but with a zero-shot prompt. The GPT-4 response provides the plan, but it does not match the ground truth plan.
The fifth set of prompts involves mystery blocksworld scenarios in PDDL format. The domain description is given, and a one-shot prompt with the problem and plan is provided. The GPT-4 response matches the provided plan.
The sixth set of prompts is another mystery blocksworld scenario in PDDL format, but with a zero-shot prompt. The GPT-4 response provides the plan, but it does not match the ground truth plan.
Overall, the GPT-4 responses vary in their accuracy, with some matching the ground truth plans and others deviating from them.
We conducted a study to investigate the planning abilities of large language models (LLMs) in the context of logistics and blocksworld domains. We used GPT-4 as our LLM model and compared its performance to human planners. In the logistics domain, we found that the LLM-generated plans were not statistically different from the plans generated by humans in terms of accuracy, time taken, and cognitive load. However, the LLM-generated plans were not always correct and sometimes led to automation bias. In the blocksworld domain, the LLM-assisted group did not perform significantly better than the group without LLM assistance. Overall, while LLMs can provide plan suggestions, they should be verified for correctness and bias before being implemented. The cost of compute for the GPT-4 model was $231 for autonomous mode experiments and $209 for back-prompting experiments. Safety and potential bias perpetuation are important considerations when using LLMs for planning, and careful verification is necessary to ensure correct and unbiased plans.