Summary of Comparing Humans and GPT-4 on Abstraction Tasks

Summary Comparing Humans and GPT-4 on Abstraction Tasks arxiv.org

3,776 words - PDF document - View PDF document

One Line

GPT-4's abstract reasoning abilities were found to be inferior to those of humans and specialized algorithms according to the ConceptARC benchmark.

Slides

Slide Presentation (8 slides)

Copy slides outline Copy embed code Download as Word

Comparing Humans, GPT-4, and GPT-4V on Abstraction and Reasoning Tasks

Source: arxiv.org - PDF - 3,776 words - view

Introduction

• Large pre-trained language models (LLMs) claim to have emergent abstract reasoning abilities

• Abstract reasoning is the ability to induce rules or patterns from limited data and apply them to new situations

• LLMs' internal mechanisms for abstract reasoning are not well understood

• Some argue that LLMs rely on learning complex patterns rather than true abstract reasoning

The Abstraction and Reasoning Corpus (ARC)

• ARC is a benchmark for evaluating abstract reasoning abilities in both humans and machines

• Tasks in ARC involve inducing abstract rules from demonstrations and applying them to new situations

• Humans have high accuracy on ARC tasks, while LLMs have lower accuracy

• Core knowledge systems hypothesized to be innate in humans are needed to solve ARC tasks

Previous Evaluation of LLMs using the ConceptARC Benchmark

• Moskvichev et al. evaluated GPT-4 on ConceptARC tasks and found lower accuracy compared to humans

• GPT-4's prompt format was simple, potentially limiting its understanding of the tasks

• Comparing human performance (91% accuracy) with GPT-4's performance (25% accuracy) highlights the gap in abstraction abilities

Experiments Evaluating Text-Only GPT-4 on ConceptARC Tasks

• A more detailed one-shot prompt was used to evaluate GPT-4's performance on text versions of ConceptARC tasks

• GPT-4's accuracy improved to 33% with the new prompting method, but still fell short of human performance

• Higher temperature settings did not significantly impact GPT-4's performance

Experiments Evaluating GPT-4V on Minimal ConceptARC Tasks

• GPT-4V, the multimodal version of GPT-4, was evaluated on minimal tasks with visual representations

• GPT-4V's accuracy on minimal tasks was lower (23%) compared to text-only GPT-4 (69%)

• The inclusion of a sample solved task in the prompt may have negatively impacted GPT-4V's performance

Conclusion

• GPT-4 and GPT-4V lack robust abstraction abilities compared to humans on ConceptARC tasks

• More informative prompts improved GPT-4's performance, but it still falls short of human performance

• GPT-4V performed worse than the text-only version on visual tasks

• Further research is needed to explore alternative prompting methods and task representations

Key Takeaways

• GPT-4 and GPT-4V were evaluated on ConceptARC tasks and showed lower abstraction abilities compared to humans

• More detailed prompts improved GPT-4's performance, but it still fell short of human performance

• GPT-4V performed worse than the text-only version on visual tasks

• The results highlight the need for further research to enhance abstraction abilities in LLMs.

Key Points

GPT-4 and GPT-4V were evaluated on the ConceptARC benchmark to assess their abstract reasoning abilities.
GPT-4's performance improved with more detailed prompts, but it still fell short of human performance.
GPT-4V, the multimodal version, performed worse than the text-only version on visual tasks.
The results suggest that GPT-4 and GPT-4V lack robust abstraction abilities compared to humans.
Further research is needed to explore alternative prompting methods and task representations.

Summaries

19 word summary

GPT-4's abstract reasoning abilities were evaluated using the ConceptARC benchmark, showing it falls short of humans and special-purpose algorithms.

73 word summary

This paper evaluates GPT-4's abstract reasoning abilities using the ConceptARC benchmark. The results show that neither the text nor image version of GPT-4 has achieved robust abstraction abilities comparable to humans. Providing more detailed instructions and examples improves GPT-4's performance on text tasks, but it still falls short of human and special-purpose algorithm performance. GPT-4V, the multimodal version, performs even worse, highlighting the significant gap in abstract reasoning between humans and AI systems.

124 word summary

This paper evaluates the abstract reasoning abilities of GPT-4, a large pre-trained language model, using the ConceptARC benchmark. The authors aim to assess GPT-4's performance on detailed, one-shot prompting with text and image versions of the tasks. The results indicate that neither version of GPT-4 has achieved robust abstraction abilities comparable to humans. While some argue that large pre-trained language models can develop emergent reasoning abilities, doubts remain about whether these models truly form humanlike abstractions. The experiments show that providing more detailed instructions and examples improves GPT-4's performance on text tasks, but it still falls short of human and special-purpose algorithm performance. The multimodal version, GPT-4V, performs even worse, highlighting the significant gap in abstract reasoning between humans and AI systems like GPT-4.

405 word summary

This paper examines the abstract reasoning abilities of GPT-4, a large pre-trained language model, using the ConceptARC benchmark. The authors aim to evaluate GPT-4's performance on detailed, one-shot prompting with text versions of ConceptARC tasks, as well as the multimodal version of GPT-4 (GPT-4V) using image versions of simpler tasks. The results indicate that neither version of GPT-4 has achieved robust abstraction abilities at humanlike levels.

Abstract reasoning is a crucial aspect of human intelligence, involving the ability to induce rules or patterns from limited data and apply them to new situations. While some researchers argue that large pre-trained language models can develop emergent reasoning abilities, the underlying mechanisms are not well understood. There are doubts about whether these models truly form humanlike abstractions, as they may rely on learning complex patterns of associations rather than generalizable abstract reasoning.

To assess GPT-4's abstract reasoning capabilities, the authors evaluate its performance on ConceptARC benchmark tasks. They find that providing more detailed instructions and a solved example improves GPT-4's performance on text versions of the tasks. However, it still falls significantly short of human and special-purpose algorithm performance. This suggests a lack of robust abstraction abilities in GPT-4.

To address limitations in previous evaluations, the authors conduct two sets of experiments. The first set focuses on the text-only version of GPT-4, using a more expressive prompt with instructions and a solved example. This improves GPT-4's performance compared to previous work but remains below human performance. In the second set, they evaluate GPT-4V using visual versions of simpler ConceptARC tasks. GPT-4V performs substantially worse than the text-only version, reinforcing the conclusion that there is a significant gap in abstract reasoning between humans and state-of-the-art AI systems.

The authors also discuss the Abstraction and Reasoning Corpus (ARC), a benchmark for evaluating abstract reasoning abilities. ARC consists of analogy puzzles that test general abstract reasoning capabilities. GPT-4's performance on ConceptARC tasks has been shown to be around 10-12% accuracy, much lower than human performance (84% accuracy). The authors address limitations in previous evaluations, such as the simplicity of the prompt format and the unfair comparison between human visual performance and language models' text-only prompt.

In summary, the experiments demonstrate that GPT-4's performance improves with more detailed instructions and examples, but it still falls short of human performance. The multimodal version, GPT-4V, performs even worse. These findings highlight the significant gap in abstract reasoning between humans and AI systems like GPT-4.

639 word summary

This paper explores the abstract reasoning abilities of GPT-4, a large pre-trained language model, using the ConceptARC benchmark. The benchmark is designed to evaluate robust understanding and reasoning with core-knowledge concepts. The authors extend previous work by evaluating GPT-4 on more detailed, one-shot prompting with text versions of ConceptARC tasks. They also evaluate GPT-4V, the multimodal version of GPT-4, using image versions of the simplest tasks. The experimental results show that neither version of GPT-4 has developed robust abstraction abilities at humanlike levels.

Abstract reasoning is the ability to induce a rule or pattern from limited data or experience and apply it to new situations. It is a key aspect of human intelligence. Some researchers have claimed that large pre-trained language models can develop emergent abilities for reasoning, pattern recognition, and analogy-making. However, the internal mechanisms underlying these abilities are not well understood, and there are doubts about whether these systems actually form humanlike abstractions. It has been suggested that these models rely on learning complex patterns of associations in their training data rather than generalizable abstract reasoning.

Abilities for creating and reasoning with abstract representations are fundamental to robust generalization. Therefore, it is important to understand the extent to which language models have achieved such abilities. The authors of this paper evaluate GPT-4 on tasks in the ConceptARC benchmark to assess its abstract reasoning capabilities. They find that GPT-4's performance on the text versions of the tasks improves when more detailed instructions and a simple solved example are provided. However, its performance remains significantly below that of humans and special-purpose algorithms for solving these tasks. The authors argue that this indicates a lack of robust abstraction abilities in GPT-4.

To address limitations in previous evaluations, the authors conduct two sets of experiments. In the first set, they evaluate the text-only version of GPT-4 using a more expressive prompt that includes instructions and a solved example. This one-shot prompting method improves GPT-4's performance on the text versions of the ConceptARC tasks compared to previous work. However, its performance still falls short of human performance. In the second set of experiments, the authors evaluate GPT-4V, the multimodal version of GPT-4, using visual versions of the simplest ConceptARC tasks. They find that GPT-4V performs substantially worse than the text-only version. These results reinforce the conclusion that there is a large gap in basic abstract reasoning between humans and state-of-the-art AI systems.

The authors also provide insights into the Abstraction and Reasoning Corpus (ARC), which is a benchmark for evaluating abstract reasoning abilities in humans and machines. They explain that ARC consists of analogy puzzles that test general abstract reasoning capabilities. The tasks require the solver to induce the abstract rule underlying the demonstrations and apply it to the test input to generate a transformed grid. The authors note that the prior knowledge needed for solving these tasks is a subset of the core knowledge systems hypothesized to be innate in humans. ARC aims to capture the crux of abstract reasoning by testing the ability to induce general rules or patterns from limited examples and apply them flexibly to new situations.

Previous evaluations of language models using the ConceptARC benchmark have shown that their accuracy is around 10-12% on straightforward text versions of tasks. In contrast, limited studies of human performance on subsets of ARC tasks have shown much higher accuracies, such as 84%. The authors mention that GPT-4's performance on ConceptARC tasks evaluated by Moskvichev et al. was substantially worse than human performance. However, they note two important limitations in that evaluation: the prompt format used was overly simple, and it might not be fair to compare human performance in the visual modality with LLMs' text-only prompt. Therefore, the authors address these limitations in their experiments.

In conclusion, the authors' experiments show that the text-only version of GPT-4 performs better when

Raw indexed text (24,223 chars / 3,776 words / 397 lines)

Comparing Humans, GPT-4, and GPT-4V

On Abstraction and Reasoning Tasks

Melanie Mitchell, Alessandro B. Palmarini, and Arseny Moskvichev

Santa Fe Institute, 1399 Hyde Park Road, Santa Fe, NM 87501

[email protected], [email protected], [email protected]

Abstract

We explore the abstract reasoning abilities of text-only and multimodal versions of GPT-4,

using the ConceptARC benchmark [10], which is designed to evaluate robust understanding and

reasoning with core-knowledge concepts. We extend the work of Moskvichev et al. [10] by evalu-

ating GPT-4 on more detailed, one-shot prompting (rather than simple, zero-shot prompts) with

text versions of ConceptARC tasks, and by evaluating GPT-4V, the multimodal version of GPT-4,

on zero- and one-shot prompts using image versions of the simplest tasks. Our experimental results

support the conclusion that neither version of GPT-4 has developed robust abstraction abilities at

humanlike levels.

Introduction

To what extent have large pre-trained language models (LLMs) developed “emergent” capabilities

for abstract reasoning? The defining characteristic of abstract reasoning is the ability to induce a

rule or pattern from limited data or experience and to apply this rule or pattern to new, unseen

situations. Such abilities are a key aspect of human intelligence; even very young children are adept

at learning abstract rules from just a few examples [13].

Recently, various researchers have claimed that sufficiently large pre-trained language models can

develop emergent abilities for reasoning [16], general abstract pattern recognition [9], and analogy-

making [15]. However, the internal mechanisms giving rise to these abilities are not well understood,

and other researchers have cast doubt on the claims that these systems actually form humanlike

abstractions [4], showing in many cases that while LLMs can solve problems involving content

similar to that in their training data, they are weak in generalizing outside such problems [8, 11, 18].

Some have interpreted this as evidence that LLMs rely not on generalizable abstract reasoning but

on learning complex patterns of associations in their training data and performing “approximate

retrieval” of these patterns in new situations [7].

Abilities for creating and reasoning with abstract representations are fundamental to robust gener-

alization, so it is essential to understand the extent to which LLMs have achieved such abilities. In

this paper we report on experiments evaluating GPT-4 on tasks in ConceptARC [10], a collection

of analogy puzzles that test general abstract reasoning capabilities. We show that by providing

prompts with more detailed instructions and a simple solved example, GPT-4’s performance on

a text version of this corpus improves substantially above that reported in previous work, but

remains substantially below that of humans and of special-purpose algorithms for solving tasks

in this domain. Because humans are given these tasks in a visual modality, it has been argued

that it would only be fair to compare humans with multimodal (rather than text-only) LLMs. We

1Figure 1: Examples of ARC tasks from [2]. Each task has a set of demonstration input-output

pairs that illustrate an abstract grid-transformation rule, and a test input. The solver’s challenge

is to generate a new grid that results from applying the abstract rule to the test input. (Figure is

from [10]; best viewed in color.)

perform this comparison using GPT-4V, the multimodal extension of the GPT-4, and show that

this particular multimodal LLM performs substantially worse than the text-only version. These

results reinforce the conclusion that a large gap in basic abstract reasoning still remains between

humans and state-of-the-art AI systems.

The Abstraction and Reasoning Corpus

Chollet (2019) proposed the Abstraction and Reasoning Corpus (ARC) as a benchmark for fairly

evaluating such abilities in both humans and machines. ARC consists of 1,000 manually created

analogy puzzles (‘tasks”), each of which contains a small number (typically 2–4) demonstrations

of transformations on grids, and a “test input” grid. The task for the solver is to induce the

abstract rule underlying the demonstrations and to apply that rule to the test input to generate a

transformed grid. Figure 1 gives three examples of ARC tasks.

According to Chollet, the prior knowledge needed for solving these tasks is a subset of the core

knowledge systems hypothesized to be innate in humans [12]—namely, objectness, numerosity,

and basic geometry and topology. Notably, Chollet intentionally omitted knowledge of language or

other “learned symbols” from the required prior knowledge, which helps avoid any “approximate

retrieval” and pattern matching based on prior training data that might underlie LLMs’ apparent

2success on language-based reasoning tasks. Instead, ARC is meant to capture the crux of abstract

reasoning: inducing general rules or patterns from small numbers of examples and applying these

flexibly to new, previously unseen situations.

Chollet published 800 ARC tasks and kept the remaining 200 as a “hidden” test set [2]. 100 of

these hidden tasks were used as an test set for a challenge on the Kaggle platform [6]. The first-

place program from the Kaggle challenge solved 21% of these hidden tasks, and an ensemble of

the first- and second-place programs solved 31%. This remains the highest-achieved accuracy on

ARC to date. The winning programs on Kaggle used program-synthesis methods that searched

over combinations of manually defined, primitive grid operations in order to find combinations that

correctly map inputs to outputs in the task demonstrations. The authors of the winning programs

acknowledge that such methods are not likely to generalize well [3, 17] and ARC remains largely

unsolved by any AI methods.

Several groups have tested LLMs on subsets of ARC tasks [4, 9, 14, 19], using different prompting

formats, and generally found the best accuracy using straightforward text versions of tasks to be

around 10–12%. Limited studies of human performance on subsets of ARC tasks has shown much

higher accuracies (e.g., 84% in [5]).

Previous Evaluation of LLMs using the ConceptARC Bench-

mark

Moskvichev et al. (2023) noted two problems with the original ARC corpus. First, they claimed,

many of the tasks are quite difficult, even for humans, and this difficulty might be a barrier to

progress in developing AI systems that reason in this domain. Second, and most important, ARC

does not offer systematic evaluation of understanding of particular core concepts; even if a system

can solve an individual ARC task, that does not necessarily mean that the system has a robust

understanding of the underlying concepts. To address these issues, Moskvichev et al. created a new

benchmark in the ARC domain, ConceptARC, whose tasks are intentionally designed to be easy for

humans and, moreover, whose 480 tasks are organized as systematic variations of particular core

spatial and semantic concepts, such as Top and Bottom, Inside and Outside, Center, and Same

and Different. Each concept group contains 30 tasks, each of which instantiates the concept in a

different way, and with differing degrees of abstraction. Moskvichev et al.’s claim was that high

performance over these various instantiations of a given concept indicates a robust understanding

of, and ability to reason abstractly about, the underlying concept.

Moskvichev et al. gave these tasks to human participants on the Amazon Mechanical Turk and

Prolific platforms. They also tested the two winning programs from the Kaggle ARC challenge as

well as GPT-4 on all 480 tasks. They found that human performance substantially exceeded that

of machines on all concept groups in the corpus; in particular, the overall accuracy of humans was

91%, compared to the first-place Kaggle program’s accuracy of 52%, and GPT-4’s accuracy of 25%

(using temperature 0.5). The authors concluded that “[t]he generally high accuracies of humans on

each concept indicates successful generalization over the different variations in each given concept

group. In contrast, the much lower accuracies of programs we tested indicates a lack of ability to

generalize over the variations in a concept group, and thus a failure to develop the abstractions

that ARC is meant to test.”

3Moskvichev et al.’s evalation of GPT-4 had two important limitations: (1) the prompt format that

they used was overly simple and might not have communicated enough about the task; and (2) it

might not be fair to compare the performance of humans, who are presented with a visual version

of each task, with LLMs, which are given a text-only version of the task.

Here we address both of these limitations, through two sets of experiments. In the first set of

experiments, we evaluate the text-only version of GPT-4 on text versions of ConceptARC tasks

using a much more expressive prompt that includes both instructions and an example of a solved

task, making this a one-shot rather than zero-shot induction problem. In the second set of ex-

periments, we evaluate GPT-4V, the multimodal version of GPT-4, on the visual version of the

simplest ConceptARC tasks, giving it a similar prompt as in the first set of experiments but using

images rather than text to represent tasks.

Experiments Evaluating Text-Only GPT-4 on ConceptARC

Tasks

To evaluate the text-only version of GPT-4, we adapted the prompt used in a recent study [14].

The exact prompt is given in the appendix, Section 7.1. This prompt provides detailed instructions

about the task as well as an example of a solved task. If GPT-4 responds with an incorrect answer,

we repeat a request for it to supply a different answer, for a maximum of three guesses, which is

standard for ARC evaluations [6]. If a correct answer is generated within these three guesses, the

task is considered to be solved.

We used this prompting method with OpenAI’s API to test GPT-4 1 on all 480 ConceptARC tasks

(30 per each of the 16 concept groups) 2 , first with GPT-4’s temperature set to zero and then with

temperature set to 0.5, to test the effects of temperature on performance. The accuracies (fraction

of solved tasks within each concept group as well as fraction of solved tasks overall) are given in

Table 1, along with the human accuracies from [10]. Note that, like GPT-4, humans are given three

guesses for each task.

For GPT-4, our more detailed, one-shot prompting method resulted in a higher accuracy overall,

0.33 for both temperature settings, than the simple zero-shot method used in [10], which reported

0.19 for temperature zero and 0.25 for temperature 0.5. However, GPT-4’s performance remains well

below the high performance of humans, supporting the conclusion that, even with more informative

prompting, the system lacks basic abstract reasoning abilities tested by this corpus.

Experiments Evaluating GPT-4V on Minimal ConceptARC

Tasks

We did a second set of experiments to test the hypothesis that GPT-4V, the multimodal version of

GPT-4, would obtain higher performance than the text-only version. Due to the costs of running

such an experiment on the 480 tasks in ConceptARC, we decided to establish a baseline for com-

We used the version of GPT-4 (gpt-4-0613) available in November, 2023.

All tasks (including “minimal” tasks) can be downloaded from https://github.com/victorvikram/ConceptARC

4Concept

Above and Below

Center

Clean Up

Complete Shape

Copy

Count

Extend To Boundary

Extract Objects

Filled and Not Filled

Horizontal and Vertical

Inside and Outside

Move To Boundary

Order

Same and Different

Top and Bottom 2D

Top and Bottom 3D

All concepts

Humans

0.90

0.94

0.97

0.85

0.94

0.88

0.93

0.86

0.96

0.91

0.83

0.88

0.95

0.93

0.91

GPT-4 T emp = 0

0.50

0.37

0.43

0.47

0.37

0.27

0.20

0.13

0.27

0.33

0.30

0.23

0.27

0.23

0.60

0.30

0.33

GPT-4 T emp = 0.5

0.47

0.37

0.46

0.40

0.33

0.23

0.20

0.13

0.30

0.37

0.33

0.17

0.30

0.63

0.27

0.33

Table 1: Accuracies of humans and GPT-4 (with temperature 0 and 0.5) on each concept group (30

tasks) and over all concepts (480 tasks) in ConceptARC, using the prompt given in the appendix,

Section 7.1. The results on humans are from [10].

Humans

0.95

GPT-4 T emp = 0

0.69

GPT-4 T emp = 0.5

0.65

GPT-4V Zero-Shot

0.25

GPT-4V One-Shot

0.23

Table 2: Accuracies of humans, GPT-4 (with Temperature 0 and 0.5), and GPT-4V (zero- and

one-shot prompting) on minimal tasks over all concepts (48 tasks) in ConceptARC.

paring text-only GPT-4 and GPT-4V using the minimal tasks created by Moskvichev et al. (2023).

For each of their 16 concept groups, Moskvichev et al. created three minimal tasks—extremely

simple instantiations of the concept. They used these tasks as “attention checks” in their human

studies, to make sure that human participants were paying attention and understood the basic idea

of the tasks. Participants who failed at solving two or more minimal tasks (out of three given) were

excluded from the study. Moskvichev et al. reported that 12 out of 482 participants were excluded

on this basis.

Though it was not reported in their paper, we used the data from Moskvichev et al. to evaluate

overall human performance on minimal tasks from all 482 participants in their study. We also ran

the text-only version of GPT-4 on these tasks, using the same prompting method described in the

previous section. As shown in Table 2, the fraction of humans correctly solving the 48 minimal

tasks was 0.95, and GPT-4’s accuracies were 0.69 (temperature 0) and 0.65 (temperature 0.5). The

difference between GPT-4’s performance on these minimal problems and on the non-minimal tasks

(Table 1) underline the simplicity of the minimal tasks compared to those in the regular corpus. 3

For comparison, the accuracy of the first-place Kaggle-ARC program was 0.81 on the minimal tasks versus 0.52

on the regular corpus.

5We explored various approaches to presenting the minimal tasks to GPT-4V 4 : displaying all input-

output pairs of a single task within one image, using a separate image for each input-output pair,

and providing each input and each output grid as an individual image. Only the last approach

yielded correct solutions during our preliminary investigations, and is thus the approach we adopted.

Furthermore, when presented with an image, GPT-4V was unable to consistently translate the vi-

sual grid to a text representation, including both color names and numeric encodings. Therefore, to

mitigate errors involved in mapping the intended output grid to a text representation, we requested

only a natural language description of the grid, along with its dimensions.

We aimed to quantify the influence of visual representations on performance by maintaining con-

sistency with our text-only evaluation. The same prompting method and example of a solved

task was used, modified only by substituting text representations of grids with visual counterparts.

This prompt, along with the variations from the text-only approach, are given in the appendix,

Section 7.2.

GPT-4V often included descriptions of an abstract transformation rule as part of its solution.

Our assessment focused exclusively on the accuracy of the model’s test output grid descriptions.

In certain cases, the model accurately described the output grid despite identifying an incorrect

abstract rule, which we classified as a success. On the other hand, we classified as failures instances

in which the model correctly identified the abstract rule but failed to accurately describe the output

grid.

The results of our experiments with GPT-4V on minimal tasks are given in the last two columns

of Table 2. The visual one-shot prompt resulted in an accuracy of 0.23 over the 48 minimal tasks,

compared with 0.69 for the (temperature 0) text-only counterpart. Notably, several of GPT-4V’s

unsuccessful output grid descriptions incorporated details from the solved example, suggesting that

including an sample solved task in the prompt may have had a negative impact on performance.

Consequently, we extended our evaluation of GPT-4V to include a zero-shot setting.

The zero-shot prompt is also given in the appendix, Section 7.2. Unlike the one-shot setting, the

evaluation was done using OpenAI’s web application. This required a slightly different prompting

method, where the model was able to respond with observations after receiving each demonstration

input-output pair, and before receiving the test input grid to provide its solution. The zero-shot

prompting method resulted in an accuracy of 0.25 on minimal tasks, solving one additional task

compared to the one-shot setting, and with an overlap of seven tasks successfully solved in both

settings.

While we tested GPT-4V only on miminal tasks, We expect that GPT-4V’s overall performance

would be similarly considerably worse than the text-only version on the much-harder non-minimal-

task corpus.

Conclusion

In this paper we extended work described in [10] on evaluating the abstract reasoning capabilities of

GPT-4, using the ConceptARC corpus, which systematically tests abstraction abilities using basic

We used the version of GPT-4V available in November, 2023 (gpt4-vision-preview). This version does not allow

a manual temperature setting and the documentation does not specify the default temperature.

6core concepts. Moskvichev et al. found that GPT-4 had substantially worse performance than both

humans and the first-place program in the Kaggle-ARC challenge on these tasks. However, the

prompting method they used was overly simple, and they experimented only with text versions

of the tasks. Here, we performed evaluations using a more informative, one-shot prompt for text

versions of tasks, and experimented with similar zero- and one-shot prompts for the multimodal

case in which task-grids were given as images. We found that our more informative one-shot prompt

improved GPT-4’s performance in the text case, but its performance remained well below that of

humans and the special-purpose Kaggle-ARC program. We also found that giving minimal tasks

as images to the multimodal GPT-4 resulted in substantially worse performance than in the text-

only case. Our results support the hypothesis that GPT-4, perhaps the most capable “general”

LLM currenly available, is still not able to robustly form abstractions and reason about basic core

concepts in contexts not previously seen in its training data. It is possible that other methods of

prompting or task representation would increase the performance of GPT-4 and GPT-4V; this is a

topic for future research.

Acknowledgments

This material is based in part upon work supported by the National Science Foundation under

Grant No. 2139983. Any opinions, findings, and conclusions or recommendations expressed in this

material are those of the authors and do not necessarily reflect the views of the National Science

Foundation. This work has also been supported by the Templeton World Charity Foundation, Inc.

(funder DOI 501100011730) under the grant https://doi.org/10.54224/20650.

References

[1] F. Chollet. On the measure of intelligence. arXiv preprint arXiv:1911.01547, 2019.

[2] F. Chollet. The Abstraction and Reasoning Corpus (ARC). https://github.com/fchollet/

ARC, 2023. Accessed 2023-11-09.

[3] A.

Miquel

Bleier.

Finishing

2nd

Kaggle’s

Abstrac-

tion

and

Reasoning

Challenge.

https://blog.jovian.com/

finishing-2nd-in-kaggles-abstraction-and-reasoning-challenge-24e59c07b50a,

2020. Accessed 2023-11-09.

[4] G. Gendron, Q. Bao, M. Witbrock, and G. Dobbie. Large language models are not strong

abstract reasoners. arXiv preprint arXiv:2305.19555, 2023.

[5] A. Johnson, W. K. Vong, B. M. Lake, and T. M. Gureckis. Fast and flexible: Human program

induction in abstract reasoning tasks. arXiv preprint arXiv:2103.05823, 2021.

[6] Kaggle.com. Kaggle Abstraction and Reasoning Challenge. https://www.kaggle.com/c/

abstraction-and-reasoning-challenge, 2020. Accessed 2023-11-09.

[7] S. Kambhampati. Can llms really reason and plan?

September 12, 2023.

Communications of the ACM, 2023.[8] R. T. McCoy, S. Yao, D. Friedman, M. Hardy, and T. L. Griffiths. Embers of autoregression:

Understanding large language models through the problem they are trained to solve. arXiv

preprint arXiv:2309.13638, 2023.

[9] S. Mirchandani, F. Xia, P. Florence, B. Ichter, D. Driess, M. G. Arenas, K. Rao, D. Sadigh,

and A. Zeng. Large language models as general pattern machines. In Seventh Conference on

Robot Learning (CoRL 2023), 2023.

[10] A. Moskvichev, V. V. Odouard, and M. Mitchell. The conceptarc benchmark: Evaluating

understanding and generalization in the arc domain. Transactions On Machine Learning

Research, 2023.

[11] Y. Razeghi, R. L. Logan IV, M. Gardner, and S. Singh. Impact of pretraining term frequencies

on few-shot numerical reasoning. In Findings of the Association for Computational Linguistics:

EMNLP 2022, pages 840–854, 2022.

[12] E. S. Spelke and K. D. Kinzler. Core knowledge. Developmental Science, 10(1):89–96, 2007.

[13] C. M. Walker and A. Gopnik. Toddlers infer higher-order relational principles in causal learn-

ing. Psychological Science, 25(1):161–169, 2014.

[14] R. Wang, E. Zelikman, G. Poesia, Y. Pu, N. Haber, and N. D. Goodman. Hypothesis search:

Inductive reasoning with language models. arXiv preprint arXiv:2309.05660, 2023.

[15] T. Webb, K. J. Holyoak, and H. Lu. Emergent analogical reasoning in large language models.

Nature Human Behaviour, 7(9):1526–1541, 2023.

[16] J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma,

D. Zhou, D. Metzler, E. H. Chi, T. Hashimoto, O. Vinyals, P. Liang, J. Dean, and W. Fedus.

Emergent abilities of large language models. Transactions on Machine Learning Research,

2022.

[17] J. S. Wind. 1st place solution + code and official documentation. https://www.kaggle.com/

competitions/abstraction-and-reasoning-challenge/discussion/154597, 2020. Ac-

cessed 2023-11-09.

[18] Z. Wu, L. Qiu, A. Ross, E. Akyürek, B. Chen, B. Wang, N. Kim, J. Andreas, and Y. Kim.

Reasoning or reciting? exploring the capabilities and limitations of language models through

counterfactual tasks. arXiv preprint arXiv:2307.02477, 2023.

[19] Y. Xu, W. Li, P. Vaezipoor, S. Sanner, and E. B. Khalil. Llms and the abstraction and

reasoning corpus: Successes, failures, and the importance of object-based representations.

arXiv preprint arXiv:2305.18354, 2023.

8Figure 2: (a) A task from the ConceptARC corpus. (b) The corresponding prompt used in [10] to

give to GPT-4. (Image is from [10]; best viewed in color.)

7.1

Appendix

Prompts for Text-only GPT-4

In their evaluation of GPT-4, Moskvichev et al. (2023) translated ConceptARC tasks into text

representations used in prompts like the one shown in Figure 2. Here the 10 possible colors are

encoded as integers, and each row of a grid is encoded as a list of integers inside square brackets.

Figure 3 shows an example of the prompt we used (adapted from[14]) in testing text-only GPT-4.

We use the format required by the OpenAI API. The symbol “#” indicates comments not given in

the actual prompt. We used the same encoding for colors and grid rows as in [10].

7.2

Prompts for GPT-4V

Figure 4 shows the adapted prompt used to test GPT-4V in the one shot setting, for the same

example used in Figure 3. Differences from the text-only prompt are highlighted in red text.

Figure 5 shows an example of the prompts used to test GPT-4V in the zero-shot setting. This

evaluation was conducted using OpenAI’s web application and thus all messages are sent as the

‘user’ role, , interspersed with the model providing a response after each message.

9Figure 3: Example of the prompt used to test text-only GPT-4 on ConceptARC tasks. The symbol

“#” indicates comments not given in the actual prompt.

10Figure 4: Example of the prompt used to test GPT-4V in the one-shot setting. The symbol “#”

indicates comments not given in the actual prompt. Red text signifies differences from the prompt

used to test text-only GPT-4, as provided in Figure 3.

11Figure 5: Example of the prompts used to test GPT-4V in the zero-shot setting. The symbol “#”

indicates comments not given in the actual prompt.