Summary of An In-Depth Look at Geminis Language Abilities

Summary An In-Depth Look at Geminis Language Abilities arxiv.org

8,902 words - PDF document - View PDF document

One Line

Gemini Pro outperforms in complex reasoning but faces challenges with bias, math problems, task termination, and content filtering, while excelling in code generation and translation but lagging in web navigation, suggesting the need for further evaluation of Gemini Ultra.

Slides

Slide Presentation (10 slides)

Copy slides outline Copy embed code Download as Word

An In-Depth Look at Gemini's Language Abilities

Source: arxiv.org - PDF - 8,902 words - view

Introduction

• Gemini's language abilities were compared to OpenAI's GPT models in a recent study

• The study aimed to provide an objective comparison and identify areas of excellence

• Datasets covered various language abilities such as reasoning, math problem solving, translation, and code generation

Gemini Pro vs. GPT 3.5 Turbo

• Gemini Pro achieved close but slightly inferior accuracy compared to GPT 3.5 Turbo on benchmarked tasks

• Gemini's under-performance explained by difficulties in mathematical reasoning, answer ordering, and content filtering

• Gemini excelled in generating non-English languages and handling complex reasoning chains

Knowledge-Based Question Answering

• Gemini Pro performed slightly worse than GPT 3.5 Turbo overall

• Bias towards selecting the final choice in multiple-choice questions

• Strong performance in generating non-English languages and handling longer reasoning chains

General-Purpose Reasoning Tasks

• Gemini Pro achieved slightly lower accuracy than GPT 3.5 Turbo

• Struggled with tracking shuffled objects but excelled in world knowledge and word rearrangement tasks

• Parsing tables and sorting words in alphabetical order were also strengths

Mathematical Reasoning Abilities

• Gemini Pro achieved slightly lower accuracy than GPT 3.5 Turbo on diverse language pattern tasks

• Similar performance on the MAWPS task

• Sensitivity to question length, but excelled in more complex examples requiring longer chains of thought

Code Generation Tasks

• Gemini Pro achieved lower Pass@1 metric compared to GPT models on HumanEval and ODEX datasets

• Room for improvement in code generation capabilities

• Outperformed other models on tasks involving drawing visualization with matplotlib

Summary of Gemini Pro's Language Abilities

• Comparable but slightly inferior performance to GPT 3.5 Turbo across various tasks

• Strengths in generating non-English languages and handling longer reasoning chains

• Challenges in mathematical reasoning, answer ordering, and code generation

Key Takeaways

• Gemini Pro's language abilities were evaluated and compared to GPT models

• Strengths include generating non-English languages and handling complex reasoning chains

• Challenges exist in mathematical reasoning, answer ordering, and code generation

• Overall, Gemini Pro's language abilities are comparable but slightly inferior to GPT 3.5 Turbo

[Include relevant visuals such as graphs comparing accuracy, examples of generated code or translations, etc.]

Key Points

Gemini's language abilities were compared to OpenAI's GPT models in a recent study.
Gemini Pro achieved close but slightly inferior accuracy compared to GPT 3.5 Turbo on benchmarked tasks.
Gemini demonstrated strengths in generating non-English languages and handling longer reasoning chains.
Gemini faced challenges in mathematical reasoning, multiple-choice answer ordering, and code generation.
Gemini Pro's language abilities were comparable but slightly inferior to GPT 3.5 Turbo across various tasks.

Summaries

54 word summary

Gemini Pro is compared to GPT 3.5 Turbo, GPT 4 Turbo, and Mixtral models in language tasks. Gemini Pro excels in complex reasoning but struggles with bias, math problems, task termination, and content filtering. It performs well in code generation and translation but lags in web navigation. Further examination of Gemini Ultra is recommended.

74 word summary

The document compares Gemini Pro, GPT 3.5 Turbo, GPT 4 Turbo, and Mixtral models in language tasks. Gemini Pro performs similarly to GPT 3.5 Turbo in text understanding but falls behind GPT 4 Turbo. It excels in complex reasoning tasks but struggles with bias, math problems, termination of tasks, and content filtering. Gemini Pro has strengths in code generation and translation, but lags behind in web navigation. The study recommends examining Gemini Ultra further.

212 word summary

The document "An In-Depth Look at Gemini's Language Abilities" compares the language abilities of Gemini Pro, GPT 3.5 Turbo, GPT 4 Turbo, and Mixtral models across various language tasks. Gemini Pro performs similarly to GPT 3.5 Turbo in text understanding but falls behind GPT 4 Turbo. It excels in long and complex reasoning tasks but struggles with bias in multiple-choice questions, large digit mathematical reasoning, premature termination of agentive tasks, and aggressive content filtering. Gemini Pro performs competitively with other models in mathematical reasoning tasks, but struggles with complex geometry and calculus problems. It achieves comparable accuracy to GPT 3.5 Turbo in knowledge-based question answering but lags behind GPT 4 Turbo. Gemini Pro performs well on shorter code generation tasks and drawing visualization but underperforms on longer solutions and tasks requiring specific libraries. It achieves competitive performance in machine translation but falls behind Google Translate and NLLB-MoE. In web navigation tasks, Gemini Pro performs slightly worse than GPT 3.5 Turbo, predicting tasks as unachievable and responding with shorter phrases. The study acknowledges limitations and recommends further examination of the upcoming Gemini Ultra edition. Overall, the study provides insights into Gemini's language abilities, highlighting its strengths and weaknesses across different tasks and contributing to the understanding of large language models in various domains.

388 word summary

For mathematical reasoning tasks, Gemini Pro performs competitively with other models. It performs well on arithmetic and algebraic word problems but struggles with more complex problems involving geometry and calculus. Its performance is on par with GPT 3.5 Turbo but inferior to GPT 4 Turbo.

In the knowledge-based question answering task, Gemini Pro achieves comparable accuracy to GPT 3.5 Turbo but lags behind GPT 4 Turbo. The model struggles with questions that require deeper understanding and reasoning but performs better on factual questions that can be answered directly from the provided context.

In code generation tasks, Gemini Pro performs well on shorter solutions but falls behind GPT 3.5 Turbo and GPT 4 Turbo as the solution length increases. It also underperforms on tasks that require specific libraries such as mock, pandas, numpy, and datetime. However, it outperforms the other models on tasks involving drawing visualization with matplotlib.

In the machine translation task, Gemini Pro achieves competitive performance but generally falls behind Google Translate and NLLB-MoE, a leading open-source machine translation model. GPT 4 Turbo outperforms other models on various language pairs, particularly those using the Devanagari script.

In the web navigation task, Gemini Pro performs slightly worse than GPT 3.5 Turbo. The model shows a tendency to predict tasks as unachievable, especially when given an "unachievable" hint. It also responds with shorter phrases and takes fewer steps compared to other models.

The study acknowledges several limitations, including the snapshot nature of the evaluation, dependence on specific prompts and generation parameters, and potential data leakage. The authors recommend considering Gemini Pro as a tool comparable to GPT 3.5 Turbo and await further examination of the upcoming Gemini Ultra edition.

Overall, the study provides insights into Gemini's language abilities and highlights its strengths and weaknesses across different tasks. The findings contribute to the understanding of large language models and their capabilities in various domains.

892 word summary

Gemini's language abilities were explored in-depth in a recent study comparing it to OpenAI's GPT models. The study aimed to provide an objective comparison of Gemini and GPT models and identify areas where one model excelled over the other. The analysis covered 10 datasets testing various language abilities such as reasoning, answering knowledge-based questions, math problem solving, language translation, code generation, and acting as instruction-following agents.

The results showed that Gemini Pro achieved accuracy that was close but slightly inferior to GPT 3.5 Turbo on all benchmarked tasks. The study provided explanations for some of Gemini's under-performance, including difficulties in mathematical reasoning with many digits, sensitivity to multiple-choice answer ordering, aggressive content filtering, and others. However, Gemini demonstrated high performance in certain areas such as generating into non-English languages and handling longer and more complex reasoning chains.

The study compared Gemini and GPT models on knowledge-based question answering tasks from the MMLU dataset. Gemini Pro performed slightly worse than GPT 3.5 Turbo overall, with GPT 4 Turbo performing even better. Gemini showed a bias towards selecting the final choice in multiple-choice questions, indicating a lack of instruction tuning for solving multiple-choice questions. However, Gemini Pro performed well on tasks that required generation into non-English languages and handling longer reasoning chains.

In the general-purpose reasoning tasks from the BIG-Bench Hard dataset, Gemini Pro achieved slightly lower accuracy than GPT 3.5 Turbo and much lower accuracy than GPT 4 Turbo. Gemini struggled with tasks involving tracking shuffled objects, often having difficulty keeping track of the order of objects. However, Gemini Pro outperformed GPT 3.5 Turbo on tasks that required world knowledge, word rearrangement, symbol manipulation, sorting words in alphabetical order, and parsing tables.

The mathematical reasoning abilities of Gemini Pro were evaluated on four math word problem benchmarks. Gemini Pro achieved slightly lower accuracy than GPT 3.5 Turbo on tasks with diverse language patterns but performed similarly on the MAWPS task. GPT 4 Turbo outperformed both Gemini Pro and GPT 3.5 Turbo on all tasks. Gemini Pro showed some sensitivity to question length, underperforming on longer questions compared to GPT models. However, it outperformed GPT 3.5 Turbo on more complex examples requiring longer chains of thought.

In code generation tasks, Gemini Pro achieved a lower Pass@1 metric compared to GPT models on both the HumanEval and ODEX datasets. Gemini's code generation capabilities still have room for improvement.

Overall, the study found that Gemini Pro's language abilities were comparable but slightly inferior to GPT 3.5 Turbo across various tasks. Gemini demonstrated strengths in generating non-English languages and handling longer reasoning chains. However, it faced challenges in mathematical reasoning, multiple-choice answer ordering, and code generation. The study provided valuable insights into the strengths and weaknesses of Gemini's language abilities, allowing for a more objective comparison with GPT models.

This summary provides an overview of the key findings and results from the document "An In-Depth Look at Gemini's Language Abilities." The study compares Gemini Pro, GPT 3.5 Turbo, GPT 4 Turbo, and Mixtral models across various language tasks, including text understanding, mathematical reasoning, knowledge-based question answering, code generation, machine translation, and web navigation.

In the text understanding task, Gemini Pro performs comparably to GPT 3.5 Turbo but falls behind GPT 4 Turbo. It struggles with bias in multiple-choice questions, large digit mathematical reasoning, premature termination of agentive tasks, and aggressive content filtering. However, Gemini Pro excels in long and complex reasoning tasks.

For mathematical reasoning tasks, Gemini Pro demonstrates competitive performance with other models. It performs well on arithmetic and algebraic word problems but struggles with more complex problems involving geometry and calculus. The model's performance is on par with GPT 3.5 Turbo but inferior to GPT 4 Turbo.

In the knowledge-based question answering task, Gemini Pro achieves comparable accuracy to GPT 3.5 Turbo but lags behind GPT 4 Turbo. The model struggles with questions that require deeper understanding and reasoning. It performs better on factual questions that can be answered directly from the provided context.

The machine translation task evaluates the models' multilingual ability using the FLORES-200 benchmark. Gemini Pro achieves competitive performance but generally falls behind Google Translate and NLLB-MoE, a leading open-source machine translation model. GPT 4 Turbo outperforms other models on various language pairs, particularly those using the Devanagari script.

The study acknowledges several limitations, including the snapshot nature of the evaluation, dependence on specific prompts and generation parameters, and potential data leakage. The authors recommend researchers and practitioners consider Gemini Pro as a tool comparable to GPT 3.5 Turbo and await further examination of the upcoming Gemini Ultra edition.

Raw indexed text (58,743 chars / 8,902 words / 1,549 lines)

An In-depth Look at Gemini’s Language Abilities

Syeda Nahida Akter ∗,1 , Zichun Yu ∗,1 , Aashiq Muhamed ∗,1 , Tianyue Ou ∗,1 , Alex Bäuerle 1

Ángel Alexander Cabrera 1 , Krish Dholakia 2 , Chenyan Xiong 1 , Graham Neubig 1

Carnegie Mellon University, 2 BerriAI

Abstract

The recently released Google Gemini class of models are the first to comprehen-

sively report results that rival the OpenAI GPT series across a wide variety of tasks.

In this paper, we do an in-depth exploration of Gemini’s language abilities, making

two contributions. First, we provide a third-party, objective comparison of the abil-

ities of the OpenAI GPT and Google Gemini models with reproducible code and

fully transparent results. Second, we take a closer look at the results, identifying

areas where one of the two model classes excels. We perform this analysis over

10 datasets testing a variety of language abilities, including reasoning, answering

knowledge-based questions, solving math problems, translating between languages,

generating code, and acting as instruction-following agents. From this analysis, we

find that Gemini Pro achieves accuracy that is close but slightly inferior to the cor-

responding GPT 3.5 Turbo on all tasks that we benchmarked. We further provide

explanations for some of this under-performance, including failures in mathematical

reasoning with many digits, sensitivity to multiple-choice answer ordering, aggres-

sive content filtering, and others. We also identify areas where Gemini demonstrates

comparably high performance, including generation into non-English languages,

and handling longer and more complex reasoning chains. Code and data for repro-

duction can be found at https://github.com/neulab/gemini-benchmark

Introduction

Gemini is the most recent in a series of large language models released by Google DeepMind [Gemini

Team, 2023]. It is notable in particular because the results reported by the Gemini team are the first to

rival the OpenAI GPT model series [Brown et al., 2020] across a wide variety of tasks. Specifically,

Gemini’s “Ultra” version reportedly outperforms GPT-4 on a wide variety of tasks, while Gemini’s

“Pro” version is reportedly comparable to GPT-3.5 Gemini Team [2023]. Despite the potential impact

of these results, the exact evaluation details and model predictions have not been released, limiting

the ability to reproduce, inspect, and analyze the results and their implications in detail.

In this paper, we conduct an in-depth exploration of Gemini’s language understanding and generation

abilities, with two goals:

1. We aim to provide a third-party, objective comparison of the abilities of the OpenAI GPT

and Google Gemini model classes with reproducible code and fully transparent results.

2. We aim to take an in-depth look into the results, identifying areas where one of the two

model classes excels.

Furthermore, we also perform a limited comparison with the recently released Mixtral model, as a

point of reference for a best-in-class open source model [Mistral AI team, 2023].

We perform this analysis over 10 datasets, testing a variety of text understanding and generation capa-

bilities, including the models’ abilities to answer knowledge-based questions (MMLU; Hendrycks

Lead authors. Individual author contributions are listed in Appendix A.Model

Task

Dataset Gemini Pro GPT 3.5 Turbo GPT 4 Turbo Mixtral

Knowledge-based QA MMLU (5-shot)

MMLU (CoT) 64.12

60.63 67.75

70.07 80.48

78.95 -

Reasoning BIG-Bench-Hard 65.58 71.02 83.90 41.76

GSM8K

SVAMP

ASDIV

MAWPS 69.67

79.90

81.53

95.33 74.60

82.30

86.69

99.17 92.95

92.50

91.66

98.50 58.45

73.20

74.95

89.83

HumanEval

ODEX 52.44

38.27 65.85

42.60 73.17

46.01 -

FLORES (0-shot)

FLORES (5-shot) 29.59

29.00 37.50

38.08 46.57

48.60 -

WebArena 7.09 8.75 15.16 1.37

Mathematics

Code Generation

Machine Translation

Web Agents

Table 1: Main results of our benchmarking. The best model is listed in bold, and the second best

model is underlined. Mixtral was only evaluated on a subset of the tasks.

et al. [2021]), perform reasoning (BigBenchHard; Suzgun et al. [2022]), answer mathematics ques-

tions (e.g. GSM8K; Cobbe et al. [2021]), translate between languages (e.g. FLORES; Goyal et al.

[2022]), generate code (e.g. HumanEval; Chen et al. [2021]), and act as an instruction-following

agent (WebArena; Zhou et al. [2023b]). 1

A summary of our main results can be found in Table 1. In sum, we found that across all tasks,

as of this writing (December 19, 2023), Gemini’s Pro model achieved comparable but slightly

inferior accuracy compared to the current version of OpenAI’s GPT 3.5 Turbo. In the following

sections, we will detail our experimental methodology (Section 2) and then perform an in-depth

description and analysis of the results on each task. Each analysis is accompanied by an online results

browser using Zeno [Cabrera et al., 2023], 2 which can be accessed through the Zeno Report images

in this PDF. All results and code for reproduction can be found at https://github.com/neulab/

gemini-benchmark.

Experimental Setup

Before discussing evaluation results and findings, this section describes our experiment configurations,

including models tested, model querying details, and evaluation procedures.

2.1

Models Tested

In this work, we compare 4 models.

Gemini Pro is the second largest model in the Gemini Series, next to the largest Gemini Ultra. 3 The

model is based on the Transformer [Vaswani et al., 2017] architecture and was trained multimodaly

over videos, text, and images. The number of parameters and size of training data are not disclosed.

In the original Google paper on Gemini, it was reported to achieve similar performance to GPT 3.5

Turbo.

GPT 3.5 Turbo is the second most capable text model served by OpenAI, part of the GPT-3 series

[Brown et al., 2020]. The model has been instruction tuned and trained using reinforcement learning

Note that Gemini is a multi-modal model, but for this examination, we only focus on Gemini’s language

understanding, generation, and translation abilities.

https://zenoml.com

Gemini Ultra is not yet publicly available, and thus we do not test it in the current version of this paper.

2from human feedback [Ouyang et al., 2022], but was trained solely on text. Similarly, model size and

precise training details are not disclosed.

GPT 4 Turbo is the second generation of the GPT-4 [OpenAI, 2023] family, a family of models

trained multimodally. The turbo version is moderately cheaper than the original GPT-4 model (making

it more conducive to benchmarking) and similarly lacks detail of the actual training algorithms, data,

or parameter size.

Mixtral in contrast, is an open-source mixture-of-experts model, consisting of eight 7B parameter

models [Mistral AI team, 2023]. It has been reported to achieve comparable accuracy to GPT 3.5

Turbo on several tasks, including some examined in this paper.

2.2

Model Querying Details

All models were queried through the unified interface

provided by LiteLLM 4 between December 11-15, 2023.

Gemini was queried through Google Vertex AI, OpenAI

models through the OpenAI API, and Mixtral through

the API provided by Together. 5 For reference, we also

list the current pricing of each model through these APIs

for 1M tokens in Table 2, which provides an approximate

measure of how efficiently the models can be run.

Language Model

Gemini Pro

GPT-3.5 Turbo

GPT-4

Mixtral

Input

$1.00

$10.00

$0.60

Output

$2.00

$30.00

$0.60

Table 2: Pricing per 1M tokens. Gemini

It is also notable that in some cases Gemini Pro blocks Pro charges by character; we multiply by

some questions, particularly in the case of potentially 4, a rule-of-thumb average of characters

illegal or sensitive material. Responses were blocked for per English token [Raf, 2023].

some portion of the testing examples, so we treated these

examples as incorrect. For each task where a significant

number of responses were blocked, we will discuss this effect in the corresponding section.

2.3

Evaluation Procedure

To perform a fair comparison between the models, we re-ran experiments with all models using

exactly the same prompts and evaluation protocol for all evaluated models. We make this decision

to ensure that all models are compared on exactly the same footing, in contrast to previous papers

where these settings may differ. In general, we tried to follow both prompts and evaluators from

standard repositories, either those officially released by the datasets themselves, or from the Eleuther

evaluation harness [Gao et al., 2023a]. These prompts generally consist of a query, input, and

few-shot examples, sometimes including chain-of-thought reasoning [Wei et al., 2022]. In some

cases, we found it necessary to make small changes from standard practice to stably evaluate all

models under consideration; all such deviations are noted below and implemented in the companion

code repository.

Knowledge-based QA

Zeno Report

In this category, we focus on 57 knowledge-based multiple-choice question-answering tasks from

MMLU [Hendrycks et al., 2021], which span topics across STEM, the humanities, the social sciences,

and more. MMLU has been widely used as a holistic evaluation of LLMs’ knowledge-based

capabilities. There are 14,042 test samples in total.

3.1

Experimental Details

Generation Parameters We examine two popular evaluation methods in this task, including the

standard 5-shot prompts from Hendrycks et al. [2021] and 5-shot chain-of-thought prompts from

chain-of-thought-hub 6 [Fu et al., 2023] with a prefix of “Let’s think step by step.” [Kojima et al.,

https://litellm.ai/

https://cloud.google.com/vertex-ai/docs https://openai.com/api https://docs.together.ai/docs

https://github.com/FranxYao/chain-of-thought-hub

30.9

system

gemini-pro

gpt-3.5-turbo

gpt-4-turbo

gemini-pro-cot

gpt-3.5-turbo-cot

gpt-4-turbo-cot

0.8

0.7

0.5

0.4

0.3

Accuracy

0.6

0.4

0.2

0.1

0.2

0.0

0.1

gemini-pro

gpt-3.5-turbo

gpt-4-turbo

0.0

Figure 1: Overall accuracy on MMLU with 5-shot prompts

and chain-of-thought prompts

slice

Figure 2: Ratio of multiple-choice

answers being prdicted by models

2022]. Note that we opt not to sample multiple responses and perform self-consistency based

reranking [Wang et al., 2022a] as done by Gemini Team [2023], as this significantly increases cost

and may not be feasible in many scenarios. We generate via greedy search with a temperature of 0.

Evaluation For the standard prompting, we directly take the first character generated by models

as their answer since this is what the 5-shot prompts imply. Sometimes, the model may not follow

this format and output the answer elsewhere. We treat examples like this as incorrect (and elaborate

more on the effect of this in the following section). For the chain-of-thought prompting, we perform

answer extraction from the model’s response and set the default answer as “C” if no answer can be

extracted, as is done in chain-of-thought-hub.

3.2

Results and Analysis

In this section, we compare and analyze the overall performance, performance by sub-tasks, and

performance by output length on MMLU.

First, from the overall results shown in Figure 1, we can see that Gemini Pro achieves an accuracy

lower than that of GPT 3.5 Turbo, and much lower than that of GPT 4 Turbo. We saw little difference

in performance using chain-of-thought prompting, likely because MMLU is mostly a knowledge-

based question answering task that may not benefit significantly from stronger reasoning-oriented

prompts. 7

Based on this overall result, we next dive a bit deeper. One first notable point is that all questions in

MMLU are multiple-choice with 4 potential answers ordered A through D. In Figure 2, we show the

ratio of the number of times each model selects each multiple choice answer. From this figure, we

can see that Gemini has a very skewed label distribution, biased towards selecting the final choice of

“D”, which contrasts to the result of the GPT model, which is more balanced. This may indicate that

Gemini has not been heavily instruction-tuned towards solving multiple-choice questions, which can

cause models to be biased with respect to answer ordering [Tjuatja et al., 2023].

Next, we examine each subtask’s performance. Figure 3 illustrates each model’s performance on

selected representative tasks. We notice that Gemini Pro underperforms on most tasks compared to

GPT 3.5. Chain-of-thought prompting decreases the variance across the subtasks.

Further, we dig deeper into the tasks where Gemini Pro underperforms/outperforms GPT

3.5 the most. From Figure 4, we can observe that Gemini Pro falls behind GPT 3.5 on

human_sexuality (social sciences), formal_logic (Humanities), elementary_mathematics

(STEM), and professional_medicine (specialized domains). For the two tasks where Gemini

Pro outperformed GPT 3.5 Turbo, gains were marginal.

The underperformance of Gemini Pro on particular tasks can be attributed to two reasons. First, as

previously mentioned in subsection 2.2, in some cases Gemini fails to return an answer. In most

MMLU sub-tasks, the API response rate was greater than 95%, but two had notably low response rates:

moral_scenarios at 85% and human_sexuality at 28%. This indicates that low performance

on some tasks can be attributed to content filters on the input. Second, Gemini Pro performed

somewhat more poorly at the basic mathematical reasoning necessary to solve the formal_logic

and elementary_mathematics tasks, which we examine further in Section 4.

Note that our evaluation numbers for GPT 4 Turbo (80.5%) are slightly worse than those from GPT 4

reported by OpenAI [2023] (86.4%). This drop could likely be attributed to the “Turbo” model, but is also

possibly the result of slight differences in prompting or generation methods.

4Figure 3: Accuracy by each subtask on MMLU

1.0

0.6

0.4

system

gemini-pro

gpt-3.5-turbo

gpt-4-turbo

gemini-pro-cot

gpt-3.5-turbo-cot

gpt-4-turbo-cot

0.8

1.0

system

gemini-pro

gpt-3.5-turbo

gpt-4-turbo

gemini-pro-cot

gpt-3.5-turbo-cot

gpt-4-turbo-cot

0.8

0.6

0.4

0.2 0.2

0.0 0.0

slice

(a) Top-4 tasks where GPT 3.5 wins over Gemini Pro

(b) Tasks where Gemini Pro wins over GPT 3.5

Figure 4: Tasks where Gemini Pro and GPT 3.5 prevail on MMLU

Finally, we analyze how the output length in the chain-of-thought prompting affects the model

performance in Figure 5. Generally, a stronger model tends to perform more complex reasoning and

thus outputs a longer response. One of the noteworthy advantages of Gemini Pro is that its accuracy

is less influenced by the output length compared to the two counterparts. It even outperforms GPT

3.5 when the output length is over 900. However, it also can be seen that Gemini Pro and GPT 3.5

Turbo rarely output these long reasoning chains compared to GPT 4 Turbo.

1.0

system

gemini-pro-cot

gpt-3.5-turbo-cot

gpt-4-turbo-cot

system

gemini-pro-cot

gpt-3.5-turbo-cot

gpt-4-turbo-cot

10,000

8,000

0.8

0.6

0.4

6,000

4,000

0.2 2,000

0.0 0

slice

(a) Accuracy by output length

(b) Output length distribution

Figure 5: Analysis of output length on MMLU

General-purpose Reasoning

Zeno Report

In this category, we focus on 27 diverse reasoning tasks from BIG-Bench Hard [Suzgun et al.,

2022] which consists of arithmetic, symbolic and multilingual reasoning and factual knowledge

understanding tasks. Most of the tasks consist of 250 question-answer pairs, with a few having

somewhat fewer.

4.1

Experimental Details

Generation Parameters We follow standard 3-shot prompts from the Eleuther harness across all

models where each question is followed by a chain of thought resulting in a final concluding sentence

50.9

system

gemini-pro

gpt-3.5-turbo

gpt-4-turbo

mixtral

0.8

0.7

system

gemini-pro

gpt-3.5-turbo

gpt-4-turbo

mixtral

0.8

0.6

0.5

0.4

0.3

0.4

0.2

0.1

0.0

slice

Figure 6: Overall accuracy on BIG-Bench-

Hard

slice

Figure 7: Accuracy by question length on BIG-

Bench-Hard

of “So the answer is ___.”. For hyperparameters, we perform greedy decoding, generating with

temperature of 0.

Evaluation The Eleuther evaluation harness implementation of BIG-Bench Hard matches the

sentence “So the answer is ___.” and extracts the text. However, we found that for some models,

they did not produce this sentence verbatim, even in cases when they generated the correct answer,

particularly multiple-choice tasks where the answer is an option chosen from the question text (e.g.,

“answer: (B)”). To remedy this, we modified the matching rule, instead taking the last word of the

generated text as the answer of the question only for multiple-choice tasks.

4.2

Results and Analysis

For the reasoning tasks, we report the overall performance, performance by question complexity, and

performance by BIG-Bench sub-task.

First, we illustrate the overall accuracy in Figure 6, we can see that Gemini Pro achieves an accuracy

slightly lower than that of GPT 3.5 Turbo, and much lower than that of GPT 4 Turbo. In contrast, the

Mixtral model achieves much lower accuracy.

Based on this overall result, let us dig a little bit deeper into why Gemini might be underperforming.

First, we examined accuracy by the length of the question, as detailed in Figure 7. We found that

Gemini Pro underperformed on longer, more complex questions while the GPT models were more

robust to this. This was particularly the case for GPT 4 Turbo, which showed very little degradation

even on longer questions, indicating an impressively robust ability to understand longer and more

complex queries. GPT 3.5 Turbo fell in the middle with respect to this robustness. Mixtral was

notably stable with respect to question length, but had low accuracy overall.

Next we look at whether there are variations in accuracy by the specific task in BIG-Bench-Hard.

Below, we list the tasks where GPT 3.5 Turbo outperformed Gemini Pro by the largest amount.

1.0

system

gemini-pro

gpt-3.5-turbo

gpt-4-turbo

mixtral

0.8

0.6

0.4

0.2

0.0

slice

Figure 8: Tasks where GPT 3.5 Turbo excels over Gemini Pro

We can notice that Gemini Pro is particularly bad at the ‘tracking_shuffled_objects’ tasks. These

tasks involve keeping track of who has certain objects as they are traded among people, and Gemini

Pro often has difficulty keeping the order straight (as Figure 8).

6Prompt

Label

Output

Alice, Bob, and Claire are friends and avid readers who occasionally trade books. At

the start of the semester, they each buy one new book: Alice gets Ulysses, Bob gets

Frankenstein, and Claire gets Lolita.

As the semester proceeds, they start trading around the new books. First, Claire and Bob

swap books. Then, Bob and Alice swap books. Finally, Claire and Bob swap books. At

the end of the semester, Bob has

Options:

(A) Ulysses

(B) Frankenstein

(B)

(0) At the start: Alice: Ulysses, Bob: Frankenstein, Claire: Lolita.

(1) Claire and Bob swap books: Alice: Ulysses, Bob: Lolita, Claire: Frankenstein.

(2) Bob and Alice swap books: Alice: Frankenstein, Bob: Ulysses, Claire: Lolita.

(3) Claire and Bob swap books: Alice: Frankenstein, Bob: Lolita, Claire: Ulysses.

At the end of the semester, Bob has Lolita. So the answer is (C).

Table 3: Example of Gemini Pro’s inability to maintain state in tracking_shuffled_objects.

In some tasks, namely multistep_arithmetic_two, salient_translation_error_detection,

snarks, disambiguition_qa and two of tracking_shuffled_objects tasks, Gemini Pro even

performs worse than the Mixtral model.

However, there were a few tasks where Gemini Pro outperformed GPT 3.5 Turbo. The Figure 9 shows

the six tasks where Gemini Pro outperformed GPT 3.5 Turbo by the largest amount. These were

heterogeneous and included those that required world knowledge (sports_understanding), manip-

ulating stacks of symbols (dyck_languages), sorting words in alphabetical order (word_sorting),

and parsing tables (penguins_in_a_table), among others.

1.0

system

gemini-pro

gpt-3.5-turbo

gpt-4-turbo

mixtral

0.8

0.6

0.4

0.2

0.0

slice

Figure 9: Tasks where Gemini Pro excels over GPT 3.5 Turbo

1.0

We further investigate the robustness of LLMs

across different answer types in the Figure below.

We can see that Gemini Pro shows the worst

performance in Valid/Invalid answer type

which falls under the task formal_fallacies.

Interestingly 68.4% of questions from this

task were blocked by. However, Gemini out-

performed all GPT models as well as Mix-

tral by a significant margin on Other answer

Figure 10: Accuracy by answer types

types (consisting of the word_sorting and

dyck_language tasks) which follows a simi-

lar line of findings as above i.e., Gemini is particularly good at word rearrangement and producing

symbols in the correct order. Also for MCQ answers, 4.39% questions were blocked Gemini Pro, and

while GPT models excel in this genre, Gemini struggles to compete with them.

system

gemini-pro

gpt-3.5-turbo

gpt-4-turbo

mixtral

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0.0

sliceIn sum, there did not seem to be a particularly strong trend in which tasks one model performed better

than the other, so when performing general-purpose reasoning tasks it may be worth trying both the

Gemini and GPT models before making a decision on which to use.

Mathematics

Zeno Report

To evaluate the mathematical reasoning ability of the evaluated models, we explore four math word

problems benchmarks (1) the grade-school math benchmark, GSM8K [Cobbe et al., 2021], (2) the

SVAMP dataset [Patel et al., 2021] with questions generated by varying word-order to check the

robust reasoning ability, (3) the ASDIV dataset [Miao et al., 2020] with diverse language patterns and

problem types and (4) the MAWPS benchmark [Koncel-Kedziorski et al., 2016] consisting of arithmetic

and algebraic word problems.

5.1

Experimental Details

Generation Parameters We consider standard 8-shot chain-of-thought prompts [Gao et al., 2023a,

Wei et al., 2022] where each question in few-shot prompting is associated with a chain of thought for

generating the corresponding answer. We evaluate all LLMs via greedy decoding using a temperature

of 0.

Evaluation In evaluation, we make a slight modification to the standard evaluation protocol in the

Eleuther harness, which consisted of matching the words “The answer is” followed by a numerical

output. We found that all evaluated models had a tendency to output the correct answer even when this

specific phrase was not present. To mitigate this, we simply taking the last number of the generated

text as the answer of the question, which resulted in higher accuracy overall.

5.2

Results and Analysis

In this section, we compare the accuracy of Gemini Pro to GPT 3.5 Turbo, GPT 4 Turbo, and Mixtral,

on the four math word problems tasks, examining overall performance, performance by question

complexity, and performance by chain-of-thought depth.

1.0

0.8

0.7

0.8

0.7

0.6

0.5

0.4

0.6

0.5

0.4

0.3 0.3

0.2 0.2

0.1 0.1

0.0

slice

1.0

system

gemini-pro

gpt-3.5-turbo

gpt-4-turbo

mixtral

0.9

0.8

0.7

slice

system

gemini-pro

gpt-3.5-turbo

gpt-4-turbo

mixtral

0.9

0.8

0.7

0.6

(b) SVAMP

(a) GSM8K

1.0

system

gemini-pro

gpt-3.5-turbo

gpt-4-turbo

mixtral

0.9

1.0

system

gemini-pro

gpt-3.5-turbo

gpt-4-turbo

mixtral

0.9

0.5

0.4

0.6

0.5

0.4

0.3 0.3

0.2 0.2

0.1 0.1

0.0

(d) MAWPS

slice

Figure 11: Overall accuracy across four mathematical reasoning tasks

First, looking at overall results in the Figure 11, we can see that Gemini Pro achieves an accuracy

slightly lower than that of GPT 3.5 Turbo, and much lower than that of GPT 4 Turbo on the GSM8K,

SVAMP and ASDIV tasks, which all contain diverse language patterns. For the MAWPS task, all

models achieve more than 90% accuracy, although Gemini Pro is still slightly worse than GPT models.

Interestingly in this task GPT 3.5 Turbo outperforms GPT 4 Turbo by a close margin. In contrast, the

Mixtral model achieves much lower accuracy compared to others.

81.0

0.6

system

gemini-pro

gpt-3.5-turbo

gpt-4-turbo

mixtral

0.8

1.0

system

gemini-pro

gpt-3.5-turbo

gpt-4-turbo

mixtral

0.8

0.4

0.6

0.4

0.2 0.2

0.0 0.0

slice

(b) SVAMP

1.0

system

gemini-pro

gpt-3.5-turbo

gpt-4-turbo

mixtral

0.6

system

gemini-pro

gpt-3.5-turbo

gpt-4-turbo

mixtral

0.8

slice

(a) GSM8K

0.4

0.6

0.4

0.2 0.2

0.0 0.0

slice

(d) MAWPS

Figure 12: Accuracy by question length across four mathematical reasoning tasks

1.0

0.6

system

gemini-pro

gpt-3.5-turbo

gpt-4-turbo

mixtral

0.8

1.0

system

gemini-pro

gpt-3.5-turbo

gpt-4-turbo

mixtral

0.8

0.4

0.6

0.4

0.2 0.2

0.0 0.0

slice

(b) SVAMP

1.0

system

gemini-pro

gpt-3.5-turbo

gpt-4-turbo

mixtral

0.8

0.6

system

gemini-pro

gpt-3.5-turbo

gpt-4-turbo

mixtral

0.8

slice

(a) GSM8K

0.4

0.6

0.4

0.2

0.0

slice

(d) MAWPS

Figure 14: Accuracy by number of digits in the answer across four mathematical reasoning tasks

1.0

system

gemini-pro

gpt-3.5-turbo

gpt-4-turbo

mixtral

0.8

0.6

0.4

0.2

0.0

Similarly to Section 4 we break down

the results to observe the robustness of

each model to question length in Fig-

ure 12. As with the reasoning tasks on

BIG-Bench Hard, we see a drop-off

on longer questions. As before, GPT

3.5 Turbo outperforms Gemini Pro on

shorter questions, but drops off more

quickly, with Gemini Pro achieving

similar (but still slightly inferior) ac-

curacy on longer questions.

slice

Figure 13: GSM8K accuracy

by chain-of-thought length

Additionally, we observe the accuracy of the LLMs when the answer requires longer chains of thought.

As shown in Figure 13, GPT 4 Turbo is very robust even when using long reasoning chains, where

GPT 3.5 Turbo, Gemini Pro and Mixtral struggle with increasing COT lengths. In this analysis we

also find that Gemini Pro is superior to GPT 3.5 Turbo in the most complex examples where the COT

length is over 100, but underperforms in the shorter examples.

9Finally, we investigate the accuracy of the compared models in generating answers with varying

numbers of digits. We create three buckets based on the number of digits in the answer, 1, 2, or

3+ (except for the MAWPS task which does not have answers more than two digits). As shown in

Figure 14, GPT 3.5 Turbo appears to be more robust to multi-digit math problems, where Gemini Pro

degrades somewhat more on problems with more digits.

Code Generation

Zeno Report

In this category, we examine the models’ coding abilities using two code generation datasets Hu-

manEval [Chen et al., 2021] and ODEX [Wang et al., 2022b]. The former tests basic code understand-

ing on a limited set of functions from the Python standard library, while the latter tests the ability to

use a broader set of libraries from the entire Python ecosystem. Both of them take as input a human-

written task description in English (often with test cases). These problems evaluate comprehension of

language, algorithmic understanding, and elementary mathematics. Overall, HumanEval has 164 test

samples, and ODEX has 439 test samples.

6.1

Experimental Details

Generation Parameters We follow the standard zero-shot code evaluation pipeline provided by the

ODEX 8 . We take its recommended hyperparameters with a temperature of 0.8 and top_p of 0.95. We

also add a customized instruction “Complete the given code with no more explanation. Remember

that there is a 4-space indent before the first line of your generated code.” to ensure that the models’

output fits the desired format.

Evaluation We perform evaluation-based execution, measuring the Pass@1 metric, which deter-

mines whether a single sample from the model passes test cases [Chen et al., 2021]. Since code

generation is evaluated in a zero-shot fashion, the model may inevitably output code that does not

conform to our input format well. Therefore, we perform rudimentary post-processing to regulate

the output code and make it fit into the final verification pipeline as much as possible, including the

removal of markdown code blocks, the extraction of function implementations and the truncation of

stop tokens. It’s noteworthy that for incorrectly formatted indent, we do not manage to manually fix

such issues and instead regard them as typical syntax errors.

6.2

Results and Analysis

In this section, we examine the overall performance and present a case study on the code generation.

First, from the overall results shown in Figure 15, we can see that Gemini Pro achieves a Pass@1

lower than GPT 3.5 Turbo and much lower than GPT 4 Turbo on both tasks. The results demonstrate

that Gemini’s code generation capabilities still have room for improvement.

0.8

0.6

system

gemini-pro

gpt-3.5-turbo

gpt-4-turbo

0.45

0.40

0.35

0.5

0.50

system

gemini-pro

gpt-3.5-turbo

gpt-4-turbo

0.7

0.4

0.3

0.30

0.25

0.20

0.15

0.2

0.10

0.1

0.05

0.0

0.00

(b) ODEX

(a) HumanEval

slice

Figure 15: Overall accuracy on code generation tasks

Second, we analyze the relationship between the gold solution length and the model performance in

Figure 16a. The solution length can partly indicate the difficulty of solving the corresponding code

generation task. We find that even though Gemini Pro achieves comparable Pass@1 with GPT 3.5

when the solution length is below 100 (e.g., easier cases), it falls behind by large margins when the

solution becomes longer. This is an interesting contrast to the results from previous sections, where

https://github.com/zorazrw/odex

100.9

0.7

0.6

system

gemini-pro

gpt-3.5-turbo

gpt-4-turbo

0.45

0.40

0.35

0.5

Pass@1

0.50

system

gemini-pro

gpt-3.5-turbo

gpt-4-turbo

0.8

0.4

0.3

0.2

0.30

0.25

0.20

0.15

0.1

0.10

0.0

0.00

0.05

slice

(a) Accuracy by gold solution length on HumanEval

(b) Accuracy by used libraries on ODEX

Figure 16: Comparison of Pass@1 w.r.t. gold solution length and the libraries used by gold solution

we found that in general Gemini Pro performed robustly with respect to longer inputs and outputs on

English language tasks.

We also present the analysis of how the libraries required in each solution affect the model performance

in Figure 16b. Gemini Pro performs worse than GPT 3.5 on most library-used cases, such as mock,

pandas, numpy, and datetime. However, it outperforms GPT 3.5 and GPT 4 on the matplotlib cases,

showing stronger capabilities when performing drawing visualization via code.

Finally, we show several concrete examples of failure cases where Gemini Pro performs worse in

code generation than GPT 3.5. First, we noticed that Gemini is somewhat worse at correctly choosing

functions and arguments from the Python API. For instance, given this prompt:

def f_3283984():

"""decode a hex string '4a4b4c' to UTF-8."""

Gemini Pro generated the following code, which results in a type mismatch error:

bytes(bytearray.fromhex('4a4b4c'), 'utf-8')

In contrast, GPT 3.5 Turbo used the following code, which achieves the desired result:

hex_string = '4a4b4c'

decoded_string = bytes.fromhex(hex_string).decode('utf-8')

return decoded_string

Further, Gemini Pro had a higher proportion of mistakes where the implemented code was syntacti-

cally correct but did not correctly match with a more complex intent. For instance, with respect to the

following prompt:

from typing import List

def remove_duplicates(numbers: List[int]) -> List[int]:

"""From a list of integers, remove all elements that occur more than once.

Keep order of elements left the same as in the input.

>>> remove_duplicates([1, 2, 3, 2, 4])

[1, 3, 4]

"""

Gemini Pro created an implementation that just extracts the unique numbers without removing those

that appear more than once.

seen_numbers = set()

unique_numbers = []

for number in numbers:

if number not in seen_numbers:

unique_numbers.append(number)

seen_numbers.add(number)

return unique_numbers

11Lang.

Gemini Pro GPT 3.5 Turbo GPT 4 Turbo Google NLLB

5.71

4.36

0.01

40.37

5.26

37.32

0.33

0.01

0.04

4.60

10.78

59.14

49.79

39.90

57.92

71.46

59.67

65.58

31.13

48.47 11.15

16.07

19.42

38.91

14.05

23.62

20.32

34.52

34.21

47.66

25.98

49.94

46.93

40.30

54.67

71.15

55.01

64.20

37.33

44.50 37.20

42.82

39.58

23.83

41.57

49.92

33.22

48.87

43.21

56.81

31.96

54.92

46.45

41.32

57.09

70.92

58.05

64.58

48.36

40.71 -

44.40

47.70

43.50

53.20

55.80

51.40

60.10

40.00

58.60

72.70

65.00

- 43.30

43.40

47.20

58.50

41.40

53.50

39.40

53.70

48.10

58.00

39.30

57.40

51.30

47.90

56.30

69.70

54.80

61.30

41.60

31.90

ssw_Latin

sna_Latin

ckb_Arab

mag_Deva

ibo_Latin

hau_Latin

pbt_Arab

tam_Tamil

kat_Geor

gle_Latin

kmr_Latin

war_Latin

ajp_Arab

lim_Latin

ukr_Cryl

fra_Latin

lvs_Latin

ron_Latin

tpi_Latin

acm_Arab

Table 4: Machine translation performance (chRF (%) scores) across models for all languages using

0-shot prompt. Best scores are bolded, second best underlined.

Machine Translation

Zeno Report

This set of experiments evaluates the models’ multilingual ability, specifically their ability to translate

between various language pairs, using the FLORES-200 machine translation benchmark [NLLB

Team et al., 2022]. We focus on a diverse subset of 20 languages used by the analysis of Robinson

et al. [2023], which encompass various levels of resource availability and translation difficulty. We

evaluate on the 1012 sentences from the test set for all the chosen language pairs. As the first step of

this study, we limited our scope to translations from English to other languages (ENG→X) only.

7.1

Experimental Details

Generation Parameters We investigate the efficacy of zero-shot and five-shot prompting strategies

across 20 language pairs. Following the guidelines proposed by Gao et al. [2023b], we utilized

designated prompts for both zero-shot and few-shot machine translation (MT), as exemplified in

Table 8. These prompt settings have shown distinct advantages for users of large language models

(LLMs), as noted by Robinson et al. [2023]. Our experimental setup employed a top_p value of 1, a

temperature of 0.3, a context_length of -1, and max_tokens 500 to optimize the performance of the

translation model.

Evaluation To evaluate the outputs, we utilized the following metrics:

spBLEU: BLEU, a standard in machine translation evaluation, was employed [Papineni et al., 2002].

We computed spBLEU scores following the methodology outlined in Goyal et al. [2022], using the

sacreBLEU toolkit [Post, 2018] with the SPM-200 tokenizer [NLLB Team et al., 2022].

chrF2++: Our primary metric, chrF2++, leverages the implementation provided by sacreBLEU [Post,

2018]. This choice is motivated by its ability to address some of BLEU’s limitations. For simplicity,

we refer to this metric as chrF in our discussion [Popović, 2017].

12Lang.

Gemini Pro GPT 3.5 Turbo GPT 4 Turbo Google NLLB

19.74

4.36

0.01

47.54

5.26

5.32

0.33

0.01

0.04

4.60

4.91

48.94

50.72

46.92

57.79

71.80

59.93

66.11

35.90

49.78 7.62

15.84

24.56

39.25

16.29

24.22

21.35

34.86

33.61

47.30

26.10

50.85

47.49

43.25

55.19

71.34

55.05

64.19

37.39

45.88 38.07

42.95

40.71

45.33

41.65

50.11

34.11

48.69

43.17

57.25

32.76

56.26

48.12

45.21

56.85

70.79

58.34

64.42

50.67

46.45 -

44.40

47.70

43.50

53.20

55.80

51.40

60.10

40.00

58.60

72.70

65.00

- 43.30

43.40

47.20

58.50

41.40

53.50

39.40

53.70

48.10

58.00

39.30

57.40

51.30

47.90

56.30

69.70

54.80

61.30

41.60

31.90

ssw_Latin

sna_Latin

ckb_Arab

mag_Deva

ibo_Latin

hau_Latin

pbt_Arab

tam_Tamil

kat_Geor

gle_Latin

kmr_Latin

war_Latin

ajp_Arab

lim_Latin

ukr_Cryl

fra_Latin

lvs_Latin

ron_Latin

tpi_Latin

acm_Arab

Table 5: Machine translation performance (chRF (%) scores) models for all languages using 5-shot

prompt. Best scores are bolded, second best underlined.

7.2

Results and Analysis

Overall Performances In Table 4 and Table 5, we conduct a comparative analysis of Gemini Pro,

GPT 3.5 Turbo, and GPT 4 Turbo against established systems like Google Translate. 9 Additionally,

we benchmark against NLLB-MoE [NLLB Team et al., 2022], the leading open-source machine

translation (MT) model known for its extensive language coverage. The results indicate that Google

Translate generally outperforms other models, and excels in 9 languages, followed by NLLB which

excels on 6 and 8 language in 0/5-shot settings. The general language models showed competitive

performances but have not yet surpassed the dedicated machine translation systems in translation into

non-English languages.

Figure 17: Machine translation performance (chRF (%) scores) by language pairs

Figure 17 illustrates the comparative performance of general language models across language pairs.

GPT 4 Turbo showed a consistent deviation of performance with NLLB relative to GPT 3.5 Turbo and

Gemini Pro. This reflects the findings in the literature on GPT 4 Turbo’s multilingual performance

OpenAI [2023]. GPT 4 Turbo also offered larger improvements for low-resource languages (as

measured in NLLB Team et al. [2022]), whereas for high-resource languages performance was

http://translate.google.com

13Figure 18: Number of samples that are blocked by Gemini Pro

Figure 19: Performance in chrf (%) on blocked and unblocked samples

similar between the LLMs. In comparison, Gemini Pro outperforms both GPT 3.5 Turbo and GPT 4

Turbo on 8 out of 20 languages, and achieved the top performances on 4 languages. However, Gemini

Pro showed a strong tendency to to block responses in approximately 10 language pairs, which we

further study in the next analysis.

Gemini blocked responses Figure 18 high-

lights that Gemini Pro’s lower performance in

these languages is due to its tendency to block

responses in scenarios of lower confidence. A re-

sponse is deemed "blocked" if Gemini Pro in ei-

ther its 0-shot or 5-shot configuration generates

a Blocked Response error. A closer examination

in Figure 19 reveals that Gemini Pro marginally

outperforms GPT 3.5 Turbo and GPT 4 Turbo in

unblocked samples where it demonstrates higher

confidence. Specifically, it surpasses GPT 4

Turbo by 1.6 chrf in 5-shot and 2.6 chrf in 0-

shot settings, and exceeds GPT 3.5 Turbo by 2.7

chrf and 2 chrf in 5-shot and 0-shot settings, re-

spectively. However, our initial analysis of GPT

4 Turbo and GPT 3.5 Turbo’s performance on

these samples indicates they are typically more

challenging to translate. Gemini Pro’s subpar

Figure 20: Performance (chrf (%)) by script

14performance on these particular samples is especially apparent in instances where the Gemini Pro

0-shot blocks responses but the 5-shot does not, and vice versa.

Other trends Throughout our analysis of the models, we observed that few-shot prompts generally

yield a modest enhancement in average performance, with an increasing variance pattern following

the order: GPT 4 Turbo < GPT 3.5 Turbo < Gemini Pro. While Gemini Pro’s 5-shot prompts show

improvement over its 0-shot counterparts in languages where it demonstrates confidence, in certain

languages, such as hau_Latin, the model exhibits significantly reduced confidence, resulting in

blocked responses (refer to Table 5).

In Figure 20, we present apparent trends when categorizing languages by family or script. A

key observation is Gemini Pro’s competitive performance with other models on Cyrillic scripts, is

contrasted by its underperformance on other scripts. GPT-4 stands out, outperforming other models

across various scripts, with few-shot prompts being particularly effective. This effectiveness is

especially pronounced in languages using the Devanagari script.

0.20

system

gemini-pro-cot

gemini-pro-cot-uahi

gpt-3.5-cot

gpt-3.5-cot-uahint

gpt-4-turbo-cot

mixtral-cot

0.18

0.16

0.14

0.12

0.10

0.08

0.06

0.04

0.02

0.00

ti-

slice

Figure 21: Web agent success rate of evaluateioned models at different site groups

Web Agents

Zeno Report

Finally, we examine the ability of each model

to act as a web navigation agent, a task that

Model

SRAC

requires long-term planning and complex data CoT UA Hint

understanding. We use WebArena [Zhou et al., ✓

✓ G EMINI - PRO

7.09

3.52

2023b], an execution-based simulation environ- ✓

✗

G EMINI - PRO

5.23

4.83

ment where the success criterion is based on

✓

✓ GPT-3.5- TURBO

8.75

6.44

execution outcome. Tasks given to agents con- ✓

✗

GPT-3.5- TURBO

6.41

6.06

sist of information seeking, site navigation, and

✓

✗

GPT-4- TURBO

15.16 14.22

content & configuration operations. The tasks

span over a variety of web sites, including E-

Table 6: Performances on WebArena.

commerce platforms, social forums, collabora-

tive software development platforms (e.g. git-

lab), content management systems, and online maps.

8.1

Experiment Details

Generation Parameters We follow WebArena’s testing methodology in testing Gemini. We used

the two-shot chain-of-thought prompts from Zhou et al. [2023b], where each prompt includes two

CoT style examples. We further distinguished between whether or not the model is instructed to

terminate execution when it believes the task is unachievable (the “unachievable” hint, or UA in

WebArena parlance).

15In sum, we tested with two prompts from WebArena:

p_cot_id_actree_2s and

p_cot_id_actree_2s_no_na, which are respectively CoT prompt with the UA hint and CoT prompt

without the UA hint. To make results comparable between GPTs and Gemini, we set the same upper

limit on the observation lengths for all of them. This number is set to 1920 tokens using the tokenizer

of gpt-4-1106-preview, consistent with experiments in WebArena. In terms of hyper-parameters,

we used the default suggested by each of the large language model providers. For the Gemini models,

the suggested default temperature is 0.9 and default top-p is 1.0, and the WebArena’s suggested

default for GPT models is 1.0 for temperature and 0.9 for top-p.

Evaluation Procedure The action sequence of an agent is considered correct as long as they

achieved the final goal, regardless the intermediate steps they take. We use WebArena’s evaluation,

which determines wether a task is completed successfully or not with the agent’s final output. A

small number of responses were blocked by the Gemini API (around 2% of the total test cases), and

we treat these as failed trajectories in our experiments.

8.2

Results and Analysis

We examine Gemini-Pro’s overall success rate, rate across dif-

ferent tasks, its response lengths, trajectory step counts, and

tendency to predict that the task is unachievable. The overall

performance is list in Table 6. Gemini-Pro performs com-

parably but slightly worse than GPT-3.5-Turbo. Similarly to

GPT-3.5-Turbo, Gemini-Pro performs better when the prompt

mentions that task might be unachievable (UA hint). With UA

hint, Gemini-Pro achieves an overall 7.09 percent success rate.

If we break down by websites, as shown in Figure 21, we

can see that Gemini-Pro performs worse than GPT-3.5-Turbo

on gitlab and maps, while being close to GPT-3.5-Turbo on

shopping admin, reddit, and shopping. It performs better than

GPT-3.5-Turbo on multi-site tasks, which is in concert with

our previous results of Gemini being a bit better on the more

complex sub-tasks across benchmarks.

In general, Gemini-Pro predicts more tasks as unachievable,

Figure 22: UA prediction count

especially in the case where a UA hint is given, as shown in

Figure 22. Gemini-Pro predicts over 80.6% of the tasks as

unachievable when given an UA hint, compared to 47.7% by

GPT-3.5-Turbo. Note that 4.4% of the tasks in the dataset are actually unachievable, so both far

over-predict the actual number of unachievable tasks.

At the same time, we observed that Gemini Pro has a higher tendency to respond in shorter phrases

and take fewer steps before reaching a conclusion. As shown in Figure 23a, more than half of

trajectories by Gemini Pro are under ten steps, while majority of trajectories by GPT 3.5 Turbo and

GPT 4 Turbo are between 10 and 30 steps. Similarly, the majority of Gemini responses are less than

100 characters in length, while most of GPT 3.5 Turbo, GPT 4 Turbo, and Mixtral’s responses are

500

400

300

800

system

mixtral-cot

gemini-pro-cot

gemini-pro-cot-uah

gpt-4-turbo-cot

gpt-3.5-cot

gpt-3.5-cot-uahint

700

600

system

gemini-pro-cot

gemini-pro-cot-uah

gpt-3.5-cot

gpt-3.5-cot-uahint

gpt-4-turbo-cot

mixtral-cot

600

500

400

300

200

100

slice

(a) Average steps taken per task

(b) Average response length

Figure 23: Model behaviors on WebArena.

16over 300 characters in length Figure 23b. Gemini tends to directly predict the actions while other

models would start with reasoning and then give their action predictions.

Conclusion

In this paper, we have taken a first impartial, in-depth look into Google’s Gemini model, comparing

it to OpenAI’s GPT 3.5 and 4 models, as well as the open source Mixtral model.

Takeaways

We came away with a number of conclusions:

• The Gemini Pro model, which is comparable to GPT 3.5 Turbo in model size and class,

generally achieves accuracy that is comparable but somewhat inferior to GPT 3.5 Turbo,

and much worse than GPT 4. It outperforms Mixtral on every task that we examined.

• In particular, we find that Gemini Pro was somewhat less performant than GPT 3.5 Turbo on

average, but in particular had issues of bias to response order in multiple-choice questions,

mathematical reasoning with large digits, premature termination of agentive tasks, as well

as failed responses due to aggressive content filtering.

• On the other hand, there were bright points: Gemini performed better than GPT 3.5 Turbo

on particularly long and complex reasoning tasks, and also was adept multilingually in tasks

where responses were not filtered.

Limitations Finally, we would like to temper these conclusions with a number of limitations.

First, our work is a snapshot in time with respect to ever-changing and unstable API-based systems.

All results here are current as of this writing on December 19, 2023, but may change in the future as

models and the surrounding systems are upgraded.

Second, the results may be dependent on the specific prompts and generation parameters that we

selected. It is quite possible that with further prompt engineering, or multiple samples and self-

consistency as was used by Gemini Team [2023], the results could change significantly. However,

we believe that the consistent results over several tasks with standardized prompts is a reasonable

indication of the robustness and generalized instruction following capability of the tested models.

Finally, any benchmarking paper would be remiss without a discussion of data leakage, which plagues

current evaluation of large language models [Zhou et al., 2023a]. While we did not measure this

leakage explicitly, we did attempt to mitigate by evaluating on a broad variety of tasks, including

those who’s outputs were not sourced from or widely available on the internet (such as WebArena).

Outlook Based on this paper, we can make the recommendation to researchers and practitioners

to carefully look at the Gemini Pro model as a tool in the toolbox, comparable to GPT 3.5 Turbo.

Gemini’s Ultra edition, which is yet to be released, is reported to be on par with GPT 4, and a further

examination of this model will be warranted when it is available.

Acknowledgements

The authors would like to thank Zhiruo Wang for her help in handling the ODEX dataset, and Shuyan

Zhou for high-level guidance on the WebArena experiments.

References

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal,

Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are

few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.

Ángel Alexander Cabrera, Erica Fu, Donald Bertucci, Kenneth Holstein, Ameet Talwalkar, Jason I

Hong, and Adam Perer. Zeno: An interactive framework for behavioral evaluation of machine

learning. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems,

pages 1–14, 2023.

17Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared

Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri,

Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan,

Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian,

Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios

Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino,

Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders,

Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa,

Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob

McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. Evaluating

large language models trained on code. 2021.

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser,

Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve

math word problems. arXiv preprint arXiv:2110.14168, 2021.

Yao Fu, Litu Ou, Mingyu Chen, Yuhao Wan, Hao Peng, and Tushar Khot. Chain-of-thought hub:

A continuous effort to measure large language models’ reasoning performance. arXiv preprint

arXiv:2305.17306, 2023.

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster,

Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff,

Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika,

Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework for few-shot

language model evaluation, 12 2023a. URL https://zenodo.org/records/10256836.

Yuan Gao, Ruili Wang, and Feng Hou. How to design translation prompts for chatgpt: An empirical

study. arXiv preprint arXiv: 2304.02182, 2023b.

Gemini Team. Gemini: A family of highly capable multimodal models. Technical report, Google, 12

2023. URL https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf.

Naman Goyal, Cynthia Gao, Vishrav Chaudhary, Peng-Jen Chen, Guillaume Wenzek, Da Ju, Sanjana

Krishnan, Marc’Aurelio Ranzato, Francisco Guzmán, and Angela Fan. The flores-101 evaluation

benchmark for low-resource and multilingual machine translation. Transactions of the Association

for Computational Linguistics, 10:522–538, 2022.

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob

Steinhardt. Measuring massive multitask language understanding. Proceedings of the International

Conference on Learning Representations (ICLR), 2021.

Takeshi Kojima, Shixiang (Shane) Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large

language models are zero-shot reasoners. In Advances in Neural Information Processing Systems,

volume 35, pages 22199–22213, 2022.

Rik Koncel-Kedziorski, Subhro Roy, Aida Amini, Nate Kushman, and Hannaneh Hajishirzi. MAWPS:

A math word problem repository. In Kevin Knight, Ani Nenkova, and Owen Rambow, editors,

Proceedings of the 2016 Conference of the North American Chapter of the Association for Com-

putational Linguistics: Human Language Technologies, pages 1152–1157, San Diego, Califor-

nia, June 2016. Association for Computational Linguistics. doi: 10.18653/v1/N16-1136. URL

https://aclanthology.org/N16-1136.

Shen-yun Miao, Chao-Chun Liang, and Keh-Yih Su. A diverse corpus for evaluating and developing

english math word problem solvers. In Proceedings of the 58th Annual Meeting of the Association

for Computational Linguistics, pages 975–984, 2020.

Mistral AI team. Mixtral of experts, December 2023. URL https://mistral.ai/news/mixtral-of-experts/.

Accessed: 2023-12-15.

NLLB Team, Marta R. Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield,

Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang,

Guillaume Wenzek, Al Youngblood, Bapi Akula, Loic Barrault, Gabriel Mejia Gonzalez, Prangthip

Hansanti, John Hoffman, Semarley Jarrett, Kaushik Ram Sadagopan, Dirk Rowe, Shannon Spruit,

18Chau Tran, Pierre Andrews, Necip Fazil Ayan, Shruti Bhosale, Sergey Edunov, Angela Fan,

Cynthia Gao, Vedanuj Goswami, Francisco Guzmán, Philipp Koehn, Alexandre Mourachko,

Christophe Ropers, Safiyyah Saleem, Holger Schwenk, and Jeff Wang. No language left behind:

Scaling human-centered machine translation. META, 2022.

OpenAI. Gpt-4 technical report, 2023.

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong

Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow

instructions with human feedback. Advances in Neural Information Processing Systems, 35:

27730–27744, 2022.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic

evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association

for Computational Linguistics, pages 311–318, 2002.

Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are NLP models really able to solve simple

math word problems? In Proceedings of the 2021 Conference of the North American Chapter of

the Association for Computational Linguistics: Human Language Technologies, pages 2080–2094,

Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.

168. URL https://aclanthology.org/2021.naacl-main.168.

Maja Popović. chrf++: words helping character n-grams. In Proceedings of the second conference

on machine translation, pages 612–618, 2017.

Matt Post. A call for clarity in reporting bleu scores. arXiv preprint arXiv:1804.08771, 2018.

Raf.

What are tokens and how to count them?

https://help.openai.com/en/articles/

4936856-what-are-tokens-and-how-to-count-them, 2023. Accessed: 2023-12-15.

Nathaniel R. Robinson, Perez Ogayo, David R. Mortensen, and Graham Neubig. Chatgpt mt:

Competitive for high- (but not low-) resource languages. Conference on Machine Translation,

2023. doi: 10.48550/arXiv.2309.07423.

Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung,

Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, et al. Challenging big-bench tasks

and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022.

Lindia Tjuatja, Valerie Chen, Sherry Tongshuang Wu, Ameet Talwalkar, and Graham Neubig.

Do llms exhibit human-like response biases? a case study in survey design. arXiv preprint

arXiv:2311.04076, 2023.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz

Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information

Processing Systems, volume 30, pages 5998–6008. Curran Associates, Inc., 2017.

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdh-

ery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models.

arXiv preprint arXiv:2203.11171, 2022a.

Zihuro Wang, Shuyan Zhou, Daniel Fried, and Graham Neubig. Execution-based evaluation for

open-domain code generation. arXiv preprint arXiv:2212.10481, 2022b.

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny

Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in

Neural Information Processing Systems, 35:24824–24837, 2022.

Kun Zhou, Yutao Zhu, Zhipeng Chen, Wentong Chen, Wayne Xin Zhao, Xu Chen, Yankai Lin,

Ji-Rong Wen, and Jiawei Han. Don’t make your llm an evaluation benchmark cheater. arXiv

preprint arXiv:2311.01964, 2023a.

Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng,

Yonatan Bisk, Daniel Fried, Uri Alon, et al. Webarena: A realistic web environment for building

autonomous agents. arXiv preprint arXiv:2307.13854, 2023b.

19A

Author Contributions

Syeda Akter performed experiments, analysis, and writing for the text understanding and mathematical

reasoning tasks. Zichun Yu performed experiments, analysis, and writing for the knowledge-based

question answering and the code generation tasks. Aashiq Muhamed performed experiments, analysis,

and writing for the machine translation task. Tianyue Ou performed experiments, analysis, and writing

for the instruction following agents task. Ángel Alexander Cabrera and Alex Bäuerle provided

visualization support and performed fine-grained analysis of the results for each tasks. Krrish

Dholakia provided support implementing calls to each of the language models. Chenyan Xiong

provided direction on the varieties of tasks to pursue and helped with paper writing. Graham Neubig

proposed the project idea, wrote the introduction, experimental setup, and conclusions section, and

provided analysis and writing support for all other sections.

Prompt Details

In this section, we detail the prompts that we used for each task.

For Knowledge-based QA task in Section 3, we have used standard 5-shot prompts from Hendrycks

et al. [2021] 10 and 5-shot chain-of-thought prompts from chain-of-thought-hub 11 .

For General-purpose Reasoning task in Section 4, we have used Chain-of-Thought prompts from

Gao et al. [2023a] 12 .

For Mathematics tasks in Section 5, we also have followed Chain-of-Thought prompts from Gao et al.

[2023a] 13 .

For Code Generation in Section 6, prompt is listed in Table 7.

Prompt

Complete the given code with no more explanation. Remember that there is a 4-space

indent before the first line of your generated code. [CODE BLOCK]

Table 7: Prompts used for code generation tasks.

For Machine Translation in Section 7, prompts are listed in Table 8.

For WebArena in Section 8, we used CoT with UA (unachievable) hint 14 and CoT without UA hint 15 .

https://github.com/hendrycks/test

https://github.com/FranxYao/chain-of-thought-hub/blob/main/MMLU/lib_prompt/mmlu-cot.json

https://github.com/EleutherAI/lm-evaluation-harness/tree/big-refactor/lm_eval/tasks/bbh/cot_fewshot

https://github.com/EleutherAI/lm-evaluation-harness/blob/big-refactor/lm_eval/tasks/gsm8k/gsm8k-cot.

yaml

https://github.com/oootttyyy/webarena/blob/main/agent/prompts/raw/p_cot_id_actree_2s.py

https://github.com/oootttyyy/webarena/blob/main/agent/prompts/raw/p_cot_id_actree_2s_no_na.py

20Shot Prompt

zero This is an English to [TGT] translation, please provide

the [TGT] translation for this sentence. Do not provide

any explanations or text apart from the translation.

[SRC]: [src-sentence]

[TGT]:

five This is an English to [TGT] translation, please provide

the [TGT] translation for these sentences:

[SRC]: [src-sentence] [TGT]: [tgt-sentence]

Please provide the translation for the following sentence.

Do not provide any explanations or text apart from the

translation.

[SRC]: [src-sentence]

[TGT]:

Table 8: Prompts used for zero- and five-shot settings in translation tasks.