Summary of Unnatural Error Correction GPT-4 Can Handle Scrambled Text

Summary Unnatural Error Correction GPT-4 Can Handle Scrambled Text arxiv.org

9,288 words - PDF document - View PDF document

One Line

GPT-4 is highly efficient in deciphering jumbled text, significantly reducing editing efforts and surpassing other models in identifying word boundaries.

Slides

Slide Presentation (12 slides)

Copy slides outline Copy embed code Download as Word

Unnatural Error Correction: GPT-4's Remarkable Ability to Handle Scrambled Text

Source: arxiv.org - PDF - 9,288 words - view

Introduction

• GPT-4 is highly efficient in deciphering jumbled text

• Significantly reduces editing efforts

• Surpasses other models in identifying word boundaries

GPT-4's Resilience to Scrambled Text

• GPT-4 can almost flawlessly process inputs with unnatural errors

• Reduces edit distance by 95%

• Remarkable resilience considering the disruptive effect of scrambled text on tokenization

GPT-4's Performance Across Scramble Rates

• GPT-4 maintains consistently high performance across different scramble rates

• Most models' performance decreases as scramble rate increases

• First and last letters of words are particularly important for GPT-4 to recognize and understand text

Insights into LLMs' Inner Workings

• Probe tasks reveal better comprehension of scrambled text with increased layers and parameters

• Unchanged surrounding context or unchanged first and last letters aid in better understanding

Impact of Training on Scrambled Text

• Finetuning on scrambled data significantly improves performance on related tasks

• Models finetuned on scrambled data outperform the original Llama-2 model

Few-Shot Scenario: ScrRec Accuracy (RealtimeQA)

• GPT-4 achieves high accuracy in ScrRec using RealtimeQA dataset

• Outperforms other models such as GPT-3.5-turbo, Falcon-180b, and Llama-2-70b

Zero-Shot Scenario: ScrRec Accuracy (RealtimeQA)

• GPT-4 shows promising results in ScrRec using RealtimeQA dataset

• Achieves high accuracy compared to other models across different metrics

Zero-Shot Scenario: ScrQA Accuracy (RealtimeQA)

• GPT-4 demonstrates strong performance in ScrQA using RealtimeQA dataset

• Outperforms other models in accuracy and RPG metric

Zero-Shot Scenario: ScrQA Accuracy (DREAM)

• GPT-4 achieves high accuracy in ScrQA using DREAM dataset

• Shows strong performance across different question categories

GPT-4's Potential for Unnatural Error Correction

• GPT-4 handles scrambled text in both few-shot and zero-shot scenarios

• Outperforms other models in accuracy and performance

• Demonstrates potential for unnatural error correction tasks

Unleashing GPT-4's Power in Unnatural Error Correction

• GPT-4's ability to handle scrambled text sets it apart from other models

• Its remarkable resilience and high performance make it a valuable tool

• GPT-4's potential for unnatural error correction is unrivaled

Key Points

Large Language Models (LLMs), specifically GPT-4, can handle scrambled text to some extent.
GPT-4 stands out as the only model that can almost flawlessly process inputs with unnatural errors, even under extreme conditions.
GPT-4 can successfully reconstruct original sentences from scrambled ones, reducing the edit distance by 95%.
GPT-4 maintains a consistently high level of performance across different scramble rates and outperforms other models.
GPT-4 shows remarkable resilience and outperforms other models in processing inputs with unnatural errors.
Finetuning Llama-2 models on scrambled data can significantly improve performance on tasks related to scrambled text.
GPT-4 achieved high accuracy in ScrRec (scrambled text recognition) in both few-shot and zero-shot scenarios.
GPT-4 demonstrated strong performance in ScrQA (scrambled text question answering) using different datasets.

Summaries

19 word summary

GPT-4 excels at processing scrambled text, reducing edit distance by 95% and outperforming other models in recognizing word boundaries.

65 word summary

Cao et al. studied GPT-4's ability to handle scrambled text using the Scrambled Bench test suite. Results show that GPT-4 flawlessly processes inputs with unnatural errors, reducing edit distance by 95%. It outperforms other models in recognizing first and last letters of words. Increasing layers and parameters improve comprehension of scrambled text. Finetuning on scrambled data enhances performance. This study offers insights for future research.

122 word summary

A study by Cao et al. investigates the ability of Large Language Models (LLMs), specifically GPT-4, to handle scrambled text. They introduce the Scrambled Bench, a test suite consisting of Scrambled Sentence Recovery and Scrambled Question Answering tasks. Experimental results show that GPT-4 is the only model capable of flawlessly processing inputs with unnatural errors, reducing the edit distance by 95%. GPT-4 maintains high performance across different scramble rates and outperforms other models in recognizing first and last letters of words. Probe tasks reveal that increasing layers and parameters improve LLMs' comprehension of scrambled text. Finetuning on scrambled data enhances performance. This study provides insights into the capabilities and limitations of LLMs when faced with scrambled text, with potential for future research.

527 word summary

The study conducted by Cao et al. examines the resilience of Large Language Models (LLMs), specifically GPT-4, when confronted with scrambled text. They introduce the Scrambled Bench, a test suite consisting of two tasks: Scrambled Sentence Recovery and Scrambled Question Answering. The researchers evaluate the performance of various LLMs on these tasks using datasets such as RealtimeQA, DREAM, and AQuA-RAT.

The experimental results reveal that while most powerful LLMs have some ability to handle scrambled text, GPT-4 stands out as the only model that can almost flawlessly process inputs with unnatural errors, even under extreme conditions. It can successfully reconstruct original sentences from scrambled ones, reducing the edit distance by 95%, even when all letters within each word are scrambled. This is surprising considering the disruptive effect of scrambled text on tokenization.

The authors also compare the performance of LLMs on different scramble types and rates. They find that as the scramble rate increases, the performance of most models decreases. However, GPT-4 maintains consistently high performance across different scramble rates. Similarly, when comparing different scramble types, GPT-4 outperforms other models, indicating the importance of recognizing and understanding the first and last letters of words for LLMs.

To gain insights into the inner workings of LLMs, the researchers use probe tasks to analyze the hidden states of Llama-2 models. They find that as the number of layers and parameters increase, LLMs tend to better comprehend scrambled text, especially when aided by unchanged surrounding context or unchanged first and last letters.

Furthermore, experiments are conducted to investigate the impact of training on scrambled text. Finetuning Llama-2 models on scrambled data significantly improves performance on tasks related to scrambled text. Models finetuned on scrambled data show better performance than the original Llama-2 model on tasks such as Scrambled Question Answering and Scrambled Sentence Recovery.

In conclusion, the study demonstrates that LLMs, particularly GPT-4, have varying degrees of capacity to handle scrambled text. GPT-4 shows remarkable resilience and outperforms other models in processing inputs with unnatural errors. The findings provide valuable insights into the capabilities and limitations of LLMs when faced with scrambled text, contributing to a better understanding of their inner workings. Future research could explore other types of disruptions to tokenization and investigate the reasons behind GPT-4's exceptional performance.

The experimental results of the study on unnatural error correction using GPT-4 were presented. The study focused on the ability of GPT-4 to handle scrambled text in few-shot and zero-shot scenarios. The RealtimeQA and DREAM datasets were used for evaluation.

In the few-shot scenario, GPT-4 achieved high accuracy in ScrRec (scrambled text recognition) using the RealtimeQA dataset. GPT-4 outperformed other models such as GPT-3.5-turbo, text-davinci-003, Falcon-180b, Falcon-180b-chat, Falcon-40b, Llama-2-70b, Llama-2-70b-chat, Llama-2-13b, Llama-2-13b-chat, MPT-30b, UL2, Flan-UL2, Flan-T5-xxl, and ByT5-xxl.

In the zero-shot scenario, GPT-4 also showed promising results in ScrRec using the RealtimeQA dataset. It achieved high accuracy compared to other models across different metrics.

The study also analyzed the performance of different categories of questions in zero-shot ScrQA using the D

697 word summary

In their study, Cao et al. investigate the resilience of Large Language Models (LLMs), specifically GPT-4, when faced with extensively scrambled text. They propose the Scrambled Bench, a test suite consisting of two tasks: Scrambled Sentence Recovery and Scrambled Question Answering. The researchers evaluate the performance of various LLMs on these tasks using datasets such as RealtimeQA, DREAM, and AQuA-RAT.

The experimental results reveal that most powerful LLMs demonstrate the ability to handle scrambled text to some extent. However, GPT-4 stands out as the only model that can almost flawlessly process inputs with unnatural errors, even under extreme conditions. It can successfully reconstruct original sentences from scrambled ones, reducing the edit distance by 95%, even when all letters within each word are scrambled. This resilience is counter-intuitive, considering the disruptive effect of scrambled text on tokenization.

The authors also compare the performance of LLMs on different scramble types and rates. They find that as the scramble rate increases, the performance of most models decreases. However, GPT-4 maintains a consistently high level of performance across different scramble rates. Similarly, when comparing different scramble types, GPT-4 outperforms other models, showing that the first and last letters of words are particularly important for LLMs to recognize and understand text.

Furthermore, the authors conduct experiments to investigate the impact of training on scrambled text. They find that finetuning Llama-2 models on scrambled data can significantly improve performance on tasks related to scrambled text. Specifically, models finetuned on scrambled data show better performance than the original Llama-2 model on tasks such as Scrambled Question Answering and Scrambled Sentence Recovery.

In conclusion, the study demonstrates that LLMs, particularly GPT-4, have the capacity to handle scrambled text to varying degrees. GPT-4 shows remarkable resilience and outperforms other models in processing inputs with unnatural errors. The findings provide valuable insights into the capabilities and limitations of LLMs when faced with scrambled text, contributing to a better understanding of their inner workings. Future research could explore other types of disruptions to tokenization and investigate the reasons behind GPT-4's exceptional performance.

In the zero-shot scenario, GPT-4 also showed promising results in ScrRec using the RealtimeQA dataset. It achieved high accuracy compared to other models across different metrics. GPT-4 achieved an accuracy of 97.69% in the ED metric and 100% in the RR metric.

The study further evaluated GPT-4's performance in zero-shot ScrQA (scrambled text question answering) using the RealtimeQA and DREAM datasets. GPT-4 demonstrated strong performance in both datasets, outperforming other models. In the RealtimeQA dataset, GPT-4 achieved an accuracy of 96.68% in the Acc (accuracy) metric and 76.09% in the RPG (reciprocal of perplexity gap) metric. In the DREAM dataset, GPT-4 achieved an accuracy of 100% in the Acc metric and 100% in the RPG metric.

The study also analyzed the performance of different categories of questions in zero-shot ScrQA using the DREAM dataset. GPT-4 showed high accuracy across all categories, including arithmetic, commonsense, logic, matching, and summary. GPT-4 achieved an accuracy of 100% in all categories.

Overall, the experimental results demonstrated that GPT-4 has the ability to handle scrambled text in both few-shot and zero-shot scenarios. It outperformed other models in accuracy and demonstrated strong performance across different metrics and question categories. These findings highlight the potential of GPT-4 for unnatural error correction tasks.

Raw indexed text (58,330 chars / 9,288 words / 2,891 lines)

Unnatural Error Correction: GPT-4 Can Almost Perfectly Handle

Unnatural Scrambled Text

Qi Cao, Takeshi Kojima, Yutaka Matsuo, Yusuke Iwasawa

The University of Tokyo, Japan

{qi.cao,t.kojima,matsuo,iwasawa}@weblab.t.u-tokyo.ac.jp

Abstract

While Large Language Models (LLMs) have

achieved remarkable performance in many

tasks, much about their inner workings remains

unclear. In this study, we present novel experi-

mental insights into the resilience of LLMs, par-

ticularly GPT-4, when subjected to extensive

character-level permutations. To investigate

this, we first propose the Scrambled Bench, a

suite designed to measure the capacity of LLMs

to handle scrambled input, in terms of both re-

covering scrambled sentences and answering

questions given scrambled context. The ex-

perimental results indicate that most powerful

LLMs demonstrate the capability akin to typo-

glycemia 1 , a phenomenon where humans can

understand the meaning of words even when

the letters within those words are scrambled, as

long as the first and last letters remain in place.

More surprisingly, we found that only GPT-4

nearly flawlessly processes inputs with unnatu-

ral errors, even under the extreme condition, a

task that poses significant challenges for other

LLMs and often even for humans. Specifically,

GPT-4 can almost perfectly reconstruct the orig-

inal sentences from scrambled ones, decreasing

the edit distance by 95%, even when all letters

within each word are entirely scrambled. It is

counter-intuitive that LLMs can exhibit such

resilience despite severe disruption to input to-

kenization caused by scrambled text. 2

Introduction

Large language models (LLMs) demonstrate im-

pressive proficiency across a range of tasks, with

certain capabilities emerging as the models scale

up in size—a phenomenon commonly known as

emergent abilities. (Wei et al., 2022a). As LLMs

Typoglycemia is an internet meme without specific sci-

entific backing. Nonetheless, people do exhibit the ability

to recognize scrambled common words to some extent. Our

focus is not on examining this ability of humans, we use the

concept for illustrative purposes.

Code will be available at https://github.com/

ccqq77/unnatural-error-correction.

The following sentence contains words with scrambled letters.

Please recover the original sentence from it.

Scrambled sentence:

oJn amRh wno het 2023 Meatsrs ermtnoTuna no duySan ta

atgsuAu ntaaNloi Gflo bClu, gnclcinhi ish ifsrt nereg ecatkj

nad ncedos raecer jroam.

Recovered sentence:

Jon Rahm won the 2023 Masters Tournament on Sunday at

Augusta National Golf Club, clinching his first green jacket

and second career major.

Question: Which professional golfer won the 2023 Masters

Tournament?

Choices: (A)Phil Mickelson (B)Jon Rahm (C)Tiger Woods

(D)Scottie Scheffler

Evidence: oJn amRh wno het 2023 Meatsrs ermtnoTuna no

duySan ta atgsuAu ntaaNloi Gflo bClu, gnclcinhi ish ifsrt

nereg ecatkj nad ncedos raecer jroam.

Answer: Based on the evidence, among A through D, the answer

(B)Jon Rahm

Figure 1: GPT-4 can recover original sentence from

scrambled sentence (upper) and answer question

based on scrambled context (lower), even if the tok-

enization drastically changes. The colors indicate the

division of sub-words during the tokenization.

become more “intelligent”, many new benchmarks

have been proposed (Liang et al., 2022; Qin et al.,

2023) to investigate the abilities of LLMs. Nonethe-

less, the internal mechanisms underlying the abili-

ties of LLMs remain enigmatic.

Several studies investigate the behavior of LLMs

given some input perturbations. For example, Sinha

et al. (2021a,b); Abdou et al. (2022) investigate the

influence of word-level permutations and discuss

the phenomenon that models are insensitive to per-

mutations of word order which corrupt the original

syntax, in some downstream tasks (e.g., natural lan-

guage inference). These results are particularly

interesting because they challenge the common

assumption of the inner workings of LLMs, i.e.,

LLMs understand human-like syntax to some ex-

tent and use it to understand sentences.

Going beyond the effect of word-level permu-

tations, we investigate the ability of LLMs undercharacter-level permutations. LLMs are supposed

to rely on the tokenizers to turn natural language

into the form that LLMs can perceive. It would be

counter-intuitive if LLMs could effectively handle

text containing unnatural permutations that sig-

nificantly alter tokenization. In other words, we

propose the following research question: Deos the

oredr of ltteers in wrods mttaer for LLMs? Note

that the above sentence contains scrambled words,

but humans can somehow recognize and under-

stand such a sentence, as several cognitive stud-

ies have explored (Rawlinson, 2007; Mason, 1982;

Johnson and Eisler, 2012). Analyzing the robust-

ness of LLMs against such character-level permu-

tations can shed light on their word comprehension

capabilities and reveal differences between various

LLMs and human understanding.

To this end, this paper first constructs Scram-

bled Bench, which converts existing benchmarks

into a test suite to measure the ability of LLMs

to handle scrambled text. We designed two types

of tasks: (1) Scrambled Sentence Recovery, which

tests the capability of LLMs to reconstruct original

sentences from scrambled ones, and (2) Scrambled

Question Answering, which measures how well

LLMs can answer questions when some context

is scrambled. Note that since the slight change in

letter-order within a word drastically changes the

tokenized output (see Figure 1) , it is questionable

whether LLMs can recognize the scrambled words

in a sentence. Counter-intuitively, we show that

most powerful LLMs are able to handle scrambled

sentences to varying degrees, when we scramble

words while keeping the first and last letters un-

changed. More surprisingly, we found that only

GPT-4 can almost flawlessly process inputs with

unnatural errors, even under extreme conditions.

That is, even when we scramble all letters in words,

GPT-4 manages to handle such input — a signifi-

cantly challenging task for other models and even

humans. For instance, GPT-4 can reconstruct orig-

inal sentences to near-perfect recovery rate in the

extreme scenario, as in Figure 1.

Related Works

The most related works are the studies investigat-

ing the effects of word or sub-word level pertur-

bations (Sinha et al., 2021a,b; Pham et al., 2021;

Abdou et al., 2022) and the studies evaluating the

robustness of LLMs (Wang et al., 2023; Zhu et al.,

2023). To the best of our knowledge, no existing

studies have investigated LLMs’ ability to handle

character-level permutations, particularly those of

an extremely high level that drastically change tok-

enization. Our study aims to fill this gap. Table 2

in Appendix B categorizes the prior studies and

demonstrates the position of our study.

For evaluations, we assess the ability to recon-

struct the original text from its scrambled version

and to answer questions using the scrambled text

as a reference. It differs with typographical error

correction (Shah and de Melo, 2020; Sun et al.,

2022), as (i) we do not train models to correct er-

rors, i.e., we measure the ability of LLMs, and (ii)

we add much more severe noises than natural ty-

pographical errors. The word unscrambling task

in BigBench (Srivastava et al., 2023) is similar to

our recovery task. However, it is more akin to a

wordplay puzzle rather than a task for comprehend-

ing scrambled text, since it includes only single

common words and no context is given.

Scrambled Bench

We propose two tasks (as in Figure 1) to evaluate

the capability of LLMs to handle scrambled text:

(1) Scrambled Sentence Recovery (ScrRec). In

this task, we provide a sentence containing scram-

bled words to LLMs and then ask them to recover

original sentence from it. This task can be utilized

to directly measure the capability of LLMs to rec-

ognize and reconstruct the scrambled words in a

sentence.

(2) Scrambled Question Answering (ScrQA).

While ScrRec can directly measure the capabil-

ity to comprehend and process scrambled text, it is

an “unusual” task for LLMs. If a model does not

perform well on the recovery task, there are two

possible reasons: (i) having difficulty following

the instructions and (ii) not being able to recover

sentences. To distinguish them, we measure the

ability to accomplish a standard task (i.e., QA)

given scrambled context. Specifically, we scramble

the context that contains essential information for

answering questions and then assess the models

based on the variations in their performances.

3.1

Dataset Creation

Data contamination is a significant concern when

assessing LLMs, when the test data for downstream

tasks has been inadvertently included in the training

set. This issue is critical for our tasks, as the relia-ScrRec (few-shot)

100

ScrQA (zero-shot)

ScrRec (zero-shot)

100

GPT-4

KFL

GPT-3.5-turbo

KFL

text-davinci-003

Falcon-180b

KFL

Llama-2-70b

Figure 2: Experimental results using scrambled RealtimeQA dataset, including performance on ScrRec in

a zero-shot setting (left) and a few-shot setting (middle) and performance on ScrQA in a zero-shot setting

(right). The missing bars represent values less than zero.

bility of the results could be compromised if some

models simply memorized original content. To mit-

igate this, we primarily utilize a scrambled version

of RealtimeQA (Kasai et al., 2022) for evaluations

in this study. RealtimeQA is a dynamic question

answering dataset that weekly announces questions

about recent news that are unlikely to be memo-

rized by the current LLMs. Specifically, we col-

lect the most recent data (2023/03/17–2023/08/04)

from RealtimeQA (totally 419 samples) and pro-

cess the evidence sentences to construct samples

for ScrRec and ScrQA. Finally, 418 samples are se-

lected for ScrRec (removing a duplicate sentence),

and 346 samples are selected for ScrQA (manually

eliminating 73 samples when the provided evidence

does not provide sufficient information to answer

the corresponding question).

Besides, we also introduce two additional

datasets: DREAM (Sun et al., 2019) and AQuA-

RAT (Ling et al., 2017). DREAM is a dialogue-

based multiple-choice reading comprehension

dataset. AQuA-RAT is a dataset of math word

problems necessitating multi-step reasoning for

their resolution. For DREAM dataset, we con-

structed the dataset by selecting 1025 samples with

annotated categories from the development and test

sets and then scrambling the dialogue part of each

question. For AQuA-RAT dataset, we adopt the

few-shot Chain of Thought (CoT) setting as in Wei

et al. 2022b and evaluate LLMs with scrambled

questions in samples and demonstrations.

For each dataset, we generate scrambled text

with various scramble types and rates. (1) Ran-

domly Scramble (RS). For each sentence, we ran-

domly select a certain rate (20%, 50%, 100% in our

experiments 3 ) of words and randomly scramble all

letters in each selected word (Arabic numerals are

In the latter part of the paper, RS denotes 100% randomly

scrambling, unless a different percentage is specified.

kept invariant). (2) Keep First (KF). We keep the

first letter in each word unchanged and randomly

scramble the letters in other positions. (3) Keep

First and Last (KFL). We keep the first and last

letter in each word unchanged and randomly scram-

ble the letters in other positions. We use KF and

KFL to investigate the effects of different scramble

types in word recognition for LLMs.

3.2

Metrics

For ScrRec, the average Edit Distance (ED) (Leven-

shtein, 1966) between original sentences and recov-

ered sentences is a natural metric to quantify the

performance. Besides, we define Recovery Rate

(RR) to measure the proportion of ED reduced in

recovered sentences, which makes the performance

comparison in different settings more straightfor-

ward as:

P n

ED(ori i , rec i ) − ni ED(ori i , scr i )

P n

RR =

i ED(ori i , scr i )

(1)

where ori i , scr i , rec i mean original sentence,

scrambled sentence, and recovered sentence of sam-

ple i, respectively. If a model is capable of recov-

ering scrambled text to some extent, its RR should

range between 0 and 100%.

For ScrQA, accuracy is a natural metric to mea-

sure the performance. But varying capabilities of

models on original questions make it hard to com-

pare the performance among models. So, Relative

Performance Gain (RPG) is defined to mitigate the

differences and make evaluations focus on how

well models are able to comprehend scrambled text

in comparison to original text as:

RP G =

Acc scr − Acc sub

Acc ori − Acc sub

(2)

where Acc ori , Acc scr , Acc sub are defined as accu-

racy when using original evidences, scrambled evi-

dences, substituted evidences (in which all lettersScrRec

ScrQA

100

Model

GPT-4

GPT-3.5-turbo

text-davinci-003

Falcon-180b

Llama-2-70b

100

20%

50%

Scramble Rate

100%

Setting

20%

50%

Scramble Rate

100%

zero-shot

few-shot

Figure 3: The effect of scramble rates on the performance on ScrRec (left) and ScrQA (right) using RealtimeQA

dataset. The missing points represent values less than zero.

of words that can be scrambled have been substi-

tuted with random letters), respectively. Acc ori ,

Acc sub are served as the higher bound and the

lower bound of accuracy, respectively. Normally,

RPG ranges between 0 and 100%.

Experiments

Models and Settings. In the experiments, we

evaluate the most powerful closed-source LLMs,

including text-davinci-003 (Brown et al., 2020),

GPT-3.5-turbo and GPT-4 (OpenAI, 2023) and the

open-source models from Falcon series (Penedo

et al., 2023), Llama-2 series (Touvron et al., 2023),

MPT series (Team, 2023), UL2 series (Tay et al.,

2022), T5 series (Raffel et al., 2020; Chung et al.,

2022; Xue et al., 2022). For the scrambled Real-

timeQA dataset, we adopt a zero-shot setting and a

few-shot setting with 3-shot exemplars from the

wikiQA dataset (Yang et al., 2015) for ScrRec,

while we only conduct experiments in a zero-shot

setting for ScrQA (since the task is rather straight-

forward). For the scrambled DREAM dataset, the

setting is also zero-shot ScrQA. For the scrambled

AQuA dataset, we adopt a few-shot CoT setting

with scrambled demonstrations (in the question

part). Note that we are showcasing the results

of the top five most proficient LLMs (i.e., GPT-

4, GPT-3.5-turbo, text-davinci-003, Falcon-180b,

Llama-2-70b) in this section, while full results can

be found in Appendix D and examples of different

tasks can be found in Appendix C.

Results 1: Effect of different scramble types.

Figure 2 show the results on zero-shot ScrRec,

few-shot ScrRec, and zero-shot ScrQA, with three

scramble types: randomly scramble (RS), keep

first (KF), and keep first and last (KFL). The re-

sults show the performance gaps between models

are not large in KFL setup. However, except for

GPT-4, performance significantly decreases as the

difficulty of scramble types increases (KFL, KF,

and RS in order). In contrast, the performance of

GPT-4 remains constantly high regardless of the

scramble types. On ScrRec, RR of GPT-4 is con-

stantly above 95% for all setups. On ScrQA, GPT-4

consistently outperforms other models, maintain-

ing a high level of accuracy even as the difficulty

of scramble types increases.

Results 2: Effect of different scramble rates.

Figure 3 (left) illustrates the relationship between

the scramble rates (i.e., the percentages of ran-

domly scrambled words in text) and the perfor-

mance on ScrRec using scrambled RealtimeQA.

As the scramble rates increases, RR decreases for

text-davinci-003, Falcon-180b and Llama-2-70b.

RR of GPT-3.5-turbo and GPT-4 does not change

significantly. GPT-4 outperforms other models by

a wide margin, with higher than 95% RR for most

setups (except for 20% scramble rate).

Figure 3 (right) displays a plot of RPG versus

scramble rates for different models on ScrQA using

scrambled RealtimeQA. RPG of all models drops

as scramble percentage increases. But GPT-4 main-

tains 87.8% of its original performance even with

100% scrambled evidences. The gap between the

performance of different models becomes larger

when the scramble rates increases. Table 3, Table 4

and Table 5 in Appendix D show full results on

scrambled RealtimeQA, with additional evaluation

metrics (ED for ScrRec and Acc for ScrQA).

Results 3: Results of additional datasets. We

test the generality of the finding across datasets

by two additional datasets for ScrQA. On scram-100

100

All Arithmetic Commonsense

Logic Matching Summary

GPT-4

KFL

GPT-3.5-turbo

KFL

text-davinci-003

Falcon-180b

KFL

Llama-2-70b

Figure 4: Experimental results of ScrQA on scrambled DREAM dataset (zero-shot). Overall performance and

performance on different categories of questions. The missing bars represent values less than or equal to zero.

Exemplars

Question

Original

20%

50%

100%

Original 20% 50% 100%

66.93 | 52.76 | 46.85

66.54 | 49.61 | 40.94

67.72 | 40.94 | 40.94

64.17 | 29.92 | 26.77 70.87 | 51.97 | 47.24

74.02 | 51.57 | 40.16

65.75 | 44.88 | 41.34

62.60 | 31.89 | 27.17 72.05 | 54.72 | 45.67

68.50 | 52.36 | 42.52

64.96 | 47.64 | 40.16

64.96 | 27.17 | 33.47 70.08 | 54.72 | 46.46

71.26 | 50.79 | 45.28

67.32 | 42.91 | 44.09

60.24 | 35.04 | 28.35

Table 1: Experimental results of ScrQA on scrambled AQuA-RAT dataset in a few-shot CoT setting. LLMs

are assessed in various scenarios by adjusting scramble rate of questions in examplars and the main question.

(left: GPT-4; middle: GPT-3.5-turbo; right: text-davinci-003). Metric: Acc (%).

bled DREAM dataset, we evaluate performance

not only overall but also on different categories of

questions using the annotations (in Figure 4). The

performance disparities between GPT-4 and other

models are more pronounced than those observed

on RealtimeQA, possibly since DREAM requires

higher-level comprehension of longer texts. The

performance on arithmetic questions tends to be

more susceptible to scrambled text compared to

other categories, even for GPT-4. Table 1 demon-

strates experimental results in a 4-shot CoT setting

on scrambled AQuA-RAT dataset (we only assess

the three closed-source models here because even

original questions in AQuA-RAT are too challeng-

ing for most open-source models). We evaluate

models with several different scenarios by chang-

ing scramble rates in both main question and ques-

tions in exemplars. The results show the influence

of scrambled exemplars is relatively small. The per-

formance of GPT-3.5-turbo and text-davinci-003

declines markedly as the scramble rate of the main

questions reaches 100%, whereas GPT-4 retains

much of its initial performance. Table 6, Table 7

and Table 8 in Appendix D show full results on

scrambled DREAM dataset.

We also conduct a deeper analysis, the details of

which can be found in Appendix A.

Conclusion

In this study, we propose Scrambled Bench, a test

suite to measure the ability of LLMs to handle

scrambled text, including two tasks (i.e., scrambled

sentence recovery and scrambled question answer-

ing) and construct scrambled datasets based on

RealtimeQA, DREAM and AQuA-RAT. Despite

the scrambled text drastically changes the tokeniza-

tion, we demonstrate that most powerful LLMs are

capable of processing scrambled text to varying

degrees. However, most LLMs have difficulty han-

dling text that is scrambled to an extreme degree

(i.e., 100% randomly scrambling). Surprisingly,

for both tasks, GPT-4 shows good results and out-

performs other models by a large margin. For the

scrambled sentence recovery task, GPT-4 can re-

cover sentences by 95% edit distance reduction

even in 100% randomly scrambling settings. For

the scrambled question answering task, GPT-4 can

maintain a very high proportion of its original ac-

curacy using scrambled context.Limitations

For LLMs, there are various ways to disrupt the

tokenization of words (e.g., inserting letters, substi-

tuting letters). In this study, we only investigate the

influence of scrambling the letter-order in words.

Investigating the performance of LLMs to handle

other situations would be an interesting topic.

In addition, we have conducted our experiments

using only three datasets, RealtimeQA, DREAM

and AQuA-RAT. Experiments on more diverse

datasets could be another future work. Note that

the two tasks can be applicable for diverse datasets,

and it is easy to extend the analysis.

We investigate the capability of different LLMs

to handle scrambled text in different settings. How-

ever, it is difficult to conclude the reason why some

LLMs are capable of handling these tasks. Espe-

cially, the reason why GPT-4 can perform almost

perfectly would be an interesting topic worth fur-

ther investigation. We can not access the closed-

source models directly and are aware of little in-

formation about them (even the exact parameter

size of GPT-4). This makes investigating the rea-

son difficult. One potential hypothesis is that this

capability might stem from training methods, such

as including denoising tasks or utilizing a huge cor-

pus of text data with various errors. Another one is

that this capability could emerges as LLMs scale.

Validating the potential hypotheses could enhance

our understanding of the inner workings of LLMs.

References

Mostafa Abdou, Vinit Ravishankar, Artur Kulmizev, and

Anders Søgaard. 2022. Word order does matter and

shuffled language models know it. In Proceedings

of the 60th Annual Meeting of the Association for

Computational Linguistics (Volume 1: Long Papers),

pages 6907–6919, Dublin, Ireland. Association for

Computational Linguistics.

Tom Brown, Benjamin Mann, Nick Ryder, Melanie

Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind

Neelakantan, Pranav Shyam, Girish Sastry, Amanda

Askell, et al. 2020. Language models are few-shot

learners. Advances in neural information processing

systems, 33:1877–1901.

Hyung Won Chung, Le Hou, Shayne Longpre, Barret

Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang,

Mostafa Dehghani, Siddhartha Brahma, Albert Web-

son, Shixiang Shane Gu, Zhuyun Dai, Mirac Suz-

gun, Xinyun Chen, Aakanksha Chowdhery, Sharan

Narang, Gaurav Mishra, Adams Yu, Vincent Zhao,

Yanping Huang, Andrew Dai, Hongkun Yu, Slav

Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam

Roberts, Denny Zhou, Quoc V. Le, and Jason Wei.

2022. Scaling instruction-finetuned language mod-

els.

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and

Luke Zettlemoyer. 2023. Qlora: Efficient finetuning

of quantized llms. arXiv preprint arXiv:2305.14314.

Wikimedia Foundation. Wikimedia downloads.

Rebecca L Johnson and Morgan E Eisler. 2012. The

importance of the first and last letter in words during

sentence reading. Acta psychologica, 141(3):336–

351.

Jungo Kasai, Keisuke Sakaguchi, Yoichi Takahashi,

Ronan Le Bras, Akari Asai, Xinyan Yu, Dragomir

Radev, Noah A Smith, Yejin Choi, and Kentaro Inui.

2022. Realtime qa: What’s the answer right now?

arXiv preprint arXiv:2207.13332.

Vladimir I Levenshtein. 1966. Binary codes capable of

correcting deletions, insertions, and reversals. Soviet

physics doklady, 10(8):707–710.

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris

Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian

Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Ku-

mar, et al. 2022. Holistic evaluation of language

models. arXiv preprint arXiv:2211.09110.

Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blun-

som. 2017. Program induction by rationale genera-

tion: Learning to solve and explain algebraic word

problems. ACL.

Mildred Mason. 1982. Recognition time for letters and

nonletters: effects of serial position, array size, and

processing order. Journal of Experimental Psychol-

ogy: Human Perception and Performance, 8(5):724.

OpenAI. 2023. Gpt-4 technical report.

Guilherme Penedo, Quentin Malartic, Daniel Hesslow,

Ruxandra Cojocaru, Alessandro Cappelli, Hamza

Alobeidli, Baptiste Pannier, Ebtesam Almazrouei,

and Julien Launay. 2023. The RefinedWeb dataset

for Falcon LLM: outperforming curated corpora

with web data, and web data only. arXiv preprint

arXiv:2306.01116.

Thang Pham, Trung Bui, Long Mai, and Anh Nguyen.

2021. Out of order: How important is the sequen-

tial order of words in a sentence in natural language

understanding tasks? In Findings of the Association

for Computational Linguistics: ACL-IJCNLP 2021,

pages 1145–1160, Online. Association for Computa-

tional Linguistics.

Chengwei Qin, Aston Zhang, Zhuosheng Zhang, Jiaao

Chen, Michihiro Yasunaga, and Diyi Yang. 2023. Is

chatgpt a general-purpose natural language process-

ing task solver? arXiv preprint arXiv:2302.06476.Colin Raffel, Noam Shazeer, Adam Roberts, Katherine

Lee, Sharan Narang, Michael Matena, Yanqi Zhou,

Wei Li, and Peter J Liu. 2020. Exploring the limits

of transfer learning with a unified text-to-text trans-

former. The Journal of Machine Learning Research,

21(1):5485–5551.

Graham Rawlinson. 2007. The significance of letter

position in word recognition. IEEE Aerospace and

Electronic Systems Magazine, 22(1):26–27.

Kshitij Shah and Gerard de Melo. 2020. Correcting

the autocorrect: Context-aware typographical error

correction via training data augmentation. In Pro-

ceedings of the Twelfth Language Resources and

Evaluation Conference, pages 6930–6936, Marseille,

France. European Language Resources Association.

Koustuv Sinha, Robin Jia, Dieuwke Hupkes, Joelle

Pineau, Adina Williams, and Douwe Kiela. 2021a.

Masked language modeling and the distributional hy-

pothesis: Order word matters pre-training for little.

In Proceedings of the 2021 Conference on Empiri-

cal Methods in Natural Language Processing, pages

2888–2913, Online and Punta Cana, Dominican Re-

public. Association for Computational Linguistics.

Koustuv Sinha, Prasanna Parthasarathi, Joelle Pineau,

and Adina Williams. 2021b. UnNatural Language

Inference. In Proceedings of the 59th Annual Meet-

ing of the Association for Computational Linguistics

and the 11th International Joint Conference on Natu-

ral Language Processing (Volume 1: Long Papers),

pages 7329–7346, Online. Association for Computa-

tional Linguistics.

Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao,

Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch,

Adam R Brown, Adam Santoro, Aditya Gupta, Adrià

Garriga-Alonso, et al. 2023. Beyond the imitation

game: Quantifying and extrapolating the capabili-

ties of language models. Transactions on Machine

Learning Research.

Kai Sun, Dian Yu, Jianshu Chen, Dong Yu, Yejin Choi,

and Claire Cardie. 2019. DREAM: A challenge data

set and models for dialogue-based reading compre-

hension. Transactions of the Association for Compu-

tational Linguistics, 7:217–231.

Rui Sun, Xiuyu Wu, and Yunfang Wu. 2022. An error-

guided correction model for Chinese spelling error

correction. In Findings of the Association for Com-

putational Linguistics: EMNLP 2022, pages 3800–

3810, Abu Dhabi, United Arab Emirates. Association

for Computational Linguistics.

Yi Tay, Mostafa Dehghani, Vinh Q Tran, Xavier Gar-

cia, Jason Wei, Xuezhi Wang, Hyung Won Chung,

Dara Bahri, Tal Schuster, Steven Zheng, et al. 2022.

Ul2: Unifying language learning paradigms. In The

Eleventh International Conference on Learning Rep-

resentations.

MosaicML NLP Team. 2023. Introducing mpt-30b:

Raising the bar for open-source foundation models.

Accessed: 2023-06-22.

Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019.

BERT rediscovers the classical NLP pipeline. In

Proceedings of the 57th Annual Meeting of the Asso-

ciation for Computational Linguistics, pages 4593–

4601, Florence, Italy. Association for Computational

Linguistics.

Hugo Touvron, Louis Martin, Kevin Stone, Peter Al-

bert, Amjad Almahairi, Yasmine Babaei, Nikolay

Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti

Bhosale, et al. 2023. Llama 2: Open founda-

tion and fine-tuned chat models. arXiv preprint

arXiv:2307.09288.

Jindong Wang, HU Xixu, Wenxin Hou, Hao Chen,

Runkai Zheng, Yidong Wang, Linyi Yang, Wei Ye,

Haojun Huang, Xiubo Geng, et al. 2023. On the

robustness of chatgpt: An adversarial and out-of-

distribution perspective. In ICLR 2023 Workshop

on Trustworthy and Reliable Large-Scale Machine

Learning Models.

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel,

Barret Zoph, Sebastian Borgeaud, Dani Yogatama,

Maarten Bosma, Denny Zhou, Donald Metzler, Ed H.

Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy

Liang, Jeff Dean, and William Fedus. 2022a. Emer-

gent abilities of large language models. Transactions

on Machine Learning Research. Survey Certifica-

tion.

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten

Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou,

et al. 2022b. Chain-of-thought prompting elicits rea-

soning in large language models. Advances in Neural

Information Processing Systems, 35:24824–24837.

Linting Xue, Aditya Barua, Noah Constant, Rami Al-

Rfou, Sharan Narang, Mihir Kale, Adam Roberts,

and Colin Raffel. 2022. ByT5: Towards a token-free

future with pre-trained byte-to-byte models. Transac-

tions of the Association for Computational Linguis-

tics, 10:291–306.

Yi Yang, Wen-tau Yih, and Christopher Meek. 2015.

Wikiqa: A challenge dataset for open-domain ques-

tion answering. In Proceedings of the 2015 con-

ference on empirical methods in natural language

processing, pages 2013–2018.

Kaijie Zhu, Jindong Wang, Jiaheng Zhou, Zichen

Wang, Hao Chen, Yidong Wang, Linyi Yang, Wei

Ye, Neil Zhenqiang Gong, Yue Zhang, et al. 2023.

Promptbench: Towards evaluating the robustness of

large language models on adversarial prompts. arXiv

preprint arXiv:2306.04528.A

Analysis

A.1

Relationship to Parameter Scale

100

By scaling up their parameters, LLMs can signif-

icantly enhance their performance across a range

of tasks and even acquire emergent abilities that

were previously unattainable with smaller parame-

ter sizes. (Wei et al., 2022a).

We investigate the relationship between the

ability to handle scrambled text and parameter

scale. Figure 5 displays the performance of LLMs

in relation to their parameter sizes on 100% ran-

domly scrambled text to illustrates the relationship

(though the parameter sizes of GPT-4 and GPT-3.5-

turbo are unknown). On ScrRec-RealtimeQA, only

100

the three closed-source models can achieve a pos-

itive recovery rate in the zero-shot setting, while

an additional open-source model, Falcon-180b, can

achieve a positive recovery rate in the few-shot

setting alongside the closed-source models. On

ScrQA, performance improves with parameter scal-

ing when using RealtimeQA dataset. However,

even when parameters are scaled up to 180 billion,

performance improvement on the DREAM dataset

is limited due to its longer texts and more diverse

questions, which present greater challenges. The

top three models in all evaluations are the three

closed-source models. Especially, GPT-4 signifi-

cantly outperforms the others, although its parame-

ter size is unknown.

ScrRec-RealtimeQA (zero-shot)

GPT-4

GPT-3.5-turbo

text-davinci-003

Falcon

Falcon-chat

Llama-2

Llama-2-chat

ScrRec-RealtimeQA (few-shot)

GPT-4

GPT-3.5-turbo

text-davinci-003

Falcon

Llama-2

Llama-2-chat

Falcon-chat

ScrQA-RealtimeQA (zero-shot)

Fitted Line

text-davinci-003

Llama-2

Llama-2-chat

GPT-4

GPT-3.5-turbo

Falcon

Falcon-chat

ScrQA-DREAM (zero-shot)

GPT-4

Fitted Line

Llama-2

Llama-2-chat

0 10 20 30 40

GPT-3.5-turbo

text-davinci-003

Falcon-chat

Falcon

Parameter Scale (billion)

180

unknown

Figure 5: Performance of LLMs in relation to their parameter sizes on four tasks, tested on 100% randomly

scrambled text.ScrQA-RealtimeQA (zero-shot) ScrQA-DREAM (zero-shot)

KR2

ScrRec-RealtimeQA (few-shot)

100

ScrRec-RealtimeQA (zero-shot)

(%) 100

KL&KR1

KF&KR1

KFL

Figure 6: Illustrating letter importance by comparing performance of LLMs between four scramble types.

Llama-2-7b

Ori

Layer

Llama-2-13b

RS-20%

RS-50%

Layer

RS-100%

Llama-2-70b

40 30 1 10 20 30 40 50 60 70 80

Layer

KFL

Sub

Figure 7: Performance of probe classifier using hidden states of each layer to predict answer on DREAM

dataset in different scramble settings for three Llama-2 model with different parameter sizes.

A.2

Letter Importance

In main experiments, we evaluate LLMs on three

scrambled types (i.e., RS, KF and KFL). Neverthe-

less, the findings do not offer definitive conclusions

regarding the varying importance of letter positions

in word recognition, as the words to be scrambled

vary for each type (e.g., in RS, all words with more

than one letter are scrambled, while in KFL, only

words with more than three letters are scrambled).

To investigate the letter importance in word recog-

nition for LLMs, we add three additional scramble

types: (1) KR2, where we keep two letters in ran-

dom positions unchanged; (2) KL&KR1, where we

keep the last letter and one letter in another random

position unchanged; and (3) KF&KR1, where we

keep the first letter and one letter in another random

position unchanged. Using these three scramble

types, along with KFL, we can mitigate the differ-

ence of scrambled words, since same words with

more than three letters are scrambled in all four

types.

Figure 6 demonstrates the performance of the

top five LLMs on four scramble types. The perfor-

mance of each LLM typically follows an ascending

order: KR2, KL&KR1, KF&KR1, KFL, with the

exception of ScrQA-RealtimeQA. On ScrRec, the

gaps between different scramble types are larger for

less capable models. However, on ScrQA, the gaps

are smaller, especially when using RealtimeQA.

This might be the reason why the pattern does not

hold on ScrQA-RealtimeQA, as the randomness

has a greater impact on performance order when

the gaps are smaller. The results imply that the first

and last letters are important for LLMs to recognize

words, with the first letter being more important

than the last letter. This finding is consistent with10

Layer

Figure 8: Layer-wise accuracy differential: the difference of probe task performance between using hidden

states from the current layer and the previous layer for Llama-2-70b in different settings.

RS RS

100

ScrQA-RealtimeQA

(zero-shot)

ScrQA-DREAM

(zero-shot)

ScrRec-RealtimeQA

(few-shot)

100

ScrRec-RealtimeQA

(zero-shot)

KFL

FT-RS

KFL

FT-KF&Adj

FT-Ori

KFL

w/o FT

KFL

Figure 9: Comparison of performance between three finetuned models (FT-RS, FT-KF&Adj, FT-Ori) and

original Llama-2-13b without finetuing (w/o FT) in four task settings. The missing bars represent values less

than or equal to zero.

human behavior (Rawlinson, 2007; Mason, 1982;

Johnson and Eisler, 2012).

A.3

Probing Hidden States

To better investigate the internal mechanistic of

LLMs to process scrambled text, we design a probe

task using the hidden states of Llama-2 models of

different parameter sizes (7b, 13b and 70b) when

inputting scrambled text (Tenney et al., 2019). In

the experiment, we collect all data in DREAM

dataset (including training, development and test

sets), and process each sample with various scram-

ble types and rates used in main experiments. The

hidden state of last token in each layer (since last

token contains all information of each question)

is extracted when inputting the question with dia-

logue in various scramble settings (Ori, RS-20%,

RS-50%, RS-100%, KF, KFL, Sub). For each layer,

we train a logistic regression model as a probing

classifier using a mixture of hidden states from

different settings in the training set to predict the

answer. We evaluate the performance in different

scramble settings separately by using a combina-

tion of the development set and test set.Level Example

(Original: How does the

change affect tokenization?) Word How the affect does

tokenization change? No

Sub-word How change token does

affectization the? Character Character

Natural

errors

Change of

tokenization

Severity Studies

Order Low–High Sinha et al. 2021a,

Sinha et al. 2021b,

Pham et al. 2021,

Abdou et al. 2022

No Order Low–High Abdou et al. 2022

How does the chagne

affect tokeniztion? Yes Tokens Low Wang et al. 2023,

Zhu et al. 2023

wHo deso teh chenga

afctfe aotznoenitki? No Tokens Middle–Extremely High Ours

Table 2: Categorizing the prior studies and demonstrating the position of this work.

Figure 7 displays the layer-wise prediction accu-

racy using hidden states from three models. With

more layers and more parameters for comprehen-

sion, LLMs tend to better comprehend scrambled

text with the help of unchanged surrounding con-

text (as in RS-20%, RS-50%) and unchanged first

and last letters (as in KF, KFL). However, even for

Llama-2-70b, comprehending the text in RS-100%

is still too challenging. Figure 8 shows the layer-

wise accuracy differential on probe task for Llama-

2-70b in different settings. Ori, RS-20%, RS-50%,

KF, and KFL exhibit similar patterns in layer distri-

bution, but with varying degrees of change. How-

ever, the pattern of RS-100% is more similar to that

of Sub. For example, both RS-100% and Sub show

almost no improvement in the 20th layer, unlike

what is observed in other settings.

A.4

Training on Scrambled Text

To initially investigate the potential reasons for this

phenomenon, we aim is to examine the impact of

training on scrambled text. In this experiment, we

select Llama-2-13b model as a base model, then

finetune it on scrambled text and evaluate finetuned

models on scrambled bench. The 20220301.simple

subset from wikipedia dataset (Foundation) is uti-

lized to finetune the model on a language modelling

task (irrelevant to our evaluation tasks). We fine-

tune Llama-2-13b on scrambled dataset with two

settings: (1) randomly scramble 30% words with

more than two letters in 60% samples (FT-RS); (2)

keep first letter unchanged and swap two adjacent

letters in other positions for 30% words with more

than two letters in 60% samples (FT-KF&Adj) to

simulate natural typographical errors that are more

likely to be found in real-world text. Additionally,

we finetune a model using the original dataset (FT-

Ori) to serve as a control for comparative analysis.

Figure 9 demonstrates the comparison of perfor-

mance between three finetuned models and original

Llama-2-13b without finetuing (w/o FT) on various

datasets and scramble types. FT-Ori usually gets

performance similar to w/o FT (except for ScrQA-

RealtimeQA) which indicates that the original data

is not likely to be beneficial for evaluation tasks.

FT-RS shows significant improvement over w/o-FT

baseline across all tasks, and it achieves positive

RR in RS setting, which even surpasses the perfor-

mance of Falcon-180b. FT-KF&Adj gets similar

or marginally better performance than w/o FT on

ScrQA. While, on ScrRec, it improve a lot over

w/o FT baseline in KF and KFL, especially it can

achieve comparable performance with FT-RS in

KFL. It could be inferred that training on text with

natural errors is likely to be advantageous for mod-

els in handling unnatural scrambled text, especially

under KFL condition.

Summary of related work

Table 2 categorizes the related work and demon-

strates the position of our study.

C.1

Examples of different tasks (by GPT-4)

Zero-shot ScrRec on RealtimeQA

Input:

The following sentence contains words with scram-

bled letters. Please recover original sentence from

it.

Scrambled sentence: A reGebr byba ulfaorm wsa

titbudiserd ot soetsr epdstie a lclera eovr bslpi-

oes ionmanantitco, grnoadicc ot eth ADF. heT py-

omacn si noniacrggue rptsean ot ckhec yna poduc-srt yhte evah ta mhoe nda cdisadr sehot taht aym

eb ecaeftdf.

Recovered sentence:

Output:

A Gerber baby formula was distributed to stores

despite a recall over possible contamination, ac-

cording to the FDA. The company is encouraging

parents to check any products they have at home

and discard those that may be affected.

C.2

Few-shot ScrRec on RealtimeQA

Input:

Scrambled Sentence: eTh camp continued to

fnctinuo this ayw ilntu the rwa needd.

Recovered Sentence: The camp continued to

function this way until the war ended.

Scrambled Sentence: It swa first developed

ni the 1980s yb oAcrn Computers tdL ot erowp

their pstodke nmecisah and subsequently supn off

sa a separate paocnmy, now ARM Holdings.

Recovered Sentence: It was first developed in the

1980s by Acorn Computers Ltd to power their

desktop machines and subsequently spun off as a

separate company, now ARM Holdings.

Scrambled Sentence: According to the CIA kcb-

Fotoa, the United States is one fo eethr iusecnort

(het etrhos nebgi Liberia nda mBuar/Myanmar)

that sha not adopted eth International System fo

Utins (SI) rmtcei symset as iethr ffliicao system fo

gswheit dna measures.

Recovered Sentence: According to the CIA Fact-

book, the United States is one of three countries

(the others being Liberia and Burma/Myanmar)

that has not adopted the International System of

Units (SI) metric system as their official system of

weights and measures.

Scrambled sentence: A reGebr byba ulfaorm

wsa titbudiserd ot soetsr epdstie a lclera eovr

bslpioes ionmanantitco, grnoadicc ot eth ADF.

heT pyomacn si noniacrggue rptsean ot ckhec yna

poducsrt yhte evah ta mhoe nda cdisadr sehot taht

aym eb ecaeftdf.

Recovered sentence:

Output:

A Gerber baby formula was distributed to stores

despite a recall over possible contamination, ac-

cording to the FDA. The company is encouraging

parents to check any products they have at home

and discard those that may be affected.

C.3

Zero-shot ScrQA on RealtimeQA

Input:

Question: Which type of product was recently dis-

tributed to some US stores despite a recall notice?

Choices: (A)Salad dressing (B)Baby formula

(C)Ground beef (D)Whole milk

Evidence: A reGebr byba ulfaorm wsa titbudiserd

ot soetsr epdstie a lclera eovr bslpioes ionmanan-

titco, grnoadicc ot eth ADF. heT pyomacn si noni-

acrggue rptsean ot ckhec yna poducsrt yhte evah ta

mhoe nda cdisadr sehot taht aym eb ecaeftdf.

Answer: Based on the evidence, among A through

D, the answer is

Output:

(B)Baby formula

C.4

Zero-shot ScrQA on DREAM

Input:

Dialogue:

M: owH goln veah uyo eben ehctnaig ni iths edmdli

colosh?

W: Fro ent esray. oT eb akfrn, I’m edirt fo higneact

het mesa bokoextt rof os ongl hghuto I od jyneo

gebin a aterceh. I’m idenoncisrg gntriy minseghot

wne.

Question: What’s the woman probably going to

do?

Choices: (A)To teach a different textbook. (B)To

change her job. (C)To learn a different textbook.

Answer: Based on the dialogue, among A through

C, the answer is

Output:

(B)To change her job.

C.5

Few-shot CoT ScrQA on AQuA-RAT

(1)

Input:

Question: hJno nfodu ttah teh gaeerav fo 15

rnsbmue si 40. fI 10 si edadd ot ceha umenbr enht

teh naem fo het rbenmsu si?

Choices: (A)50 (B)45 (C)65 (D)78 (E)64

Answer: If 10 is added to each number, then the

mean of the numbers also increases by 10. So the

new mean would be 50. The answer is (A).

Question: fI a / b = 3/4 dan 8a + 5b = 22,

htne nfid het evual fo a.

Choices: (A)1/2 (B)3/2 (C)5/2 (D)4/2 (E)7/2

Answer: If a / b = 3/4, then b = 4a / 3. So 8a +5(4a / 3) = 22. This simplifies to 8a + 20a / 3 = 22,

which means 44a / 3 = 22. So a is equal to 3/2.

The answer is (B).

Question: A nprseo si gtlanrvie ta 20 mk/rh

nda hdrecae ihs ynedits ni 2.5 rh neth fdin eth

dicatsen?

Choices: (A)53 km (B)55 km (C)52 km (D)60 km

(E)50 km

Answer: The distance that the person traveled

would have been 20 km/hr * 2.5 hrs = 50 km. The

answer is (E).

Question: wHo aymn etosreyksk ear deeend

ot peyt het nsrbmeu rfmo 1 ot 500?

Choices: (A)1156 (B)1392 (C)1480 (D)1562

(E)1788

Answer: There are 9 one-digit numbers from 1 to

9. There are 90 two-digit numbers from 10 to 99.

There are 401 three-digit numbers from 100 to 500.

9 + 90(2) + 401(3) = 1392. The answer is (B).

Question: A ritna inunrgn ta a eepsd fo 100

mlesi/uhro, eakts 10 hurso ot cerah ist dneitiotsan.

trefA ovgeirnc etarqru fo het cntsdaie, ti tstrsa

grniina nda eth rniat ash ot eb dlseow ot seedp fo

75 liesm/uhor. Wtha si het otalt orjeyun iroatdnu?

Choices: (A)10 (B)11.5 (C)12.5 (D)13.5 (E)15

Answer:

Output:

The train covers a quarter of the distance at

100 miles/hour, so it travels for 2.5 hours (10

hours / 4) at this speed. The remaining distance

is 3/4 of the total distance, which takes the

train 7.5 hours (10 hours - 2.5 hours) at 100

miles/hour. When the train slows down to 75

miles/hour, it takes 1.5 times longer to cover the

same distance (100 miles/hour / 75 miles/hour =

1.33). So the train takes 11.25 hours (7.5 hours

* 1.33) to cover the remaining distance at 75

miles/hour. The total journey duration is 13.75

hours (2.5 hours + 11.25 hours). The answer is (D).

(2)

Input:

Question: John found that the average of 15

numbers is 40. If 10 is added to each number then

the mean of the numbers is?

Choices: (A)50 (B)45 (C)65 (D)78 (E)64

Answer: If 10 is added to each number, then the

mean of the numbers also increases by 10. So the

new mean would be 50. The answer is (A).

Question: If a / b = 3/4 and 8a + 5b = 22,

then find the value of a.

Choices: (A)1/2 (B)3/2 (C)5/2 (D)4/2 (E)7/2

Answer: If a / b = 3/4, then b = 4a / 3. So 8a +

5(4a / 3) = 22. This simplifies to 8a + 20a / 3 = 22,

which means 44a / 3 = 22. So a is equal to 3/2.

The answer is (B).

Question: A person is traveling at 20 km/hr

and reached his destiny in 2.5 hr then find the

distance?

Choices: (A)53 km (B)55 km (C)52 km (D)60 km

(E)50 km

Answer: The distance that the person traveled

would have been 20 km/hr * 2.5 hrs = 50 km. The

answer is (E).

Question: How many keystrokes are needed

to type the numbers from 1 to 500?

Choices: (A)1156 (B)1392 (C)1480 (D)1562

(E)1788

Answer: There are 9 one-digit numbers from 1 to

9. There are 90 two-digit numbers from 10 to 99.

There are 401 three-digit numbers from 100 to 500.

9 + 90(2) + 401(3) = 1392. The answer is (B).

Question: ieahMnc A psut otu a oy-oy ryeve 6

nsimtue. enicMah B pstu uto a oy-oy veeyr 9

nmusite. tAefr hwo aynm smtnuei lwil yeth hvea

eurdpcod 10 oy-soy?

Choices: (A)24 minutes (B)32 minutes (C)36

minutes (D)64 minutes (E)72 minutes

Answer:

Output:

First, let’s rewrite the question with the words

unscrambled: "Machine A puts out a yo-yo every

6 minutes. Machine B puts out a yo-yo every 9

minutes. After how many minutes will they have

produced 10 yo-yos?"

To solve this problem, we can find the least

common multiple (LCM) of 6 and 9, which is 18.

This means that every 18 minutes, both machines

will have produced a yo-yo at the same time. In 18

minutes, Machine A will have produced 3 yo-yos

(18/6) and Machine B will have produced 2 yo-yos

(18/9), for a total of 5 yo-yos. Since we need 10

yo-yos, we can multiply 18 minutes by 2 to get 36

minutes. The answer is (C) 36 minutes.D

Full experimental results

We conduct experiments using the most powerful

closed-source LLMs, including text-davinci-003,

GPT-3.5-turbo and GPT-4 and various open-source

LLMs, including the models from Falcon series,

Llama-2 series, MPT series, UL2 series, and T5 se-

ries. The open-source model covers diverse model

architectures (decoder only and encoder-decoder),

model size (from 7b to 180b), training objectives

(e.g., with or without further finetuning) and tok-

enizers (e.g., tokenizer-free: ByT5-xxl).

For GPT-4, the version GPT-4-0314 is used. For

GPT-3.5-turbo, the version GPT-3.5-turbo-0301 is

used. For Falcon-180b and Falcon-180b-chat, the

quantized method (Dettmers et al., 2023) is used to

load the model and run the experiments. It probably

affects their performance to some extent.

D.1

Full experimental results on scrambled

RealtimeQA dataset

Table 3 and Table 4 illustrate the full experimental

results on ScrRec using the scrambled RealtimeQA

dataset in both few-shot and zero-shot settings, re-

spectively.

Table 5 illustrates the full experimental results

on ScrQA using scrambled RealtimeQA dataset in

a zero-shot setting.

D.2

Full experimental results on scrambled

DREAM dataset

Table 6 illustrate the full experimental results on

ScrQA using scrambled DREAM dataset in a zero-

shot setting.

Table 7 and Table 8 display the experimental re-

sults for various question categories, comparing the

top five models using Acc and RPG as evaluation

metrics.Model

GPT-4

GPT-3.5-turbo

text-davinci-003

Falcon-180b

Falcon-180b-chat

Falcon-40b

Falcon-40b-instruct

Llama-2-70b

Llama-2-70b-chat

Llama-2-13b

Llama-2-13b-chat

Llama-2-7b

Llama-2-7b-chat

MPT-30b

MPT-30b-instruct

UL2

Flan-UL2

Flan-T5-xxl

ByT5-xxl

20%

1.49

94.51

5.80

78.57

5.38

80.13

4.18

84.55

11.61

57.12

17.27

36.18

18.79

30.56

7.85

71.0

11.11

58.97

14.18

47.62

20.75

23.35

20.90

22.77

25.14

7.11

14.36

46.96

19.36

28.46

18.12

33.03

20.51

24.21

16.60

38.68

79.27

50%

2.40

96.46

15.38

77.29

18.89

72.11

21.84

67.76

38.19

43.62

69.86

74.26

37.30

44.93

54.49

19.56

59.13

12.71

73.04

79.83

83.13

54.81

19.08

66.05

2.49

68.79

62.91

7.13

59.34

12.4

104.28

100%

4.79

96.45

36.87

72.63

84.39

37.35

116.73

13.34

186.65

185.26

192.86

169.85

196.44

192.67

201.50

194.48

196.88

181.43

170.88

153.53

137.15

136.41

160.17

3.79

96.37

20.76

80.14

22.21

78.75

28.77

72.47

63.51

39.24

119.00

130.82

56.25

46.19

106.31

103.50

0.99

119.52

135.50

132.44

90.92

13.02

99.38

4.92

109.47

105.67

105.44

123.97

KFL

2.24

97.04

7.40

90.22

8.36

88.95

9.40

87.58

21.73

71.28

66.03

12.71

62.39

17.53

18.99

74.9

43.18

42.93

42.67

43.59

56.03

25.93

68.18

9.88

73.84

2.39

36.14

52.23

50.44

33.33

71.70

5.22

74.35

1.72

73.97

2.22

102.50

Table 3: The experimental results on few-shot ScrRec using RealtimeQA dataset (upper: ED; lower: RR).Model

GPT-4

GPT-3.5-turbo

text-davinci-003

Falcon-180b

Falcon-180b-chat

Falcon-40b

Falcon-40b-instruct

Llama-2-70b

Llama-2-70b-chat

Llama-2-13b

Llama-2-13b-chat

Llama-2-7b

Llama-2-7b-chat

MPT-30b

MPT-30b-instruct

UL2

Flan-UL2

Flan-T5-xxl

ByT5-xxl

20%

1.69

93.76

4.68

82.72

15.02

44.52

7.18

73.48

15.92

41.17

20.59

23.94

26.42

2.4

8.11

70.04

12.65

53.26

20.21

25.32

26.79

1.03

29.81

48.92

20.54

24.11

36.01

72.39

33.10

27.07

85.83

50%

2.78

95.89

13.00

80.82

31.86

52.96

26.53

60.84

50.50

25.46

66.33

2.08

78.94

41.10

39.32

60.23

11.08

75.54

84.09

106.38

123.66

70.69

84.98

139.14

71.76

68.02

111.53

100%

5.47

95.94

29.80

77.88

109.63

18.61

159.36

180.13

177.33

178.55

180.82

181.30

231.41

191.85

192.05

202.16

226.34

233.60

205.86

136.94

136.47

155.27

4.19

95.99

14.22

86.39

30.88

70.45

43.63

58.25

76.31

27.0

92.45

11.56

104.46

0.07

60.16

42.45

118.13

127.63

120.51

165.56

167.49

135.58

132.77

188.57

106.48

106.08

136.99

KFL

2.53

96.66

6.94

90.82

13.30

82.41

13.08

82.72

29.29

61.29

39.04

48.39

48.42

36.0

21.68

71.34

51.73

31.62

59.77

20.99

63.82

15.64

88.65

112.03

62.16

17.84

76.60

163.67

78.55

76.69

117.80

Table 4: The experimental results on zero-shot ScrRec using RealtimeQA dataset (upper: ED; lower: RR).Model

GPT-4

GPT-3.5-turbo

text-davinci-003

Falcon-180b

Falcon-180b-chat

Falcon-40b

Falcon-40b-instruct

Llama-2-70b

Llama-2-70b-chat

Llama-2-13b

Llama-2-13b-chat

Llama-2-7b

Llama-2-7b-chat

MPT-30b

MPT-30b-instruct

UL2

Flan-UL2

Flan-T5-xxl

ByT5-xxl

Ori

97.69

100.0

97.11

100.0

93.93

100.0

84.68

100.0

75.43

100.0

58.96

100.0

65.90

100.0

88.44

100.0

96.24

100.0

66.18

100.0

84.68

100.0

47.40

100.0

82.08

100.0

62.72

100.0

65.03

100.0

87.57

100.0

98.55

100.0

97.40

100.0

73.12

100.0

20%

97.40

99.19

94.80

94.81

91.62

94.07

82.37

91.58

72.54

86.11

59.25

101.49

63.87

89.39

86.13

92.59

92.49

90.91

65.03

94.59

78.90

85.61

45.09

71.43

75.14

81.82

56.36

75.00

66.76

106.90

84.68

94.12

95.95

95.03

96.24

97.96

57.23

66.87

50%

96.82

97.56

91.33

87.01

87.28

82.96

78.61

77.89

71.39

80.56

54.62

77.61

61.27

75.76

80.35

74.07

83.82

69.93

57.51

59.46

66.18

53.96

42.77

42.86

64.16

53.03

50.87

53.41

58.67

74.71

76.30

77.06

89.31

82.32

87.86

83.16

46.24

43.98

100%

93.35

87.80

81.79

65.58

77.46

57.78

65.03

28.42

58.09

16.67

44.51

25.37

48.55

9.09

67.92

34.26

64.74

23.78

47.40

12.16

47.98

8.63

38.73

45.09

3.03

37.86

2.27

47.11

28.74

45.95

15.29

49.71

6.63

49.42

15.31

26.30

2.41

93.35

87.80

88.44

80.52

84.10

74.81

70.52

48.42

63.87

44.44

44.51

25.37

53.47

34.85

75.43

58.33

75.43

49.65

52.89

37.84

55.20

26.62

40.75

17.86

52.31

21.97

47.69

40.91

54.62

58.62

60.69

45.29

65.61

37.02

66.47

45.41

35.84

22.29

KFL

95.66

94.31

93.06

90.91

88.15

85.19

74.28

62.11

67.34

61.11

50.29

55.22

60.69

72.73

80.64

75.00

82.66

67.13

57.80

60.81

59.83

38.13

43.64

53.57

56.94

34.09

53.47

63.64

61.27

85.06

71.10

66.47

76.88

58.56

82.66

73.98

41.91

34.94

Sub

62.14

0.0

52.60

0.0

54.91

0.0

57.23

0.0

54.62

0.0

39.60

0.0

46.82

0.0

57.23

0.0

54.91

0.0

44.80

0.0

44.51

0.0

39.31

0.0

43.93

0.0

37.28

0.0

39.88

0.0

38.44

0.0

46.24

0.0

40.75

0.0

25.14

0.0

Table 5: The experimental results on zero-shot ScrQA using RealtimeQA dataset (upper: Acc; lower: RPG).Model

GPT-4

GPT-3.5-turbo

text-davinci-003

Falcon-180b

Falcon-180b-chat

Falcon-40b

Falcon-40b-instruct

Llama-2-70b

Llama-2-70b-chat

Llama-2-13b

Llama-2-13b-chat

Llama-2-7b

Llama-2-7b-chat

MPT-30b

MPT-30b-instruct

UL2

Flan-UL2

Flan-T5-xxl

ByT5-xxl

Ori

96.68

100.0

92.48

100.0

92.29

100.0

93.65

100.0

92.68

100.0

70.12

100.0

75.20

100.0

93.26

100.0

91.11

100.0

69.53

100.0

78.61

100.0

60.55

100.0

73.63

100.0

74.02

100.0

83.50

100.0

43.36

100.0

94.24

100.0

95.02

100.0

43.16

100.0

20%

96.29

99.22

87.30

92.21

89.45

94.15

89.65

90.57

89.16

92.58

65.04

81.09

67.48

76.13

89.75

92.26

85.74

88.32

63.87

82.48

71.78

81.87

55.08

76.47

64.45

72.83

69.24

84.35

75.68

81.44

39.36

86.90

91.21

93.43

92.87

95.11

40.23

74.14

50%

94.43

95.53

77.83

77.94

80.27

75.20

82.23

73.10

78.32

69.69

56.45

49.09

58.59

48.64

80.96

72.90

75.10

65.18

53.32

49.85

59.77

50.00

47.46

43.70

56.25

48.55

59.47

52.40

65.43

57.08

29.88

55.91

80.96

71.19

84.67

76.44

36.23

38.79

100%

87.40

81.52

45.61

29.41

52.44

17.74

55.37

9.89

49.90

9.69

44.34

4.00

43.65

2.42

54.30

14.19

48.24

6.79

38.67

4.53

41.80

2.33

37.60

1.26

40.04

0.58

47.17

12.14

44.82

8.12

14.55

5.75

49.80

3.60

51.17

0.22

31.93

0.86

93.65

93.97

71.78

68.82

76.66

67.74

75.39

57.01

70.51

53.20

50.78

28.00

50.00

22.05

73.34

56.13

59.67

31.63

44.63

22.96

48.44

19.95

43.07

24.79

45.41

16.47

50.00

21.41

54.30

30.63

17.38

15.02

62.99

32.20

67.09

36.44

33.50

14.66

KFL

96.39

99.42

83.89

87.06

87.50

90.12

84.47

78.39

80.76

74.85

58.30

56.00

60.06

53.17

84.38

80.43

74.02

62.85

52.25

46.53

57.42

43.78

46.39

39.08

53.81

41.33

57.71

46.65

60.64

45.71

25.20

40.58

75.39

59.11

81.54

69.33

36.62

42.24

Sub

46.48

0.0

26.07

0.0

43.85

0.0

51.17

0.0

45.31

0.0

43.26

0.0

42.87

0.0

47.85

0.0

45.12

0.0

37.21

0.0

40.92

0.0

37.30

0.0

39.84

0.0

43.46

0.0

41.41

0.0

12.79

0.0

48.14

0.0

51.07

0.0

31.84

0.0

Table 6: The experimental results on zero-shot ScrQA using DREAM dataset (upper: Acc; lower: RPG).Model

GPT-4

GPT-3.5-turbo

text-davinci-003

Falcon-180b

Llama-2-70b

Type

all

arithmetic

commonsense

logic

matching

summary

all

arithmetic

commonsense

logic

matching

summary

all

arithmetic

commonsense

logic

matching

summary

all

arithmetic

commonsense

logic

matching

summary

all

arithmetic

commonsense

logic

matching

summary

Ori

96.68

76.09

94.69

95.97

99.07

98.47

92.48

60.87

89.11

91.01

99.07

95.42

92.29

47.83

90.78

91.01

99.07

94.66

93.65

52.17

91.34

92.62

98.15

96.18

93.26

58.70

91.34

92.62

94.44

96.18

20%

96.29

78.26

94.41

95.84

97.22

97.71

87.30

60.87

84.36

85.64

91.67

93.13

89.45

45.65

87.99

87.92

94.44

94.66

89.65

47.83

88.55

88.59

90.74

93.89

89.75

47.83

89.39

88.99

89.81

94.66

50%

94.43

67.39

93.02

93.42

98.15

96.18

77.83

52.17

76.54

76.11

81.48

82.44

80.27

41.30

77.37

78.39

86.11

86.26

82.23

45.65

79.33

79.87

87.04

89.31

80.96

50.00

80.73

79.73

79.63

86.26

100%

87.40

45.65

82.40

85.77

92.59

90.08

45.61

21.74

39.39

44.83

56.48

41.22

52.44

21.74

46.37

52.89

57.41

48.09

55.37

34.78

56.15

52.35

62.96

67.18

54.30

36.96

53.63

54.50

50.93

58.02

93.65

58.70

91.90

92.62

97.22

95.42

71.78

41.30

63.69

71.28

78.70

70.99

76.66

41.30

71.79

75.03

87.96

77.86

75.39

43.48

74.58

73.02

78.70

87.02

73.34

47.83

73.18

73.02

73.15

79.39

KFL

96.39

71.74

94.41

95.97

99.07

96.95

83.89

50.00

79.61

82.95

89.81

83.97

87.50

45.65

84.92

86.71

93.52

87.79

84.47

47.83

83.24

82.55

89.81

91.60

84.38

43.48

84.36

84.16

79.63

90.84

Sub

46.48

32.61

42.46

47.65

47.22

36.64

26.07

15.22

25.42

25.10

30.56

24.43

43.85

17.39

45.25

41.34

48.15

49.62

51.17

34.78

52.23

50.47

51.85

56.49

47.85

36.96

44.97

48.05

50.93

50.38

Table 7: The experimental results of different categories on zero-shot ScrQA using DREAM dataset (Acc).Model

GPT-4

GPT-3.5-turbo

text-davinci-003

Falcon-180b

Llama-2-70b

Type

all

arithmetic

commonsense

logic

matching

summary

all

arithmetic

commonsense

logic

matching

summary

all

arithmetic

commonsense

logic

matching

summary

all

arithmetic

commonsense

logic

matching

summary

all

arithmetic

commonsense

logic

matching

summary

Ori

100.0

20%

99.22

105.00

99.47

99.72

96.43

98.77

92.21

100.00

92.54

91.85

89.19

96.77

94.15

92.86

93.87

93.78

90.91

100.00

90.57

75.00

92.86

90.45

84.00

94.23

92.26

50.00

95.78

91.87

89.36

96.67

50%

95.53

80.00

96.79

94.72

98.21

96.30

77.94

80.95

80.26

77.39

74.32

81.72

75.20

78.57

70.55

74.59

74.55

81.36

73.10

62.50

69.29

69.75

76.00

82.69

72.90

60.00

77.11

71.08

65.96

78.33

100%

81.52

30.00

76.47

78.89

87.50

86.42

29.41

14.29

21.93

29.94

37.84

23.66

17.74

14.29

2.45

23.24

18.18

9.89

0.00

10.00

4.46

24.00

26.92

14.19

0.00

18.67

14.46

0.00

16.67

93.97

60.00

94.65

93.06

96.43

95.06

68.82

57.14

60.09

70.06

70.27

65.59

67.74

78.57

58.28

67.84

78.18

62.71

57.01

50.00

57.14

53.50

58.00

76.92

56.13

50.00

60.84

56.02

51.06

63.33

KFL

99.42

90.00

99.47

100.00

97.53

87.06

76.19

85.09

87.78

86.49

83.87

90.12

92.86

87.12

91.35

89.09

84.75

78.39

75.00

79.29

76.11

82.00

88.46

80.43

30.00

84.94

81.02

65.96

88.33

Sub

0.0

Table 8: The experimental results of different categories on zero-shot ScrQA using DREAM dataset (RPG).