Summary of Magicoder Open-source Large Language Models for Code

Summary Magicoder Open-source Large Language Models for Code arxiv.org

8,169 words - PDF document - View PDF document

One Line

Magicoder generates coding challenges from open-source code to train large language models, producing diverse, low-bias data that allows its models to outperform larger models on code generation tasks.

Slides

Slide Presentation (12 slides)

Copy slides outline Copy embed code Download as Word

Unleashing the Power of Open-source Code with Magicoder

Source: arxiv.org - PDF - 8,169 words - view

Magicoder Revolutionizes Code Generation

• Introduces a series of open-source large language models (LLMs) for code generation.

• Outperforms existing closed-source models while maintaining under 7B parameters.

• Aims to enhance accessibility and research potential in code generation.

[Visual: Comparison chart of Magicoder vs. other LLMs on performance metrics]

OSS-INSTRUCT Transforms Instruction Tuning

• Generates high-quality instruction-response pairs from open-source code snippets.

• Encourages diverse coding problem creation by prompting LLMs with seed snippets.

• Ensures low-bias and balanced training data through rigorous cleaning processes.

[Visual: Flowchart of OSS-INSTRUCT process]

Diverse Coding Challenges Created with OSS-INSTRUCT

• Produces a range of coding tasks, from algorithmic challenges to application construction.

• Covers real-world programming scenarios to enhance model training.

• Generates 80K initial seed snippets across multiple programming languages.

[Visual: Examples of coding challenges generated]

Magicoder Series Enhances Performance

• Models include Magicoder-CL and MagicoderS-CL, finetuned on OSS-INSTRUCT data.

• Achieves superior results on benchmarks like HumanEval, MBPP, and MultiPL-E.

• Outperforms larger models like WizardCoder with significantly fewer parameters.

[Visual: Bar graph showing performance metrics across benchmarks]

Robust Evaluation Framework

• Evaluated across multiple coding tasks and datasets for comprehensive analysis.

• Utilizes the EvalPlus framework for rigorous assessment of coding models.

• Magicoder models show consistent performance improvements over base models.

[Visual: Overview of evaluation datasets and results]

MagicoderS-CL Matches ChatGPT Performance

• Pass@1 result of MagicoderS-CL is competitive with ChatGPT on HumanEval.

• Surpasses ChatGPT on more challenging benchmarks like HumanEval+.

• Demonstrates robustness in code generation capabilities.

[Visual: Table comparing pass rates of various models on benchmarks]

Significant Improvements in Data Science Coding

• Magicoder-CL excels on the DS-1000 dataset for data science tasks.

• Achieves an 8.3 percentage point improvement over the 15B WizardCoder-SC.

• Validates the effectiveness of OSS-INSTRUCT in specialized domains.

[Visual: Line graph tracking improvements in data science coding tasks]

Impact of Diverse Programming Languages

• Ablation studies reveal that training on multiple languages enhances overall coding ability.

• Instruction tuning boosts performance even for out-of-distribution languages.

• Highlights the importance of data diversity in LLM training.

[Visual: Pie chart showing language distribution in training data]

OSS-INSTRUCT vs. Traditional Approaches

• Comparison between OSS-INSTRUCT and direct finetuning on comment-function pairs.

• Demonstrates superior performance and data quality from OSS-INSTRUCT.

• Emphasizes the value of high-quality instruction data for model training.

[Visual: Side-by-side comparison chart]

Open-sourcing for Future Research

• All model weights, training data, and source code are fully open-sourced.

• Aims to facilitate collaboration and innovation in LLM research for code.

• Provides a foundation for further advancements in open-source coding models.

[Visual: Screenshot of GitHub repository for Magicoder]

Empowering Code Generation through Innovation

• Magicoder sets a new standard for open-source LLMs in code generation.

• OSS-INSTRUCT paves the way for diverse, high-quality instruction data creation.

• Join the movement towards accessible and powerful coding solutions!

[Visual: Call to action with a link to the GitHub repository]

Key Points

Magicoder is a series of open-source large language models (LLMs) for code generation, which significantly outperform existing closed-source models while having no more than 7B parameters
OSS-INSTRUCT is a novel approach to generating high-quality instruction-response pairs for training Magicoder models by prompting LLMs to create coding problems and solutions based on open-source code snippets
The Magicoder models, including Magicoder-CL and MagicoderS-CL, are trained using the OSS-INSTRUCT data and achieve state-of-the-art performance on a wide range of code generation benchmarks, including HumanEval, MBPP, MultiPL-E, and DS-1000
Ablation studies show that instruction tuning on diverse programming languages can boost the overall coding ability, and the OSS-INSTRUCT approach is superior to direct finetuning on comment-function pairs
The authors fully open-source the Magicoder model weights, training data, and source code to facilitate future research in LLMs for code

Summaries

25 word summary

Magicoder generates coding challenges from open-source code to train large language models. OSS-INSTRUCT produces diverse, low-bias data. Magicoder models outperform larger models on code generation.

46 word summary

Magicoder leverages open-source code to generate high-quality coding challenges for training large language models. The OSS-INSTRUCT method produces diverse, low-bias data. Magicoder models, including Magicoder-CL and MagicoderS-CL, outperform larger models on code generation tasks. MagicoderS-CL achieves comparable performance to a 34B model using only 7B parameters.

116 word summary

Magicoder is a novel approach that leverages open-source code snippets to generate high-quality coding challenges for instruction tuning of large language models (LLMs). The key contributions are the OSS-INSTRUCT method, which produces diverse, low-bias, and high-quality training data, and the Magicoder models, including Magicoder-CL and MagicoderS-CL, which outperform larger models on code generation tasks. The Magicoder models are evaluated on various benchmarks, demonstrating superior performance compared to state-of-the-art models. MagicoderS-CL, which combines OSS-INSTRUCT with the Evol-Instruct technique, achieves comparable performance to the 34B WizardCoder model while using only 7B parameters. The authors' open-sourcing of the model weights, training data, and source code aims to facilitate further research and development in the field of LLMs for code.

310 word summary

Magicoder: Open-source Large Language Models for Code

Magicoder is a novel approach that leverages open-source code snippets to generate high-quality coding challenges for instruction tuning of large language models (LLMs). This method, called OSS-INSTRUCT, enables Magicoder to outperform existing LLMs on various code generation benchmarks.

The key contributions of this work are:

1. OSS-INSTRUCT: Magicoder generates instruction-response pairs by extracting code snippets from open-source repositories and pairing them with natural language descriptions. This approach produces diverse, low-bias, and high-quality training data for instruction tuning.

2. Magicoder Models: The Magicoder models, including Magicoder-CL and MagicoderS-CL, are instruction-tuned on the OSS-INSTRUCT data. Despite having less than 7B parameters, they outperform larger models like the 15B WizardCoder on code generation tasks.

The Magicoder models are evaluated on a wide range of benchmarks, including multilingual code generation (MultiPL-E), data science coding (DS-1000), and program synthesis (HumanEval, MBPP). The results demonstrate the superior performance of Magicoder compared to state-of-the-art models.

On the MultiPL-E benchmark, Magicoder-CL outperforms the base CodeLLAMA-Python-7B model across all studied programming languages. MagicoderS-CL, which combines OSS-INSTRUCT with the Evol-Instruct technique, achieves comparable performance to the 34B WizardCoder model while using only 7B parameters.

On the DS-1000 dataset, Magicoder-CL-7B outperforms all baselines, including the state-of-the-art WizardCoder models. MagicoderS-CL-7B further improves upon this, achieving an 8.3 percentage point absolute improvement over the 15B WizardCoder-SC.

The authors also compare Magicoder with the recently released DeepSeek-Coder models. Despite DeepSeek-Coder's impressive performance, the Magicoder variants, particularly MagicoderS-DS, are able to surpass DeepSeek-Coder-Instruct-6.7B on the HumanEval and MBPP benchmarks while using only a fraction of the training tokens.

In conclusion, the Magicoder models, enabled by the novel OSS-INSTRUCT data generation method, demonstrate state-of-the-art performance on a wide range of code generation tasks. The authors' open-sourcing of the model weights, training data, and source code aims to facilitate further research and development in the field of LLMs for code.

438 word summary

Magicoder: Open-source Large Language Models for Code

Magicoder is a novel approach that leverages open-source code snippets to generate high-quality coding challenges for instruction tuning of large language models (LLMs). This method, called OSS-INSTRUCT, enables Magicoder to significantly outperform existing LLMs on a range of code generation benchmarks.

The key contributions of this work are:

3. Comprehensive Evaluation: Magicoder is evaluated on a wide range of benchmarks, including multilingual code generation (MultiPL-E), data science coding (DS-1000), and program synthesis (HumanEval, MBPP). The results demonstrate the superior performance of Magicoder compared to state-of-the-art models.

The Magicoder models are trained using the OSS-INSTRUCT approach, which generates instruction-response pairs from open-source code snippets. This method allows the models to learn from real-world coding examples, leading to significant performance improvements compared to existing LLMs.

On the MultiPL-E benchmark, Magicoder-CL outperforms the base CodeLLAMA-Python-7B model by a large margin across all studied programming languages. Moreover, MagicoderS-CL, which combines OSS-INSTRUCT with the Evol-Instruct technique, achieves comparable performance to the 34B WizardCoder model while using only 7B parameters.

The authors also evaluate Magicoder on the DS-1000 dataset, which assesses code generation for data science tasks. The results show that Magicoder-CL-7B outperforms all the baselines, including the state-of-the-art WizardCoder models. MagicoderS-CL-7B further improves upon this, achieving an 8.3 percentage point absolute improvement over the 15B WizardCoder-SC.

The authors conduct ablation studies to understand the impact of the training data distribution on the model's performance. They find that instruction tuning on different programming languages can boost the overall coding ability, even for out-of-distribution languages. Additionally, the authors compare OSS-INSTRUCT with direct finetuning on comment-function pairs, demonstrating the superiority of the OSS-INSTRUCT approach in terms of data quality and model performance.

1166 word summary

Magicoder: Open-source Large Language Models for Code

Introduction Code generation is a long-standing challenge in computer science. Recently, Large Language Models (LLMs) trained on code have shown outstanding breakthroughs in generating code that accurately satisfies user intents. However, these models are often closed-source, limiting their accessibility and potential for further research and development.

To address this, we introduce Magicoder, a series of fully open-source (code, weights, and data) LLMs for code that significantly close the gap with top code models while having no more than 7B parameters. The key innovation is OSS-INSTRUCT, a novel approach to enlightening LLMs with open-source code snippets to generate high-quality instruction data for code.

OSS-INSTRUCT: Instruction Tuning from Open Source OSS-INSTRUCT works by prompting an LLM (e.g., ChatGPT) to generate a coding problem and its solution according to some seed code snippet collected from open-source repositories. The seed snippet offers controllability of the generation and encourages the LLM to create diverse coding problems that can reflect real-world programming scenarios.

We collect 80K initial seed snippets from various programming languages and apply them to a prompt template, which the LLM takes as input and outputs both a coding problem and its solution. We perform data cleaning and decontamination to ensure the quality of the generated data.

Qualitative examples demonstrate how OSS-INSTRUCT can inspire an LLM to create diverse coding tasks, including algorithmic challenges, realistic issues, single-function code generation, library-based program completion, whole-program development, and even whole-application construction. Analysis of the generated data shows that it exhibits diversity and balance across different categories.

Compared to other data generation methods like Self-Instruct and Evol-Instruct, OSS-INSTRUCT exhibits the lowest average similarity with HumanEval, indicating that the improvements from OSS-INSTRUCT are not merely due to including data from the same distribution.

Magicoder and MagicoderS We build the Magicoder series by finetuning the base models (CodeLLAMA-Python-7B and DeepSeek-Coder-Base-6.7B) on the 75K synthetic data generated through OSS-INSTRUCT. To further enhance the coding abilities, we continue to finetune the Magicoder models with the open-source Evol-Instruct dataset, resulting in the MagicoderS series.

Evaluation We evaluate the Magicoder and MagicoderS models on a wide range of coding tasks, including HumanEval and MBPP for Python text-to-code generation, MultiPL-E for multilingual code completion, and DS-1000 for solving data science problems. We also use the more rigorous EvalPlus framework, which includes the augmented HumanEval+ and MBPP+ datasets.

The results show that both Magicoder-CL and MagicoderS-CL substantially outperform the base CodeLLAMA-Python-7B. Notably, Magicoder-CL even outperforms WizardCoder-CL-7B, WizardCoder-SC-15B, and all studied SOTA LLMs with less than or equal to 16B parameters on all the benchmarks we tested.

Furthermore, the pass@1 result of the enhanced MagicoderS-CL is on par with ChatGPT on HumanEval (70.7 vs. 72.6) and surpasses it on the more rigorous HumanEval+ (66.5 vs. 65.9), indicating that MagicoderS-CL can generate more robust code. It also achieves SOTA results among all code models at the same scale.

We also applied OSS-INSTRUCT on the DeepSeek-Coder-Base-6.7B, resulting in the creation of Magicoder-DS and MagicoderS-DS. MagicoderS-DS achieves a remarkable 76.8 pass@1 on HumanEval, outperforming DeepSeek-Coder-Instruct-6.7B on HumanEval, HumanEval+, MBPP, and MBPP+ with 8x less finetuning tokens.

Contributions In summary, we make the following contributions:

1. We introduce OSS-INSTRUCT, a pioneering approach to enlightening LLMs with open-source code snippets to generate more diverse, realistic, and controllable coding instruction data, which can be leveraged to substantially boost the performance of various LLMs via instruction tuning.

2. We build the Magicoder series trained with OSS-INSTRUCT and MagicoderS series trained on a combination of OSS-INSTRUCT and Evol-Instruct. Our evaluation across 6 benchmarks shows that all Magicoders significantly improve the base LLMs, with MagicoderS-CL and MagicoderS-DS outperforming ChatGPT on HumanEval+ with only 7B parameters.

3. We fully open-source the model weights, training data, and source code at https://github.com/ise-uiuc/magicoder to facilitate future research.

Overall, OSS-INSTRUCT opens a new direction for creating low-bias and high-quality instruction-tuning data from the abundance of open-source references, enabling the development of powerful open-source code generation models.

Magicoder: Open-source Large Language Models for Code

Magicoder is a novel approach that leverages open-source code snippets to generate high-quality coding challenges for instruction tuning of large language models (LLMs). This method, called OSS-INSTRUCT, enables Magicoder to significantly outperform existing LLMs on a range of code generation benchmarks.

The key contributions of this work are:

4. Ablation Studies: The authors investigate the impact of the training data distribution on the model's performance, showing that instruction tuning on diverse programming languages can still boost the overall coding ability. They also compare OSS-INSTRUCT with direct finetuning on comment-function pairs, highlighting the importance of data quality over format.

5. Open-sourcing: The authors fully open-source the Magicoder model weights, training data, and source code to enable future research in LLMs for code.

On the multilingual code generation benchmark (MultiPL-E), Magicoder-CL outperforms the base CodeLLAMA-Python-7B model by a large margin across all studied programming languages. Moreover, MagicoderS-CL, which combines OSS-INSTRUCT with the Evol-Instruct technique, achieves comparable performance to the 34B WizardCoder model while using only 7B parameters.

Raw indexed text (56,240 chars / 8,169 words / 1,054 lines)

Magicoder: Source Code Is All You Need

Yuxiang Wei 1 Zhe Wang 2 Jiawei Liu 1 Yifeng Ding 1 Lingming Zhang 1

University of Illinois at Urbana-Champaign 2 Tsinghua University

{ywei40,jiawei6,yifeng6,lingming}@illinois.edu [email protected]

https://github.com/ise-uiuc/magicoder

Abstract

We introduce Magicoder, a series of fully open-source (code, weights, and data)

Large Language Models (LLMs) for code that significantly closes the gap with top

code models while having no more than 7B parameters. Magicoder models are

trained on 75K synthetic instruction data using OSS-I NSTRUCT , a novel approach

to enlightening LLMs with open-source code snippets to generate high-quality

instruction data for code. Our main motivation is to mitigate the inherent bias of

the synthetic data generated by LLMs by empowering them with a wealth of open-

source references for the production of more diverse, realistic, and controllable

data. The orthogonality of OSS-I NSTRUCT and other data generation methods

like Evol-Instruct further enables us to build an enhanced MagicoderS. Both

Magicoder and MagicoderS substantially outperform state-of-the-art code models

with similar or even larger sizes on a wide range of coding benchmarks, including

Python text-to-code generation, multilingual coding, and data-science program

completion. Notably, MagicoderS-CL-7B based on C ODE L LAMA even surpasses

the prominent ChatGPT on HumanEval+ (66.5 vs. 65.9 in pass@1). Overall,

OSS-I NSTRUCT opens a new direction for low-bias and high-quality instruction

tuning using abundant open-source references.

Open-source codebase

PosNeg.py Program.cs

Log.cpp Strength.swift

GrantInfo.ts …

Seed code snippet

learn_model(

tf_idfSVM, tf_idfNB, target)

def get_clean_review(raw_review):

letters_only = re.sub(

"[^a-zA-Z]", " ", raw_review)

Generated solution (details omitted)

from sklearn.feature_extraction.text import TfidfVectorizer ...

def get_clean_review(raw_review): ...

def train_model(tf_idfSVM, tf_idfNB, reviews, labels): ...

def classify_review(clean_review, tf_idfSVM, tf_idfNB): ...

...

train_model(tf_idfSVM, tf_idfNB, reviews, labels)

cleaned_review = get_clean_review(...)...

Prompt (details omitted)

Please gain inspiration from the

code snippet to create a high-

quality programming problem…

Language

Model

OSS-I NSTRUCT

Generated problem (details omitted)

You are working on a natural language processing (NLP)

project and need to create a program to preprocess and

classify movie reviews...

...

Your program should be able to preprocess new movie

reviews, train the model, and classify new reviews accurately.

Figure 1: Overview of OSS-I NSTRUCT and the pass@1 results of different LLMs on HumanEval (+)1

Introduction

Code generation, also known as program synthesis [Gulwani et al., 2017], is a long-standing chal-

lenge in computer science. In the past few decades, a large body of research has been studying

symbolic approaches, such as abstraction-based synthesis [Wang et al., 2017, Feng et al., 2018] for

general-purpose synthesis problems and programming by examples [Cambronero et al., 2023, Liu

et al., 2023a] for domain-specific tasks. Until recently, Large Language Models (LLMs) trained on

code [Austin et al., 2021, Chen et al., 2021] has shown outstanding breakthroughs in generating

code that accurately satisfies user intents, and they are widely deployed to assist real-world software

development [Microsoft, 2023b, Services, 2023].

Initially, closed-source models such as GPT-3.5 Turbo [OpenAI, 2022] (i.e., ChatGPT) and GPT-

4 [OpenAI, 2023] massively dominated various code generation benchmarks and leaderboards [Chen

et al., 2021, Austin et al., 2021, Liu et al., 2023b, Lai et al., 2022]. To further push the boundaries

of code generation with open source LLMs, S ELF -I NSTRUCT [Wang et al., 2023a] is adopted to

bootstrap the instruction-following ability of LLMs. In the realm of code, practitioners commonly

devise synthetic coding instructions using a stronger teacher model (e.g., ChatGPT and GPT-4) and

then finetune a weaker student model (e.g., C ODE L LAMA [Rozière et al., 2023]) with the generated

data to distill the knowledge from the teacher [Taori et al., 2023, Chaudhary, 2023].For example, Code

Alpaca [Chaudhary, 2023] consists of 20K automatically generated code instructions by applying

S ELF -I NSTRUCT on ChatGPT using 21 seed tasks. To further enhance the coding abilities of LLMs,

Luo et al. [2023b] proposes Code Evol-Instruct that employs various heuristics to increase the

complexity of seed code instructions (Code Alpaca in this case), achieving state-of-the-art (SOTA)

results among open-source models.

While these data generation methods can effectively improve the instruction-following capability of

an LLM, they rely on a narrow range of predefined tasks or heuristics under the hood.For example, on

the one hand, Code Alpaca that adopts S ELF -I NSTRUCT only relies on 21 seed tasks to generate new

code instructions using an identical prompt template. On the other hand, Code Evol-Instruct takes

Code Alpaca as seeds and merely depends on 5 heuristics to evolve the dataset. As partly suggested

by Yu et al. [2023] and [Wang et al., 2023a], such approaches may significantly inherit the system

bias inherent in the LLMs as well as the predefined tasks.

Therefore, in this paper, we propose OSS-I NSTRUCT to mitigate the inherent bias of LLMs and to

unleash their potential to craft high-quality and creative code instructions via direct learning from the

open source. As shown in Figure 1, OSS-I NSTRUCT leverages a powerful LLM to automatically

generate new coding problems by drawing inspiration from any random code snippets collected from

the open source. In this example, the LLM gets inspired by two incomplete code fragments from

different functions and manages to relate them and craft a realistic machine learning problem. Thanks

to the “infinite” real-world open-source code, OSS-I NSTRUCT can directly produce diverse, realistic,

and controllable code instructions by providing distinct seed code snippets. In the end, we generate

75K synthetic data to finetune C ODE L LAMA -P YTHON -7B, resulting in Magicoder-CL. While being

simple and effective, OSS-I NSTRUCT is orthogonal to existing data generation methods, and they

can be combined to further push the boundaries of the models’ coding capabilities. Therefore, we

continually finetune Magicoder-CL on an open-source Evol-Instruct with 110K entries, producing

MagicoderS-CL.

We evaluate Magicoder and MagicoderS on a wide range of coding tasks, including Hu-

manEval [Chen et al., 2021] and MBPP [Austin et al., 2021] for Python text-to-code generation,

MultiPL-E [Cassano et al., 2022] for multilingual code completion, and DS-1000 [Lai et al., 2022]

for solving data science problems. We further adopt EvalPlus [Liu et al., 2023b], which includes the

augmented HumanEval+ and MBPP+ datasets for more rigorous model evaluation. Both Magicoder-

CL and MagicoderS-CL substantially boost the base C ODE L LAMA -P YTHON -7B. Additionally,

Magicoder-CL even outperforms WizardCoder-CL-7B, WizardCoder-SC-15B, and all studied SOTA

LLMs with less than or equal to 16B parameters on all the benchmarks we tested.Also, the pass@1

result of the enhanced MagicoderS-CL is on par with ChatGPT on HumanEval (70.7 vs. 72.6) and

surpasses it on the more rigorous HumanEval+ (66.5 vs. 65.9), indicating that MagicoderS-CL can

generate more robust code. It also achieves SOTA results among all code models at the same scale.

Additionally, we notice a very recent advancement in the development of the DeepSeek-Coder

series [DeepSeek AI, 2023] which has shown exceptional coding performance. However, due to the

2You are exceptionally skilled at crafting high-quality programming problems and

offering precise solutions.

Please gain inspiration from the following random code snippet to create a

high-quality programming problem. Present your output in two distinct sections:

[Problem Description] and [Solution].

Code snippet for inspiration:

```

{code}

```

Guidelines for each section:

1. [Problem Description]: This should be **completely self-contained**, providing

all the contextual information one needs to understand and solve the problem.

Assume common programming knowledge, but ensure that any specific context,

variables, or code snippets pertinent to this problem are explicitly included.

2. [Solution]: Offer a comprehensive, **correct** solution that accurately

addresses the [Problem Description] you provided.

Figure 2: The detailed prompt design for OSS-I NSTRUCT

limited technical details currently disclosed, we only briefly discuss them in §4.4. Despite this, we

applied OSS-I NSTRUCT on DeepSeek-Coder-Base 6.7B, resulting in the creation of Magicoder-DS

and MagicoderS-DS. In addition to the consistent findings on the previous results with C ODE L LAMA -

P YTHON -7B as the base model, Magicoder-DS and MagicoderS-DS benefit from the more powerful

DeepSeek-Coder-Base-6.7B. This advantage is demonstrated by MagicoderS-DS, which achieves

a remarkable 76.8 pass@1 on HumanEval. MagicoderS-DS also outperforms DeepSeek-Coder-

Instruct-6.7B on HumanEval, HumanEval+, MBPP, and MBPP+ with 8× less finetuning tokens.

To justify the design of OSS-I NSTRUCT , i.e., generating instruction-tuning data from open-source

references rather than using the reference directly, we further demonstrate that finetuning the base

models with semantically relevant comment-function pairs directly extracted from open-source

projects even negatively impacts the model performance (§5.2).

In general, we make the following contributions:

• We introduce OSS-I NSTRUCT , a pioneering approach to enlightening LLMs with open-source

code snippets to generate more diverse, realistic, and controllable coding instruction data, which

can be leveraged to substantially boost the performance of various LLMs via instruction tuning.

It opens a new dimension for creating low-bias and high-quality instruction-tuning data from the

abundance of open-source references.

• We build the Magicoder series trained with OSS-I NSTRUCT and MagicoderS series trained on a

combination of OSS-I NSTRUCT and Evol-Instruct. Our evaluation across 6 benchmarks shows

that all Magicoders significantly improve the base LLMs. Notably, both MagicoderS-CL and

MagicoderS-DS outperform ChatGPT on HumanEval+ with only 7B parameters.

• We fully open source the model weights, training data, and source code at https://github.com/

ise-uiuc/magicoder to facilitate future research.

OSS-I NSTRUCT : Instruction Tuning from Open Source

In this section, we elaborate on our OSS-I NSTRUCT approach. From a high level, as shown in

Figure 1, OSS-I NSTRUCT works by prompting an LLM (e.g., ChatGPT) to generate a coding

problem and its solution according to some seed code snippet collected from the wild (e.g., from

GitHub). The seed snippet offers controllability of the generation and encourages the LLM to create

diverse coding problems that can reflect real-world programming scenarios.

2.1

Generating Coding Problems

OSS-I NSTRUCT is powered by seed code snippets that can be easily collected from open source. In

this work, we directly adopt starcoderdata as our seed corpus, a filtered version of The Stack [Ko-

cetkov et al., 2022] dataset that StarCoder is trained on, containing permissively licensed source

3code documents in various programming languages.We chose starcoderdata because it is widely

adopted, includes massive high-quality code snippets, and is even post-processed for data decontami-

nation [Li et al., 2023, Allal et al., 2023]. For each code document from the corpus, we randomly

extract 1–15 consecutive lines as the seed snippet for the model to gain inspiration from and produce

coding problems. In total, we collected 80K initial seed snippets from 80K code documents, 40K from

Python, and 5K from each of C++, Java, TypeScript, Shell, C#, Rust, PHP, and Swift respectively.

Then, each collected seed code snippet is applied to the prompt template shown in Figure 2, which a

teacher model takes as input and outputs both a coding problem and its solution.

2.2

Data Cleaning and Decontamination

We perform data cleaning by excluding samples that are identical or share the same seed code snippet”

or linebreaks “\n”. While there exist other sorts of noisiness (e.g., the solution is incomplete) in the

generated data, inspired by Honovich et al. [2023], they are not removed as we believe they still

contain valuable information for LLMs to learn. Finally, we apply the same logic as StarCoder Li

et al. [2023] to decontaminate our training data by removing coding problems that contain docstrings

or solutions from HumanEval [Chen et al., 2021] and MBPP [Austin et al., 2021], docstrings

from APPS [Hendrycks et al., 2021], prompts from DS-1000 [Lai et al., 2022], or questions from

GSM8K [Cobbe et al., 2021]. As part of our analysis, the decontamination procedure only filters out

9 additional samples. Since the seed corpus starcoderdata has already gone through rigorous data

decontamination, this observation suggests that OSS-I NSTRUCT is unlikely to introduce additional

data leakage beyond the seeds. The eventual OSS-I NSTRUCT dataset contains about 75K entries.

2.3

Qualitative Examples of OSS-I NSTRUCT

Figure 3 shows some qualitative examples of how OSS-I NSTRUCT can help LLM get inspiration

from a seed code snippet to create new coding problems and solutions. For example, the shell

script example shows how an LLM crafts a Python coding problem with just one line of shell script.

The library imports example demonstrates how an LLM can create a realistic machine learning

problem using just a few import statements. Meanwhile, the class signature instance illustrates the

ability of LLM to draw inspiration from an incomplete class definition featuring annotations like

SpringBootApplication and keywords such as bank. From this, the LLM generates a problem that

requires implementing a complete banking system based on Spring Boot! Overall, OSS-I NSTRUCT

can inspire an LLM with distinct code structures and semantics to create diverse coding tasks,

including algorithmic challenges, realistic issues, single-function code generation, library-based

program completion, whole-program development, and even whole-application construction.

Categories To study the categories of OSS-I NSTRUCT -generated data, we use INSTRUCTOR [Su

et al., 2022], which is one of the SOTA embedding models and can generate different text embeddings

according to a task instruction. Inspired by OctoPack [Muennighoff et al., 2023] and the topic

tags on GitHub, we manually designed 10 categories specific to coding. As shown in Figure 4, we

calculate the cosine similarity between the embeddings of each sample in OSS-I NSTRUCT and the 10

categories to obtain the category breakdown. Overall, OSS-I NSTRUCT exhibits diversity and balance

across different categories.

Length distribution We depict the length distribution for both generated problems and solutions

in Figure 5. The x-axis represents the number of tokens in each problem/solution, while the y-axis

shows the correspondent number of samples.

Similarity with HumanEval To study whether our data generation process produces more

HumanEval-like problems or solutions that contribute to high performance, we pair each sam-

ple from our 75K dataset with each of the 164 HumanEval [Chen et al., 2021] samples and compute

their cosine similarity using TF-IDF [SPARCK JONES, 1972] embeddings. We then associate

each OSS-I NSTRUCT sample with a HumanEval sample with the highest similarity score. We also

compare our dataset against Code Alpaca, a 20K dataset applying S ELF -I NSTRUCT to code, and

evol-codealpaca-v1 [theblackcat102, 2023], an open-source reproduction of Evol-Instruct contain-

ing 110K coding instructions. We resort to the open-source implementation because the official Code

Evol-Instruct [Luo et al., 2023b] dataset is not released. We decontaminate all the datasets beforehand

using the same way discussed in §2.2. Figure 6 shows that OSS-I NSTRUCT exhibits the lowest

4Seed: method de inition

render() { Seed: shell script

python3 makeErrorFile.py data/test_dataset_14 14

Problem

Your task is to complete the `render` method to

generate the rendered shape as a string... Problem

Create a Python program that generates an error ile

based on a given dataset...

Code

class ShapeRenderer {

constructor(vertices) {

this.vertices = vertices;

}

render() {

let renderedShape = "";

for (let i = 0; i < this.vertices.length; i++) {

const vertex = this.vertices[i];

renderedShape += `(${vertex.x}, ${vertex.y})`;

if (i < this.vertices.length - 1) {

renderedShape += " - ";

}

return renderedShape;

}

} Code

def generate_error_file(dataset_file, ...):

error_lines = []

with open(dataset_file, 'r') as file:

for line in file:

...

with open(error_file_name, 'w') as error_file:

for error_line in error_lines:

error_file.write(error_line + '\n')

if __name__ == "__main__":

if len(sys.argv) != 3:

print("Usage: ...")

else:

dataset_file = sys.argv[1]

dataset_number = sys.argv[2]

generate_error_file(...)

Seed: library imports

import numpy as np

import gym_electric_motor as gem

import matplotlib.pyplot as plt Seed: class signature

@SpringBootApplication

@Import({ AxonConfig.class })

public class AxonbankApplication {

public static void main(String[] args) {

Problem

Create a reinforcement learning agent to control an

electric motor using the OpenAI Gym environment...

Problem

Create a simple Java Spring Boot application for a

banking system...

Code

import numpy as np

import gym_electric_motor as gem

...

env = gem.make("DcSeriesCont-v1")

class DQNAgent:

def __init__(self, state_dim, action_dim): ...

def build_model(self): ...

def act(self, state): ...

def train(self, state, action, reward, ...): ...

...

for episode in range(episodes):

state = env.reset()

state = np.reshape(state, [1, state_dim])

...

Code

import org.axonframework.commandhandling.CommandBus;

import org.axonframework.config.Configuration;

...

@SpringBootApplication

@Import({ AxonConfig.class })

public class AxonbankApplication {...}

public class BankAccount {...}

public class CreateAccountCommand {...}

public class DepositFundsCommand {...}

public class WithdrawFundsCommand {...}

public class AccountCreatedEvent {...}

public class FundsDepositedEvent {...}

public class FundsWithdrawnEvent {...}

Seed: code statements

cutoff_range = np.ptp(cutoffs)

if cutoff_range == 0: cutoff_range = 1

cutoff_colors = plt.get_cmap('plasma')(

(cutoffs - np.min(cutoffs)) / cutoff_range

) Seed: comments

# Set degrees

Problem

Implement a function that calculates the color values

for a given set of cuto values based on a speci ied

color map... Code

class TemperatureConverter:

def __init__(self): ...

def set_celsius(self, degrees): ...

def set_fahrenheit(self, degrees): ...

def set_kelvin(self, degrees): ...

def get_celsius(self): ...

def get_fahrenheit(self): ...

def get_kelvin(self): ...

def convert_to(self, unit):

if unit == 'C':

return self.get_celsius()

elif unit == 'F':

return self.get_fahrenheit()

elif unit == 'K':

return self.get_kelvin()

...

Problem

Implement a Python class that represents a

temperature in degrees...

Code

import numpy as np

import matplotlib.pyplot as plt

def calculate_cutoff_colors(cutoffs, cmap_name):

cutoff_range = np.ptp(cutoffs)

if cutoff_range == 0:

cutoff_range = 1

cmap = plt.get_cmap(cmap_name)

normalized_cutoffs = ...

cutoff_colors = ...

return cutoff_colors

Figure 3: Examples showing how OSS-I NSTRUCT generates problems and solutions from seed code

snippets. Details such as full problem requirements, complete implementations, and explanations are

omitted for brevity.

5Figure 4: The category constitution of OSS-I NSTRUCT

Figure 5: Token count distribution of OSS- Figure 6: Cosine similarities between HumanEval

I NSTRUCT -generated problems and solutions

and different data generation methods

average similarity among all the studied data generation techniques while S ELF -I NSTRUCT shows

the highest average similarity. This result indicates that the improvements from OSS-I NSTRUCT are

not merely due to including data from the same distribution.

Implementation Details

Data generation We use gpt-3.5-turbo-1106 as the foundation model to do OSS-I NSTRUCT

due to its high cost-effectiveness. We randomly extract 1–15 lines from each selected code document

from starcoderdata and let gpt-3.5-turbo-1106 imagine a self-contained coding problem and a

correct solution. Given the numerous seed code snippets, we perform greedy decoding to maximize

the consistency between the generated problems and solutions.

Data decontamination We apply data decontamination before training our Magicoder and

MagicoderS models. Following Li et al. [2023], we decontaminate both our 75K OSS-I NSTRUCT

dataset and the evol-codealpaca-v1 [theblackcat102, 2023] dataset, an open-source reproduction

of Evol-Instruct, by removing exact matches from HumanEval [Chen et al., 2021], MBPP [Austin

et al., 2021], DS-1000 [Lai et al., 2022], and GSM8K [Cobbe et al., 2021]. Eventually, we filtered

out 9 problems for OSS-I NSTRUCT dataset and 89 for evol-codealpaca-v1.

Training We employ C ODE L LAMA -P YTHON -7B and DeepSeek-Coder-Base 6.7B as the base

LLMs. To obtain Magicoder series, we first finetune the base models on about 75K synthetic data

generated through OSS-I NSTRUCT using the transformers library from Hugging Face [Hugging

Face, 2023]. We finetune the base models for 2 epochs using two NVIDIA A100-80GB GPUs through

the Distributed Data Parallel (DDP) module from PyTorch. We set the initial learning rate at 5e-5 with

15 warmup steps and a linear scheduler. We use Adafactor [Shazeer and Stern, 2018] as our optimizer

and choose a batch size of 512 with a sequence truncation length of 1216. To obtain MagicoderS,

we continue to finetune Magicoder models with the evol-codealpaca-v1 dataset, an open-source

Evol-Instruct implementation containing about 110K samples. We use the same hyperparameters

except for 15 warmup steps and a 1024 max sequence length.

6Table 1: Pass@1 (%) results of different LLMs on HumanEval (+) and MBPP (+) computed with

greedy decoding. The abbreviations “CL” and “SC” refer to the base models C ODE L LAMA -P YTHON

and StarCoder, respectively. We report the results consistently from the EvalPlus [Liu et al., 2023b]

Leaderboard.

Benchmark

Model

4.1

Release Date

Open-Source

Size

HumanEval (+) MBPP (+) Weight Data

# #

GPT-3.5 Turbo

GPT-4 Turbo Nov 2023

Nov 2023 -

- 72.6 (65.9)

85.4 (81.7) 81.7 (69.4)

83.0 (70.7)

C ODE L LAMA -P YTHON

WizardCoder-CL Aug 2023

Sep 2023 34B

34B 51.8 (42.7)

73.2 (64.6) 67.2 (52.9)

73.2 (59.9)

CodeT5+

CodeGen-Mono

StarCoder

C ODE L LAMA -P YTHON

WizardCoder-SC May 2023

Mar 2022

May 2023

Aug 2023

Sep 2023 16B

16B

15B

13B

15B 31.7

32.9

34.1

42.7

51.9 (26.2)

(27.4)

(29.3)

(36.6)

(45.1) 54.6

52.6

55.1

61.2

61.9 (44.4)

(43.6)

(46.1)

(50.9)

(50.6)

StarCoder

Mistral

CodeT5+

CodeGen-Mono

C ODE L LAMA -P YTHON

WizardCoder-CL May 2023

Oct 2023

May 2023

Mar 2022

Aug 2023

Sep 2023 7B

7B 24.4

28.7

29.3

37.8

48.2 (20.7)

(23.2)

(23.8)

(25.6)

(34.1)

(40.9) 33.1

50.1

51.9

49.9

57.6

56.6 (28.8)

(40.9)

(42.1)

(45.4)

(47.1)

Magicoder-CL

MagicoderS-CL Dec 2023

Dec 2023 7B

7B 60.4 (55.5)

70.7 (66.5)

64.2 (52.6)

68.4 (56.6)

Evaluation

Python Text-to-Code Generation

HumanEval [Chen et al., 2021] and MBPP [Austin et al., 2021] are two of the most widely used

benchmarks for code generation. Each task in these benchmarks includes a task description (e.g.,

docstring) as the prompt, where LLMs generate corresponding code whose correctness is checked

by a handful of test cases. Because tests in these benchmarks can be insufficient [Liu et al., 2023b],

for more rigorous evaluation, we use HumanEval+ and MBPP+, both powered by the EvalPlus

framework [Liu et al., 2023b] to obtain 80×/35× more tests. Following prior work [Liu et al., 2023b,

Chen et al., 2023], for each task and LLM we use greedy decoding to generate one sample and focus

on comparing the pass@1 metric.

We consider a wide range of baseline models, including C ODE L LAMA -P YTHON [Rozière et al., 2023],

WizardCoder [Luo et al., 2023b], GPT-3.5 Turbo [OpenAI, 2022], GPT-4 Turbo [OpenAI, 2023],

StarCoder [Li et al., 2023], CodeT5+ [Wang et al., 2023b], CodeGen-Mono [Nijkamp et al., 2023],

and Mistral [Jiang et al., 2023a]. All the results are consistently reported from the EvalPlus [Liu et al.,

2023b] leaderboard.

Table 1 shows the pass@1 results of different LLMs on these benchmarks. From the results, we can

first observe that Magicoder-CL has a clear improvement over the base C ODE L LAMA -P YTHON -

7B, and outperforms all studied open-source models except C ODE L LAMA -P YTHON -34B and

WizardCoder-CL-34B. Notably, Magicoder-CL surpasses WizardCoder-SC-15B and has a substantial

improvement on HumanEval and HumanEval+ over C ODE L LAMA -P YTHON -34B.MagicoderS-CL

demonstrates further improvements by being trained with the orthogonal Evol-Instruct method.

MagicoderS-CL outperforms ChatGPT and all other open-source models on HumanEval+. More-

over, although it scores slightly lower than WizardCoder-CL-34B and ChatGPT on HumanEval, it

surpasses both of them on the more rigorous HumanEval+ dataset, indicating that MagicoderS-CL

may produce more robust code.

7Table 2: Pass@1 results of different LLMs on MultiPL-E [Cassano et al., 2022] following the

same hyperparameter settings as the WizardCoder paper [Luo et al., 2023b]: temperature =

0.2, top_p = 0.95, max_length = 512, and num_samples = 50. We evaluate all 7B models

using bigcode-evaluation-harness [Ben Allal et al., 2022] and report other results from the

WizardCoder paper.

Programming Language

Model

4.2

Size

Java JavaScript C++ PHP Swift Rust

C ODE L LAMA

C ODE L LAMA -P YTHON

C ODE L LAMA -I NSTRUCT

WizardCoder-CL 34B

34B

34B 40.2

39.5

41.5

44.9 41.7

44.7

45.9

55.3 41.4

39.1

41.5

47.2 40.4

39.8

37.0

47.2 35.3

34.3

37.6

44.3 38.7

39.7

39.3

46.2

StarCoderBase

StarCoder

WizardCoder-SC 15B

15B

15B 28.5

30.2

35.8 31.7

30.8

41.9 30.6

31.6

39.0 26.8

26.1

39.3 16.7

22.7

33.7 24.5

21.8

27.1

C ODE L LAMA

C ODE L LAMA -P YTHON 7B

7B 29.3

29.1 31.7

35.7 27.0

30.2 25.1

29.0 25.6

27.1 25.5

27.0

Magicoder-CL

MagicoderS-CL 7B

7B 36.4

42.9 45.9

57.5 36.5

44.4 39.5

47.6 33.4

44.1 30.6

40.3

Multilingual Code Generation

In addition to Python, as shown in Table 2, we perform an extensive evaluation on 6 widely used

programming languages, i.e., Java, JavaScript, C++, PHP, Swift, and Rust, using the MultiPL-E

benchmark [Cassano et al., 2022].We report available results from the WizardCoder paper [Luo et al.,

2023b] and evaluate our models consistently through bigcode-evaluation-harness [Ben Allal

et al., 2022]. We skip proprietary models such as ChatGPT and GPT-4 as they are not supported by

the framework. Due to a significant inference latency when running WizardCoder-CL-7B using the

harness in our environment, we choose not to include it in our analysis.

The results indicate that Magicoder-CL improves the base C ODE L LAMA -P YTHON -7B by a large

margin among all the studied programming languages. Moreover, Magicoder-CL also achieves better

results than the SOTA 15B WizardCoder-SC among half of the programming languages.Additionally,

MagicoderS-CL demonstrates further improvement over Magicoder-CL on all programming lan-

guages, achieving comparable performance against WizardCoder-CL-34B with only 7B parameters.

It is worth noting that Magicoder-CL is only trained with very limited multilingual data but still

outperforms other LLMs with similar or even larger sizes. Also, although the harness evaluates

models in completion formats, Magicoders still show significant improvements despite being only

instruction-tuned. This implies that LLMs can learn knowledge from the data beyond its format.

4.3

Code Generation for Data Science

The DS-1000 dataset [Lai et al., 2022] contains 1K distinct data science coding issues ranging from 7

popular data science libraries in Python. It evaluates the realistic and practical use case of an LLM

and offers unit tests for validating each problem. DS-1000 has both completion and insertion modes,

but here we only evaluate completion because the base C ODE L LAMA -P YTHON does not support

infilling. Table 3 shows the evaluation results where we include the recent I N C ODER [Fried et al.,

2023], CodeGen [Nijkamp et al., 2023], Code-Cushman-001 [Microsoft, 2023a], StarCoder [Li et al.,

2023], C ODE L LAMA -P YTHON [Rozière et al., 2023], and WizardCoder [Luo et al., 2023b]. We can

see from the table that Magicoder-CL-7B already outperforms all the baselines we evaluate, including

state-of-the-art WizardCoder-CL-7B and WizardCoder-SC-15B. MagicoderS-CL-7B further breaks

the limit by introducing an 8.3 percentage point absolute improvement over WizardCoder-SC-15B.

8Table 3: Pass@1 results on DS-1000 (completion format) with temperature = 0.2, top_p = 0.5,

max_length = 1024, and num_samples = 40, following the same hyperparameter setting used in

WizardCoder [Luo et al., 2023b]. We evaluate all the 7B models with their preferred prompt formats

and report other results from WizardCoder.

Model + 155

+ 220 + 291

+ 68 + 106

+ 115

+ 45 = 1000

Size Matplotlib NumPy Pandas PyTorch SciPy Sklearn TensorFlow Overall

I N C ODER

CodeGen-Mono

Code-Cushman-001 6.7B

16B

- 28.3

31.7

40.7 4.4

10.9

21.8 3.1

3.4

7.9 4.4

7.0

12.4 2.8

9.0

11.3 2.8

10.8

18.0 3.8

15.2

12.2 7.4

11.7

18.1

StarCoder

WizardCoder-SC 15B

15B 51.7

55.2 29.7

33.6 11.4

16.7 21.4

26.2 20.2

24.2 29.5

24.9 24.5

26.7 26.0

29.2

C ODE L LAMA -P YTHON

WizardCoder-CL 7B

7B 55.3

53.5 34.5

34.4 16.4

15.2 19.9

25.7 22.3

21.0 17.6

24.5 28.5

28.9 28.0

28.4

Magicoder-CL

MagicoderS-CL 7B

7B 54.6

55.9 34.8

40.6 19.0

28.4 24.7

40.4 25.0

28.8 22.6

35.8 28.9

37.6 29.9

37.5

Table 4: Pass@1 (greedy decoding) comparison between Magicoder and DeepSeek-Coder [DeepSeek

AI, 2023] on HumanEval (+) and MBPP (+). DeepSeek-Coder results are reported from EvalPlus [Liu

et al., 2023b] Leaderboard.

Benchmark

Model

Size

Open-Source

Training Tokens

HumanEval (+) MBPP (+)

DeepSeek-Coder-Base 1.3B

6.7B

33B 2T

2T -

47.6 (39.6)

51.2 (43.3) 55.4 (46.9)

70.2 (56.6)

DeepSeek-Coder Instruct 1.3B

6.7B

33B +2B

+2B

+2B 64.6 (58.5)

73.8 (70.1)

78.7 (72.6) 63.7 (53.1)

72.7 (63.4)

78.7 (66.7)

Magicoder-DS

MagicoderS-DS 6.7B

6.7B +90M

+240M 66.5 (60.4)

76.8 (70.7) 75.4 (61.9)

75.7 (64.4)

4.4

Weight

Data

Comparison with DeepSeek-Coder

DeepSeek-Coder [DeepSeek AI, 2023] is a series of models released very recently and they demon-

strate superior coding performance. We only briefly discuss it in this section because its technical

details and instruction data are not publicly available at the time of writing. We apply the same

finetuning strategy on DeepSeek-Coder-Base-6.7B as we performed on C ODE L LAMA -P YTHON -7B,

leading to Magicoder-DS and MagicoderS-DS. Table 4 shows a similar trend as Table 1 that the base

model can be significantly improved after applying OSS-I NSTRUCT . Remarkably, the MagicoderS-

DS variant surpasses DeepSeek-Coder-Instruct-6.7B on all the benchmarks with ×8 fewer training

tokens, and it also closely matches DeepSeek-Coder-Instruct-34B on these datasets!

5.1

Ablations of Data Source

Impact of the Language Distribution

To understand the correlation between the programming languages appearing in the training data and

the downstream performance of different languages, we conduct an additional ablation study about

the training data. We classify the 75K training data into approximately 43K Python-only, and 32K

non-Python data according to whether ```python is a substring of the generated data. We do not

classify the data based on the seed code snippet because LLMs performing OSS-I NSTRUCT may

produce code in a different programming language than the seed.

9Table 5: Ablation study of using different programming languages as training data. We show the

pass@1 results on HumanEval+ [Liu et al., 2023b] for Python and the average pass@1 results on

MultiPL-E [Cassano et al., 2022] for the same set of programming languages used in Table 2 (i.e.,

Java, JavaScript, C++, PHP, Swift, and Rust). All the variants are finetuned with 2 epochs and

evaluated through greedy-decoding.

Model (7B) Finetuning Data

Python (HumanEval+) Others (MultiPL-E)

C ODE L LAMA -P YTHON - 34.1 29.6

Magicoder-CL

Magicoder-CL Python (43K)

Others (32K) 47.6

44.5 32.7

38.3

Magicoder-CL Both (75K) 55.5 37.8

Table 6: Comparison between OSS-I NSTRUCT and directly finetuning on comment-function pairs

with C ODE L LAMA -P YTHON -7B as the base model.

Finetuning Data

Base model w/o finetuning

Comment-function pairs (75K)

OSS-I NSTRUCT (75K)

HumanEval+ MultiPL-E

34.1

55.5 29.6

24.1

37.8

Table 5 shows the evaluation results, where we consistently finetune the base C ODE L LAMA -P YTHON -

7B for 2 epochs on different data partitions using the same training hyperparameters explained in

§3. From the table, we can see that, as can be imagined, training on Python or non-Python data can

substantially boost the performance of the base model in Python or non-Python tasks, respectively.

Interestingly, instruction tuning on different programming languages can still boost the overall

coding performance that includes out-of-distribution languages. For example, when trained on only

non-Python data, Magicoder-CL still achieves a 10.4 percentage point improvement over the base

model in the Python-only evaluation. This implies LLMs can establish correlations between different

programming languages and perform transfer learning of deeper code semantics.Finally, we observe

a more significant boost in Python evaluation when combining data from both sources, with a slight

decrease in multilingual performance compared with only finetuning on multilingual data. We

attribute this decrease to the dominant amount of Python data (around 57%) during instruction tuning.

5.2

OSS-I NSTRUCT vs. Direct Finetuning

The fact that OSS-I NSTRUCT gets an LLM inspired from open-source code snippets may lead to a

natural question: why not directly finetuning on these open-source code? To answer this question,

we follow CodeSearchNet [Husain et al., 2020] to mine semantically relevant comment-function

pairs from the same seed document corpus we use to construct the 75K OSS-I NSTRUCT dataset. We

then train the model to predict the function bodies from the function signatures and comments. We

prioritize comment-function pairs that overlap with our 75K seed snippets, resulting in about 11K

data points. To align with our 75K samples, we collect the remaining 64K samples using the whole

corpus of 75K seed documents. Eventually, we have the same number of comment-function pairs

with OSS-I NSTRUCT data.

We finetune the base C ODE L LAMA -P YTHON -7B for 2 epochs using the paired data, following the

same training setup discussed in §3. From Table 6, we observe that finetuning on 75K paired comment-

function data even worsens the base model, while OSS-I NSTRUCT helps to introduce a substantial

boost. We conjecture that the degradation is owing to the substantial noise and inconsistency that

exists intrinsically in the data pairs, even though these paired data exhibit very similar format as

HumanEval or MultiPL-E problems. This further shows that data factuality, rather than the format, is

essential to code instruction tuning. It also indicates the superiority of OSS-I NSTRUCT which can

translate these loosely related code fragments into semantically-consistent instruction-tuning data.

106

Related Work

Foundation models for code Trained over billions of lines of code, LLMs have demonstrated out-

standing performance in a wide range of software engineering tasks, including code generation [Chen

et al., 2021, Austin et al., 2021], program repair [Wei et al., 2023, Xia et al., 2023a,c, Xia and

Zhang, 2022, Jiang et al., 2023b], and fuzzing [Xia et al., 2023b, Deng et al., 2023b,a]. In particular,

prominent base models, such as CodeGen [Nijkamp et al., 2023], CodeT5 [Wang et al., 2021],

StarCoder [Li et al., 2023], and C ODE L LAMA [Rozière et al., 2023], are pre-trained over a huge

number of codebase from scratch, establishing the fundamental ability of general code generation

and understanding. Furthermore, these base models are also further finetuned [Luo et al., 2023b]

or prompted [Chen et al., 2023] to unlock the true potential of the model to specialize in solving

domain-specific coding tasks.

Instruction tuning Instruction tuning aims at improving pretrained LLMs by finetuning them with

a mixture of instructions and corresponding responses [Wei et al., 2022]. With instruction tuning,

LLMs can improve their generalization ability and directly follow instructions without explicit user

demonstration. However, obtaining high-quality instructional data is oftentimes laborious. To this

end, Wang et al. [2023a] introduces S ELF -I NSTRUCT , where a foundation LLM (GPT-3) is used

to generate synthetic instruction-response pairs with carefully crafted prompts. The same LLM

is then instruction-tuned on the synthetic data to distill such self-generated knowledge. Although

S ELF -I NSTRUCT initially suggests using the same LLM for both data generation and instruction

tuning, it has been further extended to create synthetic data with different LLMs. For example,

Alpaca [Taori et al., 2023] and Code Alpaca [Chaudhary, 2023] apply S ELF -I NSTRUCT to finetune

L LAMA with ChatGPT-generated instructions. C ODE L LAMA -I NSTRUCT s [Rozière et al., 2023] are

fintuned using S ELF -I NSTRUCT by prompting L LAMA 2 [Touvron et al., 2023] for coding problems

and C ODE L LAMA for unit tests and solutions. To improve S ELF -I NSTRUCT , WizardLM [Xu et al.,

2023] and WizardCoder [Luo et al., 2023a] propose Evol-Instruct and Code Evol-Instruct by guiding

ChatGPT with heuristic prompts to make the synthetic data more complex and diverse. In contrast,

our proposed OSS-I NSTRUCT is orthogonal to all existing methods by allowing LLMs to get inspired

from real-world code snippets for better controllability, quality, and creativity in coding tasks.

Evaluating LLMs for code Most code benchmarks evaluate LLMs on generating single-function

programs from natural language descriptions. Such benchmarks include HumanEval [Chen et al.,

2021], MBPP [Austin et al., 2021], APPS [Hendrycks et al., 2021], and CodeContests [Li et al., 2022].

A handful of manual tests are used to assess the functional correctness of LLM-generated solutions.

However, insufficient tests can lead to false negatives. Consequently, the EvalPlus framework [Liu

et al., 2023b] produces HumanEval+ and MBPP+ by extending 80×/35× more tests. Meanwhile,

there are comprehensive benchmarks evaluating code generation for data science (DS-1000 [Lai et al.,

2022]), addressing open-source issues (SWE-bench [Jimenez et al., 2023]), and repository-level code

generation (C ROSS C ODE E VAL [Ding et al., 2023] and RepoEval [Zhang et al., 2023]).

Conclusion and Future Work

We propose OSS-I NSTRUCT , a novel data generation method using Large Language Models to

generate low-bias and high-quality coding challenges from open-source code snippets. This approach

enables Magicoder, which significantly improves the base LLM. Despite having less than 7B parame-

ters, it can outperform all evaluate LLMs with less than or equal to 16B parameters, including the

15B WizardCoder. Combining OSS-I NSTRUCT with Evol-Instruct allows us to build the enhanced

MagicoderS models. They achieve remarkable results by rivaling leading models like ChatGPT in

HumanEval benchmarks. We fully open source the model weights, training data, and source code, to

enable future research in LLM s for code.In the near future, we will apply OSS-I NSTRUCT to larger

base models. We will also continue advancing OSS-I NSTRUCT by generating higher-quality data

with a strategically designed distribution of the seed code snippets and with more advanced teacher

LLMs such as GPT-4.

11References

Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Carlos Munoz

Ferrandis, Niklas Muennighoff, Mayank Mishra, Alex Gu, Manan Dey, Logesh Kumar Umapathi,

Carolyn Jane Anderson, Yangtian Zi, Joel Lamy Poirier, Hailey Schoelkopf, Sergey Troshin,

Dmitry Abulkhanov, Manuel Romero, Michael Lappert, Francesco De Toni, Bernardo García del

Río, Qian Liu, Shamik Bose, Urvashi Bhattacharyya, Terry Yue Zhuo, Ian Yu, Paulo Villegas,

Marco Zocca, Sourab Mangrulkar, David Lansky, Huu Nguyen, Danish Contractor, Luis Villa, Jia

Li, Dzmitry Bahdanau, Yacine Jernite, Sean Hughes, Daniel Fried, Arjun Guha, Harm de Vries,

and Leandro von Werra. Santacoder: don’t reach for the stars!, 2023.

Jacob Austin, Augustus Odena, Maxwell I. Nye, Maarten Bosma, Henryk Michalewski, David Dohan,

Ellen Jiang, Carrie J. Cai, Michael Terry, Quoc V. Le, and Charles Sutton. Program synthesis

with large language models. CoRR, abs/2108.07732, 2021. URL https://arxiv.org/abs/2108.

07732.

Loubna Ben Allal, Niklas Muennighoff, Logesh Kumar Umapathi, Ben Lipkin, and Leandro von

Werra. A framework for the evaluation of code generation models. https://github.com/

bigcode-project/bigcode-evaluation-harness, 2022.

José Cambronero, Sumit Gulwani, Vu Le, Daniel Perelman, Arjun Radhakrishna, Clint Simon, and

Ashish Tiwari. Flashfill++: Scaling programming by example by cutting to the chase. Proc. ACM

Program. Lang., 7(POPL), jan 2023. doi: 10.1145/3571226. URL https://doi.org/10.1145/

3571226.

Federico Cassano, John Gouwar, Daniel Nguyen, Sydney Nguyen, Luna Phipps-Costin, Donald

Pinckney, Ming-Ho Yee, Yangtian Zi, Carolyn Jane Anderson, Molly Q Feldman, Arjun Guha,

Michael Greenberg, and Abhinav Jangda. Multipl-e: A scalable and extensible approach to

benchmarking neural code generation, 2022.

Sahil Chaudhary. Code alpaca: An instruction-following llama model for code generation. https:

//github.com/sahil280114/codealpaca, 2023.

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared

Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri,

Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan,

Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian,

Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios

Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino,

Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders,

Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa,

Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob

McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. Evaluating

large language models trained on code, 2021.

Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. Teaching large language models to

self-debug, 2023.

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser,

Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John

Schulman. Training verifiers to solve math word problems, 2021.

DeepSeek AI. Deepseek coder: Let the code write itself. https://github.com/deepseek-ai/

DeepSeek-Coder, 2023.

Yinlin Deng, Chunqiu Steven Xia, Haoran Peng, Chenyuan Yang, and Lingming Zhang. Large

language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language models,

2023a.

Yinlin Deng, Chunqiu Steven Xia, Chenyuan Yang, Shizhuo Dylan Zhang, Shujing Yang, and

Lingming Zhang. Large language models are edge-case fuzzers: Testing deep learning libraries

via fuzzgpt, 2023b.

12Yangruibo Ding, Zijian Wang, Wasi Uddin Ahmad, Hantian Ding, Ming Tan, Nihal Jain, Mu-

rali Krishna Ramanathan, Ramesh Nallapati, Parminder Bhatia, Dan Roth, and Bing Xiang.

Crosscodeeval: A diverse and multilingual benchmark for cross-file code completion. In Thirty-

seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track,

2023. URL https://openreview.net/forum?id=wgDcbBMSfh.

Yu Feng, Ruben Martins, Osbert Bastani, and Isil Dillig. Program synthesis using conflict-driven

learning. SIGPLAN Not., 53(4):420–435, jun 2018. ISSN 0362-1340. doi: 10.1145/3296979.

3192382. URL https://doi.org/10.1145/3296979.3192382.

Daniel Fried, Armen Aghajanyan, Jessy Lin, Sida Wang, Eric Wallace, Freda Shi, Ruiqi Zhong,

Scott Yih, Luke Zettlemoyer, and Mike Lewis. Incoder: A generative model for code infilling and

synthesis. In The Eleventh International Conference on Learning Representations, 2023. URL

https://openreview.net/forum?id=hQwb-lbM6EL.

Sumit Gulwani, Oleksandr Polozov, and Rishabh Singh. Program synthesis. Foundations and Trends®

in Programming Languages, 4(1-2):1–119, 2017. ISSN 2325-1107. doi: 10.1561/2500000010.

URL http://dx.doi.org/10.1561/2500000010.

Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin

Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. Measuring coding challenge

competence with apps, 2021.

Or Honovich, Thomas Scialom, Omer Levy, and Timo Schick. Unnatural instructions: Tuning

language models with (almost) no human labor. In Anna Rogers, Jordan Boyd-Graber, and

Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Com-

putational Linguistics (Volume 1: Long Papers), pages 14409–14428, Toronto, Canada, July

2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.806. URL

https://aclanthology.org/2023.acl-long.806.

Hugging Face. Hugging face: The ai community building the future. https://huggingface.co/,

2023. Accessed: 2023-12-01.

Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. Code-

searchnet challenge: Evaluating the state of semantic code search, 2020.

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot,

Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier,

Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas

Wang, Timothée Lacroix, and William El Sayed. Mistral 7b, 2023a.

Nan Jiang, Kevin Liu, Thibaud Lutellier, and Lin Tan. Impact of code language models on automated

program repair, 2023b.

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik

Narasimhan. Swe-bench: Can language models resolve real-world github issues?, 2023.

Denis Kocetkov, Raymond Li, Loubna Ben Allal, Jia Li, Chenghao Mou, Carlos Muñoz Ferrandis,

Yacine Jernite, Margaret Mitchell, Sean Hughes, Thomas Wolf, Dzmitry Bahdanau, Leandro von

Werra, and Harm de Vries. The stack: 3 tb of permissively licensed source code, 2022.

Yuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang, Ruiqi Zhong, Luke Zettlemoyer, Scott Wen

tau Yih, Daniel Fried, Sida Wang, and Tao Yu. Ds-1000: A natural and reliable benchmark for

data science code generation, 2022.

Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou,

Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii Zheltonozhskii, Terry Yue

Zhuo, Thomas Wang, Olivier Dehaene, Mishig Davaadorj, Joel Lamy-Poirier, João Monteiro,

Oleh Shliazhko, Nicolas Gontier, Nicholas Meade, Armel Zebaze, Ming-Ho Yee, Logesh Kumar

Umapathi, Jian Zhu, Benjamin Lipkin, Muhtasham Oblokulov, Zhiruo Wang, Rudra Murthy, Jason

Stillerman, Siva Sankalp Patel, Dmitry Abulkhanov, Marco Zocca, Manan Dey, Zhihan Zhang,

Nour Fahmy, Urvashi Bhattacharyya, Wenhao Yu, Swayam Singh, Sasha Luccioni, Paulo Villegas,

13Maxim Kunakov, Fedor Zhdanov, Manuel Romero, Tony Lee, Nadav Timor, Jennifer Ding, Claire

Schlesinger, Hailey Schoelkopf, Jan Ebert, Tri Dao, Mayank Mishra, Alex Gu, Jennifer Robinson,

Carolyn Jane Anderson, Brendan Dolan-Gavitt, Danish Contractor, Siva Reddy, Daniel Fried,

Dzmitry Bahdanau, Yacine Jernite, Carlos Muñoz Ferrandis, Sean Hughes, Thomas Wolf, Arjun

Guha, Leandro von Werra, and Harm de Vries. Starcoder: may the source be with you!, 2023.

Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom

Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien

de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven

Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson,

Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu, and Oriol Vinyals. Competition-level

code generation with alphacode. Science, 378(6624):1092–1097, December 2022. ISSN 1095-

9203. doi: 10.1126/science.abq1158. URL http://dx.doi.org/10.1126/science.abq1158.

Jiawei Liu, Jinjun Peng, Yuyao Wang, and Lingming Zhang. Neuri: Diversifying dnn generation via

inductive rule inference, 2023a.

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by

chatgpt really correct? rigorous evaluation of large language models for code generation, 2023b.

Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng, Wenxiang Hu, Chongyang Tao, Jing

Ma, Qingwei Lin, and Daxin Jiang. Wizardcoder: Empowering code large language models with

evol-instruct. arXiv preprint arXiv:2306.08568, 2023a.

Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng, Wenxiang Hu, Chongyang Tao, Jing

Ma, Qingwei Lin, and Daxin Jiang. Wizardcoder: Empowering code large language models with

evol-instruct, 2023b.

Microsoft. Azure openai service models. https://learn.microsoft.com/en-us/azure/

cognitive-services/openai/concepts/models, 2023a.

Microsoft. GitHub Copilot – Your AI pair programmer. https://github.com/features/copilot,

2023b.

Niklas Muennighoff, Qian Liu, Armel Zebaze, Qinkai Zheng, Binyuan Hui, Terry Yue Zhuo, Swayam

Singh, Xiangru Tang, Leandro von Werra, and Shayne Longpre. Octopack: Instruction tuning

code large language models, 2023.

Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and

Caiming Xiong. Codegen: An open large language model for code with multi-turn program

synthesis. In The Eleventh International Conference on Learning Representations, 2023. URL

https://openreview.net/forum?id=iaYcJKpY2B_.

OpenAI. Chatgpt: Optimizing language models for dialogue.

chatgpt/, 2022.

https://openai.com/blog/

OpenAI. Gpt-4 technical report, 2023.

Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi

Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton,

Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez,

Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, and

Gabriel Synnaeve. Code llama: Open foundation models for code, 2023.

Amazon Web Services. AI Code Generator - Amazon CodeWhisperer - AWS. https://aws.amazon.

com/codewhisperer/, 2023.

Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear memory cost,

2018.

KAREN SPARCK JONES. A statistical interpretation of term specificity and its application in

retrieval. 28(1):11–21, 2023/11/30 1972. doi: 10.1108/eb026526. URL https://doi.org/10.

1108/eb026526.

14Hongjin Su, Weijia Shi, Jungo Kasai, Yizhong Wang, Yushi Hu, Mari Ostendorf, Wen-tau Yih,

Noah A. Smith, Luke Zettlemoyer, and Tao Yu. One embedder, any task: Instruction-finetuned

text embeddings. 2022. URL https://arxiv.org/abs/2212.09741.

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy

Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model.

https://github.com/tatsu-lab/stanford_alpaca, 2023.

theblackcat102. The evolved code alpaca dataset.

theblackcat102/evol-codealpaca-v1, 2023.

https://huggingface.co/datasets/

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay

Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cris-

tian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu,

Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn,

Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel

Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee,

Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra,

Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi,

Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh

Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen

Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic,

Sergey Edunov, and Thomas Scialom. Llama 2: Open foundation and fine-tuned chat models,

2023.

Xinyu Wang, Isil Dillig, and Rishabh Singh. Program synthesis using abstraction refinement. Proc.

ACM Program. Lang., 2(POPL), dec 2017. doi: 10.1145/3158151. URL https://doi.org/10.

1145/3158151.

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and

Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instructions.

In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st

Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages

13484–13508, Toronto, Canada, July 2023a. Association for Computational Linguistics. doi:

10.18653/v1/2023.acl-long.754. URL https://aclanthology.org/2023.acl-long.754.

Yue Wang, Weishi Wang, Shafiq Joty, and Steven C.H. Hoi. CodeT5: Identifier-aware unified

pre-trained encoder-decoder models for code understanding and generation. In Marie-Francine

Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Proceedings of the 2021

Conference on Empirical Methods in Natural Language Processing, pages 8696–8708, Online and

Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics.

doi: 10.18653/v1/2021.emnlp-main.685. URL https://aclanthology.org/2021.emnlp-main.

685.

Yue Wang, Hung Le, Akhilesh Deepak Gotmare, Nghi D. Q. Bui, Junnan Li, and Steven C. H. Hoi.

Codet5+: Open code large language models for code understanding and generation, 2023b.

Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du,

Andrew M. Dai, and Quoc V. Le. Finetuned language models are zero-shot learners, 2022.

Yuxiang Wei, Chunqiu Steven Xia, and Lingming Zhang. Copiloting the copilots: Fusing large

language models with completion engines for automated program repair, 2023.

Chunqiu Steven Xia and Lingming Zhang. Less training, more repairing please: Revisiting automated

program repair via zero-shot learning, 2022.

Chunqiu Steven Xia, Yifeng Ding, and Lingming Zhang. Revisiting the plastic surgery hypothesis

via large language models, 2023a.

Chunqiu Steven Xia, Matteo Paltenghi, Jia Le Tian, Michael Pradel, and Lingming Zhang. Universal

fuzzing via large language models, 2023b.

15Chunqiu Steven Xia, Yuxiang Wei, and Lingming Zhang. Automated program repair in the era of

large pre-trained language models. In 2023 IEEE/ACM 45th International Conference on Software

Engineering (ICSE), pages 1482–1494, 2023c. doi: 10.1109/ICSE48619.2023.00129.

Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin

Jiang. Wizardlm: Empowering large language models to follow complex instructions. arXiv

preprint arXiv:2304.12244, 2023.

Yue Yu, Yuchen Zhuang, Jieyu Zhang, Yu Meng, Alexander Ratner, Ranjay Krishna, Jiaming Shen,

and Chao Zhang. Large language model as attributed training data generator: A tale of diversity

and bias, 2023.

Fengji Zhang, Bei Chen, Yue Zhang, Jacky Keung, Jin Liu, Daoguang Zan, Yi Mao, Jian-Guang Lou,

and Weizhu Chen. Repocoder: Repository-level code completion through iterative retrieval and

generation, 2023.