Summary of SoTaNa The Open-Source Software Development Assistant

Summary SoTaNa The Open-Source Software Development Assistant arxiv.org

8,341 words - PDF document - View PDF document

One Line

SoTaNa is an open-source software development assistant that utilizes ChatGPT and fine-tuning to help developers with data and code summarization.

Slides

Slide Presentation (9 slides)

Copy slides outline Copy embed code Download as Word

SoTaNa: The Open-Source Software Development Assistant

Source: arxiv.org - PDF - 8,341 words - view

Introduction

• SoTaNa is an open-source software development assistant based on ChatGPT and fine-tuning.

• It aims to assist developers with data and code summarization.

• SoTaNa enhances the LLaMA model to provide effective support to developers.

SoTaNa's Capabilities

• SoTaNa demonstrates effectiveness in assisting developers through human evaluation.

• It has the ability to summarize and generate code.

• The open-source nature of SoTaNa allows for continuous improvement and community contributions.

Addressing the Challenge of Human-Written Instructions

• OpenAI has curated instruct-based datasets to improve understanding of human-written instructions.

• SoTaNa leverages these datasets to enhance its capabilities.

• The model is constantly learning and evolving to better comprehend instructions.

Parameter-Efficient Tuning of Large Language Models

• SoTaNa focuses on parameter-efficient tuning of large language models (LLMs).

• The Lora method is used to freeze pre-trained model parameters and introduce trainable low-rank decomposition matrices into each Transformer layer.

• This approach improves the efficiency and effectiveness of the model.

Generating High-Quality Instruction-Based Data

• SoTaNa leverages LLMs to generate high-quality instruction-based data for software engineering tasks.

• This data is crucial for improving the model's understanding and performance.

• The generated data contributes to the continuous enhancement of SoTaNa's capabilities.

References to Relevant Research Papers and Projects

• Various research papers and projects related to open-source software development and language models have influenced SoTaNa's development.

• Studies on instruction data scaling, code generation, and evaluation have contributed to the model's advancements.

• References include Alpaca, ChatGPT, LLaMA, and Wizardlm, among others.

Key Takeaways

• SoTaNa is an open-source software development assistant that utilizes ChatGPT and enhances the LLaMA model.

• It demonstrates effectiveness in assisting developers through human evaluation and has capabilities in code summarization and generation.

• OpenAI has curated instruct-based datasets to address the challenge of understanding human-written instructions.

• The approach focuses on parameter-efficient tuning of large language models (LLMs) using the Lora method.

• SoTaNa leverages LLMs to generate high-quality instruction-based data for software engineering tasks and fine-tunes the LLaMA model with software engineering-related data.

• SoTaNa is a valuable tool for developers seeking efficient and effective support in their software development tasks.

[Visuals: Include visuals such as screenshots of SoTaNa in action, graphs showcasing its performance, and images representing the collaboration within the open-source community.]

Key Points

SoTaNa is an open-source software development assistant that utilizes ChatGPT and enhances the LLaMA model.
SoTaNa demonstrates effectiveness in assisting developers through human evaluation and has capabilities in code summarization and generation.
OpenAI has curated instruct-based datasets to address the challenge of understanding human-written instructions.
The approach focuses on parameter-efficient tuning of large language models (LLMs) using the Lora method.
SoTaNa leverages LLMs to generate high-quality instruction-based data for software engineering tasks and fine-tunes the LLaMA model with software engineering-related data.

Summaries

21 word summary

SoTaNa is an open-source software development assistant that uses ChatGPT and fine-tuning to assist developers with instruction-based data and code summarization.

38 word summary

SoTaNa is an open-source software development assistant that utilizes ChatGPT to generate instruction-based data for software engineering tasks and enhances the LLaMA model through fine-tuning. It demonstrates effectiveness in assisting developers and highlights capabilities in code summarization and

390 word summary

SoTaNa is an open-source software development assistant that utilizes ChatGPT to generate high-quality instruction-based data for software engineering tasks and enhances the open-source foundation model LLaMA through parameter-efficient fine-tuning. The objective of SoTaNa is

The document discusses the development of SoTaNa, an open-source software development assistant based on a large language model. The model's effectiveness in assisting developers is demonstrated through human evaluation. The paper also highlights the model's capabilities in code summarization and generation

OpenAI has developed models that convert NLP tasks into a unified format and use multi-task learning to achieve good results on new tasks. However, understanding human-written instructions remains challenging for these models. OpenAI has curated instruct-based datasets to address this challenge

Our approach focuses on parameter-efficient tuning of large language models (LLMs) using the Lora method. Lora freezes pre-trained model parameters and introduces trainable low-rank decomposition matrices into each Transformer layer. We evaluate the model's ability to understand code

The Open-Source Software Development Assistant, SoTaNa, utilizes Gaussian initialization for matrix A and sets matrix B to zero. The statistics of SoTaNa, including training times, are shown in Table 1. SoTaNa is evaluated through human

The best way to get a file extension in PHP is to use the pathinfo() function. This function returns an array containing the file name, extension, path, and other information about the file. Another option is to use the explode() function,

To find a file extension in PHP, there are several methods suggested. One method is to split the file name with a delimiter and retrieve the last part. Another method is to use the 'pathinfo()' function, which returns an array that includes the

SoTaNa is an open-source software development assistant that leverages Large Language Models (LLMs) to generate high-quality instruction-based data for software engineering tasks. It fine-tunes the LLaMA model with software engineering-related data to enhance its capabilities

This excerpt includes references to various research papers and projects related to open-source software development and language models. It mentions studies on the impact of instruction data scaling on large language models, the development of instruction-following language models for code generation, and investigations into

This excerpt contains a list of references to various papers and models related to open-source software development and language models. It includes references to models such as Alpaca, Chatgpt, Llama, and Wizardlm, as well as papers on evaluation

Raw indexed text (55,227 chars / 8,341 words / 1,534 lines)

SoTaNa: The Open-Source Software Development Assistant

Ensheng Shi a,† Fengji Zhang b,† Yanlin Wang c,§ Bei Chen b

Lun Du b Hongyu Zhang d Shi Han b Dongmei Zhang b Hongbin Sun a

Xi’an Jiaotong University b Microsoft

Sun Yat-sen University d Chongqing University

[email protected], [email protected],

{v-fengjzhang, beichen, lun.du, shihan, dongmeiz}@microsoft.com

[email protected], [email protected]

Abstract

Software development plays a crucial role in

driving innovation and efficiency across mod-

ern societies. To meet the demands of this

dynamic field, there is a growing need for

an effective software development assistant.

However, existing large language models rep-

resented by ChatGPT suffer from limited ac-

cessibility, including training data and model

weights. Although other large open-source

models like LLaMA have shown promise, they

still struggle with understanding human in-

tent. In this paper, we present SoTaNa, an

open-source software development assistant.

SoTaNa utilizes ChatGPT to generate high-

quality instruction-based data for the domain of

software engineering and employs a parameter-

efficient fine-tuning approach to enhance the

open-source foundation model, LLaMA. We

evaluate the effectiveness of SoTaNa in answer-

ing Stack Overflow questions and demonstrate

its capabilities. Additionally, we discuss its

capabilities in code summarization and gen-

eration, as well as the impact of varying the

volume of generated data on model perfor-

mance. Notably, SoTaNa can run on a single

GPU, making it accessible to a broader range

of researchers. Our code, model weights, and

data are public at https://github.com/

DeepSoftwareAnalytics/SoTaNa.

Introduction

Software plays a critical role in today’s society,

impacting various domains such as healthcare, en-

ergy, transportation, public safety, and entertain-

ment (Allamanis et al., 2018). However, software

development remains a complex and challenging

process for developers, involving the assimilation

of new concepts, artifact design and implementa-

tion, and exception handling in a rapidly evolving

technological landscape (DRM Associates, 2002;

†

Work done during the author’s employment at Microsoft

Research Asia.

Yanlin Wang is the corresponding author.

Winograd, 1973; Grudin, 1994). To address these

challenges, there is an urgent demand for software

development assistants to enhance the efficiency

and effectiveness of the development process, as

emphasized by Winograd (1973). These assistants

offer comprehensive support, including answering

programming queries and resolving minor issues.

By integrating software development assistants into

the workflow, developers can navigate complexi-

ties and contribute to the creation of reliable and

maintainable software products that benefit society

at large (Allamanis et al., 2018; DRM Associates,

2002).

Instruction-based large language models

(LLMs), like ChatGPT (OpenAI, 2022) and

GPT4 (OpenAI, 2023), have demonstrated the

ability to comprehend human intent and produce

human-like responses. They exhibit remarkable

capabilities as AI assistants across various domains,

including neural machine translation (Jiao et al.,

2023), text summarization (Shen et al., 2023), and

code generation (Zan et al., 2022). These models

are composed of billions of parameters, trained on

extensive data, up to hundreds of billions or even a

trillion tokens, and fine-tuned on instruction-based

datasets (Zhao et al., 2023). However, their ac-

cessibility is limited to a restricted API, and other

information such as model weights and instruction-

based datasets are not available, creating barriers

to new research and advancements. On the other

hand, open-source models such as Alpaca (Taori

et al., 2023) and Vicuna (Chiang et al., 2023),

which fine-tune LLaMA (Touvron et al., 2023)

(an open-source foundation language model) on

the instruction-based dataset generated by LLMs,

have shown promising results in understanding

human intent and generating human-like responses.

Nevertheless, due to the limited availability of

instruction-based software engineering domain

data, there is still room for improvement in the

domain of software development assistants.In this paper, we introduce SoTaNa, an open-

source software development assistant. As shown

in Fig. 1, it utilizes ChatGPT (OpenAI, 2022), a

powerful large language model, to generate high-

quality instruction-based data for software engi-

neering tasks and employs a parameter-efficient

fine-tuning method to enhance open-source founda-

tion models, namely LLaMA (Touvron et al., 2023).

The primary objective of our work is to enable the

foundation LLMs (such as LLaMA) to understand

developers’ intent while utilizing limited comput-

ing resources.

Specifically, to generate software engineering

(SE)-related data, we guide ChatGPT using a spe-

cific prompt that includes the requirements for the

newly generated instances (Fig. 2). To ensure Chat-

GPT comprehends the desired output format and

content, we provide a manually annotated seed

pool of 200 Software engineering-related instances.

These instances belong to different SE tasks and

each of them is a three-tuple consisting of (instruc-

tion, input, and output). During the generation

process, we empirically sample three instances

from the seed pool as demonstrations and add an-

other two instances from the previously generated

data to diversify the demonstrations. The com-

plete prompt is shown in Fig. 2 including require-

ments and demonstrations. We also filter the gen-

erated data that does not meet the requirements

automatically via instance-wise checking, ensuring

high-quality data. After generating high-quality

instruction-based data for software engineering, we

employ Lora (Hu et al., 2021), a parameter-efficient

tuning approach, to fine-tune LLaMA using a sin-

gle A100 GPU. This fine-tuning process enables

LLaMA to understand human intent and generate

intend-related responses in the software engineer-

ing domain while utilizing limited computing re-

sources.

Seed Pool

Prompt

Demonstration 1 Sample

Demonstration 2

...

Instruction

tion answering dataset (Kou et al., 2022). The re-

sults, including human evaluation, demonstrate the

effectiveness of our model in assisting developers.

Furthermore, we provide a brief discussion on the

model’s capabilities in code summarization (Shi

et al., 2022a) and generation (Zan et al., 2022).

Additionally, we explore the impact of different

volumes of generated data on the model’s perfor-

mance.

The main contributions of this work can be sum-

marized as follows:

• We are the first to develop a software devel-

opment assistant based on a large language

model, which can understand developers’ in-

tent and generate related and useful reponses.

• We release the model weights and provide a

high-quality instruction-based dataset specifi-

cally tailored for software engineering. This

availability of resources aims to facilitate fu-

ture research and advancements in the field.

• We conduct extensive experiments to demon-

strate the capabilities of SoTaNa in effectively

answering Stack Overflow questions, code

summarization, and code generation.

2.1

LLM

Query

Instruction-

based Data

Software Development Assistant

With the increasing reliance on software sys-

tems, the demand for innovative software solutions

has surged significantly (DRM Associates, 2002).

However, the process of software development re-

mains complex and challenging for developers who

face numerous obstacles throughout the develop-

ment lifecycle.

One of the primary challenges in software de-

velopment is the constant evolution of technol-

ogy (Nerur et al., 2005; Mikkonen et al., 2018;

Cao and Ramesh, 2008). As new technologies

emerge and existing ones advance, developers must

continuously adapt and assimilate new concepts

into their projects. Keeping up with these techno-

logical advancements can be overwhelming and

time-consuming, often leading to delayed project

timelines and increased development costs. Fur-

thermore, the design and implementation of soft-

ware artifacts require meticulous planning and at-

tention to detail (Stone, 2010; Florac and Carleton,

1999). Developers need to carefully architect the

Demonstrations

Foundation

Model

Background

Generate

SoTaNa

Figure 1: The pipeline of SoTaNa.

We evaluate SoTaNa on a Stack Overflow ques-

2software components, ensuring that they are scal-

able, maintainable, and aligned with the project

objectives. The process of transforming abstract

ideas into functional software solutions involves

intricate decision making, making the development

phase a critical and resource-intensive aspect of

the software development lifecycle. Another sig-

nificant challenge lies in handling exceptions and

errors that may arise during the development pro-

cess (Nuseibeh, 1996; Dellarocas and Klein, 2000).

As the complexity of the software increases, the

likelihood of encountering bugs and issues also

increases. Identifying, debugging, and resolving

these problems effectively can be time consuming

and can hinder progress if not managed efficiently.

In order to address these challenges, there is

an urgent demand for software development assis-

tants (Winograd, 1973) that can significantly im-

prove the efficiency and effectiveness of the devel-

opment process. These assistants, often powered

by artificial intelligence and machine learning al-

gorithms, have the potential to revolutionize the

way developers work. By providing intelligent and

context-aware recommendations, code suggestions,

and error analyses, these assistants can enhance

developers’ abilities, leading to faster development

cycles and improved software quality. We are the

first to develop a software development assistant

based on recently powerful large language models.

2.2

from 7B to 65B parameters. Built on the trans-

former decoder architecture, LLaMA is trained

on trillions of tokens and exhibits superior perfor-

mance in various aspects (Touvron et al., 2023).

Our primary objective is to enable LLaMA to un-

derstand developers’ intent and generate human-

like responses.

2.3

Data Generation with LLMs

Collecting a large-scale dataset comprising human-

annotated instructions and corresponding responses

can be a time-consuming and labor-intensive en-

deavor. To overcome this challenge, researchers

have turned to alternative approaches that lever-

age the capabilities of LLMs to generate such data.

One notable method is Self-Instruct (Wang et al.,

2022a), which proposes a pipeline to utilize ex-

isting collections of instructions and a large lan-

guage model to create more broad-coverage in-

structions that define diverse tasks, often intro-

ducing new ones. Building upon this idea, Al-

paca (Taori et al., 2023) leverages Self-Instruct

and Text-Davinci-003 1 (a powerful LLM) to

generate a dataset of 52K instruction-based data.

Surprisingly, when fine-tuning LLaMA-7B us-

ing this dataset, Alpaca exhibits a remarkable

understanding of human intent. Subsequent ef-

forts like codealpaca (Chaudhary, 2023), alpaca-

cot (Si et al., 2023), GPT4ALL (Anand et al.,

2023), ShareGPT (Domeccleston, 2023), Dolly-

v2 (Conover et al., 2023), BELLE (Ji et al., 2023a),

Vicuna (Chiang et al., 2023), Koala (Geng et al.,

2023), Baize (Xu et al., 2023b), Wizardlm (Xu

et al., 2023a) and others have further explored data

augmentation with LLMs. While previous works

have focused on generating general-purpose data,

our research aims to generate data for the domain

of software engineering.

Large Language Model

Large language models (LLMs) have recently

emerged as powerful tools in natural language pro-

cessing (NLP), demonstrating remarkable achieve-

ments across a wide spectrum of tasks (Zhao et al.,

2023; Brown et al., 2020; Zhang et al., 2022; Tou-

vron et al., 2023; Workshop et al., 2022; Zeng et al.,

2023). These models, including GPT-3 (Brown

et al., 2020), BLOOM (Workshop et al., 2022) and

LLaMA (Touvron et al., 2023), typically employ

a multi-layer Transformer architecture (Vaswani

et al., 2017) with billions of training parameters.

They are trained on massive corpora of unlabeled

data, often containing hundreds of billions or even a

trillion tokens, enabling them to capture substantial

domain knowledge without relying on task-specific

training data. Their self-supervised pre-training

approach has been a critical factor contributing

to their remarkable success. Among these LLMs,

LLaMA has gained significant attention as it is a

collection of open and efficient LLMs that range

2.4

Instruction Fine-tuning

The primary objective of instruction fine-tuning is

to equip the model with the capability to handle

a diverse range of NLP tasks (Wei et al., 2021;

Sanh et al., 2021; Mishra et al., 2021; Ji et al.,

2023b; Wang et al., 2022b). These models usu-

ally convert an amount of NLP tasks into a uni-

fied format and are trained with the paradigm of

multi-task learning to facilitate cross-task gener-

alization. As a result, they often achieve promis-

https://platform.openai.com/docs/

models/gpt-3-5

3You are asked to come up with a set of 20 diverse software

engineering-related task instructions.

These task instructions will be given to a GPT model and we

will evaluate the GPT model for completing the instructions.

ing results on new tasks. However, understand-

ing human-written instructions remains challeng-

ing for these models (Ouyang et al., 2022). OpenAI

addresses this challenge by curating a substantial

amount of instruct-based datasets, which encom-

pass human-written instructions and their corre-

sponding desired outputs across a wide array of

tasks (Ouyang et al., 2022). Leveraging this dataset

and reinforcement learning from human feedback

(RLHF) (Ouyang et al., 2022; Ziegler et al., 2020),

they enable the model to comprehend human in-

structions and generate human-like responses. This

line of development has led to impressive works

like ChatGPT (OpenAI, 2022) and GPT4 (OpenAI,

2023). More recent models, such as Alpaca (Taori

et al., 2023) and Baize (Xu et al., 2023b), lever-

age ChatGPT to generate instruction-based data

and fine-tune LLaMA on it, enabling the LLaMA

model to align with human intent. Our model’s

primary goal is to empower LLaMA to understand

developers’ intent, extending its capabilities to the

domain of software engineering.

Requirements:

1. Include diverse types of software engineering tasks such as

coding, debugging, testing, documentation, etc.

…

9. The instructions should be 1 to 2 sentences long. Either an

imperative sentence or a question is permitted..

### Example 1

Instruction: Explain the concept of a stack in computer science.

Input:

Output:

A stack is a data structure in which elements can be added or removed only

from the top (called "pushing" and "popping"). It is a Last-In-First-Out (LIFO)

data structure and can be used for many operations such as reversing a string,

evaluating an arithmetic expression, etc.

….

### Example 6

Instruction:

Figure 2: The prompt used to generate data.

we query ChatGPT using the above prompt to gen-

erate 500 instances. Next, two of the authors eval-

uate whether the generated instances are correct

and relevant to the domain of software engineering.

Finally, we select 200 instances on which there is

agreement as seed instances.

During the data generation process, we empiri-

cally incorporate three instances from the seed pool

as demonstrations and include an additional two in-

stances from previously generated data to enhance

diversity. To ensure data quality, we apply filters

to remove examples that do not adhere to the three-

tuple format or are not in English. Additionally,

we discard examples with instructions containing

fewer than three words. Through this rigorous pro-

cess, we successfully obtain a high-quality dataset

of 100K instruction-based examples specific to the

domain of software engineering.

Our Approach

In this section, we present a detailed overview of

our approach. Building upon prior studies (Wang

et al., 2022a; Taori et al., 2023), we first leverage

ChatGPT (Text-Davinci-003) to automatically gen-

erate instruction-based data for the domain of soft-

ware engineering. We then adopt Lora (Hu et al.,

2021), a parameter-efficient fine-tuning approach,

to tune LLaMA (an open and effective large lan-

guage model) with the newly generated data. The

goal is to enhance LLaMA’s understanding of hu-

man instructions with limited computing resources.

3.1

Automatic Data Generation

To effectively generate software engineering-

related data, we design a prompt (Fig. 2) consist-

ing of a task description (in blue), data-generation

requirements (in yellow), and demonstrations (in

green). The data-generation requirements are

adapted from Alpaca (Taori et al., 2023) and serve

as guidelines to ensure the relevance of the newly

generated examples to the domain of software en-

gineering. Demonstrations are randomly sampled

from the seed pool.

To construct the seed pool, we first use the

prompt shown in Fig. 2 and each time randomly

sample three instances from the 52K dataset of Al-

paca as the demonstrations into the prompt. Then

3.2

Parameter-Efficient Tuning

To enable large language models to understand hu-

man intent and generate related responses, previous

studies (Taori et al., 2023; Chiang et al., 2023) typ-

ically fine-tune all parameters on the instruction-

based dataset, requiring large computational re-

sources. In contrast, our approach focuses on

a parameter-efficient tuning approach (Hu et al.,

2021; Shi et al., 2023) to fine-tune LLMs using

less resources. Among these approaches, we adapt

Lora (Hu et al., 2021), known for its efficiency

in fine-tuning large language models like GPT-

43 (Brown et al., 2020), to tune the foundation model

LLaMA.

Specifically, Lora freezes pre-trained model pa-

rameters and introduces additional trainable low-

rank decomposition matrices into each Transformer

layer. For instance, in a linear layer with the equa-

tion y = Wx, where W ∈ R n×k represents pre-

trained parameters, we incorporate low-rank ma-

trices A ∈ R n×r and B ∈ R r×k to calculate y

as:

y = (W)x + (∆W)x = Wx + BAx

English description. The evaluation includes test

cases to assess the generated code, with an average

of 7.7 test cases per problem.

Code Summarization: For evaluating the mod-

els’ ability to understand code, we use the TL-

CodeSum (Hu et al., 2018) dataset. This dataset is

typically used to assess code summarization mod-

els. Specifically, given a code snippet, models are

required to generate one natural language sentence

to describe the semantics of the code. We con-

duct evaluations on the first 100 examples in the

test set to verify the models’ code understanding

capabilities.

(1)

Here, r corresponds to the rank of A and B, with

r ≪ min(n, k). It is worth noting that we only

update the weights of A and B, significantly reduc-

ing the number of training parameters from n × k

to (n + k) × r. They usually scale (∆W)x by

r , where α is a constant. As LLaMA is built on a

multi-layer Transformer (Vaswani et al., 2017), we

apply low-rank decomposition matrices to all linear

weights in each layer to achieve efficient parameter

tuning and enhanced overall performance.

4.1

4.2

Baselines

To evaluate the effectiveness of our approach, we

compare SoTaNa with two related models, namely

LLaMA (Touvron et al., 2023) and Alpaca (Taori

et al., 2023).

LLaMA (Touvron et al., 2023) is a collection

of open large pre-trained language models ranging

from 7B to 65B parameters. These models are built

on the Transformer decoder (Vaswani et al., 2017)

and pre-trained with approximately 1T tokens from

diverse sources such as books, GitHub, Wikipedia,

arXiv, and more. Due to the large size, the 65B

model cannot be loaded on a single A100 GPU

card with 80G memory. Therefore, we focus on

the other three sizes (7/13/30B). We denote them

as LLaMA-7B, LLaMA-13B, and LLaMA-30B,

respectively.

Alpaca (Taori et al., 2023) is derived from

the LLaMA-7B model and fine-tuned with 52K

instruction-based data generated by Text-Davinci-

003. Additionally, we further fine-tune LLaMA-

13B and LLaMA-30B using Lora on the same 52K

instruction-based data. The resulting models are de-

noted as Alpaca-7B, Alpaca-13B, and Alpaca-30B,

respectively. These models serve as comparison

points for our proposed SoTaNa.

Experimental Design

Evaluation Datasets

We primarily focus on verifying the effectiveness

of SoTaNa in answering Stack Overflow questions.

Additionally, we evaluate its capabilities in code

understanding and generation.

Stack Overflow Question Answering: For eval-

uating the model’s ability to answer Stack Over-

flow questions, we use the SoSum dataset (Kou

et al., 2022), which contains question titles, ques-

tion bodies, and answer posts with positive scores,

along with summarizations of the posts. The

dataset was originally intended for evaluating post-

summarization models, but we repurpose it to as-

sess question answering (QA) capabilities. Specifi-

cally, we feed the question title and body to models,

the models are required to generate answers. From

the original test set of 506 questions, we exclude

86 questions where large code snippets or images

are replaced with BIGBLOCK, rendering them in-

comprehensible. After filtering, we proceed with

the evaluation using the remaining 420 questions.

Code Generation: To evaluate the effective-

ness of models on code generation, we utilize the

widely-used HumanEval (Chen et al., 2021) dataset,

consisting of 164 function-level programming prob-

lems in Python. The task requires models to gener-

ate the body of a function based on its signature and

4.3

Experimental Settings

Following the previous studies (Xu et al., 2023b;

Taori et al., 2023), we set the maximum length

of the input sequence to 512. The rank r and the

constant α in Lora are set to 8 and 16. To reduce

memory usage and speed up the training process,

we initialize LLaMA weights with 8-bit integer

format. For parameters of Lora, following the pre-

vious work (Hu et al., 2021), we adopt a random

Gaussian initialization for matrix A, while setting

matrix B to zero. This results in the value of BA

5Model

Model

LLaMa-7B

LLaMa-13B

LLaMa-30B

Alpaca-7B

Alpaca-13B

Alpaca-30B

SoTaNa-7B

SoTaNa-13B

SoTaNa-30B

#LLaMA Param. #Lora Param. Training Time

SoTaNa-7B

SoTaNa-13B

SoTaNa-30B

13B

30B

8.4M

13.1M

25.6M

25h35m

39h10m

48h02m

Table 1: The statistics of SoTaNa.

being zero at the beginning of training. We in-

ject low-rank decomposition matrices into all lin-

ear weights in each layer of LLaMA. The number

of Lora parameters is shown in Table 1. We uti-

lize the Adam optimizer to update Lora parameters

with a batch size of 512 and learning rates of 1e-4.

The dropout rate for Lora parameters is set to 0.05.

LLaMA-7B, LLaMA-13B, and LLaMA-30B are

fine-tuned for 5, 5, and 3 epochs, respectively. All

experiments are conducted on an NVIDIA A100-

80GB GPU. We denote SoTaNa with 7B, 13B, and

30B as SoTaNa-7B, SoTaNa-13B, and SoTaNa-

30B, respectively. The statistics of each model,

including training times, are listed in Table 1.

4.4

Experimental Results

5.1.2

Stack Overflow Question Answering

Cider

0.01

0.03

0.04

0.05

0.04

Human evaluation

Inspired by previous work (Shi et al., 2022a,b,

2021), we conduct a human evaluation to evalu-

ate the effectiveness of SoTaNa. We randomly

select 50 questions from the testing sets and

collect the answers generated by the nine ap-

proaches listed in Table 2. Subsequently, we obtain

450

answer> pairs for scoring.

Specifically, we invite 10 volunteers with ex-

cellent English abilities and over three years of

software development experience. Each volunteer

We conduct extensive experiments, including hu-

man evaluation, to assess the effectiveness of

SoTaNa on answering Stack Overflow questions.

5.1.1

Rouge-L

8.86

6.24

5.15

12.71

13.48

13.98

12.48

12.85

13.21

rics, indicating that instruction fine-tuning would

help models understand human intent and generate

human-like responses. However, despite improve-

ments over LLaMA, Alpaca and SoTaNa obtain

relatively low scores on the four metrics. These au-

tomatical metrics, based on token-based similarity,

might not fully reflect the quality of the generated

answers. For instance, consider the example in Ta-

ble 3, where the question is "How to get(extract)

a file extension in PHP? " and the corresponding

referenced answer is "pathinfo()".

Many models (LLaMA-30B, Alpaca-7B, Alpaca-

30B, SoTaNa-7B, and SoTaNa-30B) correctly sug-

gest using the pathinfo() function to extract a file

extension in PHP. However, the answers received

low or inconsistent scores in BLEU, Rouge-L, and

Cider, highlighting the limitations of these metrics

in evaluating answer quality. Specifically, all the

answers are scored 0 in terms of BLEU, Rouge-L,

and Cider, regardless of whether they are correct or

not. While the answers of Alpaca-7B and Alpaca-

30B outperform LLaMA-30B by avoiding irrele-

vant sentences, the Meteor score of LLaMA-30B

is higher than that of Alpaca-7B and Alpaca-30B.

Therefore, to comprehensively study the effective-

ness of our approach SoTaNa, conducting human

evaluations is necessary.

Evaluation Metrics

5.1

Meteor

8.58

4.09

4.19

6.63

7.55

7.61

7.05

7.42

8.22

Table 2: Experimental results on Stack Overflow ques-

tion answering.

We evaluate the quality of generated answers for

Stack Overflow questions and generated summa-

rization for given code snippets via four metrics:

BLEU (Papineni et al., 2002), Meteor (Baner-

jee and Lavie, 2005), Rouge-L (Lin, 2004), and

Cider (Vedantam et al., 2015). There are many vari-

ants of BLEU being used to measure the generated

code summarization (Shi et al., 2022a). We choose

BLEU-DC (a sentence-level BLEU with smooth-

ing function four), which correlates with human

perception the most (Shi et al., 2022a). Addition-

ally, to evaluate code generation models, following

previous work (Chen et al., 2021), we employ the

widely-used Pass@1 as the evaluation metric.

BLEU

0.95

0.58

0.66

1.74

1.90

1.66

1.78

1.89

Automatic evaluation

We consider the answer without negative scores as

the ground truth and employ four automatical met-

rics (BLEU, Meteor, Rouge-L, and Cider) to assess

the quality of the generated answer. The results are

shown in Table 2. We can see that both SoTaNa

and Alpaca outperform LLaMA across all met-

6Question Title

Question Body

Reference

Model

How to get(extract) a file extension in PHP?

This is a question you can read everywhere on the web with various

answers etc. However, there is always "the best way" and it should be

on Stack Overflow.

pathinfo()

Generated Answer BLEU Met. Rou.L Cid.

0.0 1.21 0.0

0.0

LLaMA-7B Question title: How to get(extract) a file extension in PHP? Question

body: This is a question you can read everywhere on the web with

various answers: etc. However, there is always "the best way" and it

should be on Stack Overflow. LLaMA-13B Answer the question on stackoverflow. ### Output: Answer the question

on stackoverflow. 0.0 1.11 0.0

0.0

The best way to get a file extension in PHP is to use the pathinfo()

function. ### Output: The best way to get a file extension in PHP is to

use the pathinfo() function. 0.0 6.57 0.0

0.0

LLaMA-30B The best way to get a file extension in PHP is to use the function

pathinfo(). This function will return an array containing the file-

name, extension, path, and other information about the file. 0.0 6.13 0.0

0.0

Alpaca-7B 0.0 2.4 0.0

0.0

Alpaca-13B The best way to get (extract) a file extension in PHP is to use the

explode() function. This function takes a string and a delimiter

and returns an array of strings. The delimiter can be used to separate the

file extension from the rest of the string. For example, if the string is

"myfile.txt", the delimiter can be set to "." and the explode() function

will return an array with two elements: "myfile" and "txt". The best way to get the file extension in PHP is to use the pathinfo()

function. This function returns an array of information about a given

path, including the file extension. For example, if you have a file named

"myfile.txt", the pathinfo() function will return "txt" as the file

extension. 0.0 5.33 0.0

0.0

SoTaNa-7B The best way to get a file extension in PHP is to use the pathinfo()

function. 0.0 6.91 0.0

0.0

0.0 3.6 0.0

0.0

SoTaNa-13B The best way to get (extract) a file extension in PHP is to use the

explode() function. This function takes a string and splits it into

an array based on a specified delimiter. In this case, the delimiter is the

dot ("."), which is used to separate the file extension from the rest of

the file name. The explode() function returns an array with the file

extension at the end of the array. 0.0 7.34 0.0

0.0

SoTaNa-30B The best way to get(extract) a file extension in PHP is to use the

pathinfo() function. This function returns an array containing infor-

mation about the given path, including the file extension. For example,

if the given path is "/path/to/file.txt", the pathinfo() function will

return an array with the file extension "txt".

Alpaca-30B

Table 3: Results generated by different models.

is asked to assign scores from 0 to 4 (with higher

scores indicating better quality) to the generated

answers based on four aspects: Alignment (the

degree of understanding questions and providing

relevant answers), Accuracy (the extent of provid-

ing correct information and valid solutions), Read-

ability (grammatical correctness, the level of flu-

ency and formatting), and Confidence (the degree

of confidence in their evaluation). Each pair is

evaluated by two volunteers, and the final score

(excluding confidence) is the average of their as-

sessments. Detailed scoring criteria, examples, and

corresponding explanations are provided in Table 4.

To ensure the reliability of the human scores,

we calculated Krippendorff’s alpha (Hayes and

Krippendorff, 2007) and Kendall rank correlation

coefficient (Kendall’s Tau) (Kendall, 1945) val-

ues to assess the agreement between volunteers.

Krippendorff’s alpha value is about 0.9, and the

pairwise Kendall’s Tau values range from 0.75 to

0.96, indicating a high degree of agreement among

the ten volunteers. Moreover, to further enhance

the reliability, we had another senior volunteer

double-check the labeled results with low confi-

dence scores (less than 2). The results of the hu-

man evaluations are shown in Table 5. We can see

7Category

Score

Alignment

Accuracy

Readability

Confidence

Scoring Criteria

The answer is entirely irrele-

vant, containing content that

is unrelated to the question’s

topic.

The answer is somewhat re-

lated to the topic, but its con-

nection to the question is weak

and not directly focused on the

problem.

The answer is relevant, display-

ing an understanding of the

question’s topic, but it may not

encompass all aspects or nu-

ances of the problem.

The answer is highly relevant,

demonstrating a deep compre-

hension of the question’s topic

and closely connecting to all

aspects of the problem.

The answer is entirely incor-

rect, providing false informa-

tion or suggesting an invalid so-

lution.

The answer contains some cor-

rect information but also has

significant inaccuracies or mis-

conceptions.

The answer is mostly accurate,

with only minor errors or omis-

sions.

The answer is completely accu-

rate, providing correct informa-

tion and a valid solution.

The answer is extremely diffi-

cult to understand, with poor

grammar, structure, or exces-

sive jargon.

The answer is somewhat diffi-

cult to understand or has some

grammatical errors and unclear

explanations.

The answer is clear, well-

structured, and has only minor

grammatical errors or room for

improvement.

The answer is very clear, well-

structured, and free from gram-

matical errors, making it easy

to understand.

The rater is not at all confident

in his evaluation of the answer

and feels unsure about the as-

signed scores.

The rater has low confidence

in their evaluation and may

have doubts about the assigned

scores.

The rater is fairly confident in

their evaluation, with only mi-

nor uncertainties about the as-

signed scores.

The rater is highly confident in

their evaluation and feels cer-

tain about the assigned scores.

Example

Cats are great pets because they are

low-maintenance and independent. Explanation

The answer is entirely irrelevant because it discusses

pets, which have no connection to the topic of ex-

tracting file extensions in PHP.

You can determine a file type by

looking at the file name. The answer is somewhat related to the topic as it

mentions file type determination, but it doesn’t pro-

vide a direct solution for extracting a file extension

in PHP.

In PHP, you can find the file exten-

sion and name by. The answer is relevant because it mentions the file

extension, but it lacks practical solutions related to

"How to".

To find a file extension in PHP, you

can split the file name with a delim-

iter and retrieve the last part. The answer is highly relevant because it suggests a

method for finding file extensions in PHP, although

it might not be entirely accurate.

Use the ‘strlen()’ function to find

the file extension in PHP The answer is entirely incorrect because the

‘strlen()’ function is used to find the length of a

string, not to extract a file extension..

Use the pathinfo() function. It re-

turns the extension directly. The answer is partially correct, as it suggests using

‘pathinfo()’, but it returns an array rather than the

extension.

Use pathinfo() in PHP to get file in-

formation, including the extension

and filedir.

Use the pathinfo() function in

PHP to extract the file extension:

$extension = pathinfo( $filename,

PATHINFO_EXTENSION );

PHP file get extension method ap-

ply for find out. The answer is mostly accurate as it mentions the

correct function to get file information. However, it

should be ‘dirname’ instead of ‘filedir’.

The answer is completely accurate, providing a cor-

rect PHP function along with an example.

php use pathinfo get file info eg

extenion,basenamee,filenme The answer is somewhat difficult to understand due

to the lack of a concrete example and proper gram-

mar.

=Use the pathinfo() to ex-

tract

extension:

$exten-

sion

pathinfo($filename,

PATHINFO_EXTENSION);

Use the pathinfo() function in

PHP to extract the file extension:

$extension = pathinfo($filename,

PATHINFO_EXTENSION)

/ The answer provides a code example, but the read-

ability is reduced due to the unnecessary symbol

"==".

/ /

The answer is extremely difficult to understand due

to poor grammar and sentence structure.

The answer is very clear, well-structured, and free

from grammatical errors, making easy understand-

ing.

Table 4: Scoring criteria. Examples on "How to get(extract) a file extension in PHP?".

that LLaMA struggles to understand the questions

and provide the correct solutions. The generated

answers are generally challenging to comprehend,

contain grammatical errors, and lack clarity in ex-

planations. In contrast, both SoTaNa and Alpaca

outperform LLaMA significantly in terms of under-

standing questions and generating correct answers.

The answers from SoTaNa and Alpaca are gener-

ally clear, well-structured, and free from grammat-

ical errors, making them easy to understand. Re-

markably, our model (SoTaNa) performs the best

among all approaches in all three metrics, indicat-

ing its exceptional ability to understand questions

and provide relevant and accurate answers while en-

suring grammatical correctness, fluency, and good

formatting.

8Model

LLaMa-7B

LLaMa-13B

LLaMa-30B

Alpaca-7B

Alpaca-13B

Alpaca-30B

SoTaNa-7B

SoTaNa-13B

SoTaNa-30B

Alignment

0.11 (±0.34)

0.20 (±0.53)

0.95 (±1.13)

1.97 (±0.85)

2.52 (±0.71)

2.52 (±0.67)

2.20 (±0.69)

2.42 (±0.80)

2.52 (±0.74)

Accuracy

0.02 (±0.14)

0.14 (±0.40)

0.70 (±1.04)

1.36 (±1.03)

2.10 (±1.10)

2.04 (±1.02)

1.62 (±1.09)

2.02 (±1.10)

2.16 (±0.92)

Readability

0.08 (±0.27)

0.35 (±0.61)

1.08 (±1.21)

2.60 (±0.63)

2.86 (±0.40)

2.90 (±0.30)

2.69 (±0.48)

2.71 (±0.59)

2.90 (±0.30)

and generation tasks. Additionally, fine-tuning soft-

ware engineering domain-specific data proves to be

advantageous for code generation.

5.2.2

We conduct further investigation into the impact

of varying the volume of generated data on model

performance. Specifically, we tune the LLaMA

model using datasets of different sizes: 1k, 5k, 10k,

50k, and 100k generated examples. Subsequently,

we evaluate the models on both code summariza-

tion and code generation tasks, and the results are

shown in Fig. 7.

Interestingly, we see that the performance does

not consistently increase with the increase in data

size, except for SoTaNa-30B on code summariza-

tion, which shows improvement. One possible rea-

son for this inconsistency could be the issue with

the evaluation metrics. As we discuss in Sec. 5.1.1,

the automatic metrics might not effectively mea-

sure the quality of the generated results.

Additionally, we notice that the impact of vary-

ing data size on model performance is not consis-

tent across different model sizes. That is, conclu-

sions or findings drawn for one model size can-

not be directly applied to another size. For in-

stance, SoTaNa-13B achieves the best performance

on code summarization when using 5K data, while

SoTaNa-7B and SoTaNa-30B did not exhibit the

same trend. For code generation, SoTaNa-7B per-

forms exceptionally well when trained on 10K data,

whereas SoTaNa-7B and SoTaNa-30B show the

worst performance with the same dataset size.

The results indicate the importance of careful

consideration when selecting the data size and

model configuration for specific tasks. It also em-

phasizes the necessity of using appropriate evalu-

ation metrics to accurately assess the model’s per-

formance on some code-related tasks.

Table 5: Human evaluation results.

5.2

Experiment on Code Summarization and

Generation

5.2.1

Overall results

Model

LLaMA-7B

LLaMA-13B

LLaMA-30B

Alpaca-7B

Alpaca-13B

Alpaca-30B

SoTaNa-7B

SoTaNa-13B

SoTaNa-30B

Code Generation Code Summarization

P@1 BLEU MET. Rou. Cid.

10.5

15.8

21.7

10.37

12.20

18.90

10.97

18.30

23.17

0.29

0.33

0.89

3.80

3.67

4.69

3.46

3.71

4.69

2.41

3.17

5.21

12.97

12.67

14.51

14.32

13.02

15.29

2.24

3.44

6.34

19.71

19.88

22.25

19.96

19.52

22.93

The impact of data volumes

0.00

0.01

0.31

0.29

0.48

0.23

0.27

0.47

Table 6: Results on code summarization and generation.

To evaluate the effectiveness of our model in

understanding and generating code, we conducted

experiments on two benchmarks and compared our

model SoTaNa with those of LLaMA and Alpaca.

The experimental results are shown in Table 6. We

can see that larger model sizes generally lead to

better performance on both code summarization

and generation. Compared to LLaMA, Alpaca and

SoTaNa show significant improvements in code

summarization. This suggests that fine-tuning mod-

els with human instructions can enhance their abil-

ity to understand and generate natural language sen-

tences resembling human-like descriptions. More-

over, SoTaNa demonstrates an improvement in

LLaMA’s code generation capability, whereas Al-

paca’s fine-tuning appears to decrease LLaMA’s

code generation ability. This indicates that fine-

tuning models using general-purpose data could

potentially impair their code generation capabili-

ties. On the other hand, fine-tuning with software

engineering-related data, as done in our approach,

enhances the model’s code generation proficiency.

In summary, our experiments demonstrate that our

model benefits from instruction-based tuning, lead-

ing to improvements in both code summarization

Discussion

The work most closely related to our research is

the StarChat (Lewis et al., 2023) project. They fine-

tune a model called StarCode, which is designed

specifically for code, using general-purpose data

to make StarCoder (Li et al., 2023) capable of han-

dling dialogue. In contrast, our approach centers

around using software engineering-related data to

fine-tune a general-purpose large language model,

with the aim of creating a software development

assistant.

9BLEU

Meteor

21.38

15.52

Rouge-L

19.96

16.71

11.99

13.72

14.32

Cider

19.0

12.43

10 9.11

2.77

0 0.17

1000

2.9

0.13

5000

4.48

3.41

0.47 0.23 0.26

10000 50000 100000

BLEU-DC

6.55

5 5.14

0.87

0 0.01

1000

Rouge-L

21.7

Meteor

16.4

10.54

10 9.07

5.68

27.5

Cider

22.93

19.52

17.99

13.02 12.23

3.4

0.33 3.71 0.27 3.23

0.21

5000 10000 50000 100000

Number of Training Examples

SoTaNa-7B

SoTaNa-13B

15.29

11.6

3.67 4.35 4.69

0.3 0.44 0.47

5000 10000 50000 100000

20.0

18.29

17.5

16.46

13.41

12.5 11.59

10.0 9.15 10.98

1000 5000

Figure 5: SoTaNa-30B on code summarization

23.17

21.34

15.0

2.89

0.29

SoTaNa-30B

23.78

22.5

Number of Training Examples

Cider

0.47

25.0

18.17

14.47

4.47

10.92

Rouge-L

Figure 4: SoTaNa-13B on code summarization

11.94

1.47

0 0.15

1000

Figure 3: SoTaNa-7B on code summarization

Meteor

18.45

3.46

Number of Training Examples

BLEU-DC

21.62

15.24 15.85 12.2 11.59 10.98

10000 50000 100000

Number of Training Examples

Figure 6: SoTaNa on code generation

Figure 7: The impact of different data size.

Threats to Validity

rely on textual similarity and may not fully capture

the semantic similarity between two sentences. To

address these limitations and obtain a more com-

prehensive evaluation, we also conducted human

evaluations. However, it’s worth noting that hu-

man evaluations can be labor-intensive and time-

consuming. In future research, we will explore new

automatic evaluation metrics that are more aligned

with human perception.

Data Quality. Another potential concern lies in the

data generation process using ChatGPT. While we

have filtered out some low-quality datasets, such

as instructions with less than 3 words, we acknowl-

edge that human checks for the correctness of the

generated data were not performed. To improve

the quality of generated data, future work can in-

corporate human checks or propose alternative ap-

proaches to ensure data accuracy and reliability.

Evaluation Datasets. The experiments have

been conducted on widely-used datasets; however,

there are other datasets available for each down-

stream task. These alternative datasets may differ

in construction methods and corpus sizes, poten-

tially leading to variations in model performance.

To enhance the robustness of our findings and con-

clusions, further experiments on a broader range of

datasets can be carried out to validate the general-

izability of the results.

Evaluation Metrics.

We have utilized

commonly-used metrics to assess the performance

of the models. However, it is crucial to recognize

that these metrics may have inherent limitations.

For example, metrics like BLEU and METEOR

Conclusion

This paper presents SoTaNa, an open-source soft-

ware development assistant designed to meet the

increasing demand for effective software devel-

opment tools. By leveraging Large Language

Models (LLMs), SoTaNa generates high-quality

instruction-based data tailored for software engi-

neering tasks. It employs a parameter-efficient fine-

tuning approach to enhance the capabilities of the

LLaMA open-source foundation model. Through

comprehensive evaluation, including human eval-

uation, SoTaNa demonstrates its efficacy in assist-

ing developers by providing accurate answers to

diverse Stack Overflow queries. It outperforms ex-

isting language models in understanding human

10intent and generating contextually appropriate re-

sponses specific to software engineering tasks.

In future work, we aim to introduce a bench-

mark to further evaluate LLMs as open-source soft-

ware development assistants. This benchmark will

provide a standardized and systematic approach

for assessing the performance of various language

models in the context of software engineering, en-

abling better comparison and progress tracking.

Chrysanthos Dellarocas and Mark Klein. 2000. A

knowledge-based approach for handling exceptions

in business processes. Information Technology and

Management, 1:155–169.

Domeccleston. 2023. Sharegpt – share your wildest

chatgpt conversations with one click. Retrieved 23

May 2023.

DRM Associates. 2002. New product development

glossary. Archived from the original on 13 July 2018.

Retrieved 29 October 2006.

William A Florac and Anita D Carleton. 1999. Measur-

ing the software process: statistical process control

for software process improvement. Addison-Wesley

Professional.

References

Miltiadis Allamanis, Earl T Barr, Premkumar Devanbu,

and Charles Sutton. 2018. A survey of machine learn-

ing for big code and naturalness. ACM Computing

Surveys (CSUR), 51(4):1–37.

Xinyang Geng, Arnav Gudibande, Hao Liu, Eric Wal-

lace, Pieter Abbeel, Sergey Levine, and Dawn Song.

2023. Koala: A dialogue model for academic re-

search. Blog post.

Yuvanesh Anand, Zach Nussbaum, Brandon Duder-

stadt, Benjamin Schmidt, and Andriy Mulyar. 2023.

Gpt4all: Training an assistant-style chatbot with large

scale data distillation from gpt-3.5-turbo. https:

//github.com/nomic-ai/gpt4all.

Jonathan Grudin. 1994. Groupware and social dynam-

ics: Eight challenges for developers. Communica-

tions of the ACM, 37(1):92–105.

Andrew F Hayes and Klaus Krippendorff. 2007. An-

swering the call for a standard reliability measure for

coding data. Communication methods and measures,

1(1):77–89.

Satanjeev Banerjee and Alon Lavie. 2005. METEOR:

an automatic metric for MT evaluation with improved

correlation with human judgments. In IEEvalua-

tion@ACL.

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan

Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang,

and Weizhu Chen. 2021. Lora: Low-rank adap-

tation of large language models. arXiv preprint

arXiv:2106.09685.

Tom Brown, Benjamin Mann, Nick Ryder, Melanie

Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind

Neelakantan, Pranav Shyam, Girish Sastry, Amanda

Askell, et al. 2020. Language models are few-shot

learners. Advances in neural information processing

systems, 33:1877–1901.

Xing Hu, Ge Li, Xin Xia, David Lo, Shuai Lu, and Zhi

Jin. 2018. Summarizing source code with transferred

api knowledge.

Lan Cao and Balasubramaniam Ramesh. 2008. Agile

requirements engineering practices: An empirical

study. IEEE software, 25(1):60–67.

Yunjie Ji, Yong Deng, Yan Gong, Yiping Peng, Qiang

Niu, Lei Zhang, Baochang Ma, and Xiangang Li.

2023a. Exploring the impact of instruction data

scaling on large language models: An empirical

study on real-world use cases. arXiv preprint

arXiv:2303.14742.

Sahil Chaudhary. 2023.

Code alpaca:

instruction-following llama model for code genera-

tion. https://github.com/sahil280114/

codealpaca.

Yunjie Ji, Yan Gong, Yong Deng, Yiping Peng, Qiang

Niu, Baochang Ma, and Xiangang Li. 2023b. To-

wards better instruction following language models

for chinese: Investigating the impact of training data

and evaluation. arXiv preprint arXiv:2304.07854.

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming

Yuan, Henrique Ponde de Oliveira Pinto, Jared Ka-

plan, Harri Edwards, Yuri Burda, Nicholas Joseph,

Greg Brockman, et al. 2021. Evaluating large

language models trained on code. arXiv preprint

arXiv:2107.03374.

Wenxiang Jiao, Wenxuan Wang, JT Huang, Xing

Wang, and ZP Tu. 2023. Is chatgpt a good trans-

lator? yes with gpt-4 as the engine. arXiv preprint

arXiv:2301.08745.

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng,

Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan

Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion

Stoica, and Eric P. Xing. 2023. Vicuna: An open-

source chatbot impressing gpt-4 with 90%* chatgpt

quality.

Maurice G Kendall. 1945. The treatment of ties in

ranking problems. Biometrika, 33(3):239–251.

Bonan Kou, Yifeng Di, Muhao Chen, and Tianyi Zhang.

2022. Sosum: a dataset of stack overflow post sum-

maries. In Proceedings of the 19th International

Conference on Mining Software Repositories, pages

247–251.

Mike Conover, Matt Hayes, Matt Mathur, Xiangrui

Meng, Jianwei Xie, Jun Wan, Ali Ghodsi, Patrick

Wendell, and Patrick Zaharia. 2023. Hello dolly: De-

mocratizing the magic of chatgpt with open models.

11Tunstall Lewis, Lambert Nathan, Beeching Nazneen,

Rajaniand Edward, Le Scao Teven, Han Sheon,

Schmid Philipp, von Werra Leandro, and Sasha

Rush. 2023. Creating a coding assistant with star-

coder.

https://huggingface.co/blog/

starchat-alpha. Ensheng Shi, Yanlin Wang, Lun Du, Junjie Chen, Shi

Han, Hongyu Zhang, Dongmei Zhang, and Hong-

bin Sun. 2022a. On the evaluation of neural code

summarization. In Proceedings of the 44th Interna-

tional Conference on Software Engineering, pages

1597–1608.

Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas

Muennighoff, Denis Kocetkov, Chenghao Mou, Marc

Marone, Christopher Akiki, Jia Li, Jenny Chim, et al.

2023. Starcoder: may the source be with you! arXiv

preprint arXiv:2305.06161. Ensheng Shi, Yanlin Wang, Lun Du, Hongyu Zhang,

Shi Han, Dongmei Zhang, and Hongbin Sun. 2021.

Cast: Enhancing code summarization with hierar-

chical splitting and reconstruction of abstract syntax

trees. In Proceedings of the 2021 Conference on

Empirical Methods in Natural Language Processing,

pages 4053–4062.

Chin-Yew Lin. 2004. ROUGE: A package for automatic

evaluation of summaries. In Text Summarization

Branches Out.

Ensheng Shi, Yanlin Wang, Wei Tao, Lun Du, Hongyu

Zhang, Shi Han, Dongmei Zhang, and Hongbin Sun.

2022b. Race: Retrieval-augmented commit message

generation. In Proceedings of the 2022 Conference

on Empirical Methods in Natural Language Process-

ing, pages 5520–5530.

Tommi Mikkonen, Casper Lassenius, Tomi Männistö,

Markku Oivo, and Janne Järvinen. 2018. Continu-

ous and collaborative technology transfer: Software

engineering research with real-time industry impact.

Information and Software Technology, 95:34–45.

Ensheng Shi, Yanlin Wang, Hongyu Zhang, Lun Du,

Shi Han, Dongmei Zhang, and Hongbin Sun. 2023.

Towards efficient fine-tuning of pre-trained code mod-

els: An experimental study and beyond. The 32nd

ACM SIGSOFT International Symposium on Soft-

ware Testing and Analysis.

Swaroop Mishra, Daniel Khashabi, Chitta Baral, and

Hannaneh Hajishirzi. 2021. Cross-task generaliza-

tion via natural language crowdsourcing instructions.

arXiv preprint arXiv:2104.08773.

Sridhar Nerur, RadhaKanta Mahapatra, and George

Mangalaraj. 2005. Challenges of migrating to ag-

ile methodologies. Communications of the ACM,

48(5):72–78.

Qingyi Si, Tong Wang, Naibin Gu, Rui Liu, and Zheng

Lin. 2023. Alpaca-cot: An instruction fine-tuning

platform with instruction data collection and uni-

fied large lnguage models interface. https://

github.com/PhoebusSi/alpaca-CoT.

Bashar Nuseibeh. 1996. To be and not to be: On man-

aging inconsistency in software development. In

Proceedings of the 8th International Workshop on

Software Specification and Design, pages 164–169.

IEEE.

Terry Stone. 2010. Managing the Design Process-

Implementing Design: An Essential Manual for the

Working Designer. Rockport Publishers.

OpenAI. 2023. Gpt-4 technical report. arXiv preprint

arXiv:2303.08774.

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann

Dubois, Xuechen Li, Carlos Guestrin, Percy Liang,

and Tatsunori B Hashimoto. 2023. Alpaca: A

strong, replicable instruction-following model. Stan-

ford Center for Research on Foundation Models.

https://crfm. stanford. edu/2023/03/13/alpaca. html.

TB OpenAI. 2022. Chatgpt: Optimizing language mod-

els for dialogue. OpenAI.

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida,

Carroll Wainwright, Pamela Mishkin, Chong Zhang,

Sandhini Agarwal, Katarina Slama, Alex Ray, et al.

2022. Training language models to follow instruc-

tions with human feedback. Advances in Neural

Information Processing Systems, 35:27730–27744.

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier

Martinet, Marie-Anne Lachaux, Timothée Lacroix,

Baptiste Rozière, Naman Goyal, Eric Hambro,

Faisal Azhar, et al. 2023. Llama: Open and effi-

cient foundation language models. arXiv preprint

arXiv:2302.13971.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-

Jing Zhu. 2002. Bleu: a method for automatic evalu-

ation of machine translation. In ACL.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob

Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz

Kaiser, and Illia Polosukhin. 2017. Attention is all

you need. Advances in neural information processing

systems, 30.

Victor Sanh, Albert Webson, Colin Raffel, et al. 2021.

Multitask prompted training enables zero-shot task

generalization. arXiv:2110.08207 [cs].

Chenhui Shen, Liying Cheng, Yang You, and Lidong

Bing. 2023. Are large language models good evalua-

tors for abstractive summarization? arXiv preprint

arXiv:2305.13091.

Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi

Parikh. 2015. Cider: Consensus-based image de-

scription evaluation. In CVPR.

12Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Al-

isa Liu, Noah A Smith, Daniel Khashabi, and Han-

naneh Hajishirzi. 2022a. Self-instruct: Aligning lan-

guage model with self generated instructions. arXiv

preprint arXiv:2212.10560.

Yizhong Wang, Swaroop Mishra, Pegah Alipoormo-

labashi, Yeganeh Kordi, Amirreza Mirzaei, Atharva

Naik, Arjun Ashok, Arut Selvan Dhanasekaran, An-

jana Arunkumar, David Stap, et al. 2022b. Super-

naturalinstructions: Generalization via declarative

instructions on 1600+ nlp tasks. In Proceedings of

the 2022 Conference on Empirical Methods in Natu-

ral Language Processing, pages 5085–5109.

Jason Wei, Maarten Bosma, Vincent Y. Zhao, et al. 2021.

Finetuned language models are zero-shot learners.

arXiv:2109.01652 [cs].

Terry Winograd. 1973. Breaking the complexity barrier

again. ACM Sigplan Notices, 10(1):13–30.

BigScience Workshop, Teven Le Scao, Angela Fan, et al.

2022. Bloom: A 176b-parameter open-access multi-

lingual language model.

Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng,

Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin

Jiang. 2023a. Wizardlm: Empowering large lan-

guage models to follow complex instructions. arXiv

preprint arXiv:2304.12244.

Canwen Xu, Daya Guo, Nan Duan, and Julian McAuley.

2023b. Baize: An open-source chat model with

parameter-efficient tuning on self-chat data. arXiv

preprint arXiv:2304.01196.

Daoguang Zan, Bei Chen, Fengji Zhang, Dianjie

Lu, Bingchao Wu, Bei Guan, Yongji Wang, and

Jian-Guang Lou. 2022.

When neural model

meets nl2code: A survey.

arXiv preprint

arXiv:2212.09420.

Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang,

Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu,

Wendi Zheng, Xiao Xia, Weng Lam Tam, Zixuan Ma,

Yufei Xue, Jidong Zhai, Wenguang Chen, Zhiyuan

Liu, Peng Zhang, Yuxiao Dong, and Jie Tang. 2023.

GLM-130b: An open bilingual pre-trained model. In

The Eleventh International Conference on Learning

Representations (ICLR).

Susan Zhang, Stephen Roller, Naman Goyal, et al. 2022.

Opt: Open pre-trained transformer language models.

Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang,

Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen

Zhang, Junjie Zhang, Zican Dong, et al. 2023. A

survey of large language models. arXiv preprint

arXiv:2303.18223.

Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, et al.

2020. Fine-tuning language models from human

preferences.