Summary of CodeTF One-Stop Transformer Library for Code Intelligence

Summary CodeTF One-Stop Transformer Library for Code Intelligence arxiv.org

8,250 words - PDF document - View PDF document

One Line

CodeTF is an open-source transformer library for code intelligence that supports multiple programming languages, includes pre-trained models and tools for code understanding and generation, and aims to enhance human capabilities while providing a unified interface for performance metrics, data preprocessing, and model fine-tuning methods.

Key Points

CodeTF is an open-source transformer library designed for code intelligence and bridging the gap between machine learning and software engineering.
The library includes pre-trained models, standardized interfaces, and key modules for extracting code attributes, language-specific parsers, and utility functions.
CodeTF is modular and extensible, allowing for integration of additional programming languages, models, and utilities, and can be used for code completion, translation, prediction, and refinement.
The library addresses issues with reproducibility and scalability by leveraging scalable infrastructure and optimizing resource allocation, while promoting responsible AI practices.
CodeTF has been evaluated on humaneval-x in 2023 and includes pre-trained models such as GraphCodeBERT, CodeTrans, Codegeex, Natgen, and Spt-code, with multilingual support.

Summaries

243 word summary

CodeTF is an open-source transformer library for code intelligence that provides pre-trained models and tools for code understanding and generation. It supports programming languages and can be fine-tuned for specific tasks. Pre-trained models include GraphCodeBERT, CodeTrans, Codegeex, Natgen, and Spt-code, with multilingual support. CodeTF is committed to responsible AI practices and aims to enhance human capabilities and work collaboratively with humans, rather than replacing them. The library has a modular design with a unified data loader interface, a unified metric interface, and a unified code utility interface for multiple programming languages. It enables users to easily perform a variety of code-related tasks, such as code summarization, completion, generation, and refinement. CodeTF plans to expand its capabilities and support more advanced use cases and improve model reproducibility. CodeTF is a transformer-based library designed to improve code intelligence. It includes pre-trained language models and the ability to fine-tune them for specific applications. The library supports encoder-only, decoder-only, and encoder-decoder models and incorporates quantization techniques to minimize model size while maintaining performance. CodeTF provides a unified interface for performance metrics, data preprocessing, and model fine-tuning methods. It also includes Code Utility for manipulating source code data and extracting important code attributes using tree-sitter as the parser for 15 programming languages. CodeTF is modular and extensible, allowing for the integration of additional programming languages, models, and utilities. It aims to become a useful tool for both software developers and researchers, fostering more innovation in code intelligence research.

704 word summary

CodeTF is an open-source Transformer-based library designed to improve code intelligence and bridge the gap between machine learning and software engineering. It supports a collection of pretrained Code LLM models and popular code benchmarks, with a standardized interface for state-of-the-art Code LLMs and code intelligence. The library includes key modules and components for extracting code attributes, language-specific parsers, and utility functions. CodeTF is modular and extensible, allowing for the integration of additional programming languages, models, and utilities. It aims to become a useful tool for both software developers and researchers, fostering more innovation in code intelligence research. CodeTF is a modular transformer library for code intelligence that includes components for model serving, training, data preparation, and evaluation. The library is tailored to specific requirements and can be used for code completion, translation, prediction, and refinement. CodeTF serves as a one-stop solution, covering various aspects of code intelligence tasks, including loading and serving state-of-the-art models in different styles, pretraining and fine-tuning, evaluation, and source code language models. CodeTF is user-friendly and comprehensive, adhering to several important principles such as being user-centric. CodeTF consists of six main modules: Model Zoo, Model Serving, Model Training, Evaluator, Data Utility, and Code Utility. The library addresses issues with reproducibility and scalability by leveraging scalable infrastructure and optimizing resource allocation. CodeTF is a transformer library for code intelligence that offers access to pre-trained language models (LLMs) and the ability to fine-tune them for specific computation budgets and applications. The library includes a training module and a serving module for code summarization, completion, generation, and refinement. CodeTF supports encoder-only, decoder-only, and encoder-decoder models, and incorporates quantization techniques to minimize model size while maintaining performance. The library provides a unified interface for performance metrics, data preprocessing, and model fine-tuning methods such as LORA, Prefix-Tuning, P-Tuning, Prompt Tuning, and AdaLORA. CodeTF aims to streamline the evaluation process, promote collaboration and innovation within the research community, and facilitate reproducibility of results on popular benchmarks. The library also includes Code Utility for manipulating source code data and extracting important code attributes using tree-sitter as the parser for 15 programming languages. CodeTF's unified interface for code-specific metrics will serve as a valuable tool for researchers, improving model generalizability and applications, and ultimately driving innovation in the field of code intelligence. CodeTF is an open-source library for code intelligence that offers a wide range of models for code retrieval and program synthesis. It has a modular library design with a unified data loader interface, a unified metric interface, and a unified code utility interface for multiple programming languages. CodeTF also offers a unified parameter-efficient fine-tuning for code intelligence tasks. The library enables users to easily perform a variety of code-related tasks, such as code summarization, completion, generation, and refinement. CodeTF plans to expand its capabilities and support more advanced use cases and improve model reproducibility.

CodeTF is committed to responsible AI practices, including human control and oversight, inclusive language in coding, and consideration of job loss and automation. The library also considers energy efficiency and potential biases, such as solution bias and coding style bias. CodeTF aims to create AI systems that enhance human capabilities and work collaboratively with humans, rather than replacing them.

CodeTF is a one-stop transformer library for code intelligence that provides pre-trained models for code understanding and generation. It includes models such as CodeBERT, CodeT5, CodeGen, CodeRL, and UnixCoder. The library is designed to be a comprehensive resource for researchers and developers working on code intelligence. CodeTF is an open-source transformer library for code intelligence that includes pre-trained models and tools for code understanding and generation. The library is based on the transformer architecture, and uses unsupervised and multitask learning to improve performance. It supports programming languages and can be fine-tuned for specific tasks. CodeTF has been evaluated on humaneval-x in 2023. The library includes pre-trained models such as GraphCodeBERT, CodeTrans, Codegeex, Natgen, and Spt-code, with multilingual support. Additionally, there are papers exploring semantic code search, program synthesis, generative models for code infilling and synthesis, large language models for code understanding and generation, as well as evaluating the state of semantic code search and measuring coding challenge competence. Multilingual training for software engineering and contextual embedding of source code has also been explored.

1912 word summary

CodeTF is a one-stop transformer library for code intelligence that provides pre-trained models for code generation with multilingual support. It has been evaluated on humaneval-x in 2023. Codegeex is a pre-trained model for code generation with multilingual support. Naturalcc is an open-source toolkit for code intelligence. Natgen is generative pre-training for learning source code representations. Spt-code is sequence-to-sequence understanding and generation. Multilingual training for software engineering has been proposed. Learning and evaluating contextual embedding of source code has also been explored. CodeTF is an open-source transformer library for code intelligence. The library provides pre-trained models, including GraphCodeBERT and CodeTrans, for code understanding and generation. CodeTF is based on the transformer architecture and uses unsupervised and multitask learning to improve performance. The library includes methods for program repair, code completion, and code synthesis evaluation. CodeTF is designed to support programming languages and can be fine-tuned for specific tasks. CodeTF is a one-stop transformer library for code intelligence. It includes a variety of models and tools for code understanding and generation. Some recent papers related to the library include Prefix-tuning, Low-rank adaptation of large language models, Gptq for accurate post-training quantization, and LLm.int8 for 8-bit matrix multiplication. Other papers explore topics such as semantic code search, program synthesis with large language models, generative models for code infilling and synthesis, and large language models for code understanding and generation. There are also papers on evaluating the state of semantic code search, measuring coding challenge competence, and exploring the limits of language modeling. CodeTF is a one-stop transformer library for code intelligence. The library provides pre-trained models for code understanding and generation. It includes models such as CodeBERT, CodeT5, and CodeGen. These models are evaluated on large language datasets for code understanding and generation, such as CodeXGLUE and CodeSearchNet. CodeRL and UnixCoder are also included in the library. The library is designed to be a comprehensive resource for researchers and developers working on code intelligence. CodeTF is a transformer library for code intelligence that is committed to responsible AI practices. It is important to maintain human control and oversight, ensure inclusive language in coding, and consider job loss and automation. Energy efficiency is a significant concern in AI, and optimized models generating more efficient code could reduce energy consumption. There are potential biases to consider, such as solution bias and coding style bias, which can affect the generated code. It is important to create AI systems that enhance human capabilities and work collaboratively with humans, rather than replacing them. CodeTF is a one-stop open-source Transformer-based library for code intelligence that offers a powerful and versatile toolset to develop and deploy LLMs for code-related tasks. The library enables users to easily perform a variety of code-related tasks, such as code summarization, completion, generation, and refinement. The library also offers solidifying resources in the field and fosters collaboration. However, there are several biases that could result in misinterpretations, incorrect results, or undesired behaviors, which the library does not provide absolute guarantees regarding their code intelligence capabilities. To expand its capabilities and support more advanced use cases and improve model reproducibility, the library plans to implement 4-bit quantization, add support for other programming languages, integrate a broader selection of recent state-of-the-art pretrained language models of code, and conduct comprehensive evaluations of well-known code intelligence tasks on established benchmarks. CodeTF is an open-source library for Code LLMs and code intelligence tasks. It offers a wide range of models for code retrieval and program synthesis. CodeTF bundles state-of-the-art LLMs for code with additional utilities for traditional software engineering analysis, and formal methods, to effectively tackle complex software engineering tasks. It has a modular library design with a unified data loader interface, a unified metric interface, and a unified code utility interface for multiple programming languages. CodeTF also offers a unified parameter-efficient fine-tuning for code intelligence tasks. Table 1 summarizes the comparison between CodeTF's key features with HuggingFace Transformers. Code LLMs, inspired by NLP models like BERT and GPT, have gained significant attention for their ability to support a wide range of code understanding tasks such as code generation, completion, repair, and translation. These models leverage pretraining strategies like span corruption and causal LM from the NLP domain and treat code as natural language text. CodeTF provides a one-stop transformer library for code intelligence, supporting the development of LLMs for code and related tools. The library includes a unified interface for evaluating models on well-known benchmarks and a trainer for preprocessed CodeXGLUE datasets. CodeTF is a one-stop transformer library for code intelligence that provides a unified interface for fine-tuning models based on supported checkpoints. It offers a unified interface to load supported models and perform inference for each supported programming language. Code Utility is a module that helps manipulate source code data, catering to the unique syntactical rules of each programming language. It offers many other useful supporting functions such as comment removal, extraction of code properties, and more. CodeTF provides users built-in functions to extract important code attributes, utilizing tree-sitter as the parser for 15 programming languages. CodeTF's unified interface for code-specific metrics will serve as a valuable tool for researchers, improving model generalizability and applications, and ultimately driving innovation in the field of code intelligence. CodeTF is a one-stop transformer library for code intelligence tasks that provides a unified interface for performance metrics, including pass@k, Edit Similarity, and CodeBLEU. The library also offers a Data Utility module for data preprocessing and a Trainer module for model fine-tuning, with options for parameter-efficient fine-tuning methods such as LORA, Prefix-Tuning, P-Tuning, Prompt Tuning, and AdaLORA. The Trainer module includes three major Trainer classes, CausalLMTrainer, Seq2SeqTrainer, and BERTTrainer, which are compatible with different families of Language Models (LLMs) for code. CodeTF aims to streamline the evaluation process, promote collaboration and innovation within the research community, and facilitate the reproducibility of results on popular benchmarks. CodeTF is a one-stop transformer library for code intelligence that provides users with the ability to access and fine-tune pre-trained language models (LLMs) to align with their specific computation budgets and applications. The library includes a training module that allows users to tailor their models to be compatible with existing datasets or tasks, and a serving module that simplifies the deployment of models for an array of code intelligence tasks, including code summarization, code completion, text-to-code generation, and code refinement. CodeTF supports a wide range of LLMs, including encoder-only models, decoder-only models, and encoder-decoder models. The library allows users to easily access both pretrained and fine-tuned models in their applications. Additionally, CodeTF incorporates quantization techniques to minimize model size while maintaining satisfactory performance, and offers an interface to the Hugging Face repository to ensure that users can effortlessly stay up-to-date with the latest advancements in the field. CodeTF is a one-stop transformer library for code intelligence that consists of six main modules: Model Zoo, Model Serving, Model Training, Evaluator, Data Utility, and Code Utility. The Model Zoo contains configurations for well-known pretrained or fine-tuned models for specific tasks. The Model Serving module can load models through an interface, specifying the model type (GPT, Seq2Seq, BERT), model size, and tasks for which the models are intended (pretraining, summarization, generation, etc.). The Model Training module provides utilities for pretraining or fine-tuning models, managing GPUs, and handling neural network configurations. The Data Utility module offers utilities to assist the Model Training module in loading well-known datasets. The Code Utility module provides tools for easy manipulation of source code. Finally, the Evaluator module validates the results of trained models on well-known benchmarks. CodeTF addresses issues with reproducibility and scalability by leveraging scalable infrastructure and optimizing resource allocation. The library is designed following software engineering principles such as Object-Oriented Programming, ensuring extensibility and flexibility. CodeTF is a comprehensive library that simplifies complex tasks for code intelligence. It prioritizes user-friendliness and usability, reducing the need for complex configurations or dependencies. CodeTF serves as a one-stop solution, covering various aspects of code intelligence tasks, including loading and serving state-of-the-art models in different styles, pretraining and fine-tuning, evaluation, and source code language models. In designing CodeTF, the team adheres to several important principles, including being user-centric and comprehensive. CodeTF addresses the specific needs of code intelligence tasks, which are not fully catered to by existing libraries such as HuggingFace Transformers. Proper evaluation of Code LLMs is crucial for integrating them into real-world applications and enabling users to leverage their capabilities. Metrics specific to code, such as CodeBLEU and Edit Similarity, are also utilized to ensure code readability. CodeTF is a modular transformer library for code intelligence that includes components for model serving, training, data preparation, and evaluation. The system design improves the library's extensibility and allows for customization and integration of additional models, data, and programming languages. CodeTF includes a model zoo, model serving, model training, data utility, code utility, and evaluation components. Fine-tuning is necessary to adapt models to specific tasks and improve their performance on the target domain. Quantization reduces model size and improves inference time without sacrificing accuracy. The library is tailored to specific requirements and can be used for code completion, code translation, defect prediction, and code refinement. CodeTF is a modular and extensible transformer library for code intelligence that allows for the integration of additional programming languages, models, and utilities. It includes a collection of popular code corpora, data preprocessing and feature extraction modules, and an interface for serving and training pretrained and custom models. CodeTF also provides tools for extracting code attributes such as method names, identifiers, variable names, and code comments, and includes Abstract Syntax Tree (AST) parsers for multiple programming languages leveraging tree-sitter. The library facilitates efficient processing and manipulation of code data during model training and evaluation, and is suitable for identifying identifier locations for its multi-objective learning approach. CodeTF aims to become a useful tool for both software developers and researchers, fostering more innovation in code intelligence research and facilitating wider deployment and application of Code LLMs. CodeTF is an open-source library for Transformer-based LLMs and software systems that supports the development and deployment of Code LLMs. The library contains a collection of popular datasets and supports a wide range of LLMs of code, including encoder-decoder, decoder-only, and popular research benchmarks. CodeTF's design principle allows standardized integration and rapid development from any off-the-shelf models and datasets. Key components such as model training, utilities to process and manipulate code data, and popular research benchmarks are included. CodeTF facilitates parameter-efficient model fine-tuning, model serving, and model quantization for efficient model inferencing. The library also contains built-in code utilities such as Prefix-tuning and Prompt-tuning, and supports multilingual AST parsers of over 15 programming languages. Initial success in applying these models in practice demonstrates the great potential benefits to society and more specifically, to software development professionals to improve the productivity and quality of their work. CodeTF is an open-source Transformer-based library designed to bridge the gap between machine learning/generative AI and software engineering, providing a comprehensive solution for developers, researchers, and practitioners. It supports a collection of pretrained Code LLM models and popular code benchmarks, with a standardized interface for state-of-the-art Code LLMs and code intelligence. CodeTF is designed with a unified interface to enable rapid access and development across different types of models, datasets and tasks. The library includes key modules and components for extracting code attributes, language-specific parsers, and utility functions. Code intelligence plays a key role in transforming modern software engineering, and deep learning models, particularly Transformer-based LLMs, have demonstrated remarkable potential in tackling source code analysis tasks.

Raw indexed text (57,272 chars / 8,250 words / 814 lines)

C ODE TF: O NE - STOP T RANSFORMER L IBRARY FOR

S TATE - OF - THE - ART C ODE LLM

Nghi D. Q. Bui ∗ , Hung Le, Yue Wang, Junnan Li, Akhilesh Deepak Gotmare, Steven C.H. Hoi ∗

Salesforce AI Research

https://github.com/salesforce/CodeTF

A BSTRACT

Code intelligence plays a key role in transforming modern software engineering. Recently, deep

learning-based models, especially Transformer-based large language models (LLMs), have demon-

strated remarkable potential in tackling these tasks by leveraging massive open-source code data

and programming language features. However, the development and deployment of such models

often require expertise in both machine learning and software engineering, creating a barrier for

the model adoption. In this paper, we present CodeTF, an open-source Transformer-based library

for state-of-the-art Code LLMs and code intelligence. Following the principles of modular design

and extensible framework, we design CodeTF with a unified interface to enable rapid access and

development across different types of models, datasets and tasks. Our library supports a collection of

pretrained Code LLM models and popular code benchmarks, including a standardized interface to

train and serve code LLMs efficiently, and data features such as language-specific parsers and utility

functions for extracting code attributes. In this paper, we describe the design principles, the architec-

ture, key modules and components, and compare with other related library tools. Finally, we hope

CodeTF is able to bridge the gap between machine learning/generative AI and software engineering,

providing a comprehensive open-source solution for developers, researchers, and practitioners.

Keywords Transformer · code large language models · code understanding · code generation · code intelligence

Introduction

AI has made transformative changes to software engineering industries in recent years. Traditional machine learning

based approaches for code intelligence tasks in software engineering entail basic source code analysis tasks. These

tasks include understanding, analyzing, and modifying source code to improve its quality and maintainability [1, 2, 3].

In recent years, deep learning models, particularly Transformer-based large language models (LLMs) pretrained on

large-scale code data (“Code LLMs”) [4, 5, 6, 7, 8, 9, 10], have shown promising results in more challenging code

intelligence tasks, such as code generation, code completion, code summarization, and code retrieval [11, 12, 13, 14].

These models leverage a massive amount of open-source code data from online platforms such as Github [5, 6, 10],

sometimes supplemented with programming language features [4, 9, 7], to learn meaningful contextual representations

in code. Initial success in applying these models in practice demonstrates the great potential benefits to society and

more specifically, to software development professionals to improve the productivity and quality of their work.

As LLMs have demonstrated great values to software developers, developing and deploying such models from scratch

still remain a daunting and time-consuming task to the majority of developers. The development of such models often

require substantial experience of model designs and training [15, 16, 17], usually offered by machine learning experts.

To deploy these models, professional software developers are then needed to scale and serve the models efficiently in

software systems. A key obstacle in this field is the set of inconsistent interfaces across models, datasets, and application

tasks, resulting in highly repetitive efforts in development and deployment of Code LLMs.

To address these challenges, we build CodeTF, an open-source comprehensive library for Transformer-based LLMs and

their application in code intelligence tasks. Figure 1 shows an overview of CodeTF. In CodeTF, we created a unified

∗

Correspondence: {nghi.bui, shoi}@salesforce.comModel training

Facilitating parameter-efficient

model fine-tuning (Lora,

Prefix-tuning, Prompt-tuning)

Code utility

Built-in code utilities such as

multilingual AST parsers of over 15

programming languages

Model serving

Supporting model quantization for

efficient model inferencing: int8,

float16, GPTQ 4-bit

Model zoo

Supporting Transformer LLMs

for code including CodeT5,

CodeGen, and CodeT5+

Model evaluation

Supporting model evaluation on

popular code benchmarks such as

HumanEval, MBPP

Built-in AP

of supp

app

Data utility

Extendable datasets and loading

utility functions for different

downstream application tasks

Figure 1: An overview of CodeTF: We develop a comprehensive Transformer-based library to support development

and deployment of LLMs for code intelligence. The library contains features to train and serve language models, code

utilities to process and manipulate code data, and popular research benchmarks to evaluate model performance.

interface to enable easy access and customization to individual components. Key components such as model training,

inference, and datasets are built upon a foundation module specifically designed for code-based data and models. This

design principle allows standardized integration and rapid development from any off-the-shelf models and datasets.

Within this unified interface of CodeTF, our library supports a diverse collection of pretrained Transformer-based

LLMs [5, 4, 18] and code tasks [12, 13, 11]. CodeTF suppots a wide range of LLMs of code, including encoder-

only (CodeBERT [6], CodeBERTA), decoder-only (CodeParrot [19], Incoder [20], CodeGen [5], SantaCoder [21],

StarCoder [22]), encoder-decoder (CodeT5 [4], CodeT5+ [18] CodeTF includes a collection of popular datasets

such as HumanEval and APPS [12, 13, 11, 23] and an interface for efficient loading and serving pretrained models,

custom models, and datasets. Through the unified interface, library users is able to not only reproduce and implement

state-of-the-art models efficiently, but also seamlessly integrate new models and benchmarks as needed.

Compared to other domains such as vision and text, code data often demands more rigorous preprocessing and

manipulation procedures due to the stringent syntactic rules that must be adhered to in accordance with their respective

programming languages. Consequently, CodeTF introduces an enhanced suite of data processing features which include

Abstract Syntax Tree (AST) parsers for multiple programming languages leveraging tree-sitter 2 , along with utilities for

extracting code attributes such as method names, identifiers, variable names, and code comments. These tools have

been meticulously designed to facilitate the efficient processing and manipulation of code data during model training,

fine-tuning, and evaluation. Such features are indispensable for supporting the preprocessing of code into a format

that is suitable for language models. For instance, CodeT5 [4] necessitates the extraction of function names and the

identification of identifier locations for its multi-objective learning approach

In summary, our main contributions of CodeTF are as follows:

• A modular and extensible framework for code intelligence tasks, allowing users to easily integrate a wide range of

programming languages, models, and data, as needed

• An interface for both serving and training pretrained models and custom models, enabling users to leverage state-of-

the-art models and fine-tune them for specific use cases

• A collection of popular code corpora with data preprocessing and feature extraction modules, supporting a wide range

of programming languages and code tasks and promoting data reusability.

• Detailed documentation and code examples, facilitating the learning and adoption process for users with varying

levels of expertise.

Ultimately, we hope CodeTF will become a useful tool for both software developers and researchers, fostering more

innovation in code intelligence research and facilitating wider deployment and application of Code LLMs.

https://github.com/tree-sitter/tree-sitter

aPractitioner/Researcher

Data Preparation

Training

Distributed Training

Code Format

Linter, Clang, …

Batch size, optimizer, learning

rate, …

Pytorch, Tensorflow, Keras,...

…..

Serving

AST Parser

Tree-sitter, PMD,

ANTLR, …

Hyperparams Config

Training Loop

Device management, data

parallel, Deepspeed, …

Quantization Distributed Serving Model Deployment

Bitsandbyes, GPTQ,

Optimum, CTranslate2, … TorchServer, Triton, Faster

Transformer, … Flask, Fast API, KubeFlow,

MLFlow

…..

Process Tokens

NLTK, Spacy, …

Evaluating

……..

Post Processing Metric Implementation

Spacy, Regex, …. Pass@K, CodeBLEU

…..

Figure 2: Illustration of how practitioners utilize Code LLMs for software engineering problems.

Library Design

Figure 3 provides a detailed overview of the CodeTF system implementation, highlighting its essential components

that empower users to effortlessly engage in various code-related tasks. Our system follows a modular architecture,

enhancing its extensibility by allowing seamless integration of additional programming languages, models, and utilities

tailored to specific requirements.

2.1

Motivation

To illustrate the motivation behind CodeTF’s design, we present use cases of practitioners and researchers adopting Code

LLMs for practical and research purposes (see Figure 2). These use cases involve four main tasks: Data Preparation,

Training, Serving, and Evaluation.

Data Preparation: In the first task, users utilize Code LLMs for code completion, code translation to other languages,

defect prediction, or code refinement. These tasks rely on making predictions based on input code snippets. However,

Code LLMs are typically pretrained for next-token prediction tasks, necessitating fine-tuning on specific datasets for

desired tasks. This requires intricate steps for preparing source code data, such as formatting the source code (e.g.,

using clang-formatter, ESLint), parsing the code into an Abstract Syntax Tree (AST) (e.g., with tree-sitter, PMD), and

processing code tokens (e.g., using NLTK, Spacy). Preprocessing the code to extract important information, rather than

using raw code, often leads to better results, especially for tasks like defect prediction or program repair.

Training/Fine-Tuning: Once the data is prepared, users may proceed to train or fine-tune Code LLMs. This involves

additional tasks such as writing the training loop (using frameworks like PyTorch, Keras, or TensorFlow) over the

prepared datasets. Moreover, users may need to handle device management (GPUs, TPUs, CPUs) within a distributed

training environment, especially when fine-tuning large models. Training Code LLMs is a critical step to adapt the

models to specific tasks and improve their performance on the target domain. Fine-tuning allows users to leverage their

own datasets and fine-tune the models’ parameters, enabling them to achieve better results for their specific use cases.

Serving: Once the model is trained, users might want to serve the model for inference. This entails efforts such as

quantizing the models into 8-bit or 4-bit versions to expedite inference speed, making them more efficient in resource

usage. Quantization reduces the model size and improves the inference time without significantly sacrificing accuracy.

3Datasets

codetf.predict

codetf.data_utility

Data Loader

HumanEval

BitsandByte

GPTQ

…

Data Processor

codetf.trainer

Data Cleaner

CodeXGLUE

Extracted code features

Full Model

Training Prefix Tuning LORA

AdaLORA P-tuning …

Human Input

codetf.performance

…

pass@k

codetf.code_utility

Comment Removal

Code data

CodeBLEU

…

codetf.model

AST Parser CodeT5-

Summarization CodeT5-

Text2Code CodeT5-

Completion

Code Attribute Extractor CodeGen-

Mono CodeGen-

Multi …

…

Figure 3: An overview of the system design of CodeTF. The modular design improves the library extensibility

and allows users to easily customizer and integrate additional models, data, and programming languages as needed.

Key components in CodeTF include: model zoo - codetf.model, model serving - codetf.predict, model train-

ing - codetf.trainer, data utility - codetf.data_utility, code utility codetf.data_utility, and evaluation

codetf.performance.

Additionally, deploying the models in specific environments requires setting up the deployment environment and

ensuring compatibility with the target system. This involves configuring the necessary infrastructure and handling the

deployment logistics, such as managing server resources and network communication. Serving Code LLMs effectively

is crucial for integrating them into real-world applications and enabling users to leverage their capabilities.

Evaluation: For large models, users often want to evaluate their quality against standard benchmarks. For example,

when evaluating fine-tuned models for code generation tasks, users may assess the passing rate (pass@k). This involves

using the model to generate code tokens and execute the generated outputs with unit tests. However, the generated code

tokens might require post-processing steps, such as truncating incomplete generations or applying formatting rules to

ensure code readability. Moreover, metrics specific to code, such as CodeBLEU and Edit Similarity, are also utilized to

assess the models’ performance accurately. Implementing these metrics can be challenging as different works may

adopt diverse approaches, hindering the reproducibility of pretrained model results and the verification of newly trained

models. Ensuring proper evaluation of Code LLMs allows users to gain insights into their performance and make

informed decisions about their suitability for specific tasks.

Performing the above tasks individually requires users to integrate different libraries and tools into a single codebase,

making it challenging to promote the usability of Code LLMs in production-level tools. Existing libraries like

HuggingFace Transformers (HF-T) provide unified interfaces to handle diverse and complex tools for working with

LLMs. However, HF-T does not fully cater to the specific needs of code intelligence tasks. CodeTF addresses this

drawback by serving as a higher-level layer built upon HuggingFace Transformers and other tools, specifically designed

to meet the requirements of code intelligence tasks. The next section highlights the key design principles of CodeTF.

2.2

Design Principles

In designing CodeTF, we adhere to several important principles that guide our approach to creating a robust and

user-centric library for code intelligence tasks. These principles serve as the foundation for the design choices and

functionalities implemented in CodeTF, ensuring that it meets the diverse needs of practitioners and researchers.

41. Comprehensiveness: CodeTF strives to be a comprehensive library, encompassing various aspects of code large

language models. This includes functionalities such as loading and serving state-of-the-art models in different

styles (encoder-only, decoder-only, and encoder-decoder), pretraining and fine-tuning, evaluation, and source code

manipulation for training purposes. CodeTF serves as a one-stop solution, covering these essential aspects.

2. User-Friendliness: CodeTF prioritizes user-friendliness, ensuring that the library is not just useful but also accessible

to a wide range of users, from beginners to advanced researchers. We simplify installation and setup processes,

reducing the need for complex configurations or dependencies. The goal is to ensure that users can easily get started

with CodeTF, regardless of their prior experience or expertise.

3. Usability: While user-friendliness is about the initial experience of getting started with CodeTF, usability focuses on

the ease and efficiency of interacting with the library on an ongoing basis. We aim to provide a cohesive and intuitive

interface for different code intelligence tasks. This involves simplifying complex tasks such as data collection,

code attribute extraction, data conversion for deep learning frameworks, GPU management, and training loop

configuration.

4. Extensibility: We recognize the rapidly evolving nature of Code LLMs, with new models employing different

training approaches and additional benchmarks emerging. To accommodate future advancements, we design

CodeTF following software engineering principles such as Object-Oriented Programming, ensuring extensibility

and flexibility.

5. Scalability: Managing system scalability during training and serving of Code LLMs can be challenging, particularly

across different devices and environments. CodeTF simplifies this process by leveraging scalable infrastructure and

optimizing resource allocation.

6. Reproducibility: Reproducibility is a crucial aspect of Code LLMs, especially when evaluating their performance

on well-known benchmarks such as HumanEval [12], MBPP [23], and APPS [13]. However, many released model

codebases lack the necessary scripts to reproduce results, hindering the research community’s ability to verify Code

LLM performance. CodeTF addresses this issue through its unified interface capable of loading a wide range of

Code LLMs, alongside an Evaluation interface that facilitates reproducibility for the research community.

2.3

Modules

Given the motivation and design principles, we have designed modules that align with these goals. The CodeTF library

consists of six main modules: Model Zoo, Model Serving, Model Training, Evaluator, Data Utility, and Code Utility.

• The Model Zoo contains configurations for well-known pretrained or fine-tuned models for specific tasks.

Three major types of Code LLMs are considered: decoder-only (or GPT-style) models, encoder-decoder (or

Seq2Seq) models , and encoder-only (or BERT-style) models.

• The Model Serving module can load models through an interface, specifying the model type (GPT, Seq2Seq,

BERT), model size, and tasks for which the models are intended (pretraining, summarization, generation, etc.).

The module can perform predictions on raw inputs, such as code snippets or natural language descriptions.

• The Model Training module provides utilities for pretraining or fine-tuning models, managing GPUs, and

handling neural network configurations. It receives the model loaded from the Model Serving module and

initializes the weights for training.

• The Data Utility module offers utilities to assist the Model Training module in loading well-known datasets.

These datasets are preprocessed at various stages into appropriate formats for input to the Model Training

module.

• To facilitate source code processing, the Code Utility module provides tools for easy manipulation of source

code. This includes loading the AST parser for code parsing and performing traversal on the AST to extract

important code attributes, which is a crucial step in data preprocessing.

• Finally, the Evaluator module validates the results of trained models on well-known benchmarks. It can receive

instances loaded from the Model Serving module and compute model performance with evaluation metrics.

More details about each module can be found in the next section.

5load_model_pipeline(args)

registry

CausalLM CodeGen StarCoder …

Seq2Seq CodeT5 CodeT5+ …

BERT CodeBERT CodeBERTA …

Model

Class

Figure 4: An overview of the model loading pipeline in CodeTF

3.1

Modules and Utilities

Model Zoo

The Model Zoo - codetf.model provides configurations for both pretrained and fine-tuned checkpoints from well-

known LLMs, including different types of Transformer model architectures. Specifically, CodeTF can support a wide

range of LLMs: encoder-only models (CodeBERT [6], CodeBERTA [24]), decoder-only models (CodeParrot [19],

Incoder [20], CodeGen [5], SantaCoder [21], StarCoder [22]), and encoder-decoder models (CodeT5 [4], CodeT5+ [18]).

This module streamlines access to state-of-the-art models for code intelligence tasks, enabling users to utilize these

models in their applications.

In addition to pretrained checkpoints, CodeTF also support fine-tuned models for specific downstream tasks, such as

code summarization, code generation, and code completion. The library allows users to easily access these models

through a unified programming interface across different tasks. Each model is accompanied by a YAML configuration

file containing essential information such as the Hugging Face URL, tokenizer, maximum sequence length, and more.

By offering an interface to the Hugging Face repository, the Model Zoo module ensures that users can effortlessly stay

up-to-date with the latest advancements in the field, promoting the adoption and implementation of advanced models

across a variety of code intelligence use cases.

3.2

Model Serving Module

The Model Serving module - codetf.predict provides users with the ability to load pretrained or finetuned model

checkpoints from Model Zoo and applies these models for a variety of tasks. CodeTF can support both many challenging

code tasks, including code summarization, code completion, text-to-code generation, and code refinement. The Model

Serving module simplifies the deployment of models for an array of code intelligence tasks by offering a convenient

interface which receives any new code snippets as input and returns a corresponding model prediction.

To facilitate a quick and user-friendly interface for deploying and testing our pretrained models, we recognize the

importance of model quantization. Raw Pytorch models can be bulky and time-consuming for delivering inference

results (e.g. about 1.2 seconds per sample for CodeGen-16B), making quantization essential to minimize model

size while maintaining satisfactory performance. CodeTF incorporates BitsandByte [25], and GPTQ [26] as diverse

quantization choices to accommodate various requirements. Figure 4 outlines the model loading process. Initially,

an entry function named load_model_pipeline is invoked where users will specify the model type alongside other

parameters, such as the model’s name. The ’registry’ is a module that registers the relevant model class, including

CausalLMModel, Seq2SeqModel, BERTModel. Each model class represents a different type of language model

architecture for code. Every model class is linked with a configuration file to select the pre-set checkpoint that is defined

in the Model Zoo. Once the model class is initiated, users can utilize it to make predictions given an input.

3.3

Model Training Module

The Training Module - codetf.trainer endows users with the ability to access checkpoints from model cards and

tailor their models to be compatible with existing datasets or tasks. This module provide users an unified interface to

easily fine-tune LLMs to align with their specific computation budgets and applications. In addition to conventional

model finetuning, we provide users, especially those under constraint computation budgets, an option to employ our

parameter-efficient finetuning methods.

6BaseTrainer

CausalLMTrainer Seq2SeqTrainer BERTTrainer

CausaLMModel

GPT, CodeGen, StarCoder,

SantaCoder Seq2SeqModel

CodeT5, CodeT5+ BertModel

CodeBERT, CodeBERTA

CausalLMConfig Seq2SeqConfig BertConfig

LORA, AdaLORA, Prompt-Tuning

Figure 5: An overview of CodeTF’s Trainer: The BaseTrainer is the base class from which all model trainers

inherit. The three major Trainer classes are CausalLMTrainer, Seq2SeqTrainer, and BERTTrainer. We design

these trainers to be compatible with different families of Language Models (LLMs) for code, including CausalLMModel,

Seq2SeqModel, and BERTModel, respectively.

To promote parameter-efficient fine-tuning, we adopt PEFT 3 as the foundation. We incorporate various fine-tuning

techniques such as LORA [27], Prefix-Tuning [28], P-Tuning [29], Prompt Tuning [30], and AdaLORA [31]. These

techniques have demonstrated significant benefits in tuning LLMs (in billions of parameters) while keeping the training

costs affordable. By offering these choices for model fine-tuning, CodeTF empowers users to modify pretrained models

to their exclusive training requirements.

Figure 5 illustrates an overview of how the Trainer(s) are implemented. The Trainer classes, which comprise

of CausalLMTrainer, Seq2SeqTrainer, and BERTTrainer, all inherit from a BaseTrainer. These trainer classes

correspond to different families of Language Models (LLMs) for code, including CausalLMModel, Seq2SeqModel, and

BERTModel, respectively. The Trainer(s) are assigned with training configurations that are specifically predefined for

each model family. We further provide configurations for parameter-efficient fine-tuning methods from HuggingFace’s

PEFT as an option for users, enabling them to effectively fine-tune the models through these built-in configurations.

3.4

Data Utility Module

The Data Utility module - codetf.data_utility provides a suite of tools for data preprocessing, including tokeniza-

tion, code processing, and data loaders. These utilities ensure that data is appropriately prepared for use in training

and inference, promoting efficient and reproducible model performance. By offering a comprehensive set of data

preprocessing tools, the Data Utility module streamlines the process of preparing code data for various machine learning

tasks.

3.5

Evaluator Module

The Evaluator Module - codetf.performance provides a unified interface that offers a variety of performance metrics

specifically tailored to code intelligence tasks. These metrics include but not limited to the Levenshtein edit similarity,

pass@k [12, 32], and CodeBLEU [33]. By providing an interface to measure these standardized metrics, we seek to

streamline the evaluation process and facilitate the reproducibility of results on popular benchmarks. Eventually, this

unified interface is designed to promote better understanding and comparability between different research papers,

fostering collaboration and innovation within the research community.

We also aim to provide a unified interface that offers a variety of metrics specifically tailored to code intelligence tasks,

including but not limited to pass@k [12, 32], Edit Similarity [34], and CodeBLEU [33]. By providing these standardized

metrics, we seek to streamline the evaluation process and facilitate the reproducibility of results on widely recognized

benchmarks. Additionally, this unified interface is designed to promote better understanding and comparability between

https://github.com/huggingface/peft

7different research papers, fostering collaboration and innovation within the research community. In the long term, we

envision that our unified interface for code-specific metrics will serve as a valuable tool for researchers, improving

model generalizability and applications, and ultimately driving innovation in the field of code intelligence.

3.6

Code Utility

Besides the common utility funtions related to model training and testing, we also provide Code Utility module -

codetf.code_utility, which assists users to manipulate source code data. CodeTF provides users built-in functions

to extract important code attributes, utilizing tree-sitter 4 as the parser for 15 programming languages (including Java,

Apex, C, C++, C#, Python, Scala, SOQL, SOSL, PHP, JavaScript, Haskell, Go, Kotlin, Ruby, Rust, Scala, Solidity,

and YAML). Tree-sitter is a parser generator tool and an incremental parsing library that can construct a concrete

syntax tree for a source code file and efficiently update the syntax tree as the source file is edited. While all of the

supported languages employ tree-sitter as the backbone to parse code into ASTs, each language relies on a distinct set

of syntactical rules. We have assembled open-source syntactical rules for each language and prebuilt them into “.so”

files compatible with various operating systems. Currently, we support major operating systems such as Darwin, Linux,

and Windows. These “.so” files are bundled with CodeTF and can be easily loaded through a programming interface

without any additional installations.

In addition to parsing, the Code Utility module offers many other useful supporting functions such as comment

removal, extraction of code properties (e.g., comments, variable names, method names), and more. Each programming

language inherits a BaseCodeUtility class, allowing for the creation of language-specific utility classes (e.g.,

JavaCodeUtility, PythonCodeUtility, ApexCodeUtility) that implement functions based on the language’s

properties. This module ensures the efficient handling and manipulation of code, catering to the unique syntactical rules

of each supported programming language.

Example Usage

Unified interface for loading models and perform inference CodeTF provides unified interface to load supported

models. This is helpful for off-the-shelf use of model inference etc. In the this example, we show how to load a CodeT5

model checkpoint for the code summarization task for Python program

from codetf . models import load_model_pipeline

s um m a ri z ation_model = load_model_pipeline ( model_name = " codet5 " , task = " sum - python " ,

model_type = " base " , is_eval = True ,

load_in_8bit = True , weight_sharding = False )

code_snippets = """

void bubbleSort ( int arr [])

{

int n = arr . length ;

for ( int i = 0; i < n - 1; i ++)

for ( int j = 0; j < n - i - 1; j ++)

if ( arr [ j ] > arr [ j + 1]) {

// swap arr [ j +1] and arr [ j ]

int temp = arr [ j ];

arr [ j ] = arr [ j + 1];

arr [ j + 1] = temp ;

}

"""

# Bubble sort program to sort an integer array

summaries = summarization_model . predict ([ code_snippets ])

Unified interface for fine-tuning models CodeTF provides a unified interface for fine-tuning a model based

on supported checkpoints. The following example shows how to load a CodeGen model and fine-tune it using a

preprocessed CodeXGLUE dataset.

https://github.com/tree-sitter/tree-sitter

from

codetf . trainer . causal_lm_trainer import CausalLMTrainer

codetf . data_utility . codexglue_dataset import CodeXGLUEDataset

codetf . models import load_model_pipeline

codetf . performance . evaluate import EvaluationMetric

model_class = load_model_pipeline ( model_name = " causal - lm " , task = " pretrained " ,

model_type = " codegen -350 M - mono " , is_eval = False ,

load_in_8bit = False , weight_sharding = False )

dataloader = CodeXGLUEDataset ( tokenizer = model_class . get_tokenizer () )

train_dataset , test_dataset , val_dataset = dataloader . load ( subset = " text - to - code " )

evaluator = EvaluationMetric ( metric = " bleu " , tokenizer = model_class . tokenizer )

# peft can be in [" lora " , " prefixtuning "]

trainer = CausalLMTrainer ( train_dataset = train_dataset ,

validation_dataset = val_dataset ,

peft = None ,

p r et r a in e d _m o d el _ o r_ p a th = model_class . get_model () ,

tokenizer = model_class . get_tokenizer () )

trainer . train ()

Unified interface for to evaluate models on well-known benchmarks CodeTF provides a unified interface for

evaluating models against well-known benchmarks across a variety of metrics. The following example shows how to

load the evaluation interface and use the pass@k metric to evaluate a CodeGen model on the Human-Eval benchmark.

from codetf . models import load_model_pipeline

from codetf . data_utility . human_eval_dataset import HumanEvalDataset

from codetf . performance . model_evaluator import ModelEvaluator

os . environ [ " HF_ALLOW_CODE_EVAL " ] = " 1 "

os . environ [ " TO KEN IZ ERS _PA RA LLE LI SM " ] = " true "

model_class = load_model_pipeline ( model_name = " causal - lm " , task = " pretrained " ,

model_type = " codegen -350 M - mono " , is_eval = True ,

load_in_8bit = True , weight_sharding = False )

dataset = HumanEvalDataset ( tokenizer = model_class . get_tokenizer () )

prompt_token_ids , prompt_attention_masks , references = dataset . load ()

problems = TensorDataset ( prompt_token_ids , pro mp t_a tt ent ion _m ask s )

evaluator = ModelEvaluator ( model_class )

pass_at_k = evaluator . evaluate_pass_k ( problems = problems , unit_tests = references , k

=[1 ,10 ,100])

Related Work

In this section, we provide an overview of the research of LLMs for code and related development of libraries/tools to

support these models.

Large Language Models for Code Large language models (LLMs) for code have gained significant attraction in

recent years, driven by their ability to support a wide range of code understanding tasks such as code generation [6, 4, 35],

code completion [6, 4, 36], program repair [37], and code translation [38]. The success of large language models

(LLMs) like BERT [39] and GPT [40] in natural language processing (NLP) has inspired researchers to adapt NLP

language models for code [6, 4, 41, 42, 35, 36, 43, 44, 45, 46]. They usually treat code as natural language text and

leverage pretraining strategies such as span corruption and causal LM from the NLP domain, which has led to new

state-of-the-art results on a wide range of code-related tasks.

Code LLMs can be grouped into three primary architectures: encoder-only models [6, 7, 47], decoder-only models

[11, 12, 20, 5], and encoder-decoder models [48, 4, 49, 50]. Encoder-only models excel in understanding tasks such

9Table 1: Comparison of features between CodeTF and HuggingFace Transformers (HF-T). Note that we compare these

libraries by features related to the code domain, highlighting functionalities where HF-T may not specifically supports.

Feature

Unified Model and Dataset Interface

Unified Parameter-Efficient Fine-Tuning for Code Intelligence Tasks

Unified Code Utility Interface for Multiple Programming Languages

Unified Metric Interface to Evaluate Code Intelligence Benchmarks

Unified Data Loader Interface to Process Code Intelligence Benchmarks

Modular Library Design

Pretrained Model Checkpoints

Task-specific Finetuned Model Checkpoints

CodeTF (Ours) HF-T

✓

✓ ✓

✓

as code retrieval [24], while decoder-only models are well-suited for generation tasks like program synthesis [12, 13].

Although encoder-decoder models [4, 48] can be adapted for both code understanding and generation tasks, they don’t

always outperform decoder-only or encoder-only models. In CodeTF, we bundle a wide range of models that represent

for different architectures into a unified interface.

Unified Library for Code Intelligence Tasks Code LLMs have recently gained significant attention for addressing

software engineering tasks. However, code intelligence encompasses a broader scope, combining the latest advances in

artificial intelligence with traditional software engineering methods, such as static analysis, dynamic analysis, pointer

analysis, and formal methods, to effectively tackle complex software engineering tasks. In the first version of CodeTF,

our focus lies in bundling state-of-the-art LLMs for code with additional utilities for traditional software engineering

methods, including AST parsers.

Several other libraries with similar goals exist in the industry. NaturalCC [51] is a platform designed to facilitate

NLP-based big code analysis research for training and reproduction. However, its usability is limited due to suboptimal

design and challenges in extending its capabilities. HuggingFace Transformers [52] is a widely-known library that

offers user-friendly interfaces for loading pretrained language models across various domains (computer vision, natural

language processing, code, and time series), garnering significant attention from the research community. Nevertheless,

its general nature may pose difficulties for users seeking features specifically tailored to the code domain. There are

also other open-source repositories for code intelligence, such as CodeT5 [4], CodeGeeX [53], CodeBERT [6], and

CodeXGLUE [11]. However, most of these are not unified libraries for code intelligence but rather specific models with

instructions on how to load the checkpoints.

Table 5 summarizes the comparison between CodeTF’s key features with HuggingFace Transformers. It is important to

note that HuggingFace Transformers (HF-T) is a comprehensive library encompassing state-of-the-art language models

and utilities for multiple research domains. The comparison provided in Table 5 focuses solely on the features related to

the code domain, highlighting areas where HuggingFace Transformers may lack certain functionality.

Future Plan & Improvement

We continue to actively improve CodeTF as an one-stop open-source library for Code LLMs and code intelligence tasks.

We have several plans to expand its capabilities and support more advanced use cases and improve model reproducibility.

Some key features we aim to incorporate in the future include:

• Implementing 4-bit quantization as part of the pretrained and fine-tuned models, enabling even large models such as

InstructCodeT5+ [18] to run efficiently on commercial laptops or workstations.

• Conducting comprehensive evaluations of well-known code intelligence tasks on established benchmarks

(CodeXGLUE, MBPP, Human-Eval, and APPS). Due to the rapid advancements in the field, there is a lack of

reproducibility of performance of state-of-the-art models, making it challenging for the research community to adapt

and foster collaboration.

• Enhancing the Code Utility module by adding support for other programming languages, such as Go, Rust, C#, and

more. We also plan to include utilities for extracting additional useful features from code, such as call graphs, control

flow, data flow, and others.

• Integrating a broader selection of recent state-of-the-art pretrained language models of code into CodeTF, further

solidifying our library as a comprehensive resource in the field.

107

Conclusion

In this paper, we introduce CodeTF, a one-stop open-source Transformer-based library for code intelligence and Code

LLMs. The library offers a powerful and versatile toolset to develop and deploy LLMs for code-related tasks. With

its modular architecture and comprehensive set of features, the library enables users to easily perform a variety of

code-related tasks, such as code summarization, completion, generation, and refinement. By providing access to

state-of-the-art models, fine-tuning and evaluation capabilities, and a range of popular datasets, our library empowers

users to leverage the latest advancements in code intelligence research and development.

Broader Impact and Responsible Use

While models within CodeTF show immense potential in various code-related tasks, they do not provide absolute

guarantees regarding their code intelligence capabilities. The datasets and pretrained models used in CodeTF may carry

biases that could result in misinterpretations, incorrect results, or undesired behaviors. These biases can take multiple

forms:

1. Language Bias: The model might prefer certain programming languages over others based on the frequency of

the languages in the training data. For instance, if the model is trained mostly on Python code, it might struggle to

generate accurate and idiomatic Java or JavaScript code.

2. Application-specific Bias: This occurs when a model trained for a particular application or domain is used in a

different application. For example, a model trained on web development code may perform poorly when tasked with

generating embedded system code.

3. Library and Framework Bias: This refers to the inherent inclination of a model towards using specific libraries

or frameworks due to the frequency of their presence in the training dataset. For example, if the model was

predominantly trained on data using Python’s Pandas for data manipulation, it may be more inclined to use Pandas

even in situations where other libraries like NumPy or native Python constructs could be more efficient or appropriate.

4. Language Version Bias: Software languages evolve, with new versions (e.g., Python 2 to Python 3) introducing

changes, depreciations, and novel features. If the training dataset is not updated regularly to reflect these changes,

the model could generate code using outdated or deprecated conventions of a language.

5. Coding Style Bias: Coding style can vary significantly between individual coders, teams, or communities. If the

model is trained predominantly on a dataset reflecting a specific style, it may generate code that is in accordance

with that style, which may not be the optimal or preferred way for the specific use-case at hand.

6. Solution Bias: There can often be more than one valid solution to a coding problem. The model might be biased

towards the solutions it was exposed to during training and might fail to generate other potentially more efficient or

elegant solutions.

In addition to these potential biases, there are several other crucial considerations:

1. Sustainability: Energy efficiency is a significant concern in AI, especially with large-scale models. Optimized

models generating more efficient code could reduce the computational resources required to execute such code,

thereby reducing energy consumption. Ongoing research into more energy-efficient AI training methods can also

decrease the energy footprint of AI itself.

2. Inclusive language: Coding language needs to be inclusive as the field becomes increasingly diverse. Non-

inclusive terms may discourage and offend many developers. Future work should focus on creating tools to identify

non-inclusive language in code and recommend suitable alternatives.

3. Job loss and automation: While AI carries the potential to automate certain tasks, it is essential to view it as a tool

that augments rather than replaces human efforts. Developer tools are usually designed to handle repetitive tasks,

freeing developers to focus on complex issues. However, it’s crucial to ensure developers do not become overly

reliant on these tools and can still code effectively on their own.

4. Human control and autonomy: Maintaining human control and oversight is crucial, especially in critical areas like

code generation. Techniques like explainability and interpretability in AI, along with rigorous testing, ensure AI

systems remain under human control and behave as expected. The goal should be to create AI systems that enhance

human capabilities and work collaboratively with humans, rather than replacing them.

Users of CodeTF must scrutinize the pretrained models and the general system before their adoption in practical

applications. We are committed to refining the library by identifying and addressing such potential biases and

11inappropriate behaviors continually. We encourages researchers, software engineers, and AI practitioners to use the

library responsibly for applications that enhance software quality and developer productivity. However, CodeTF should

not be used to develop code intelligence models that could lead to unethical capabilities, such as unauthorized code

manipulation, privacy breaches, or the propagation of insecure coding practices. As AI becomes more integrated into

software development, it is essential to address these ethical and practical considerations. CodeTF is committed to

supporting responsible AI practices andgating potential biases and inappropriate behaviors moving forward.

References

[1] Simone Livieri, Yoshiki Higo, Makoto Matushita, and Katsuro Inoue. Very-large scale code clone analysis and

visualization of open source programs using distributed ccfinder: D-ccfinder. In 29th International Conference on

Software Engineering (ICSE’07), pages 106–115. IEEE, 2007.

[2] Carol V Alexandru and Harald C Gall. Rapid multi-purpose, multi-commit code analysis. In 2015 IEEE/ACM

37th IEEE International Conference on Software Engineering, volume 2, pages 635–638. IEEE, 2015.

[3] Boyuan Chen and Zhen Ming Jiang. Characterizing and detecting anti-patterns in the logging code. In 2017

IEEE/ACM 39th International Conference on Software Engineering (ICSE), pages 71–81. IEEE, 2017.

[4] Yue Wang, Weishi Wang, Shafiq R. Joty, and Steven C. H. Hoi. Codet5: Identifier-aware unified pre-trained

encoder-decoder models for code understanding and generation. In Marie-Francine Moens, Xuanjing Huang,

Lucia Specia, and Scott Wen-tau Yih, editors, Proceedings of the 2021 Conference on Empirical Methods in

Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November,

2021, pages 8696–8708. Association for Computational Linguistics, 2021.

[5] Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming

Xiong. Codegen: An open large language model for code with multi-turn program synthesis. In The Eleventh

International Conference on Learning Representations, 2023.

[6] Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting

Liu, Daxin Jiang, and Ming Zhou. Codebert: A pre-trained model for programming and natural languages.

In Trevor Cohn, Yulan He, and Yang Liu, editors, Findings of the Association for Computational Linguistics:

EMNLP 2020, Online Event, 16-20 November 2020, volume EMNLP 2020 of Findings of ACL, pages 1536–1547.

Association for Computational Linguistics, 2020.

[7] Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey

Svyatkovskiy, Shengyu Fu, Michele Tufano, Shao Kun Deng, Colin B. Clement, Dawn Drain, Neel Sundaresan,

Jian Yin, Daxin Jiang, and Ming Zhou. Graphcodebert: Pre-training code representations with data flow. In ICLR.

OpenReview.net, 2021.

[8] Daya Guo, Shuai Lu, Nan Duan, Yanlin Wang, Ming Zhou, and Jian Yin. Unixcoder: Unified cross-modal

pre-training for code representation. In ACL (1), pages 7212–7225. Association for Computational Linguistics,

2022.

[9] Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven Hoi. CodeRL: Mastering code

generation through pretrained models and deep reinforcement learning. In Alice H. Oh, Alekh Agarwal, Danielle

Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.

[10] OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023.

[11] Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin B. Clement, Dawn

Drain, Daxin Jiang, Duyu Tang, Ge Li, Lidong Zhou, Linjun Shou, Long Zhou, Michele Tufano, Ming Gong,

Ming Zhou, Nan Duan, Neel Sundaresan, Shao Kun Deng, Shengyu Fu, and Shujie Liu. Codexglue: A machine

learning benchmark dataset for code understanding and generation. In NeurIPS Datasets and Benchmarks, 2021.

[12] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri

Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.

arXiv preprint arXiv:2107.03374, 2021.

[13] Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir

Puranik, Horace He, Dawn Song, and Jacob Steinhardt. Measuring coding challenge competence with apps.

NeurIPS, 2021.

[14] Md. Rizwan Parvez, Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. Retrieval

augmented code generation and summarization. In EMNLP (Findings), pages 2719–2734. Association for

Computational Linguistics, 2021.

12[15] Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring the limits of language

modeling. arXiv preprint arXiv:1602.02410, 2016.

[16] Ben Sorscher, Robert Geirhos, Shashank Shekhar, Surya Ganguli, and Ari Morcos. Beyond neural scaling laws:

beating power law scaling via data pruning. Advances in Neural Information Processing Systems, 35:19523–19536,

2022.

[17] Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego

de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. An empirical analysis of compute-optimal

large language model training. Advances in Neural Information Processing Systems, 35:30016–30030, 2022.

[18] Yue Wang, Hung Le, Akhilesh Deepak Gotmare, Nghi DQ Bui, Junnan Li, and Steven CH Hoi. Codet5+: Open

code large language models for code understanding and generation. arXiv preprint arXiv:2305.07922, 2023.

[19] Frank F Xu, Uri Alon, Graham Neubig, and Vincent Josua Hellendoorn. A systematic evaluation of large language

models of code. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming,

pages 1–10, 2022.

[20] Daniel Fried, Armen Aghajanyan, Jessy Lin, Sida Wang, Eric Wallace, Freda Shi, Ruiqi Zhong, Wen-tau Yih,

Luke Zettlemoyer, and Mike Lewis. Incoder: A generative model for code infilling and synthesis. arXiv preprint

arXiv:2204.05999, 2022.

[21] Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Carlos Munoz Ferrandis,

Niklas Muennighoff, Mayank Mishra, Alex Gu, Manan Dey, et al. Santacoder: don’t reach for the stars! arXiv

preprint arXiv:2301.03988, 2023.

[22] Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc

Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. Starcoder: may the source be with you! arXiv preprint

arXiv:2305.06161, 2023.

[23] Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang,

Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint

arXiv:2108.07732, 2021.

[24] Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. Codesearchnet

challenge: Evaluating the state of semantic code search. CoRR, abs/1909.09436, 2019.

[25] Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Llm. int8 (): 8-bit matrix multiplication for

transformers at scale. arXiv preprint arXiv:2208.07339, 2022.

[26] Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for

generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022.

[27] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu

Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.

[28] Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint

arXiv:2101.00190, 2021.

[29] Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie Qian, Zhilin Yang, and Jie Tang. Gpt understands, too.

arXiv preprint arXiv:2103.10385, 2021.

[30] Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. arXiv

preprint arXiv:2104.08691, 2021.

[31] Qingru Zhang, Minshuo Chen, Alexander Bukharin, Pengcheng He, Yu Cheng, Weizhu Chen, and Tuo Zhao.

Adaptive budget allocation for parameter-efficient fine-tuning. arXiv preprint arXiv:2303.10512, 2023.

[32] Yujia Li, David H. Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles,

James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume,

Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy,

Daniel J. Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu, and

Oriol Vinyals. Competition-level code generation with alphacode. CoRR, abs/2203.07814, 2022.

[33] Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sundaresan, Ming Zhou, Ambrosio

Blanco, and Shuai Ma. Codebleu: a method for automatic evaluation of code synthesis. CoRR, abs/2009.10297,

2020.

[34] Fengji Zhang, Bei Chen, Yue Zhang, Jin Liu, Daoguang Zan, Yi Mao, Jian-Guang Lou, and Weizhu Chen.

Repocoder: Repository-level code completion through iterative retrieval and generation. arXiv preprint

arXiv:2303.12570, 2023.

13[35] Ahmed Elnaggar, Wei Ding, Llion Jones, Tom Gibbs, Tamas Feher, Christoph Angerer, Silvia Severini, Florian

Matthes, and Burkhard Rost. Codetrans: Towards cracking the language of silicon’s code through self-supervised

deep learning and high performance computing. arXiv preprint arXiv:2104.02443, 2021.

[36] Dinglan Peng, Shuxin Zheng, Yatao Li, Guolin Ke, Di He, and Tie-Yan Liu. How could neural networks

understand programs? In International Conference on Machine Learning, pages 8476–8486. PMLR, 2021.

[37] Chunqiu Steven Xia, Yuxiang Wei, and Lingming Zhang. Practical program repair in the era of large pre-trained

language models. arXiv preprint arXiv:2210.14179, 2022.

[38] Baptiste Roziere, Marie-Anne Lachaux, Lowik Chanussot, and Guillaume Lample. Unsupervised translation of

programming languages. Advances in Neural Information Processing Systems, 33:20601–20611, 2020.

[39] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional

transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of

the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis,

MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4171–4186, 2019.

[40] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are

unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.

[41] Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey

Svyatkovskiy, Shengyu Fu, et al. Graphcodebert: Pre-training code representations with data flow. arXiv preprint

arXiv:2009.08366, 2020.

[42] Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. Unified Pre-training for Program

Understanding and Generation. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-

Tür, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou, editors, Proceedings

of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics:

Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, pages 2655–2668. Association for

Computational Linguistics, 2021.

[43] Aditya Kanade, Petros Maniatis, Gogul Balakrishnan, and Kensen Shi. Learning and evaluating contextual

embedding of source code. In International Conference on Machine Learning, pages 5110–5121. PMLR, 2020.

[44] Saikat Chakraborty, Toufique Ahmed, Yangruibo Ding, Premkumar T Devanbu, and Baishakhi Ray. Natgen:

generative pre-training by “naturalizing” source code. In Proceedings of the 30th ACM Joint European Software

Engineering Conference and Symposium on the Foundations of Software Engineering, pages 18–30, 2022.

[45] Toufique Ahmed and Premkumar Devanbu. Multilingual training for software engineering. In Proceedings of the

44th International Conference on Software Engineering, pages 1443–1455, 2022.

[46] Changan Niu, Chuanyi Li, Vincent Ng, Jidong Ge, Liguo Huang, and Bin Luo. Spt-code: sequence-to-sequence

pre-training for learning source code representations. In Proceedings of the 44th International Conference on

Software Engineering, pages 2006–2018, 2022.

[47] Xin Wang, Yasheng Wang, Yao Wan, Jiawei Wang, Pingyi Zhou, Li Li, Hao Wu, and Jin Liu. CODE-MVP:

Learning to represent source code from multiple views with contrastive pre-training. In Findings of the Association

for Computational Linguistics: NAACL 2022, pages 1066–1077, Seattle, United States, July 2022. Association for

Computational Linguistics.

[48] Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. Unified pre-training for program

understanding and generation. In NAACL-HLT, pages 2655–2668. Association for Computational Linguistics,

2021.

[49] Changan Niu, Chuanyi Li, Vincent Ng, Jidong Ge, Liguo Huang, and Bin Luo. Spt-code: Sequence-to-sequence

pre-training for learning source code representations. In ICSE, pages 1–13. ACM, 2022.

[50] Saikat Chakraborty, Toufique Ahmed, Yangruibo Ding, Prem Devanbu, and Baishakhi Ray. Natgen: generative

pre-training by “naturalizing” source code. Proceedings of the 30th ACM Joint European Software Engineering

Conference and Symposium on the Foundations of Software Engineering, 2022.

[51] Yao Wan, Yang He, Zhangqian Bi, Jianguo Zhang, Yulei Sui, Hongyu Zhang, Kazuma Hashimoto, Hai Jin,

Guandong Xu, Caiming Xiong, et al. Naturalcc: an open-source toolkit for code intelligence. In Proceedings of

the ACM/IEEE 44th International Conference on Software Engineering: Companion Proceedings, pages 149–153,

2022.

[52] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac,

Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine

Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander

14Rush. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on

Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online, October 2020.

Association for Computational Linguistics.

[53] Qinkai Zheng, Xiao Xia, Xu Zou, Yuxiao Dong, Shan Wang, Yufei Xue, Zihan Wang, Lei Shen, Andi Wang, Yang

Li, Teng Su, Zhilin Yang, and Jie Tang. Codegeex: A pre-trained model for code generation with multilingual

evaluations on humaneval-x, 2023.