Summary of CodeCompose AI-assisted Code Authoring Deployment

Summary CodeCompose AI-assisted Code Authoring Deployment arxiv.org

8,982 words - PDF document - View PDF document

One Line

CodeCompose is an AI-assisted code authoring tool that suggests code based on contextual information, fine-tuned for Meta-specific languages, and offers features such as auto-completion, API discovery, and standard library suggestions.

Key Points

CodeCompose is an AI-assisted code authoring system that suggests entire statements or blocks of code during development.
It utilizes large language models (LLMs) to offer coding suggestions based on the organization's code repository.
CodeCompose has been deployed on various code authoring surfaces across the company, offering quantitative metrics and qualitative feedback to measure its impact.
The system has been trained on various corpora to assimilate vast amounts of knowledge and assist developers in achieving efficiency.
CodeCompose aims to improve developer productivity throughout the software development life cycle.

Summaries

345 word summary

CodeCompose is an AI-assisted code authoring tool that suggests entire statements or blocks of code during development based on contextual information. It is fine-tuned for Meta-specific languages such as Hack and Flow and utilizes large language models (LLMs) to offer coding suggestions based on the organization's code repository. CodeCompose aids in code generation, documentation, and suggestion accuracy and offers features such as auto-completion, API discovery, and standard library suggestions. The tool has a reduction in coding iteration time when exposed to single-line ML completion and generates code across 10+ languages using first-party training data from Meta's code repositories and notebooks. The system design allows for the ability to plug CodeCompose into any code editing surface, and the architecture makes it straightforward to integrate with in-house developer tools. The tool received a 91.5% favorable response from users and made 4.5 million suggestions during the deployment period. CodeCompose proposes LCM to evaluate the system and employs a suite of optimizations to bring down the end-to-end latency below a threshold. CodeCompose is an AI-assisted code authoring deployment designed to improve developer productivity. Its three primary components are the server, Language Service Protocol (LSP), and the client. The system generates a sequence of tokens to auto-complete statements by fine-tuning on Meta's internal code and has a 91.5% favorable feedback rating. In a 15-day period, 4.5 million suggestions were shown to developers, 22% of which were accepted. CodeCompose also helps developers discover unfamiliar APIs and write documentation. The authors discuss building trust, designing the user experience, and measuring feedback to understand the system's impact on code authoring. Various AI companies offer commercial versions of code completion tools, including Google, Microsoft, and GitHub. Amazon's CodeWhisperer is a fully functional code completion tool that can generate correct syntax and pass unit tests in programming languages it is not trained for. GitHub Copilot is a code completion tool that provides a useful starting point for programming tasks but has some shortcomings. IntelliCode Compose and Pythia are also code completion tools. CodeCompose is an entirely internal tool that has only been deployed at Meta.

592 word summary

CodeCompose is an AI-assisted code authoring deployment that aims to improve developer productivity. It has three primary components: the server, Language Service Protocol (LSP), and the client. CodeCompose generates a sequence of tokens to auto-complete the statement(s) by fine-tuning on Meta's internal code. The system has desirable side effects on the coding experience of developers, as shown by a 91.5% favorable feedback rating. In a 15-day time period, 4.5 million suggestions were shown to developers, 22% of which were accepted. The system also helps developers discover unfamiliar APIs and write documentation. The authors discuss building trust, designing the user experience, and measuring feedback to understand the impact of the system on code authoring. Various AI companies offer commercial versions of code completion tools, including Google, Microsoft, and GitHub. Amazon's CodeWhisperer is a fully functional code completion tool that can generate correct syntax and pass unit tests in programming languages it is not trained for. GitHub Copilot is a code completion tool that provides a useful starting point for programming tasks, but it has some shortcomings. IntelliCode Compose and Pythia are also code completion tools. CodeCompose is an entirely internal tool that has only been deployed at Meta. CodeCompose is an AI-assisted code authoring tool that aids in code generation, documentation, and suggestion accuracy. It predicts relevant suggestions and offers features such as auto-completion, API discovery, and standard library suggestions. It is well-received by developers who work on building pipelines and common infrastructure and those who follow typical coding patterns. However, it struggles with specialized APIs and libraries. The tool has a reduction in coding iteration time when exposed to single-line ML completion. CodeCompose generates code across 10+ languages using first-party training data from Meta's code repositories and notebooks. The system design allows for the ability to plug CodeCompose into any code editing surface, and the architecture makes it straightforward to integrate with in-house developer tools. The system collects data by mining user feedback posts in the support group and manually analyzes suggestions. The tool received a 91.5% favorable response from users and made 4.5 million suggestions during the deployment period. CodeCompose suggests code and generates comments, messages, and documentation using LLMs. It's deeply integrated with Meta's version of VSCode and can suggest docstrings for functions. Suggestions need to appear within a certain timeframe, and the system has to match the developer's typing speed. CodeCompose proposes LCM to evaluate the system. The tool employs a suite of optimizations to bring down the end-to-end latency below a threshold. CodeCompose is an AI-assisted code authoring tool developed and deployed at Meta for serving tens of thousands of developers. It offers inline code suggestions based on contextual information and is fine-tuned for Meta-specific languages such as Hack and Flow. CodeCompose suggests entire statements or blocks of code during development, utilizing large language models (LLMs) to offer coding suggestions based on the organization's code repository. It has been deployed on various code authoring surfaces across the company, offering quantitative metrics and qualitative feedback to measure its impact. CodeCompose addresses the challenge of knowledge discovery in a dynamic environment by surfacing internal knowledge during code authoring, encouraging developers to produce better quality code. Trust-building was important for CodeCompose's productization, which was achieved by working with language partner teams at Meta and obtaining feedback from early adopters. The authors adopted a rollout strategy to incrementally build trust in CodeCompose by rolling it out to the company in waves of languages. At every step, they were able to gather developer feedback and iterate on the product before rolling out further.

1080 word summary

CodeCompose is an AI-assisted code authoring system that suggests entire statements or blocks of code during development. It utilizes large language models (LLMs) to offer coding suggestions based on the organization's code repository. CodeCompose has been deployed on various code authoring surfaces across the company, offering quantitative metrics and qualitative feedback to measure its impact. The system has been trained on various corpora to assimilate vast amounts of knowledge and assist developers in achieving efficiency.

CodeCompose addresses the challenge of knowledge discovery in a dynamic environment by surfacing internal knowledge during code authoring. Its impact on Meta's internal code authoring experience over a 15-day time window was significant, and it encourages developers to produce better quality code. Trust-building was important for CodeCompose's productization, which was achieved by working with language partner teams at Meta and obtaining feedback from early adopters.

CodeCompose is an AI-assisted code authoring tool developed and deployed at Meta for serving tens of thousands of developers. It offers inline code suggestions based on contextual information and is fine-tuned for Meta-specific languages such as Hack and Flow. It is based on the InCoder LLM that merges generative capabilities with bi-directionality and is multi-lingual.

The paper presents details about the CodeCompose model, system architecture, challenges for coding assistants, industrial setting and how code is authored at Meta, developer feedback on the impact of CodeCompose on Meta, results from an extensive large-scale deployment, threats to validity, and related work. The authors adopted a rollout strategy to incrementally build trust in CodeCompose by rolling it out to the company in waves of languages. At every step, they were able to gather developer feedback and iterate on the product before rolling out further. CodeCompose is an AI-assisted code authoring deployment that suggests code and generates comments, messages, and documentation using LLMs. It's deeply integrated with Meta's version of VSCode and can suggest docstrings for functions. Code generation can be done at multiple levels, but requiring too much rework may reduce trust in the model. Suggestions need to appear within 300ms-500ms and not beyond 1s, and the system has to match the developer's typing speed. Balancing the rollout of suggestions is a unique challenge, and CodeCompose proposes LCM to evaluate the system. The underlying LLM architecture involves a generative model trained with the CM objective. Developers can request multi-line suggestions by pressing Tab multiple times. CodeCompose also employs a suite of optimizations to bring down the end-to-end latency below a threshold. CodeCompose developed an AI-assisted code authoring deployment that generates code across 10+ languages using first-party training data from Meta's code repositories and notebooks. They used a specialized language model, LCM, and a tokenizer to encode each sentence separately with metadata, code before, code after, and target sentences. The system design allows for the ability to plug CodeCompose into any code editing surface, and the architecture makes it straightforward to integrate with in-house developer tools. The system collects data by mining user feedback posts in the support group and manually analyzes suggestions. The tool received a 91.5% favorable response from users and made 4.5 million suggestions during the deployment period. CodeCompose is an AI-assisted code authoring deployment that accelerates coding, aids in discovery, documentation, and suggestion accuracy. It helps developers with coding tasks such as generating in-code documentation, writing boilerplate, and discovering new APIs. The tool successfully predicts relevant suggestions without being too noisy and offers features such as auto-completing lines of code, API discovery, generating in-code documentation, and suggesting standard libraries. CodeCompose struggles to suggest correct code in scenarios where developers use specialized APIs and libraries. However, it is well received by developers who work on building pipelines and common infrastructure and those who follow typical coding patterns. The productivity impact of CodeCompose was observed through a reduction in coding iteration time when exposed to single-line ML completion. The closest work to CodeCompose is the deployment of hybrid semantic ML code completion at Google. Qualitative feedback from developers was categorized into different categories, with the majority finding CodeCompose to be a net positive experience. However, there were some UX problems, such as overloaded keyboard shortcuts and disruptions caused by competing suggestion systems. Some developers found traditional auto-complete to be more useful in certain cases. Various AI companies offer commercial versions of code completion tools, including Google, Microsoft, and GitHub. Amazon's CodeWhisperer is a fully functional code completion tool that can generate correct syntax and pass unit tests in programming languages it is not trained for. GitHub Copilot is a code completion tool that provides a useful starting point for programming tasks, but it has some shortcomings. IntelliCode Compose and Pythia are also code completion tools. CodeCompose is an entirely internal tool that has only been deployed at Meta.

CodeCompose has desirable side effects on the coding experience of developers, as shown by a 91.5% favorable feedback rating. In a 15-day time period, 4.5 million suggestions were shown to developers, 22% of which were accepted. The system also helps developers discover unfamiliar APIs and write documentation.

CodeCompose is an AI-assisted code authoring deployment that aims to improve developer productivity throughout the software development life cycle. The system is built using three primary components: the server, Language Service Protocol (LSP), and the client. The underlying InCoder-based LLM generates a sequence of tokens to auto-complete the statement(s) by fine-tuning on Meta's internal code.

In this paper, the authors introduced an AI-based coding assistant system named CodeCompose, which was scaled to 16k developers. The authors also referenced several related studies on user interactions with query auto completion, participant response bias in HCI, and efficient training of language models. They concluded by highlighting the potential productivity benefits of ML-enhanced code completion systems from Google, Microsoft, and Amazon. This article discusses improving code autocompletion with transfer learning and cites Attention Is All You Need and Pythia as relevant resources. It also examines the impact of AI on developer productivity and presents a method for automatic evaluation of machine translation. An empirical evaluation of GitHub copilot's code suggestions is discussed, as well as the legal and ethical challenges of ChatGPT. The article also explores code prediction by feeding trees to transformers and natural language generation.

2348 word summary

Improving Code Autocompletion with Transfer Learning is discussed in an article from the International Conference on Software Engineering: Software Engineering in Practice. Attention Is All You Need and Pythia are cited as relevant resources. The impact of AI on developer productivity is examined in a paper from Arxiv. A method for automatic evaluation of machine translation is presented in Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. An empirical evaluation of GitHub copilot's code suggestions is discussed in a paper by Nhan Nguyen and Sarah Nadi. The Dark Side of ChatGPT: Legal and Ethical Challenges from Stochastic Parrots and Hallucination is addressed in a paper by Zihao Li. Code Pre-diction by Feeding Trees to Transformers and Natural Language Generation are explored in an article from ACM Comput. Surv. In this paper, the authors introduced an AI-based coding assistant system named CodeCompose, which was scaled to 16k developers. The system uses generative models for code infilling and synthesis, pre-training of deep bidirectional transformers for language understanding, and learning from examples to improve code completion systems. The authors also referenced several related studies on user interactions with query auto completion, participant response bias in HCI, and efficient training of language models. They concluded by highlighting the potential productivity benefits of ML-enhanced code completion systems from Google, Microsoft, and Amazon. CodeCompose is an AI-assisted code authoring deployment that aims to improve developer productivity throughout the software development life cycle. The system is built using three primary components: the server, Language Service Protocol (LSP), and the client. The underlying InCoder-based LLM generates a sequence of tokens to auto-complete the statement(s) by fine-tuning on Meta's internal code. CodeCompose offers suggestions for individual lines of code and leverages semantic information to perform pre-processing and post-processing to improve suggestion accuracy.

The system architecture of CodeCompose is presented, along with the challenges faced in building an AI-assisted code authoring deployment. The authors discuss building trust, designing the user experience, and measuring feedback to understand the impact of the system on code authoring. A systematic literature survey of related work in this field is also summarized. There are several generative AI companies that offer commercial versions of code completion tools, including Google, Microsoft, and GitHub. Amazon's CodeWhisperer is a fully functional code completion tool that can generate correct syntax and pass unit tests in programming languages it is not trained for. GitHub Copilot is a code completion tool that provides a useful starting point for programming tasks, but it has some shortcomings, such as generating code that can be further simplified and code that relies on undefined helper methods. There have been several empirical evaluations of GitHub Copilot, and the results have been positive overall. IntelliCode Compose is a general-purpose multilingual code completion tool that is built on a state-of-the-art generative transformer model trained on 1.2 billion lines of source code in Python, C#, JavaScript, and TypeScript programming languages. Pythia is a code completion tool deployed with Intellicode by Microsoft that is built using DL models trained on code contexts extracted from abstract syntax trees. CodeCompose is an entirely internal tool that has only been deployed at Meta. The generalizability of CodeCompose has not been measured separately with statistical significance. CodeCompose is an AI-assisted code authoring deployment that aims to improve developer productivity. The scope of the paper is to present the results from building and deploying CodeCompose at scale and its usage. The productivity impact of CodeCompose was observed through a reduction in coding iteration time when exposed to single-line ML completion. The closest work to CodeCompose is the deployment of hybrid semantic ML code completion at Google. Qualitative feedback from developers was categorized into different categories, with the majority finding CodeCompose to be a net positive experience. However, there were some UX problems, such as overloaded keyboard shortcuts and disruptions caused by competing suggestion systems. Some developers found traditional auto-complete to be more useful in certain cases. CodeCompose struggles to suggest correct code in scenarios where developers use specialized APIs and libraries, and is not helpful for tasks that involve writing heavily templatized code. However, it is well received by developers who work on building pipelines and common infrastructure and those who follow typical coding patterns. CodeCompose offers features such as auto-completing lines of code, API discovery, generating in-code documentation, and suggesting standard libraries. The value-add of fine-tuning the LLM on Meta's internal code may not translate externally, and investing in UX research is important to make CodeCompose a productive experience for all developers. CodeCompose sometimes suggests hallucinations and struggles with function signatures, but it has helped some developers reduce coding time and improve their workflow. CodeCompose is an AI-assisted code authoring deployment that helps developers with coding tasks such as generating in-code documentation, writing boilerplate, and discovering new APIs. It highlights the value of naming and quality of documentation, providing immediate effectiveness boosts. CodeCompose successfully predicts relevant suggestions without being too noisy. It helps speed up the coding process and is generally accurate. Developers have reported positive feedback about their experience using CodeCompose. However, some developers found it intrusive and requested to disable it. Overall, CodeCompose is a useful tool for developers looking to improve their coding workflow. CodeCompose is an AI-assisted code authoring tool that accelerates coding, aids in discovery, documentation, and suggestion accuracy. The tool received positive feedback from users, who found it helpful in accelerating their coding activity, discovering new APIs, and automating tedious tasks such as boilerplate code. The tool was found to be accurate and able to auto-complete lines of code. The labeling process of user feedback involved three steps: independent labeling, team collaboration, and refinement of categories. The qualitative feedback was collected through a forum and analyzed manually. The tool also generates in-code and API documentation. CodeCompose is an AI-assisted code authoring tool that received a 91.5% favorable response from users, with Python being the most commonly used language. The tool uses a randomized rollout strategy for each language to avoid anomalies and collects data on the acceptance rate of suggestions, the number of characters typed, and the display time of suggestions. CodeCompose made 4.5 million suggestions during the deployment period and was integrated into internal tools to limit demands on the client. The Language Server Protocol (LSP) is responsible for recording telemetry and ensuring consistency across various surfaces where CodeCompose is used. An observational study was conducted to evaluate the usefulness of CodeCompose in developer workflows. CodeCompose is an AI-assisted code authoring deployment system that utilizes user feedback to improve code suggestions. The system collects data by mining user feedback posts in the support group and manually analyzes suggestions. To evaluate performance, the system tracks various events in the IDE, such as displaying a suggestion inline, accepting or rejecting a suggestion, and the length of accepted suggestions. The system is equipped with A100 GPUs and is fine-tuned with a customized LCM objective to improve performance for languages like Hack, C++, Python, and Flow (JavaScript). The system processes requests via Thrift and is optimized for latency rather than throughput. CodeCompose is an AI-assisted code authoring deployment that employs a client-server architecture with a language server that is reused across multiple editor integrations. The system design allows for the ability to plug CodeCompose into any code editing surface, and the architecture makes it straightforward to integrate with in-house developer tools. Clients are extensions that run locally on developers' machines and are responsible for ultimately displaying inline suggestions in the editor. The LSP implements core request logic such as debouncing and caching, and supports only one meaningful request type: textDocument/inlineCompletions. To mediate requests between the client and server, an LSP was implemented. InCoder-1.3B, the public model with 1.3 billion parameters, was fine-tuned further on first party data with the LCM objective. CodeCompose developed an AI-assisted code authoring deployment that can generate code across 10+ languages. They collected first party training data from Meta's code repositories and notebooks, applying several filters to avoid bugs and exclude old/obsolete patterns. To keep the training data fresh, they only looked at diffs going back up to 2 years and only kept the latest versions of files. They used a specialized language model, LCM, which has only one mask in any task and avoids training on code that may have been added a long time ago but is never modified. LCM only masks at certain trigger characters and lifts the masking step to the language level. CodeCompose's tokenizer encodes each sentence separately with metadata, code before, code after, and target sentences. They found an optimal 70-30 split of the model's input length between code before and code after the cursor. The model generates only one sequence of tokens for the user experience and stops the generation early once a newline token has been generated. They provide telemetry that is required to evaluate or monitor the product. CodeCompose is an AI-assisted code authoring deployment that aims to address issues related to compliance and proprietary data. Evaluating the usefulness of AI-generated suggestions is a major challenge, and metrics like acceptance rate may underrepresent the actual benefits. To evaluate the system, CodeCompose proposes Language Causal Masking (LCM). The underlying LLM architecture for CodeCompose involves a generative model trained with the Causal Masking (CM) objective. Developers can request multi-line suggestions by pressing Tab multiple times. CodeCompose also employs a suite of optimizations to bring down the end-to-end latency below a threshold. CodeCompose is an AI-assisted code authoring deployment. Developers require suggestions appearing within 300ms - 500ms, and not beyond 1s. Suggestions are not shown if there is any code to the right of the cursor, except certain tokens such as parentheses, brackets, etc. The system has to match the developer's typing speed. Performance and latency are important considerations. Balancing the rollout of CodeCompose's suggestions is a unique challenge. Suggestions are generated at different granularities depending on factors such as suggestion confidence, user context, and task completion. Code generation can be done at multiple levels due to the generality of LLMs, but requiring a lot of rework might contribute to the reduction of trust and confidence in the model. Studies show that developers are fine with reworking a suggestion as long as the model provides a useful starting point or structure with relevant libraries and function calls. However, if the model is not generating enough suggestions or generates only a few accurate suggestions, developers might eventually stop caring about the system as it becomes sparse. Other factors like training data, security issues, vulnerabilities, constant data drift, biases in the source corpus used for training and its validity, and complexities of parsing source code or developing semantic understanding can also impact trust.

CodeCompose is a coding assistant that uses LLMs to suggest code and generate comments, messages, and documentation. It adheres to the comment and can also inline comments in natural language and generate code that adheres to the comment. CodeCompose is deeply integrated with Meta's version of VSCode and can suggest the docstring for a function from code that conventionally appears after the docstring. CodeCompose can look beyond the cursor to suggest code at the current position, fluently generate comments, messages, and documentation, and understand natural language proficiency.

CodeCompose has an overwhelming 91.5% positive reception and has an acceptance rate of 22% across several languages, with 8% of the suggestions being accepted by users. It helps generate more in-code documentation and identifies obsolete patterns. CodeCompose is part of Meta's Software Development Life Cycle (SDLC) and is embedded in the organization's internal code repository that hosts source code written in 20+ programming languages.

The challenges faced in deploying CodeCompose include unique challenges in terms of user experience and metrics that arise when deploying such tools in large-scale industrial settings. CodeCompose's impact on Meta's internal code authoring experience over a 15-day time window was significant, and it encourages developers to produce better quality code. Trust-building was important for CodeCompose's productization, which was achieved by working with language partner teams at Meta and obtaining feedback from early adopters. CodeCompose is an AI-assisted code authoring system that suggests entire statements or blocks of code during development. It utilizes large language models (LLMs) to offer coding suggestions based on the organization's code repository. CodeCompose has been deployed on various code authoring surfaces across the company, offering quantitative metrics and qualitative feedback to measure its impact. The system has been trained on various corpora to assimilate vast amounts of knowledge and assist developers in achieving efficiency. However, the dynamic environment at a large software company poses challenges with respect to knowledge discovery, as tools get added and deprecated, and developers move across teams. CodeCompose addresses this challenge by surfacing internal knowledge during code authoring.

Raw indexed text (58,296 chars / 8,982 words / 1,139 lines)

CodeCompose: A Large-Scale Industrial Deployment of

AI-assisted Code Authoring

Vijayaraghavan Murali Chandra Maddila Imad Ahmad

[email protected]

Meta Platforms Inc.

USA [email protected]

Meta Platforms Inc.

USA [email protected]

Meta Platforms Inc.

USA

Michael Bolin Daniel Cheng Negar Ghorbani

[email protected]

Meta Platforms Inc.

USA [email protected]

Meta Platforms Inc.

USA [email protected]

Meta Platforms Inc.

USA

Renuka Fernandez Nachiappan Nagappan

[email protected]

Meta Platforms Inc.

UK [email protected]

Meta Platforms Inc.

USA

ABSTRACT

source code or is confined to a small set of developers. For example,

a question like “How can I upload a table to Hive in Hack?” can be

answered using that knowledge.

The state of the source code and the developers associated with

the source code keeps evolving constantly - internal libraries and

tools get added and deprecated, while developers move across teams

and change roles. At Meta’s scale, keeping up with the knowledge

required to accomplish coding tasks is challenging. Additionally,

the dynamic environment at a large software company like Meta

poses interesting challenges with respect to knowledge discovery

and achieving developer efficiency.

Recently, large language models (LLMs)[12, 15, 31] have exhib-

ited the ability to assimilate vast amounts of knowledge after being

trained on the data in various corpora. This characteristic of LLMs

has made them highly impactful in assisting developers in authoring

code[3, 5, 7, 23]. Specifically, when they are trained to predict the

next token in a code sequence, LLMs can become powerful coding

assistants that can suggest entire statements or blocks of code dur-

ing code authoring. Such a coding assistant trained on an organiza-

tion’s code repository can surface internal knowledge during code

authoring when developers are most likely to seek that information.

At Meta, we built an AI-assisted code authoring system named

CodeCompose to explore the application of LLM technology for

code authoring. Over the past year, we (i) conducted R&D on the

LLM (model) architecture, training data, and training objective, (ii)

built an end-to-end coding assistant that offers code suggestions

in various code authoring surfaces across the company, and (iii)

gathered quantitative metrics and qualitative feedback to measure

the impact of the CodeCompose system. CodeCompose has the

following characteristics:

The rise of large language models (LLMs) has unlocked various ap-

plications of this technology in software development. In particular,

generative LLMs have been shown to effectively power AI-based

code authoring tools that can suggest entire statements or blocks of

code during code authoring. In this paper we present CodeCompose,

an AI-assisted code authoring tool developed and deployed at Meta

internally. CodeCompose is based on the InCoder LLM that merges

generative capabilities with bi-directionality. We have scaled up

CodeCompose to serve tens of thousands of developers at Meta,

across 10+ programming languages and several coding surfaces.

We discuss unique challenges in terms of user experience and

metrics that arise when deploying such tools in large-scale indus-

trial settings. We present our experience in making design deci-

sions about the model and system architecture for CodeCompose

that addresses these challenges. Finally, we present metrics from

our large-scale deployment of CodeCompose that shows its im-

pact on Meta’s internal code authoring experience over a 15-day

time window, where 4.5 million suggestions were made by Code-

Compose. Quantitative metrics reveal that (i) CodeCompose has an

acceptance rate of 22% across several languages, and (ii) 8% of the

code typed by users of CodeCompose is through accepting code

suggestions from CodeCompose. Qualitative feedback indicates an

overwhelming 91.5% positive reception for CodeCompose. In addi-

tion to assisting with code authoring, CodeCompose is also intro-

ducing other positive side effects such as encouraging developers

to generate more in-code documentation, helping them with the

discovery of new APIs, etc.

INTRODUCTION

Meta is a large software company with a complex code base. Tens

of thousands of developers work on billions of lines of code (LOCs)

in a monolithic repository that hosts source code written in 20+

programming languages. One of the prominent activities in the

Software Development Life Cycle (SDLC) at Meta is code author-

ing. A significant amount of contextual knowledge on the inter-

nal software development processes, libraries, is embedded in the

• Multi-lingual: CodeCompose is based on the InCoder LLM [16]

and has been trained on 10+ programming languages at Meta.

As such, it inherits the property of LLMs to be multi-lingual.

• Customized for the organization: CodeCompose is fine-tuned

on Meta’s internal code repository, and is thus able to handle

Meta-specific languages such as Hack and Flow. CodeCompose

1of developers who were early adopters of CodeCompose, and we

discuss our learnings for each challenge.

2.1

(a)

(b)

(c)

Figure 1: CodeCompose (a) offers inline code suggestions in

VSCode in a grey text when the user is typing code (Tab to

accept), (b) changes its suggestion to adapt to a natural lan-

guage comment, (c) suggests code or documentation based

on code below the current position.

Learnings for CodeCompose. Building trust was an important

part of our productization, as it is directly affected by model ac-

curacy. First, we worked with language partner teams at Meta to

identify obsolete patterns (e.g., “PHP-isms” in Hack) and not-in-

production code (e.g., experiments) in the codebase and filter it out

from our training data. In a similar vein, we only train on code that

is being actively modified, that is, containing commits in the last

two years, as opposed to all code in the repository.

Secondly, from developer feedback, we found out that contextual

information adds significant value to the suggestion accuracy, such

as the code after the cursor, the file being edited, or the kernel being

used (in a notebook). We modified our model’s training objective

to take into account this contextual information (for more details

see Section 3).

Finally, we adopted a rollout strategy to incrementally build trust

in the product. We rolled out CodeCompose to the company in

waves of languages: (i) only Python, (ii) Hack, Flow (Javascript)

and C++, (iii) others, and within each wave, we rolled it out to in-

crements of 25% of the developer population. By doing this steadily,

at every step we were able to gather developer feedback (such as

the above) and iterate on the product before rolling out further.

is also deeply integrated with Meta’s version of VSCode, as seen

in Figure 1(a), and other editors.

• Natural language proficiency: CodeCompose can understand

inline comments in natural language and generate code that

adheres to the comment, as shown in Figure 1(b). It can also

fluently generate comments, messages, and documentation.

• Bi-directional: CodeCompose can look beyond the cursor to sug-

gest code at the current position. In Figure 1(c) it can be seen

suggesting the docstring for a function from code that conven-

tionally appears after the docstring.

In this paper we present our experience in building CodeCom-

pose, discussions on the unique challenges in a large industrial set-

ting and how they influenced design decisions about CodeCom-

pose, and finally, results from an extensive large-scale deployment

including developer feedback on the impact of CodeCompose on

how code is authored at Meta.

The paper is organized as follows. Section 2 discusses industrial

challenges for coding assistants, Section 3 presents details about the

CodeCompose model, Section 4 presents the system architecture,

Section 5 presents results from our large-scale deployment at Meta,

Section 6 discusses some threats to validity, Section 7 discusses

related work, and Section 8 concludes the paper.

Trust

Most of today’s code generation LLMs consider code as a sequence

of tokens, similar to natural language. That is one of the reasons for

the observed acceleration in the model-building process – dealing

with the complexities of parsing source code or developing semantic

understanding is not required anymore. However, this has some

side effects: most of the models cannot guarantee that the generated

code compiles, is syntactically correct, or executes. For example,

the model may make a “suggestion” by generating an API call that

does not exist. Making developers trust and accept suggestions in

this setting is a huge challenge. Other factors can also impact trust

such as the source corpus used for training and its validity, biases in

training data, security issues, vulnerabilities, constant data drift, etc.

The traditional precision vs recall problem in machine learning

also plays into trust. If the model is optimized for recall and gener-

ates a lot of incorrect suggestions, developers will lose faith in the

system as it becomes noisy. If the model is optimized for precision

and generates only a few accurate suggestions, developers might

eventually stop caring about the system as it becomes sparse. Also,

it will be hard to justify the return on investment if the system is

not generating enough suggestions. Studies[1] show that develop-

ers are fine with reworking a suggestion as long as the model pro-

vides a useful starting point or structure with relevant libraries and

function calls. However, requiring a lot of rework might contribute

to the reduction of trust and confidence in the model.

2.2

User Experience

Due to the generality of LLMs, code generation can be done at mul-

tiple levels of granularity: token completion, statement completion,

completing multiple lines, or generating entire blocks. Depending

on factors such as suggestion confidence, user context, and task con-

text, we will need to generate suggestions at different granularities.

INDUSTRIAL CHALLENGES

In this section we discuss unique challenges and considerations for

deploying a coding assistant at a large-scale industrial organization

such as Meta. These are gathered from feedback from hundreds

2Learnings for CodeCompose. At Meta, and there are standard

definitions of internal metrics to measure the scale of deployment

of tools. We track usage metrics such as the number of active users,

number of suggestions offered, acceptance rate, and the percent-

age of code typed by developers, that comes from accepting Code-

Compose’s suggestions (for more details see Section 5). We also use

offline metrics such as exact match to measure the quality of the

model and run experiments.

One particular challenge for us was balancing the rollout of

CodeCompose with measuring its impact without bias. We wanted

to roll it out to developers who are excited by this technology and

would provide early feedback to help us iterate quickly. However,

we wanted to avoid implicitly introducing any bias in the population

of developers who may be, for example, more likely to accept sug-

gestions than normal. To address this, we first rolled it out to a few

hundred early adopters and gathered feedback that improved the

product. Then, we incrementally expanded the rollout to randomly

include more developers at Meta, outweighing any possible bias

from early adopters. For all our deployment metrics in Section 5, we

discarded any data that was gathered before the randomized rollout.

For example, when developers are in their coding “flow”, suggesting

a large block of code and moving their existing code down – and

moving it back up when they type over the suggestion – results in

an extremely jarring experience. This is because developers would

constantly need to switch between typing code and reviewing the

suggested block of code, resulting in an unpleasant “code review

on-the-fly” situation. Therefore, when to suggest, what to suggest,

and how much to suggest is an essential part of the user experience.

Similarly, performance and latency are also an important consid-

eration. Coding assistants operate in a real-time environment with

strict requirements on latency. Developers do not want to wait for

several seconds for the system to generate a single suggestion as

they are typing it – the system has to match the developer’s typing

speed. Studies[18] have shown that sometimes users adjust their

typing behaviors (and speed) to match autocomplete suggestions

in IDEs. However, slowing oneself down should be compensated

by passing the right suggestions and making it a “net positive” ex-

perience.

Learnings for CodeCompose. After several iterations, we found

out through developer feedback that offering single-line sugges-

tions – that is, completing the current line of code where the cursor

is – strikes a good balance between suggesting an adequate amount

of code and avoiding the jarring effect. Furthermore, even within

the current line we do not show suggestions if there is any code to

the right of the cursor, except certain tokens such as ), }, ], etc.

In terms of latency, developers were fine with suggestions ap-

pearing within 300ms - 500ms, and certainly not beyond 1s. To

bring down the end-to-end latency below this threshold, we em-

ployed a suite of optimizations such as caching and debouncing (for

more details see Section 4). With this experience, developers who

are looking for multi-line suggestions simply press Tab as many

times as needed. This allows them to edit the currently suggested

line before going to the next line, which makes the subsequent sug-

gestions more accurate. They can also press a manual keyboard

shortcut to request a multi-line suggestion, but we found that de-

velopers rarely use this workflow.

2.3

MODEL DETAILS

In this section we provide details about the underlying LLM archi-

tecture for CodeCompose, and discuss considerations that affected

our decision.

3.1

Model Architecture and Training Objective

LLMs largely fall into two main categories. The BERT [15] branch

of models are trained with the Masked Language Model (MLM)

objective that masks out certain parts of the input and trains the

model to predict them back. These models have a notion of bidi-

rectionality, where predicting a token can take into account both

preceding and following tokens. However, MLM is not easily suited

for generative tasks. The GPT [12] branch of models are trained

with the Causal Language Model (CLM) objective that provides

the model a sequence of tokens and trains it to predict the next to-

ken [12]. CLM makes the model more suited for auto-regressive

generation tasks, however it only takes into account the preceding

tokens at a given point.

In industrial settings, developers often perform the task of editing

where there is code before and after their cursor position. The

code after the cursor contains highly relevant signals about the

code being written. Thus, for a code generation system we would

ideally want both – a generative model with bidirectionality. For this

purpose, we used the InCoder LLM [16] as our pre-trained model.

InCoder is a generative model trained with the Causal Masking

(CM) objective, where a sequence of tokens are masked out from

a given input, appended to the end of the input, and the model

is trained to generate the whole sentence left-to-right. Moreover,

the tokenizer used by InCoder tokenizes multiple words into a

single token, and as such allows efficient encoding of common code

patterns such as import numpy as np. For more details we refer

the reader to the InCoder [16] paper.

To suit our end application better, we make a few modifications

to the CM objective, and propose Language Causal Masking (LCM):

Metrics

Evaluating the usefulness of AI-generated suggestions is a major

challenge. This includes (but not limited to), complexities involved

with respect to the granularity of suggestions, developers’ flexibility

to rework the generated code, defining true positives, etc. A strict

metric like exact match, BLEU score or percentage of accepted

suggestions as-is is going to underrepresent the actual benefits

offered by these solutions. For example, while acceptance rate is a

good metric to measure product usage, high acceptance rate does

not necessarily mean developers are being more productive when

using the tool – that has to be measured explicitly.

Furthermore, evaluation goes beyond just measuring product

impact – it helps us understand where the model is inaccurate,

or where developer pain points are in the user experience, so as

to address those issues. However, complexities such as privacy

compliance and proprietary data make it harder to track all the

telemetry that is required to evaluate or monitor the product.

3Tokenize code at trigger

characters: (, ., =, etc.

Sequence of tokens

metadata

Select a random subsequence

under a length limit

File: /home/bubblesort.py Language: python

3 sequences of tokens

def bubbleSort(array):

# loop through each element of array

code before

for i in range(len(array)):

# keep track of swapping

swapped = False

# loop to compare array elements

for j in range(0, len(array) - i - 1):

if array[j] > array[j + 1]:

target

# swap if needed

temp = array[j]

array[j] = array[j+1]

array[j+1] = temp

code after

swapped = True

Create metadata, code before, code after,

and target sentences

4 sentences

Encode each sentence separately with

CodeCompose’s tokenizer

4 sequences of token IDs

Truncate code before and

code after using a 70-30 split

# array is already sorted

if not swapped:

break

4 sequences of truncated token IDs

Concatenate the encoded sequences and insert

special token IDs for mask, pad if needed

metadata IDs

code before IDs

code after IDs

target IDs

Figure 2: Steps to construct an input to the model in LCM with an example.

• CM implements the masking after the text has been tok-

enized into token IDs, which limits the model during train-

ing to only seeing mask spans with edges at common tok-

enizer tokens. LCM lifts the masking step to the language

level and avoids this, similar to the fill-in-the-middle (FIM)

task [11]. Also, LCM only masks at certain trigger charac-

ters – that is, characters where the model will be queried

during inference such as (, ., =, etc.

• We prefix certain metadata to the input in LCM, such as the

programming language, full path to the file, and the kernel

name for notebooks.

• Through model level ablations (not in the scope of this paper),

we found an optimal 70-30 split of the model’s input length

between code before and code after the cursor.

• Specialized for our use case, LCM has only one mask in any

input.

As we are suggesting single lines of code, we stop the generation

early once a newline token has been generated. Due to the real-time

nature of our application and the inline suggestion user experience

(UX), we only return one sequence of generated tokens.

3.2

Training data

For training on our internal data, we collected first party data from

Meta’s code repositories and notebooks, applying several filters:

• Rather than crawling the entire repository, we used code

that is modified through diffs checked in by developers as

a way of staying close to our end application (i.e., writing

code in the IDE). This way we avoid training on code that

may have been added a long time ago but is never modified.

• To keep the training data fresh, we only looked at diffs going

back up to 2 years, and only kept the latest versions of files

to avoid bugs that may have been patched.

• For each major target language, we worked with the respec-

tive language services team at Meta to exclude code with

old/obsolete patterns, not-in-production, etc.

A step-by-step overview of constructing an input in LCM is

shown in Figure 2, along with an example code snippet. Once an

input is constructed, during training, we maximize the log proba-

bility of the language-masked input:

After these filters, our first party training data included in the

order of tens of millions of files amounting to a few billion lines of

code across 10+ languages.

log P ( [Metadata; Before; ; After; ; Target])

where Metadata, Before and After are the tokens in the metadata,

code before, and code after the cursor, respectively, Target is the

code that was masked, and is a special token. During in-

ference, we sample tokens in an auto-regressive manner from the

distribution:

3.3

Training details

We took InCoder-1.3B, the public model with 1.3 billion parameters,

and fine-tuned it further on the above first party data with our LCM

objective. A 6.7 billion parameter model called InCoder-6.7B is also

available, but we found that for suggesting single lines of code it

P (· | [Metadata; Before; ; After; ])

4Table 1: Accuracy metrics across programming languages for the public model and the fine-tuned model

Exact Match

BLEU score

Models

Public

Fine tuned (CM)

Fine tuned (LCM)

Python

23.1

35.53

48.84

Hack

14.2

37.77

57.73

Flow

18.62

30.79

52.23

added too much latency for incremental gain in accuracy. For fine-

tuning the 1.3B model, we used a batch size of 20 per device (2.5k

effective), and a learning rate of 5e-4. Training for 4 epochs with

sharded data parallelism took 4 days on a cluster of 128 A100 GPUs.

We then deployed the model on a cluster of 150 A100 GPUs.

3.4

Effect of fine-tuning on prediction accuracy

4.2

Hack

27.63

52.60

72.96

Flow

33.83

45.83

68.62

C++

39.85

40.29

56.23

LSP

To mediate requests between the client and server, we implemented

a language server in Rust that we reuse across our various editor in-

tegrations. While most language servers are designed to support a

wide array of traditional IDE functionality (autocomplete, jump-to-

definition, etc.), the CodeCompose language server supports only

one meaningful request type: textDocument/inlineCompletions.

The LSP also implements core request logic such as debouncing

and caching. Debouncing waits for a 20ms pause before sending a

request to avoid too many requests from being sent as the user is

typing in the editor. Caching saves the response from the server

with the context as the key, so that duplicate requests – which hap-

pen frequently as developers type and erase code – can be avoided.

Moreover, requests that hit the LSP cache are served significantly

faster than the service.

4.3

Clients

Clients are extensions that run locally on developers’ machines

and are responsible for ultimately displaying inline suggestions in

the editor. For editors such as VS Code that support LSP natively,

we require relatively little glue code to create the CodeCompose

extension. For editors such as Android Studio that do not have

native LSP support, we built a small adapter to proxy requests to

our LSP.

Furthermore, this architecture makes it straightforward to in-

tegrate CodeCompose with our in-house developer tools, such as

the web-based Bento notebook [2] surface created internally at

Meta. Because the notebook surface also supports LSP, it was easy

to provide our data scientists and ML engineers with the same AI-

powered code suggestions as developers working in VS Code. This

ability to plug CodeCompose into any code editing surface inter-

nally is an advantage of owning the entire stack.

SYSTEM DESIGN

CodeCompose employs a typical client-server architecture in which

the server is an inference tier that runs the model and a client is an

editor that surfaces code suggestions. We encode the bulk of the

client-side logic in a Language Server Protocol (LSP) conformant

language server that is reused across multiple editor integrations.

4.1

Python

37.88

50.03

61.98

Clients can make requests to this tier via Thrift. The caller specifies

the code before and after the cursor, as well as the file path, language,

and metadata to use to process the request. The request handler is

designed to be simple to utilize the most out of the GPUs. It simply

processes the request data into a string, tokenizes it, and invokes the

model’s auto-regressive generation function that runs on the GPU.

Since we are optimizing for latency rather than throughput, each

request is processed as soon as it arrives and so there is no inference

batching. Due to latency considerations, when the input length is

less than 1500 tokens, we can afford to perform a beam search with

a beam width of 3 and return the most likely suggestion. Otherwise,

we perform greedy generation.

We conduct offline evaluations to measure the lift of fine-tuning

with first-party data on the model accuracy. The offline evaluation

measures two metrics: Exact Match (EM), BLEU score [24].

We collect 20K source code files randomly sampled from Meta’s

source code repositories. We collect files that belong to four major

programming languages used in Meta: Hack, C++, Python, and

Flow (JavaScript). To simulate the code being written in the IDE,

we randomly mask a part of the code snippet that terminates with

an end-of-the-line character, in each file. The prediction task is to

predict the masked code snippets when code before and code after

are passed to the model as the context. In addition to this context,

we pass metadata such as file name, full path, and the language.

As shown in Table 1, fine tuning the CodeCompose model on

first-party source code with the CM objective allowed the model to

learn internal specific code styles, formatting, libraries, etc., which

led to a significant improvement of the EM and BLEU metrics.

The improvements for languages like Hack, which are heavily cus-

tomized for use at Meta with minimal public presence, are evident

with 23.5% improvement in exact match and 24.97% in BLEU score.

Whereas, the improvements for C++ are minimal because of the re-

semblance between C++ used at Meta and the public version. More-

over, when fine-tuned with the customized LCM objective that is

closer to the final application, the model performance improves fur-

ther, even for C++.

This experiment provides a strong evidence that an internally

fine tuned model outperforms an off-the-shelf model that is trained

on external data only (when tested on an organization-specific

benchmarks for code completion), and is more suitable to serve the

needs of the developer community at the given enterprise.

C++

23.42

24.18

40.01

Server

At Meta, we have a tier of machines equipped with A100 GPUs, each

with sufficient memory to host the fine-tuned CodeCompose model.

5Figure 3: System Architecture for CodeCompose

4.4

Telemetry

potential biases from early adopters. We instrumented telemetry to

track various events in the IDE such as displaying a suggestion in-

line, accepting or rejecting a suggestion, and the length of accepted

suggestions.

For our qualitative evaluation (Section 5.3), we collected data by

mining user feedback posts in our support group. We manually an-

notated posts as CSAT (Customer Satisfaction) or DSAT (Customer

Dissatisfaction). The feedback group does not allow anonymous

posting. However, culturally the Meta is known to promote open

feedback and has a vocal developer community, which reduces the

effect of response bias [14] if not eliminate it.

The LSP is responsible for recording all of the client-side telemetry,

which ensures that metrics are recorded consistently across the

various surfaces where CodeCompose is used.

Because the LSP receives textDocument/didChange events, it

can calculate most of the metrics of interest without further cooper-

ation from the client. Nevertheless, the LSP does expect a client to

send a custom cc/received notification upon receipt of the com-

pletions so that we can record the full round-trip time from the end

user’s perspective. Limiting the demands on the client facilitates

integrating CodeCompose into more of our internal tools.

EVALUATION: OBSERVATIONS AND

PERCEPTIONS ABOUT USEFULNESS OF

CODECOMPOSE

5.2

In this section, we provide details about the metrics that we track

to understand the usage, reception, and impact of CodeCompose

on developer workflows.

5.1

Observations from deploying

CodeCompose for multiple languages

CodeCompose made 4.5 million suggestions during the aforemen-

tioned time period across 9 programming languages. 16k distinct

developers have seen at least one CodeCompose suggestion. As

shown in Table 2, we observe a suggestion level acceptance rate of

22% across 9 programming languages for suggestions that were dis-

played for at least 750 milliseconds. Imposing a lower bound on the

display time of the suggestions when calculating the acceptance

rate helps in making sure developers who were exposed to a sug-

gestion had a chance to see and comprehend it [6].

In addition, we log the length of each accepted suggestion and

the number of characters developers type when authoring code

(excluding large copy-pastes, refactors, etc.). This allows us to cal-

culate, at a character level, the percentage of the code typed by de-

velopers that came through accepting CodeCompose suggestions,

which we computed to be 8%.

Data collection and methodology

We obtained data from a large-scale deployment of CodeCompose

at Meta. For our observational study (Section 5.2), we collect the de-

ployment data from 15th April to 30th April 2023. This data is col-

lected after enabling CodeCompose for 9 programming languages

for at least two weeks for each programming language, and mak-

ing sure any anomalies concerning telemetry instrumentation are

identified and fixed.

We also made sure that the data collection period is several

weeks after we enabled CodeCompose to developers based on a

randomized rollout strategy for each language so that we avoid any

6Table 2: CodeCompose acceptance rate across programming languages

Language # Suggestions shown Acceptance rate Percentage of code # Users

typed using CodeCom-

pose

Python

Hack

C++

Flow (Javascript) [8]

Rust

Objective C++

Objective C

Typescript

All 1.87mn

1.25mn

608.1k

583.2k

74.2k

56.5k

34.4k

23.5k

8.9k

4.5mn 22

22.5

18.2

17.2

18.1

21.3

22 8

Note that Table 2 might indicate that Python has more users than

languages like Hack or Javascript, but that is simply an artifact of

our incremental rollout strategy that opened CodeCompose first

to Python, as mentioned in Section 2.1.

5.3

10.7k

5.5k

2.5k

212

429

299

201

16k

Table 3 lists the distribution of qualitative feedback responses.

91.5% of the CodeCompose users gave a favorable response while

8.5% of them gave an unfavorable response. Many respondents

(15.7%) appreciated the fact that CodeCompose suggestions are

accurate and CodeCompose was able to auto complete the line[s]

they were going to type.

In addition to accuracy of suggestions, 23% of the users said

CodeCompose was helpful in discovering new APIs, automate

tedious tasks such as boilerplate code. Therefore, CodeCompose is

not just helping with typing code faster (as demonstrated in Table

2) but also saving time and effort people spend in searching for

APIs and documentation.

CodeCompose is found to be helping people accelerate their cod-

ing activity. 20% of the users said they found CodeCompose to be

accelerating their coding activity. In addition to helping write code

faster, CodeCompose seems to be helping developers produce more

in-code documentation and API documentation. This is an inter-

esting side-effect and demonstrates the model’s ability to generate

text as an output (from code) in addition to code generation.

Qualitative feedback

In addition to collecting metrics such as acceptance rate and per-

centage of code typed using CodeCompose, we collect qualitative

feedback from the CodeCompose users through a forum. We cre-

ated a Workplace group [9], where CodeCompose users can post

any feedback they may have. Everyone in the company can see the

feedback, react to it, and have a follow-up discussion. Though it is

a public group, we do not expect the CodeCompose users to suffer

from response bias [14] due to the fact that Meta is a large organiza-

tion with open culture. Also, the developers that provided feedback

sit far away (organizationally) from the developers of CodeCom-

pose. Therefore, the developers tend share their feedback about

tools and services openly without hesitation.

We received 70 responses over a 15-day time period, on which we

performed qualitative data analysis [10]. We performed coding [32]

manually with two members of the team annotating the responses

received from the CodeCompose users independently. Coding is

the process of labeling and organizing qualitative data to identify

different themes and the relationships between them.

The labeling process involved three steps: 1. the team got to-

gether and came up with five categories (labels) to represent the

qualitative responses (bootstrapping). First, we identified some cat-

egories that are intuitive and refined them as we sift through the

verbatim to arrive at a final set of categories 2. two members of the

team performed labeling independently 3. the team got together

again to reconcile their understanding and made modifications to

the labels as necessary.

We list the categories or labels and their descriptions in Table 3.

We have two primary categories of user feedback: CSAT (favorable)

and DSAT (unfavorable). CSAT is further expanded into five sub-

categories: Accelerate coding, Discovery, Documentation, Sugges-

tion accuracy, and generally favorable. The intent behind the cate-

gorization is to understand what makes developers think CodeCom-

pose is useful and the use cases for which they use CodeCompose.

5.4

Representative quotes

To offer an impression, we list some quotes (positive and negative)

that we received from the CodeCompose users. The quotes are

paraphrased for presentation purposes, and are generalized to re-

move references to internal artifacts.

A developer who got access to CodeCompose shared their first

impressions:

"Today I needed to write a quick tool for [task]. I opened

VS Code and started writing [code]. Suddenly this thing

called CodeCompose popped up and wrote all the code

for me! This was a super delightful experience, it really

made my day, so thank you!"

In this case, the developer was not informed about the existence

of CodeCompose. They happened to be picked by our randomized

roll out algorithm. This quote summarizes the delightful coding

experience that the developers at Meta experienced when they pair-

program with CodeCompose.

Several remarks are clear indicators of the usefulness of employ-

ing CodeCompose in code authoring workflows.

7Table 3: Distribution of qualitative feedback responses

User feedback Category

Accelerate coding

CSAT Discovery

Description

CodeCompose helped speed up

my coding process

CodeCompose helped me dis-

cover APIs, write boilerplate

code faster

CodeCompose helped me with

generating in-code documenta-

tion, docstrings, etc.

CodeCompose suggestions are

accurate and not noisy

CodeCompose is generally

great, and I will keep using it

for my coding tasks

CodeCompose is not useful

Documentation

Suggestion accuracy

Generally favorable

DSAT

Unfavorable

"I have been a big fan of [CodeCompose ] ever since I

got access. I do not think I have been in any coding ses-

sion since then where I did not use it."

# of responses

14 (20%)

16 (22.8%)

5 (7.1%)

11 (15.7%)

18 (25.7%)

6 (8.5%)

faster than looking for other call sites to figure out how

to use this [API]."

We also observed interesting side effects such as increased in-

code documentation when developers use CodeCompose to make

code changes. As CodeCompose provides accurate descriptions

of code changes and a template to start with, developers tend to

accept the suggestions, make changes (as necessary), and push the

documentation changes (along with the source code changes) in

their diffs. This is reflected in some of the anecdotes listed below.

"The suggestions [from CodeCompose ] are quite rel-

evant and do not show up too often – not necessarily

“rarely”, but not obnoxiously often either. It is a nice bal-

ance."

"I was blown away that after writing the following

[code], CodeCompose successfully predicted the cor-

rect completion. This is amazing because I could not re-

member the name of the global variable or where it was

in the code, but CodeCompose knew what it was."

"I find CodeCompose particularly useful when writing

docstrings. Without CodeCompose, I would not even

imagine that I can write [so many] lines of docstring

for my actual code."

Many developers highlighted the fact that the predictions are

relevant. Also, we received feedback about how nicely CodeCom-

pose navigates the precision versus recall problem by not showing

suggestions too often.

There were also several pieces of feedback that highlighted Code-

Compose’s ability to help developers discover new APIs or ramp

them up quickly on unfamiliar APIs.

"I really like how CodeCompose highlights the value

of naming and the quality of documentation, which

used to be mainly for long-term benefit, but now good

naming and [documentation] gives you an immediate

effectiveness boost."

In addition to the positive remarks, few developers requested

to disable CodeCompose as they found CodeCompose to be noisy

and intrusive. We list a few anecdotes below.

"I wanted to post about a great experience I had in using

CodeCompose in the [internal] codebase. I have been

getting back into coding and was rusty on the different

arguments that exist in the [library] operators. Some

operators can have up to [hundreds of] arguments so

this is a challenging problem. CodeCompose has helped

me several times in a day by giving me the right argu-

ment that I cared about, whereas my previous workflow

would have been to [search] for the operator and spend

several minutes on each."

"CodeCompose is good, however, it seems to reduce my

coding speed. The reason is that the traditional auto-

complete was typically suggesting [real] functions, but

CodeCompose is suggesting hallucinations. These are

not always correct and I would like a UX design where

I could switch between traditional auto-complete and

CodeCompose."

"I have realized that [CodeCompose ] tends to strug-

gle while working with function signatures. The sug-

gestions it offers do not match the actual signature, al-

though they make sense."

"CodeCompose has been a game changer for working

with APIs that I have not used before. Even if it is not

exactly what I need, it often shows me approximately

what I should pass or how I should call it. This is much

8"The [code editing] experience is sometimes good and

sometimes bad. Auto-complete struggles to work to-

gether with CodeCompose – they seem to be competing

with each other."

This highlights the importance of investing in UX research to

deliver the best experience for AI-assisted code authoring while co-

existing with traditional auto-complete systems. The feedback also

talks about the problems faced by LLMs with respect to hallucina-

tions and grounding [17, 19]. We are actively exploring this area to

make CodeCompose a productive experience for all the developers.

5.5

conclusions from the quantitative metrics or qualitative feedback

presented. It is possible that the results might not hold elsewhere,

and for this reason, we cannot assume a priori that other AI-based

coding assistants built outside Meta will perform similarly. Simi-

larly, the value-add of fine-tuning the LLM on Meta’s internal code

(Table 1) may also not translate externally. We already saw that the

lift offered by fine-tuning varied between languages.

6.2

Factors Affecting CodeCompose

favorability

After analyzing all the qualitative feedback, we identified the fol-

lowing main factors that make a developer inclined towards find-

ing a system like CodeCompose useful.

Favorable scenarios for CodeCompose: The scenarios for which

CodeCompose was able to add the biggest value includes (but not

limited to), auto completing lines of code, API discovery, boilerplate

coding, suggesting standard libraries, generating in-code documen-

tation, etc.

When it comes to developers, the developers who benefited the

most are the ones work on authoring code that involves writing

boilerplate code and idioms, the ones who follow the typical coding

patterns employed at Meta, and the ones that employ standard first-

party (or third-party) libraries to accomplish their coding tasks.

Additionally, CodeCompose is well received by the developers who

tend to work on building pipelines and common infrastructure

tasks that involves writing heavily templatized code. This can be

corroborated with the distribution of responses listed in Table 3

and the representative quotes listed in Section 5.4.

Unfavorable scenarios for CodeCompose: After analyzing the

qualitative feedback, we found that CodeCompose is not helpful

in scenarios where developers use specialized APIs and libraries.

CodeCompose seems to be struggling with suggesting the correct

code when the developers employ atypical coding patterns.

Some developers found the traditional, semantic, auto-complete

functionality to be more useful in those cases. Developers also

complained about the possibility of CodeCompose to hallucinate

[17, 21] when recommending the uncommon API names, URLs, etc.

Developers also found the coexistence of CodeCompose with

traditional auto-complete to be vital. Sometimes, when these two

systems compete for showing suggestions, it creates a disruptive

experience to developers. Also, commonly used keyboard shortcuts

such as “Tab” are overloaded to accept CodeCompose suggestions

and traditional auto-complete suggestion. Solving these UX prob-

lems is vital to facilitating a smoother integration with existing se-

mantic engines, which we discussed in Section 2.2.

6.3

Internal Validity

While we have taken significant measures to reduce bias in the re-

sults presented (e.g., by waiting several weeks after randomized

rollout), we cannot guarantee that there is absolutely no bias. The

time period chosen for measuring the metrics in Table 2 could have

had external events beyond our control that influenced developers’

coding behaviors during that time. Some languages such as Type-

script only have a small number of developers, who might have

had an outsized influence on the metrics for those languages.

Qualitative feedback, in general, comes with a degree of subjec-

tivity. While Meta encourages a culture of open feedback, develop-

ers might not have been comfortable sharing negative assessments

with colleagues. We facilitated easy opt out, but less than 1% had

opted out, indicating that a majority of developers found Code-

Compose to be a net positive experience.

The categorization of the qualitative feedback into different cat-

egories was done manually. We tried to reduce subjectivity in this

process by making two authors in the team categorize them and

then reconcile the different labels. Nevertheless, a different set of

people might have produced a different categorization in Table 3.

RELATED WORK

In software engineering research, while there is a large body of

work on code completion [13, 20, 26, 27, 33], there has been limited

deployment of these in large industrial environments [3–5, 7]. With

the advent of Generative AI, advances in the code assistance area

are now in weeks rather than months. Several of these results are

circulated as blog posts and not traditional papers. We use a mix

of these results to set the context for historical work at Google,

Amazon, Microsoft, and GitHub.

The closest in spirit to our work is the work at Google [5]. At

Google [5] deploying hybrid semantic ML code completion to 10k+

Google employees (over three months across eight programming

languages) compared to a control group observed a 6% reduction in

coding iteration time (time between builds and tests) when exposed

to single-line ML completion. These results demonstrate the impact

on developer productivity and at the time of the article 3% of new

THREATS TO VALIDITY

In this section, we list certain threats to the validity of our evaluation

in Section 5.

6.1

Productivity Impact

The scope of this paper is to only present our results from build-

ing and deploying CodeCompose at scale and its usage. It is impor-

tant to note that the metrics used in this paper need not correlate

with the productivity of developers who use CodeCompose. We

do not make any claims about the impact of CodeCompose on the

productivity of developers at Meta. For example, simply because a

developer accepts CodeCompose suggestions often or is favorable

to it does not necessarily make them productive – that has to be

measured separately with statistical significance.

Generalizability

CodeCompose is an entirely internal tool that has only been de-

ployed at Meta. Therefore, there is a threat of drawing any general

9code (measured in characters) was generated from accepting ML

completion suggestions.

Pythia [29] deployed with Intellicode [4] by Microsoft is built

using DL models trained on code contexts extracted from abstract

syntax trees. Its best matching code completions are in the order of

100 ms. The offline evaluation of Pythia obtained on 2700 Python

open source software GitHub repositories showed a top-5 accuracy

of 92%, surpassing the baseline models by 20% averaged over classes,

for both intra and cross-project settings.[29] IntelliCode Compose

[28] – is a general-purpose multilingual code completion tool that

is capable of predicting sequences of code tokens of arbitrary types,

generating up to entire lines of syntactically correct code [28]. It

is built on a state-of-the-art generative transformer model trained

on 1.2 billion lines of source code in Python, C#, JavaScript, and

TypeScript programming languages [28]. Its best model yields an

average edit similarity of 86.7% and a perplexity of 1.82 for Python

programming language [28].

There have been several empirical evaluations of GitHub’s Copi-

lot [3] in actual use for automatic code completion. Nguyen et al.

[22] used 33 LeetCode questions to create queries for Copilot in

four different programming languages. They found that Copilot’s

Java suggestions have the highest correctness score (57%) while

JavaScript is the lowest (27%) [22]. Overall, Copilot’s suggestions

had low complexity with no notable differences between the pro-

gramming languages. They also observed some potential Copilot

shortcomings: such as generating code that can be further simplified

and code that relies on undefined helper methods [22]. Not all usage

has been positive. A user study [30] with 24 participants to under-

stand how programmers use and perceive Copilot found that while

Copilot did not necessarily improve the task completion time or

success rate, most participants preferred to use Copilot in daily pro-

gramming tasks, since Copilot often provided a useful starting point

and saved the effort of searching online [30]. However, participants

did face difficulties in understanding, editing, and debugging code

snippets generated by Copilot, which significantly hindered their

task-solving effectiveness [30]. An empirical study [25] presented a

controlled experiment with GitHub Copilot showing recruited soft-

ware developers who were asked to implement an HTTP server in

JavaScript as quickly as possible. The treatment group, with access

to Copilot, completed the task 55.8% faster than the control group.

Amazon’s CodeWhisperer [7] is a fully functional code comple-

tion tool integrated into the IDE. Its analysis [14] found that lan-

guage models can generate code with correct syntax and pass unit

tests in programming languages they are not intentionally trained

on. Like the companies discussed above (Google, Microsoft, GitHub,

and Amazon), there are several commercial versions of code com-

pletion tools by other generative AI companies. The goal here is to

provide a sample of the related work rather than undertaking a sys-

tematic literature survey of the state of the art. The above works

summarize the related work in this area at this point in time in this

rapidly changing field.

across 9 programming languages at Meta, and presented metrics and

feedback to understand the impact of the system on code authoring.

We first looked at some unique challenges in building such cod-

ing assistants in large organizations. We shared our experience in

building trust, designing the user experience, and measuring the

right set of metrics to evaluate the system.

We presented details about the underlying InCoder-based LLM

that powers CodeCompose. We introduced a custom training ob-

jective, Language Causal Masking, that suits our application of sug-

gesting individual lines of code. In doing so, we conducted an of-

fline evaluation that showed a 23-25% improvement brought about

by fine-tuning the LLM on the Meta’s internal code.

We then presented the system architecture of CodeCompose. We

built CodeCompose using three primary components: the server,

Language Service Protocol (LSP), and the client. When a developer

types code, the service receives a request to perform inference and

generates a sequence of tokens to auto-complete the statement(s).

The LSP and the client help orchestrate the inference calls and

implement the logic to display the suggestion.

We presented quantitative metrics that show the scale and im-

pact of CodeCompose. In a 15-day time period, 4.5 million sugges-

tions were shown to developers, 22% of which were accepted. This

resulted in 8% of code typed by users of CodeCompose coming

from suggestions offered by CodeCompose.

Finally, we presented qualitative feedback from developers. We

found that an overwhelming 91.5% of the feedback shared by de-

velopers was favorable towards CodeCompose, indicating that it

adds significant value to their coding experience. Expanding into

the feedback, we found CodeCompose having desirable side effects,

such as helping developers discover unfamiliar APIs and write doc-

umentation.

In the future, we plan to enable more features such as block

completions, more interaction modalities such as a conversational

modality within the IDE, functionality to explain source code and

provide code walk-throughs, etc. We are exploring opportunities

to leverage semantic information to perform pre-processing and

post-processing to improve the suggestion accuracy by reducing

hallucinations. Furthermore, we see opportunities to expand this

technology, beyond code authoring, to help developers across the

Software Development Life Cycle (SDLC).

ACKNOWLEDGEMENTS

We want to thank the following people for their help, support, and

consultation while envisioning, building, and deploying CodeCom-

pose at Meta: Gabriel Synnaeve, Baptiste Rozière, Chris Lengerich,

Omer Dunay, Marshall Roch, Karim Nakad, Peter Monaco, Kelly

Hirano, Killian Murphy, Kristian Kristensen, Wes Dyer, Adam Tait,

Arun Ganesan, Andy Chiu, and everyone on the CodeCompose

team.

REFERENCES

[1] Github 2022. How GitHub Copilot helps improve developer productivity. Github.

https://github.blog/2022-07-14-research-how-github-copilot-helps-improve-

developer-productivity/

[2] Meta Accessed 2021. ELI5: Bento - Interactive Notebook that Empowers Development

Collaboration and Best Practices. Meta. https://developers.facebook.com/blog

/post/2021/09/20/eli5-bento-interactive-notebook-empowers-development-

collaboration-best-practices/

CONCLUSION

In this paper, we introduced an AI-based coding assistant system

named CodeCompose, discussed how we scaled it to 16k developers

10[3] Github Accessed 2021. Github Copilot. Github. https://github.com/features/copi

lot

[4] Microsoft Accessed 2021. Microsoft Intellicode. Microsoft. https://visualstudio.m

icrosoft.com/services/intellicode

[5] Google Accessed 2021. ML Enhanced Code Completion. Google. https://ai.googl

eblog.com/2022/07/ml-enhanced-code-completion-improves.html

[6] Accessed 2022. ML-Enhanced Code Completion Improves Developer Productivity.

https://ai.googleblog.com/2022/07/ml-enhanced-code-completion-improves.h

tml

[7] Amazon Accessed 2023. Amazon CodeWhisperer. Amazon. https://aws.amazon

.com/codewhisperer

[8] Accessed 2023. Flow Javascript. https://engineering.fb.com/2014/11/18/web/flo

w-a-new-static-type-checker-for-javascript

[9] Accessed 2023. Workplace Group. https://www.workplace.com/features/groups

[10] Carl Auerbach and Louise B Silverstein. 2003. Qualitative data: An introduction

to coding and analysis. Vol. 21. NYU press.

[11] Mohammad Bavarian, Heewoo Jun, Nikolas Tezak, John Schulman, Christine

McLeavey, Jerry Tworek, and Mark Chen. 2022. Efficient Training of Language

Models to Fill in the Middle. arXiv:2207.14255 [cs.CL]

[12] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan,

Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda

Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan,

Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter,

Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin

Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya

Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners.

arXiv:2005.14165 [cs.CL]

[13] Marcel Bruch, Martin Monperrus, and Mira Mezini. 2009. Learning from examples

to improve code completion systems. In Proceedings of the 7th joint meeting of

the European software engineering conference and the ACM SIGSOFT symposium

on The foundations of software engineering (ESEC/FSE ’09).

[14] Nicola Dell, Vidya Vaidyanathan, Indrani Medhi, Edward Cutrell, and William

Thies. 2012. "Yours is Better!": Participant Response Bias in HCI. In Proceedings

of the SIGCHI Conference on Human Factors in Computing Systems (Austin, Texas,

USA) (CHI ’12). Association for Computing Machinery, New York, NY, USA,

1321–1330. https://doi.org/10.1145/2207676.2208589

[15] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT:

Pre-training of Deep Bidirectional Transformers for Language Understanding.

arXiv:1810.04805 [cs.CL]

[16] Daniel Fried, Armen Aghajanyan, Jessy Lin, Sida Wang, Eric Wallace, Freda Shi,

Ruiqi Zhong, Wen tau Yih, Luke Zettlemoyer, and Mike Lewis. 2023. InCoder: A

Generative Model for Code Infilling and Synthesis. arXiv:2204.05999 [cs.SE]

[17] Nuno M. Guerreiro, Duarte Alves, Jonas Waldendorf, Barry Haddow, Alexandra

Birch, Pierre Colombo, and André F. T. Martins. 2023. Hallucinations in Large

Multilingual Translation Models. arXiv:2303.16104 [cs.CL]

[18] Kajta Hofmann, Bhaskar Mitra, Filip Radlinski, and Milad Shokouhi. 2014. An

Eye-Tracking Study of User Interactions with Query Auto Completion (CIKM

’14). Association for Computing Machinery, New York, NY, USA, 549–558. https:

//doi.org/10.1145/2661829.2661922

[19] Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii,

Ye Jin Bang, Andrea Madotto, and Pascale Fung. 2023. Survey of Hallucination in

Natural Language Generation. ACM Comput. Surv. 55, 12, Article 248 (mar 2023),

38 pages. https://doi.org/10.1145/3571730

[20] Seohyun Kim, Jinman Zhao, Yuchi Tian, and Satish Chandra. 2021. Code Pre-

diction by Feeding Trees to Transformers. In 2021 IEEE/ACM 43rd International

Conference on Software Engineering (ICSE). 150–162. https://doi.org/10.1109/IC

SE43902.2021.00026

[21] Zihao Li. 2023. The Dark Side of ChatGPT: Legal and Ethical Challenges from

Stochastic Parrots and Hallucination. arXiv:2304.14347 [cs.CY]

[22] Nhan Nguyen and Sarah Nadi. 2022. An empirical evaluation of GitHub copilot’s

code suggestions. In Proceedings of the 19th International Conference on Mining

Software Repositories (MSR ’22).

[23] Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou,

Silvio Savarese, and Caiming Xiong. 2023. CodeGen: An Open Large Language

Model for Code with Multi-Turn Program Synthesis. arXiv:2203.13474 [cs.LG]

[24] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU:

A Method for Automatic Evaluation of Machine Translation. In Proceedings of

the 40th Annual Meeting on Association for Computational Linguistics (Philadel-

phia, Pennsylvania) (ACL ’02). Association for Computational Linguistics, USA,

311–318. https://doi.org/10.3115/1073083.1073135

[25] Sida Peng, Eirini Kalliamvakou, Peter Cihon, and Mert Demirer. 2023. The

Impact of AI on Developer Productivity: Evidence from GitHub Copilot. In Arxiv.

https://arxiv.org/abs/2302.06590

[26] Sebastian Proksch, Johannes Lerch, and Mira Mezini. 2015. Intelligent Code

Completion with Bayesian Networks. ACM Transactions on Software Engineering

and Methodology (TOSEM) 25, 1, Article 3 (12 2015).

[27] R. Robles and M. Lanza. 2008. How Program History Can Improve Code Com-

pletion. In 2008 23rd IEEE/ACM International Conference on Automated Software

Engineering. 317–326. https://doi.org/10.1109/ASE.2008.42

[28] Alexey Svyatkovskiy, Shao Kun Deng, Shengyu Fu, and Neel Sundaresan. Ac-

cessed 2020. IntelliCode compose: code generation using transformer. In Proceed-

ings of the 28th ACM Joint Meeting on European Software Engineering Conference

and Symposium on the Foundations of Software Engineering (ESEC/FSE 2020).

[29] Alexey Svyatkovskiy, Ying Zhao, Shengyu Fu, and Neel Sundaresan. 2019. Pythia:

AI-assisted Code Completion System. In Proceedings of the 25th ACM SIGKDD

International Conference on Knowledge Discovery & Data Mining (KDD ’19).

[30] Priyan Vaithilingam, Tianyi Zhang, and Elena L. Glassman. 2022. Expectation vs.

Experience: Evaluating the Usability of Code Generation Tools Powered by Large

Language Models. In Extended Abstracts of the 2022 CHI Conference on Human

Factors in Computing Systems (CHI EA ’22).

[31] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,

Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You

Need. arXiv:1706.03762 [cs.CL]

[32] Michael Williams and Tami Moser. 2019. The art of coding and thematic explo-

ration in qualitative research. International Management Review 15, 1 (2019), 45–

55.

[33] Wen Zhou, Seohyun Kim, Vijayaraghavan Murali, and Gareth Ari Aye. 2022.

Improving Code Autocompletion with Transfer Learning. In 2022 IEEE/ACM 44th

International Conference on Software Engineering: Software Engineering in Practice

(ICSE-SEIP). 161–162. https://doi.org/10.1145/3510457.3513061