Summary Robustness and Reliability of Large Language Model Code Generation arxiv.org
6,974 words - PDF document - View PDF document
One Line
The text discusses the reliability and robustness of code generated by large language models using a benchmark of coding questions and an abstract syntax tree evaluator.
Slides
Slide Presentation (9 slides)
Key Points
- Large language models (LLMs) are popular for coding help but their reliability and robustness have not been thoroughly studied.
- A benchmark has been proposed to evaluate the reliability and robustness of code generated by LLMs, using a dataset from Stack Overflow and an evaluator based on abstract syntax trees (AST).
- Previous studies have highlighted issues in code quality from online forums, such as compilation errors, deprecated APIs, and security risks.
- The evaluation of LLM-generated code focuses on its robustness and reliability, with experiments conducted to answer research questions about API misuse.
- Static analysis is used to analyze code misuse by its structure, providing full coverage beyond semantic correctness.
- Performance and misuse rate of LLMs in generating code are discussed, with lower values indicating better performance on the ROBUST API.
- Several studies have evaluated the robustness and reliability of LLMs for code generation, including assessments of correctness and benchmark datasets.
- API usage patterns are checked in the ROBUST API, based on existing research on API misuses, using control structures and method calls.
Summaries
38 word summary
This summary discusses the reliability and robustness of code generated by large language models (LLMs) using a benchmark that includes a dataset of coding questions from Stack Overflow and an evaluator that uses an abstract syntax tree (AST).
44 word summary
This summary discusses the reliability and robustness of code generated by large language models (LLMs) and presents a benchmark for evaluating their performance. The benchmark includes a dataset of coding questions from Stack Overflow and an evaluator that uses an abstract syntax tree (AST
368 word summary
Large language models (LLMs) have become a popular resource for software engineers seeking coding help. However, the reliability and robustness of code generated by LLMs have not been thoroughly studied. The misuse of APIs in the generated code can lead to
This article presents a benchmark for evaluating the reliability and robustness of code generated by large language models. The benchmark includes a dataset of coding questions from Stack Overflow and an evaluator that uses an abstract syntax tree (AST) to analyze the generated code snippets.
The paper discusses the evaluation of the reliability and robustness of large language model (LLM)-generated code. It mentions previous studies on code quality from online forums, highlighting issues such as compilation errors, deprecated APIs, and security risks. The authors introduce
The document discusses the robustness and reliability of large language model (LLM) code generation. It begins with a prompt format for answering code questions using a given API. The document then introduces two few-shot settings, one-shot-irrelevant and one
Testing the reliability and robustness of code is challenging because high-coverage test cases only cover semantic correctness, not unexpected input that may occur in production. To address this, static analysis is used to analyze code misuse by its structure, providing full coverage and
The excerpt discusses a series of experiments conducted on state-of-the-art large language models (LLMs) to evaluate their ability to answer real-world coding questions and their reliability regarding API misuse. The experiments aimed to answer several research questions, including the API misuse
The text discusses the performance and misuse rate of large language models (LLMs) in generating code. Table 2 shows the performance of each LLM on the ROBUST API, with lower values indicating better performance. Some LLMs can effectively
Several studies have been conducted to evaluate the robustness and reliability of large language models (LLMs) for code generation. One such study examined code generated by ChatGPT and assessed its correctness. Another study introduced Codexglue, a benchmark dataset
The document discusses the API usage patterns checked in the ROBUST API. These patterns are based on existing research on API misuses. Each pattern consists of control structures and method calls separated by commas. The patterns are checked against the AST of code snippets
Raw indexed text (45,487 chars / 6,974 words / 1,086 lines)
>>: $CODE
>>: int[] array = 1, 2, 3, 4, 5;