Automated Repair of Programs from Large Language Models

Ref:

[APA]
Fan, Z., Gao, X., Mirchev, M., Roychoudhury, A., & Tan, S. H. (2022). Automated Repair of Programs from Large Language Models. ICSE.
[GB/T 7714]
Fan Z, Gao X, Mirchev M, et al. Automated Repair of Programs from Large Language Models[C]. ICSE, 2022.

Abstract:

Large language models such as Codex, have shown the capability to produce code for many programming tasks. However, the success rate of existing models is low, especially for complex programming tasks. One of the reasons is that language models lack awareness of program semantics, resulting in incorrect programs, or even programs which do not compile. In this paper, we systematically study whether automated program repair (APR) techniques can fix the incorrect solutions produced by language models in LeetCode contests. The goal is to study whether APR techniques can enhance reliability in the code produced by large language models. Our study revealed that: (1) automatically generated code shares common programming mistakes with human-crafted solutions, indicating APR techniques may have potential to fix auto-generated code; (2) given bug location information provided by a statistical fault localization approach, the newly released Codex edit mode, which supports editing code, is similar to or better than existing Java repair tools TBar and Recoder in fixing incorrect solutions. By analyzing the experimental results generated by these tools, we provide several suggestions: (1) enhancing APR tools to surpass limitations in patch space (e.g., introducing more flexible fault localization) is desirable; (2) as large language models can derive more fix patterns by training on more data, future APR tools could shift focus from adding more fix patterns to synthesis/semantics based approaches, (3) combination of language models with APR to curate patch ingredients, is worth studying.

论文十问
Q1. 论文试图解决什么问题？
验证自动程序修复（automated program repair，ARP）是否可以提升（大规模）语言模型（（large） language model）生成代码的可靠性。
—can automated program repair improve the code produced by language models?
（”Codex edit mode,” 2022. [Online]. Available: https://openai.com/blog/ gpt-3-edit-insert）
进而细分为三个子问题：

What mistakes are common in auto-generated code?
Can APR tools effectively fix code from Codex?
Can Codex edit mode fix program bugs?

Q2. 这是否是一个新问题？
这是一项综合性研究。
研究自动程序修复工具能否辅助引导Codex修复程序错误
“study whether the side effect of APR tools, such as fault localization results, can be used to guide Codex-e, and how effective Codex-e is in fixing program bugs”

Q3. 这篇文章要验证一个什么科学假设？
本文为综合性研究，主要探究了以下方面：

语言模型生成错误代码的特点；
自动程序修复技术对Codex等语言模型自动生成的错误代码的修复效果；
语言模型是否可以作为ARP技术修复错误的代码；
若3.可行；通过采用ARP技术，对语言模型修复缺陷项目能力是否有提升，提升的方式（如缺陷定位）和具体的ARP技术（如基于符号执行的修复、基于搜索的修复和基于学习的修复）；
将ARP与language model结合的可行性和有效性分析。

Q4. 有哪些相关研究？如何归类？谁是这一课题在领域内值得关注的研究员？

Automated Program Repair(APR)
[1] Claire Le Goues, Michael Pradel, & Abhik Roychoudhury (2019). Automated program repair Communications of The ACM.
[19年综述]

Search-based ARP tool take a buggy program and a correct criteria as inputs, and generate patches in two steps: (1) producing patches using predefined code transformation operators; and (2) searching for a patch over the patch space that satisfies a correctness criteria (e.g. passes given tests). Search-based repair can scale to large programs, but often not to large search spaces.
[2] Shin Hwei Tan, & Abhik Roychoudhury (2015). relifix: automated repair of software regressions International Conference on Software Engineering.
[ReliFix] [Search-based]
[3] Shin Hwei Tan, Zhen Dong, Xiang Gao, & Abhik Roychoudhury (2018). Repairing crashes in Android apps International Conference on Software Engineering.
[Search-based]
[4] Ming Wen, Junjie Chen, Rongxin Wu, Dan Hao, & Shing-Chi Cheung (2018). Context-aware patch generation for better automated program repair International Conference on Software Engineering.
[Search-based]
[5] Jiajun Jiang, Yingfei Xiong, Hongyu Zhang, Qing Gao, & Xiangqun Chen (2018). Shaping program repair space with existing patches and similar code International Symposium on Software Testing and Analysis.
[Search-based]
[6] Sergey Mechtaev, Xiang Gao, Shin Hwei Tan, & Abhik Roychoudhury (2018). Test-Equivalence Analysis for Automatic Patch Generation ACM Transactions on Software Engineering and Methodology.
[Search-based]
[7] Kui Liu, Anil Koyuncu, Dongsun Kim, & Tegawendé F. Bissyandé (2019). TBar: Revisiting Template-based Automated Program Repair arXiv: Software Engineering.
[TBar] [Search-based]
[8] Kui Liu, Anil Koyuncu, Dongsun Kim, & Tegawendé F. Bisyandé (2018). AVATAR : Fixing Semantic Bugs with Fix Patterns of Static Analysis Violations Cornell University - arXiv.
[Search-based]
[9] Westley Weimer, ThanhVu Nguyen, Claire Le Goues, & Stephanie Forrest (2009). Automatically finding patches using genetic programming International Conference on Software Engineering.
[GenProg] [Search-based]

Semantics-based APR techniques (e.g., SemFix [10], Nopol [11], and Angelix [12]) generate patches by (1) formulating a repair constraint that needs to be satisfied by a program passing a given test-suite; and (2) solving the repair constraint to generate patches.
[10] Hoang Duong Thien Nguyen, Dawei Qi, Abhik Roychoudhury, & Satish Chandra (2013). SemFix: program repair via semantic analysis International Conference on Software Engineering.
[SemFix] [Semantics-based]
[11] Jifeng Xuan, Matias Martinez, Favio DeMarco, Maxime Clement, Sebastian R. Lamelas Marcote, Thomas Durieux, Daniel Le Berre, & Martin Monperrus (2017). Nopol: Automatic Repair of Conditional Statement Bugs in Java Programs IEEE Transactions on Software Engineering.
[Nopol] [Semantics-based]
[12] Sergey Mechtaev, Jooyong Yi, & Abhik Roychoudhury (2016). Angelix: scalable multiline program patch synthesis via symbolic analysis International Conference on Software Engineering.
[Angelix] [Semantics-based]

The application of deep learning techniques in program repair has been explored in past few years.

DeepRepair [13] and DeepFix [14] are the early attempts to fix bugs by learning fixes from similar code.
[13] Martin White, Michele Tufano, Christopher Vendome, & Denys Poshyvanyk (2016). Deep learning code fragments for code clone detection Automated Software Engineering.
[14] Rahul Gupta, Soham Pal, Aditya Kanade, & Shirish Shevade (2017). DeepFix: Fixing Common C Language Errors by Deep Learning Proceedings of the … AAAI Conference on Artificial Intelligence.

SequenceR [15] adapts neural machine translation (NMT) to generate patch, whereas CoCoNuT [16] and CURE [17] further improve the results by either encoding program context or using a programming language model.
[15] Zimin Chen, Steve Kommrusch, Michele Tufano, Louis-Noël Pouchet, Denys Poshyvanyk, & Martin Monperrus (2021). SequenceR : Sequence-to-Sequence Learning for End-to-End Program Repair IEEE Transactions on Software Engineering.
[16] Thibaud Lutellier, Hung Viet Pham, Lawrence Pang, Yitong Li, Moshi Wei, & Lin Tan (2020). CoCoNuT: combining context-aware neural translation models using ensemble for program repair International Symposium on Software Testing and Analysis.
[17] Nan Jiang, Thibaud Lutellier, & Lin Tan (2021). CURE: Code-Aware Neural Machine Translation for Automatic Program Repair Cornell University - arXiv.

DLFix [18] uses two-layer tree-based RNN to learn code transformations, and Recoder [19] designed a syntax-guided learning approach to improve the decoder of a DL model.
[18] Yi Li, Shaohua Wang, & Tien N. Nguyen (2020). DLFix: context-based code transformation learning for automated program repair International Conference on Software Engineering.
[19] Qihao Zhu, Zeyu Sun, Yuan-an Xiao, Wenjie Zhang, Kang Yuan, Yingfei Xiong, & Lu Zhang (2023). A Syntax-Guided Edit Decoder for Neural Program Repair CERN European Organization for Nuclear Research - Zenodo.
Database
[20] René Just, Darioush Jalali, & Michael D. Ernst (2014). Defects4J: a database of existing faults to enable controlled testing studies for Java programs International Symposium on Software Testing and Analysis.
Large Language Model for Code Generation
[21] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Thomas Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Samuel McCandlish, Alec Radford, Ilya Sutskever, & Dario Amodei (2020). Language Models are Few-Shot Learners arXiv: Computation and Language. [NLP]
[22] Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, & Jacob Steinhardt (2021). Measuring Coding Challenge Competence With APPS Cornell University - arXiv. [benchmark]

Later Codex [23], the back-end model that powers GitHub Copilot, Alphacode [24], Codewhisperer [25], and [26] have emerged as language model based automatic code generation platforms.
[23] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harrison Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth A. Barnes, Ariel Herbert-Voss, William H. Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Joshua Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew M. Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Samuel McCandlish, Ilya Sutskever, & Wojciech Zaremba (2021). Evaluating Large Language Models Trained on Code Cornell University - arXiv. [Codex]
[24] Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien De Masson D’autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel Mankowitz, Esme Robson, Pushmeet Kohli, Nando De Freitas, Koray Kavukcuoglu, & Oriol Vinyals (2023). Competition-Level Code Generation with AlphaCode [AlphaCode]
[25] “Amazon codewhisperer,” 2022. [Online]. Available: https://aws. amazon.com/codewhisperer/ [Codewhisperer]
[26]Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie J. Cai, Michael Terry, Quoc V. Le, & Charles Sutton (2021). Program Synthesis with Large Language Models.. arXiv: Programming Languages.

There are emerging approaches combining program synthesis with large language model on fixing API usage [27] and synthesizing regular expression [28]
[27] Naman Jain, Skanda Vaidyanath, Arun Iyer, Nagarajan Natarajan, Suresh Parthasarathy, Sriram Rajamani, & Rahul Sharma (2023). Jigsaw: Large Language Models meet Program Synthesis
[28] Kia Rahmani, Mohammad Shahid Raza, Sumit Gulwani, Vu Le, Dan Morris, Arjun Radhakrishna, Gustavo Soares, & Ashish Tiwari (2021). Multi-modal program inference: a marriage of pre-trained language models and component-based synthesis

[29] evaluated the quality of code generated by Copilot on a small set of randomly selected LeetCode programming tasks (33 tasks with 132 solutions).
[29] Nhan Nguyen, & Sarah Nadi (2023). An empirical evaluation of GitHub copilot’s code suggestions

The most relevant papers: how language model can fix bugs?
[30] Hammond Pearce, Benjamin Tan, Baleegh Ahmad, Ramesh Karri, & Brendan Dolan-Gavitt (2023). Can OpenAI Codex and Other Large Language Models Help Us Fix Security Bugs?.
[31] Julian Aron Prenner, Hlib Babii, & Romain Robbes (2023). Can OpenAI’s Codex Fix Bugs? An evaluation on QuixBugs

Q5. 论文中提到的解决方案之关键是什么？

在LeetCode中构建了一个包含113个编程任务的数据集LMDefects（数据集中的任务较新，避免被Codex训练过）。
p.s.: 大致保证Codex的训练版本在问题之前即可。
Codex的参数设置，与问题、测试集的建模与表示。在自然语言描述中提供公共测试，以指导APR工具，而不是将它们嵌入到提示中。
作者手动将公共测试用例转换为JUnit测试
具体而言，在给出的提示下（问题描述），运行Codex生成50个候选解，并根据Codex的最佳参选择正确概率最高的前5个解。
然后将解（生成的算法代码）提交LeetCode进行内部测试（私有测试）。
评估修复工具是否可以修复Codex产生的错误解决方案：使用测试驱动的修复工具（TBar和Recoder）基于公共测试用例对缺陷代码进行修复；通过提交LeetCode测试修复是否成功
通过生成程序编辑来修改现有代码，调查Codex-e是否可以作为APR工具，并将其与TBar和Recoder进行比较。

Q6. 论文中的实验是如何设计的？

RQ1. 分析Codex生成的解决方案中的典型错误来研究ARP进行缺陷修复的可行性
（1）对Codex在LMDefects上生成的解进行手动修复，并对错误进行归类。将这些错误与Codeflaws（编程竞赛参与者提交的错误的benchmark）
语法错误：（a）不完整的代码，(b)调用未定义的变量/函数/类
作者通过增加生成解的数量和生成代码段的长度（token的最大数量）来减少语法错误
“Despite providing the maximum length as the bound for code generation, Codex still generates incomplete code where the average token length is 628. It is worthwhile to study the feasibility of applying code completion techniques for fixing the auto-generated incomplete code by Codex.”
“对于具有未定义函数的程序，需要合成函数体来解决编译错误”
“Future research can work on using program synthesis techniques to resolve the undefined functions or invoking Codex on a function-by-function basis to synthesize the function body.”
“由于括号不匹配很容易修复（使用正则表达式匹配机制）”，作者手动修复后归类到其它类别”
“Misaligned Algorithm”&“Syntax Error”

P.S.: 作者发现即使在手动修复括号不匹配的错误后，这些程序依然无法通过LeetCode的保留测试（hold-out tests）。

RQ2. 将ARP生成的补丁划分为似是而非补丁（Plausible patches）与正确补丁（Correct patches）。前者修复后的代码只能通过公共测试集，后者能提交并通过LeetCode的私有测试（Private tests）。
对每个缺陷类型，ARP工具修复的情况进行了统计分析。“The results show that existing APR tools are still limited in generating complex patches that require edits of multiple lines.”
将文中使用的ARP工具TBar与Recoder进行对比

“Existing pattern based and learning based APR are ineffective at fixing auto-generated code, challenges include: (1) limited search space; (2) unable to generate multi-edit patches; (3) lack of awareness of program dependencies.” //why?//how?//

RQ3.
OpenAI发布了一种新的Codex编辑模式，它具有改变现有程序内容的能力；设计实验验证：“能否通过适当的指令使得Codex-e修复缺陷程序？”

Codex-e-bug: 告诉Codex-e给定程序中存在缺陷，并要求Codex-e修复它。该指令简单地表示为“修复程序中的bug” （”Fix bug in the program”）
Codex-e-line: Codex-e的指令被表述为“修复行n”（”Fix line N”）
Codex-e-stm: 修复具体的语句,对可疑行的文本s1，如“修复s1”（s1:i-=2;）（”Fix s1”）

选择十个最可疑的陈述，并要求 Codex-e 为每个陈述生成五个可能的编辑（即 Codex-e 试图在 50 次尝试中修复不正确的解决方案）。Code-e(stm)方案的总体效果最好。
接下来：

将Codex-e-stm、TBar和Recoder的效果进行对比
结合不同工具的补丁空间
a) Combine patch space of Codex-e and APR
定义1. 补丁成分：用于构建相应补丁的运算符/操作数（例如变量、文字、运算符等）的集合。
探索通过评估不同工具产生的补丁成分，APR和Codex-e产生的补丁搜索空间是否可以相互补充。 #Recoder是基于深度学习的模型，因此不考虑Recoder与Codex-e的结合
<对LMDefects每个问题由2名作者确认出一个最小的补丁，称为“ground truth”>
（没看太懂）：对于每个缺陷的解，根据”ground truth”补丁获得“所需的补丁成分 $I_1$”(required patch ingredients $I_1$)。
接下来，探究以下2点：
单个工具 (TBar/Codex-e) 能否为每个不正确的解决方案生成所有所需的补丁成分？
结合 TBar 和 Codex-e（按顺序运行 TBar 和 Codex-e）可以产生所有所需的补丁成分吗？
b) Combine APR with Multiple Solutions of Codex
Codex 生成一组候选程序，每个候选程序在理解问题描述方面可能略有不同，因此生成的代码略有不同。作者研究了从这些候选者（“TBar+Codex”设置）中组合补丁成分的可行性。

Q7. 用于定量评估的数据集是什么？代码有没有开源？
LMDefects, a new dataset that contains 113 Java programming tasks.
（文中未给出链接）https://github.com/apr4codex/icse2023
Codex：略
TBar：略
Recoder：略

Q8. 论文中的实验及结果有没有很好地支持需要验证的科学假设？
RQ1. Auto-generated programs share common mistakes with human-written programs, and contain certain negative symptoms including: (1) names indicate wrong algorithms; (2) similar code blocks; (3) irrelevant helper functions.
RQ2. Existing pattern based and learning based APR are ineffective at fixing auto-generated code, challenges include: (1) limited search space; (2) unable to generate multi-edit patches; (3) lack of awareness of program dependencies.
RQ3_1. The effectiveness of Codex-e with a given specific fault location (Codex-e—stm) is nearly comparable to its effectiveness without any location guidance (Codex-e-bug).
RQ3_2. By using patch ingredients extracted from (1) TBar’s and Codex-e’s patches, and (2) TBar’s patches and multiple generated solutions by Codex — we successfully identify the required patch ingredients of more incorrect solutions.

Q9. 这篇论文到底有什么贡献？

研究了Codex自动生成缺陷代码的特征，并建立了公开数据集LMDefects；
首次对Codex作为ARP工具的有效性的研究；
研究基于搜索的ARP工具、基于深度学习的ARP工具对Codex自动生成缺陷代码的修复效果；
研究ARP工具与Codex在修复Codex自动生成缺陷代码的过程中，可能的结合方式：a)两者补丁的结合，b)通过TBar的补丁修复多个Codex的解。

Q10. 下一步呢？有什么工作可以继续深入？
“As Codex generated code share common mistakes with human-written code, using APR techniques to enhance reliability in auto-generated code by automatically fixing the bugs in the auto-generated code is worth studying.”

对语言模型生成的缺陷程序进行系统调查（作者的做法：建立LMDefects数据集）。
Codex似乎严重依赖函数名称来解决编程任务, 为代码生成而设计的未来语言模型应该专注于总结问题描述的有用信息，以减少对函数名称的依赖。
基于搜索的ARP算法需要额外的修复信息/模式（additional fix patterns）与更大的搜索空间；如何解决这类问题。
基于符号的算法未被本文测试。基于符号的修复往往针对于特定的语义问题，如Nopol专注于修复条件语句错误，总体效果也许不如Recoder与TBar，但是对于特定问题也许效果会更好。此外，如何集成多个基于符号的修复工具、如何结合不同的修复工具（符号修复往往依赖于精确的错误定位，也许能通过这个方向进行深入）等都是值得研究的。
基于学习的APR不能保证其学习后生成的模型总是正确应用于修复所有程序（缺乏泛化性），作者给出的建议是(1) 将特定领域的知识纳入学习新模式，(2) 提高学习模式的泛化性。
测试驱动的修复框架：未来的代码生成也许不是从头开始生成正确的程序，而可能是先生成错误的程序，并通过迭代测试的方法进一步优化它。
正确程序的优先级（Prioritization of correct programs）：在补丁正确性评估 [32]、[33] 和补丁优先级排序 [34]、[35] 中结合APR研究的最新进展来指导语言模型，如Codex，以生成更好的程序。
获取补丁成分（Obtaining patch ingredients）：未来的研究可以致力于自动搜索和合并补丁成分以生成更复杂的程序/块。
如何构建编辑指令来指导Codex-e生成更正确的修复（Codex-e-bug、Codex-e-line与Codex-e-stm）。

[32] Shin Hwei Tan, Hiroaki Yoshida, Mukul R. Prasad, & Abhik Roychoudhury (2016). Anti-patterns in search-based program repair Foundations of Software Engineering.
[33] Shangwen Wang, Ming Wen, Bo Lin, Hongjun Wu, Yihao Qin, Deqing Zou, Xiaoguang Mao, & Hai Jin (2020). Automated patch correctness assessment: how far are we?. Automated Software Engineering.
[34] He Ye, Jian Gu, Matias Martinez, Thomas Durieux, & Martin Monperrus (2021). Automated Classification of Overfitting Patches with Statically Extracted Code Features IEEE Transactions on Software Engineering.
[35] Ali Ghanbari (2020). ObjSim: Lightweight Automatic Patch Prioritization via Object Similarity Cornell University - arXiv.

P.S.: 引用中带有“arXiv”的参考文献格式是不完全规范的；本文图方便了（我懒，略略略略略略）