Break-It-Fix-It Unsupervised Learning for Program Repair.

Ref:

Yasunaga M, Liang P. Break-it-fix-it: Unsupervised learning for program repair[C]//International Conference on Machine Learning. PMLR, 2021: 11941-11952.

Code and data are available at https://github.com/michiyasunaga/bifi.
Experiments are available at https://worksheets.codalab.org/worksheets/0xfddb2ef01a9f4dc0b5d974a5a97174be.

Abstract:
We consider repair tasks: given a critic (e.g., compiler) that assesses the quality of an input, the goal is to train a fixer that converts a bad example (e.g., code with syntax errors) into a good one (e.g., code with no syntax errors). Existing works create training data consisting of (bad, good) pairs by corrupting good examples using heuristics (e.g., dropping tokens). However, fixers trained on this synthetically-generated data do not extrapolate well to the real distribution of bad inputs. To bridge this gap, we propose a new training approach, Break-It-Fix-It (BIFI), which has two key ideas: (i) we use the critic to check a fixer’s output on real bad inputs and add good (fixed) outputs to the training data, and (ii) we train a breaker to generate realistic bad code from good code. Based on these ideas, we iteratively update the breaker and the fixer while using them in conjunction to generate more paired data. We evaluate BIFI on two code repair datasets: GitHub-Python, a new dataset we introduce where the goal is to repair Python code with AST parse errors; and DeepFix, where the goal is to repair C code with compiler errors. BIFI outperforms state-of-the-art methods, obtaining 90.5% repair accuracy on GitHub-Python (+28.5%) and 71.7% on DeepFix (+5.6%). Notably, BIFI does not require any labeled data; we hope it will be a strong starting point for unsupervised learning of various repair tasks.

论文十问

Q1. 论文试图解决什么问题？
程序修复编译器错误语法错误
Program Repair: compiler error - syntax error.

Q2. 这是否是一个新问题？
简而言之：deal with distribution of real bad examples

在通过深度学习训练生成的模型性能，与输入的数据集有关，数据集由(bad, good)属性的样本对组成。
现有的工作基于启发式(e.g.: dropping tokens)对正确的样本进行corrupt生成错误样本。生成的错误样本与正确的样本组成的样本对加入训练集中作为模型的输入。但是基于这种方式生成的数据集训练后的模型/修复器并不能很好地探索程序错误在实际项目中的真实分布。

Q3. 这篇文章要验证一个什么科学假设？

通过训练一个breaker（模型）从正样本（没有错误的样本）中生成的错误样本更接近实际工程中的错误样本
接近实际错误样本的数据能提升solver模型的性能

Q4. 有哪些相关研究？如何归类？谁是这一课题在领域内值得关注的研究员？

DeepFix
Rahul Gupta, Soham Pal, Aditya Kanade, and Shirish Shevade. 2017. DeepFix: fixing common C language errors by deep learning. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI’17). AAAI Press, 1345–1351.

[heuristic noising]
DrRepair
Yasunaga, M. and Liang, P. Graph-based, self-supervised program repair from diagnostic feedback. In International Conference on Machine Learning (ICML), 2020.

[unsupervised machine translation]
Guillaume Lample, Myle Ott, Alexis Conneau, Ludovic Denoyer, and Marc’Aurelio Ranzato. 2018. [Phrase-Based & Neural Unsupervised Machine Translation] (https://aclanthology.org/D18-1549). In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 5039–5049, Brussels, Belgium. Association for Computational Linguistics.

error resolution records：
Defects4J
René Just, Darioush Jalali, and Michael D. Ernst. 2014. Defects4J: a database of existing faults to enable controlled testing studies for Java programs. In Proceedings of the 2014 International Symposium on Software Testing and Analysis (ISSTA 2014). Association for Computing Machinery, New York, NY, USA, 437–440. https://doi.org/10.1145/2610384.2628055

SequenceR
Z. Chen, S. Kommrusch, M. Tufano, L. -N. Pouchet, D. Poshyvanyk and M. Monperrus, “SequenceR: Sequence-to-Sequence Learning for End-to-End Program Repair,” in IEEE Transactions on Software Engineering, vol. 47, no. 9, pp. 1943-1959, 1 Sept. 2021, doi: 10.1109/TSE.2019.2940179.

DeepDelta
Ali Mesbah, Andrew Rice, Emily Johnston, Nick Glorioso, and Edward Aftandilian. 2019. DeepDelta: learning to repair compilation errors. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2019). Association for Computing Machinery, New York, NY, USA, 925–936. https://doi.org/10.1145/3338906.3340455

Graph2Diff
Daniel Tarlow, Subhodeep Moitra, Andrew Rice, Zimin Chen, Pierre-Antoine Manzagol, Charles Sutton, and Edward Aftandilian. 2020. Learning to Fix Build Errors with Graph2Diff Neural Networks. In Proceedings of the IEEE/ACM 42nd International Conference on Software Engineering Workshops (ICSEW’20). Association for Computing Machinery, New York, NY, USA, 19–20. https://doi.org/10.1145/3387940.3392181

Getafix
Bader, J., Scott, A., Pradel, M., & Chandra, S. (2019). Getafix: Learning to fix bugs automatically. Proceedings of the ACM on Programming Languages, 3(OOPSLA), 1-27.

Y. Ding, B. Ray, P. Devanbu and V. J. Hellendoorn, “Patching as Translation: the Data and the Metaphor,” 2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE), 2020, pp. 275-286.

learning code fixers from unlabeled or synthetic data：
sk_p
Pu Y, Narasimhan K, Solar-Lezama A, et al. sk_p: a neural program corrector for MOOCs[C]//Companion Proceedings of the 2016 ACM SIGPLAN International Conference on Systems, Programming, Languages and Applications: Software for Humanity. 2016: 39-40.

GradeIT
Parihar S, Dadachanji Z, Singh P K, et al. Automatic grading and feedback using program repair for introductory programming courses[C]//Proceedings of the 2017 ACM Conference on Innovation and Technology in Computer Science Education. 2017: 92-97.

TRACER
Ahmed U Z, Kumar P, Karkare A, et al. Compilation error repair: for the student programs, from the student programs[C]//Proceedings of the 40th International Conference on Software Engineering: Software Engineering Education and Training. 2018: 78-87.

DeepBugs
Pradel M, Sen K. Deepbugs: A learning approach to name-based bug detection[J]. Proceedings of the ACM on Programming Languages, 2018, 2(OOPSLA): 1-25.

Dynamic Neural Program Embeddings
Wang K, Singh R, Su Z. Dynamic Neural Program Embeddings for Program Repair[C]//International Conference on Learning Representations. 2018.

Neural Program Repair by Jointly Learning to Localize and Repair
Vasic M, Kanade A, Maniatis P, et al. Neural Program Repair by Jointly Learning to Localize and Repair[C]//International Conference on Learning Representations. 2018.

Deep Reinforcement Learning for Syntactic Error Repair in Student Programs
Gupta, R., Kanade, A. and Shevade, S. 2019. Deep Reinforcement Learning for Syntactic Error Repair in Student Programs. Proceedings of the AAAI Conference on Artificial Intelligence. 33, 01 (Jul. 2019), 930-937. DOI:https://doi.org/10.1609/aaai.v33i01.3301930.

SampleFix
Hajipour H, Bhattacharyya A, Staicu C A, et al. SampleFix: Learning to Generate Functionally Diverse Fixes[C]//Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, Cham, 2021: 119-133.

Ding, Y., Ray, B., Devanbu, P., and Hellendoorn, V. J. Patching as translation: the data and the metaphor. InAutomated Software Engineering (ASE), 2020.

Denoising autoencoding

Domain adaptation

Data augmentation

Self-tranining

Q5. 论文中提到的解决方案之关键是什么？
通过多轮数据生成和训练同时改进fixer和breaker：（有点生成-对抗的意思）
核心思想
（1）通过编译器作为评判标准（critic），对一个真实世界中的错误样例，判断一个fixer输出的修复数据是否为正确的，若为TRUE，则将该输出作为正确样本与错误样本配对，并加入训练集中
（2）训练一个breaker，对正确的样本生成对应的错误样本，目标是尽可能接近真实的错误样本

具体来说，给定一个在合成的配对数据集上训练的初始fixer，BIFI通过多轮数据生成和训练同时改进fixer和breaker：
（1）将fixer应用于
真实的错误样本，通过编译器（能否通过编译）筛选出正确的输出作为修复样本，生成配对数据加入测试集/数据集（因为数据集是可以通过k-fold等划分成测试集和训练集的，所以这里的描述差别不大），
（2）使用结果数据来训练breaker，
（3）基于正确的样本，使用训练后的breaker生成错误样本并获得更多配对数据，
（4）在（1）和（3）中新生成的配对数据集上训练fixer。

Q6. 论文中的实验是如何设计的？

初始修复器使用训练好的DrRepair作为f_0；其训练与原始论文设置相同
对f_0,运行2轮BIFI（K=2）
测试：对一个错误代码，循环使用训练后的fixer进行修复，截至条件为：错误被修复（可编译）||达到最大迭代次数：5（设置同DrRepair）
策略验证（github-python & Deepfix dataset）：a. Initial fixer：f_0的结果；b. f_0不使用Breaker生成的错误样本进行训练（FixerOnly）2遍；c. f_0 ＋ BIFI 2遍
对比实验（Deepfix dataset）：4的基础上，加上与下述算法的对比
Deepfix RLAssit SampleFix DrRepair(即Initial fixer, f_0)

Q7. 用于定量评估的数据集是什么？代码有没有开源？
DeepFix 数据集[开源]
本文收集的GitHub-python数据集[https://nlp.stanford.edu/projects/myasu/BIFI/data.zip][https://nlp.stanford.edu/projects/myasu/BIFI/data_minimal.zip]
代码[https://github.com/michiyasunaga/BIFI]

Q8. 论文中的实验及结果有没有很好地支持需要验证的科学假设？
通过对比试验和策略验证，证明了所提出的breaker与fixer能通过生成更接近真实错误的代码样本，并对模型的性能起到了提升
GitHub-Python（Total：1.9%的性能提升于FixerOnly，Unbalanced Parentheses：1.8%；Indentation Error：2.2%，Invalid Syntax：1.5%）

DeepFix-test set（BIFI: 71.7%, FixerOnly: 70.5%, DrRepair: 66.1%, SampleFix 45.3%, RLAssist: 26.6%, DeepFix: 33.4%）

Q9. 这篇论文到底有什么贡献？
一方面，提出了一种break and fix 的方法，使得生成的错误测试用例更接近真实的错误代码情景
另一方面，验证了‘接近真实错误实例的测试样本最为数据集可也接好地提升模型性能’

Q10. 下一步呢？有什么工作可以继续深入？

BIFI框架对其它模型的提升？为什么只采用BIFI的训练数据对DrRepair进行增强（答案可以是本文采用的框架选用的数据解结构和DrRepair比较相似或者基本相同）？即fixer f_0如果选取其它的模型，会有怎样的效果？[一项综合性的实证研究]
BIFI思路在其它问题上的运用
文中的评判标准为是否可编译，如果针对其它问题，即可编译问题的缺陷定位与修复，是否可行？是否有可以运用该方法的可能？如何运用？

P.S. 一些参考文献的引用格式不一致，主要有：ACM引用格式，APA，GB/T 7714 等主要是懒得统一了，阿巴阿巴~