Large Language Models Generate Synthetic Data To Automate Code Review Classification

0
Large Language Models Generate Synthetic Data To Automate Code Review Classification

Automated code review is essential for ensuring software quality, but a lack of labelled data hinders the development of effective systems, particularly for newer programming languages. Yogev Cohen, Dudi Ohayon, and Romy Somkin, all from Holon Institute of Technology, alongside Yehudit Aperstein from Afeka Academic College of Engineering Tel Aviv Israel and Alexander Apartsin from Holon Institute of Technology, tackle this problem by exploring the potential of Large Language Models to create synthetic training data. The team investigates whether these models can translate code changes between languages, effectively generating labelled examples where real data is limited. Their experiments, conducted across numerous GitHub repositories, reveal that models trained on this synthetically generated data perform surprisingly well, significantly improving code review recommendation systems even when labelled data is scarce, and offering a scalable solution for rapidly evolving software development landscapes.

code changes from well-resourced languages into equivalent changes in underrepresented or emerging languages, generating synthetic training data when labelled examples are limited. The research recognizes that large language models have learned the syntax and semantics of new languages from unlabelled code, but haven’t fully grasped which code changes are significant or warrant review. This limitation hinders their effectiveness in code review for languages with limited training data, and this work addresses the challenge of adapting code review capabilities to low-resource languages by transferring knowledge from languages with abundant labelled data. The central objective is to create a method for automatically generating synthetic, labelled data for code review, thereby improving the performance of code review tools and assisting developers working with these languages.

Synthetic Data Generation via Language Translation

The team engineered a novel approach to address the scarcity of labelled data for automated code review, particularly in emerging programming languages. This method involves translating code changes from a well-resourced language, Java, into equivalent changes in a target language, C++, using the LLM GPT-4. The core idea is that while LLMs understand the syntax and semantics of various languages, they lack specific knowledge about which code modifications require manual review.

To overcome this limitation, the study pioneered a transfer learning pipeline. Scientists used GPT-4 to create synthetic examples of code changes in C++, effectively labelling them based on patterns learned from Java. These synthetic examples then serve as training data for a review classification model, CodeBERT, a transformer model pre-trained on source code. Importantly, CodeBERT was trained without any C++ data to ensure the synthetic data drove the learning process. The resulting classifier then predicts whether a given code change warrants manual review.

Experiments employed multiple GitHub repositories to evaluate the effectiveness of this approach, systematically comparing the performance of the classifier trained on LLM-generated synthetic C++ data against a baseline model trained on real, labelled C++ data. This comparison rigorously assesses how well the synthetic data approximates real-world review patterns. The system delivers a scalable pathway to extend automated code review capabilities to rapidly evolving stacks, even when annotated data is unavailable, potentially eliminating the need for costly manual labelling efforts. The results demonstrate that LLM-generated synthetic data can effectively bootstrap review recommendation systems, narrowing the performance gap between synthetic and real data, even in low-resource settings.

LLMs Automate Cross-Language Code Review

Scientists have developed a groundbreaking approach to automate code review, addressing a critical bottleneck in modern software development where new programming languages and frameworks rapidly emerge. The team discovered that large language models (LLMs) can effectively translate code changes between languages, generating synthetic training data for scenarios where labelled examples are scarce. This innovative method overcomes the limitations of relying solely on real labelled data, which is often insufficient for emerging technologies. Experiments reveal that LLMs, after learning the syntax and semantics of new languages, can be leveraged to create artificial code examples and annotations.

Researchers demonstrated this by translating labelled Java code changes into equivalent C++ code using GPT-4o, preserving the original intent and review label. This process effectively bootstraps review recommendation systems, narrowing the performance gap even in low-resource settings. The results show that this synthetic data generation significantly enhances the ability to classify code changes requiring manual review, even in novel programming languages. Further bolstering this approach, scientists have built upon existing transfer learning techniques and cross-lingual LLMs. Previous work demonstrated that models trained on one language, such as Python, can achieve over 5% absolute accuracy gain when applied to another, like JavaScript.

Building on this, researchers achieved up to a 43% improvement in C code translation using synthetic data, despite having fewer than 150 real examples. This illustrates the power of LLMs to learn shared representations across languages and generate high-quality training data where it is lacking. The breakthrough delivers a scalable pathway to extend automated code review capabilities to rapidly evolving stacks, even without extensive manually labelled data. By combining cross-lingual LLMs with synthetic data generation, scientists have created a system that can effectively determine whether a code change requires human review, regardless of the programming language or ecosystem. This advancement promises to significantly improve software quality and accelerate the development process.

Synthetic Data Bridges C++ Code Review Gap

This research demonstrates the feasibility of using large language models to generate synthetic datasets for training code review systems, particularly for programming languages where labelled data is scarce. The team successfully translated labelled code changes from Java into C++, enabling the training of a classifier for C++ review recommendations without requiring extensive manually labelled C++ data. Results indicate that models trained on this synthetic data achieve performance comparable to those trained on real data, narrowing the gap in accuracy and suggesting a practical solution for low-resource languages. While models trained on real data still outperform those using synthetic data, the narrow performance difference highlights the potential of this approach for extending automated code review capabilities to rapidly evolving programming languages and frameworks. The authors acknowledge this limitation and plan future work to improve the quality of the synthetic data through advanced prompting techniques and reinforcement learning, and will explore applying this methodology to additional languages, including those used in mobile development and data science, and investigating the use of multilingual code models to enhance transferability and accuracy.

link

Leave a Reply

Your email address will not be published. Required fields are marked *