Skip to content

Commit

Permalink
Added SANER'25
Browse files Browse the repository at this point in the history
  • Loading branch information
areyde committed Dec 20, 2024
1 parent bc626a7 commit 495c32f
Show file tree
Hide file tree
Showing 2 changed files with 2 additions and 2 deletions.
2 changes: 1 addition & 1 deletion _publications/2025-05-02-cmg-evaluation.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,6 @@ level: "A*"
pdf: 'https://arxiv.org/abs/2410.12046'
data: 'https://huggingface.co/collections/JetBrains-Research/commit-message-generation-evaluation-664a96940e5395fb52c202c5'
tool: 'https://huggingface.co/spaces/JetBrains-Research/commit-message-editing'
counter_id: 'C30'
counter_id: 'C31'
abstract: "<p><b>Abstract</b>. Commit message generation (CMG) is a crucial task in software engineering that is challenging to evaluate correctly. When a CMG system is integrated into the IDEs and other products at JetBrains, we perform online evaluation based on user acceptance of the generated messages. However, performing online experiments with every change to a CMG system is troublesome, as each iteration affects users and requires time to collect enough statistics. On the other hand, offline evaluation, a prevalent approach in the research literature, facilitates fast experiments but employs automatic metrics that are not guaranteed to represent the preferences of real users. In this work, we describe a novel way we employed to deal with this problem at JetBrains, by leveraging an online metric - the number of edits users introduce before committing the generated messages to the VCS - to select metrics for offline experiments.</p><p>To support this new type of evaluation, we develop a novel markup collection tool mimicking the real workflow with a CMG system, collect a dataset with 57 pairs consisting of commit messages generated by GPT-4 and their counterparts edited by human experts, and design and verify a way to synthetically extend such a dataset. Then, we use the final dataset of 656 pairs to study how the widely used similarity metrics correlate with the online metric reflecting the real users' experience.</p><p>Our results indicate that edit distance exhibits the highest correlation, whereas commonly used similarity metrics such as BLEU and METEOR demonstrate low correlation. This contradicts the previous studies on similarity metrics for CMG, suggesting that user interactions with a CMG system in real-world settings differ significantly from the responses by human labelers operating within controlled research environments. We release all the code and the dataset for researchers: https://jb.gg/cmg-evaluation.</p>"
---
2 changes: 1 addition & 1 deletion _publications/2025-05-02-full-line-code-completion.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,6 @@ venue: "<b>ICSE'25</b>"
level: "A*"
pdf: 'https://arxiv.org/abs/2405.08704'
tool: 'https://plugins.jetbrains.com/plugin/14823-full-line-code-completion'
counter_id: 'C31'
counter_id: 'C32'
abstract: "<p><b>Abstract</b>. In recent years, several industrial solutions for the problem of multi-token code completion have appeared, each making a great advance in the area but mostly focusing on cloud-based runtime and avoiding working on the end user's device.</p><p>In this work, we describe our approach for building a multi-token code completion feature for the JetBrains' IntelliJ Platform, which we call Full Line Code Completion. The feature suggests only syntactically correct code and works fully locally, i.e., data querying and the generation of suggestions happens on the end user's machine. We share important time and memory-consumption restrictions, as well as design principles that a code completion engine should satisfy. Working entirely on the end user's device, our code completion engine enriches user experience while being not only fast and compact but also secure. We share a number of useful techniques to meet the stated development constraints and also describe offline and online evaluation pipelines that allowed us to make better decisions.</p><p>Our online evaluation shows that the usage of the tool leads to 1.5 times more code in the IDE being produced by code completion. The described solution was initially started with the help of researchers and was bundled into two JetBrains' IDEs - PyCharm Pro and DataSpell - at the end of 2023, so we believe that this work is useful for bridging academia and industry, providing researchers with the knowledge of what happens when complex research-based solutions are integrated into real products.</p>"
---

0 comments on commit 495c32f

Please sign in to comment.