Skip to content

Latest commit

 

History

History
170 lines (123 loc) · 11.7 KB

README.md

File metadata and controls

170 lines (123 loc) · 11.7 KB

CodeLLM Paper

This repository provides a curated list of research papers focused on Large Language Models (LLMs) for code. It aims to facilitate researchers and practitioners in exploring the rapidly growing body of literature on this topic. The papers are systematically collected from various top-tier venues, categorized, and labeled for easier navigation.

Table of Contents

A. Venues

We have systematically selected papers from the following venues, which are top-tier conferences and journals in SE/PL/Sec/NLP communities.

The papers accepted by USENIXSec2024 and CCS2024 have not been published in the proceedings. Due to the large volume, we do not systematically collect the papers published in top-tier ML conferences (ICML, NeurIPS, and ICLR) and arXiv. However, we are keeping manually adding important works published in these venues. We plan to expand the collection over time, and contributions are welcome. For details, see the section How to Contribute.

B. Selection Strategy

  1. Abstract Extraction: Extract the abstracts from bib files or HTML files. The bib and HTML files of the above listed venues are stored in the directory data/rawdata.

  2. Keyword Matching: Filter abstracts that meet both of the following conditions:

    • Contains at least one keyword from: {"pretrain", "LLM", "large language model", "transformer", "code model"}.

    • Contains the keyword "code" or "program".

  3. Relevance Check Using LLMs: Use LLMs to verify if the papers obtained in Step 2 are related to LLMs for code.

  4. Manual Labeling: Manually assign labels to the papers based on domain knowledge.

All the selected papers along with the labels are maintained in the json file data/labeldata/labeldata.json. src/process.py is the python script used for selecting and labeling papers.

C. Taxonomy

The papers in this repository are categorized along three dimensions: Application, Principle, and Research Paradigm. Each paper is assigned multiple labels based on these categories. Note that categories are not necessarily disjoint.

C.1. Application

This category focuses on typical tasks in Software Engineering (SE) and Programming Languages (PL).

C.2. Principle

This category concentrates on the LLMs' ability in understanding different forms of code and the non-functional properties of the LLMs (e.g., security and robustness). We also consider how to utilize the LLMs for general reasoning problems, such as typical agent-centric designs and specific PL designs for LLMs.

C.3. Research Paradigm

This category includes studies on benchmarks, empirical evaluations, and surveys. The papers that do not belong to the following three categories are purely technical papers.

D. How to Contribute

D.1. PR Submission

We welcome contributions to expand this repository. If you want to add new papers to the list, please follow these steps:

  1. Prepare a JSON File: Format the file like data/labeldata/patch/example.json. Each paper should include:

    • title, authors, abstract, url, venue, and labels (aligned with the taxonomy in data/labeldata/patch).
  2. Upload the File: Place the JSON file in the data/labeldata/patch directory.

  3. Update Markdown Files: Run the following command to update the repository:

    cd src && python patch.py

If you want to add new labels and change the current taxonomy, please post an issue first and suggest your taxonomy (See below).

D.2. Issue Submission

Another option is to post the papers you wish to add in an issue. Please include a permanently valid link to the paper and specify the venue. If you'd like, you can also categorize the paper based on your understanding of the work by attaching appropriate labels from the existing options in data/category.json or by creating new ones. We will add the paper to our repository very soon.

D.3. Request for Batch Updates

To facilitate timely batch updates to the paper repository, we prefer to utilize the proceedings of various conferences and journals. Here are several examples: ASE2024, OOPSLA2023, S&P2023, and ACL2024. By parsing and extracting information from bib files and HTML files (See data/rawdata), including abstracts, we can semi-automatically classify papers based on the aforementioned selection strategy. If the conference or journal you are following has recently released its complete proceedings, please notify us by submitting an issue. We will prioritize the batch update and add the corresponding conference or journal name to the venue list.

E. Disclaimer and Contact

This paper repository is intended solely for research purposes. All raw data is sourced from publicly available information on ACM, IEEE, and corresponding conference websites. Any content involving additional copyright information, including full PDF versions of the papers, is not disclosed in this repository.

For any questions or suggestions, please contact [email protected] or [email protected]