Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
tiginamaria authored Jun 5, 2024
1 parent 69d0708 commit 891e950
Showing 1 changed file with 7 additions and 14 deletions.
21 changes: 7 additions & 14 deletions bug_localization/README.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,17 @@
# Bug Localization

This folder contains code for **Bug Localization** benchmark. Challenge:
given an issue with bug description, identify the files within the project that need to be modified
to address the reported bug.
given an issue with bug description and the repository code in the state where issue is reproducible, identify the files within the project that need to be modified to address the reported bug.

We provide scripts for [data collection and processing](./src/data), [data exploratory analysis](./src/notebooks) as well as several [baselines implementations](./src/baselines) for the task solution.
We provide scripts for [data collection and processing](./src/data), [data exploratory analysis](./src/notebooks) as well as several [baselines implementations](./src/baselines) for the task solution with [evaluation metrics calculation](./src/notebooks).
## 💾 Install dependencies
We provide dependencies for pip dependency manager, so please run the following command to install all required packages:
```shell
pip install -r requirements.txt
```
Bug Localization task: given an issue with bug description, identify the files within the project that need to be modified to address the reported bug

## 🤗 Load data
## 🤗 Dataset

All data is stored in [HuggingFace 🤗](JetBrains-Research/lca-bug-localization). It contains:

* Dataset with bug localization data (with issue description, sha of repo with initial state and to the state after issue fixation).
Expand All @@ -39,21 +38,15 @@ You can access data using [datasets](https://huggingface.co/docs/datasets/en/ind


* Archived repos (from which we can extract repo content on different stages and get diffs which contains bugs fixations).\
They are stored in `.tar.gz` so you need to run script to load them and unzip:
1. Set `repos_path` in [config](configs/data/hf_data.yaml) to directory where you want to store repos
2. Run [load_data_from_hf.py](./src/load_data_from_hf.py) which will load all repos from HF and unzip them

## ⚙️ Run Baseline
## ⚙️ Baselines

* Embedding-based
* [TF-IDF](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer)
* [GTE](https://huggingface.co/thenlper/gte-large)
* [CodeT5](https://huggingface.co/Salesforce/codet5p-110m-embedding)
* [BM25](https://platform.openai.com/docs/models/gpt-3-5-turbo)
* [BM25]()

* Name-based
* [GPT3.5](https://platform.openai.com/docs/models/gpt-3-5-turbo)
* [GPT4](https://platform.openai.com/docs/models/gpt-3-5-turbo)
* [Cloud 2](https://platform.openai.com/docs/models/gpt-3-5-turbo)
* [CodeLLama](https://platform.openai.com/docs/models/gpt-3-5-turbo)
* [Mistral](https://platform.openai.com/docs/models/gpt-3-5-turbo)
* [GPT4](https://platform.openai.com/docs/models/gpt-4)

0 comments on commit 891e950

Please sign in to comment.