From 891e9504bf3cd48b3c3b85d0f4fdcf55d4de92d0 Mon Sep 17 00:00:00 2001 From: Maria Tigina <31625351+tiginamaria@users.noreply.github.com> Date: Wed, 5 Jun 2024 18:14:36 +0200 Subject: [PATCH] Update README.md --- bug_localization/README.md | 21 +++++++-------------- 1 file changed, 7 insertions(+), 14 deletions(-) diff --git a/bug_localization/README.md b/bug_localization/README.md index 5beefac..614ec39 100644 --- a/bug_localization/README.md +++ b/bug_localization/README.md @@ -1,18 +1,17 @@ # Bug Localization This folder contains code for **Bug Localization** benchmark. Challenge: -given an issue with bug description, identify the files within the project that need to be modified -to address the reported bug. +given an issue with bug description and the repository code in the state where issue is reproducible, identify the files within the project that need to be modified to address the reported bug. -We provide scripts for [data collection and processing](./src/data), [data exploratory analysis](./src/notebooks) as well as several [baselines implementations](./src/baselines) for the task solution. +We provide scripts for [data collection and processing](./src/data), [data exploratory analysis](./src/notebooks) as well as several [baselines implementations](./src/baselines) for the task solution with [evaluation metrics calculation](./src/notebooks). ## 💾 Install dependencies We provide dependencies for pip dependency manager, so please run the following command to install all required packages: ```shell pip install -r requirements.txt ``` -Bug Localization task: given an issue with bug description, identify the files within the project that need to be modified to address the reported bug -## 🤗 Load data +## 🤗 Dataset + All data is stored in [HuggingFace 🤗](JetBrains-Research/lca-bug-localization). It contains: * Dataset with bug localization data (with issue description, sha of repo with initial state and to the state after issue fixation). @@ -39,21 +38,15 @@ You can access data using [datasets](https://huggingface.co/docs/datasets/en/ind * Archived repos (from which we can extract repo content on different stages and get diffs which contains bugs fixations).\ -They are stored in `.tar.gz` so you need to run script to load them and unzip: - 1. Set `repos_path` in [config](configs/data/hf_data.yaml) to directory where you want to store repos - 2. Run [load_data_from_hf.py](./src/load_data_from_hf.py) which will load all repos from HF and unzip them -## ⚙️ Run Baseline +## ⚙️ Baselines * Embedding-based * [TF-IDF](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer) * [GTE](https://huggingface.co/thenlper/gte-large) * [CodeT5](https://huggingface.co/Salesforce/codet5p-110m-embedding) - * [BM25](https://platform.openai.com/docs/models/gpt-3-5-turbo) + * [BM25]() * Name-based * [GPT3.5](https://platform.openai.com/docs/models/gpt-3-5-turbo) - * [GPT4](https://platform.openai.com/docs/models/gpt-3-5-turbo) - * [Cloud 2](https://platform.openai.com/docs/models/gpt-3-5-turbo) - * [CodeLLama](https://platform.openai.com/docs/models/gpt-3-5-turbo) - * [Mistral](https://platform.openai.com/docs/models/gpt-3-5-turbo) + * [GPT4](https://platform.openai.com/docs/models/gpt-4)