Skip to content

Commit

Permalink
Merge branch 'main' into code-completion
Browse files Browse the repository at this point in the history
  • Loading branch information
evgngl authored Jun 5, 2024
2 parents 8ee4eec + 508551c commit 4eb0f89
Show file tree
Hide file tree
Showing 1,958 changed files with 130,105 additions and 1,779 deletions.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,5 @@
.idea/
tree-sitter-python
wandb
=======
/venv/
1 change: 1 addition & 0 deletions bug_localization/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
.env
59 changes: 59 additions & 0 deletions bug_localization/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
# Bug Localization

This folder contains code for **Bug Localization** benchmark. Challenge:
given an issue with bug description, identify the files within the project that need to be modified
to address the reported bug.

We provide scripts for [data collection and processing](./src/data), [data exploratory analysis](./src/notebooks) as well as several [baselines implementations](./src/baselines) for the task solution.
## 💾 Install dependencies
We provide dependencies for pip dependency manager, so please run the following command to install all required packages:
```shell
pip install -r requirements.txt
```
Bug Localization task: given an issue with bug description, identify the files within the project that need to be modified to address the reported bug

## 🤗 Load data
All data is stored in [HuggingFace 🤗](JetBrains-Research/lca-bug-localization). It contains:

* Dataset with bug localization data (with issue description, sha of repo with initial state and to the state after issue fixation).
You can access data using [datasets](https://huggingface.co/docs/datasets/en/index) library:
```python3
from datasets import load_dataset

# Select a configuration from ["py", "java", "kt", "mixed"]
configuration = "py"
# Select a split from ["dev", "train", "test"]
split = "dev"
# Load data
dataset = load_dataset("JetBrains-Research/lca-bug-localization", configuration, split=split)
```
where labels are:\
`dev` - all collected data\
`test` - manually selected data ([labeling artifacts](https://docs.google.com/spreadsheets/d/1cEyFHjse-iUYQlUO7GO5KpqkvJ3wu6vheou4W61TMOg/edit?usp=sharing))\
`train` - all collected data which is not in test\
and configurations are:\
`py` -- only `.py` files in diff\
`java` -- only `.java` files in diff\
`kt` -- only `.kt` files in diff\
`mixed` -- at least on of the `.py`, `.java` or `.kt` file and maybe file(s) with another extensions in diff


* Archived repos (from which we can extract repo content on different stages and get diffs which contains bugs fixations).\
They are stored in `.tar.gz` so you need to run script to load them and unzip:
1. Set `repos_path` in [config](configs/data/hf_data.yaml) to directory where you want to store repos
2. Run [load_data_from_hf.py](./src/load_data_from_hf.py) which will load all repos from HF and unzip them

## ⚙️ Run Baseline

* Embedding-based
* [TF-IDF](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer)
* [GTE](https://huggingface.co/thenlper/gte-large)
* [CodeT5](https://huggingface.co/Salesforce/codet5p-110m-embedding)
* [BM25](https://platform.openai.com/docs/models/gpt-3-5-turbo)

* Name-based
* [GPT3.5](https://platform.openai.com/docs/models/gpt-3-5-turbo)
* [GPT4](https://platform.openai.com/docs/models/gpt-3-5-turbo)
* [Cloud 2](https://platform.openai.com/docs/models/gpt-3-5-turbo)
* [CodeLLama](https://platform.openai.com/docs/models/gpt-3-5-turbo)
* [Mistral](https://platform.openai.com/docs/models/gpt-3-5-turbo)
26 changes: 26 additions & 0 deletions bug_localization/configs/baselines/codet5.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
hydra:
job:
name: ${backbone.name}_emb
run:
dir: /home/tigina/lca-baselines/bug_localization/output/${hydra:job.name}
job_logging:
root:
handlers: [console, file]
backbone:
_target_: src.baselines.backbones.emb.hf_emb_backbone.HfEmbBackbone
name: codet5
pretrained_path: null
parameters:
model_name: Salesforce/codet5p-110m-embedding
ranker:
_target_: src.baselines.backbones.emb.rankers.cosine_distance_ranker.CosineDistanceRanker
data_source:
_target_: src.baselines.data_sources.hf_data_source.HFDataSource
repos_dir: /mnt/data/shared-data/lca/repos_updated
cache_dir: null
hub_name: tiginamaria/bug-localization
configs:
- py
- java
- kt
split: test
26 changes: 26 additions & 0 deletions bug_localization/configs/baselines/gte.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
hydra:
job:
name: ${backbone.name}_emb
run:
dir: /home/tigina/lca-baselines/bug_localization/output/${hydra:job.name}
job_logging:
root:
handlers: [console, file]
backbone:
_target_: src.baselines.backbones.emb.hf_emb_backbone.HfEmbBackbone
name: gte
pretrained_path: null
parameters:
model_name: thenlper/gte-large
ranker:
_target_: src.baselines.backbones.emb.rankers.cosine_distance_ranker.CosineDistanceRanker
data_source:
_target_: src.baselines.data_sources.hf_data_source.HFDataSource
repos_dir: /mnt/data/shared-data/lca/repos_updated
cache_dir: null
hub_name: tiginamaria/bug-localization
configs:
- py
- java
- kt
split: test
26 changes: 26 additions & 0 deletions bug_localization/configs/baselines/mistral.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
hydra:
job:
name: ${backbone.name}_emb
run:
dir: /home/tigina/lca-baselines/bug_localization/output/${hydra:job.name}
job_logging:
root:
handlers: [console, file]
backbone:
_target_: src.baselines.backbones.emb.hf_emb_backbone.HfEmbBackbone
name: mistral
pretrained_path: null
parameters:
model_name: Salesforce/SFR-Embedding-Mistral
ranker:
_target_: src.baselines.backbones.emb.rankers.cosine_distance_ranker.CosineDistanceRanker
data_source:
_target_: src.baselines.data_sources.hf_data_source.HFDataSource
repos_dir: /mnt/data/shared-data/lca/repos_updated
cache_dir: null
hub_name: tiginamaria/bug-localization
configs:
- py
- java
- kt
split: test
25 changes: 25 additions & 0 deletions bug_localization/configs/baselines/openai_agent.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
hydra:
job:
name: ${backbone.name}_${backbone.model_name}
run:
dir: /home/tigina/lca-baselines/bug_localization/output/${hydra:job.name}
job_logging:
root:
handlers: [console, file]
backbone:
_target_: src.baselines.backbones.agent.openai_agent_backbone.OpenAIAgentBackbone
name: openai_agent
model_name: gpt-4-1106-preview
api_key: null
prompt:
_target_: src.baselines.backbones.agent.prompts.agent_simple_prompt.AgentSimplePrompt
data_source:
_target_: src.baselines.data_sources.hf_data_source.HFDataSource
repos_dir: /mnt/data/shared-data/lca/repos_updated
cache_dir: null
hub_name: tiginamaria/bug-localization
configs:
- py
- java
- kt
split: test
29 changes: 29 additions & 0 deletions bug_localization/configs/baselines/openai_chat.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
hydra:
job:
name: ${backbone.name}_${backbone.model_name}
run:
dir: /home/tigina/lca-baselines/bug_localization/output/${hydra:job.name}
job_logging:
root:
handlers: [ console, file ]
backbone:
_target_: src.baselines.backbones.chat.openai_chat_backbone.OpenAIChatBackbone
name: openai_chat
model_name: gpt-3.5-turbo-1106
api_key: null
parameters:
seed: 76097149
temperature: 0
prompt:
_target_: src.baselines.backbones.chat.prompts.chat_file_list_prompt.ChatFileListPrompt
data_source:
_target_: src.baselines.data_sources.hf_data_source.HFDataSource
repos_dir: /mnt/data/shared-data/lca/repos_updated
cache_dir: null
hub_name: tiginamaria/bug-localization
configs:
- py
- java
- kt
split: test

29 changes: 29 additions & 0 deletions bug_localization/configs/baselines/tfidf-bpe.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
hydra:
job:
name: ${backbone.name}_bpe_emb
run:
dir: /home/tigina/lca-baselines/bug_localization/output/${hydra:job.name}
job_logging:
root:
handlers: [console, file]
backbone:
_target_: src.baselines.backbones.emb.tfidf_emb_backbone.TfIdfEmbBackbone
name: tfidf
pretrained_path: null
tokenizer:
_target_: src.baselines.backbones.emb.tokenizers.bpe_tokenizer.BPETokenizer
vocab_size: 10000
min_frequency: 2
pretrained_path: null
ranker:
_target_: src.baselines.backbones.emb.rankers.cosine_distance_ranker.CosineDistanceRanker
data_source:
_target_: src.baselines.data_sources.hf_data_source.HFDataSource
repos_dir: /mnt/data/shared-data/lca/repos_updated
cache_dir: null
hub_name: tiginamaria/bug-localization
configs:
- py
- java
- kt
split: test
26 changes: 26 additions & 0 deletions bug_localization/configs/baselines/tfidf-nltk.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
hydra:
job:
name: ${backbone.name}_nltk_emb
run:
dir: /home/tigina/lca-baselines/bug_localization/output/${hydra:job.name}
job_logging:
root:
handlers: [console, file]
backbone:
_target_: src.baselines.backbones.emb.tfidf_emb_backbone.TfIdfEmbBackbone
name: tfidf
pretrained_path: null
tokenizer:
_target_: src.baselines.backbones.emb.tokenizers.nltk_tokenizer.NltkTokenizer
ranker:
_target_: src.baselines.backbones.emb.rankers.cosine_distance_ranker.CosineDistanceRanker
data_source:
_target_: src.baselines.data_sources.hf_data_source.HFDataSource
cache_dir: null
repos_dir: /mnt/data/shared-data/lca/repos_updated
hub_name: tiginamaria/bug-localization
configs:
- py
- java
- kt
split: test
2 changes: 2 additions & 0 deletions bug_localization/configs/data/local.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
data_path: /Users/Maria.Tigina/PycharmProjects/lca-baselines/data/lca-bug-localization
repos_path: /Users/Maria.Tigina/PycharmProjects/lca-baselines/data/lca-bug-localization/repos
21 changes: 21 additions & 0 deletions bug_localization/configs/data/server.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
issues_path: /mnt/data/shared-data/lca/issues_prs_updated_dedup
pulls_path: /mnt/data/shared-data/lca/pulls_updated_dedup
repos_path: /mnt/data/shared-data/lca/repos_updated
repos_list_path: /mnt/data/shared-data/lca/updated_repos_list.json
repos_archive_path: /mnt/data/shared-data/lca/repos_archive_updated
issues_links_path: /mnt/data/shared-data/lca/issues_links_updated
issues_links_filtered_path: /mnt/data/shared-data/lca/issues_links_filtered_updated
issues_prs_path: /mnt/data/shared-data/lca/issues_prs_updated_dedup
issues_comments_path: /mnt/data/shared-data/lca/comments_updated_dedup
pull_requests_comments_path: /mnt/data/shared-data/lca/pulls_comments_updated
bug_localization_data_path: /mnt/data/shared-data/lca/bug_localization_data
test_data_ids: [
# py
"thealgorithms/python/295/289","keras-team/keras/15943/15942","serverless/serverless/5500/5499","textualize/rich/2212/2197","mitmproxy/mitmproxy/4298/3994","hpcaitech/colossalai/3621/3620","docker/compose/8158/8136","ccxt/ccxt/18301/18192","microsoft/deepspeed/4084/4083","streamlit/streamlit/6459/6440","huggingface/pytorch-image-models/1351/1348","dokku/dokku/2452/2445","google/jax/15796/15782","google/jax/1039/1033","pypa/pipenv/1767/1750","stevenblack/hosts/112/107","openbb-finance/openbbterminal/410/409","textualize/textual/1653/1634","jerryjliu/llama_index/6630/6629","wekan/wekan/3889/3884","sanic-org/sanic/1327/1323","huggingface/datasets/4828/4796","rasahq/rasa/4035/4015","pulumi/pulumi/11541/11542","archivebox/archivebox/822/821","microsoft/recommenders/804/746","joke2k/faker/671/667","manimcommunity/manim/229/228","pydantic/pydantic/6541/6525","mlflow/mlflow/5720/5713","plotly/plotly.py/2208/2167","rapptz/discord.py/1967/1862","netbox-community/netbox/10187/9895","nltk/nltk/3042/3041","spotdl/spotify-downloader/380/370","iterative/dvc/1870/1869","beetbox/beets/3238/2790","mikubill/sd-webui-controlnet/321/307","microsoft/qlib/357/356","edgedb/edgedb/5785/5725","chia-network/chia-blockchain/16110/16039","chia-network/chia-blockchain/11764/11628","pre-commit/pre-commit/1668/1658","aws/chalice/504/501","tweepy/tweepy/1262/1261","openmined/pysyft/1908/1896","pypa/pip/12140/12138","modin-project/modin/1657/1656","dagger/dagger/3814/3813","paddlepaddle/paddlespeech/1432/1426",
# java
"square/okhttp/1254/1158","dbeaver/dbeaver/12766/12765","dbeaver/dbeaver/20881/20529","square/leakcanary/458/449","alibaba/spring-cloud-alibaba/46/42","google/gson/2364/904","libgdx/libgdx/856/798","apache/shardingsphere/2352/2343","apache/shardingsphere/1987/1985","ibotpeaches/apktool/1570/1564","appium/appium/1145/1140","appium/appium/1104/1100","williamfiset/algorithms/98/59","prestodb/presto/6427/6379","prestodb/presto/6208/6196","deeplearning4j/deeplearning4j/4664/4635","pinpoint-apm/pinpoint/2078/2077","quarkusio/quarkus/33586/33305","codecentric/spring-boot-admin/1589/1586","zaproxy/zaproxy/7494/7484","apache/dolphinscheduler/3957/3956","grpc/grpc-java/667/583","tootallnate/java-websocket/570/564","tootallnate/java-websocket/329/259","trinodb/trino/17804/17803","thundernest/k-9/3219/632","gocd/gocd/10676/10036","axonframework/axonframework/2756/2751","webbukkit/dynmap/3990/3982","jdbi/jdbi/1339/1338","spring-projects/spring-session/450/445","synthetichealth/synthea/410/395","naver/ngrinder/189/184","gchq/gaffer/2884/2881","embulk/embulk/1054/1031","jooby-project/jooby/2220/2210","j-easy/easy-random/240/237","apache/eventmesh/3432/3431","clickhouse/clickhouse-java/540/462","google/conscrypt/785/781","appium/java-client/568/567","vert-x3/vertx-web/1779/1778","googlecloudplatform/dataflowtemplates/469/464","spring-cloud/spring-cloud-stream/402/403","zalando/nakadi/677/391","apicurio/apicurio-studio/1604/1533","objectionary/eo/1263/1256","spring-cloud/spring-cloud-consul/230/229","opengamma/strata/1313/1312","citrusframework/citrus/598/588",
# kotlin
"airbnb/lottie-android/2078/2077","square/leakcanary/2144/2137","android/compose-samples/1045/1023","android/nowinandroid/713/611","android/nowinandroid/858/853","kotlin/kotlinx.coroutines/3801/3789","kotlin/kotlinx.coroutines/3584/3578","quarkusio/quarkus/21328/21304","ktorio/ktor/1359/1358","square/moshi/791/775","square/moshi/604/602","thundernest/k-9/5687/5661","thundernest/k-9/5905/5873","thundernest/k-9/5250/5249","pinterest/ktlint/389/367","pinterest/ktlint/1015/997","detekt/detekt/483/466","cashapp/sqldelight/2516/2382","kotlin/kotlinx.serialization/1257/1251","intellij-rust/intellij-rust/9545/9543","intellij-rust/intellij-rust/9695/9414","square/wire/1472/1446","square/wire/1232/1230","square/kotlinpoet/1519/1518","square/kotlinpoet/1174/1076","netflix/dgs-framework/268/262","netflix/dgs-framework/168/167","cashapp/paparazzi/645/610","carlosesco/neko/104/67","autonomousapps/dependency-analysis-android-gradle-plugin/301/295","fwcd/kotlin-language-server/194/174","fwcd/kotlin-language-server/249/248","cashapp/redwood/137/40","square/anvil/577/574","square/anvil/467/459","fasterxml/jackson-module-kotlin/631/558","fasterxml/jackson-module-kotlin/641/340","square/workflow-kotlin/644/642","bumble-tech/appyx/33/28","bumble-tech/appyx/17/16","hannah-sten/texify-idea/2532/2530","hannah-sten/texify-idea/1603/1602","hannah-sten/texify-idea/3151/3038","kordlib/kord/96/94","horizontalsystems/unstoppable-wallet-android/560/538","jwstegemann/fritz2/430/425","jwstegemann/fritz2/679/671","theopenconversationkit/tock/1040/1039","google/android-fhir/301/296","splendo/kaluga/469/468",
# mixed
"electron/electron/11103/11101",
]
23 changes: 23 additions & 0 deletions bug_localization/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
datasets==2.15.0
huggingface_hub==0.19.4
hydra-core==1.3.2
numpy==1.26.2
tokenizers==0.13.0
transformers==4.39.3
omegaconf==2.3.0
openai==1.3.9
nltk==3.8.1
scikit-learn==1.3.2
gitpython==3.1.40
matplotlib==3.8.2
pandas~=2.2.0
pytest~=8.0.1
langdetect~=1.0.9
tenacity~=8.2.3
python-dotenv~=1.0.1
anthropic~=0.21.3
tiktoken~=0.6.0
unidiff~=0.7.5
torch~=2.2.2
backoff~=2.2.1
python-dotenv~=1.0.1
Empty file.
5 changes: 5 additions & 0 deletions bug_localization/src/baselines/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Baselines

```shell
python +data_src=hf data_src.hub_name=tiginamaria/bug-localization +backbone=openai +backbone/prompt=detailed backbone.model_name=gpt-3.5-turbo-16k ++backbone.parameters.temperature=0.8 ++backbone.parameters.seed=2687987020 logger.name=gpt_3.5_16k-detailed
```
Empty file.
Empty file.
Empty file.
Empty file.
47 changes: 47 additions & 0 deletions bug_localization/src/baselines/backbones/agent/env/fs_env.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
from typing import List, Dict

from src.baselines.backbones.agent.env.fs_tools import read_fs_tools


class FileSystemEnv:

def __init__(self, repo_content: Dict[str, str]):
self.repo_content = repo_content

def _read_file(self, path: str) -> str:
if path in self.repo_content:
return self.repo_content[path]
else:
return f"Error occurred while reading file. Repository does not contain file {path}."

def _list_directory(self, path: str) -> List[str]:
if path == "" or path == ".":
files_by_path = list(self.repo_content.keys())
return list(set(f.split('/', 1)[0] for f in files_by_path))
dir_path = path + '/'
files_by_path = [f for f in self.repo_content.keys() if f.startswith(dir_path)]
return list(set([dir_path + f.replace(dir_path, "").split('/', 1)[0] for f in files_by_path]))

def _assert_args(self, command_name: str, command_params, expected_args: List[str]):
for arg in expected_args:
assert command_params.get(arg) is not None, Exception(f"Argument {arg} is not provided for tool call {command_name}")

def run_command(self, command_name: str, command_params: dict) -> str:
try:
message = ""
if command_name == 'read_file':
self._assert_args(command_name, command_params, ['path'])
message = self._read_file(
path=command_params.get("path"),
)
elif command_name == 'list_directory':
self._assert_args(command_name, command_params, ['path'])
message = str(self._list_directory(
path=command_params.get("path"),
))
return message
except Exception as e:
return str(e)

def get_tools(self) -> list[dict]:
return read_fs_tools
Loading

0 comments on commit 4eb0f89

Please sign in to comment.