Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
evgngl authored Jun 5, 2024
1 parent 0eb060e commit 58967b3
Showing 1 changed file with 26 additions and 12 deletions.
38 changes: 26 additions & 12 deletions code_completion/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,14 @@ This folder contains code for running baselines for Project Level Code Completio

We provide the implementation for the following baseline: a language model that is fed with differently composed context from a repository snapshot.

The evaluation steps are the following:
* Choose context composer
* Run next token prediction for different context composers
* Choose the best one based on the lowest perplexity on the completion file
* Evaluate code completion with the composer
* Run one line code completion with zero project context
* Run one line code completion with a project context composed by the chosen context composer

# How-to

## 💾 Install dependencies
Expand All @@ -26,33 +34,39 @@ We use [Hydra](https://hydra.cc/docs/intro/) for configuration. Main config used
* To evaluate your model you need to add it to [registry](model_hub/model_registry.py).

### Add your context composer
* Base class for a composer is [`OneCompletonFileComposer`](eval/composers.py).
* To evaluate your composer you need to add it to `COMPOSERS` dictionary that is located in [eval/composers.py](eval/composers.py).
* Base class for a composer is [`OneCompletionFileComposer`](composers/one_completion_file_composer.py).
* To evaluate your composer you need to add it to `COMPOSERS` dictionary that is located in [composers/composer_registry.py](composers/composer_registry.py).

### Suported datasets:
* [`JetBrains-Research/lca-codegen-small`](https://huggingface.co/datasets/JetBrains-Research/lca-codegen-small)
* [`JetBrains-Research/lca-codegen-medium`](https://huggingface.co/datasets/JetBrains-Research/lca-codegen-medium)
* All configurations of [`JetBrains-Research/lca-project-level-code-completion`](https://huggingface.co/datasets/JetBrains-Research/lca-project-level-code-completion):
* `small_context`
* `medium_context`
* `large_context`
* `huge_context`

## 🚀 Run

The main running script is [`eval/eval_pipeline.py`](eval/eval_pipeline.py).

* To start evaluation with Poetry, run: `poetry run python -m eval.eval_pipeline wandb_project_name=%project_name_1% wandb_project_name_generation=%project_name_2%`
* `%project_name_1%` is a name for [wandb](https://wandb.ai/) project with the results of the next token prediction task to compare composers. Target metric is perplexity on completion file.
* `%project_name_2%` is a name for [wandb](https://wandb.ai/) project with the results ot the one line code completion task with best

* To start evaluation with Poetry, run: `poetry run python -m eval.eval_pipeline params=codellama7b`
You can also add command-line arguments using [Hydra's override feature](https://hydra.cc/docs/advanced/override_grammar/basic/).

### Hydra Config Main Parameters
* `params` – to choose a model to evaluate, possible values are filenames from [the directory](eval/config/params)
* `dataset` – to choose a dataset, possible values are filenames from [the directory](eval/config/dataset)
* `artifacts_dir` – where to put all the artifacts of evaluation
* results are stored in `os.path.join(config.artifacts_dir, config.language, model_name, dataset_name)`
* `wandb_project_name` – WandB project name for the composer choice step
* `wandb_project_name_generation` – WandB project name for the line generation step

### Examples
* [Starcoder Base 7B](https://huggingface.co/bigcode/starcoderbase-7b) on small dataset
* Command:
```
poetry run python -m eval.eval_pipeline wandb_project_name=%project_name_1% wandb_project_name_generation=%project_name_2% dataset='JetBrains-Research/lca-codegen-small'
+params=starcoderbase7b
poetry run python -m eval.eval_pipeline dataset=small params=starcoderbase7b
```
* [CodeLlama 7B](https://huggingface.co/codellama/CodeLlama-7b-hf) in 4bit quantization with context window 8K on medium dataset
* Command:
```
poetry run python -m eval.eval_pipeline wandb_project_name=%project_name_1% wandb_project_name_generation=%project_name_2% dataset='JetBrains-Research/lca-codegen-medium'
+params=codellama7b_4bit params.inference_params.seq_max_len=8000
poetry run python -m eval.eval_pipeline dataset=medium params=codellama7b_4bit params.inference_params.seq_max_len=8000
```

0 comments on commit 58967b3

Please sign in to comment.