Releases: symflower/eval-dev-quality
v0.6.2
Ruby Support
Fully added Ruby as a new language.
- Mistakes repository for Ruby by @ruiAzevedo19 in #316
- Transpile repository for Ruby by @ruiAzevedo19 in #318
- Infer the language we want to transpile from the file extension, so other languages are supported for transpilation by @ruiAzevedo19 in #317
- Define the mistakes logic for Ruby by @ruiAzevedo19 in #326
- fix, Use another example for the missing import package, since the current one does not work because Ruby auto-loads the JSON module by @ruiAzevedo19 in #332
- Finalize Ruby support & @bauersimon by @ruiAzevedo19 in #327
Further merge requests
- Throw an error when trying to use a configuration file with containerized runtimes by @Munsio in #331
- Set up ruby inside the Github Actions Workflow by @Munsio in #333
- v0.6 results by @Munsio& @bauersimon in #324
Full Changelog: v0.6.1...v0.6.2
v0.6.1
Scoring bugfix
Assess code repair tasks by counting passing tests instead of accumulating coverage objects of executed tests as coverage metrics do not say anything about implemented behavior by @bauersimon in #321, #320
Further merge requests (working towards v0.7.0)
- Language Support
- Ruby
- Introduce the Ruby language by @ruiAzevedo19 in #311
- Do not register Ruby language yet, since the test execution feature is still in progress by @ruiAzevedo19 in #313
- “light” repository for Ruby by @ruiAzevedo19 in #315
- Add Ruby as dependency to the Docker container by @Munsio in #309
- Ruby
- Development
- Reporting
- Store models meta information in a CSV file, so it can be further used in data visualization by @ruiAzevedo19 in #298
- fix, Prefix OpenRouter model with provider ID only once by @bauersimon in #322
- Use new Symflower version which reduces error output of the "fix" command by @bauersimon in #323
v0.6.0
Highlights 🌟
- Sandboxed Execution with Docker 🐳 LLM-generated code can now be executed within a safe docker sandbox, includes parallel evaluation of multiple models within multiple containers.
- Scaling Benchmarks with Kubernetes 📈 Docker evaluations can be scaled across kubernetes clusters to support benchmarking many models in parallel on distributed hardware.
- New Task Types 📚
- Code Repair 🛠️ Prompts an LLM with compilation errors and asks to fix them.
- Code Transpilation 🔀 Has an LLM transpile source code from one programming language into another.
- Static Code Repair Benchmark 🚑 LLMs commonly make small mistakes that are easily fixable using static analysis - this benchmark task showcases the potential of this technique.
- Automatically pull Ollama Models 🦙 Ollama models are now automatically pulled when specified for the evaluation.
- Improved Reporting 📑 Results are now written alongside the benchmark, meaning nothing is lost in case of an error. Plus a new tool
eval-dev-quality report
for combining multiple evaluation results into one.
See the full release notes below. 🤗
Merge Requests
- Development & Management 🛠️
- Demo script to run models sequentially in separate evaluations on the "light" repository by @ahumenberger #189
- Documentation 📚
- Document roadmaps and release schedule by @bauersimon #196
- Evaluation ⏱️
- Isolated Execution
- Docker Support
- Build Docker image for every release by @Munsio #199
- Docker evaluation runtime by @Munsio #211, #238, #234, #252
- Parallel execution of containerized evaluations by @Munsio #221
- Run docker image generation on each push by @Munsio #247
- fix, Use
main
revision docker tag by default by @Munsio #249, - fix, Add commit revision to docker and reports by @Munsio #255
- fix, IO error when multiple Containers use the same result path by @Munsio #274
- Test docker in GitHub Actions by @Munsio #260
- fix, Ignore CLI argument model, provider and testdata checks on host when using containerization by @Munsio #290
- fix, Pass environment tokens into container by @Munsio #250
- fix, Use a pinned Java 11 version by @Munsio #279
- Make paths absolute when copying docker results cause docker gets confused with paths containing colons by @Munsio #308
- Kubernetes Support
- Docker Support
- Timeouts for test execution and
symflower
test generation by @ruiAzevedo19 #277, #267, #188 - Clarify prompt that code responses must be in code fences by @ruiAzevedo19 #259
- fix, Use backoff for retrying LLMs cause some LLMs need more time to recover by @zimmski #172
- Isolated Execution
- Models 🤖
- Pull Ollama models if they are selected for evaluation by @Munsio #284
- Model Selection
- Exclude certain models (e.g. "openrouter/auto"), because is just forwarding to a model automatically by @bauersimon #288
- Exclude the
perplexicty
online models because they have a "per request" cost #288 (automatically excluded as online models)
- fix, Retry openrouter models query cause it sometimes just errors by @bauersimon #191
- fix, Default to all repositories if none are explicitly selected by @bauersimon #182
- fix, Do not start Ollama server if no Ollama model is selected by @ruiAzevedo19 #269
- fix, Always use forward slashes in prompts so its unified by @ruiAzevedo19 #268
- Reports & Metrics 🗒️
- Logging
- refactor, Structural logging by @ahumenberger #245
- Store model responses in separate files for easier lookup by @ahumenberger #278
- Store coverage objects by @ruiAzevedo19 #223
- Write out results right away so we don't loose anything if the evaluation crashes by @ruiAzevedo19 #243
- refactor, Abstract the storage of assessments by @ahumenberger #178
- fix, Do not overwrite results but create a separate result directory by @bauersimon #179
- New
report
subcommand for postprocessing report datareport
subcommand to combine multiple evaluations into one by @ruiAzevedo19 #271- Let
report
command also combine markdown reports by @ruiAzevedo19 #258
- Report evaluation configuration (used models + repositories) as a JSON artifact for reproducibility
- Store models for the evaluation in JSON configuration report by @bauersimon #285
- Store repositories for the evaluation in JSON configuration report by @bauersimon #287
- Load models and repositories that were used from JSON configuration by @ruiAzevedo19 #291
- Report maximum of executable files by @ruiAzevedo19 #261
- Experiment with human-readable model names and costs to prepare for data visualization
- Generate the summed model files from the evaluation.csv by @ruiAzevedo19 #241
- Extract human-readable names of models by @ruiAzevedo19 #217
- Extract model costs by @ruiAzevedo19 #216
- Remove summed CSVs, human-readable names to handle them later during visualization by @ruiAzevedo19 #256
- Logging
- Operating Systems 🖥️
- More tests for Windows
- Explicitly test Java test path logic on Windows by @bauersimon #184
- Extend temporary repository tests to Windows by @bauersimon
- More tests for Windows
- Tools 🧰
symflower fix
auto-repair of common LLM mistakes- Integrate
symflower fix
into evaluation by @ruiAzevedo19, @bauersimon #229 - Do not run
symflower fix
when there is a timeout of the LLM by @ruiAzevedo19 #236 - Update
symflower
to latest version to benefit from improved Go test package repairs by @bauersimon, @Munsio #294, #303
- Integrate
- Tasks 🔢
- Infrastructure for different Task types
- Introduce the interface for doing "evaluation tasks" so we can easily add them by @ahumenberger #197, #166
- fix, CSV header missing the task identifier by @bauersimon #190
- Compile Go and Java so compilation errors can be used for code repair task by @ruiAzevedo19 #162
- refactor, Share logging setup between multiple tasks by @bauersimon #202
- fix, Missing return statements when checking model capabilities by @bauersimon #239
- Validate task repositories before evaluation by @ruiAzevedo19 #265, #306
- New task types
- Evaluation task for code repair by @ruiAzevedo19 #170, #192
- fix, Ignore git and Maven repositories when validating code-repair repositories by @ahumenberger, @ruiAzevedo19 #281
- fix, Correct test value for "variable unknown" code...
- Evaluation task for code repair by @ruiAzevedo19 #170, #192
- Infrastructure for different Task types
v0.5.0
Highlights 🌟
- Ollama 🦙 support
- Mac and Windows 🖥️ support
- Support for any inference endpoint 🧠 as long as it implements the
OpenAI
inference API - More complex "write test" task cases 🔢 for both Java, and Go
- Evaluation now measures the processing time ⏱️ that it takes a model to compute a response
- Evaluation now counts the number of characters 💬 in model responses to give an idea which models give brief, efficient responses
- Multiple runs 🏃 built right into the evaluation tool
Pull Requests
Merged
- Development 🛠️
- Documentation 📚
- Readme
- Reduce fluff of intro to a minimum, and summary CSVs for the v0.4.0 results by @zimmski in #86
- New section explaining the evaluation, and its tasks and cases by @bauersimon in #87
- Update according to newest blog post release by @bauersimon in #88
- Use the most cost-effective model that is still good for usage showcase because Claude Opus is super expensive by @zimmski in #84
- Readme
- Evaluation ⏱️
- Multiple Runs
- Support multiple runs in a single evaluation by @bauersimon in #109
- Option to execute multiple runs non-interleaved by @bauersimon in #120
- fix, Do not cancel successive runs if previous runs had problems by @bauersimon in #129
- Testdata Repository
- Use Git to avoid copying the repository on each model run by @Munsio in #114
- fix,Use empty Git config in temporary repositories to not inherit any user configuration by @bauersimon in #146
- fix, Reset repository per task to not bleed task results into subsequent task by @bauersimon in #148
- Tests
- Remove the need to change the provider registry in tests to make test code concurrency safe by @ruiAzevedo19 in #137
- Move the error used in the evaluation tests to a variable, to avoid copying it the test suites by @ruiAzevedo19 in #138
- Language Support
- fix, Java test file path needs to be OS aware by @Munsio and @ruiAzevedo19 in #155
- Require at least symflower v36800 as it fixes Java coverage extraction in examples with exceptions by @bauersimon in #14
- fix, Do not ignore Go coverage count if there are failing tests by @ahumenberger in #161
- fix, Empty model responses should be handled as errors by @Munsio in #97
- refactor, Move evaluation logic into evaluation package for isolation of concern by @zimmski in #136
- Multiple Runs
- Models 🤖
- New Models
- Ollama Support
- Installation and Update
- Ollama tool automated installation by @bauersimon in #95
- Ollama tool version check and update if version is outdated by @bauersimon in #118
- Update Ollama to 0.1.41 to have all the latest Windows fixes by @bauersimon in #154
- Provider Integration
- Prepare evaluation for Ollama provider by @zimmski in #115
- Support Ollama provider by @bauersimon in #96
- Preload Ollama models before inference and unload afterwards by @bauersimon in #121
- Installation and Update
- Generic OpenAI API provider by @bauersimon in #112
- Ollama Support
- Allow to retry a model when it errors by @ruiAzevedo19 in #125
- Clean up query attempt code by @zimmski in #132
- Explicitly check the interface that is setting the query attempts, to ensure the model implements all its methods by @ruiAzevedo19 in #139
- New Models
- Reports 🗒️
- CSV
- Replace model dependent evaluation result with report file since that contains all the evaluation information by @bauersimon in #85
- Additional CSVs to sum up metrics for all models overall and per language by @Munsio in #94
- fix, Sort map by model before creating the CSV output to be deterministic by @Munsio in #99
- Metrics
- Measure processing time of model responses by @bauersimon in #106
- Measure how many characters were present in a model response and generated test files by @ruiAzevedo19 in #142
- Make sure to use uint64 consistently for metrics and scoring, and allow more task cases by always working on a clean repository by @zimmski in #133
- CSV
- Operating Systems 🖥️
- Tasks 🔢
- More “write test” tasks for Go and Java
- More Java task cases for test generation by @ahumenberger and @zimmski in #134
- More Go and Java tasks by @ahumenberger and @zimmski #124
- fix, Add the testify package dependency to the Golang light repository, so symflower test can execute the generated tests by @ruiAzevedo19 in #150
- fix, Download Go dependencies when executing tests by @bauersimon in #153
- More “write test” tasks for Go and Java
Closed
Issues
Closed
- #83 Add additional CSV files that sum up: overall, per-language
- #91 Integrate Ollama
- #92 Empty responses should not be tested but should fail
- #98 Non deterministic test output leads to flaky CI Jobs
- #101 Unable to run benchmark tasks on windows due to incorrect directory creation syntax
- #105 Measure Model response time
- #108 Multiple Runs
- #111 Generic OpenAI API provider
- #113 Optimize repository handling in multiple runs per model
- #116 Preload/Unload Ollama models before prompting
- #117 Fixed Ollama version
- #119 Multiple runs without interleaving
- #123 Give models a retry on error
- #128 Track how many characters were present in code part / complete response
- #131 Follow-up: Allow to retry a model when it errors
- #145 git repository change requires the GPG password
- #147 Repository not reset for multiple tasks
- #158 Deal with failing tests
v0.4.0
Deep dive into evaluation with this version: https://symflower.com/en/company/blog/2024/dev-quality-eval-v0.4.0-is-llama-3-better-than-gpt-4-for-generating-tests/
This release's major additions are
- Java as a new language,
- automatic Markdown report with an SVG chart,
- and lots of automation and testing to make the evaluation benchmark super reliable.
Features
- Java language adapter with “java/plain” repository #62
- Scoring through metric points and ranking of models #42
- Automatic categorization of models depending on their worst result #36 #39 #48
- Fully log per model and repository as results #25 #53
- Migrate to
symflower test
instead of redoing test execution logic #62 - Automatic installation of Symflower for RAG and general source code analytics to not reinvent the wheel #50
- Generate test file paths through language adapters #60
- Generate import / package paths through language adapters #63
- Generate test framework name through language adapters #63
- Human readable categories with description #57
- Summary report as Markdown file with links to results #57 #77
- Summary bar chart for overall results of categories as SVG in Markdown file #57
Bug fixes
v0.3.0
First README of "DevQualityEval" (our final name for the benchmark) is online https://github.com/symflower/eval-dev-quality looking for feedback on how to make it more direct, less fluffy and more interesting for developers 🚨🔦Please help. We are currently sifting through the first benchmark and writing a report.
v0.2.0
This release makes the following tasks possible:
- Add providers for models, models, and languages easily by implementing a common interface
- Evaluate with any model that openrouter.ai offers and with Symflower's symbolic execution.
- Add repositories that should be evaluated using Go as language
- Run tests of Go repositories and query their coverage as the first evaluation benchmark task
More to come. If you want to contribute, let us know.
v0.1.0
This release includes all basic components to move forward with creating an evaluation benchmark for LLMs and friends to compare and evolve code quality of code generation. The only big exceptions are a well documented README, interface to a generic LLM API service and tasks so people who want to contribute can help. These will follow soon.