09 Sep 13:08

bauersimon

bbf6ab9

v0.6.2 Latest

Latest

Ruby Support

Fully added Ruby as a new language.

Mistakes repository for Ruby by @ruiAzevedo19 in #316
Transpile repository for Ruby by @ruiAzevedo19 in #318
Infer the language we want to transpile from the file extension, so other languages are supported for transpilation by @ruiAzevedo19 in #317
Define the mistakes logic for Ruby by @ruiAzevedo19 in #326
fix, Use another example for the missing import package, since the current one does not work because Ruby auto-loads the JSON module by @ruiAzevedo19 in #332
Finalize Ruby support & @bauersimon by @ruiAzevedo19 in #327

Further merge requests

Throw an error when trying to use a configuration file with containerized runtimes by @Munsio in #331
Set up ruby inside the Github Actions Workflow by @Munsio in #333
v0.6 results by @Munsio& @bauersimon in #324

Full Changelog: v0.6.1...v0.6.2

Contributors

Munsio, bauersimon, and ruiAzevedo19

Assets 2

20 Aug 08:22

bauersimon

v0.6.1

f2178cb

v0.6.1

Scoring bugfix

Assess code repair tasks by counting passing tests instead of accumulating coverage objects of executed tests as coverage metrics do not say anything about implemented behavior by @bauersimon in #321, #320

Further merge requests (working towards v0.7.0)

Language Support
- Ruby
  - Introduce the Ruby language by @ruiAzevedo19 in #311
  - Do not register Ruby language yet, since the test execution feature is still in progress by @ruiAzevedo19 in #313
  - “light” repository for Ruby by @ruiAzevedo19 in #315
  - Add Ruby as dependency to the Docker container by @Munsio in #309
Development
- Free GitHub runner disk space by removing unnecessary pre-packaged libraries by @Munsio in #314
Reporting
- Store models meta information in a CSV file, so it can be further used in data visualization by @ruiAzevedo19 in #298
- fix, Prefix OpenRouter model with provider ID only once by @bauersimon in #322
- Use new Symflower version which reduces error output of the "fix" command by @bauersimon in #323

Contributors

Munsio, bauersimon, and ruiAzevedo19

Assets 2

02 Aug 11:30

bauersimon

v0.6.0

d3ba2cb

v0.6.0

Highlights 🌟

Sandboxed Execution with Docker 🐳 LLM-generated code can now be executed within a safe docker sandbox, includes parallel evaluation of multiple models within multiple containers.
Scaling Benchmarks with Kubernetes 📈 Docker evaluations can be scaled across kubernetes clusters to support benchmarking many models in parallel on distributed hardware.
New Task Types 📚
- Code Repair 🛠️ Prompts an LLM with compilation errors and asks to fix them.
- Code Transpilation 🔀 Has an LLM transpile source code from one programming language into another.
Static Code Repair Benchmark 🚑 LLMs commonly make small mistakes that are easily fixable using static analysis - this benchmark task showcases the potential of this technique.
Automatically pull Ollama Models 🦙 Ollama models are now automatically pulled when specified for the evaluation.
Improved Reporting 📑 Results are now written alongside the benchmark, meaning nothing is lost in case of an error. Plus a new tool eval-dev-quality report for combining multiple evaluation results into one.

See the full release notes below. 🤗

Merge Requests

Development & Management 🛠️
- Demo script to run models sequentially in separate evaluations on the "light" repository by @ahumenberger #189
Documentation 📚
- Document roadmaps and release schedule by @bauersimon #196
Evaluation ⏱️
- Isolated Execution
  - Docker Support
    - Build Docker image for every release by @Munsio #199
    - Docker evaluation runtime by @Munsio #211, #238, #234, #252
    - Parallel execution of containerized evaluations by @Munsio #221
    - Run docker image generation on each push by @Munsio #247
    - fix, Use main revision docker tag by default by @Munsio #249,
    - fix, Add commit revision to docker and reports by @Munsio #255
    - fix, IO error when multiple Containers use the same result path by @Munsio #274
    - Test docker in GitHub Actions by @Munsio #260
    - fix, Ignore CLI argument model, provider and testdata checks on host when using containerization by @Munsio #290
    - fix, Pass environment tokens into container by @Munsio #250
    - fix, Use a pinned Java 11 version by @Munsio #279
    - Make paths absolute when copying docker results cause docker gets confused with paths containing colons by @Munsio #308
  - Kubernetes Support
    - Kubernetes evaluation runtime by @Munsio #231
    - Copy back results from the cluster to the initial host by @Munsio #272
    - fix, Only use valid characters in Kubernetes job names by @Munsio #292
- Timeouts for test execution and symflower test generation by @ruiAzevedo19 #277, #267, #188
- Clarify prompt that code responses must be in code fences by @ruiAzevedo19 #259
- fix, Use backoff for retrying LLMs cause some LLMs need more time to recover by @zimmski #172
Models 🤖
- Pull Ollama models if they are selected for evaluation by @Munsio #284
- Model Selection
  - Exclude certain models (e.g. "openrouter/auto"), because is just forwarding to a model automatically by @bauersimon #288
  - Exclude the perplexicty online models because they have a "per request" cost #288 (automatically excluded as online models)
- fix, Retry openrouter models query cause it sometimes just errors by @bauersimon #191
- fix, Default to all repositories if none are explicitly selected by @bauersimon #182
- fix, Do not start Ollama server if no Ollama model is selected by @ruiAzevedo19 #269
- fix, Always use forward slashes in prompts so its unified by @ruiAzevedo19 #268
Reports & Metrics 🗒️
- Logging
  - refactor, Structural logging by @ahumenberger #245
  - Store model responses in separate files for easier lookup by @ahumenberger #278
  - Store coverage objects by @ruiAzevedo19 #223
- Write out results right away so we don't loose anything if the evaluation crashes by @ruiAzevedo19 #243
- refactor, Abstract the storage of assessments by @ahumenberger #178
- fix, Do not overwrite results but create a separate result directory by @bauersimon #179
- New report subcommand for postprocessing report data
  - report subcommand to combine multiple evaluations into one by @ruiAzevedo19 #271
  - Let report command also combine markdown reports by @ruiAzevedo19 #258
- Report evaluation configuration (used models + repositories) as a JSON artifact for reproducibility
  - Store models for the evaluation in JSON configuration report by @bauersimon #285
  - Store repositories for the evaluation in JSON configuration report by @bauersimon #287
  - Load models and repositories that were used from JSON configuration by @ruiAzevedo19 #291
- Report maximum of executable files by @ruiAzevedo19 #261
- Experiment with human-readable model names and costs to prepare for data visualization
  - Generate the summed model files from the evaluation.csv by @ruiAzevedo19 #241
  - Extract human-readable names of models by @ruiAzevedo19 #217
  - Extract model costs by @ruiAzevedo19 #216
  - Remove summed CSVs, human-readable names to handle them later during visualization by @ruiAzevedo19 #256
Operating Systems 🖥️
- More tests for Windows
  - Explicitly test Java test path logic on Windows by @bauersimon #184
  - Extend temporary repository tests to Windows by @bauersimon
Tools 🧰
- symflower fix auto-repair of common LLM mistakes
  - Integrate symflower fix into evaluation by @ruiAzevedo19, @bauersimon #229
  - Do not run symflower fix when there is a timeout of the LLM by @ruiAzevedo19 #236
  - Update symflower to latest version to benefit from improved Go test package repairs by @bauersimon, @Munsio #294, #303
Tasks 🔢
- Infrastructure for different Task types
  - Introduce the interface for doing "evaluation tasks" so we can easily add them by @ahumenberger #197, #166
  - fix, CSV header missing the task identifier by @bauersimon #190
  - Compile Go and Java so compilation errors can be used for code repair task by @ruiAzevedo19 #162
  - refactor, Share logging setup between multiple tasks by @bauersimon #202
  - fix, Missing return statements when checking model capabilities by @bauersimon #239
  - Validate task repositories before evaluation by @ruiAzevedo19 #265, #306
- New task types
  - Evaluation task for code repair by @ruiAzevedo19 #170, #192
    - fix, Ignore git and Maven repositories when validating code-repair repositories by @ahumenberger, @ruiAzevedo19 #281
    - fix, Correct test value for "variable unknown" code...

Contributors

Munsio, zimmski, and 3 other contributors

Assets 2

06 Jun 07:18

bauersimon

v0.5.0

efe1ea3

v0.5.0

Highlights 🌟

Ollama 🦙 support
Mac and Windows 🖥️ support
Support for any inference endpoint 🧠 as long as it implements the OpenAI inference API
More complex "write test" task cases 🔢 for both Java, and Go
Evaluation now measures the processing time ⏱️ that it takes a model to compute a response
Evaluation now counts the number of characters 💬 in model responses to give an idea which models give brief, efficient responses
Multiple runs 🏃 built right into the evaluation tool

Pull Requests

Merged

Development 🛠️
- CI
  - fix, Let Windows CI error on failures, and inject API tokens only once per provider by @zimmski in #104
  - Cancel previous runs of the CI when a new push happened to a PR by @Munsio in #140
- Tooling
  - Introduce an "ID" method to the tool interface by @ruiAzevedo19 in #122
Documentation 📚
- Readme
  - Reduce fluff of intro to a minimum, and summary CSVs for the v0.4.0 results by @zimmski in #86
  - New section explaining the evaluation, and its tasks and cases by @bauersimon in #87
  - Update according to newest blog post release by @bauersimon in #88
  - Use the most cost-effective model that is still good for usage showcase because Claude Opus is super expensive by @zimmski in #84
Evaluation ⏱️
- Multiple Runs
  - Support multiple runs in a single evaluation by @bauersimon in #109
  - Option to execute multiple runs non-interleaved by @bauersimon in #120
  - fix, Do not cancel successive runs if previous runs had problems by @bauersimon in #129
- Testdata Repository
  - Use Git to avoid copying the repository on each model run by @Munsio in #114
  - fix,Use empty Git config in temporary repositories to not inherit any user configuration by @bauersimon in #146
  - fix, Reset repository per task to not bleed task results into subsequent task by @bauersimon in #148
- Tests
  - Remove the need to change the provider registry in tests to make test code concurrency safe by @ruiAzevedo19 in #137
  - Move the error used in the evaluation tests to a variable, to avoid copying it the test suites by @ruiAzevedo19 in #138
- Language Support
  - fix, Java test file path needs to be OS aware by @Munsio and @ruiAzevedo19 in #155
  - Require at least symflower v36800 as it fixes Java coverage extraction in examples with exceptions by @bauersimon in #14
  - fix, Do not ignore Go coverage count if there are failing tests by @ahumenberger in #161
- fix, Empty model responses should be handled as errors by @Munsio in #97
- refactor, Move evaluation logic into evaluation package for isolation of concern by @zimmski in #136
Models 🤖
- New Models
  - Ollama Support
    - Installation and Update
      - Ollama tool automated installation by @bauersimon in #95
      - Ollama tool version check and update if version is outdated by @bauersimon in #118
      - Update Ollama to 0.1.41 to have all the latest Windows fixes by @bauersimon in #154
    - Provider Integration
      - Prepare evaluation for Ollama provider by @zimmski in #115
      - Support Ollama provider by @bauersimon in #96
      - Preload Ollama models before inference and unload afterwards by @bauersimon in #121
  - Generic OpenAI API provider by @bauersimon in #112
- Allow to retry a model when it errors by @ruiAzevedo19 in #125
- Clean up query attempt code by @zimmski in #132
- Explicitly check the interface that is setting the query attempts, to ensure the model implements all its methods by @ruiAzevedo19 in #139
Reports 🗒️
- CSV
  - Replace model dependent evaluation result with report file since that contains all the evaluation information by @bauersimon in #85
  - Additional CSVs to sum up metrics for all models overall and per language by @Munsio in #94
  - fix, Sort map by model before creating the CSV output to be deterministic by @Munsio in #99
- Metrics
  - Measure processing time of model responses by @bauersimon in #106
  - Measure how many characters were present in a model response and generated test files by @ruiAzevedo19 in #142
  - Make sure to use uint64 consistently for metrics and scoring, and allow more task cases by always working on a clean repository by @zimmski in #133
Operating Systems 🖥️
- Support MacOS by @zimmski in #102
- Support Windows by @zimmski in #103
Tasks 🔢
- More “write test” tasks for Go and Java
  - More Java task cases for test generation by @ahumenberger and @zimmski in #134
  - More Go and Java tasks by @ahumenberger and @zimmski #124
  - fix, Add the testify package dependency to the Golang light repository, so symflower test can execute the generated tests by @ruiAzevedo19 in #150
  - fix, Download Go dependencies when executing tests by @bauersimon in #153

Closed

Ollama support PR #27 as everything was integrated in #95, #115, #118, #96

Issues

Closed

#83 Add additional CSV files that sum up: overall, per-language
#91 Integrate Ollama
#92 Empty responses should not be tested but should fail
#98 Non deterministic test output leads to flaky CI Jobs
#101 Unable to run benchmark tasks on windows due to incorrect directory creation syntax
#105 Measure Model response time
#108 Multiple Runs
#111 Generic OpenAI API provider
#113 Optimize repository handling in multiple runs per model
#116 Preload/Unload Ollama models before prompting
#117 Fixed Ollama version
#119 Multiple runs without interleaving
#123 Give models a retry on error
#128 Track how many characters were present in code part / complete response
#131 Follow-up: Allow to retry a model when it errors
#145 git repository change requires the GPG password
#147 Repository not reset for multiple tasks
#158 Deal with failing tests

Contributors

Munsio, zimmski, and 3 other contributors

Assets 2

26 Apr 12:42

zimmski

v0.4.0

8a38762

v0.4.0

Deep dive into evaluation with this version: https://symflower.com/en/company/blog/2024/dev-quality-eval-v0.4.0-is-llama-3-better-than-gpt-4-for-generating-tests/

This release's major additions are

Java as a new language,
automatic Markdown report with an SVG chart,
and lots of automation and testing to make the evaluation benchmark super reliable.

Features

Java language adapter with “java/plain” repository #62
Scoring through metric points and ranking of models #42
Automatic categorization of models depending on their worst result #36 #39 #48
Fully log per model and repository as results #25 #53
Migrate to symflower test instead of redoing test execution logic #62
Automatic installation of Symflower for RAG and general source code analytics to not reinvent the wheel #50
Generate test file paths through language adapters #60
Generate import / package paths through language adapters #63
Generate test framework name through language adapters #63
Human readable categories with description #57
Summary report as Markdown file with links to results #57 #77
Summary bar chart for overall results of categories as SVG in Markdown file #57

Bug fixes

More reliable parsing of code fences #70 #69
Do not exit process but instead panic for reliable testing and traces #69

Assets 2

04 Apr 12:50

zimmski

v0.3.0

a0f48bc

v0.3.0

First README of "DevQualityEval" (our final name for the benchmark) is online https://github.com/symflower/eval-dev-quality looking for feedback on how to make it more direct, less fluffy and more interesting for developers 🚨🔦Please help. We are currently sifting through the first benchmark and writing a report.

Assets 2

03 Apr 12:42

zimmski

v0.2.0

df10ae2

v0.2.0

This release makes the following tasks possible:

Add providers for models, models, and languages easily by implementing a common interface
Evaluate with any model that openrouter.ai offers and with Symflower's symbolic execution.
Add repositories that should be evaluated using Go as language
Run tests of Go repositories and query their coverage as the first evaluation benchmark task

More to come. If you want to contribute, let us know.

Assets 2

29 Mar 20:00

zimmski

v0.1.0

b0c59b4

v0.1.0

This release includes all basic components to move forward with creating an evaluation benchmark for LLMs and friends to compare and evolve code quality of code generation. The only big exceptions are a well documented README, interface to a generic LLM API service and tasks so people who want to contribute can help. These will follow soon.

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ruby Support

Further merge requests

Contributors

Scoring bugfix

Further merge requests (working towards v0.7.0)

Contributors

Highlights 🌟

Merge Requests

Contributors

Highlights 🌟

Pull Requests

Merged

Closed

Issues

Closed

Contributors

Features

Bug fixes

Releases: symflower/eval-dev-quality

v0.6.2

Ruby Support

Further merge requests

Contributors

v0.6.1

Scoring bugfix

Further merge requests (working towards v0.7.0)

Contributors

v0.6.0

Highlights 🌟

Merge Requests

Contributors

v0.5.0

Highlights 🌟

Pull Requests

Merged

Closed

Issues

Closed

Contributors

v0.4.0

Features

Bug fixes

v0.3.0

v0.2.0

v0.1.0