Skip to content

Releases: symflower/eval-dev-quality

v0.6.2

09 Sep 13:08
bbf6ab9
Compare
Choose a tag to compare

Ruby Support

Fully added Ruby as a new language.

Further merge requests

Full Changelog: v0.6.1...v0.6.2

v0.6.1

20 Aug 08:22
f2178cb
Compare
Choose a tag to compare

Scoring bugfix

Assess code repair tasks by counting passing tests instead of accumulating coverage objects of executed tests as coverage metrics do not say anything about implemented behavior by @bauersimon in #321, #320

Further merge requests (working towards v0.7.0)

  • Language Support
  • Development
    • Free GitHub runner disk space by removing unnecessary pre-packaged libraries by @Munsio in #314
  • Reporting
    • Store models meta information in a CSV file, so it can be further used in data visualization by @ruiAzevedo19 in #298
    • fix, Prefix OpenRouter model with provider ID only once by @bauersimon in #322
    • Use new Symflower version which reduces error output of the "fix" command by @bauersimon in #323

v0.6.0

02 Aug 11:30
d3ba2cb
Compare
Choose a tag to compare

Highlights 🌟

  • Sandboxed Execution with Docker 🐳 LLM-generated code can now be executed within a safe docker sandbox, includes parallel evaluation of multiple models within multiple containers.
  • Scaling Benchmarks with Kubernetes 📈 Docker evaluations can be scaled across kubernetes clusters to support benchmarking many models in parallel on distributed hardware.
  • New Task Types 📚
    • Code Repair 🛠️ Prompts an LLM with compilation errors and asks to fix them.
    • Code Transpilation 🔀 Has an LLM transpile source code from one programming language into another.
  • Static Code Repair Benchmark 🚑 LLMs commonly make small mistakes that are easily fixable using static analysis - this benchmark task showcases the potential of this technique.
  • Automatically pull Ollama Models 🦙 Ollama models are now automatically pulled when specified for the evaluation.
  • Improved Reporting 📑 Results are now written alongside the benchmark, meaning nothing is lost in case of an error. Plus a new tool eval-dev-quality report for combining multiple evaluation results into one.

See the full release notes below. 🤗

Merge Requests

  • Development & Management 🛠️
    • Demo script to run models sequentially in separate evaluations on the "light" repository by @ahumenberger #189
  • Documentation 📚
  • Evaluation ⏱️
    • Isolated Execution
      • Docker Support
        • Build Docker image for every release by @Munsio #199
        • Docker evaluation runtime by @Munsio #211, #238, #234, #252
        • Parallel execution of containerized evaluations by @Munsio #221
        • Run docker image generation on each push by @Munsio #247
        • fix, Use main revision docker tag by default by @Munsio #249,
        • fix, Add commit revision to docker and reports by @Munsio #255
        • fix, IO error when multiple Containers use the same result path by @Munsio #274
        • Test docker in GitHub Actions by @Munsio #260
        • fix, Ignore CLI argument model, provider and testdata checks on host when using containerization by @Munsio #290
        • fix, Pass environment tokens into container by @Munsio #250
        • fix, Use a pinned Java 11 version by @Munsio #279
        • Make paths absolute when copying docker results cause docker gets confused with paths containing colons by @Munsio #308
      • Kubernetes Support
        • Kubernetes evaluation runtime by @Munsio #231
        • Copy back results from the cluster to the initial host by @Munsio #272
        • fix, Only use valid characters in Kubernetes job names by @Munsio #292
    • Timeouts for test execution and symflower test generation by @ruiAzevedo19 #277, #267, #188
    • Clarify prompt that code responses must be in code fences by @ruiAzevedo19 #259
    • fix, Use backoff for retrying LLMs cause some LLMs need more time to recover by @zimmski #172
  • Models 🤖
    • Pull Ollama models if they are selected for evaluation by @Munsio #284
    • Model Selection
      • Exclude certain models (e.g. "openrouter/auto"), because is just forwarding to a model automatically by @bauersimon #288
      • Exclude the perplexicty online models because they have a "per request" cost #288 (automatically excluded as online models)
    • fix, Retry openrouter models query cause it sometimes just errors by @bauersimon #191
    • fix, Default to all repositories if none are explicitly selected by @bauersimon #182
    • fix, Do not start Ollama server if no Ollama model is selected by @ruiAzevedo19 #269
    • fix, Always use forward slashes in prompts so its unified by @ruiAzevedo19 #268
  • Reports & Metrics 🗒️
    • Logging
    • Write out results right away so we don't loose anything if the evaluation crashes by @ruiAzevedo19 #243
    • refactor, Abstract the storage of assessments by @ahumenberger #178
    • fix, Do not overwrite results but create a separate result directory by @bauersimon #179
    • New report subcommand for postprocessing report data
    • Report evaluation configuration (used models + repositories) as a JSON artifact for reproducibility
      • Store models for the evaluation in JSON configuration report by @bauersimon #285
      • Store repositories for the evaluation in JSON configuration report by @bauersimon #287
      • Load models and repositories that were used from JSON configuration by @ruiAzevedo19 #291
    • Report maximum of executable files by @ruiAzevedo19 #261
    • Experiment with human-readable model names and costs to prepare for data visualization
  • Operating Systems 🖥️
    • More tests for Windows
  • Tools 🧰
  • Tasks 🔢
    • Infrastructure for different Task types
    • New task types
Read more

v0.5.0

06 Jun 07:18
efe1ea3
Compare
Choose a tag to compare

Highlights 🌟

  • Ollama 🦙 support
  • Mac and Windows 🖥️ support
  • Support for any inference endpoint 🧠 as long as it implements the OpenAI inference API
  • More complex "write test" task cases 🔢 for both Java, and Go
  • Evaluation now measures the processing time ⏱️ that it takes a model to compute a response
  • Evaluation now counts the number of characters 💬 in model responses to give an idea which models give brief, efficient responses
  • Multiple runs 🏃 built right into the evaluation tool

Pull Requests

Merged

  • Development 🛠️
    • CI
      • fix, Let Windows CI error on failures, and inject API tokens only once per provider by @zimmski in #104
      • Cancel previous runs of the CI when a new push happened to a PR by @Munsio in #140
    • Tooling
  • Documentation 📚
    • Readme
      • Reduce fluff of intro to a minimum, and summary CSVs for the v0.4.0 results by @zimmski in #86
      • New section explaining the evaluation, and its tasks and cases by @bauersimon in #87
      • Update according to newest blog post release by @bauersimon in #88
      • Use the most cost-effective model that is still good for usage showcase because Claude Opus is super expensive by @zimmski in #84
  • Evaluation ⏱️
    • Multiple Runs
      • Support multiple runs in a single evaluation by @bauersimon in #109
      • Option to execute multiple runs non-interleaved by @bauersimon in #120
      • fix, Do not cancel successive runs if previous runs had problems by @bauersimon in #129
    • Testdata Repository
      • Use Git to avoid copying the repository on each model run by @Munsio in #114
      • fix,Use empty Git config in temporary repositories to not inherit any user configuration by @bauersimon in #146
      • fix, Reset repository per task to not bleed task results into subsequent task by @bauersimon in #148
    • Tests
      • Remove the need to change the provider registry in tests to make test code concurrency safe by @ruiAzevedo19 in #137
      • Move the error used in the evaluation tests to a variable, to avoid copying it the test suites by @ruiAzevedo19 in #138
    • Language Support
      • fix, Java test file path needs to be OS aware by @Munsio and @ruiAzevedo19 in #155
      • Require at least symflower v36800 as it fixes Java coverage extraction in examples with exceptions by @bauersimon in #14
      • fix, Do not ignore Go coverage count if there are failing tests by @ahumenberger in #161
    • fix, Empty model responses should be handled as errors by @Munsio in #97
    • refactor, Move evaluation logic into evaluation package for isolation of concern by @zimmski in #136
  • Models 🤖
    • New Models
      • Ollama Support
        • Installation and Update
          • Ollama tool automated installation by @bauersimon in #95
          • Ollama tool version check and update if version is outdated by @bauersimon in #118
          • Update Ollama to 0.1.41 to have all the latest Windows fixes by @bauersimon in #154
        • Provider Integration
      • Generic OpenAI API provider by @bauersimon in #112
    • Allow to retry a model when it errors by @ruiAzevedo19 in #125
    • Clean up query attempt code by @zimmski in #132
    • Explicitly check the interface that is setting the query attempts, to ensure the model implements all its methods by @ruiAzevedo19 in #139
  • Reports 🗒️
    • CSV
      • Replace model dependent evaluation result with report file since that contains all the evaluation information by @bauersimon in #85
      • Additional CSVs to sum up metrics for all models overall and per language by @Munsio in #94
      • fix, Sort map by model before creating the CSV output to be deterministic by @Munsio in #99
    • Metrics
      • Measure processing time of model responses by @bauersimon in #106
      • Measure how many characters were present in a model response and generated test files by @ruiAzevedo19 in #142
      • Make sure to use uint64 consistently for metrics and scoring, and allow more task cases by always working on a clean repository by @zimmski in #133
  • Operating Systems 🖥️
  • Tasks 🔢

Closed

Issues

Closed

  • #83 Add additional CSV files that sum up: overall, per-language
  • #91 Integrate Ollama
  • #92 Empty responses should not be tested but should fail
  • #98 Non deterministic test output leads to flaky CI Jobs
  • #101 Unable to run benchmark tasks on windows due to incorrect directory creation syntax
  • #105 Measure Model response time
  • #108 Multiple Runs
  • #111 Generic OpenAI API provider
  • #113 Optimize repository handling in multiple runs per model
  • #116 Preload/Unload Ollama models before prompting
  • #117 Fixed Ollama version
  • #119 Multiple runs without interleaving
  • #123 Give models a retry on error
  • #128 Track how many characters were present in code part / complete response
  • #131 Follow-up: Allow to retry a model when it errors
  • #145 git repository change requires the GPG password
  • #147 Repository not reset for multiple tasks
  • #158 Deal with failing tests

v0.4.0

26 Apr 12:42
8a38762
Compare
Choose a tag to compare

Deep dive into evaluation with this version: https://symflower.com/en/company/blog/2024/dev-quality-eval-v0.4.0-is-llama-3-better-than-gpt-4-for-generating-tests/

This release's major additions are

  • Java as a new language,
  • automatic Markdown report with an SVG chart,
  • and lots of automation and testing to make the evaluation benchmark super reliable.

Features

  • Java language adapter with “java/plain” repository #62
  • Scoring through metric points and ranking of models #42
  • Automatic categorization of models depending on their worst result #36 #39 #48
  • Fully log per model and repository as results #25 #53
  • Migrate to symflower test instead of redoing test execution logic #62
  • Automatic installation of Symflower for RAG and general source code analytics to not reinvent the wheel #50
  • Generate test file paths through language adapters #60
  • Generate import / package paths through language adapters #63
  • Generate test framework name through language adapters #63
  • Human readable categories with description #57
  • Summary report as Markdown file with links to results #57 #77
  • Summary bar chart for overall results of categories as SVG in Markdown file #57

Bug fixes

  • More reliable parsing of code fences #70 #69
  • Do not exit process but instead panic for reliable testing and traces #69

v0.3.0

04 Apr 12:50
a0f48bc
Compare
Choose a tag to compare

First README of "DevQualityEval" (our final name for the benchmark) is online https://github.com/symflower/eval-dev-quality looking for feedback on how to make it more direct, less fluffy and more interesting for developers 🚨🔦Please help. We are currently sifting through the first benchmark and writing a report.

v0.2.0

03 Apr 12:42
df10ae2
Compare
Choose a tag to compare

This release makes the following tasks possible:

  • Add providers for models, models, and languages easily by implementing a common interface
  • Evaluate with any model that openrouter.ai offers and with Symflower's symbolic execution.
  • Add repositories that should be evaluated using Go as language
  • Run tests of Go repositories and query their coverage as the first evaluation benchmark task

More to come. If you want to contribute, let us know.

v0.1.0

29 Mar 20:00
b0c59b4
Compare
Choose a tag to compare

This release includes all basic components to move forward with creating an evaluation benchmark for LLMs and friends to compare and evolve code quality of code generation. The only big exceptions are a well documented README, interface to a generic LLM API service and tasks so people who want to contribute can help. These will follow soon.