Highlights 🌟

Ollama 🦙 support
Mac and Windows 🖥️ support
Support for any inference endpoint 🧠 as long as it implements the OpenAI inference API
More complex "write test" task cases 🔢 for both Java, and Go
Evaluation now measures the processing time ⏱️ that it takes a model to compute a response
Evaluation now counts the number of characters 💬 in model responses to give an idea which models give brief, efficient responses
Multiple runs 🏃 built right into the evaluation tool

Pull Requests

Merged

Development 🛠️
- CI
  - fix, Let Windows CI error on failures, and inject API tokens only once per provider by @zimmski in #104
  - Cancel previous runs of the CI when a new push happened to a PR by @Munsio in #140
- Tooling
  - Introduce an "ID" method to the tool interface by @ruiAzevedo19 in #122
Documentation 📚
- Readme
  - Reduce fluff of intro to a minimum, and summary CSVs for the v0.4.0 results by @zimmski in #86
  - New section explaining the evaluation, and its tasks and cases by @bauersimon in #87
  - Update according to newest blog post release by @bauersimon in #88
  - Use the most cost-effective model that is still good for usage showcase because Claude Opus is super expensive by @zimmski in #84
Evaluation ⏱️
- Multiple Runs
  - Support multiple runs in a single evaluation by @bauersimon in #109
  - Option to execute multiple runs non-interleaved by @bauersimon in #120
  - fix, Do not cancel successive runs if previous runs had problems by @bauersimon in #129
- Testdata Repository
  - Use Git to avoid copying the repository on each model run by @Munsio in #114
  - fix,Use empty Git config in temporary repositories to not inherit any user configuration by @bauersimon in #146
  - fix, Reset repository per task to not bleed task results into subsequent task by @bauersimon in #148
- Tests
  - Remove the need to change the provider registry in tests to make test code concurrency safe by @ruiAzevedo19 in #137
  - Move the error used in the evaluation tests to a variable, to avoid copying it the test suites by @ruiAzevedo19 in #138
- Language Support
  - fix, Java test file path needs to be OS aware by @Munsio and @ruiAzevedo19 in #155
  - Require at least symflower v36800 as it fixes Java coverage extraction in examples with exceptions by @bauersimon in #14
  - fix, Do not ignore Go coverage count if there are failing tests by @ahumenberger in #161
- fix, Empty model responses should be handled as errors by @Munsio in #97
- refactor, Move evaluation logic into evaluation package for isolation of concern by @zimmski in #136
Models 🤖
- New Models
  - Ollama Support
    - Installation and Update
      - Ollama tool automated installation by @bauersimon in #95
      - Ollama tool version check and update if version is outdated by @bauersimon in #118
      - Update Ollama to 0.1.41 to have all the latest Windows fixes by @bauersimon in #154
    - Provider Integration
      - Prepare evaluation for Ollama provider by @zimmski in #115
      - Support Ollama provider by @bauersimon in #96
      - Preload Ollama models before inference and unload afterwards by @bauersimon in #121
  - Generic OpenAI API provider by @bauersimon in #112
- Allow to retry a model when it errors by @ruiAzevedo19 in #125
- Clean up query attempt code by @zimmski in #132
- Explicitly check the interface that is setting the query attempts, to ensure the model implements all its methods by @ruiAzevedo19 in #139
Reports 🗒️
- CSV
  - Replace model dependent evaluation result with report file since that contains all the evaluation information by @bauersimon in #85
  - Additional CSVs to sum up metrics for all models overall and per language by @Munsio in #94
  - fix, Sort map by model before creating the CSV output to be deterministic by @Munsio in #99
- Metrics
  - Measure processing time of model responses by @bauersimon in #106
  - Measure how many characters were present in a model response and generated test files by @ruiAzevedo19 in #142
  - Make sure to use uint64 consistently for metrics and scoring, and allow more task cases by always working on a clean repository by @zimmski in #133
Operating Systems 🖥️
- Support MacOS by @zimmski in #102
- Support Windows by @zimmski in #103
Tasks 🔢
- More “write test” tasks for Go and Java
  - More Java task cases for test generation by @ahumenberger and @zimmski in #134
  - More Go and Java tasks by @ahumenberger and @zimmski #124
  - fix, Add the testify package dependency to the Golang light repository, so symflower test can execute the generated tests by @ruiAzevedo19 in #150
  - fix, Download Go dependencies when executing tests by @bauersimon in #153

Closed

Ollama support PR #27 as everything was integrated in #95, #115, #118, #96

Issues

Closed

#83 Add additional CSV files that sum up: overall, per-language
#91 Integrate Ollama
#92 Empty responses should not be tested but should fail
#98 Non deterministic test output leads to flaky CI Jobs
#101 Unable to run benchmark tasks on windows due to incorrect directory creation syntax
#105 Measure Model response time
#108 Multiple Runs
#111 Generic OpenAI API provider
#113 Optimize repository handling in multiple runs per model
#116 Preload/Unload Ollama models before prompting
#117 Fixed Ollama version
#119 Multiple runs without interleaving
#123 Give models a retry on error
#128 Track how many characters were present in code part / complete response
#131 Follow-up: Allow to retry a model when it errors
#145 git repository change requires the GPG password
#147 Repository not reset for multiple tasks
#158 Deal with failing tests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.5.0

Highlights 🌟

Pull Requests

Merged

Closed

Issues

Closed

Contributors