Releases
v0.5.0
Highlights π
Ollama π¦ support
Mac and Windows π₯οΈ support
Support for any inference endpoint π§ as long as it implements the OpenAI
inference API
More complex "write test" task cases π’ for both Java, and Go
Evaluation now measures the processing time β±οΈ that it takes a model to compute a response
Evaluation now counts the number of characters π¬ in model responses to give an idea which models give brief, efficient responses
Multiple runs π built right into the evaluation tool
Pull Requests
Merged
Development π οΈ
CI
fix, Let Windows CI error on failures, and inject API tokens only once per provider by @zimmski in #104
Cancel previous runs of the CI when a new push happened to a PR by @Munsio in #140
Tooling
Documentation π
Readme
Reduce fluff of intro to a minimum, and summary CSVs for the v0.4.0 results by @zimmski in #86
New section explaining the evaluation, and its tasks and cases by @bauersimon in #87
Update according to newest blog post release by @bauersimon in #88
Use the most cost-effective model that is still good for usage showcase because Claude Opus is super expensive by @zimmski in #84
Evaluation β±οΈ
Multiple Runs
Support multiple runs in a single evaluation by @bauersimon in #109
Option to execute multiple runs non-interleaved by @bauersimon in #120
fix, Do not cancel successive runs if previous runs had problems by @bauersimon in #129
Testdata Repository
Use Git to avoid copying the repository on each model run by @Munsio in #114
fix,Use empty Git config in temporary repositories to not inherit any user configuration by @bauersimon in #146
fix, Reset repository per task to not bleed task results into subsequent task by @bauersimon in #148
Tests
Remove the need to change the provider registry in tests to make test code concurrency safe by @ruiAzevedo19 in #137
Move the error used in the evaluation tests to a variable, to avoid copying it the test suites by @ruiAzevedo19 in #138
Language Support
fix, Java test file path needs to be OS aware by @Munsio and @ruiAzevedo19 in #155
Require at least symflower v36800 as it fixes Java coverage extraction in examples with exceptions by @bauersimon in #14
fix, Do not ignore Go coverage count if there are failing tests by @ahumenberger in #161
fix, Empty model responses should be handled as errors by @Munsio in #97
refactor, Move evaluation logic into evaluation package for isolation of concern by @zimmski in #136
Models π€
New Models
Ollama Support
Installation and Update
Ollama tool automated installation by @bauersimon in #95
Ollama tool version check and update if version is outdated by @bauersimon in #118
Update Ollama to 0.1.41 to have all the latest Windows fixes by @bauersimon in #154
Provider Integration
Generic OpenAI API provider by @bauersimon in #112
Allow to retry a model when it errors by @ruiAzevedo19 in #125
Clean up query attempt code by @zimmski in #132
Explicitly check the interface that is setting the query attempts, to ensure the model implements all its methods by @ruiAzevedo19 in #139
Reports ποΈ
CSV
Replace model dependent evaluation result with report file since that contains all the evaluation information by @bauersimon in #85
Additional CSVs to sum up metrics for all models overall and per language by @Munsio in #94
fix, Sort map by model before creating the CSV output to be deterministic by @Munsio in #99
Metrics
Measure processing time of model responses by @bauersimon in #106
Measure how many characters were present in a model response and generated test files by @ruiAzevedo19 in #142
Make sure to use uint64 consistently for metrics and scoring, and allow more task cases by always working on a clean repository by @zimmski in #133
Operating Systems π₯οΈ
Tasks π’
More βwrite testβ tasks for Go and Java
Closed
Issues
Closed
#83 Add additional CSV files that sum up: overall, per-language
#91 Integrate Ollama
#92 Empty responses should not be tested but should fail
#98 Non deterministic test output leads to flaky CI Jobs
#101 Unable to run benchmark tasks on windows due to incorrect directory creation syntax
#105 Measure Model response time
#108 Multiple Runs
#111 Generic OpenAI API provider
#113 Optimize repository handling in multiple runs per model
#116 Preload/Unload Ollama models before prompting
#117 Fixed Ollama version
#119 Multiple runs without interleaving
#123 Give models a retry on error
#128 Track how many characters were present in code part / complete response
#131 Follow-up: Allow to retry a model when it errors
#145 git repository change requires the GPG password
#147 Repository not reset for multiple tasks
#158 Deal with failing tests
You canβt perform that action at this time.