Skip to content

v0.5.0

Compare
Choose a tag to compare
@bauersimon bauersimon released this 06 Jun 07:18
· 417 commits to main since this release
efe1ea3

Highlights 🌟

  • Ollama πŸ¦™ support
  • Mac and Windows πŸ–₯️ support
  • Support for any inference endpoint 🧠 as long as it implements the OpenAI inference API
  • More complex "write test" task cases πŸ”’ for both Java, and Go
  • Evaluation now measures the processing time ⏱️ that it takes a model to compute a response
  • Evaluation now counts the number of characters πŸ’¬ in model responses to give an idea which models give brief, efficient responses
  • Multiple runs πŸƒ built right into the evaluation tool

Pull Requests

Merged

  • Development πŸ› οΈ
    • CI
      • fix, Let Windows CI error on failures, and inject API tokens only once per provider by @zimmski in #104
      • Cancel previous runs of the CI when a new push happened to a PR by @Munsio in #140
    • Tooling
  • Documentation πŸ“š
    • Readme
      • Reduce fluff of intro to a minimum, and summary CSVs for the v0.4.0 results by @zimmski in #86
      • New section explaining the evaluation, and its tasks and cases by @bauersimon in #87
      • Update according to newest blog post release by @bauersimon in #88
      • Use the most cost-effective model that is still good for usage showcase because Claude Opus is super expensive by @zimmski in #84
  • Evaluation ⏱️
    • Multiple Runs
      • Support multiple runs in a single evaluation by @bauersimon in #109
      • Option to execute multiple runs non-interleaved by @bauersimon in #120
      • fix, Do not cancel successive runs if previous runs had problems by @bauersimon in #129
    • Testdata Repository
      • Use Git to avoid copying the repository on each model run by @Munsio in #114
      • fix,Use empty Git config in temporary repositories to not inherit any user configuration by @bauersimon in #146
      • fix, Reset repository per task to not bleed task results into subsequent task by @bauersimon in #148
    • Tests
      • Remove the need to change the provider registry in tests to make test code concurrency safe by @ruiAzevedo19 in #137
      • Move the error used in the evaluation tests to a variable, to avoid copying it the test suites by @ruiAzevedo19 in #138
    • Language Support
      • fix, Java test file path needs to be OS aware by @Munsio and @ruiAzevedo19 in #155
      • Require at least symflower v36800 as it fixes Java coverage extraction in examples with exceptions by @bauersimon in #14
      • fix, Do not ignore Go coverage count if there are failing tests by @ahumenberger in #161
    • fix, Empty model responses should be handled as errors by @Munsio in #97
    • refactor, Move evaluation logic into evaluation package for isolation of concern by @zimmski in #136
  • Models πŸ€–
    • New Models
      • Ollama Support
        • Installation and Update
          • Ollama tool automated installation by @bauersimon in #95
          • Ollama tool version check and update if version is outdated by @bauersimon in #118
          • Update Ollama to 0.1.41 to have all the latest Windows fixes by @bauersimon in #154
        • Provider Integration
      • Generic OpenAI API provider by @bauersimon in #112
    • Allow to retry a model when it errors by @ruiAzevedo19 in #125
    • Clean up query attempt code by @zimmski in #132
    • Explicitly check the interface that is setting the query attempts, to ensure the model implements all its methods by @ruiAzevedo19 in #139
  • Reports πŸ—’οΈ
    • CSV
      • Replace model dependent evaluation result with report file since that contains all the evaluation information by @bauersimon in #85
      • Additional CSVs to sum up metrics for all models overall and per language by @Munsio in #94
      • fix, Sort map by model before creating the CSV output to be deterministic by @Munsio in #99
    • Metrics
      • Measure processing time of model responses by @bauersimon in #106
      • Measure how many characters were present in a model response and generated test files by @ruiAzevedo19 in #142
      • Make sure to use uint64 consistently for metrics and scoring, and allow more task cases by always working on a clean repository by @zimmski in #133
  • Operating Systems πŸ–₯️
  • Tasks πŸ”’

Closed

Issues

Closed

  • #83 Add additional CSV files that sum up: overall, per-language
  • #91 Integrate Ollama
  • #92 Empty responses should not be tested but should fail
  • #98 Non deterministic test output leads to flaky CI Jobs
  • #101 Unable to run benchmark tasks on windows due to incorrect directory creation syntax
  • #105 Measure Model response time
  • #108 Multiple Runs
  • #111 Generic OpenAI API provider
  • #113 Optimize repository handling in multiple runs per model
  • #116 Preload/Unload Ollama models before prompting
  • #117 Fixed Ollama version
  • #119 Multiple runs without interleaving
  • #123 Give models a retry on error
  • #128 Track how many characters were present in code part / complete response
  • #131 Follow-up: Allow to retry a model when it errors
  • #145 git repository change requires the GPG password
  • #147 Repository not reset for multiple tasks
  • #158 Deal with failing tests