Language-Model Evaluation #21
Labels
downstream
Changes code wrapping the core model
engineering
Software-engineering problems that don't require ML-Expertise
ML
Requires machine-learning knowledge (can be built up on the fly)
Milestone
At the moment, we only have language-modelling loss to go by when experimenting with different architectures. Unfortunately, many methods, such as extra-gradient methods, different loss functions, different tokenisers or even different datasets, will change these loss values dramatically, making comparison almost impossible. We would gain certainty by integrating a dedicated evaluation pipeline such as EleutherAI's eval-harness that one model is better than the other and allow us to compare ourselves with existing models such as GPT-J and GPT-3.
The text was updated successfully, but these errors were encountered: