Skip to content

v0.4.0

Latest
Compare
Choose a tag to compare
@guipenedo guipenedo released this 06 Dec 18:43
· 3 commits to main since this release
842b241

What's Changed

  • Readme nits by @hynky1999 in #280
  • Fixed a bug that in the reader pipline, the document count is always less that the actual number of documents by the number of files. by @lyuwen in #286
  • Fix languages listify bug by @BramVanroy in #294
  • [Fixbug] Ensure only one task will be launched for each srun cmd by @silverriver in #296
  • [fixbug]: Fixed the issue in MinhashBuildIndex where get_datafolder w… by @Youggls in #307
  • FineWeb-2: multilingual, numpy 2.0, minhash improvements by @guipenedo and @hynky1999 in #285:
    • upgrades to support numpy 2.0
    • added additional word tokenizers and revamped word tokenizer assignment mechanism
    • MinHash optimizations + new rust tool to speed up step3
    • MinHash cluster sizes feature
    • fixed memory leaks from some word tokenizers
    • updated url blocklists
    • added caching to some word tokenization calls
    • glotlid support
    • general bugfixes

New Contributors

Full Changelog: v0.3.0...v0.4.0