Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

关于分词器处理后的预训练语料是通过哪个代码生成的 #84

Open
livevivaer opened this issue Sep 9, 2024 · 0 comments

Comments

@livevivaer
Copy link

为了给大家节省数据预处理的时间,本项目开源了经过ChatGLM2-6B的分词器处理后的预训练语料,共计634亿Tokens的数据量,链接如下:Baby-llama2-chinese Corpus 提取码:6unr。将下载好的数据放到./data目录下即可。

上面提到的分词器处理器后的预训练语料,这部分是如何生成的。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant