We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
为了给大家节省数据预处理的时间,本项目开源了经过ChatGLM2-6B的分词器处理后的预训练语料,共计634亿Tokens的数据量,链接如下:Baby-llama2-chinese Corpus 提取码:6unr。将下载好的数据放到./data目录下即可。
上面提到的分词器处理器后的预训练语料,这部分是如何生成的。
The text was updated successfully, but these errors were encountered:
No branches or pull requests
为了给大家节省数据预处理的时间,本项目开源了经过ChatGLM2-6B的分词器处理后的预训练语料,共计634亿Tokens的数据量,链接如下:Baby-llama2-chinese Corpus 提取码:6unr。将下载好的数据放到./data目录下即可。
上面提到的分词器处理器后的预训练语料,这部分是如何生成的。
The text was updated successfully, but these errors were encountered: