We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I have read the README carefully. 我已经仔细阅读了 README 上的操作指引。
I have pulled the latest code of main branch to run again and the problem still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。
https://github.com/alibaba/data-juicer/blob/main/data_juicer/ops/filter/stopwords_filter.py#L85
tokenization参数看起没有意义,因为只实现了一种统计stopwords的方式(另一个是从已经分词完的状态里取),就是sentencepiece tokenization.
而且#L83行不成立时,tokenization如设置为False,可能引起#L86行报错
No response
The text was updated successfully, but these errors were encountered:
感谢您关注和使用Data-Juicer, 感谢您的宝贵建议。 stopwords 通过设置tokenization ,选择whitespace和sentencepiece 分词 2种方式中的一个。 tokenization 为False时,#L80行返回为None, 并使用whitespace 进行分词。 对于类中文的文本来说,whitespace分词方式不是很合适,需要tokenization=True, 使用sentencepiece 进行分词。 但当tokenization=False时,确实仍然发生了一次多余的prepare_model的操作。
Sorry, something went wrong.
zhijianma
Successfully merging a pull request may close this issue.
Before Asking 在提问之前
I have read the README carefully. 我已经仔细阅读了 README 上的操作指引。
I have pulled the latest code of main branch to run again and the problem still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。
Search before asking 先搜索,再提问
Question
https://github.com/alibaba/data-juicer/blob/main/data_juicer/ops/filter/stopwords_filter.py#L85
tokenization参数看起没有意义,因为只实现了一种统计stopwords的方式(另一个是从已经分词完的状态里取),就是sentencepiece tokenization.
而且#L83行不成立时,tokenization如设置为False,可能引起#L86行报错
Additional 额外信息
No response
The text was updated successfully, but these errors were encountered: