Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tokenization parameter in StopWordsFilter ops #93

Closed
3 tasks done
simplew2011 opened this issue Nov 22, 2023 · 1 comment · Fixed by #99
Closed
3 tasks done

tokenization parameter in StopWordsFilter ops #93

simplew2011 opened this issue Nov 22, 2023 · 1 comment · Fixed by #99
Assignees
Labels
question Further information is requested

Comments

@simplew2011
Copy link

Before Asking 在提问之前

  • I have read the README carefully. 我已经仔细阅读了 README 上的操作指引。

  • I have pulled the latest code of main branch to run again and the problem still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。

Search before asking 先搜索,再提问

  • I have searched the Data-Juicer issues and found no similar questions. 我已经在 issue列表 中搜索但是没有发现类似的问题。

Question

https://github.com/alibaba/data-juicer/blob/main/data_juicer/ops/filter/stopwords_filter.py#L85

  • tokenization参数看起没有意义,因为只实现了一种统计stopwords的方式(另一个是从已经分词完的状态里取),就是sentencepiece tokenization.

  • 而且#L83行不成立时,tokenization如设置为False,可能引起#L86行报错

Additional 额外信息

No response

@simplew2011 simplew2011 added the question Further information is requested label Nov 22, 2023
@zhijianma
Copy link
Collaborator

感谢您关注和使用Data-Juicer, 感谢您的宝贵建议。
stopwords 通过设置tokenization ,选择whitespace和sentencepiece 分词 2种方式中的一个。
tokenization 为False时,#L80行返回为None, 并使用whitespace 进行分词。
对于类中文的文本来说,whitespace分词方式不是很合适,需要tokenization=True, 使用sentencepiece 进行分词。
但当tokenization=False时,确实仍然发生了一次多余的prepare_model的操作。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants