tokenization parameter in StopWordsFilter ops #93

simplew2011 · 2023-11-22T12:00:42Z

Before Asking 在提问之前

I have read the README carefully. 我已经仔细阅读了 README 上的操作指引。
I have pulled the latest code of main branch to run again and the problem still existed. 我已经拉取了主分支上最新的代码，重新运行之后，问题仍不能解决。

Search before asking 先搜索，再提问

I have searched the Data-Juicer issues and found no similar questions. 我已经在 issue列表中搜索但是没有发现类似的问题。

Question

https://github.com/alibaba/data-juicer/blob/main/data_juicer/ops/filter/stopwords_filter.py#L85

tokenization参数看起没有意义，因为只实现了一种统计stopwords的方式（另一个是从已经分词完的状态里取），就是sentencepiece tokenization.
而且#L83行不成立时，tokenization如设置为False，可能引起#L86行报错

Additional 额外信息

No response

zhijianma · 2023-11-23T01:16:26Z

感谢您关注和使用Data-Juicer，感谢您的宝贵建议。
stopwords 通过设置tokenization ，选择whitespace和sentencepiece 分词 2种方式中的一个。
tokenization 为False时，#L80行返回为None, 并使用whitespace 进行分词。
对于类中文的文本来说，whitespace分词方式不是很合适，需要tokenization=True, 使用sentencepiece 进行分词。
但当tokenization=False时，确实仍然发生了一次多余的prepare_model的操作。

simplew2011 added the question Further information is requested label Nov 22, 2023

HYLcool assigned zhijianma Nov 23, 2023

zhijianma pinned this issue Nov 24, 2023

zhijianma unpinned this issue Nov 24, 2023

zhijianma linked a pull request Nov 24, 2023 that will close this issue

fix: remove extra prepare model when enable tokenization flag in some ops. #99

Merged

zhijianma closed this as completed in #99 Nov 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tokenization parameter in StopWordsFilter ops #93

tokenization parameter in StopWordsFilter ops #93

simplew2011 commented Nov 22, 2023

zhijianma commented Nov 23, 2023

tokenization parameter in StopWordsFilter ops #93

tokenization parameter in StopWordsFilter ops #93

Comments

simplew2011 commented Nov 22, 2023

Before Asking 在提问之前

Search before asking 先搜索，再提问

Question

Additional 额外信息

zhijianma commented Nov 23, 2023