New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

提取的关键词倾向于带英文字母 #21

Open

hummingg opened this issue Oct 14, 2021 · 4 comments

hummingg commented Oct 14, 2021

大佬好！
我用这份代码提取《大话数据结构》全书，发现得到的关键词大多都含字母，且不大像一个词，如下图。
请问，我该怎么改进呢？

Owner

sunyilgdx commented Oct 15, 2021

这里修改正则表达式 @hummingg

Author

hummingg commented Oct 17, 2021 •

edited

Loading

似乎问题是THULAC分词错误导致的，碰上英文就歇菜。清华的分词模型对自定义用户词典的支持好像不太好。
准备把THULAC换成jieba试试，可行吗？

Owner

sunyilgdx commented Oct 17, 2021

跟分词系统和正则匹配规则相关度很大

1sebsgithub1 commented Nov 14, 2022

大佬你好，怎么才能提取全书呢

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment