Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

提取的关键词倾向于带英文字母 #21

Open
hummingg opened this issue Oct 14, 2021 · 4 comments
Open

提取的关键词倾向于带英文字母 #21

hummingg opened this issue Oct 14, 2021 · 4 comments

Comments

@hummingg
Copy link

大佬好!
我用这份代码提取《大话数据结构》全书,发现得到的关键词大多都含字母,且不大像一个词,如下图。
请问,我该怎么改进呢?

SIFRank关键词

@sunyilgdx
Copy link
Owner

这里修改正则表达式 @hummingg

@hummingg
Copy link
Author

hummingg commented Oct 17, 2021

似乎问题是THULAC分词错误导致的,碰上英文就歇菜。清华的分词模型对自定义用户词典的支持好像不太好
准备把THULAC换成jieba试试,可行吗?
无向图有向图

@sunyilgdx
Copy link
Owner

跟分词系统和正则匹配规则相关度很大

@1sebsgithub1
Copy link

大佬你好,怎么才能提取全书呢

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants