why generate different results when i use corenlp in python and java？ #1364

xy1137030414 · 2024-03-07T13:59:20Z

xy1137030414
Mar 7, 2024

the result of tokenize and POS of Chinese is so different， java is much better than python， why！

AngledLuffa · 2024-03-07T18:19:05Z

AngledLuffa
Mar 7, 2024
Maintainer

Would you be more specific please? Examples would help

3 replies

xy1137030414 Mar 8, 2024
Author

When i use java:金湖/NR 荷花荡/NN ——/PU 国家/NN AAAA/NR 级/NN 旅游/NN 景区/NN 。/PU
地处/VV 淮安市/NR 金湖县/NR 东南/NN ，/PU 美丽/JJ 高邮/NN 湖畔/NN ，/PU 三/CD 面/M 环湖/NR 相拥/VV ，/PU 生态/NN 环境/NN 优美/VA ，/PU 总/JJ 面积/NN 1.2万/CD 亩/M ，/PU 是/VC 集/VV 农业/NN 观光/NN 、/PU 休闲/NN 度假/NN 、/PU 科普/NN 教育/NN 、/PU 健康/NN 疗养/VV 为/VC 一体/NN 的/DEG 综合性/JJ 生态/NN 旅游/NN 景点/NN 。/PU
but,use stanza:
金湖/PROPN 荷花/PROPN 荡/PART —/PUNCT —/PUNCT 国家/PROPN AAAA/X 级/PART 旅游/NOUN 景区/NOUN 。/PUNCT /n地/NOUN 处/VERB 淮安/PROPN 市/PART 金湖/PROPN 县/PART 东南/NOUN ，/PUNCT 美丽/PROPN 高邮/PROPN 湖畔/NOUN ，/PUNCT 三/NUM 面/NOUN 环湖/PROPN 相拥/VERB ，/PUNCT 生态/NOUN 环境/NOUN 优美/ADJ ，/PUNCT 总/PART 面积/NOUN 1.2万/NUM 亩/NOUN ，/PUNCT 是/AUX 集/VERB 农业/NOUN 观光/NOUN 、/PUNCT 休闲/NOUN 度假/NOUN 、/PUNCT 科普/NOUN 教育/NOUN 、/PUNCT 健康/NOUN 疗养/NOUN 为/AUX 一/NUM 体/NOUN 的/PART 综合/NOUN 性/PART 生态/NOUN 旅游/NOUN 景点/NOUN

Thank you!

xy1137030414 Mar 8, 2024
Author

i think java can recognize NER correctly！like，荷花荡/NN ，but 荷花/PROPN 荡/PART in python.

AngledLuffa Mar 8, 2024
Maintainer

Would you write out the input text? Trying to figure it out from this is just making my life more difficult, and I'm really getting tired of chasing down all these little issues

AngledLuffa · 2024-03-08T04:27:57Z

AngledLuffa
Mar 8, 2024
Maintainer

Well, the simple answer is, they are different models trained on different data. CoreNLP's segmenter is trained on a much larger treebank, 50K sentences, whereas the Stanza segmenter is only trained on 4K sentences. Ideally they'd be mixed together, but unfortunately the tokenization standards between the UD treebank (Stanza) and Chinese Treebank (CoreNLP) are different enough that it would actually hurt performance.

…

On Thu, Mar 7, 2024 at 7:45 PM Xu Yue ***@***.***> wrote: hey！I am YUE，this is input text！so sorry！i forgot give you original text！金湖荷花荡――国家AAAA级旅游景区。地处淮安市金湖县东南，美丽高邮湖畔，三面环湖相拥，生态环境优美，总面积1.2万亩，是集农业观光、休闲度假、科普教育、健康疗养为一体的综合性生态旅游景点。拥有“国家水利风景区”、“全国农业旅游示范点”、“江苏省对台交流基地”、“全国科普教育基地”、“淮安高层次人才疗养基地”、“江苏省放心消费示范街区”等众多荣誉称号。盛夏时节，景区内碧叶连天、荷花万顷，完美诠释了宋代大诗人杨万里“接天莲叶无穷碧、映日荷花别样红”的胜景。为致敬奋斗在“抗疫”一线的医务工作者，景区自恢复运营之日起至2020年12月31日对全国医护工作者门票免费。 ________________________________ From: John Bauer ***@***.***> Sent: Friday, March 8, 2024 11:38 AM To: stanfordnlp/stanza ***@***.***> Cc: Xu Yue ***@***.***>; Author ***@***.***> Subject: Re: [stanfordnlp/stanza] why generate different results when i use corenlp in python and java？ (Discussion #1364) Would you write out the input text? Trying to figure it out from this is just making my life more difficult, and I'm really getting tired of chasing down all these little issues ― Reply to this email directly, view it on GitHub< #1364 (reply in thread)>, or unsubscribe< https://github.com/notifications/unsubscribe-auth/ARRIME4IWQIGS7C3FE3NN4LYXEXCXAVCNFSM6AAAAABELBVSH6VHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4DOMJVGA4TI>. You are receiving this because you authored the thread.Message ID: ***@***.***> — Reply to this email directly, view it on GitHub <#1364 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA2AYWKKNR75XAKCXB7XHFLYXEX4DAVCNFSM6AAAAABELBVSH6VHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4DOMJVGEZDG> . You are receiving this because you commented.Message ID: ***@***.***>

0 replies

AngledLuffa · 2024-03-08T05:55:29Z

AngledLuffa
Mar 8, 2024
Maintainer

Hang on, the tokenization & tagging you were getting earlier, was that from using the Stanza models or from accessing the CoreNLP models through python in some manner? There is a CoreNLP client for Stanza which connects to the Java version of the software already. You don't need to write your own: https://stanfordnlp.github.io/stanza/corenlp_client.html

0 replies

AngledLuffa · 2024-03-08T06:19:19Z

AngledLuffa
Mar 8, 2024
Maintainer

When did that error happen? There should be a stack trace telling you. It might have happened either downloading the model descriptions from github or the models themselves from HuggingFace. Since you are clearly able to post to github, my first guess would be HuggingFace, but we'd need the whole traceback to be sure

…

On Thu, Mar 7, 2024 at 10:03 PM Xu Yue ***@***.***> wrote: import os import stanza # 初始化中文管道 nlp = stanza.Pipeline('zh') # 指定输入和输出目录 input_dir = 'F:/path' output_dir = 'F:path' # 如果输出目录不存在，创建它 if not os.path.exists(output_dir): os.makedirs(output_dir) # 遍历输入目录下的所有文件 for filename in os.listdir(input_dir): # 只处理文本文件 if filename.endswith('.txt'): # 读取文件内容 with open(os.path.join(input_dir, filename), 'r', encoding='utf-8') as f: text = f.read() # 使用 stanza 进行分词和词性标注 doc = nlp(text) # 格式化结果 result = ' '.join([f'{word.text}/{word.upos}' for sent in doc.sentences for word in sent.words]) # 写入到新的文件中 with open(os.path.join(output_dir, filename), 'w', encoding='utf-8') as f: f.write(result) HI，i use the Stanza models in python，and here are my code！but now my python code also run failed，i do not kown why！the error is[WinError 10060] 由于连接方在一段时间后没有正确答复或连接的主机没有反应，连接尝试失败。')) THANK YOU ， i will use CoreNLP client. ________________________________ From: John Bauer ***@***.***> Sent: Friday, March 8, 2024 1:55 PM To: stanfordnlp/stanza ***@***.***> Cc: Xu Yue ***@***.***>; Author ***@***.***> Subject: Re: [stanfordnlp/stanza] why generate different results when i use corenlp in python and java？ (Discussion #1364) Hang on, the tokenization & tagging you were getting earlier, was that from using the Stanza models or from accessing the CoreNLP models through python in some manner? There is a CoreNLP client for Stanza which connects to the Java version of the software already. You don't need to write your own: https://stanfordnlp.github.io/stanza/corenlp_client.html ― Reply to this email directly, view it on GitHub< #1364 (comment)>, or unsubscribe< https://github.com/notifications/unsubscribe-auth/ARRIMEYRNTYZ5KFTVMUSXO3YXFHGNAVCNFSM6AAAAABELBVSH6VHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4DOMJVHA2DQ>. You are receiving this because you authored the thread. — Reply to this email directly, view it on GitHub <#1364 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA2AYWNI6IMVPWRGUKIWZQTYXFIB5AVCNFSM6AAAAABELBVSH6VHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4DOMJVHEYDQ> . You are receiving this because you commented.Message ID: ***@***.***>

0 replies

AngledLuffa · 2024-03-08T06:36:15Z

AngledLuffa
Mar 8, 2024
Maintainer

Ah, I see. Do you have java in your path? Do you have the CoreNLP distribution in your java classpath? There's also a system variable expected with the CoreNLP download path, $CORENLP_HOME It's explained here: https://stanfordnlp.github.io/stanza/client_setup.html#manual-installation

…

On Thu, Mar 7, 2024 at 10:31 PM Xu Yue ***@***.***> wrote: i think there is no stack trace。just has a error：错误: 找不到或无法加载主类 edu.stanford.nlp.pipeline.StanfordCoreNLP. the error is happened when i run this code file,only has this error ________________________________ From: John Bauer ***@***.***> Sent: Friday, March 8, 2024 2:19 PM To: stanfordnlp/stanza ***@***.***> Cc: Xu Yue ***@***.***>; Author ***@***.***> Subject: Re: [stanfordnlp/stanza] why generate different results when i use corenlp in python and java？ (Discussion #1364) When did that error happen? There should be a stack trace telling you. It might have happened either downloading the model descriptions from github or the models themselves from HuggingFace. Since you are clearly able to post to github, my first guess would be HuggingFace, but we'd need the whole traceback to be sure On Thu, Mar 7, 2024 at 10:03 PM Xu Yue ***@***.***> wrote: > import os > import stanza > > # 初始化中文管道 > nlp = stanza.Pipeline('zh') > > # 指定输入和输出目录 > input_dir = 'F:/path' > output_dir = 'F:path' > > # 如果输出目录不存在，创建它 > if not os.path.exists(output_dir): > os.makedirs(output_dir) > > # 遍历输入目录下的所有文件 > for filename in os.listdir(input_dir): > # 只处理文本文件 > if filename.endswith('.txt'): > # 读取文件内容 > with open(os.path.join(input_dir, filename), 'r', encoding='utf-8') as f: > text = f.read() > > # 使用 stanza 进行分词和词性标注 > doc = nlp(text) > > # 格式化结果 > result = ' '.join([f'{word.text}/{word.upos}' for sent in doc.sentences > for word in sent.words]) > > # 写入到新的文件中 > with open(os.path.join(output_dir, filename), 'w', encoding='utf-8') as f: > f.write(result) > > HI，i use the Stanza models in python，and here are my code！but now my > python code also run failed，i do not kown why！the error is[WinError 10060] > 由于连接方在一段时间后没有正确答复或连接的主机没有反应，连接尝试失败。')) > THANK YOU ， i will use CoreNLP client. > ________________________________ > From: John Bauer ***@***.***> > Sent: Friday, March 8, 2024 1:55 PM > To: stanfordnlp/stanza ***@***.***> > Cc: Xu Yue ***@***.***>; Author ***@***.***> > Subject: Re: [stanfordnlp/stanza] why generate different results when i > use corenlp in python and java？ (Discussion #1364) > > Hang on, the tokenization & tagging you were getting earlier, was that > from > using the Stanza models or from accessing the CoreNLP models through > python > in some manner? > > There is a CoreNLP client for Stanza which connects to the Java version of > the software already. You don't need to write your own: > > > https://stanfordnlp.github.io/stanza/corenlp_client.html > > ― > Reply to this email directly, view it on GitHub< > #1364 (comment)>, > or unsubscribe< > https://github.com/notifications/unsubscribe-auth/ARRIMEYRNTYZ5KFTVMUSXO3YXFHGNAVCNFSM6AAAAABELBVSH6VHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4DOMJVHA2DQ>. > > You are receiving this because you authored the thread. > > — > Reply to this email directly, view it on GitHub > < #1364 (comment)>, > or unsubscribe > < https://github.com/notifications/unsubscribe-auth/AA2AYWNI6IMVPWRGUKIWZQTYXFIB5AVCNFSM6AAAAABELBVSH6VHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4DOMJVHEYDQ> > . > You are receiving this because you commented.Message ID: > ***@***.***> > — Reply to this email directly, view it on GitHub< #1364 (comment)>, or unsubscribe< https://github.com/notifications/unsubscribe-auth/ARRIME3KBHZ62SK6NVDICRLYXFJ7ZAVCNFSM6AAAAABELBVSH6VHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4DOMJWGA2DC>. You are receiving this because you authored the thread.Message ID: ***@***.***> — Reply to this email directly, view it on GitHub <#1364 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA2AYWIKQHYXZ3676BILNUTYXFLL7AVCNFSM6AAAAABELBVSH6VHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4DOMJWGEZTO> . You are receiving this because you commented.Message ID: ***@***.***>

0 replies

xy1137030414 · 2024-03-08T06:44:22Z

xy1137030414
Mar 8, 2024
Author

sorry，you mean classpath？i set the classpath and its value. [cid:ae16f34e-21ed-4591-92c8-a2b79de04308]

________________________________ From: John Bauer ***@***.***> Sent: Friday, March 8, 2024 2:36 PM To: stanfordnlp/stanza ***@***.***> Cc: Xu Yue ***@***.***>; Author ***@***.***> Subject: Re: [stanfordnlp/stanza] why generate different results when i use corenlp in python and java？ (Discussion #1364) Ah, I see. Do you have java in your path? Do you have the CoreNLP distribution in your java classpath? There's also a system variable expected with the CoreNLP download path, $CORENLP_HOME It's explained here: https://stanfordnlp.github.io/stanza/client_setup.html#manual-installation

On Thu, Mar 7, 2024 at 10:31 PM Xu Yue ***@***.***> wrote: i think there is no stack trace。just has a error：错误: 找不到或无法加载主类 edu.stanford.nlp.pipeline.StanfordCoreNLP. the error is happened when i run this code file,only has this error ________________________________ From: John Bauer ***@***.***> Sent: Friday, March 8, 2024 2:19 PM To: stanfordnlp/stanza ***@***.***> Cc: Xu Yue ***@***.***>; Author ***@***.***> Subject: Re: [stanfordnlp/stanza] why generate different results when i use corenlp in python and java？ (Discussion #1364) When did that error happen? There should be a stack trace telling you. It might have happened either downloading the model descriptions from github or the models themselves from HuggingFace. Since you are clearly able to post to github, my first guess would be HuggingFace, but we'd need the whole traceback to be sure On Thu, Mar 7, 2024 at 10:03 PM Xu Yue ***@***.***> wrote: > import os > import stanza > > # 初始化中文管道 > nlp = stanza.Pipeline('zh') > > # 指定输入和输出目录 > input_dir = 'F:/path' > output_dir = 'F:path' > > # 如果输出目录不存在，创建它 > if not os.path.exists(output_dir): > os.makedirs(output_dir) > > # 遍历输入目录下的所有文件 > for filename in os.listdir(input_dir): > # 只处理文本文件 > if filename.endswith('.txt'): > # 读取文件内容 > with open(os.path.join(input_dir, filename), 'r', encoding='utf-8') as f: > text = f.read() > > # 使用 stanza 进行分词和词性标注 > doc = nlp(text) > > # 格式化结果 > result = ' '.join([f'{word.text}/{word.upos}' for sent in doc.sentences > for word in sent.words]) > > # 写入到新的文件中 > with open(os.path.join(output_dir, filename), 'w', encoding='utf-8') as f: > f.write(result) > > HI，i use the Stanza models in python，and here are my code！but now my > python code also run failed，i do not kown why！the error is[WinError 10060] > 由于连接方在一段时间后没有正确答复或连接的主机没有反应，连接尝试失败。')) > THANK YOU ， i will use CoreNLP client. > ________________________________ > From: John Bauer ***@***.***> > Sent: Friday, March 8, 2024 1:55 PM > To: stanfordnlp/stanza ***@***.***> > Cc: Xu Yue ***@***.***>; Author ***@***.***> > Subject: Re: [stanfordnlp/stanza] why generate different results when i > use corenlp in python and java？ (Discussion #1364) > > Hang on, the tokenization & tagging you were getting earlier, was that > from > using the Stanza models or from accessing the CoreNLP models through > python > in some manner? > > There is a CoreNLP client for Stanza which connects to the Java version of > the software already. You don't need to write your own: > > > https://stanfordnlp.github.io/stanza/corenlp_client.html > > ― > Reply to this email directly, view it on GitHub< > #1364 (comment)>, > or unsubscribe< > https://github.com/notifications/unsubscribe-auth/ARRIMEYRNTYZ5KFTVMUSXO3YXFHGNAVCNFSM6AAAAABELBVSH6VHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4DOMJVHA2DQ>. > > You are receiving this because you authored the thread. > > — > Reply to this email directly, view it on GitHub > < #1364 (comment)>, > or unsubscribe > < https://github.com/notifications/unsubscribe-auth/AA2AYWNI6IMVPWRGUKIWZQTYXFIB5AVCNFSM6AAAAABELBVSH6VHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4DOMJVHEYDQ> > . > You are receiving this because you commented.Message ID: > ***@***.***> > — Reply to this email directly, view it on GitHub< #1364 (comment)>, or unsubscribe< https://github.com/notifications/unsubscribe-auth/ARRIME3KBHZ62SK6NVDICRLYXFJ7ZAVCNFSM6AAAAABELBVSH6VHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4DOMJWGA2DC>. You are receiving this because you authored the thread.Message ID: ***@***.***> — Reply to this email directly, view it on GitHub <#1364 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA2AYWIKQHYXZ3676BILNUTYXFLL7AVCNFSM6AAAAABELBVSH6VHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4DOMJWGEZTO> . You are receiving this because you commented.Message ID: ***@***.***>

— Reply to this email directly, view it on GitHub<#1364 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ARRIME2O4NLUFL7MWI5YAJLYXFL7JAVCNFSM6AAAAABELBVSH6VHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4DOMJWGE3DO>. You are receiving this because you authored the thread.Message ID: ***@***.***>

0 replies

AngledLuffa · 2024-03-08T06:51:37Z

AngledLuffa
Mar 8, 2024
Maintainer

File

"d:\miniconda\envs\envname\lib\site-packages\stanza\resources\common.py", line 454, in download_resources_json Got it. That's a bit different. You can try downloading this file: https://github.com/stanfordnlp/stanza-resources/blob/main/resources_1.8.0.json Download it, rename it resources.json, and put it in the directory where you want to save the models - by default on Windows that will be %HOMEPATH%\stanza_resources\resources.json but you can set STANZA_RESOURCES_DIR to put the models somewhere else. Once that's done, you can add to the Pipeline the following: download_method="reuse_resources" This will stop it from trying to download the resources file from github

0 replies

AngledLuffa · 2024-03-08T07:00:49Z

AngledLuffa
Mar 8, 2024
Maintainer

Thanks, but what I need is to be writing software instead of doing customer service @manning

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

why generate different results when i use corenlp in python and java？ #1364

{{title}}

Replies: 8 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

why generate different results when i use corenlp in python and java？ #1364

xy1137030414 Mar 7, 2024

Replies: 8 comments · 3 replies

AngledLuffa Mar 7, 2024 Maintainer

xy1137030414 Mar 8, 2024 Author

xy1137030414 Mar 8, 2024 Author

AngledLuffa Mar 8, 2024 Maintainer

AngledLuffa Mar 8, 2024 Maintainer

AngledLuffa Mar 8, 2024 Maintainer

AngledLuffa Mar 8, 2024 Maintainer

AngledLuffa Mar 8, 2024 Maintainer

xy1137030414 Mar 8, 2024 Author

AngledLuffa Mar 8, 2024 Maintainer

AngledLuffa Mar 8, 2024 Maintainer

xy1137030414
Mar 7, 2024

Replies: 8 comments 3 replies

AngledLuffa
Mar 7, 2024
Maintainer

xy1137030414 Mar 8, 2024
Author

xy1137030414 Mar 8, 2024
Author

AngledLuffa Mar 8, 2024
Maintainer

AngledLuffa
Mar 8, 2024
Maintainer

AngledLuffa
Mar 8, 2024
Maintainer

AngledLuffa
Mar 8, 2024
Maintainer

AngledLuffa
Mar 8, 2024
Maintainer

xy1137030414
Mar 8, 2024
Author

AngledLuffa
Mar 8, 2024
Maintainer

AngledLuffa
Mar 8, 2024
Maintainer