why generate different results when i use corenlp in python and java? #1364
Replies: 8 comments 3 replies
-
Would you be more specific please? Examples would help
|
Beta Was this translation helpful? Give feedback.
-
Well, the simple answer is, they are different models trained on different
data. CoreNLP's segmenter is trained on a much larger treebank, 50K
sentences, whereas the Stanza segmenter is only trained on 4K sentences.
Ideally they'd be mixed together, but unfortunately the tokenization
standards between the UD treebank (Stanza) and Chinese Treebank (CoreNLP)
are different enough that it would actually hurt performance.
…On Thu, Mar 7, 2024 at 7:45 PM Xu Yue ***@***.***> wrote:
hey!I am YUE,this is input text!so sorry!i forgot give you original text!
金湖荷花荡――国家AAAA级旅游景区。地处淮安市金湖县东南,美丽高邮湖畔,三面环湖相拥,生态环境优美,总面积1.2万亩,是集农业观光、休闲度假、科普教育、健康疗养为一体的综合性生态旅游景点。拥有“国家水利风景区”、“全国农业旅游示范点”、“江苏省对台交流基地”、“全国科普教育基地”、“淮安高层次人才疗养基地”、“江苏省放心消费示范街区”等众多荣誉称号。盛夏时节,景区内碧叶连天、荷花万顷,完美诠释了宋代大诗人杨万里“接天莲叶无穷碧、映日荷花别样红”的胜景。为致敬奋斗在“抗疫”一线的医务工作者,景区自恢复运营之日起至2020年12月31日对全国医护工作者门票免费。
________________________________
From: John Bauer ***@***.***>
Sent: Friday, March 8, 2024 11:38 AM
To: stanfordnlp/stanza ***@***.***>
Cc: Xu Yue ***@***.***>; Author ***@***.***>
Subject: Re: [stanfordnlp/stanza] why generate different results when i
use corenlp in python and java? (Discussion #1364)
Would you write out the input text? Trying to figure it out from this is
just making my life more difficult, and I'm really getting tired of chasing
down all these little issues
―
Reply to this email directly, view it on GitHub<
#1364 (reply in thread)>,
or unsubscribe<
https://github.com/notifications/unsubscribe-auth/ARRIME4IWQIGS7C3FE3NN4LYXEXCXAVCNFSM6AAAAABELBVSH6VHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4DOMJVGA4TI>.
You are receiving this because you authored the thread.Message ID:
***@***.***>
—
Reply to this email directly, view it on GitHub
<#1364 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA2AYWKKNR75XAKCXB7XHFLYXEX4DAVCNFSM6AAAAABELBVSH6VHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4DOMJVGEZDG>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Hang on, the tokenization & tagging you were getting earlier, was that from
using the Stanza models or from accessing the CoreNLP models through python
in some manner?
There is a CoreNLP client for Stanza which connects to the Java version of
the software already. You don't need to write your own:
https://stanfordnlp.github.io/stanza/corenlp_client.html
|
Beta Was this translation helpful? Give feedback.
-
When did that error happen? There should be a stack trace telling you. It
might have happened either downloading the model descriptions from github
or the models themselves from HuggingFace. Since you are clearly able to
post to github, my first guess would be HuggingFace, but we'd need the
whole traceback to be sure
…On Thu, Mar 7, 2024 at 10:03 PM Xu Yue ***@***.***> wrote:
import os
import stanza
# 初始化中文管道
nlp = stanza.Pipeline('zh')
# 指定输入和输出目录
input_dir = 'F:/path'
output_dir = 'F:path'
# 如果输出目录不存在,创建它
if not os.path.exists(output_dir):
os.makedirs(output_dir)
# 遍历输入目录下的所有文件
for filename in os.listdir(input_dir):
# 只处理文本文件
if filename.endswith('.txt'):
# 读取文件内容
with open(os.path.join(input_dir, filename), 'r', encoding='utf-8') as f:
text = f.read()
# 使用 stanza 进行分词和词性标注
doc = nlp(text)
# 格式化结果
result = ' '.join([f'{word.text}/{word.upos}' for sent in doc.sentences
for word in sent.words])
# 写入到新的文件中
with open(os.path.join(output_dir, filename), 'w', encoding='utf-8') as f:
f.write(result)
HI,i use the Stanza models in python,and here are my code!but now my
python code also run failed,i do not kown why!the error is[WinError 10060]
由于连接方在一段时间后没有正确答复或连接的主机没有反应,连接尝试失败。'))
THANK YOU , i will use CoreNLP client.
________________________________
From: John Bauer ***@***.***>
Sent: Friday, March 8, 2024 1:55 PM
To: stanfordnlp/stanza ***@***.***>
Cc: Xu Yue ***@***.***>; Author ***@***.***>
Subject: Re: [stanfordnlp/stanza] why generate different results when i
use corenlp in python and java? (Discussion #1364)
Hang on, the tokenization & tagging you were getting earlier, was that
from
using the Stanza models or from accessing the CoreNLP models through
python
in some manner?
There is a CoreNLP client for Stanza which connects to the Java version of
the software already. You don't need to write your own:
https://stanfordnlp.github.io/stanza/corenlp_client.html
―
Reply to this email directly, view it on GitHub<
#1364 (comment)>,
or unsubscribe<
https://github.com/notifications/unsubscribe-auth/ARRIMEYRNTYZ5KFTVMUSXO3YXFHGNAVCNFSM6AAAAABELBVSH6VHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4DOMJVHA2DQ>.
You are receiving this because you authored the thread.
—
Reply to this email directly, view it on GitHub
<#1364 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA2AYWNI6IMVPWRGUKIWZQTYXFIB5AVCNFSM6AAAAABELBVSH6VHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4DOMJVHEYDQ>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Ah, I see. Do you have java in your path? Do you have the CoreNLP
distribution in your java classpath?
There's also a system variable expected with the CoreNLP download path,
$CORENLP_HOME
It's explained here:
https://stanfordnlp.github.io/stanza/client_setup.html#manual-installation
…On Thu, Mar 7, 2024 at 10:31 PM Xu Yue ***@***.***> wrote:
i think there is no stack trace。just has a error:错误: 找不到或无法加载主类
edu.stanford.nlp.pipeline.StanfordCoreNLP. the error is happened when i run
this code file,only has this error
________________________________
From: John Bauer ***@***.***>
Sent: Friday, March 8, 2024 2:19 PM
To: stanfordnlp/stanza ***@***.***>
Cc: Xu Yue ***@***.***>; Author ***@***.***>
Subject: Re: [stanfordnlp/stanza] why generate different results when i
use corenlp in python and java? (Discussion #1364)
When did that error happen? There should be a stack trace telling you. It
might have happened either downloading the model descriptions from github
or the models themselves from HuggingFace. Since you are clearly able to
post to github, my first guess would be HuggingFace, but we'd need the
whole traceback to be sure
On Thu, Mar 7, 2024 at 10:03 PM Xu Yue ***@***.***> wrote:
> import os
> import stanza
>
> # 初始化中文管道
> nlp = stanza.Pipeline('zh')
>
> # 指定输入和输出目录
> input_dir = 'F:/path'
> output_dir = 'F:path'
>
> # 如果输出目录不存在,创建它
> if not os.path.exists(output_dir):
> os.makedirs(output_dir)
>
> # 遍历输入目录下的所有文件
> for filename in os.listdir(input_dir):
> # 只处理文本文件
> if filename.endswith('.txt'):
> # 读取文件内容
> with open(os.path.join(input_dir, filename), 'r', encoding='utf-8') as
f:
> text = f.read()
>
> # 使用 stanza 进行分词和词性标注
> doc = nlp(text)
>
> # 格式化结果
> result = ' '.join([f'{word.text}/{word.upos}' for sent in doc.sentences
> for word in sent.words])
>
> # 写入到新的文件中
> with open(os.path.join(output_dir, filename), 'w', encoding='utf-8') as
f:
> f.write(result)
>
> HI,i use the Stanza models in python,and here are my code!but now my
> python code also run failed,i do not kown why!the error is[WinError
10060]
> 由于连接方在一段时间后没有正确答复或连接的主机没有反应,连接尝试失败。'))
> THANK YOU , i will use CoreNLP client.
> ________________________________
> From: John Bauer ***@***.***>
> Sent: Friday, March 8, 2024 1:55 PM
> To: stanfordnlp/stanza ***@***.***>
> Cc: Xu Yue ***@***.***>; Author ***@***.***>
> Subject: Re: [stanfordnlp/stanza] why generate different results when i
> use corenlp in python and java? (Discussion #1364)
>
> Hang on, the tokenization & tagging you were getting earlier, was that
> from
> using the Stanza models or from accessing the CoreNLP models through
> python
> in some manner?
>
> There is a CoreNLP client for Stanza which connects to the Java version
of
> the software already. You don't need to write your own:
>
>
> https://stanfordnlp.github.io/stanza/corenlp_client.html
>
> ―
> Reply to this email directly, view it on GitHub<
>
#1364 (comment)>,
> or unsubscribe<
>
https://github.com/notifications/unsubscribe-auth/ARRIMEYRNTYZ5KFTVMUSXO3YXFHGNAVCNFSM6AAAAABELBVSH6VHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4DOMJVHA2DQ>.
>
> You are receiving this because you authored the thread.
>
> —
> Reply to this email directly, view it on GitHub
> <
#1364 (comment)>,
> or unsubscribe
> <
https://github.com/notifications/unsubscribe-auth/AA2AYWNI6IMVPWRGUKIWZQTYXFIB5AVCNFSM6AAAAABELBVSH6VHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4DOMJVHEYDQ>
> .
> You are receiving this because you commented.Message ID:
> ***@***.***>
>
—
Reply to this email directly, view it on GitHub<
#1364 (comment)>,
or unsubscribe<
https://github.com/notifications/unsubscribe-auth/ARRIME3KBHZ62SK6NVDICRLYXFJ7ZAVCNFSM6AAAAABELBVSH6VHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4DOMJWGA2DC>.
You are receiving this because you authored the thread.Message ID:
***@***.***>
—
Reply to this email directly, view it on GitHub
<#1364 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA2AYWIKQHYXZ3676BILNUTYXFLL7AVCNFSM6AAAAABELBVSH6VHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4DOMJWGEZTO>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
sorry,you mean classpath?i set the classpath and its value. [cid:ae16f34e-21ed-4591-92c8-a2b79de04308]
________________________________
From: John Bauer ***@***.***>
Sent: Friday, March 8, 2024 2:36 PM
To: stanfordnlp/stanza ***@***.***>
Cc: Xu Yue ***@***.***>; Author ***@***.***>
Subject: Re: [stanfordnlp/stanza] why generate different results when i use corenlp in python and java? (Discussion #1364)
Ah, I see. Do you have java in your path? Do you have the CoreNLP
distribution in your java classpath?
There's also a system variable expected with the CoreNLP download path,
$CORENLP_HOME
It's explained here:
https://stanfordnlp.github.io/stanza/client_setup.html#manual-installation
On Thu, Mar 7, 2024 at 10:31 PM Xu Yue ***@***.***> wrote:
i think there is no stack trace。just has a error:错误: 找不到或无法加载主类
edu.stanford.nlp.pipeline.StanfordCoreNLP. the error is happened when i run
this code file,only has this error
________________________________
From: John Bauer ***@***.***>
Sent: Friday, March 8, 2024 2:19 PM
To: stanfordnlp/stanza ***@***.***>
Cc: Xu Yue ***@***.***>; Author ***@***.***>
Subject: Re: [stanfordnlp/stanza] why generate different results when i
use corenlp in python and java? (Discussion #1364)
When did that error happen? There should be a stack trace telling you. It
might have happened either downloading the model descriptions from github
or the models themselves from HuggingFace. Since you are clearly able to
post to github, my first guess would be HuggingFace, but we'd need the
whole traceback to be sure
On Thu, Mar 7, 2024 at 10:03 PM Xu Yue ***@***.***> wrote:
> import os
> import stanza
>
> # 初始化中文管道
> nlp = stanza.Pipeline('zh')
>
> # 指定输入和输出目录
> input_dir = 'F:/path'
> output_dir = 'F:path'
>
> # 如果输出目录不存在,创建它
> if not os.path.exists(output_dir):
> os.makedirs(output_dir)
>
> # 遍历输入目录下的所有文件
> for filename in os.listdir(input_dir):
> # 只处理文本文件
> if filename.endswith('.txt'):
> # 读取文件内容
> with open(os.path.join(input_dir, filename), 'r', encoding='utf-8') as
f:
> text = f.read()
>
> # 使用 stanza 进行分词和词性标注
> doc = nlp(text)
>
> # 格式化结果
> result = ' '.join([f'{word.text}/{word.upos}' for sent in doc.sentences
> for word in sent.words])
>
> # 写入到新的文件中
> with open(os.path.join(output_dir, filename), 'w', encoding='utf-8') as
f:
> f.write(result)
>
> HI,i use the Stanza models in python,and here are my code!but now my
> python code also run failed,i do not kown why!the error is[WinError
10060]
> 由于连接方在一段时间后没有正确答复或连接的主机没有反应,连接尝试失败。'))
> THANK YOU , i will use CoreNLP client.
> ________________________________
> From: John Bauer ***@***.***>
> Sent: Friday, March 8, 2024 1:55 PM
> To: stanfordnlp/stanza ***@***.***>
> Cc: Xu Yue ***@***.***>; Author ***@***.***>
> Subject: Re: [stanfordnlp/stanza] why generate different results when i
> use corenlp in python and java? (Discussion #1364)
>
> Hang on, the tokenization & tagging you were getting earlier, was that
> from
> using the Stanza models or from accessing the CoreNLP models through
> python
> in some manner?
>
> There is a CoreNLP client for Stanza which connects to the Java version
of
> the software already. You don't need to write your own:
>
>
> https://stanfordnlp.github.io/stanza/corenlp_client.html
>
> ―
> Reply to this email directly, view it on GitHub<
>
#1364 (comment)>,
> or unsubscribe<
>
https://github.com/notifications/unsubscribe-auth/ARRIMEYRNTYZ5KFTVMUSXO3YXFHGNAVCNFSM6AAAAABELBVSH6VHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4DOMJVHA2DQ>.
>
> You are receiving this because you authored the thread.
>
> —
> Reply to this email directly, view it on GitHub
> <
#1364 (comment)>,
> or unsubscribe
> <
https://github.com/notifications/unsubscribe-auth/AA2AYWNI6IMVPWRGUKIWZQTYXFIB5AVCNFSM6AAAAABELBVSH6VHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4DOMJVHEYDQ>
> .
> You are receiving this because you commented.Message ID:
> ***@***.***>
>
—
Reply to this email directly, view it on GitHub<
#1364 (comment)>,
or unsubscribe<
https://github.com/notifications/unsubscribe-auth/ARRIME3KBHZ62SK6NVDICRLYXFJ7ZAVCNFSM6AAAAABELBVSH6VHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4DOMJWGA2DC>.
You are receiving this because you authored the thread.Message ID:
***@***.***>
—
Reply to this email directly, view it on GitHub
<#1364 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA2AYWIKQHYXZ3676BILNUTYXFLL7AVCNFSM6AAAAABELBVSH6VHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4DOMJWGEZTO>
.
You are receiving this because you commented.Message ID:
***@***.***>
—
Reply to this email directly, view it on GitHub<#1364 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ARRIME2O4NLUFL7MWI5YAJLYXFL7JAVCNFSM6AAAAABELBVSH6VHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4DOMJWGE3DO>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
File
"d:\miniconda\envs\envname\lib\site-packages\stanza\resources\common.py",
line 454, in download_resources_json
Got it. That's a bit different. You can try downloading this file:
https://github.com/stanfordnlp/stanza-resources/blob/main/resources_1.8.0.json
Download it, rename it resources.json, and put it in the directory where
you want to save the models - by default on Windows that will be
%HOMEPATH%\stanza_resources\resources.json but you can
set STANZA_RESOURCES_DIR to put the models somewhere else.
Once that's done, you can add to the Pipeline the following:
download_method="reuse_resources"
This will stop it from trying to download the resources file from github
|
Beta Was this translation helpful? Give feedback.
-
Thanks, but what I need is to be writing software instead of doing
customer service
@manning
|
Beta Was this translation helpful? Give feedback.
-
the result of tokenize and POS of Chinese is so different, java is much better than python, why!
Beta Was this translation helpful? Give feedback.
All reactions