Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NER error after loading a CONLL-U document: doc.text is None #1428

Open
zbeloki opened this issue Oct 19, 2024 · 3 comments
Open

NER error after loading a CONLL-U document: doc.text is None #1428

zbeloki opened this issue Oct 19, 2024 · 3 comments
Labels

Comments

@zbeloki
Copy link

zbeloki commented Oct 19, 2024

I get the following error when running NER: TypeError: 'NoneType' object is not subscriptable

After debugging the error, I found out that it is trying to access the document's text attribute, but it is empty (None). I'm loading the document from a CONLL-U file created using Stanza, with the function stanza.utils.conll.conll2doc. So it seems loaded documents don't get their text attribute set. Each sentence has their text, but not the main document, which Stanza is trying to access in order to create the entity spans.

Is it possible to build the document's text from the sentences? That would fix the problem, I guess.

This is the entire stack trace:

Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/zbeloki/.vscode-server/extensions/ms-python.debugpy-2024.12.0-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher/../../debugpy/main.py", line 71, in
cli.main()
File "/home/zbeloki/.vscode-server/extensions/ms-python.debugpy-2024.12.0-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 501, in main
run()
File "/home/zbeloki/.vscode-server/extensions/ms-python.debugpy-2024.12.0-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 351, in run_file
runpy.run_path(target, run_name="main")
File "/home/zbeloki/.vscode-server/extensions/ms-python.debugpy-2024.12.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 310, in run_path
return _run_module_code(code, init_globals, run_name, pkg_name=pkg_name, script_name=fname)
File "/home/zbeloki/.vscode-server/extensions/ms-python.debugpy-2024.12.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 127, in _run_module_code
_run_code(code, mod_globals, init_globals, mod_name, mod_spec, pkg_name, script_name)
File "/home/zbeloki/.vscode-server/extensions/ms-python.debugpy-2024.12.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 118, in _run_code
exec(code, run_globals)
File "stanza/prepare_eval_data.py", line 59, in
main(args)
File "stanza/prepare_eval_data.py", line 30, in main
doc = nlp(doc_tokenized)
File "/home/zbeloki/workspace/nlp_processors_evaluation/venv/lib/python3.10/site-packages/stanza/pipeline/core.py", line 480, in call
return self.process(doc, processors)
File "/home/zbeloki/workspace/nlp_processors_evaluation/venv/lib/python3.10/site-packages/stanza/pipeline/core.py", line 431, in process
doc = process(doc)
File "/home/zbeloki/workspace/nlp_processors_evaluation/venv/lib/python3.10/site-packages/stanza/pipeline/ner_processor.py", line 123, in process
total = len(batch.doc.build_ents())
File "/home/zbeloki/workspace/nlp_processors_evaluation/venv/lib/python3.10/site-packages/stanza/models/common/doc.py", line 433, in build_ents
s_ents = s.build_ents()
File "/home/zbeloki/workspace/nlp_processors_evaluation/venv/lib/python3.10/site-packages/stanza/models/common/doc.py", line 752, in build_ents
self.ents.append(Span(tokens=ent_tokens, type=e['type'], doc=self.doc, sent=self))
File "/home/zbeloki/workspace/nlp_processors_evaluation/venv/lib/python3.10/site-packages/stanza/models/common/doc.py", line 1601, in init
self.init_from_tokens(tokens, type)
File "/home/zbeloki/workspace/nlp_processors_evaluation/venv/lib/python3.10/site-packages/stanza/models/common/doc.py", line 1618, in init_from_tokens
self.text = self.doc.text[self.start_char:self.end_char]
TypeError: 'NoneType' object is not subscriptable

@zbeloki zbeloki added the bug label Oct 19, 2024
@AngledLuffa
Copy link
Collaborator

Do you have a code sample which shows this? I run a small example and it works fine:

>>> import stanza
>>> pipe = stanza.Pipeline("en", processors="tokenize,ner")
>>> pipe("Dr. Pritchett gave me a new hip")
etc etc

@zbeloki
Copy link
Author

zbeloki commented Oct 28, 2024

Thanks for your response. NER doesn't work with documents loaded from CONLLU files. This snippet replicates the error:

>>> import stanza
>>> pipe = stanza.Pipeline("en", processors="tokenize,ner")
>>> doc = pipe("Dr. Pritchett gave me a new hip")
# At this point NER is correct. But...
>>> from stanza.utils.conll import CoNLL
>>> CoNLL.write_doc2conll(doc, "output.conllu")
>>> loaded_doc = CoNLL.conll2doc("output.conllu")
>>> pipe = stanza.Pipeline("en", processors="tokenize,ner", tokenize_pretokenized=True)
>>> pipe(loaded_doc) 
# TypeError: 'NoneType' object is not subscriptable

AngledLuffa added a commit that referenced this issue Oct 28, 2024
…(which can happen when a Document is made via conllu), try to extract the entity text from the sentence text instead. Addresses #1428
@AngledLuffa
Copy link
Collaborator

Thank you, that was a perfect example of the bug in action. It should now be fixed on dev

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants