NER error after loading a CONLL-U document: doc.text is None #1428

zbeloki · 2024-10-19T07:16:38Z

I get the following error when running NER: TypeError: 'NoneType' object is not subscriptable

After debugging the error, I found out that it is trying to access the document's text attribute, but it is empty (None). I'm loading the document from a CONLL-U file created using Stanza, with the function stanza.utils.conll.conll2doc. So it seems loaded documents don't get their text attribute set. Each sentence has their text, but not the main document, which Stanza is trying to access in order to create the entity spans.

Is it possible to build the document's text from the sentences? That would fix the problem, I guess.

This is the entire stack trace:

Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/zbeloki/.vscode-server/extensions/ms-python.debugpy-2024.12.0-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher/../../debugpy/main.py", line 71, in
cli.main()
File "/home/zbeloki/.vscode-server/extensions/ms-python.debugpy-2024.12.0-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 501, in main
run()
File "/home/zbeloki/.vscode-server/extensions/ms-python.debugpy-2024.12.0-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 351, in run_file
runpy.run_path(target, run_name="main")
File "/home/zbeloki/.vscode-server/extensions/ms-python.debugpy-2024.12.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 310, in run_path
return _run_module_code(code, init_globals, run_name, pkg_name=pkg_name, script_name=fname)
File "/home/zbeloki/.vscode-server/extensions/ms-python.debugpy-2024.12.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 127, in _run_module_code
_run_code(code, mod_globals, init_globals, mod_name, mod_spec, pkg_name, script_name)
File "/home/zbeloki/.vscode-server/extensions/ms-python.debugpy-2024.12.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 118, in _run_code
exec(code, run_globals)
File "stanza/prepare_eval_data.py", line 59, in
main(args)
File "stanza/prepare_eval_data.py", line 30, in main
doc = nlp(doc_tokenized)
File "/home/zbeloki/workspace/nlp_processors_evaluation/venv/lib/python3.10/site-packages/stanza/pipeline/core.py", line 480, in call
return self.process(doc, processors)
File "/home/zbeloki/workspace/nlp_processors_evaluation/venv/lib/python3.10/site-packages/stanza/pipeline/core.py", line 431, in process
doc = process(doc)
File "/home/zbeloki/workspace/nlp_processors_evaluation/venv/lib/python3.10/site-packages/stanza/pipeline/ner_processor.py", line 123, in process
total = len(batch.doc.build_ents())
File "/home/zbeloki/workspace/nlp_processors_evaluation/venv/lib/python3.10/site-packages/stanza/models/common/doc.py", line 433, in build_ents
s_ents = s.build_ents()
File "/home/zbeloki/workspace/nlp_processors_evaluation/venv/lib/python3.10/site-packages/stanza/models/common/doc.py", line 752, in build_ents
self.ents.append(Span(tokens=ent_tokens, type=e['type'], doc=self.doc, sent=self))
File "/home/zbeloki/workspace/nlp_processors_evaluation/venv/lib/python3.10/site-packages/stanza/models/common/doc.py", line 1601, in init
self.init_from_tokens(tokens, type)
File "/home/zbeloki/workspace/nlp_processors_evaluation/venv/lib/python3.10/site-packages/stanza/models/common/doc.py", line 1618, in init_from_tokens
self.text = self.doc.text[self.start_char:self.end_char]
TypeError: 'NoneType' object is not subscriptable

The text was updated successfully, but these errors were encountered:

AngledLuffa · 2024-10-19T23:23:19Z

Do you have a code sample which shows this? I run a small example and it works fine:

>>> import stanza
>>> pipe = stanza.Pipeline("en", processors="tokenize,ner")
>>> pipe("Dr. Pritchett gave me a new hip")
etc etc

zbeloki · 2024-10-28T11:47:22Z

Thanks for your response. NER doesn't work with documents loaded from CONLLU files. This snippet replicates the error:

>>> import stanza
>>> pipe = stanza.Pipeline("en", processors="tokenize,ner")
>>> doc = pipe("Dr. Pritchett gave me a new hip")
# At this point NER is correct. But...
>>> from stanza.utils.conll import CoNLL
>>> CoNLL.write_doc2conll(doc, "output.conllu")
>>> loaded_doc = CoNLL.conll2doc("output.conllu")
>>> pipe = stanza.Pipeline("en", processors="tokenize,ner", tokenize_pretokenized=True)
>>> pipe(loaded_doc) 
# TypeError: 'NoneType' object is not subscriptable

…(which can happen when a Document is made via conllu), try to extract the entity text from the sentence text instead. Addresses #1428

AngledLuffa · 2024-10-28T20:13:57Z

Thank you, that was a perfect example of the bug in action. It should now be fixed on dev

zbeloki added the bug label Oct 19, 2024

AngledLuffa added a commit that referenced this issue Oct 28, 2024

When extracting text for NER entities, if the doc text doesn't exist …

0732628

…(which can happen when a Document is made via conllu), try to extract the entity text from the sentence text instead. Addresses #1428

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NER error after loading a CONLL-U document: doc.text is None #1428

NER error after loading a CONLL-U document: doc.text is None #1428

zbeloki commented Oct 19, 2024

AngledLuffa commented Oct 19, 2024

zbeloki commented Oct 28, 2024 •

edited

Loading

AngledLuffa commented Oct 28, 2024

NER error after loading a CONLL-U document: doc.text is None #1428

NER error after loading a CONLL-U document: doc.text is None #1428

Comments

zbeloki commented Oct 19, 2024

AngledLuffa commented Oct 19, 2024

zbeloki commented Oct 28, 2024 • edited Loading

AngledLuffa commented Oct 28, 2024

zbeloki commented Oct 28, 2024 •

edited

Loading