Random 'Segmentation fault (core dumped)' error when training for long spancat #13026

belalsalih · 2023-09-28T10:55:49Z

Hi,
I am getting 'Segmentation fault (core dumped)' when trying to train model for long SpanCat. I know this error could be related to OOM issues but this does not seem the case here. I tried to reduce [nlp] batch_size and [training.batcher.size] as shown in the attached config file and used a VM with very large RAM to make sure we are not running out of memory.
During training the VM memory usage never goes above 40% and even when reducing the [components.spancat.suggester] min_size and max_size the memory usage does not exceed 20% but the training exits with error 'Segmentation fault (core dumped)'.

Note: when training with low [components.spancat.suggester] values the training completes but with all zeroes for F, P and R.

His is the command I am using for training:
python -m spacy train config_spn.cfg --output ./output_v3_lg_1.3 --paths.train ./spacy_models_v3/train_data.spacy --paths.dev ./spacy_models_v3/test_data.spacy --code functions.py -V

This is the training output:

[2023-09-28 09:25:08,461] [DEBUG] Config overrides from CLI: ['paths.train', 'paths.dev']
ℹ Saving to output directory: output_v3_lg_1.3
ℹ Using CPU

=========================== Initializing pipeline ===========================
[2023-09-28 09:25:08,610] [INFO] Set up nlp object from config
[2023-09-28 09:25:08,618] [DEBUG] Loading corpus from path: spacy_models_v3/test_data.spacy
[2023-09-28 09:25:08,618] [DEBUG] Loading corpus from path: spacy_models_v3/train_data.spacy
[2023-09-28 09:25:08,619] [INFO] Pipeline: ['tok2vec', 'spancat']
[2023-09-28 09:25:08,621] [INFO] Created vocabulary
[2023-09-28 09:25:09,450] [INFO] Added vectors: en_core_web_lg
[2023-09-28 09:25:09,450] [INFO] Finished initializing nlp object
[2023-09-28 09:25:16,150] [INFO] Initialized pipeline components: ['tok2vec', 'spancat']
✔ Initialized pipeline

============================= Training pipeline =============================
[2023-09-28 09:25:16,158] [DEBUG] Loading corpus from path: spacy_models_v3/test_data.spacy
[2023-09-28 09:25:16,159] [DEBUG] Loading corpus from path: spacy_models_v3/train_data.spacy
ℹ Pipeline: ['tok2vec', 'spancat']
ℹ Initial learn rate: 0.001
E # LOSS TOK2VEC LOSS SPANCAT SPANS_SC_F SPANS_SC_P SPANS_SC_R SCORE

0 0 98109.47 19535.08 0.00 0.00 4.58 0.00
0 200 528.73 781.51 0.00 0.00 3.75 0.00
Segmentation fault (core dumped)

Environment:

Operating System: Ubuntu 20.04.6 LTS
Python Version Used: 3.8.10
spaCy Version Used: 3.6.0
config_spn.cfg.txt

Thanks in advance!

shadeMe · 2023-09-29T08:50:13Z

A segmentation fault shouldn't be happening under any circumstances. Could you post the output of the following command?

pip list

Furthermore, I'd appreciate it if you could you try the following for me:

Create a new virtualenv with just spacy (and its automatically installed dependencies ).
Create a minimal training/eval set (with a small number of examples).
Try to reproduce the crash in this virtualenv.

belalsalih · 2023-09-30T07:13:26Z

Thanks for the reply,
I have created a new venv with only spacy, however I am still getting the same error so this is not related to pip packages. I am using a small sample data(300 docs) for training and validation.

Noticed one thing:
Changing the with in [components.tok2vec.model.encode] from the default 96 to 128 will make the training command complete one iteration then crash, changing this value back to 96 will cause the command to fail without completing any iterations.

Attached debug data output FYR.
debug_data.txt

pip list output:

Package             Version
------------------- ---------
attrs               23.1.0
azure-core          1.28.0
azure-storage-blob  12.17.0
blis                0.7.9
catalogue           2.0.8
certifi             2023.5.7
cffi                1.15.1
charset-normalizer  3.2.0
click               8.1.5
confection          0.1.0
contourpy           1.1.0
cryptography        41.0.2
cycler              0.11.0
cymem               2.0.7
en-core-web-lg      3.6.0
en-core-web-sm      3.6.0
fonttools           4.41.1
fuzzysearch         0.7.3
fuzzywuzzy          0.18.0
idna                3.4
importlib-resources 6.0.0
isodate             0.6.1
Jinja2              3.1.2
joblib              1.3.1
kiwisolver          1.4.4
langcodes           3.3.0
Levenshtein         0.21.1
MarkupSafe          2.1.3
matplotlib          3.7.2
murmurhash          1.0.9
numpy               1.24.4
packaging           23.1
pandas              2.0.3
pathy               0.10.2
Pillow              10.0.0
pip                 23.2
pkg_resources       0.0.0
preshed             3.0.8
pycparser           2.21
pydantic            1.10.11
pyodbc              4.0.39
pyparsing           3.0.9
python-dateutil     2.8.2
python-Levenshtein  0.21.1
pytz                2023.3
rapidfuzz           3.2.0
regex               2023.8.8
requests            2.31.0
scikit-learn        1.3.0
scipy               1.10.1
setuptools          68.0.0
six                 1.16.0
sklearn             0.0.post7
smart-open          6.3.0
spacy               3.6.0
spacy-legacy        3.0.12
spacy-loggers       1.0.4
srsly               2.4.6
thefuzz             0.20.0
thinc               8.1.10
threadpoolctl       3.2.0
tqdm                4.65.0
typer               0.9.0
typing_extensions   4.7.1
tzdata              2023.3
urllib3             2.0.3
wasabi              1.1.2
wheel               0.40.0
zipp                3.16.2

shadeMe · 2023-10-04T08:00:53Z

Thanks for the info - We'll investigate.

belalsalih · 2023-10-23T08:16:37Z

To anyone facing this issue, I've used NER instead SpanCat and I had no issues.
And for overlapping spans I've trained the model to extract the high level details and trained separate models to extract sub-details from complex data.
I still believe SpanCat is the right way to do it if it worked as intended.

Regards.

shadeMe · 2023-10-26T10:35:20Z

Hi, can you share the training/dev data and the custom code you were using to train the SpanCat model? We'd need that to reproduce the crash and debug the issue.

belalsalih · 2023-10-30T12:36:49Z

Hi,
I got this issue while creating a CV parser for our clients, so unfortunately we cannot share the data since it is using live applicants data.
We are not using any custom code to train the model, we are generating the training and saving the training.spacy/dev.spacy on the fly.
The same data that is causing this error is working fine when using NER instead of SapnCat, so I don't think this is data issue as you can see in the debug data shared earlier.
You can check this discussion thread related to this issue 13012.

Regards.

shadeMe · 2023-10-30T13:20:32Z

That's understandable. The issue is likely a bug in the SpanCat component's code, but we still need to consistently reproduce the crash in order to identify the cause and fix it. If you run into this issue in the future where you can share the data that triggers the crash, please let us know.

shadeMe added the feat / spancat Feature: Span Categorizer label Sep 28, 2023

shadeMe added the bug Bugs and behaviour differing from documentation label Oct 4, 2023

shadeMe added the more-info-needed This issue needs more information label Oct 30, 2023

github-actions bot removed the more-info-needed This issue needs more information label Oct 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Random 'Segmentation fault (core dumped)' error when training for long spancat #13026

Random 'Segmentation fault (core dumped)' error when training for long spancat #13026

belalsalih commented Sep 28, 2023

shadeMe commented Sep 29, 2023

belalsalih commented Sep 30, 2023 •

edited

Loading

shadeMe commented Oct 4, 2023

belalsalih commented Oct 23, 2023

shadeMe commented Oct 26, 2023

belalsalih commented Oct 30, 2023

shadeMe commented Oct 30, 2023

Random 'Segmentation fault (core dumped)' error when training for long spancat #13026

Random 'Segmentation fault (core dumped)' error when training for long spancat #13026

Comments

belalsalih commented Sep 28, 2023

shadeMe commented Sep 29, 2023

belalsalih commented Sep 30, 2023 • edited Loading

shadeMe commented Oct 4, 2023

belalsalih commented Oct 23, 2023

shadeMe commented Oct 26, 2023

belalsalih commented Oct 30, 2023

shadeMe commented Oct 30, 2023

belalsalih commented Sep 30, 2023 •

edited

Loading