deal with wikipedia bug #29

shaigue · 2023-06-25T12:25:16Z

When trying to use the wikipedia dataset I get the BUG:

Traceback (most recent call last):
  File "C:\repos\pmi_masking\create_pmi_masking_vocab.py", line 188, in <module>
    main()
  File "C:\repos\pmi_masking\create_pmi_masking_vocab.py", line 184, in main
    run_pipeline(**args.__dict__)
  File "C:\repos\pmi_masking\src\db_implementation\run_pipeline.py", line 66, in run_pipeline
    dataset = load_and_tokenize_dataset(dataset_name=dataset_name, tokenizer_name=tokenizer_name,
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\repos\pmi_masking\src\load_dataset.py", line 81, in load_and_tokenize_dataset
    dataset = dataset_name_to_load_function[dataset_name]()
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\repos\pmi_masking\src\load_dataset.py", line 29, in load_bookcorpus_and_wikipedia_dataset
    wiki = load_wikipedia_dataset()
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\repos\pmi_masking\src\load_dataset.py", line 23, in load_wikipedia_dataset
    return load_dataset(dataset_path, configuration_name, split=split)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\repos\pmi_masking\venv\Lib\site-packages\datasets\load.py", line 1791, in load_dataset
    builder_instance.download_and_prepare(
  File "C:\repos\pmi_masking\venv\Lib\site-packages\datasets\builder.py", line 902, in download_and_prepare
    self._save_info()
  File "C:\repos\pmi_masking\venv\Lib\site-packages\datasets\builder.py", line 2039, in _save_info
    import apache_beam as beam
ModuleNotFoundError: No module named 'apache_beam'

BUT, installing apache_beam could be problematic. maybe I can use this dataset without it? or use a different version? or not use it at all? try to figure this out

Before that, when I run .map with multiple processes, I would get an error since apache_beam and multiprocess have conflicting versions of dill

***Solution to try -- try to use python 3.9 instead of python 3.10 / 3.11

UPDATE: OK, so downgrading to python 3.9 seems to work, but I still got a memory allocation error when tokenizing the dataset, so I reduced the tokenizer batch size to 10_000 instead to see if it works better that way.

I also changed the datasets line in the requirement.txt to datasets[apache-beam].

Reopend 02.07.2023

I get this error:

Traceback (most recent call last):
File "", line 1, in
ModuleNotFoundError: No module named 'multiprocess'

And i think it is related to this issue:
uqfoundation/multiprocess#61

so I'm kinda giving up on wikipedia , and I think that this could be a Windows bug. So I hope runing this on linux will work Alright.

The text was updated successfully, but these errors were encountered:

shaigue · 2023-07-03T05:54:26Z

I see two possible solutions for that:

maybe this bug only happens on windows, so using linux should be OK
simply use a dataset of a similar size to test this out, without having to deal with apache-beam

In addition, I want to add a test that uses multiprocessing (n_workers >= 2) to catch this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deal with wikipedia bug #29

deal with wikipedia bug #29

shaigue commented Jun 25, 2023 •

edited

Loading

shaigue commented Jul 3, 2023 •

edited

Loading

deal with wikipedia bug #29

deal with wikipedia bug #29

Comments

shaigue commented Jun 25, 2023 • edited Loading

shaigue commented Jul 3, 2023 • edited Loading

shaigue commented Jun 25, 2023 •

edited

Loading

shaigue commented Jul 3, 2023 •

edited

Loading