Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deal with wikipedia bug #29

Open
shaigue opened this issue Jun 25, 2023 · 1 comment
Open

deal with wikipedia bug #29

shaigue opened this issue Jun 25, 2023 · 1 comment

Comments

@shaigue
Copy link
Owner

shaigue commented Jun 25, 2023

When trying to use the wikipedia dataset I get the BUG:

Traceback (most recent call last):
  File "C:\repos\pmi_masking\create_pmi_masking_vocab.py", line 188, in <module>
    main()
  File "C:\repos\pmi_masking\create_pmi_masking_vocab.py", line 184, in main
    run_pipeline(**args.__dict__)
  File "C:\repos\pmi_masking\src\db_implementation\run_pipeline.py", line 66, in run_pipeline
    dataset = load_and_tokenize_dataset(dataset_name=dataset_name, tokenizer_name=tokenizer_name,
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\repos\pmi_masking\src\load_dataset.py", line 81, in load_and_tokenize_dataset
    dataset = dataset_name_to_load_function[dataset_name]()
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\repos\pmi_masking\src\load_dataset.py", line 29, in load_bookcorpus_and_wikipedia_dataset
    wiki = load_wikipedia_dataset()
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\repos\pmi_masking\src\load_dataset.py", line 23, in load_wikipedia_dataset
    return load_dataset(dataset_path, configuration_name, split=split)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\repos\pmi_masking\venv\Lib\site-packages\datasets\load.py", line 1791, in load_dataset
    builder_instance.download_and_prepare(
  File "C:\repos\pmi_masking\venv\Lib\site-packages\datasets\builder.py", line 902, in download_and_prepare
    self._save_info()
  File "C:\repos\pmi_masking\venv\Lib\site-packages\datasets\builder.py", line 2039, in _save_info
    import apache_beam as beam
ModuleNotFoundError: No module named 'apache_beam'

BUT, installing apache_beam could be problematic. maybe I can use this dataset without it? or use a different version? or not use it at all? try to figure this out


Before that, when I run .map with multiple processes, I would get an error since apache_beam and multiprocess have conflicting versions of dill

***Solution to try -- try to use python 3.9 instead of python 3.10 / 3.11

UPDATE: OK, so downgrading to python 3.9 seems to work, but I still got a memory allocation error when tokenizing the dataset, so I reduced the tokenizer batch size to 10_000 instead to see if it works better that way.

I also changed the datasets line in the requirement.txt to datasets[apache-beam].


Reopend 02.07.2023

I get this error:

Traceback (most recent call last):
File "", line 1, in
ModuleNotFoundError: No module named 'multiprocess'

And i think it is related to this issue:
uqfoundation/multiprocess#61

so I'm kinda giving up on wikipedia , and I think that this could be a Windows bug. So I hope runing this on linux will work Alright.

@shaigue
Copy link
Owner Author

shaigue commented Jul 3, 2023

I see two possible solutions for that:

  1. maybe this bug only happens on windows, so using linux should be OK
  2. simply use a dataset of a similar size to test this out, without having to deal with apache-beam

In addition, I want to add a test that uses multiprocessing (n_workers >= 2) to catch this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant