You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When trying to use the wikipedia dataset I get the BUG:
Traceback (most recent call last):
File "C:\repos\pmi_masking\create_pmi_masking_vocab.py", line 188, in <module>
main()
File "C:\repos\pmi_masking\create_pmi_masking_vocab.py", line 184, in main
run_pipeline(**args.__dict__)
File "C:\repos\pmi_masking\src\db_implementation\run_pipeline.py", line 66, in run_pipeline
dataset = load_and_tokenize_dataset(dataset_name=dataset_name, tokenizer_name=tokenizer_name,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\repos\pmi_masking\src\load_dataset.py", line 81, in load_and_tokenize_dataset
dataset = dataset_name_to_load_function[dataset_name]()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\repos\pmi_masking\src\load_dataset.py", line 29, in load_bookcorpus_and_wikipedia_dataset
wiki = load_wikipedia_dataset()
^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\repos\pmi_masking\src\load_dataset.py", line 23, in load_wikipedia_dataset
return load_dataset(dataset_path, configuration_name, split=split)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\repos\pmi_masking\venv\Lib\site-packages\datasets\load.py", line 1791, in load_dataset
builder_instance.download_and_prepare(
File "C:\repos\pmi_masking\venv\Lib\site-packages\datasets\builder.py", line 902, in download_and_prepare
self._save_info()
File "C:\repos\pmi_masking\venv\Lib\site-packages\datasets\builder.py", line 2039, in _save_info
import apache_beam as beam
ModuleNotFoundError: No module named 'apache_beam'
BUT, installing apache_beam could be problematic. maybe I can use this dataset without it? or use a different version? or not use it at all? try to figure this out
Before that, when I run .map with multiple processes, I would get an error since apache_beam and multiprocess have conflicting versions of dill
***Solution to try -- try to use python 3.9 instead of python 3.10 / 3.11
UPDATE: OK, so downgrading to python 3.9 seems to work, but I still got a memory allocation error when tokenizing the dataset, so I reduced the tokenizer batch size to 10_000 instead to see if it works better that way.
I also changed the datasets line in the requirement.txt to datasets[apache-beam].
Reopend 02.07.2023
I get this error:
Traceback (most recent call last):
File "", line 1, in
ModuleNotFoundError: No module named 'multiprocess'
When trying to use the wikipedia dataset I get the BUG:
BUT, installing apache_beam could be problematic. maybe I can use this dataset without it? or use a different version? or not use it at all? try to figure this out
Before that, when I run .map with multiple processes, I would get an error since
apache_beam
andmultiprocess
have conflicting versions ofdill
apache-beam==2.43.0
withmultiprocess>=0.70.12
has incompatibility fordill
apache/beam#24458***Solution to try -- try to use python 3.9 instead of python 3.10 / 3.11
UPDATE: OK, so downgrading to python 3.9 seems to work, but I still got a memory allocation error when tokenizing the dataset, so I reduced the tokenizer batch size to 10_000 instead to see if it works better that way.
I also changed the
datasets
line in therequirement.txt
todatasets[apache-beam]
.Reopend 02.07.2023
I get this error:
Traceback (most recent call last):
File "", line 1, in
ModuleNotFoundError: No module named 'multiprocess'
And i think it is related to this issue:
uqfoundation/multiprocess#61
so I'm kinda giving up on wikipedia , and I think that this could be a Windows bug. So I hope runing this on linux will work Alright.
The text was updated successfully, but these errors were encountered: