You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thank you for your diligence and hard work of this project.
I am working on replicating the training procedure of the CEPE with minimal training data. During the data preparation process,
I encountered some questions regarding:
downloading the training data: c4.
preparing the downloaded training data.
What I have done:
follow the guidance at: C4 download guide to successfully download and obtained jsonl files as shown below:
with the content shown as the below screenshot:
I placed all processed jsonl files onto ./data/redpajama/c4-rp/*.jsonl in the cloned CEPE repository, which I suspect may not be the correct filepath, although I couldn't find any clear suggestion in README.md in ./data about how to organize the downloaded and preprocessed dataset from different domains.
I successfully run the get_all_jsonl.py and obtained the txtfile as shown below:
I try to directly run bash run_tokenize.sh, where I encountered several issues:
i. The code will only process one jsonl file. In the tokenize_files.py it calls, on line $74$ it suggests this code is prepared for slrum clusters only.
ii. After I modify line $78$ to allow variable file_names include all .jsonl files for preprocessing, I found the preprocessed file were saved as .npy files, but they cannot be correctly read by sanity_check.py:
iii. I looked into the code responsible for saving and reading the .npy file. It suggests that in tokenize_files.py, line $70$, np.save() was called but in sanity_check.py, line $37$, pickle.read() was called.
The error stops me from processing the dataset further for training. I am wondering what caused the .npy file to be corrupted and what is the best practice to preprocess the dataset.
Many thanks!
The text was updated successfully, but these errors were encountered:
Thank you for your diligence and hard work of this project.
I am working on replicating the training procedure of the CEPE with minimal training data. During the data preparation process,
I encountered some questions regarding:
c4
.What I have done:
jsonl
files as shown below:with the content shown as the below screenshot:
I placed all processed
jsonl
files onto./data/redpajama/c4-rp/*.jsonl
in the cloned CEPE repository, which I suspect may not be the correct filepath, although I couldn't find any clear suggestion in README.md in ./data about how to organize the downloaded and preprocessed dataset from different domains.I successfully run the
get_all_jsonl.py
and obtained thetxt
file as shown below:bash run_tokenize.sh
, where I encountered several issues:i. The code will only process one jsonl file. In the tokenize_files.py it calls, on line
ii. After I modify line$78$ to allow variable
file_names
include all.jsonl
files for preprocessing, I found the preprocessed file were saved as.npy
files, but they cannot be correctly read bysanity_check.py
:iii. I looked into the code responsible for saving and reading the$70$ , $37$ ,
.npy
file. It suggests that in tokenize_files.py, linenp.save()
was called but in sanity_check.py, linepickle.read()
was called.The error stops me from processing the dataset further for training. I am wondering what caused the
.npy
file to be corrupted and what is the best practice to preprocess the dataset.Many thanks!
The text was updated successfully, but these errors were encountered: