Dataset Sanity Check bug report and seeking help for training data preparation #6

EigenTom · 2024-12-22T18:56:52Z

Thank you for your diligence and hard work of this project.

I am working on replicating the training procedure of the CEPE with minimal training data. During the data preparation process,
I encountered some questions regarding：

downloading the training data： c4.
preparing the downloaded training data.

What I have done:

follow the guidance at: C4 download guide to successfully download and obtained jsonl files as shown below:

with the content shown as the below screenshot:

I placed all processed jsonl files onto ./data/redpajama/c4-rp/*.jsonl in the cloned CEPE repository, which I suspect may not be the correct filepath, although I couldn't find any clear suggestion in README.md in ./data about how to organize the downloaded and preprocessed dataset from different domains.
I successfully run the get_all_jsonl.py and obtained the txtfile as shown below:

I try to directly run bash run_tokenize.sh, where I encountered several issues:
i. The code will only process one jsonl file. In the tokenize_files.py it calls, on line $74$ it suggests this code is prepared for slrum clusters only.

ii. After I modify line $78$ to allow variable file_names include all .jsonl files for preprocessing, I found the preprocessed file were saved as .npy files, but they cannot be correctly read by sanity_check.py:

iii. I looked into the code responsible for saving and reading the .npy file. It suggests that in tokenize_files.py, line $70$, np.save() was called but in sanity_check.py, line $37$, pickle.read() was called.

The error stops me from processing the dataset further for training. I am wondering what caused the .npy file to be corrupted and what is the best practice to preprocess the dataset.

Many thanks!

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset Sanity Check bug report and seeking help for training data preparation #6

Dataset Sanity Check bug report and seeking help for training data preparation #6

EigenTom commented Dec 22, 2024

Dataset Sanity Check bug report and seeking help for training data preparation #6

Dataset Sanity Check bug report and seeking help for training data preparation #6

Comments

EigenTom commented Dec 22, 2024