Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset Sanity Check bug report and seeking help for training data preparation #6

Open
EigenTom opened this issue Dec 22, 2024 · 0 comments

Comments

@EigenTom
Copy link

Thank you for your diligence and hard work of this project.

I am working on replicating the training procedure of the CEPE with minimal training data. During the data preparation process,
I encountered some questions regarding:

  1. downloading the training data: c4.
  2. preparing the downloaded training data.

What I have done:

  1. follow the guidance at: C4 download guide to successfully download and obtained jsonl files as shown below:
image

with the content shown as the below screenshot:
image

  1. I placed all processed jsonl files onto ./data/redpajama/c4-rp/*.jsonl in the cloned CEPE repository, which I suspect may not be the correct filepath, although I couldn't find any clear suggestion in README.md in ./data about how to organize the downloaded and preprocessed dataset from different domains.

  2. I successfully run the get_all_jsonl.py and obtained the txtfile as shown below:

image
  1. I try to directly run bash run_tokenize.sh, where I encountered several issues:
    i. The code will only process one jsonl file. In the tokenize_files.py it calls, on line $74$ it suggests this code is prepared for slrum clusters only.

ii. After I modify line $78$ to allow variable file_names include all .jsonl files for preprocessing, I found the preprocessed file were saved as .npy files, but they cannot be correctly read by sanity_check.py:
image

iii. I looked into the code responsible for saving and reading the .npy file. It suggests that in tokenize_files.py, line $70$, np.save() was called but in sanity_check.py, line $37$, pickle.read() was called.

The error stops me from processing the dataset further for training. I am wondering what caused the .npy file to be corrupted and what is the best practice to preprocess the dataset.

Many thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant