Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop trying to guess the length of webdataset datasets #339

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

gabrielilharco
Copy link
Collaborator

No description provided.

@gabrielilharco
Copy link
Collaborator Author

@rom1504 here's the PR we were discussing, in case we want to stop trying to guess the length of wds datasets

@usuyama
Copy link

usuyama commented Jan 6, 2023

I like this change! (in some experiments, I wanted to override train_num_samples and shorten the epoch - this change will solve the issue, too)

With webdataset, when setting train_num_samples as smaller than whole dataset, do we get different set of samples per epoch?

@gabrielilharco
Copy link
Collaborator Author

@usuyama with the current code, you can set --train-num-samples to a smaller number, and if you use --dataset-resampled, you'll always get random samples from the entire pool on each "epoch".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants