-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Method to perform finer-grained selection of ARCs and WARCs #247
Comments
Based on our discussion on Slack, we would like to limit the size of individual records instead of archive files. A lot of Spark jobs fail because of records that are too big. |
Should this still be an open issue? I don't think we've been running into any ingestion issues lately, including on some very large collections. |
I think the work done here is worth revisiting in the future, since I believe it got at the spirit of @lintool original post:
But, I'll leave it up to him to close or keep it open. |
Please leave open. |
We currently only have one method to load ARCs or WARCs:
It'd be nice to have some fine grained control, e.g., I want all (W)ARCs starting with a prefix, except for A, B, and C - this would help us debug large collections that maybe errors/corruption/etc.
The text was updated successfully, but these errors were encountered: