Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Method to perform finer-grained selection of ARCs and WARCs #247

Closed
lintool opened this issue Jul 30, 2018 · 6 comments
Closed

Method to perform finer-grained selection of ARCs and WARCs #247

lintool opened this issue Jul 30, 2018 · 6 comments

Comments

@lintool
Copy link
Member

lintool commented Jul 30, 2018

We currently only have one method to load ARCs or WARCs:

RecordLoader.loadArchives("/path/to/many/warcs/*.gz", sc)

It'd be nice to have some fine grained control, e.g., I want all (W)ARCs starting with a prefix, except for A, B, and C - this would help us debug large collections that maybe errors/corruption/etc.

@borislin
Copy link
Collaborator

@lintool @ianmilligan1

Based on our discussion on Slack, we would like to limit the size of individual records instead of archive files. A lot of Spark jobs fail because of records that are too big.

@ianmilligan1
Copy link
Member

Should this still be an open issue? I don't think we've been running into any ingestion issues lately, including on some very large collections.

@ruebot
Copy link
Member

ruebot commented Jan 11, 2019

I think the work done here is worth revisiting in the future, since I believe it got at the spirit of @lintool original post:

It'd be nice to have some fine grained control, e.g., I want all (W)ARCs starting with a prefix, except for A, B, and C - this would help us debug large collections that maybe errors/corruption/etc.

But, I'll leave it up to him to close or keep it open.

@ianmilligan1
Copy link
Member

Sounds good, thanks @lintool @ruebot !

@lintool
Copy link
Member Author

lintool commented Aug 21, 2019

Please leave open.

@ruebot
Copy link
Member

ruebot commented May 20, 2022

The spirit of this issue being identifying problematic W/ARCs will be solved with #533, so I'm going to close this once it is merged.

The additions to loadArchives here still might be worth bringing in if there is a demand for it in the future.

@ruebot ruebot closed this as completed in c8fa256 May 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants