Method to perform finer-grained selection of ARCs and WARCs #247

lintool · 2018-07-30T13:51:36Z

We currently only have one method to load ARCs or WARCs:

RecordLoader.loadArchives("/path/to/many/warcs/*.gz", sc)

It'd be nice to have some fine grained control, e.g., I want all (W)ARCs starting with a prefix, except for A, B, and C - this would help us debug large collections that maybe errors/corruption/etc.

The text was updated successfully, but these errors were encountered:

borislin · 2018-10-12T00:32:31Z

@lintool @ianmilligan1

Based on our discussion on Slack, we would like to limit the size of individual records instead of archive files. A lot of Spark jobs fail because of records that are too big.

ianmilligan1 · 2019-01-11T16:33:19Z

Should this still be an open issue? I don't think we've been running into any ingestion issues lately, including on some very large collections.

ruebot · 2019-01-11T16:35:24Z

I think the work done here is worth revisiting in the future, since I believe it got at the spirit of @lintool original post:

It'd be nice to have some fine grained control, e.g., I want all (W)ARCs starting with a prefix, except for A, B, and C - this would help us debug large collections that maybe errors/corruption/etc.

But, I'll leave it up to him to close or keep it open.

ianmilligan1 · 2019-01-11T16:38:05Z

Sounds good, thanks @lintool @ruebot !

lintool · 2019-08-21T08:50:29Z

Please leave open.

ruebot · 2022-05-20T18:10:13Z

The spirit of this issue being identifying problematic W/ARCs will be solved with #533, so I'm going to close this once it is merged.

The additions to loadArchives here still might be worth bringing in if there is a demand for it in the future.

lintool assigned borislin Jul 30, 2018

borislin mentioned this issue Aug 12, 2018

Refactor loadArchives() function #257

Closed

ruebot added enhancement RA-Task labels Aug 14, 2018

ruebot added the in progress label Aug 20, 2018

borislin mentioned this issue Oct 12, 2018

Refactor loadArchives() function to limit size of individual record #275

Closed

ianmilligan1 unassigned borislin Jan 11, 2019

ruebot mentioned this issue May 19, 2022

Remove Java w/arc processing, and replace it with Sparkling. #533

Merged

ruebot closed this as completed in c8fa256 May 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Method to perform finer-grained selection of ARCs and WARCs #247

Method to perform finer-grained selection of ARCs and WARCs #247

lintool commented Jul 30, 2018

borislin commented Oct 12, 2018

ianmilligan1 commented Jan 11, 2019

ruebot commented Jan 11, 2019 •

edited

Loading

ianmilligan1 commented Jan 11, 2019

lintool commented Aug 21, 2019

ruebot commented May 20, 2022

Method to perform finer-grained selection of ARCs and WARCs #247

Method to perform finer-grained selection of ARCs and WARCs #247

Comments

lintool commented Jul 30, 2018

borislin commented Oct 12, 2018

ianmilligan1 commented Jan 11, 2019

ruebot commented Jan 11, 2019 • edited Loading

ianmilligan1 commented Jan 11, 2019

lintool commented Aug 21, 2019

ruebot commented May 20, 2022

ruebot commented Jan 11, 2019 •

edited

Loading