Remove Java w/arc processing, and replace it with Sparkling. #533

ruebot · 2022-05-18T18:26:44Z

GitHub issue(s):

What does this Pull Request do?

Removes legacy Java w/arc processing and replaces it with Sparkling. There is no longer any Java code in the project.

In addition, a couple other issues have been resolved:

fix discardDate issue Discard date RDD filter only takes a single string, not a list of strings. #532
update tests for Replace Java ARC/WARC record processing library #494 (resolved with Sparkling integration)
add test for Extract gzip data from transfer-encoded WARC #493
add test for Discard date RDD filter only takes a single string, not a list of strings. #532
move issue specific tests to their own directory
updates GitHub actions

How should this be tested?

GitHub actions
I'm also going to be doing some exhaustive testing on tuna/orca with some of the legacy issues, especially Heap space!! java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:3332) #317. I'll report back here when I'm done.

Additional Notes:

This work is co-authored by @helgeho. I'm very grateful for his help on this one, and also very grateful for Sparkling being open sourced now.

Also of note, Sparkling is being pull in by the latest commit now using Jitpack. We might want to pin it to a specific commit hash so things are stable there.

* Partially address #494

* fix discardDate issue * update tests for #494 * add test for #493 * add test for #532 * move issue specific tests to their own directory * add copyright statement to SparklingArchiveRecord * move webarchive-commons back to 1.1.9 * resolves #532 * resolves #494 * resolves #493 * resolves #492 * resolves #317 * resolves #260 * resolves #182 * resolves #76 * resolves #74 * resolves #73 * resolves #23 * resolves #18

codecov · 2022-05-18T18:42:41Z

Codecov Report

Merging #533 (e75d1b1) into main (8104a65) will increase coverage by 3.90%.
The diff coverage is 86.81%.

@@             Coverage Diff              @@
##               main     #533      +/-   ##
============================================
+ Coverage     88.83%   92.74%   +3.90%     
+ Complexity       57       42      -15     
============================================
  Files            43       39       -4     
  Lines          1012      813     -199     
  Branches         85       52      -33     
============================================
- Hits            899      754     -145     
+ Misses           74       35      -39     
+ Partials         39       24      -15

…ntentBytes, getContentString and getPayloadDigest.

ruebot · 2022-05-20T16:25:40Z

I'm fairly certain this is an apples to apples comparison, and if it is, if looks like this PR speeds things up slightly:

Other thing to note here, is that in the previous benchmark test (0.50.1), we were giving Spark 500G of RAM. In this test, I only gave Spark 8G of RAM. So, that's a SIGNIFICANT improvement 😉

ruebot · 2022-05-20T16:26:28Z

I'm going to do a couple more tests, and if they're good to go, I'm going to squash and merge this.

ruebot · 2022-05-20T17:33:08Z

PySpark testing is good: archivesunleashed/notebooks@969fbea

BAnQ problem collection is the last one left!

* Documentation and formatting updates * Remove unneeded getContentBytes * Add missing spreadsheet mimetype

…ed/aut#534

ruebot · 2022-05-24T12:54:59Z

Good news, bad news.

Bad news: The BAnQ collection surfaced the #317 errors still. At the end of the day, that's because we're still reading records into memory with getBinaryBytes. If it's a big record, and malformed record, we're going to overflow RAM, and we can't really get around that.

Good news: We solved this in ARCH with streaming. It shouldn't be that difficult for me to further modify what we have here with Sparkling to pivot to streaming. In addition, we'll need to update the matchbox utilities to compensate for streaming as well. We did this in ARCH, so again, shouldn't be too difficult.

So, I'm going to propose that we squash and merge what we have here since it's getting a little out of control with the number of issues we're fixing. I've taken #317 out of the resolved column, and will leave that open. Let me know if you're good with that @ianmilligan1

ruebot · 2022-05-24T12:58:09Z

Forgot to add something; we could keep the getBinaryBytes method, but we'd need to limit or discard records over a given size. I don't like this. It make the archivist/scientist in me uncomfortable given what aut is supposed to do.

ianmilligan1 · 2022-05-24T13:11:39Z

So, I'm going to propose that we squash and merge what we have here since it's getting a little out of control with the number of issues we're fixing. I've taken #317 out of the resolved column, and will leave that open. Let me know if you're good with that @ianmilligan1

Makes a lot of sense to me, @ruebot - happy to test + merge this PR once you're good to go with it.

Forgot to add something; we could keep the getBinaryBytes method, but we'd need to limit or discard records over a given size. I don't like this. It make the archivist/scientist in me uncomfortable given what aut is supposed to do.

Agreed - esp. given the jobs we have around binary files - would be misleading to say get a dump of video records that excludes the biggest ones or something.

ruebot · 2022-05-24T13:29:52Z

Yeah, if you want to give it some tests, please do! If you want to merge it as well, let me know and we can coordinate the commit message like we've done in the past.

ianmilligan1

Noted in Slack, but builds nicely and more importantly it works. I noticed how clean the text extractions are compared to earlier ones - way less noise. Kudos and congratulations! 🎉🎉🎉

ruebot and others added 7 commits September 30, 2021 13:40

Rip out Java code.

b33f27a

* Partially address #494

Merge branch 'main' of github.com:archivesunleashed/aut into issue-494

a04c35b

Pull in Sparkling.

6c1b9fa

first draft to use IA's Sparkling WARC record loader

20e04c5

Updates for pulling in Sparkling.

c0d2228

Remove java formatter, and apply scalafmt.

b1af6eb

Filter on response and revisit records, and Try getBinaryBytes, getCo…

6f24220

…ntentBytes, getContentString and getPayloadDigest.

This was referenced May 19, 2022

ARC reader string vs int error on record length #492

Closed

Heap space!! java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:3332) #317

Open

ruebot added a commit to archivesunleashed/notebooks that referenced this pull request May 20, 2022

PySpark testing for archivesunleashed/aut#533.

969fbea

This was referenced May 20, 2022

Method to perform finer-grained selection of ARCs and WARCs #247

Closed

Extract gzip data from transfer-encoded WARC #493

Closed

ruebot added 3 commits May 21, 2022 15:25

Cleanup

efa72a0

* Documentation and formatting updates * Remove unneeded getContentBytes * Add missing spreadsheet mimetype

Remove shadowed AU tika-parsers dependency, and use org.tika again.

f04b841

Add domain column to webpages().

dbeab4d

ruebot added a commit to archivesunleashed/aut-docs that referenced this pull request May 22, 2022

Documentation updates for archivesunleashed/aut#533 & archivesunleash…

ca6b8d4

…ed/aut#534

ruebot added a commit to archivesunleashed/aut-docs that referenced this pull request May 22, 2022

Documentation updates for archivesunleashed/aut#533 & archivesunleash…

c937c9a

…ed/aut#534

Update tests for %534

e75d1b1

ianmilligan1 approved these changes May 24, 2022

View reviewed changes

ruebot merged commit c8fa256 into main May 24, 2022

ruebot deleted the issue-494 branch May 24, 2022 16:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove Java w/arc processing, and replace it with Sparkling. #533

Remove Java w/arc processing, and replace it with Sparkling. #533

ruebot commented May 18, 2022 •

edited

Loading

codecov bot commented May 18, 2022 •

edited

Loading

ruebot commented May 20, 2022

ruebot commented May 20, 2022

ruebot commented May 20, 2022

ruebot commented May 24, 2022 •

edited

Loading

ruebot commented May 24, 2022

ianmilligan1 commented May 24, 2022

ruebot commented May 24, 2022

ianmilligan1 left a comment

Remove Java w/arc processing, and replace it with Sparkling. #533

Remove Java w/arc processing, and replace it with Sparkling. #533

Conversation

ruebot commented May 18, 2022 • edited Loading

What does this Pull Request do?

How should this be tested?

Additional Notes:

codecov bot commented May 18, 2022 • edited Loading

Codecov Report

ruebot commented May 20, 2022

ruebot commented May 20, 2022

ruebot commented May 20, 2022

ruebot commented May 24, 2022 • edited Loading

ruebot commented May 24, 2022

ianmilligan1 commented May 24, 2022

ruebot commented May 24, 2022

ianmilligan1 left a comment

Choose a reason for hiding this comment

ruebot commented May 18, 2022 •

edited

Loading

codecov bot commented May 18, 2022 •

edited

Loading

ruebot commented May 24, 2022 •

edited

Loading