Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove Java w/arc processing, and replace it with Sparkling. #533

Merged
merged 12 commits into from
May 24, 2022
Merged

Conversation

ruebot
Copy link
Member

@ruebot ruebot commented May 18, 2022

GitHub issue(s):

What does this Pull Request do?

Removes legacy Java w/arc processing and replaces it with Sparkling. There is no longer any Java code in the project.

In addition, a couple other issues have been resolved:

How should this be tested?

Additional Notes:

This work is co-authored by @helgeho. I'm very grateful for his help on this one, and also very grateful for Sparkling being open sourced now.

Also of note, Sparkling is being pull in by the latest commit now using Jitpack. We might want to pin it to a specific commit hash so things are stable there.

ruebot and others added 7 commits September 30, 2021 13:40
* Partially address #494
* fix discardDate issue
* update tests for #494
* add test for #493
* add test for #532
* move issue specific tests to their own directory
* add copyright statement to SparklingArchiveRecord
* move webarchive-commons back to 1.1.9
* resolves #532
* resolves #494
* resolves #493
* resolves #492
* resolves #317
* resolves #260
* resolves #182
* resolves #76
* resolves #74
* resolves #73
* resolves #23
* resolves #18
@codecov
Copy link

codecov bot commented May 18, 2022

Codecov Report

Merging #533 (e75d1b1) into main (8104a65) will increase coverage by 3.90%.
The diff coverage is 86.81%.

@@             Coverage Diff              @@
##               main     #533      +/-   ##
============================================
+ Coverage     88.83%   92.74%   +3.90%     
+ Complexity       57       42      -15     
============================================
  Files            43       39       -4     
  Lines          1012      813     -199     
  Branches         85       52      -33     
============================================
- Hits            899      754     -145     
+ Misses           74       35      -39     
+ Partials         39       24      -15     

…ntentBytes, getContentString and getPayloadDigest.
@ruebot
Copy link
Member Author

ruebot commented May 20, 2022

I'm fairly certain this is an apples to apples comparison, and if it is, if looks like this PR speeds things up slightly:

v0 50 1 vs issue-494

Other thing to note here, is that in the previous benchmark test (0.50.1), we were giving Spark 500G of RAM. In this test, I only gave Spark 8G of RAM. So, that's a SIGNIFICANT improvement 😉

@ruebot
Copy link
Member Author

ruebot commented May 20, 2022

I'm going to do a couple more tests, and if they're good to go, I'm going to squash and merge this.

ruebot added a commit to archivesunleashed/notebooks that referenced this pull request May 20, 2022
@ruebot
Copy link
Member Author

ruebot commented May 20, 2022

PySpark testing is good: archivesunleashed/notebooks@969fbea

BAnQ problem collection is the last one left!

* Documentation and formatting updates
* Remove unneeded getContentBytes
* Add missing spreadsheet mimetype
ruebot added a commit to archivesunleashed/aut-docs that referenced this pull request May 22, 2022
ruebot added a commit to archivesunleashed/aut-docs that referenced this pull request May 22, 2022
@ruebot
Copy link
Member Author

ruebot commented May 24, 2022

Good news, bad news.

Bad news: The BAnQ collection surfaced the #317 errors still. At the end of the day, that's because we're still reading records into memory with getBinaryBytes. If it's a big record, and malformed record, we're going to overflow RAM, and we can't really get around that.

Good news: We solved this in ARCH with streaming. It shouldn't be that difficult for me to further modify what we have here with Sparkling to pivot to streaming. In addition, we'll need to update the matchbox utilities to compensate for streaming as well. We did this in ARCH, so again, shouldn't be too difficult.

So, I'm going to propose that we squash and merge what we have here since it's getting a little out of control with the number of issues we're fixing. I've taken #317 out of the resolved column, and will leave that open. Let me know if you're good with that @ianmilligan1

@ruebot
Copy link
Member Author

ruebot commented May 24, 2022

Forgot to add something; we could keep the getBinaryBytes method, but we'd need to limit or discard records over a given size. I don't like this. It make the archivist/scientist in me uncomfortable given what aut is supposed to do.

@ianmilligan1
Copy link
Member

So, I'm going to propose that we squash and merge what we have here since it's getting a little out of control with the number of issues we're fixing. I've taken #317 out of the resolved column, and will leave that open. Let me know if you're good with that @ianmilligan1

Makes a lot of sense to me, @ruebot - happy to test + merge this PR once you're good to go with it.

Forgot to add something; we could keep the getBinaryBytes method, but we'd need to limit or discard records over a given size. I don't like this. It make the archivist/scientist in me uncomfortable given what aut is supposed to do.

Agreed - esp. given the jobs we have around binary files - would be misleading to say get a dump of video records that excludes the biggest ones or something.

@ruebot
Copy link
Member Author

ruebot commented May 24, 2022

Yeah, if you want to give it some tests, please do! If you want to merge it as well, let me know and we can coordinate the commit message like we've done in the past.

Copy link
Member

@ianmilligan1 ianmilligan1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted in Slack, but builds nicely and more importantly it works. I noticed how clean the text extractions are compared to earlier ones - way less noise. Kudos and congratulations! 🎉🎉🎉

@ruebot ruebot merged commit c8fa256 into main May 24, 2022
@ruebot ruebot deleted the issue-494 branch May 24, 2022 16:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants