-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove Java w/arc processing, and replace it with Sparkling. #533
Conversation
* Partially address #494
* fix discardDate issue * update tests for #494 * add test for #493 * add test for #532 * move issue specific tests to their own directory * add copyright statement to SparklingArchiveRecord * move webarchive-commons back to 1.1.9 * resolves #532 * resolves #494 * resolves #493 * resolves #492 * resolves #317 * resolves #260 * resolves #182 * resolves #76 * resolves #74 * resolves #73 * resolves #23 * resolves #18
Codecov Report
@@ Coverage Diff @@
## main #533 +/- ##
============================================
+ Coverage 88.83% 92.74% +3.90%
+ Complexity 57 42 -15
============================================
Files 43 39 -4
Lines 1012 813 -199
Branches 85 52 -33
============================================
- Hits 899 754 -145
+ Misses 74 35 -39
+ Partials 39 24 -15 |
…ntentBytes, getContentString and getPayloadDigest.
I'm fairly certain this is an apples to apples comparison, and if it is, if looks like this PR speeds things up slightly: Other thing to note here, is that in the previous benchmark test (0.50.1), we were giving Spark 500G of RAM. In this test, I only gave Spark 8G of RAM. So, that's a SIGNIFICANT improvement 😉 |
I'm going to do a couple more tests, and if they're good to go, I'm going to squash and merge this. |
PySpark testing is good: archivesunleashed/notebooks@969fbea BAnQ problem collection is the last one left! |
Good news, bad news. Bad news: The BAnQ collection surfaced the #317 errors still. At the end of the day, that's because we're still reading records into memory with Good news: We solved this in ARCH with streaming. It shouldn't be that difficult for me to further modify what we have here with Sparkling to pivot to streaming. In addition, we'll need to update the So, I'm going to propose that we squash and merge what we have here since it's getting a little out of control with the number of issues we're fixing. I've taken #317 out of the resolved column, and will leave that open. Let me know if you're good with that @ianmilligan1 |
Forgot to add something; we could keep the |
Makes a lot of sense to me, @ruebot - happy to test + merge this PR once you're good to go with it.
Agreed - esp. given the jobs we have around binary files - would be misleading to say get a dump of video records that excludes the biggest ones or something. |
Yeah, if you want to give it some tests, please do! If you want to merge it as well, let me know and we can coordinate the commit message like we've done in the past. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noted in Slack, but builds nicely and more importantly it works. I noticed how clean the text extractions are compared to earlier ones - way less noise. Kudos and congratulations! 🎉🎉🎉
GitHub issue(s):
What does this Pull Request do?
Removes legacy Java w/arc processing and replaces it with Sparkling. There is no longer any Java code in the project.
In addition, a couple other issues have been resolved:
How should this be tested?
Additional Notes:
This work is co-authored by @helgeho. I'm very grateful for his help on this one, and also very grateful for Sparkling being open sourced now.
Also of note, Sparkling is being pull in by the latest commit now using Jitpack. We might want to pin it to a specific commit hash so things are stable there.