Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize checksum calculation on verify #861

Merged
merged 1 commit into from
Mar 25, 2024
Merged

Conversation

aryanA101a
Copy link
Contributor

@aryanA101a aryanA101a commented Mar 2, 2024

Related to #614

  • I modified the FileImpl::Verify checksum algo to calculate the checksum in pieces of 1024 bytes to save on cpu cycles and IO requests.
  • Around 2.6x fast (tested on 16 core AMD EPYC 7B13 Gitpod)
Size Previous Optimized
31 MB 0.73s 0.27s
4 GB 96.84s 38.17s

@kelson42
Copy link
Contributor

kelson42 commented Mar 2, 2024

@aryanA101a Thank you for your promising PR. I really appreciate this PR if confirmed to be done the right way, but probably too short to close the ticket.

@kelson42 kelson42 requested a review from mgautierfr March 2, 2024 17:50
Copy link
Collaborator

@mgautierfr mgautierfr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR.

This is a tricky part here.
I wonder if we don't have to change the logic by having only one loop, reading as long as we have data to read (instead of a while loop inside a for loop). Something like:

auto remaining = checkSumPos;
auto part_iter = zimFile->begin();
std::ifstream current_stream(part->second->filename(), ...);
while (remaining) {
    stream.read(reinterpret_cast<char*>(ch), min(piece_size, remaining));
    zim_MD5Update(&md5ctx, ch, stream.gcount());
    remaining -= stream.gcount();

    if (remaining == 0) {
        // we have read everything
        break;
    }
    if (!stream.good()) {
        part_iter++;
        current_stream = std::ifstream(part->second->filename(), ...);
    }
}

(This is the global idea, error checking and stuff have to be added)


Have you tried with a chunk bigger than 1024 ? How it behaves ?

src/fileimpl.cpp Outdated Show resolved Hide resolved
src/fileimpl.cpp Outdated Show resolved Hide resolved
src/fileimpl.cpp Outdated Show resolved Hide resolved
src/fileimpl.cpp Outdated Show resolved Hide resolved
@aryanA101a
Copy link
Contributor Author

aryanA101a commented Mar 4, 2024

Thanks for the PR.

This is a tricky part here. I wonder if we don't have to change the logic by having only one loop, reading as long as we have data to read (instead of a while loop inside a for loop). Something like:

auto remaining = checkSumPos;
auto part_iter = zimFile->begin();
std::ifstream current_stream(part->second->filename(), ...);
while (remaining) {
    stream.read(reinterpret_cast<char*>(ch), min(piece_size, remaining));
    zim_MD5Update(&md5ctx, ch, stream.gcount());
    remaining -= stream.gcount();

    if (remaining == 0) {
        // we have read everything
        break;
    }
    if (!stream.good()) {
        part_iter++;
        current_stream = std::ifstream(part->second->filename(), ...);
    }
}

From the beginning, there were two loops(nested for loops).
Because according to FileCompound API, we can access the zimfile in parts, and we have to read from each part.

class FileCompound : private std::map<Range, FilePart*, less_range> {
    typedef std::map<Range, FilePart*, less_range> ImplType;

  public: // types
    typedef const_iterator PartIterator;
    typedef std::pair<PartIterator, PartIterator> PartRange;

  public: // functions
    explicit FileCompound(const std::string& filename);
...

(This is the global idea, error checking and stuff have to be added)

Have you tried with a chunk bigger than 1024 ? How it behaves ?

Yes, I have tried more chunk sizes.
On a 4gb zim file.

Chunk Size Time
128 37.77s
1024 36.89s
2048 36.63s

I think 1024 is the sweet spot here.

Copy link
Collaborator

@mgautierfr mgautierfr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still have few small changes but we are mostly good.
Please squash your commits before the next review round.

src/fileimpl.h Outdated Show resolved Hide resolved
src/fileimpl.cpp Outdated Show resolved Hide resolved
src/fileimpl.cpp Outdated Show resolved Hide resolved
src/fileimpl.cpp Show resolved Hide resolved
@mgautierfr
Copy link
Collaborator

From the beginning, there were two loops(nested for loops).
Because according to FileCompound API, we can access the zimfile in parts, and we have to read from each part.

Yes, but as we refactor, we may remove the nested loops. My solution also loop on the parts, it "simply" not using a while or for.

Anyway, you last proposition is mostly good, so let's continue with it.

@kelson42 kelson42 added this to the 9.2.0 milestone Mar 7, 2024
Copy link

codecov bot commented Mar 9, 2024

Codecov Report

Attention: Patch coverage is 30.00000% with 7 lines in your changes are missing coverage. Please review.

Project coverage is 58.00%. Comparing base (8d978e1) to head (5e5c0b8).

Files Patch % Lines
src/fileimpl.cpp 30.00% 0 Missing and 7 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #861      +/-   ##
==========================================
- Coverage   58.04%   58.00%   -0.05%     
==========================================
  Files         101      101              
  Lines        4617     4622       +5     
  Branches     1921     1925       +4     
==========================================
+ Hits         2680     2681       +1     
  Misses        667      667              
- Partials     1270     1274       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@aryanA101a aryanA101a force-pushed the main branch 2 times, most recently from ea2cdbe to 02f7c68 Compare March 10, 2024 07:42
@aryanA101a aryanA101a requested a review from mgautierfr March 13, 2024 06:24
Copy link
Collaborator

@mgautierfr mgautierfr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two last tinny changes before merging:

  • Wrong indentation
  • Previous comment about CHUNCK_SIZE vs stream.gcount()

src/fileimpl.cpp Outdated Show resolved Hide resolved
Copy link
Collaborator

@mgautierfr mgautierfr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the PR !

@mgautierfr mgautierfr merged commit 96afb38 into openzim:main Mar 25, 2024
29 of 30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants