Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How is data integrity verification handled? #16

Open
kript opened this issue Feb 22, 2019 · 2 comments
Open

How is data integrity verification handled? #16

kript opened this issue Feb 22, 2019 · 2 comments

Comments

@kript
Copy link

kript commented Feb 22, 2019

More of a question than a bug report but might turn into a feature request...

TL;DR; how is checksumming expected to work in this plugin, both on upload and using the ichksum command later to verify, given the chunked nature of objects within the resource?

At the moment, as I understand it, an replica of an object is stored in 4MB chunks across the Rados 'bucket'. Therefore, to perform a checksum, the file must be downloaded and reassembled before ichksum can be usefully run against it.

Is that correct? If so, how would ichksum -a be expected to work on a tree with a replication node, meaning that there are more than one copies and one of them is held on the librados back end? Foe that matter, are tools like iscan and ifsck supported?

I can see that irods/irods#2796 would be useful here, but wondering if there were any other thoughts for ways to ensure data integrity without having to read every file back from the bucket!

Cheers

John

@trel
Copy link
Member

trel commented Nov 10, 2020

Saw your link back to here from the SoftIron conversation. The rados plugin itself could 'trust' the storage to provide these types of calculations/values. Another option is to not use the plugin at all, and just use unixfilesystem via CephFS (and perhaps grow a setting that itself... trusts the storage for checksum information).

Otherwise, yes, this is a challenge. And I think you're still ahead of it - we haven't faced this question from others yet, even nearly two years after you posted this.

@jasoncoposky
Copy link
Member

Within iRODS every checksum computation for every replica is a full read from storage and a compute. We have discussed moving the checksum operation from an RPC API and delegating that to the underlying storage architecture which may provide quicker and better assurances (e.g. erasure coding) that the data is correct at rest. Given that we could rely on assurances from ceph that data is correct given your own configuration of the storage and iRODS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants