-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chunking and verification failure #29
Comments
Hi, Do you have an example before-and-after file with chunking? |
I'm just the messenger 😛 This was observed by another team and I was warned that my transfer may be similarly effected. So far, I've actually been unable to replicate the problem and, in my independent verification, I've not been able to find any corrupt files. I'm thus not convinced that there is a bug with PCP, or the other team has some exotic setup that I'm not privy to that's causing the problem... I will update if I can actually reproduce. |
I've not been able to reproduce either. Looking at the code, there is a
theoretical way it could happen: if there was a silent corruption when
reading the original file, then the md5sum will be calculated on the bad
data, which is copied, and then the verified md5sum will match. It isn't
obvious to me what the correct way to catch that behaviour is. Getting a
binary diff of the affected file will definitely be helpful in seeing what
is going on.
…On 28 February 2018 at 15:28, Christopher Harrison ***@***.*** > wrote:
I'm just the messenger 😛 This was observed by another team and I was
warned that my transfer may be similarly effected. So far, I've actually
been unable to replicate the problem and, in my independent verification,
I've not been able to find any corrupt files. I'm thus not convinced that
there is a bug with PCP, or the other team has some exotic setup that I'm
not privy to that's causing the problem... I will update if I can actually
reproduce.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#29 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AC2XXFaYPmwdqIXmAXAcf8flpyrYDbnMks5tZXCTgaJpZM4SVQKP>
.
--
Dr. Guy Coates
+44(0)7801 710224
|
Hi Guy, The file that I saw was the same size and had a large chunk of zero's at the beginning. I will ask the group if they can provide an example of the two file if you want. I was thinking about the same issue you comment about above and my only thought was a paranoid mode which does a read of the source at the same time as checksumming the destination. It also occurred to me that the work done to ensure that the read and verifies occur on different machines would be made less effective when you did a smeared read on all the clients. -- |
Hi James,
Was the file full of zeros, or nulls?
Thanks,
Guy
…On 1 March 2018 at 23:07, jbeal-work ***@***.***> wrote:
Hi Guy,
The file that I saw was the same size and had a large chunk of zero's at
the beginning. I will ask the group if they can provide an example of the
two file if you want.
I was thinking about the same issue you comment about above and my only
thought was a paranoid mode which does a read of the source at the same
time as checksumming the destination.
It also occurred to me that the work done to ensure that the read and
verifies occur on different machines would be made less effective when you
did a smeared read on all the clients.
--
james
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#29 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AC2XXFB1puB93MONxVAQ-E_ZSW2Lo8DBks5taH8WgaJpZM4SVQKP>
.
--
Dr. Guy Coates
+44(0)7801 710224
|
On 2 Mar 2018, at 08:37, Guy Coates ***@***.***> wrote:
Hi James,
Was the file full of zeros, or nulls?
Thanks,
Guy
From memory it was nulls.
\0
|
Ok, I think I have a likely explanation. pcp treats "Permission denied" errors on read as non fatal. With chunking, destination files are created as sparse prior to copying the chunks. If there is a (transient) permission denied error (temporary MDT/ldap overload?), then the chunk being accessed at the time will never be copied or checksummed, and you end up with a file with chunks missing. I've reproduced it below: The file highlighted in bold started off with no read permissions, but was opened up half way during the copy. You can see only chunk 1 was copied, but the file ends up as the correct size at the destination. I'm about to go on holiday, so won't have a chance to fix this for a while; I'm guessing someone at WSTI can knock up a patch now the suspect code has been identified. (There might be a similar issues with the "non existent file" error handling too....) mpirun -n 4 ~/work/pcp/pcp -c -v -b 1000 1 2 R0: All workers have reported in. Copy Statisics: ls -al 1/2018.AM_SIM_Abaqus_Extend.AllOS.1-4.tar |
Well done guy, have a good holiday…
… On 2 Mar 2018, at 13:47, Guy Coates ***@***.***> wrote:
Ok, I think I have a likely explanation. pcp treats "Permission denied" errors on read as non fatal. With chunking, destination files are created as sparse prior to copying the chunks. If there is a (transient) permission denied error (temporary MDT/ldap overload?), then the chunk being accessed at the time will never be copied or checksummed, and you end up with a file with chunks missing.
I've reproduced it below: The file highlighted in bold started off with no read permissions, but was opened up half way during the copy. You can see only chunk 1 was copied, but the file ends up as the correct size at the destination.
I'm about to go on holiday, so won't have a chance to fix this for a while; I'm guessing someone at WSTI can knock up a patch now the suspect code has been identified. (There might be a similar issues with the "non existent file" error handling too....)
mpirun -n 4 ~/work/pcp/pcp -c -v -b 1000 1 2
R0: All workers have reported in.
Starting 4 processes.
Files larger than 1000 Mbytes will be copied in parallel chunks.
Will md5 verify copies.
Starting phase I: Scanning and copying directory structure...
Phase I done: Scanned 2 files, 1 dirs in 00 hrs 00 mins 00 secs (1930 items/sec).
2 files will be copied.
Starting phase II: Copying files...
R3: 1/2018.AM_SIM_Abaqus_Extend.AllOS.2-4.tar Large file : copying in 4 chunks.
R2: 1/2018.AM_SIM_Abaqus_Extend.AllOS.1-4.tar Large file : copying in 4 chunks.
R2: Mar 02 13:27:03 WARNING: permission denied on 1/2018.AM_SIM_Abaqus_Extend.AllOS.1-4.tar. Skipping...
R2: Mar 02 13:27:03 WARNING: permission denied on 1/2018.AM_SIM_Abaqus_Extend.AllOS.1-4.tar. Skipping...
R2: Mar 02 13:27:03 WARNING: permission denied on 1/2018.AM_SIM_Abaqus_Extend.AllOS.1-4.tar. Skipping...
R2: Mar 02 13:27:20 copied 1/2018.AM_SIM_Abaqus_Extend.AllOS.2-4.tar chunk 0 1000.00 Mbytes (56.58 Mbytes/s)
R1: Mar 02 13:27:20 copied 1/2018.AM_SIM_Abaqus_Extend.AllOS.2-4.tar chunk 2 1000.00 Mbytes (55.88 Mbytes/s)
R3: Mar 02 13:27:21 copied 1/2018.AM_SIM_Abaqus_Extend.AllOS.2-4.tar chunk 1 1000.00 Mbytes (54.92 Mbytes/s)
R2: Mar 02 13:27:27 copied 1/2018.AM_SIM_Abaqus_Extend.AllOS.2-4.tar chunk 3 184.55 Mbytes (28.80 Mbytes/s)
R1: Mar 02 13:27:29 copied 1/2018.AM_SIM_Abaqus_Extend.AllOS.1-4.tar chunk 1 1000.00 Mbytes (113.17 Mbytes/s)
R3: Mar 02 13:27:30 1/2018.AM_SIM_Abaqus_Extend.AllOS.2-4.tar chunk 2 md5sum verified (17978c81a63b610bccc09c346cb2499a)
R3: Mar 02 13:27:34 1/2018.AM_SIM_Abaqus_Extend.AllOS.2-4.tar chunk 3 md5sum verified (6b72a02a769435060fcec17a40808676)
R2: Mar 02 13:27:36 1/2018.AM_SIM_Abaqus_Extend.AllOS.2-4.tar chunk 1 md5sum verified (fc1e9b57bee910d322f8616d86bdc1b5)
R1: Mar 02 13:27:38 1/2018.AM_SIM_Abaqus_Extend.AllOS.2-4.tar chunk 0 md5sum verified (4426783debafee77a8489c7bf6114232)
R3: Mar 02 13:27:41 1/2018.AM_SIM_Abaqus_Extend.AllOS.1-4.tar chunk 1 md5sum verified (e1664cca44e0cdcdcf22e487e470f0bd)
Phase II done.
R0: Sending SHUTDOWN to workers
rank 1 shutdown
rank 2 shutdown
rank 3 shutdown
R0: Gathering results
Copy Statisics:
Rank 1 copied 1.95 Gbytes in 2 files (74.81 Mbytes/s)
Rank 1 checksummed 1000.00 Mbytes in 1 files (108.97 Mbytes/s)
Rank 2 copied 1.16 Gbytes in 2 files (49.17 Mbytes/s)
Rank 2 checksummed 1000.00 Mbytes in 1 files (102.97 Mbytes/s)
Rank 3 copied 1000.00 Mbytes in 1 files (54.90 Mbytes/s)
Rank 3 checksummed 2.13 Gbytes in 3 files (107.88 Mbytes/s)
Total data copied: 4.09 Gbytes in 5 files (108.78 Mbytes/s)
Total Time for copy: 00 hrs 00 mins 38 secs
ls -al 1/2018.AM_SIM_Abaqus_Extend.AllOS.1-4.tar
-r--r--r-- 1 gcoates gcoates 3341621760 Feb 28 14:13 1/2018.AM_SIM_Abaqus_Extend.AllOS.1-4.tar
ls -al 2/2018.AM_SIM_Abaqus_Extend.AllOS.1-4.tar
-rw-rw-r-- 1 gcoates gcoates 3341621760 Mar 2 13:27 2/2018.AM_SIM_Abaqus_Extend.AllOS.1-4.tar
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub <#29 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ATx3FBrZ-HbU891raRDjkgyR46ruAkimks5taU1VgaJpZM4SVQKP>.
|
The PCP available on the farm is a slightly modified version of e4b161c (as of 2018-02-27)
PCP has been observed to incorrectly chunk files and falsely verify said chunks, based on presumably the same underlying problem. That is, chunks are copied incorrectly and the MD5 sum is based on what's copied, rather than the actual source, so it passes verification.
One can workaround this problem by specifying the
-b0
option (i.e., no chunking).The text was updated successfully, but these errors were encountered: