Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chunking and verification failure #29

Open
Xophmeister opened this issue Feb 27, 2018 · 8 comments
Open

Chunking and verification failure #29

Xophmeister opened this issue Feb 27, 2018 · 8 comments

Comments

@Xophmeister
Copy link

The PCP available on the farm is a slightly modified version of e4b161c (as of 2018-02-27)

PCP has been observed to incorrectly chunk files and falsely verify said chunks, based on presumably the same underlying problem. That is, chunks are copied incorrectly and the MD5 sum is based on what's copied, rather than the actual source, so it passes verification.

One can workaround this problem by specifying the -b0 option (i.e., no chunking).

@guycoates
Copy link
Collaborator

Hi,

Do you have an example before-and-after file with chunking?

@Xophmeister
Copy link
Author

I'm just the messenger 😛 This was observed by another team and I was warned that my transfer may be similarly effected. So far, I've actually been unable to replicate the problem and, in my independent verification, I've not been able to find any corrupt files. I'm thus not convinced that there is a bug with PCP, or the other team has some exotic setup that I'm not privy to that's causing the problem... I will update if I can actually reproduce.

@guycoates
Copy link
Collaborator

guycoates commented Feb 28, 2018 via email

@jbeal-work
Copy link

Hi Guy,

The file that I saw was the same size and had a large chunk of zero's at the beginning. I will ask the group if they can provide an example of the two file if you want.

I was thinking about the same issue you comment about above and my only thought was a paranoid mode which does a read of the source at the same time as checksumming the destination.

It also occurred to me that the work done to ensure that the read and verifies occur on different machines would be made less effective when you did a smeared read on all the clients.

--
james

@guycoates
Copy link
Collaborator

guycoates commented Mar 2, 2018 via email

@jbeal-work
Copy link

jbeal-work commented Mar 2, 2018 via email

@guycoates
Copy link
Collaborator

Ok, I think I have a likely explanation. pcp treats "Permission denied" errors on read as non fatal. With chunking, destination files are created as sparse prior to copying the chunks. If there is a (transient) permission denied error (temporary MDT/ldap overload?), then the chunk being accessed at the time will never be copied or checksummed, and you end up with a file with chunks missing.

I've reproduced it below: The file highlighted in bold started off with no read permissions, but was opened up half way during the copy. You can see only chunk 1 was copied, but the file ends up as the correct size at the destination.

I'm about to go on holiday, so won't have a chance to fix this for a while; I'm guessing someone at WSTI can knock up a patch now the suspect code has been identified. (There might be a similar issues with the "non existent file" error handling too....)

mpirun -n 4 ~/work/pcp/pcp -c -v -b 1000 1 2

R0: All workers have reported in.
Starting 4 processes.
Files larger than 1000 Mbytes will be copied in parallel chunks.
Will md5 verify copies.
Starting phase I: Scanning and copying directory structure...
Phase I done: Scanned 2 files, 1 dirs in 00 hrs 00 mins 00 secs (1930 items/sec).
2 files will be copied.
Starting phase II: Copying files...
R3: 1/2018.AM_SIM_Abaqus_Extend.AllOS.2-4.tar Large file : copying in 4 chunks.
R2: 1/2018.AM_SIM_Abaqus_Extend.AllOS.1-4.tar Large file : copying in 4 chunks.
R2: Mar 02 13:27:03 WARNING: permission denied on 1/2018.AM_SIM_Abaqus_Extend.AllOS.1-4.tar. Skipping...
R2: Mar 02 13:27:03 WARNING: permission denied on 1/2018.AM_SIM_Abaqus_Extend.AllOS.1-4.tar. Skipping...

R2: Mar 02 13:27:03 WARNING: permission denied on 1/2018.AM_SIM_Abaqus_Extend.AllOS.1-4.tar. Skipping...
R2: Mar 02 13:27:20 copied 1/2018.AM_SIM_Abaqus_Extend.AllOS.2-4.tar chunk 0 1000.00 Mbytes (56.58 Mbytes/s)
R1: Mar 02 13:27:20 copied 1/2018.AM_SIM_Abaqus_Extend.AllOS.2-4.tar chunk 2 1000.00 Mbytes (55.88 Mbytes/s)
R3: Mar 02 13:27:21 copied 1/2018.AM_SIM_Abaqus_Extend.AllOS.2-4.tar chunk 1 1000.00 Mbytes (54.92 Mbytes/s)
R2: Mar 02 13:27:27 copied 1/2018.AM_SIM_Abaqus_Extend.AllOS.2-4.tar chunk 3 184.55 Mbytes (28.80 Mbytes/s)
R1: Mar 02 13:27:29 copied 1/2018.AM_SIM_Abaqus_Extend.AllOS.1-4.tar chunk 1 1000.00 Mbytes (113.17 Mbytes/s)
R3: Mar 02 13:27:30 1/2018.AM_SIM_Abaqus_Extend.AllOS.2-4.tar chunk 2 md5sum verified (17978c81a63b610bccc09c346cb2499a)
R3: Mar 02 13:27:34 1/2018.AM_SIM_Abaqus_Extend.AllOS.2-4.tar chunk 3 md5sum verified (6b72a02a769435060fcec17a40808676)
R2: Mar 02 13:27:36 1/2018.AM_SIM_Abaqus_Extend.AllOS.2-4.tar chunk 1 md5sum verified (fc1e9b57bee910d322f8616d86bdc1b5)
R1: Mar 02 13:27:38 1/2018.AM_SIM_Abaqus_Extend.AllOS.2-4.tar chunk 0 md5sum verified (4426783debafee77a8489c7bf6114232)
R3: Mar 02 13:27:41 1/2018.AM_SIM_Abaqus_Extend.AllOS.1-4.tar chunk 1 md5sum verified (e1664cca44e0cdcdcf22e487e470f0bd)
Phase II done.
R0: Sending SHUTDOWN to workers
rank 1 shutdown
rank 2 shutdown
rank 3 shutdown
R0: Gathering results

Copy Statisics:
Rank 1 copied 1.95 Gbytes in 2 files (74.81 Mbytes/s)
Rank 1 checksummed 1000.00 Mbytes in 1 files (108.97 Mbytes/s)
Rank 2 copied 1.16 Gbytes in 2 files (49.17 Mbytes/s)
Rank 2 checksummed 1000.00 Mbytes in 1 files (102.97 Mbytes/s)
Rank 3 copied 1000.00 Mbytes in 1 files (54.90 Mbytes/s)
Rank 3 checksummed 2.13 Gbytes in 3 files (107.88 Mbytes/s)
Total data copied: 4.09 Gbytes in 5 files (108.78 Mbytes/s)
Total Time for copy: 00 hrs 00 mins 38 secs

ls -al 1/2018.AM_SIM_Abaqus_Extend.AllOS.1-4.tar
-r--r--r-- 1 gcoates gcoates 3341621760 Feb 28 14:13 1/2018.AM_SIM_Abaqus_Extend.AllOS.1-4.tar
ls -al 2/2018.AM_SIM_Abaqus_Extend.AllOS.1-4.tar
-rw-rw-r-- 1 gcoates gcoates 3341621760 Mar 2 13:27 2/2018.AM_SIM_Abaqus_Extend.AllOS.1-4.tar

@jbeal-work
Copy link

jbeal-work commented Mar 2, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants