Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFE: sync option #24

Open
jeremyenos opened this issue Apr 8, 2016 · 4 comments
Open

RFE: sync option #24

jeremyenos opened this issue Apr 8, 2016 · 4 comments

Comments

@jeremyenos
Copy link

Would it be a massive change to provide a sync option which would have the option to remove files on the target on update operations if they've been removed from the source? (like rsync can do)

@mjwoods
Copy link
Contributor

mjwoods commented Apr 9, 2016

Other people may know better, but I think this would be difficult. The current implementation walks through the source tree and deals individually with each file that it finds (in parallel, of course). To remove files from the destination would require a separate walk through the destination tree and some kind of comparison between trees, preferably working on different directories in parallel. I'm sure it could be done, but it would probably involve some big code changes.

For now, you could try the "-i" (incremental backup) option. This works with three directory trees - "source", "previous" backup and "new" backup. For your purposes, the "previous" backup would be your current destination directory, and the "new" backup would be a clean directory on the same Lustre filesystem. The "source" directory does not need to be on the same filesystem. The "-i" option compares each file in the "source" tree with the same file in the "previous" tree, and if they match, a hard-link to the "previous" file is created in the "new" tree. Any "source" files that do not match a "previous" file are copied from the "source" to the "new" tree. In the end, the "new" tree should contain a mirror of the "source" tree, consisting of a mixture of copied files and hard-links to "previous" files. The disk space required for the "new" tree will mainly depend on how many files need to be copied, because hard-links only occupy a tiny amount of space in directory structures.

Once the incremental backup process is complete, the "new" tree could replace the "previous" tree. It is safest to do this when no jobs are using the "previous" tree. After the replacement is done, you could remove the old "previous" directory at your leisure. For example, you could do something like:

mv previous previous.bak && mv new previous && rm -rf previous.bak

I think the "-i" option will do most of what you need, so please give it a try. But also bear in mind that this option is a recent addition to pcp, so please report any problems that you discover.

@jeremyenos
Copy link
Author

Great suggestion- I think that could potentially work. Good to know what's going on behind the scenes on incrementals as well. I expect that a sync option could be more efficient, since it wouldn't have to do inode creations for stuff that didn't change, but as long as it doesn't exist, the incremental hardlink tree creation is the next best thing. Plus, I imagine when dealing with 500M inodes in the tree, the removal of the obsolete hardlink trees may take awhile.
I'll certainly give it a try.

@mjwoods
Copy link
Contributor

mjwoods commented Apr 9, 2016

I'd be interested to know how you go, and perhaps a few statistics from the process. For example, how many parallel processes were used, how many files were copied, how much data was transferred, how long did it take?

While there is definitely overhead in managing a tree of hard links, it may not be as bad as you expect. If you delete your "previous" directory, please let us know how long it takes. It may go faster if you delete sub-trees in parallel.

I do agree that a "sync+delete" option would be more efficient for your application, and it would be worth investigating as a possible new feature.

@jeremyenos
Copy link
Author

I'll be happy to provide some stats. It'll be a couple weeks before I get a large test opportunity as well as the root level mpi environment setup that I'll need. I think parallel delete threads will likely be a necessity, but will report on that as well if I don't run into an earlier blocker. thx.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants