Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use os.walk for recursing #48

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

use os.walk for recursing #48

wants to merge 1 commit into from

Conversation

xlotlu
Copy link

@xlotlu xlotlu commented Mar 7, 2018

Use os.walk() for recursing instead of the custom os.listdir() loop, and adapted tests.

This should improve efficiency in python >= 3.5, which uses the new os.scandir() under the hood.

Related to #45.

@coveralls
Copy link

coveralls commented Mar 7, 2018

Coverage Status

Coverage increased (+0.5%) to 85.156% when pulling d512a2c on xlotlu:master into 78670d8 on dsoprea:master.

@xlotlu
Copy link
Author

xlotlu commented Mar 7, 2018

Benchmark results with:

# mkdirs.py
from os import mkdir
from os.path import join

def mkdirs(parent, depth=0):
    if depth == 5:
        for f in range(10):
            with open(join(parent, str(f)), 'w'):
                pass
        return
        
    depth += 1
    
    for d in range(10):
        newdir = join(parent, str(d))
        mkdir(newdir)
        mkdirs(newdir, depth)

if __name__ == '__main__':
    mkdirs('/storage/test')

That's 111111 directories (and 1 million files). Testing with:

# test.py
from inotify.adapters import InotifyTree

if __name__ == '__main__':
    i = InotifyTree('/storage/test')

This is happening on a rotational drive. The benchmark (I did several runs and chose the best result):

$ sudo sysctl fs.inotify.max_user_watches=1000000 # make sure we don't hit file watch limit
fs.inotify.max_user_watches = 1000000
$ sudo sysctl vm.vfs_cache_pressure=0 # never auto-clear vfs cache
vm.vfs_cache_pressure = 0
$ echo 2 | sudo tee /proc/sys/vm/drop_caches # drop dentry cache
2

$ time python3 test.py # python3 test *without* dentry cache

real	0m19.287s
user	0m2.395s
sys	0m6.483s

$ time python3 test.py # python3 test *with* dentry cache

real	0m18.905s
user	0m2.587s
sys	0m6.155s

$ echo 2 | sudo tee /proc/sys/vm/drop_caches
2

$ time python2 test.py # python2 test *without* dentry cache

real	1m17.637s
user	0m4.165s
sys	0m26.825s

$ time python2 test.py # python2 test *with* dentry cache

real	1m16.991s
user	0m4.376s
sys	0m25.833s

So that's about 1/4th of the time.
Interestingly enough dentry cache doesn't make much of a difference (or my test procedure was wrong - I can oly guess it's not hitting the cache despite the 0 vfs_cache_pressure).

Just out of curiosity, I tried old code for comparison (without flushing cache):

$ git checkout HEAD^
[...]
HEAD is now at 78670d8 Version bump.

$ time python3 test.py

real	1m17.761s
user	0m6.275s
sys	0m24.890s

$ time python2 test.py

real	1m16.418s
user	0m4.408s
sys	0m25.065s

So, a negligible difference in the os.listdir() loop vs os.walk(), when os.scandir() is not involved.

As i mentioned, this was on an HDD. On an SSD the results are:

new code, yes cache:
python 3: real	0m25.734s
python 2: real	0m45.174s

old code, yes cache:
python 3: real	0m52.341s
python 2: real	0m45.708s

Not so dramatic, but still a serious improvement.

@@ -11,12 +11,9 @@ def temp_path():
path = tempfile.mkdtemp()

original_wd = os.getcwd()
os.chdir(path)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's leave these in, please.

Copy link

@yelizariev yelizariev Oct 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it still relevant? There is some explanation here: #48 (comment)

@@ -166,7 +166,10 @@ def test__cycle(self):

i = inotify.adapters.InotifyTree(path)

with open('seen_new_file1', 'w'):
watches = i._i._Inotify__watches
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give me an overview on why this file has to change (aside from fully-qualifying the paths)?

@Elias481
Copy link

@dsoprea can't You just merge this one and roll back the unwanted/unneeded change in inotify/test_support.py as @xlotlu doesn't seem to respond..

Then pending #31 and #30 could be properly reimplemented and all pending pull-requests be merged. And I would have one more reason to switch to Python >3.4...

The question regarding the changes in tests/test_inotify.py can be answered shortly. They are needed because of access pattern changes using os.walk.
You can see similiar result in #31 where such adjustment of test file is missing. It's in failing state because the test dow not account for changed access pattern caused by implementing fix..

Every directory is scanned by os.walk after it has been added to inotify because not intermediate list is used. (Which is the proper way. First it's shorter and probably all in all more efficient code, second there will not be missing directories if they are added between listdir and add_watch with old method..)
So in this point proper work.

@xlotlu
Copy link
Author

xlotlu commented Oct 2, 2018

@xlotlu isn't responding because it's been a really long time since my commit got any attention, and the specifics are vague, and I ended up not using this project in production.

But, IIRC, @Elias481's explanation above is precisely why the changes are needed. Also, the old tests made assumptions about the access order which don't hold true with os.walk(), so I remember I had to mess around with that. They could definitely be made "prettier".

I'm not sure though why I decided to remove that os.chdir(). It might have been because I was trying to understand why the old tests were failing and thought that was one of the reasons, or it might have been
because of this note under https://docs.python.org/3/library/os.html#os.walk:

If you pass a relative pathname, don’t change the current working directory between resumptions of walk(). walk() never changes the current directory, and assumes that its caller doesn’t either.

or maybe I thought the code smells and could trigger strange behaviour, or all of the above.

@xlotlu
Copy link
Author

xlotlu commented Oct 2, 2018

By the way, in the meantime I discovered the reason why my benchmarks under #48 (comment) didn't yield different results with / without dentry cache:
I did the tests under ZFS, which has its own (very generous) cache, so I guess all the tests were performed with cache on. Which means the difference might be quite spectacular on a cold start.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants