Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[opt] use scandir to shorten initialization time of Inotify #45

Closed
NoneGG opened this issue Feb 11, 2018 · 5 comments
Closed

[opt] use scandir to shorten initialization time of Inotify #45

NoneGG opened this issue Feb 11, 2018 · 5 comments

Comments

@NoneGG
Copy link

NoneGG commented Feb 11, 2018

I want to use inotify to monitor video files on cdn server, and it takes too long time to initialize the InotifyTrees when i run the demo script (about 1061070 files)
I notice that os.listdir is used in the code, is there any possibility that we can use scandir.listdir (it is said that scandir will be merged to Python3 official in next release) to optimize the initialization speed?

@NoneGG
Copy link
Author

NoneGG commented Feb 11, 2018

If you allow, i am glad to make a pull request~

@xlotlu
Copy link

xlotlu commented Mar 7, 2018

@NoneGG I implemented the tree handling in #48 using os.walk() instead, which is 4 times faster on rotational media, and about twice as fast on an SSD. That is, if you are using python >= 3.5.

It would be interesting to see a comparison with raw os.scandir(), but I think you can't squeeze much more out of it. os.walk() only does a few extra-operations that aren't of interest to InotifyTree, and neither are I/O-bound. Unless you need support for python < 3.5, of course, then you could depend on scandir, and do the suggested

try:
    from os import scandir
except ImportError:
    from scandir import scandir

If I may make a suggestion though, I think your approach is not the best for your situation, architecturally speaking. Given such a huge tree of files, it's preferable to hook into the code creating / modifying the video files, and callback some handler on the other side -- maybe some API exposed by the code interested in change events. If there are multiple parties interested in changes, then a message queue / fanout system would simplify things greatly.

@NoneGG
Copy link
Author

NoneGG commented Mar 8, 2018

@xlotlu Thank you for your response~
I read your commit and it seems nice.

As far as i know, os.walk in Python (version less than 3.5) still use 'stat' in its realization and will generate lots of io request to disk. As i said before, package scandir is merged into python>=3.5, so it is good to use scandir with Python < 3.5.

Actually the monitor base on inotify is designed for both human operation mistake and code mistake and is still in development now. Your suggestion sounds reasonable and we do have API and subscribing mechanism. But if we need to take monitor on human operation, a hook in file system level is needed, that's why we choose inotify.

Could you tell me why a huge tree of files is not recommended? Accoding to data i found, inotify is improved with the limit of file descriptor, not like dnotiy.

@xlotlu
Copy link

xlotlu commented Mar 8, 2018

@NoneGG yes, on python < 3.5 it is just as slow as before. I made some benchmarks which you can find attached to the PR.

If you need < 3.5 support, then you need to depend on scandir and do the loop just like in the old code. If you create a pull request that does this I'll close mine.

But if we need to take monitor on human operation, a hook in file system level is needed, that's why we choose inotify.

I see. I didn't imagine you'd have arbitrary, human-driven modifications. If so you have no other option, short of making sure all those modifications go through a custom application.

Could you tell me why a huge tree of files is not recommended? Accoding to data i found, inotify is improved with the limit of file descriptor, not like dnotiy.

I didn't say that - a huge tree of files is probably the best way to handle your storage needs. It's the inotify that I think is not the right tool for the job, because it's meant to monitor individual files / directories, while what you want is to monitor "everything". Because of its design you first have to visit every existing inode, which will take a lot of time, no matter what. Then you have to set up watchers that will consume memory. And then those watchers will consume cpu cycles at every event. It's true that it's efficient, but you're still dealing with a million-entries hash-table.

Maybe you could approach this from the other direction: monitor for "everything", and filter out the events that you're interested in? The kernel's audit system comes to mind, and it can monitor specific paths. There's also fanotify, but I don't think it fits your requirements.

@NoneGG
Copy link
Author

NoneGG commented Mar 9, 2018

@xlotlu
Thanks for your advice, i will take audit system into consideration~

According to experiment in our CDN server, it do takes long time to initialize inotify tree(that's why i open this issue), but when refering to CPU and memory, it does not take much indeed.(i am not so sure, i only use top and free command to monitor these two indexes)

@NoneGG NoneGG closed this as completed Mar 9, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants