Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add module parameter to disable prefetch in zfs_readdir #16318

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

clhedrick
Copy link

Add paramter zfs_readdir_dnode_prefetch_enabled, defaulting to 1, to control whether zfs_readdir prefetched metadata for all objects it look at when reading a directory.

Setting it to 0 can be important for NFS servers with directories containing many subdirectories.

Motivation and Context

This improves #16254, although a more radical redesign is possible. We have an NFS servers with directories containing 200,000 to 500,000 subdirectories. These arise in AI training scenarios. Our NFS servers can have 20 or more threads running at 100% for hours due to this.

The` problem has been found to be an interaction between the NFS code and zfs_readdir. If information about a directory is cached on the client but not the server, when the server tries to valid the NFS file id, it will add an entry in the dnode cache for the directory. Because he new dnode is not connected to the rest of the directory hierarchy, the NFS (actually exportfs) code goes up the tree trying to connect it. The ends up calling zfs_readdir to find the directory in its parent Because zfs_readdir reads the parent looking for the directory, the amount of time it takes is proportional to the size of the parent. If every directory is processed, the amount of time will be N**2 in the size of the parent. Reading a directory would have only moderate overhead except that zfs_readdir does a prefetch on every item as it looks at it, even if the item is already in the ARC. Of course if it's in the ARC, no disk I/O will be done. But there's still a substantial amount of code.

We've found a factor of 10 to 20 speedup by skipping the prefetches. This is enough to move us from a painful situation to one we can live with.

Of course a linear speedup in an N**2 problem isn't a full solution, but it's enough to survive the size directories we see in practice. A full solution would almost certainly require help from the kernel developers, which is going to be hard to get, since directory searching is fast enough for other Linux file systems that the problem only comes up with ZFS.

Description

This change add a module parameter, zfs_readdir_dnode_prefetch_enabled, normally 1, which can be changed to 0 to disable the prefetch that happens in zfs_readdir.

How Has This Been Tested?

We have done testing with a directory containing 200,000 subdirectories, with the results discussed above. It has been running in production for several weeks.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@stuartthebruce
Copy link

Would it make sense to change this from a boolean to a scalar threshold value on the directory link count to determine whether or not to prefetch?

@clhedrick
Copy link
Author

Yes, it probably does make sense. I'm willing to try. (I didn't actually write the code, but I can probaby figure that out.)

However before doing that it's worth asking whether we can solve the underlying problem. The problem occurs becasue NFS wants to know the name that the directory has in its parennt directory. There's a function to get the name of an inode. It's specific to the file system type. ZFS doesn't have it, so the backup is used. The backup ends up using zfs_readdir to search the directory looking for the inode.

So the obvious queston is: can we create a zfs_getname that does it better. Is there somewhere in the ZFS data structures that we could cause the filename. that way when zfs_readdir (or our replacemennt) looks through the parent directory, it can cache the name of every inode as it goes by. Then when we come to look up that node, the name will already be there, and zfs_getname won't need to do a search at all.

Since I'm not a zfs developer, I don't know the data structures. Is there a place the name can be put, and obviously cleared if there's a rename or delete? With a normal file, it can be in more than one directory, so a single entry wouldn't work. But for a directory, there's only one name that it has in its parent (i.e ".."), so we don't have that ambiguity.

If someone could point me to what needs to be done, I'm willing to code it and test it. I do have experience doing kernel code. (I was one of the original Linux developers when Linus was first writing linux.)

@clhedrick
Copy link
Author

clhedrick commented Jul 4, 2024

For clarify, the actual function is

int get_name(const struct path *path, char *name, struct dentry *child);

  • get_name - default export_operations->get_name function
  • @path: the directory in which to find a name
  • @name: a pointer to a %NAME_MAX+1 char buffer to store the name
  • @child: the dentry for the child directory.

The problem occurs when data is in cache on the client, but the dentry for the child directory isn't in the dentry cache anymolre. So the data can't be stored ni the dentry, since it may not exist for some of the directories whose names we'd like to cache.

have a very large number of subdirectories.
A reasonable value would be 20000.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be pedantic, please consider specifying what the units are.

@clhedrick
Copy link
Author

I've changed it to have a limit rather than a boolean. If non-zero, any directory larger than the limit disables prefetch.

Note that it fails checkstyle. Checkstyle demands that I use a posix type, replaceing ulong with ulong_t, but the macro doesn't support ulong_t. There are examples in other parts of the code using ulong.

I have tested this with 0, a value smaller than my directory size, and a value larger than my directory size. The performance is sufficiently different that I can tell it works as intended.

@clhedrick
Copy link
Author

there's no unit. It's a count. Directory size. Unlike other file systems, in ZFS the size of the directory is a count of the files in it.

@stuartthebruce
Copy link

stuartthebruce commented Jul 5, 2024

there's no unit. It's a count. Directory size. Unlike other file systems, in ZFS the size of the directory is a count of the files in it.

Perhaps too pedantic, but in general directories have a size in bytes as well as a size in link count. However, if looks like ZFS sets the size to equal the link count.

@clhedrick
Copy link
Author

I looked a bit further into writing an alternative zfs_get_name. It looks like one could use zap_value_search. However the basic functions for iterating through the zap seem to be the same. I'm sure it would be faster, but it's not clear whether it would be enough to justify the extra code.

@tonyhutter
Copy link
Contributor

I can take a look at this. Would you mind squashing the commits and take a look at the checkstyle errors?

@clhedrick
Copy link
Author

clhedrick commented Jul 12, 2024 via email

@clhedrick
Copy link
Author

clhedrick commented Jul 12, 2024

the complaint about non-posix definition was hard to understand. All lthe POSIX alternatives generate compile errors. And I found plenty of uses of ulong in other files.

Here's my attempt to use ulong_t

param_ops_ulong_t’ undeclared here

Here's another file with the same type of declaration not using POSIX.

in ./module/os/linux/zfs/zvol_os.c
module_param(zvol_max_discard_blocks, ulong, 0444);

I was unable to figure out what to do.

@tonyhutter
Copy link
Contributor

I see what you mean, that's bizarre. It's probably a bug in our cstyle.pl. I guess just make it a ulong_t for now in module_param() as a workaround.

To squash commits run git rebase -i and squash all your commits into your first commit. After you've successfully squashed them, you can do a force push (git push --force) to update this PR with the single commit.

@clhedrick
Copy link
Author

clhedrick commented Jul 12, 2024

ulong is the one that compiles but fails style
ulong_t is the one that passes style but doesn't compile.
I'm leaving ulong.

I haven't been able to figure out why mine fails but it works in other files. I suspect there's an include file I need.

I think I've squashed it correctly.

@clhedrick
Copy link
Author

clhedrick commented Jul 12, 2024 via email

@tonyhutter
Copy link
Contributor

@clhedrick looks like a /* CSTYLED */ will suppress the cstyle error:

diff --git a/module/os/linux/zfs/zfs_vnops_os.c b/module/os/linux/zfs/zfs_vnops_os.c
index 11f665d22..efe2371fa 100644
--- a/module/os/linux/zfs/zfs_vnops_os.c
+++ b/module/os/linux/zfs/zfs_vnops_os.c
@@ -4257,6 +4257,7 @@ EXPORT_SYMBOL(zfs_map);
 module_param(zfs_delete_blocks, ulong, 0644);
 MODULE_PARM_DESC(zfs_delete_blocks, "Delete files larger than N blocks async");
 
+/* CSTYLED */
 module_param(zfs_readdir_dnode_prefetch_limit, ulong, 0644);
 MODULE_PARM_DESC(zfs_readdir_dnode_prefetch_limit,
        "No zfs_readdir prefetch if non-zero and size > this");

Would you mind adding that in? Then we should be pretty much good to go.

@clhedrick
Copy link
Author

OK. I just tested on a 6.8 kernel with the limit 0, 20000 and 400000, with a directory of size 200000. I got the expected results. 4 min with 0 and 400000, 20 sec with 20000.

@clhedrick
Copy link
Author

This is tricky to test. The problem occurs when files are cached on the client, but metadata is not, and the dnoce is not cached on the server. This is realistic for us, because our clients have very large memory. Data can stay cached for days if not weeks, but the attribute cache time out after something like 90 sec.

I believe the problem occurs when the client does a getattr to see if the data is still fresh. The getattr uses what is in effect a inode number. That triggers the reconnect code in the NFS server, which searches the parent directory for the directory, in order to find its name.

So to duplicate the problem, clear cache on the server but not the cllient, and wait at least 120 sec on the client, for the attribute cache to timeout. Clearing cache on the client would cause it to do full lookups on all files, which would be fast.

@tonyhutter
Copy link
Contributor

Just so I'm clear, zfs_readdir_dnode_prefetch_limit means "stop prefetching file metadata after zfs_readdir_dnode_prefetch_limit files have been pre-fetched in a directory"? If so, could you update your man page entry with something like this:

diff --git a/man/man4/zfs.4 b/man/man4/zfs.4
index c12f838c5..588bf9339 100644
--- a/man/man4/zfs.4
+++ b/man/man4/zfs.4
@@ -1781,9 +1781,11 @@ When readdir searches a directory, it normally prefetches metadata for
 all objects in the directory it checks, even if it's just
 looking for a single object.
 Setting this to a non-zero value disables that prefetching for directories
-with more entries than that value.
-Disabling it for large directories can greatly lower CPU usage on NFS servers where directories
-have a very large number of subdirectories.
+with more entries than that value.  For example, if set to 1000, then stop
+prefetching file metadata after 1000 file's metadata have been prefetched for
+the directory.
+Disabling it for large directories can greatly lower CPU usage on NFS servers
+where directories have a very large number of subdirectories.
 A reasonable value would be 20000.
 .
 .It Sy zfs_qat_checksum_disable Ns = Ns Sy 0 Ns | Ns 1 Pq int

@clhedrick
Copy link
Author

The documentation is correct. I compare the directory size with the limit. What you describe could be implemented. I would use a counter in the loop. It would take a lot of work on lots of different situations to know whether this is better or not. It's not obvious that it is.

A more complex fix would be to implement a zfs-specific get_name. That would basically disable prefetch only in this specific NFS operation. Whether that's worth doing or not depends partly upon whether the prefetch is a good idea in the first place for very large directories. I'm not so sure it is. Consider "ls -l". If information on all files is already in the ARC, the prefetches are wasted. If not, it will queue up a bunch of prefetches, in the order that they occur in the directory traversal. The best case seems to be"ls -l". But ls will sort the files alphabetically. I think it's just as likely that the first file will have to wait for a bench of prefetches are finished to get its data. Prefetch won't be any faster than a demand fetch. It's useful only if the demand comes later, so that by the time it comes the data is in memory, where it would have had to wait otherwise. But it probably doesn't matter for typical file access. The "ls -l" case seems like the best hope, but I'm skeptical because of the ordering issue.

You'd need some pretty careful testing in quite a bunch of situations. But I'm skeptical about the usefulness of the prefetch. Certainly not for us, where metadata is always in NVMe.

It you can suggest a way to get a directory's name without doing a traversal of the parent at all, that would be worth doing. I'd be willing to code it. But I don't see any obvious way to do that given the data structures.

@tonyhutter
Copy link
Contributor

there's no unit. It's a count. Directory size. Unlike other file systems, in ZFS the size of the directory is a count of the files in it.

I compare the directory size with the limit

Ok, I misunderstood. So it's literally is the 'size' from stat(/directory). (https://kb.netapp.com/on-prem/ontap/Ontap_OS/OS-KBs/What_is_directory_size)

May I suggest the following change?:

diff --git a/man/man4/zfs.4 b/man/man4/zfs.4
index c12f838c5..083ffeaa4 100644
--- a/man/man4/zfs.4
+++ b/man/man4/zfs.4
@@ -1776,15 +1776,18 @@ that ends up not being needed, so it can't hurt performance.
 .
 .It Sy zfs_readdir_dnode_prefetch_limit Ns = Ns Sy 0 Pq u64
 Disable prefetches in readdir for large directories.
-(Normally off)
 When readdir searches a directory, it normally prefetches metadata for
 all objects in the directory it checks, even if it's just
 looking for a single object.
 Setting this to a non-zero value disables that prefetching for directories
-with more entries than that value.
-Disabling it for large directories can greatly lower CPU usage on NFS servers where directories
-have a very large number of subdirectories.
+with a greater size than that value.  Directory size in this case is the
+size returned from calling
+.Sy stat
+on the directory (stat.st_size).
+Disabling it for large directories can greatly lower CPU usage on NFS servers
+where directories have a very large number of subdirectories.
 A reasonable value would be 20000.
+A zero value (the default) means no limit on directory metadata prefetching.
 .
 .It Sy zfs_qat_checksum_disable Ns = Ns Sy 0 Ns | Ns 1 Pq int
 Disable QAT hardware acceleration for SHA256 checksums.

@clhedrick
Copy link
Author

clhedrick commented Jul 12, 2024

To my knowledge, that field for a directory is in fact the number of entries in the directory. Is that wrong? It's not, as it is for something like ext4, the amount of space used by the directory structure itself. Obviously I don't actually call stat, although I'm prepared to believe that this is the number that stat uses.

I just checked a few directories, and it seems close, off by one maybe.

I found code incrementing or drecrementing it for actons with a single file.

@tonyhutter
Copy link
Contributor

To my knowledge, that field for a directory is in fact the number of entries in the directory. Is that wrong? It's not, as it is for something like ext4, the amount of space used by the directory structure itself.

I double checked and that appears to be correct. On ZFS it's basically the number of files/subdirs in the directory, but on other filesystems it's like the metadata size needed for the directory. So it's not strictly defined 🤷‍♂️

How about:

diff --git a/man/man4/zfs.4 b/man/man4/zfs.4
index c12f838c5..a6a00e107 100644
--- a/man/man4/zfs.4
+++ b/man/man4/zfs.4
@@ -1776,15 +1776,20 @@ that ends up not being needed, so it can't hurt performance.
 .
 .It Sy zfs_readdir_dnode_prefetch_limit Ns = Ns Sy 0 Pq u64
 Disable prefetches in readdir for large directories.
-(Normally off)
 When readdir searches a directory, it normally prefetches metadata for
 all objects in the directory it checks, even if it's just
 looking for a single object.
 Setting this to a non-zero value disables that prefetching for directories
-with more entries than that value.
-Disabling it for large directories can greatly lower CPU usage on NFS servers where directories
-have a very large number of subdirectories.
+with a greater size than that value.
+Disabling it for large directories can greatly lower CPU usage on NFS servers
+where directories have a very large number of subdirectories.
+Directory size in this case is the size returned from calling
+.Sy stat
+on the directory (stat.st_size).
+On ZFS, this directory size value is approximately the number of files
+and subdirectories in the directory.
 A reasonable value would be 20000.
+A zero value (the default) means no limit on directory metadata prefetching.
 .
 .It Sy zfs_qat_checksum_disable Ns = Ns Sy 0 Ns | Ns 1 Pq int
 Disable QAT hardware acceleration for SHA256 checksums.

@tonyhutter
Copy link
Contributor

Can you edit the commit message to fix the commitcheck error?:

error: commit message body contains line over 72 characters

Other than that, it should be good to go..

Add paramter zfs_readdir_dnode_prefetch_limit, defaulting to 0, to
control whether zfs_readdir prefetched metadata for objects it look at
when reading a directory. If zero, metadata is prefetched for all
directory entries. If non-zero, metadata is prefetched only if
directory has fewer entries than this.

Setting it to non-0 can be important for NFS servers with directories
containing many subdirectories.

Signed-off-by: Charles Hedrick <[email protected]>
Co-authored-by: Chris Siebenmann<[email protected]>
@clhedrick
Copy link
Author

clhedrick commented Jul 15, 2024

ok. I also noted in the documentation that it only applies to Linux. I have no reason to think the BSD needs it. It's needed because of details of the NFS implementation. i would hope that BSD hasn't chosen the same approach.

At any rate, the BSD readdir doesn't prefetch.

@clhedrick
Copy link
Author

what's up with this? Test failures can't be due to this patch, as it obvviously has no effect if not enabled.

Is this going to be merged?

@behlendorf
Copy link
Contributor

@clhedrick as you mentioned originally the issue here is really the lack of a proper .get_name() function so lets start by implementing it. I suspect it was never implemented because it was marked as optional in the documentation. Clearly it really should be added! From Documentation/filesystems/nfs/exporting.rst:


  get_name (optional)
    When given a parent dentry and a child dentry, this should find a name
    in the directory identified by the parent dentry, which leads to the
    object identified by the child dentry.  If no get_name function is
    supplied, a default implementation is provided which uses vfs_readdir
    to find potential names, and matches inode numbers to find the correct
    match.

The implementation for this should be similar to zpl_lookup() which calls zfs_lookup() to find a name in a directory. If the lookup succeeds you'll then want to compare the inode number in the inode which was looked up and the one which was passed in. There won't be any scanning in the lookup case so this should eliminate the issue you're seeing. It still might have to go to disk if the directory isn't cached in the ARC but only the needed blocks will be read in.

If you're game to take a crack at this you'll want to register a new zpl_get_name() function in the export_operations structure at the end of module/os/linux/zfs/zpl_export.c.

@snajpa
Copy link
Contributor

snajpa commented Aug 13, 2024

It would be lovely to have this as a per-dataset property, just saying :)

@behlendorf
Copy link
Contributor

It would be lovely to have this as a per-dataset property, just saying :)

Assuming implementing .get_name() fixes the performance issue what's the use case for making this tunable? I probably missed that in an earlier comment.

@behlendorf behlendorf added the Status: Design Review Needed Architecture or design is under discussion label Aug 13, 2024
@clhedrick
Copy link
Author

maybe.. So get_name is (dentry *parent, char *namebuff, dentry *child)
zpl_lookup returns dentry * and has
struct inode *dir, struct dentry *dentry, unsigned int flags)
So I get the inode for the parent from the entry and pass the child dentry
What is returned? Is it a dentry with the name filled in?

Is any locking needed around it? Is the returned dnode looked or held of anything?

@clhedrick
Copy link
Author

What I find odd about zpl_lookup is that it seems to pass a name to zfs_lookup. But I don't have the name. Given the data structures I don't see how you could do a lookup without the name or searching.

@behlendorf
Copy link
Contributor

behlendorf commented Aug 13, 2024

What I find odd about zpl_lookup is that it seems to pass a name to zfs_lookup. But I don't have the name.

Indeed, sorry my initial reading on this was a bit to hasty. You're right, the only way we have to do this lookup is by iterating over the entire directory. That said, I still think it may be worthwhile to add our own get_name() callback. There's a lot of extra work, in addition to the prefetching, that ends up happening in the fallback zfs_readdir case. That could all be eliminated which I imagine would speed things up further.

@clhedrick
Copy link
Author

I think the place to look is zap_value_search. But there's another thing to think about: would it make sense to leave the prefetch, but when something is prefetched, if it's a directory, fill in the name and mark it connected? That would eliminate almost all of the searching, since a search happens only if a connected dnode can't be found.

@behlendorf
Copy link
Contributor

I think the place to look is zap_value_search.

This should work. The directory ZAP stores the dataset object number as the first integer, and we use this same object number when setting the inode number. The virtual .zfs snapshot directories doesn't appear the directory ZAPs and use virtual inode number so there might be a minor corner case there we need to handle.

would it make sense to leave the prefetch, but when something is prefetched, if it's a directory, fill in the name and mark it connected? That would eliminate almost all of the searching, since a search happens only if a connected dnode can't be found.

If we've gone through all the expense of prefetching the directory it makes sense to me to populate it and mark it connected.

@clhedrick
Copy link
Author

My feeling is that if we make prefetch connect the dnode, performance of this loop isn't so critical, so we don't need our own get_name. As long as the directory dnode is connected, none of this code will be called.

This is for realistic cases. In principle, I think the limit to prefetch is needed independent of the NFS issue described here. ZFS claims to support huge directories. But if you actually tried to use them, even if you aren't using NFS and thus don't have the issue described in this request, the first time you do an "ls" of a directory with a zillion files the prefetches will thrash the ARC and possibly fill task queues and other things. I can't say whether this would ever happen in reality.

I note that the BSD code for readdir doesn't prefetch.

Anyway, connecting he dnode would be interesting. zfs_readdir triggers an async operation. It ends with a callback to prefetch done, which doesn't know what's being prefetched. So we'd have to use the prefetch finish callback, I assume

This is getting beyond what I think I can do. For the moment we'll stick with this patch. It's hard to know whether anyone else is seeing this issue, since it wouldn't show any symptom other than more CPU usage than you'd like. A lot of our folks do AI. the training datasets seem to have a structure that triggers this. I doubt we're the only place doing this.

@clhedrick
Copy link
Author

It's actually not so clear that we need the prefetch. There may be enough info ni the inode. But I don't understand what is going on in the reconnect code, so I can't tell what would be needed to actually do what I propose, and whether it would be expensive enough to cause other issues.

@amotin
Copy link
Member

amotin commented Sep 6, 2024

I note that the BSD code for readdir doesn't prefetch.

As I can see, the prefetch code is identical for Linux and FreeBSD -- first zfs_readdir() call always prefetches, while later do it only if client is actually trying to access the returned entries via zfs_dirlook(). I guess your workload in some way tries get more than just an entry name and inode number for the entries, that triggers following prefetches, which may be legal in that case (unless those accesses are done by unrelated tasks). The only question is that you may not need them once all the dnodes are already in ARC.

To address the last case of unneeded prefetches when the data are already in ARC to mind comes mechanism I implemented some time ago for speculative prefetcher in dmu_zfetch() (see missed argument) -- do not issue data prefetches (only prefetch indirects) for an I/O stream until we get at least one cache miss on demand read access. It gave significant CPU usage reduction there.

@youzhongyang
Copy link
Contributor

@[
    osq_lock+1
    __mutex_lock_slowpath+27
    mutex_lock+64
    dnode_block_freed+124
    dbuf_prefetch_impl+280
    dbuf_prefetch+43
    dmu_prefetch+208
    zfs_readdir+817
    zpl_iterate+105
    iterate_dir+178
    get_name+348
    exportfs_get_name+116
    reconnect_path+264
    exportfs_decode_fh_raw+289
    nfsd_set_fh_dentry+585
    fh_verify+494
    nfsd3_proc_getattr+134
    nfsd_dispatch+374
    svc_process_common+1205
    svc_process+201
    nfsd+257
    kthread+327
    ret_from_fork+31
]: 84231614

We have this issue on our server. In 60 seconds, there were 84231614 times osq_lock() stack traces. The nfsd processes were 90%+ busy.

@behlendorf behlendorf added the Status: Revision Needed Changes are required for the PR to be accepted label Nov 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Design Review Needed Architecture or design is under discussion Status: Revision Needed Changes are required for the PR to be accepted
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants