Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add vdev property to bypass vdev queue #16591

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

MigeljanImeri
Copy link
Contributor

Added vdev to bypass vdev queue when reading / writing from / to vdev. The intention behind this property is to improve performance when using o_direct.

Motivation and Context

Bypassing the vdev queue yields significant performance improvements when doing o_direct random reads on NVMe SSD's. We noticed a contention point with the vq->vq_lock in a flamegraph and wondered if this contention could be reduced through bypassing the vdev queue.

Description

An additional check is needed in vdev_queue_io_done to see if the ZIO was actually added to the queue, since the property may have been toggled in between vdev_queue_io and vdev_queue_io_done.

Using a gen 5 NVMe SSD, we were able to get roughly 3.5x the IOPS (136k -> 469k), through bypassing the vdev queue when doing o_direct random reads.

How Has This Been Tested?

Added a test case to check if toggling the bypassqueue property will cause kernel panic. When creating this vdev property, I ran into a situation where if the property was toggled while a ZIO was going through the pipeline, it would cause a panic because it was expected to be in a queue when it was never added to the queue. We now check if the ZIO has been added to the vdev queue before removing it.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

Copy link
Member

@amotin amotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it might have sense for some very fast NVMe's with "unlimited" queue depth and I/O rate, since others may suffer from I/O aggregation lost as a side effect.

Also it seems that you are still obtaining the queue lock on completion, aren't you?

Also I don't see what would initialize io_queue_state you check on completion if the queue is bypassed.

Also the property seems to have negative meaning. It is preferable to use positive ones to avoid double negation.

@MigeljanImeri
Copy link
Contributor Author

I guess it might have sense for some very fast NVMe's with "unlimited" queue depth and I/O rate, since others may suffer from I/O aggregation lost as a side effect.

Also it seems that you are still obtaining the queue lock on completion, aren't you?

Changed to avoid obtaining the queue lock if nothing gets removed from the queue.

Also I don't see what would initialize io_queue_state you check on completion if the queue is bypassed.

Changed to set the io_queue_state if not queueing the zio.

Also the property seems to have negative meaning. It is preferable to use positive ones to avoid double negation.

Changed property to use positive meaning.

@@ -156,6 +156,8 @@ If this device should perform new allocations, used to disable a device
when it is scheduled for later removal.
See
.Xr zpool-remove 8 .
.It Sy queue_io
Add io to the vdev queue when reading or writing to this vdev.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might want to give a hint to the user about when they would want to change this prop:

Add IO to the vdev queue when reading or writing to this vdev.
Disabling this property can sometimes improve performance for direct IOs.

Feel free to re-word this ⬆️

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

Comment on lines 70 to 83
log_must eval "fio --filename=$mntpnt/foobar --name=write-file \
--rw=write --size=$MINVDEVSIZE --bs=128k --numjobs=1 --direct=1 \
--ioengine=sync --runtime=10"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to also do the toggle_queue_io test while doing the write?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure makes sense to me.
Added.

verify_runnable "global"

command -v fio > /dev/null || log_unsupported "fio missing"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to save/set/restore the zfs_dio_enabled module param to use Direct IO

log_must save_tunable DIO_ENABLED
log_must set_tunable32 DIO_ENABLED 1
...
log_must restore_tunable DIO_ENABLED

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

@robn
Copy link
Member

robn commented Oct 7, 2024

I am uncomfortable with this change.

I agree that the vq_lock runs white-hot when the disks underneath are very very fast. I've spent many months trying to reorganise this lock, with mixed success, mostly because in all my testing it just moves the contention to other locks. That was all pre-direct-IO though, so that may be different now, though my own workloads are not O_DIRECT.

So the main thing I know that this will "break" is that the IO is no longer tracked for the "slow IO" checks to find (vdev_deadman()). That's actually the main reason this lock runs hot, because removing the IO from the list is the only reason we take it on the IO return path.

The other part I'm just nervous about is that I don't know how it will interact with IO that is on the queue. Seems like it could starve other high-priority IO. This sort of stuff is why I've spent so long on it without a conclusive path forward.

I would weakly support this being done as a module parameter so it's at least available in the field for experimentation on specific customer workload, though it's a little tough with module parameters being scoped far wider than a single vdev instance. Though maybe that's fine to start for an experimental feature; the other vdev_queue_* parameters have the same issue.

For now though I'd be against having as a vdev property, because that is both more visible and more permanent, and I don't have the confidence to support it yet.

If I were overruled here, I would at least make the scope far more narrow: maybe only limit it to Direct IO IOs, and definitely make the name clear that is more narrowly. I would suggest something like your original and make it "queue bypass" or similar (eg "directio queue bypass" if it were limited to those).

(I appreciate the point about avoiding double-negatives, but I think its possible to claim that "bypass" is a specific, positive action. At least, for me, "queue io" is way too vague, and might cause me to say "why wouldn't I turn that off? Queues sound slow!". But I this is only a tiny part of my objection to this whole thing, so if it does stay in its current form, I'm unlikely to worry too much about naming).

@MigeljanImeri
Copy link
Contributor Author

I am uncomfortable with this change.

I agree that the vq_lock runs white-hot when the disks underneath are very very fast. I've spent many months trying to reorganise this lock, with mixed success, mostly because in all my testing it just moves the contention to other locks. That was all pre-direct-IO though, so that may be different now, though my own workloads are not O_DIRECT.

I have been working with @bwatkinson into improving IOPS performance with reads and we have also noticed that getting around this lock just seems to shift contention elsewhere. I am hoping that with further research and testing we can further reduce this contention, this bypassing the vdev queue idea is just one step.

So the main thing I know that this will "break" is that the IO is no longer tracked for the "slow IO" checks to find (vdev_deadman()). That's actually the main reason this lock runs hot, because removing the IO from the list is the only reason we take it on the IO return path.

The other part I'm just nervous about is that I don't know how it will interact with IO that is on the queue. Seems like it could starve other high-priority IO. This sort of stuff is why I've spent so long on it without a conclusive path forward.

I would weakly support this being done as a module parameter so it's at least available in the field for experimentation on specific customer workload, though it's a little tough with module parameters being scoped far wider than a single vdev instance. Though maybe that's fine to start for an experimental feature; the other vdev_queue_* parameters have the same issue.

As for implementing this as a module parameter, I am not too sure which way we should go. I am okay with it either way, maybe others (@amotin, @tonyhutter) can provide some insight into what should be done here?

For now though I'd be against having as a vdev property, because that is both more visible and more permanent, and I don't have the confidence to support it yet.

If I were overruled here, I would at least make the scope far more narrow: maybe only limit it to Direct IO IOs, and definitely make the name clear that is more narrowly. I would suggest something like your original and make it "queue bypass" or similar (eg "directio queue bypass" if it were limited to those).

I agree that this property should probably be limited to only O_DIRECT IO's, as from the testing that I have done, it doesn't seem to improve performance with non O_DIRECT workloads.

(I appreciate the point about avoiding double-negatives, but I think its possible to claim that "bypass" is a specific, positive action. At least, for me, "queue io" is way too vague, and might cause me to say "why wouldn't I turn that off? Queues sound slow!". But I this is only a tiny part of my objection to this whole thing, so if it does stay in its current form, I'm unlikely to worry too much about naming).

@bwatkinson
Copy link
Contributor

bwatkinson commented Oct 23, 2024

@behlendorf do you have an opinion on whether we should stick with a VDEV property that is limited to Direct I/O or would a module parameter make more sense at this point? I see a good argument for both.

@tonyhutter
Copy link
Contributor

My preference would be for a vdev property, since you could set it NVMe vdevs, but not spinning disks. This assumes we're 100% comfortable with this feature though.

@MigeljanImeri do you have any thoughts on @robn's comments regarding breaking the deadman IO checks and the IO priority starvation? For example, if you submit a un-queue'd IO, and the drive "eats" the IO with no response, how does this codepath handle that?

@MigeljanImeri
Copy link
Contributor Author

My preference would be for a vdev property, since you could set it NVMe vdevs, but not spinning disks. This assumes we're 100% comfortable with this feature though.

@MigeljanImeri do you have any thoughts on @robn's comments regarding breaking the deadman IO checks and the IO priority starvation? For example, if you submit a un-queue'd IO, and the drive "eats" the IO with no response, how does this codepath handle that?

If we restrict this to only o_direct reads and writes, the o_direct code path should propagate up any errors that occur. In terms of priority starvation, I don't foresee this being an issue with newer NVMe SSD's that are very fast. Although this could be an issue with slower devices, maybe we could include a warning in the property description that potential starvation could occur if the underlying disk can't keep up with the requests?

@tonyhutter
Copy link
Contributor

@robn I was actually able to get this PR to trigger deadman events even with queue_io=off.

I set the ZFS params to:

zfs_deadman_checktime_ms=100
zfs_deadman_ziotime_ms=500
zfs_deadman_enabled=1

I then created a scsi_debug device with these params set:

modprobe scsi_debug max_luns=1 num_tgts=1 add_host=1 dev_size_mb=1000

... then created the pool with the scsi_debug device. I then set these additional scsi_debug params:

delay=100
every_nth=1000

Then I copied the zfs/build directory from the zfs source workspace to the pool and saw deadman events generated. This was NOT using O_DIRECT writes - just normal writes.

@tonyhutter
Copy link
Contributor

I'm going to see if I can starve out a scrub and MMP write with queue_io=off.

@robn
Copy link
Member

robn commented Oct 24, 2024

@tonyhutter thanks for trying to test those cases; I'm mostly going from memory and gut feels and its good to have things confirmed, or not!

From memory the deadman path takes the front (oldest) zio from the queue active list and uses it for the slow report, because we know its the oldest one outstanding. It therefore follows that that if the slow IO is not on that list, we won't report on it. The slow IO however will still be holding up the txg sync, and with no reporting, all the knockon "deadman fired" events (like suspending the pool, zed actions, etc) won't fire.

That might be ok! It might be ok to just document, "you lose all these other things" or something, so long as we're aware of it.

But this is part of why, for now, I'm against a vdev property: because there's knock-on effects that are to some extent tied up in implementation details that I definitely want to change, and I'm not aware of any policy or process around deprecating/removing a property in the future.

To be clear, I'm not against having the option available. If it had started as a module parameter, I would not have flinched. I think it's crucial we have ways to change internal behaviour in the field to gain more experience without too much overhead. My only concern with any of this is that we can support it into the future. Properties feel to me like real "user interface", like buttons, switches or lights on the dashboard, while module parameters feel more like messing around under the hood.

If I'm overblowing it, feel free to tell me to go away: I get it, the "best" way isn't always possible, and I'm not gonna be the person blocking this - it's important we can learn more about these use-cases.

Other possible options:

  • make it a hidden property; a secret switch under the dashboard is perhaps a bit more defensible
  • add some helpers so we can target module parameters to specific objects (vdevs, datasets, pools)
  • add some sort "transient" property type, that can be set, but isn't stored
  • gate the property behind a module parameter, so it does nothing unless a parameter is also set

I volunteer to make any/all of these possible if it would help (I'm not gonna request a bunch of unrelated work from someone, that's rude).

@robn
Copy link
Member

robn commented Oct 24, 2024

(incidentally, I love how many subtle and weird and things properly fast IO is throwing up. It's nice to be solving hard problems!)

@tonyhutter
Copy link
Contributor

@robn - some random thoughts:

One alternative is to pull it in as a vdev property and then keep it in master until 2.4.x. So don't backport it to 2.3.x or 2.2.x. That will give it some soak time and will help expose any unknowns. One thing I didn't think of though is that you can set vdev properties on the root vdev, top-level-vdevs and leaf vdevs, which could make this complicated. Like, what happens if you have queue_io=off on the top level vdev, and queue_io=on on the mirror-0 vdev and queue_io=off on one of the leaf mirror vdevs.

Another idea would be not have the vdev property at all and just overload zfs_vdev_def_queue_depth=0 to mean "don't queue anything". That makes me curious what setting zfs_vdev_def_queue_depth=0 does without this PR...

@robn
Copy link
Member

robn commented Oct 24, 2024

There's currently no inheritance on vdev properties, so nothing happens. We've talked a little internally at Klara about what property inheritance should mean, and turns out it's kinda weird no matter how you slice it. But yeah, for now, nothing.

I did think about getting it on master and then not backporting it, but I don't think it gets us much, and possible makes it worse. "Soak time" assumes that there's people running master and kicking the tyres on new things, and to me this feels too niche for many (any?) people to actually do that. If they don't, then we're just having to make the same decisions when its time to cut 2.4. If they do, then it's the same problem, in that we have a user-acessible lever tied to an implementation detail, with the bonus that it's already enabled on pools already out there.

If we were doing a module parameter, I would just make a new one, not overload an existing one. Based on precedent, I don't consider module parameters to be stable between release series, and I'd have no problem shipping one in (say) 2.3 and then removing it in 2.4. We already do this with lots of different parameters (most notably ARC related).


(Below follows an extended ramble where I ask a bunch of questions, sort of answer them along the way, and come to the same conclusion. The joys of vacation; got time on my hands).

Maybe there's some clarity to be found in considering when and why an operator might want to enable this option, or maybe better, when not to. That's something for me too; I initially saw a mention of vq_lock and kinda rubbed my hands together because I've been training for this moment 💪 and maybe I haven't thought about it on its merits.

So, I'm an operator, and I've got my shiny new NVMe pool, and I feel like it's going a bit slow. I search around and find people talking about an option that makes it go faster. I turn it on and indeed, it goes faster. I am happy! Hurrah! Or, I turn it on, and it doesn't go faster because my workload (or pool topology or overall system architecture) isn't a fit for it, and I'm now annoyed, but I don't bother turning it off because eh, it doesn't do anything anyway.

Alternatively, I'm not using an NVMe pool, just regular spinners, but I read about the option and wonder, and I turn it on. Everything suddenly goes slower, so I turn it off again, fine. But maybe it doesn't go slower, because I have some funky setup with tons of block cache "under" ZFS. (I dunno, I'm trying to invent a scenario I'm pretty sure isn't real).

So! Is this is only a GO FAST / GO SLOW toggle? Or does it affect something deeper?

Let's say I'm right for the moment, and it does have two knock-on effects: no slow IO warnings for bypassed IOs, and priority queued IO may not get serviced if bypassed IOs are hogging all the worker threads. As an operator, am I going to be surprised if I toggle this and other things stop working properly? What if I didn't know that they might beforehand? What if I did now but I didn't really understand it?

What if it didn't affect something deeper? What if we "fixed" those? Like, maybe queue bypass only works when we're not scrubbing or other things that "naturally" degrade pool performance, though maybe we want to prioritise ZIL writes still, or MOS updates, or...? Maybe queue bypass only works when the pool is definitely in good health and responding well (ie when we can be more certain that slow IOs wouldn't have happened anyway).

I hope it's obvious that these are examples; these two may not end up being problems at all. There might be different things we haven't thought of. New ones might come up as a result of future changes; how does that affect the property being enabled or not? What if there was a future where vq_lock didn't exist (I'm working on it I swear). What's the user's expectation then?

If it's only limited to direct IO, does that change the answers? Does it add any new problems?

And then, if it was possible for it to be only a GO FAST / GO SLOW toggle, would we even need it? Is there ever a time where we wouldn't bypass the queue for "safe" IO, and so don't even need a toggle?

What if it wasn't an explicit "bypass the queue", but more of a hint: "prioritise direct IO when safe to do so?". That would sidestep most of the questions, because it clearly says "idk, give it a try, I promise I won't mess up your pool". Is there any value in such a thing?

But if we had that, then how is that any different in concept to the existing priority system? Note "in concept"; I know it would suck under the existing priority system because of how it's currently implemented, but that's not the point.

So I don't have complete answers for all of this, and I'll admit that probably my partial answers are informing my questions a bit, so again, I am not assuming that I'm definitely right here. But, all this is why this seems like a module parameter to me, because those are things we currently have to influence implementation-specific behaviour. Lifting the lid to flip a "bypass queue" switch carries with it an acknowledgement that the operator knows they're doing something weird, and at least, they now know there is thing called "queue" and presumably it's there for a reason. So they flip it and everything goes wrong and at least maybe they know why, and if they ask for help we can say "ok, turn that off, and thanks for the feedback on you scenario".

The one shortcoming here is about wanting to apply this to a specific vdev. I know that properties are the only comfortable way to do that right now, but that doesn't mean they're the right way necessarily, and the only reason we don't have some more transient way is that we haven't needed one. Is this the change that forces our hand? Ask yourself, if I had said at the start "this should be a vdev-specific module parameter" because that's a facility we aready had, would this discussion even had happened? If it's "I don't think so", then I will go and write that facility this weekend (and use the hell out of it myself; I have a patch to lift the vdev queue limit parameters into vdev properties that I finished but shelved for all the reasons above: I didn't want to commit to surfacing those implementation details yet).

Conclusion: don't put a "bypass queue" button on the dashboard without first (a) defining "queue", (b) committing to "queue" being a thing in the long term and (c) making sure it does only that. But feel free to put whatever you want under the hood, because it's already expected that (a) there's loads of things in there you don't understand, (b) the next model could have a totally different engine and (c) you have no expectation that it won't do weird things to other things in there, because it's a whole engine and everything is there for a reason, right?

(I genuinely did not start out with car analogies, but there's a reason they're so prevalent in engineering discplines!)

@tonyhutter
Copy link
Contributor

I think you raise some good points on the vdev prop vs module param discussion. I'll have to think it over a little more.

What if it didn't affect something deeper? What if we "fixed" those? Like, maybe queue bypass only works when we're not scrubbing or other things that "naturally" degrade pool performance, though maybe we want to prioritise ZIL writes still, or MOS updates, or...?

Hmm.. could we do a hybrid approach where we allow O_DIRECT queue bypass if and only if the pending queue is empty (vq->vq_active == 0)?:

zio_t *                                                                          
vdev_queue_io(zio_t *zio)   
...
       if (!zio->io_vd->vdev_queue_io && vq->vq_active == 0) {                   
              zio->io_queue_state = ZIO_QS_NONE;                                 
              return (zio);                                                      
       }  

So if you had a workload that was 99% O_DIRECT writes and 1% queued writes, you'd get most (but not all) of your O_DIRECT writes to be bypass. That would solve the starvation problem but not the slow IO detection. I don't know if this would actually work, just thinking aloud. You could get stuck into a "pileup" problem where there's exists one queued write, and so the next 99 O_DIRECT writes get queued up too, and then you never get back into bypass mode since the queue is backed up. Maybe there's enough gaps where the queue goes back to zero that you can get into bypass mode fairly easily. I'd be curious to test this on real hardware.

Maybe queue bypass only works when the pool is definitely in good health and responding well (ie when we can be more certain that slow IOs wouldn't have happened anyway).

If your disk was unhealthy and not responding to IOs, and you had at least a handful of queued-IOs (even just 1% of the writes), then presumably those would get reported as slow IOs, and you'd at least have somewhat of a canary in the coal mine, right? So maybe this slow IO detection isn't as much of an issue?

@tonyhutter
Copy link
Contributor

@MigeljanImeri I've been trying to test/benchmark this PR, but in the process I've uncovered some unrelated bugs in ZFS (#16707, #16695) that are slowing down my testing. Can you describe your setup for getting those 3.5x gains you mentioned? Did you get those running fio with O_DIRECT? What did your pool config and record size look like?

@MigeljanImeri
Copy link
Contributor Author

@MigeljanImeri I've been trying to test/benchmark this PR, but in the process I've uncovered some unrelated bugs in ZFS (#16707, #16695) that are slowing down my testing. Can you describe your setup for getting those 3.5x gains you mentioned? Did you get those running fio with O_DIRECT? What did your pool config and record size look like?

I was running just a single pool with one vdev. Had recordsize set to 4k. Wrote out 256 GB over 256 files (each one gig) with o_direct. Then doing o_direct random 4k reads with fio, 256 jobs, each job reading from its own individual file (we saw some weird contention with rangelocks and this was how we got around it).

@tonyhutter
Copy link
Contributor

tonyhutter commented Oct 30, 2024

I just tested with a single NVMe pool. Sure enough, turning off queuing does improve IOPS with O_DIRECT:

-- IOPS --
fio, blocks=4k, seq read, direct=1
queue on:150k
queue off: 280k

fio, blocks=4k, randread, direct=1
queue on: 136k
queue off: 297k

@tonyhutter
Copy link
Contributor

I did get over 500k IOPS with direct=0, but I'm assuming that's due to caching.

@tonyhutter
Copy link
Contributor

I'll try to run my "scrub while FIOing" test again, but this time using the NVMe drive, since there seems to be weird stuff going on with brd + ZFS.

break;
}
if (vd->vdev_ops->vdev_op_leaf) {
vd->vdev_queue_io = intval & 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you plan it to be a bitmask, or why else & 1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, no. I removed the & 1.

Comment on lines 986 to 988
if (zio->io_queue_state == ZIO_QS_ACTIVE) {
mutex_enter(&vq->vq_lock);
vdev_queue_pending_remove(vq, zio);
}
else
return;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (zio->io_queue_state == ZIO_QS_ACTIVE) {
mutex_enter(&vq->vq_lock);
vdev_queue_pending_remove(vq, zio);
}
else
return;
if (zio->io_queue_state == ZIO_QS_NONE)
return
mutex_enter(&vq->vq_lock);
vdev_queue_pending_remove(vq, zio);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

Added vdev property to toggle queueing io to vdev queue when
reading / writing from / to vdev. The intention behind this
property is to improve performance when using o_direct.

Signed-off-by: MigeljanImeri <[email protected]>
@MigeljanImeri
Copy link
Contributor Author

I limited the queue bypass to only direct io. I think I've come to the conclusion that this should probably be a module parameter, rather than a vdev property, due to the potential issues that could arise with starvation and slow io detection. The only concern I have is that the current vdev module parameters aren't specific to a vdev. Ideally, we would be able to toggle this for individual vdevs.

@tonyhutter
Copy link
Contributor

tonyhutter commented Nov 22, 2024

I'm having trouble seeing the benefits of the "Direct IOs can bypass the queue" special case. What does that get us over just having queue_io control all IOs? If the idea is "my workload runs Direct IO, and I know what I'm doing, so I'm fine with bypassing the queue and accept the risks that go along with it", that's fine, but keep in mind other users on your system may also be using Direct IO as well. In fact a user may be running an application that uses Direct IO without their knowledge (like a database). So they could be buying into the relaxed safety assumptions unwittingly. That's why I lean towards queue_io controlling all IOs. It's just simpler, and it allows the user to bypass the queue for normal, non-Direct IOs as well.

There's one other "hybrid" option we could consider:

queue_io=on|off|hybrid

hybrid: If this is a "normal IO" and the queue is empty, then do not queue the IO. That should prevent starvation of the important IOs. Assuming most of the IOs are "normal" then most of the time then you're not going to be queuing, which is what you want. Important IOs would be things like ZIO_FLAG_IO_REPAIR, ZIO_FLAG_SELF_HEAL, ZIO_FLAG_RESILVER, ZIO_FLAG_SCRUB, ZIO_FLAG_IO_RETRY... In those cases you would always queue. So if you're running as resilver, a bunch of resilver IOs are going to get queued, which means your "normal IOs" get queued as well, and so you don't starve out the resilver.

Hybrid mode would look a little like this (untested):

vdev_queue_io(zio_t *zio)
...
    zio->io_flags |= ZIO_FLAG_DONT_QUEUE;
    zio->io_timestamp = gethrtime();

    if (!zio->io_vd->vdev_queue_io && ZIO_IS_NORMAL(zio)) {
        int length = vdev_queue_class_length(zio->io_vd, zio->io_priority);
        if (length == 0) {
            /* Queue is empty - bypass queue */
            zio->io_queue_state = ZIO_QS_NONE;
            return (zio);
        }
    }

I don't think we'd ever default queue_io to anything other than "on". You would have to manually switch it to "off" or "hybrid" to signal buy-in for the relaxed protections.

I don't think this should be a module param because you can have big discrepancies in latency and bandwidth between disks in a pool. Imagine you have a pool comprised of spinning disks, and some NVMe drives for a special device. You would want queue_io=on for the spinning disks and queue_io=off|hybrid for the NVMe. This is also true since you'd want slow IO detection for the spinning disks, but probably wouldn't care for NVMe. Now consider the same pool with spinning disks and NVMe but you only had a single, monolithic zfs_queue_io module parameter. Would you set it to "on" or "off" or "hybrid?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Code Review Needed Ready for review and testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants