Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wip: ublk zoned #28

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open

wip: ublk zoned #28

wants to merge 5 commits into from

Conversation

metaspace
Copy link

@metaspace metaspace commented Nov 10, 2022

Hi!

This is a work in progress effort to add zoned storage support to ublk. Kernel patches are here.

Any feedback is appreciated :)

@@ -17,6 +17,8 @@
#define UBLK_CMD_STOP_DEV 0x07
#define UBLK_CMD_SET_PARAMS 0x08
#define UBLK_CMD_GET_PARAMS 0x09
#define UBLK_CMD_REPORT_ZONES_FETCH 10
#define UBLK_CMD_REPORT_ZONES_COMMIT_FETCH 11
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we should let SET_PARAS/GET_PARAMS to cover any zone
specific parameters. And you should define one new structure of 'ublk_param_zoned' which includes all zoned specific definitions, meantime reserve a few fields for future extension.

The above is the UAPI(kabi) between userspace and kernel driver,
which is important and should be done first.

Once the kabi definition is finalized, you can send one patch to linux-block for starting the zoned support, and this kind of chang shouldn't be big, just wire ublk/zoned with block layer.

Meantime working on userspace side(ublksrv side), so people
can verify the driver change. IMO, we should have one dedicated zoned target for simulating all zoned behavior, such as write on each zone, ZNS, ...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we should let SET_PARAS/GET_PARAMS to cover any zone specific parameters. And you should define one new structure of 'ublk_param_zoned' which includes all zoned specific definitions, meantime reserve a few fields for future extension.

I added zone size via the chunk_sectors field that is already present in the basic parameters.

The above is the UAPI(kabi) between userspace and kernel driver, which is important and should be done first.

Once the kabi definition is finalized, you can send one patch to linux-block for starting the zoned support, and this kind of chang shouldn't be big, just wire ublk/zoned with block layer.

The kernel patch is currently undergoing internal review. I will send it next week when it is done (hopefully). The reason the patch is so big is report_zones. It is very few lines to wire all the zone operations in ublk, just add some conversion of the operation enumerations. But because report_zones is out-of-band in the block layer, it requires special handling. I added a uring based interface on the control device for this, leaning on how the data plane is implemented. report_zones is metadata, so I think it belongs in the control plane rather than the data plane.

If there is a more clean way to implement report_zones I can of course change the patches.

Meantime working on userspace side(ublksrv side), so people can verify the driver change. IMO, we should have one dedicated zoned target for simulating all zoned behavior, such as write on each zone, ZNS, ...

This patch adds support to the null target and to the loopback target. The null target is not much for testing, but I have run real workloads with the loopback target backed by a ZN540 drive. I did not try SMR yet, but I will test on SMR next week.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we should let SET_PARAS/GET_PARAMS to cover any zone specific parameters. And you should define one new structure of 'ublk_param_zoned' which includes all zoned specific definitions, meantime reserve a few fields for future extension.

I added zone size via the chunk_sectors field that is already present in the basic parameters.

chunk_sectors can be reused for zoned device without any issue just
like kernel block layer, but other zoned parameters(max_open_zones,
max_active_zones) should be set/get by one standalone zoned parameter.

Adding them into ublk_param_basic will break kabi, and any new
parameter has to be add to the end of ublk_params.

The above is the UAPI(kabi) between userspace and kernel driver, which is important and should be done first.
Once the kabi definition is finalized, you can send one patch to linux-block for starting the zoned support, and this kind of chang shouldn't be big, just wire ublk/zoned with block layer.

The kernel patch is currently undergoing internal review. I will send it next week when it is done (hopefully). The reason the patch is so big is report_zones. It is very few lines to wire all the zone operations in ublk, just add some conversion of the operation enumerations. But because report_zones is out-of-band in the block layer, it requires special handling. I added a uring based interface on the control device for this, leaning on how the data plane is implemented. report_zones is metadata, so I think it belongs in the control plane rather than the data plane.

All control commands are supposed for handling generic control
function only, not for target specific function, that is how ublk's parameter is designed.

I believe the zone info can be abstracted as 'struct ublk_param_zoned', then report_zones can be implemented
inside ublk driver based on 'ublk_param_zoned' as one generic
block feature.

If the approach of adding ublk_param_zoned doesn't work, please just
let us know, and I am happy to figure out better solution.

If there is a more clean way to implement report_zones I can of course change the patches.

Addding 'ublk_param_zoned ' is the clean approach for supporting
zoned, and adding 'ublk_param_foo' is supposed to be the normal
way to add new feature.

Meantime working on userspace side(ublksrv side), so people can verify the driver change. IMO, we should have one dedicated zoned target for simulating all zoned behavior, such as write on each zone, ZNS, ...

This patch adds support to the null target and to the loopback target. The null target is not much for testing, but I have run real workloads with the loopback target backed by a ZN540 drive. I did not try SMR yet, but I will test on SMR next week.

I'd suggest to not modify null/loop targets for zoned, and adding
one totally new zoned target could be better, then you can extend
it to one really zoned implementation step by step, such as, add
ZNS, ....

Thanks,
Ming

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, please don't handle report zone by control command.
Instead, there could be at least two ways:

  1. define REPORT_ZONE as one IO command, just like
    other UBLK_IO_OP_ZONE_* you added, but one thing could be
    that the per-io buffer may not hold all the requested zoned info
    in single disk->fops->report_zones(), but the ublk driver can split
    it into several requests, and each request is done by one blk-mq
    pass-through request, please refer to nvme_ns_report_zones().

Then ublk server can handle them just like one normal IO.

  1. store all zoned info as one PARAMETER, but this way may
    waste memory a lot.

And I prefer to the 1st approach.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback :)

I think option 2 would not work, since the zone report contains dynamic information. Each time the write pointer changes, we would have to call set_parameters()?

I will try to see if I can do 1) and rewrite to tunnel the zone report request over the IO command path that is already in place.

@metaspace metaspace force-pushed the ubdsrv-zoned branch 2 times, most recently from 1ebdbf5 to a9eb3ff Compare November 14, 2022 12:23
#define UBLK_IO_OP_ZONE_CLOSE 11
#define UBLK_IO_OP_ZONE_FINISH 12
#define UBLK_IO_OP_ZONE_APPEND 13
#define UBLK_IO_OP_ZONE_RESET 15
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above IO_OP addition is good, but looks REQ_OP_ZONE_RESET_ALL is missed.

};

return __ublksrv_ctrl_cmd_with_ring(ctrl_dev, ring, &data, cqe_data);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the above report_zones_commit_and_fetch are not good.

Please just define one extra UBLK_IO_OP_ZONE_REPORT
for handling ->report_zone(). And ublksrv_io_desc is 24byte,
which is enough to hold zone_io_report_desc, which
usually just need start_sector and nr_zones.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand. You would prefer the zone report to go via the ring already in place for the IO requests? I will try to see if I can do this.

The reason I did not do this in the first place is that kernel request to uring request is using the kernel req->tag. So if I use the ring to do something that is not IO, I might grab a uring IO that is going to be used by blk-mq tag. But I guess I can submit a few extra uring requests for the zone report.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand. You would prefer the zone report to go via the ring already in place for the IO requests? I will try to see if I can do this.

The reason I did not do this in the first place is that kernel request to uring request is using the kernel req->tag. So if I use the ring to do something that is not IO, I might grab a uring IO that is going to be used by blk-mq tag. But I guess I can submit a few extra uring requests for the zone report.

No, when you implement ->report_zones() via blk-mq passthrough
request, it is still blk-mq request, and it is still blk-mq tag.

thanks,

Copy link
Author

@metaspace metaspace Nov 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, when you implement ->report_zones() via blk-mq passthrough
request, it is still blk-mq request, and it is still blk-mq tag.

I am not a kernel expert, so I might be wrong, but this is how I understand report_zones: report_zones() does not go through the normal block stack. It does not result in a kernel IO request. This is the call stack for report_zones ioctl when called on ublkbX:

  • blkdev_report_zones_ioctl() in blk-zoned.c
  • blkdev_report_zones() in blk-zoned.c
  • disk->fops->report_zones which we must implement in ublk_drv.c

None of this results in a blk-mq request. This is why I state that the report_zones ioctl is not implemented through the block layer like open/close/reset/finish zone is. It uses a out-of-band path. Checking nvme and null_blk I also see no struct request being submitted for report_zones.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

->report_zones() is one callback of ublk driver, inside this function,
you call allocate blk-mq requests to forward the request to ublksrv for handling them. Finally ublksrv just fills the report data and
complete the io command.

That is exactly what nvme ->report_zones() does, please see
nvme_ns_report_zones().

That is also why I suggested to define the uapi interface and
post driver code to linux-block for review.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

->report_zones() is one callback of ublk driver, inside this function, you call allocate blk-mq requests to forward the request to ublksrv for handling them. Finally ublksrv just fills the report data and complete the io command.

Alright, I can try to build it that way 👍

That is exactly what nvme ->report_zones() does, please see nvme_ns_report_zones().

I see, thanks for clarifying. I completely missed that.

tgt_null.cpp Outdated
.type = UBLKSRV_TGT_TYPE_NULL,
.name = "null",
.report_zone = null_report_zone,
.type = UBLKSRV_TGT_TYPE_NULL,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add one zoned target for handling zoned, and it isn't good
to mess up loop/null, IMO.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I can do that 👍

tgt_loop.cpp Outdated

if (ublk_op == UBLK_IO_OP_DRV_IN) {
struct {
struct blk_zone_report report;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not compile for me:

In file included from tgt_loop.cpp:5: /usr/include/linux/blkzoned.h: In function ‘co_io_job __loop_handle_io_async(ublksrv_queue*, ublk_io*, int)’: /usr/include/linux/blkzoned.h:133:25: error: flexible array member ‘blk_zone_report::zones’ not at end of ‘struct __loop_handle_io_async(ublksrv_queue*, ublk_io*, int)::’ 133 | struct blk_zone zones[]; | ^~~~~ tgt_loop.cpp:321:41: note: next member ‘blk_zone __loop_handle_io_async(ublksrv_queue*, ublk_io*, int)::::zone’ declared here 321 | struct blk_zone zone; | ^~~~ tgt_loop.cpp:319:24: note: in the definition of ‘struct __loop_handle_io_async(ublksrv_queue*, ublk_io*, int)::’ 319 | struct { | ^ In file included from tgt_loop.cpp:5: /usr/include/linux/blkzoned.h:133:25: error: flexible array member ‘blk_zone_report::zones’ not at end of ‘struct __loop_handle_io_async(ublksrv_queue*, ublk_io*, int)::_ZL22__loop_handle_io_asyncP13ublksrv_queueP7ublk_ioi.frame’ 133 | struct blk_zone zones[]; | ^~~~~ tgt_loop.cpp:321:41: note: next member ‘blk_zone __loop_handle_io_async(ublksrv_queue*, ublk_io*, int)::::zone’ declared here 321 | struct blk_zone zone; | ^~~~ tgt_loop.cpp:384:1: note: in the definition of ‘struct __loop_handle_io_async(ublksrv_queue*, ublk_io*, int)::_ZL22__loop_handle_io_asyncP13ublksrv_queueP7ublk_ioi.frame’ 384 | } | ^

Why not just do this properly: allocate space for the requested number of zones, then get all those reported through the ioctl, just like in https://github.com/westerndigitalcorporation/libzbd/blob/master/lib/zbd.c#L520

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting all the zones (or up to a max page worth maybe) is a good idea. It is however not related to the reason the code is not building for you. Are you using an old compiler?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yhr I pushed a fix for that warning, can you check if it resolves your issue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest push looks better and compiles, thanks!

I think we have discussed this before, but what is the reason for not generating a full report? For a device with a thousand zones the kernel-userspace ping-pong game will be a long one. Packing up a report in a page will cut down the report calls with a factor of 17 , but it would be good to avoid any unneeded roundtrips.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have discussed this before, but what is the reason for not generating a full report? For a device with a thousand zones the kernel-userspace ping-pong game will be a long one. Packing up a report in a page will cut down the report calls with a factor of 17 , but it would be good to avoid any unneeded roundtrips.

I agree there is an opportunity for optimizing the zone report. But to avoid premature optimization, I did not include this in the the current version of the patch. I think the round trip to user space and back is around 1us, so we would waste 1ms for a device with 1k zones. However, I think this is completely dwarfed by the time current devices spend on producing a full zone report anyway? Also this would probably only be invoked during initialization phases, so it does not really matter. We can always change it?

We should encode the number of zones, one or more, in the kernel<->userspace api now though. That might be difficult to change later if we want to.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's important to make sure that the uabi is reasonably future-proof, so reporting more than one zone at a time must be there I believe. As the code is today, the per-zone IOCTL will generate a l lot of overhead for real devices. I'll look over the kernel patch next - the uabi is probably best discussed on the block list .

@ming1
Copy link
Collaborator

ming1 commented Feb 28, 2023

Hi metaspace,

I think we have reached agreement that zoned implementation
should be done in tgt_null.c which is for test purpose, and should
be kept as it is.

So please do not send PR for messing tgt_null.c with zoned.

Thanks,

@ming1
Copy link
Collaborator

ming1 commented Feb 28, 2023

Hi metaspace,

I think we have reached agreement that zoned implementation should be done in tgt_null.c which is for test purpose, and should be kept as it is.

shouldn't be done.

@metaspace
Copy link
Author

Hi Ming,

I think we have reached agreement that zoned implementation should be done in tgt_null.c which is for test purpose, and should be kept as it is.

So please do not send PR for messing tgt_null.c with zoned.

Thank you for your response. Please note that this PR is just a draft. I understand that you would prefer the zoned implementation as a separate target. To comply with that, I plan to move the zoned code out of loop and null targets, into a separate zoned target. I also need to rebase on master I think.

However, you suggested handling the kernel<-->user interface before putting too much effort into patching ubdsrv. So currently this PR is just demonstrating one possible user space implementation, not necessarily the final one.

If you would prefer, we can close this PR until the kernel patches are accepted and then I can submit a new PR with a separate zoned target later? But I think it is nice to have this PR to track the work.

In case you missed them, the kernel patches are here. I tried to CC you :)

Best regards
Andreas

@ming1
Copy link
Collaborator

ming1 commented Mar 1, 2023

Hi Ming,

I think we have reached agreement that zoned implementation should be done in tgt_null.c which is for test purpose, and should be kept as it is.
So please do not send PR for messing tgt_null.c with zoned.

Thank you for your response. Please note that this PR is just a draft. I understand that you would prefer the zoned implementation as a separate target. To comply with that, I plan to move the zoned code out of loop and null targets, into a separate zoned target. I also need to rebase on master I think.

However, you suggested handling the kernel<-->user interface before putting too much effort into patching ubdsrv. So currently this PR is just demonstrating one possible user space implementation, not necessarily the final one.

If you would prefer, we can close this PR until the kernel patches are accepted and then I can submit a new PR with a separate zoned target later? But I think it is nice to have this PR to track the work.

In case you missed them, the kernel patches are here. I tried to CC you :)

I know this kernel patch, and I meant even for test purpose only,
it is still better to write one clean & intact(at least for interface verification) target for verification & reviewing kabi interface, which won't be modified any more after
kernel patch is merged(can only be extended). So here we need
to pay more attention to.

I will review the kernel patch in this week.

Thanks,
Ming

building on my Debian 11 system with gcc I get these errors:

      CXX      ublk-tgt_loop.o
    tgt_loop.cpp: In function ‘int loop_init_tgt(ublksrv_dev*, int, int, char**)’:
    tgt_loop.cpp:123:27: error: ‘UINT_MAX’ was not declared in this scope
      123 |    .max_discard_sectors = UINT_MAX >> 9,
          |                           ^~~~~~~~
    tgt_loop.cpp:8:1: note: ‘UINT_MAX’ is defined in header ‘<climits>’; did you forget to ‘#include <climits>’?
        7 | #include "ublksrv_tgt.h"
      +++ |+#include <climits>
        8 |

Adding the <limits.h> header to ublksrv_tgt.h as suggested by Ming Lei.

Signed-off-by: Bart Trojanowski <[email protected]>
@ming1 ming1 marked this pull request as ready for review July 11, 2023 11:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants