-
Notifications
You must be signed in to change notification settings - Fork 387
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
viostor: cause qemu virtio_error when an uncompleted request times out #907
Comments
Thank for reporting this issue. By any chance, do you know if the problem is reproducible on x64 platforms? Thanks, |
I'm sorry the windows system log and disk/file system related events can not upload. Thanks, |
Technically the following scenario is possible :
If it is the case try to increase IoTimeOut value to see if it makes any difference |
The origin test with Windows11 guest has not been set TimeoutValue and IoTimeOutValue.
So, TimeoutValue is just worked for virtio-scsi, but not for virito-blk? IoTimeOutValue is for virtio-blk? And, even though we set IoTimeOutValue, technically it is still can cause the virtio_error problem. Thanks, |
If I'm not mistaken TimeOutValue is for SCSI Port Miniport drivers (not related to viostor and vioscsi drivers which both We need to understand that 60 Seconds is a huge time period, and we can expect that a request will be completed a way sooner. Even if it is a VM working with a network storage. We can also expect that wrong with vCPU threads scheduling or with the storage back end. Regarding the solution, I think that the most appropriated way to make it work properly will be to implement the proper Best, |
Do you have a lplan to solve the potential problem thoroughly in the future? Thanks, |
Yes, even though the Lun reset logic implemented in #683 and #684 helps to deal with lost/uncompleted requests issues, Best regards, |
Thanks, expect implementing this feature soon. I have another question, the "IoTimeOutValue" for our viostor, the registry entry path is "HKLM\System\CurrentControlSet\Services\viostor\Parameters"? Thanks, |
Yes, [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Parameters\viostor] is the right path for Best, |
I set 200s to IoTimeOutValue in the windows guest. And add sleep 60s before pop avail IO from vring in Qemu. After 60s, it will pop all the avail IO in vring. Thanks, |
@vrozenfe and @wangyan0507, I have backported the fix for vioscsi into viostor. I think the removal of unnecessary VG unlocks will fix this issue. Also appears to be a slight performance increase. I was able to get 4.4GB/s on boot+system disk and 6.1-6.3GB/s on additional disks. Perhaps you could test for us Yan Wang? You will need to build against master and enable --- ./a/viostor/virtio_stor_hw_helper.c 2024-08-06 05:42:24.000000000 +1000
+++ ./b/viostor/virtio_stor_hw_helper.c 2024-08-15 21:33:31.283522952 +1000
@@ -160,6 +160,7 @@
ULONG QueueNumber = 0;
ULONG MessageId = 1;
+ int res = 0;
ULONG OldIrql = 0;
BOOLEAN result = FALSE;
bool notify = FALSE;
@@ -194,30 +195,29 @@
vq = adaptExt->vq[QueueNumber];
RhelDbgPrint(TRACE_LEVEL_VERBOSE, " QueueNumber 0x%x vq = %p\n", QueueNumber, vq);
- element = &adaptExt->processing_srbs[QueueNumber];
VioStorVQLock(DeviceExtension, MessageId, &LockHandle, FALSE);
- if (virtqueue_add_buf(vq,
- &srbExt->sg[0],
- srbExt->out, srbExt->in,
- &srbExt->vbr, va, pa) >= 0) {
+ res = virtqueue_add_buf(vq,
+ &srbExt->sg[0],
+ srbExt->out, srbExt->in,
+ &srbExt->vbr, va, pa);
+
+ if (res >= 0) {
notify = virtqueue_kick_prepare(vq);
+ element = &adaptExt->processing_srbs[QueueNumber];
InsertTailList(&element->srb_list, &srbExt->vbr.list_entry);
element->srb_cnt++;
- VioStorVQUnlock(DeviceExtension, MessageId, &LockHandle, FALSE);
#ifdef DBG
InterlockedIncrement((LONG volatile*)&adaptExt->inqueue_cnt);
#endif
result = TRUE;
- }
- else {
- VioStorVQUnlock(DeviceExtension, MessageId, &LockHandle, FALSE);
+ } else {
RhelDbgPrint(TRACE_LEVEL_ERROR, " Can not add packet to queue %d.\n", QueueNumber);
StorPortBusy(DeviceExtension, 2);
}
- if (notify) {
+ VioStorVQUnlock(DeviceExtension, MessageId, &LockHandle, FALSE);
+ if (notify){
virtqueue_notify(vq);
}
-
if (adaptExt->num_queues > 1) {
if (CHECKFLAG(adaptExt->perfFlags, STOR_PERF_OPTIMIZE_FOR_COMPLETION_DURING_STARTIO)) {
VioStorCompleteRequest(DeviceExtension, MessageId, FALSE); |
@vrozenfe |
Do you want to add an extra case to ACTION_ON_RESET https://github.com/virtio-win/kvm-guest-drivers-windows/blob/master/vioscsi/vioscsi.h#L281C3-L281C17 and handle it respectively? Thanks, |
@vrozenfe
|
Ah, sorry. Don't know why, but thought that you are asking about virtio-scsi driver. Best, |
We will not complete the pending request, so windows will not recycle it. If the driver completes the pending request without stopping the device or resetting the virtio queues, windows will recycle the memory associated with that request. However, the device may still access this memory because the request remains in the virtio queues, potentially leading to file system corruption. If the driver does not complete the pending request, windows will continue to wait for its completion, resulting in an io hang, similar to an io hang in Linux. Best, |
Yes, the class driver will not reuse an abandoned SRB and IIRC will be able to allocate a new block if needed (balancing between minimal and maximal working set of packages). In any case if you really need to complete SRB_FUNCTION_RESET_* requests as SRB_STATUS_INVALID_REQUEST and But a long term target, the proper implementation of multitier reset provided by storport definitely will be the best solution. Best, |
Sorry for long time no reply. I do not get the point why the patch can fix the problem and improve the performance. |
My earlier comment you quoted was at a time when my focus was on spinlocks. The issue in FYI, the changes in #1174 were mostly refactoring. There is a minor performance increase due to removing the virtqueue struct The issue you reported could be due to spinlocks elsewhere or perhaps out-of-order DPCs, or maybe even CPU affinity... It's also possible that there might have been some reset logic implemented in the last year that has resolved this issue. I'm happy to help track it down, but do you think you could first try to reproduce the problem again using mm286/b266, i.e. virtio-win-0.1.266, and / or using drivers built from master, on top of a more recent kernel and QEMU...? |
Yeah, I can use the new viostor and qemu to reproduce the current problem. It may take some time. |
Describe the bug
I found an virtio_error() in QEMU when handling virtio block request with windows11 guest.
Two uncompleted blk request(req1 and reqN) has the same indirect descriptor table address. The length of indirect descriptor table in req1 is 48, but the lenth in reqN is 160. So, it cause error when handling req1 in QEMU.
The stack trace:
To Reproduce
I use gdb to attach qemu process, and debug the virtqueue_pop() in virtio block request step by step. Using gdb command 'next/continue/print' in the debug process which may cause the request timeout for windows wdk.
Expected behavior
Not cause virtio_error() in QEMU.
Screenshots
If applicable, add screenshots to help explain your problem.
Host:
VM:
virtio-win-0.1.229.iso, virtio-win-0.1.215.iso and the compiled viostor by the newest kvm-guest-drivers-windows.
Additional context
I use dbgview to capature the debug log, and found the following RESET log before calling function virtqueue_get_buf_split.
In the windows driver doc "Handling SRB_FUNCTION_RESET_DEVICE", it said "the port driver requests a device reset when an uncompleted request times out".
https://learn.microsoft.com/en-us/windows-hardware/drivers/storage/handling-srb-function-reset-device
The text was updated successfully, but these errors were encountered: