-
Notifications
You must be signed in to change notification settings - Fork 387
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
viostor Reset to device, \Device\RaidPort3, was issued. VM is frozen. #756
Comments
@mherlitzius That is quite new strange that you can reproduce the problem on WS2016 with build 215 since it was tested and verified by our QE https://bugzilla.redhat.com/show_bug.cgi?id=2013976 and https://bugzilla.redhat.com/show_bug.cgi?id=2028000 Thanks, |
@vrozenfe Thank you for your timely reply. I will ask for the event log from the WS2016 machine running build 215. |
@mherlitzius |
@vrozenfe Please see attached the event log. The error just appeared this morning. |
@mherlitzius BEst, |
@vrozenfe
Best, |
Did you get any resolution to this? I have the same issue on Server 19/SQL 19 using VirtIO 0.1.229 |
I am also interested to know if there was any solution, or a planned new release of the virtio drivers. Let me know if I can provide any logs or configuration that can be useful to narrow down the issue. |
@melfacion @thinkingcap Thank you, |
@vrozenfe : different hosts, different datacenters, different storage pools. They all use Ceph for network storage. Hypervisor is proxmox (ver 7.3-6 in production currently and 7.4-3 in test datacenter) I exported the events from all sources for some time before the last crash and some time after. Will send it to you as requested. |
When I get the error I get the following in the host Proxmox syslog Once I got this too with same impact on guest Hypervisor is Proxmox
qm config
Workload is SQL Server / Analysis Server 2019 running ETL jobs and processing cubes. I actually get this on 2 hosts, this one more than the other. |
Since it is SQL server it might be useful to try reducing the maximum transfer size by specifying "PhysicalBreaks" key Vadim. |
Its protected content unfortunately. |
@thinkingcap |
@vrozenfe I assume it requires a reboot when changing? |
@thinkingcap |
Hello, I'm also haunted by this problem. Running Proxmox 7.2 in production, one Windows Server 2022 VM is affected with MSSQL 2019 CU21 and virtio drivers 0.1.225. Storage backend for all VMs in the cluster is CEPH 16.2.7 (using 10Gbit SPF+, Jumbo Frames 9000 MTU). Sometimes the VM is running for weeks without issues, sometimes multiple lockups in a single day which require a hard reset. Same symptoms:
@vrozenfe According to your suggestions I should first try setting these registry values, right? As you can see in the screenshot it happens exactly every 60 secs, which means my default IO timeout is 60secs right? So I would adjust it to 90sec for the vioscsi driver first?
Does 0x3f make sense for CEPH with 9000 MTU? |
@MaxXor I don't recall I've ever seen "kvm: virtio: zero sized buffers are not allowed" before. But since you and @thinkingcap both 0x3f makes the maximum transfer size equal to 256K. In my understanding 30 sec (default timeout) period should be enough for Jumbo frame network storage to complete such transfer. Even for write operations, I think. Unfortunately I have very limited knowledge in CEPH, even though that we use it as one of the backends for our internal storage testing. Best, |
Thanks for your answer, unfortunately both IoTimeoutValue (0x5a) and PhysicalBreaks (0x3f) didn't help yet. I really don't understand it. Large file transfer, e.g. by copying a file, are working fine which should definitely put load on CEPH and cause higher IO latencies. It's copying with 500 MB/s, no errors. But if I keep the MSSQL running, monitoring the disk IO, it just locks up with low load (max. 30 MB/s) again after 1-2 hours. I'm using the Proxmox QEMU package at version 7.2.0-8, which relates to this git commit: https://github.com/proxmox/pve-qemu/tree/93d558c1eef8f3ec76983cbe6848b0dc606ea5f1 It's the latest available for Proxmox 7.x. Another thing I noticed, which might be worth mentioning is that only one of the vioscsi disks locks up. During the reset events in windows I can still read/write fine to all other disks, except the one which is locked up. The VM I'm using has 4 vioscsi disks. I will post the QEMU command line later. |
Did you reboot after applying those registry changes? |
Yes. It's not a critical machine. I can reboot it at any time if it helps us figure out the root cause. 🙂 |
Whats your VM config look like? Can you post a |
Only scsi1 and scsi2 are affected (I already tried without cache=writeback, but it didn't help). |
I would also change cache to |
Back to square 1 here, just got |
To mitigate this problem, we switched to local storage. Unfortunately, we have not found any other solution either. Thanks for your help anyway, Vadim! |
@thinkingcap I just noticed that the automatic trim operations for SSDs were running at the time of the last lockup. I disabled it for now to see if the problem disappears. |
@mherlitzius I'm already using local SSD backed storage (from the start) Proxmox 8.0.3 ran without issue for 5 days but got the less frequent error today: |
I think #735 is worth a revisit... |
Author of #735 / #736 here - thanks for all of the commentary on this thread. All that allows you to do is when you have large IO, it make sure it is properly aligned so if you're max transfer size config is 1MB, you're actually getting 1MB IO's cleanly up and down the stack. I suspect that what ends up happening here is something else down the stack is binding up, and perhaps just making sure it has nicely aligned IO makes it angrier. That said, if you've got a very reliable reproducer, you could git bisect this repro, build some demo drivers, and see if you can find a smoking gun for your setup. |
Also, to be clear, not saying that it couldn't be the problem, perhaps it contributes, but this type of lockup has been going on before that commit was even in the repo (see #623). Very well may be in the same class of issues. I'll defer to the guru @vrozenfe though. Happy to help if this ends up being part of the issue. As a data point, we've been running that commit across our fleet for a long time and haven't seen a single customer support case with this signature (albeit in a completely different topology RE qemu/backend storage config). |
I think this is part of the problem with performing root cause analysis for this issue. The lack of fault reproduction in Nutanix and RHEL environments points to other root causes. I didn't see this issue myself running virtio-blk dataplane until I went over 1GiBps throughput. If someone is able to run my reproducer above on a VM hosted on RHEL rather than a Debian-based hypervisor, that would go a long way to determining the scope of the problem. I don't know enough about your "data path plumbing" to know if it would be reproducible in your kit.
I'll probably wait for Vadim to have a look before I cast my very fresh eyes over it. 8^d |
Guys you are all right. In my understanding the root of the problem is in the maximum size of IO transfer. Likely to Nutanix and us (RH) we don't see any problem if the IO transfer exceeds the virtual queue size. Others might have Cheers, |
Thanks Vadim, that would be great. If I was to hazard a guess, it looks like the train really kicked off with #502 --> #596 --> 2f1a58a ...but don't let me lead you up the garden path...! |
Right, I believe that the problem is that we create SG longer than virtual queue size.. Best, |
So I've had a bit of a peek around the vioscsi code and thought I'd drop a quick post here as to what I've found in the hope that this is more a hint than a distraction. The issue starts after 7dc052d, with the rework on SRB. The issue is apparent following fdf56dd. The nature of the problem, throughput graph observations during fault reproduction, what appear to be resets on the bus and similar behaviour to issue #907 (and others) - when considered together - suggest to me that SRBs are being processed out of order, potentially with stray retries. It seems stray SRBs might being returning quite late and corrupting the queue. Given SRBs and the initial LUN reset logic were included in that commit, perhaps one is adversely affecting the other...? So focusing on #684, I built each commit and tested for the fault. Then I tore down the code to see what effect the changes had on the fault condition. I started with the It seems to me that the fault is coincident to the changes in request completion handling and the removal (or maybe a relocation?) of the spin lock in the SRB implementation. So the main two commits would be f1338bb and fdf56dd, but considered together. The problem here is the SRB implementation has evolved quite a bit in the three years since then... With thanks - it looks great..! I do suspect it might be something in the However, I also didn't notice any functions reminiscent of I was sort of curious where the newer SRB scaffold appeared from, and whether there was a missing functional dependency on VirtIO or something else in the larger tree. Anyway, I hope the above is of some help. I might have another look in a day or so. Regards, |
Also forgot to mention I checked changes to |
Hi Vadim, Thanks for looking into this again.
Your remark about the queue size prompted me to look into virtio-scsi multiqueue. My current understanding of virtio is not very deep, but AFAICT current QEMU allocates 1 virtqueue per vCPU for virtio-scsi (plus 2 more virtqueues) [0] [1]. I've tried setting
e.g.:
With this setting, the reproducer from [2] has not produced a This could also be a coincidence, but on the other hand -- could there be some unfortunate interaction in presence of virtio-scsi multiqueue that may trigger the issue? [0] https://github.com/qemu/qemu/blob/a733f37aef3b7d1d33bfe2716af88cdfd67ba64e/hw/virtio/virtio-pci.c#L2526 Host: Proxmox VE kernel 6.8.12-1-pve based on Ubuntu kernel 6.8
|
I've confirmed @frwbr's results per https://forum.proxmox.com/threads/redhat-virtio-developers-would-like-to-coordinate-with-proxmox-devs-re-vioscsi-reset-to-device-system-unresponsive.139160/post-693772 It does prevent bus resets and both kvm errors, but at a cost to performance. I'm thinking I might need to revisit my checking of the legacy |
I found it in /vioscsi/helper.c What I have so far. RFC --- ./a/vioscsi/helper.c 2024-08-06 05:42:24.000000000 +1000
+++ ./b/vioscsi/helper.c 2024-08-15 08:41:32.366850208 +1000
@@ -51,6 +51,7 @@
PVOID va = NULL;
ULONGLONG pa = 0;
ULONG QueueNumber = VIRTIO_SCSI_REQUEST_QUEUE_0;
+ BOOLEAN notify = FALSE;
STOR_LOCK_HANDLE LockHandle = { 0 };
ULONG status = STOR_STATUS_SUCCESS;
UCHAR ScsiStatus = SCSISTAT_GOOD;
@@ -108,18 +109,14 @@
srbExt->psgl,
srbExt->out, srbExt->in,
&srbExt->cmd, va, pa);
+ element = &adaptExt->processing_srbs[index];
if (res >= 0) {
- element = &adaptExt->processing_srbs[index];
+ notify = virtqueue_kick_prepare(adaptExt->vq[QueueNumber]) ? TRUE : notify;
InsertTailList(&element->srb_list, &srbExt->list_entry);
element->srb_cnt++;
- }
- VioScsiVQUnlock(DeviceExtension, MessageID, &LockHandle, FALSE);
- if ( res >= 0){
- if (virtqueue_kick_prepare(adaptExt->vq[QueueNumber])) {
- virtqueue_notify(adaptExt->vq[QueueNumber]);
- }
} else {
+ RemoveTailList(&element->srb_list);
virtqueue_notify(adaptExt->vq[QueueNumber]);
ScsiStatus = SCSISTAT_QUEUE_FULL;
SRB_SET_SRB_STATUS(Srb, SRB_STATUS_BUSY);
@@ -127,6 +124,17 @@
StorPortBusy(DeviceExtension, 10);
CompleteRequest(DeviceExtension, Srb);
RhelDbgPrint(TRACE_LEVEL_FATAL, " Could not put an SRB into a VQ, so complete it with SRB_STATUS_BUSY. QueueNumber = %d, SRB = 0x%p, Lun = %d, TimeOut = %d.\n", QueueNumber, srbExt->Srb, SRB_LUN(Srb), Srb->TimeOutValue);
+ notify = TRUE;
+ }
+
+ VioScsiVQUnlock(DeviceExtension, MessageID, &LockHandle, FALSE);
+
+ if (notify){
+ virtqueue_notify(adaptExt->vq[QueueNumber]);
+ /*if (virtqueue_kick_prepare(adaptExt->vq[QueueNumber])) {
+ virtqueue_notify(adaptExt->vq[QueueNumber]);
+ }
+ */
}
EXIT_FN_SRB(); |
@vrozenfe, I refactored the above to the root cause. I should note that performance is odd. I get performance of 3.5GB/s, then 6.4GB/s alternating. Not consistently, but what appears to be according to a timer. Alternating slow and fast periods. This survives reboot. Is that normal? --- ./a/vioscsi/helper.c 2024-08-06 05:42:24.000000000 +1000
+++ ./b/vioscsi/helper.c 2024-08-15 10:26:16.389005304 +1000
@@ -110,15 +110,14 @@
&srbExt->cmd, va, pa);
if (res >= 0) {
+ virtqueue_kick_prepare(adaptExt->vq[QueueNumber]);
element = &adaptExt->processing_srbs[index];
InsertTailList(&element->srb_list, &srbExt->list_entry);
element->srb_cnt++;
}
VioScsiVQUnlock(DeviceExtension, MessageID, &LockHandle, FALSE);
if ( res >= 0){
- if (virtqueue_kick_prepare(adaptExt->vq[QueueNumber])) {
- virtqueue_notify(adaptExt->vq[QueueNumber]);
- }
+ virtqueue_notify(adaptExt->vq[QueueNumber]);
} else {
virtqueue_notify(adaptExt->vq[QueueNumber]);
ScsiStatus = SCSISTAT_QUEUE_FULL; |
It looks like this is multiplexing of additional adapters. With one adapter, I get 6.4GB/s consistently. Is that by design? |
@vrozenfe, I refactored again to eliminate what I think are unnecessary VQ unlocks. Performance is improved again, with 6.4GB/s on the boot+system disk and 9.2-9.5GB/s on additional disks. The adapter multiplexing appears only to affect disks on additional adapters. I have not tried booting on SCSI ID > 0. --- ./a/vioscsi/helper.c 2024-08-06 05:42:24.000000000 +1000
+++ ./b/vioscsi/helper.c 2024-08-15 22:59:13.605562341 +1000
@@ -56,6 +56,7 @@
UCHAR ScsiStatus = SCSISTAT_GOOD;
ULONG MessageID;
int res = 0;
+ BOOLEAN notify = FALSE;
PREQUEST_LIST element;
ULONG index;
ENTER_FN_SRB();
@@ -110,17 +111,11 @@
&srbExt->cmd, va, pa);
if (res >= 0) {
+ notify = virtqueue_kick_prepare(adaptExt->vq[QueueNumber]);
element = &adaptExt->processing_srbs[index];
InsertTailList(&element->srb_list, &srbExt->list_entry);
element->srb_cnt++;
- }
- VioScsiVQUnlock(DeviceExtension, MessageID, &LockHandle, FALSE);
- if ( res >= 0){
- if (virtqueue_kick_prepare(adaptExt->vq[QueueNumber])) {
- virtqueue_notify(adaptExt->vq[QueueNumber]);
- }
} else {
- virtqueue_notify(adaptExt->vq[QueueNumber]);
ScsiStatus = SCSISTAT_QUEUE_FULL;
SRB_SET_SRB_STATUS(Srb, SRB_STATUS_BUSY);
SRB_SET_SCSI_STATUS(Srb, ScsiStatus);
@@ -128,7 +123,10 @@
CompleteRequest(DeviceExtension, Srb);
RhelDbgPrint(TRACE_LEVEL_FATAL, " Could not put an SRB into a VQ, so complete it with SRB_STATUS_BUSY. QueueNumber = %d, SRB = 0x%p, Lun = %d, TimeOut = %d.\n", QueueNumber, srbExt->Srb, SRB_LUN(Srb), Srb->TimeOutValue);
}
-
+ VioScsiVQUnlock(DeviceExtension, MessageID, &LockHandle, FALSE);
+ if (notify){
+ virtqueue_notify(adaptExt->vq[QueueNumber]);
+ }
EXIT_FN_SRB();
} I have backported this into viostor too. I will drop that at issue #907. EDIT: Removed unnecessary (hopefully) notify in (res < 0)... |
@frwbr In my opinion the simplest solutions to work around the problem at the moment would be:
Best, |
Thanks a lot. Best, |
Agreed @benyamin-codez thanks for diving into the trenches on this one, let's raise it as a pull request when you finish up your batch of testing and we can comment in there. Please tag me and @sb-ntnx in addition to Vadim, I've got a few thoughts but don't want them to get buried in this very long discussion thread 😃 |
Hi all, Thanks @benyamin-codez for jumping into the code! I don't know much about SCSI or Windows drivers so can't comment on the changes themselves. Happy to test a PR with the finished changes though! If I understand the situation correctly, there does seem to be some bug that can be triggered by excessive IO, e.g. by the 8k+16k reads reproducer. While a large portion of the users affected by this issue are Proxmox VE users, it is not limited to Proxmox VE users, e.g. @GreyWolfBW is using Debian+libvirt. Still, there may be a factor that makes the issue more likely to trigger on Proxmox VE than e.g. on RH (if I understand correctly, Nutanix uses vhost-user-scsi, so that's quite a different setup). As I've done my tests with upstream QEMU stable, I'd rule out our downstream QEMU patches as a factor. Our kernels may be a factor -- next week I can see if I manage to reproduce the issue on a mainline kernel. Also, there may be some peculiarity hiding in my QEMU command line (I've posted them e.g. here) -- I don't see anything suspicious there but I'm obviously biased.
Yes, I wouldn't consider In a similar vein, I built kvm-guest-drivers-windows today at c87ea56 and enabled debug prints (with this patch). I ran the reproducer (4 cores and no restriction on
In
Interestingly, there are no vioscsi debug outputs at all for the ~65 seconds before that, the last one is:
Thanks for the suggestions!
That one I already tried in #756 (comment), it seems to get rid of the
I think I've looked into those already a couple of weeks ago and didn't see changed behavior when running the reproducer, but I'll try to revisit them.
Best, Friedrich
|
All good. Mostly done with IO and SCSI failure testing. Proxmox doesn't seem to work well with HMP based Just working out some optimisation vs WPP tracing issues. It must have been "fun" when @vrozenfe implemented it. The PR should be up in the next day. |
This is definitely the case. I'm having a look at it now... I dropped an update here: https://forum.proxmox.com/threads/redhat-virtio-developers-would-like-to-coordinate-with-proxmox-devs-re-vioscsi-reset-to-device-system-unresponsive.139160/post-695547 |
The case about what? I flipped a thru the last few posts on the forums, can you elaborate? |
There is an underlying problem that existed before the v208 release. I would say well before. As you put it:
Per my comments in the Proxmox post and in the draft PR. I'm focusing on that issue at the moment. |
Also Vadim's comments:
and
...and that may very well be the case, but I'm also reviewing the NOTIFY implementation in [virtio]. |
PR #1150 has been raised to resolve this issue in [vioscsi]. |
PR #1174 has been raised to resolve this issue in [viostor]. |
Unfortunately we are experiencing the same issue like described in #623.
Error message "viostor Reset to device, \Device\RaidPort3, was issued." appears in the log and the VM is frozen / unresponsive. Only cold reboot helps. This happened already several times.
Environment:
Affected VMs:
Windows server version: Windows Server 2016
Windows server version: Windows Server 2016
Windows server version: Windows Server 2012 R2
It seems primarily servers running MS SQL Server are affected.
The error message means that the storage does not respond for 30 seconds. In Ceph we see nothing close to that, latency of the affected volume is good. Disk performance and throughput on the Ceph SSD storage pool is excellent when measured with CrystalDiskMark or the SQL Server-internal benchmark.
It seems the issue may occur at arbitrary times, i.e. it is not related to the overall load on the storage cluster. The VM often runs for a few minutes after the first "viostor Reset to device" entries appear in the log. The first application-level problems can appear even if the entire VM is not yet frozen.
On a volume on which this error occured once, we can reproduce it as follows:
When we can not reproduce this issue:
This means that apparently "only" this combination of volume and VM causes the problem.
We have been able to resolve this issue at least for some time, by copying the data from the affected volume to a new volume (while attached to another VM) and then attach the newly created volume to the initial VM.
It is working for a while before the error reoccurs.
It also seems that after the problem occurred for the very first time on a volume it then reoccurs more often (it is not always possible to immediately replace the affected volume as these are production servers).
Do you have any idea what could be causing this problem and how we can work around it? Thank you very much.
The text was updated successfully, but these errors were encountered: