Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

viostor Reset to device, \Device\RaidPort3, was issued. VM is frozen. #756

Closed
mherlitzius opened this issue Apr 5, 2022 · 160 comments
Closed
Assignees
Labels
Bug Bug

Comments

@mherlitzius
Copy link

mherlitzius commented Apr 5, 2022

Unfortunately we are experiencing the same issue like described in #623.

Error message "viostor Reset to device, \Device\RaidPort3, was issued." appears in the log and the VM is frozen / unresponsive. Only cold reboot helps. This happened already several times.

Environment:

  • KVM / libvirt 3.0.0-4+deb9u5
  • Storage: Ceph Nautilus 14.2.21, SSD pool
  • Problem occurs on both Intel (Xeon E5-2643) and AMD (EPYC 7F72) compute nodes

Affected VMs:
Windows server version: Windows Server 2016

  • Virtio driver version: build 215

Windows server version: Windows Server 2016

  • Virtio driver version: 100.77.104.17100

Windows server version: Windows Server 2012 R2

  • Virtio driver version: 62.74.104.14100

It seems primarily servers running MS SQL Server are affected.

The error message means that the storage does not respond for 30 seconds. In Ceph we see nothing close to that, latency of the affected volume is good. Disk performance and throughput on the Ceph SSD storage pool is excellent when measured with CrystalDiskMark or the SQL Server-internal benchmark.

It seems the issue may occur at arbitrary times, i.e. it is not related to the overall load on the storage cluster. The VM often runs for a few minutes after the first "viostor Reset to device" entries appear in the log. The first application-level problems can appear even if the entire VM is not yet frozen.

On a volume on which this error occured once, we can reproduce it as follows:

  • run CrystalDiskMark on that volume (or just copy files from that volume), while the volume is attached to the original server. The error appears in the log and the VM freezes within seconds.

When we can not reproduce this issue:

  • If the affected volume is attached to another VM, we can not reproduce this problem. Running the same operation (e.g. CrystalDiskMark or copying files) on the same volume when attached to another VM the error does not occur.
  • Running the same operation on a (newly created) volume on the initial VM does not reproduce the problem.

This means that apparently "only" this combination of volume and VM causes the problem.

We have been able to resolve this issue at least for some time, by copying the data from the affected volume to a new volume (while attached to another VM) and then attach the newly created volume to the initial VM.

It is working for a while before the error reoccurs.

It also seems that after the problem occurred for the very first time on a volume it then reoccurs more often (it is not always possible to immediately replace the affected volume as these are production servers).

Do you have any idea what could be causing this problem and how we can work around it? Thank you very much.

@vrozenfe
Copy link
Collaborator

vrozenfe commented Apr 6, 2022

@mherlitzius
he problem supposed to be fixed for both vioscsi and viostor with #684 and #683 commits respectively.

That is quite new strange that you can reproduce the problem on WS2016 with build 215 since it was tested and verified by our QE https://bugzilla.redhat.com/show_bug.cgi?id=2013976 and https://bugzilla.redhat.com/show_bug.cgi?id=2028000
Will it be possible to share the system event log file from WS2016 system?

Thanks,
Vadim.

@mherlitzius
Copy link
Author

@vrozenfe Thank you for your timely reply. I will ask for the event log from the WS2016 machine running build 215.
If this is fixed with build 215, is it enough to update the driver or should the drives or the whole VM be rebuilt?

@vrozenfe
Copy link
Collaborator

vrozenfe commented Apr 6, 2022

@mherlitzius
Update should be enough. However, you might be asked to reboot the VM after that.

@mherlitzius
Copy link
Author

@vrozenfe Please see attached the event log. The error just appeared this morning.
srv-t-jdbib1.zip

@vrozenfe
Copy link
Collaborator

vrozenfe commented Apr 6, 2022

@mherlitzius
Thanks a lot.
By any chance, did you change any of two parameters "TimeoutValue" or "IoTimeoutValue" mentioned in #623 ?
I'm asking because the reset event is coming every 4 minutes instead of default 30 seconds. Another interesting thing is the bunch of 153 events. Can you please export the system event log file in .evtx format instead of .csv? It will make possible to
check and analyse the SRB status returned by Ceph.

BEst,
Vadim.

@mherlitzius
Copy link
Author

@vrozenfe
There are no entries for Storport Miniport registry keys.
There is an entry for SCSI miniport:

image003
Can I provide you the event log via another channel as it contains customer specific information? Thanks.

Best,
Matthias

@thinkingcap
Copy link

Did you get any resolution to this? I have the same issue on Server 19/SQL 19 using VirtIO 0.1.229

@melfacion
Copy link

I am also interested to know if there was any solution, or a planned new release of the virtio drivers.
We have approx 90 windows VMs, and this issue only happens to 3 of them for some reason. (2 of them are windows 2019 server, and the last one is a windows 2022 server)

Let me know if I can provide any logs or configuration that can be useful to narrow down the issue.
The freeze and error happens sporadically. Sometimes several times each week, and sometimes it can go weeks between occurrences. But only on these 3 VMs for some reason.

@vrozenfe
Copy link
Collaborator

@melfacion @thinkingcap
build 229 is completely up to date.
If you didn't adjust "IoTimoutValue" https://learn.microsoft.com/en-us/windows-hardware/drivers/storage/registry-entries-for-storport-miniport-drivers - please do it, trying to increase it gradually (60/12/180) and checking if it makes any differences. Please note that VM needs to be restarted erery time this parameter has been changed.
That three VMs. Are they running on the same or different hosts? Do they all use local or network storage? It will be nice if you can share Windows event log files with me at vrozenfe_at_redhat_dot_com.

Thank you,
Vadim.

@melfacion
Copy link

That three VMs. Are they running on the same or different hosts? Do they all use local or network storage? It will be nice if you can share Windows event log files with me at vrozenfe_at_redhat_dot_com.

Thank you, Vadim.

@vrozenfe : different hosts, different datacenters, different storage pools. They all use Ceph for network storage. Hypervisor is proxmox (ver 7.3-6 in production currently and 7.4-3 in test datacenter)
(2 of the VMs are in our production datacenter and are running customer workloads. One is a QLIK Sense application, the other one is a WSUS update server. The third one in a different datacenter is one of our internal VMs used as a jumpserver/Windows terminal services for accessing our customers ) So the WSUS update server and our internal server can easily be restarted for troubleshooting whenever needed. The last one is a customer-facing solution, and should be up and running at all times.

I exported the events from all sources for some time before the last crash and some time after. Will send it to you as requested.

@thinkingcap
Copy link

thinkingcap commented Jun 20, 2023

When I get the error Reset to device, \Device\RaidPort2, was issued.

I get the following in the host Proxmox syslog
QEMU[71107]: kvm: virtio: zero sized buffers are not allowed

Once I got this too with same impact on guest
QEMU[2198]: kvm: Desc next is 3

Hypervisor is Proxmox

proxmox-ve: 7.4-1 (running kernel: 6.2.11-2-pve)
pve-manager: 7.4-13 (running version: 7.4-13/46c37d9c)
pve-kernel-6.2: 7.4-3
pve-kernel-5.15: 7.4-3
pve-kernel-6.1: 7.3-6
pve-kernel-6.2.11-2-pve: 6.2.11-2
pve-kernel-6.1.15-1-pve: 6.1.15-1
pve-kernel-5.15.107-2-pve: 5.15.107-2
pve-kernel-5.15.74-1-pve: 5.15.74-1
ceph-fuse: 15.2.17-pve1
corosync: 3.1.7-pve1
criu: 3.15-1+pve-1
glusterfs-client: 9.2-1
ifupdown2: 3.1.0-1+pmx4
ksm-control-daemon: 1.4-1
libjs-extjs: 7.0.0-1
libknet1: 1.24-pve2
libproxmox-acme-perl: 1.4.4
libproxmox-backup-qemu0: 1.3.1-1
libproxmox-rs-perl: 0.2.1
libpve-access-control: 7.4.1
libpve-apiclient-perl: 3.2-1
libpve-common-perl: 7.4-2
libpve-guest-common-perl: 4.2-4
libpve-http-server-perl: 4.2-3
libpve-rs-perl: 0.7.7
libpve-storage-perl: 7.4-3
libspice-server1: 0.14.3-2.1
lvm2: 2.03.11-2.1
lxc-pve: 5.0.2-2
lxcfs: 5.0.3-pve1
novnc-pve: 1.4.0-1
proxmox-backup-client: 2.4.2-1
proxmox-backup-file-restore: 2.4.2-1
proxmox-kernel-helper: 7.4-1
proxmox-mail-forward: 0.1.1-1
proxmox-mini-journalreader: 1.3-1
proxmox-widget-toolkit: 3.7.3
pve-cluster: 7.3-3
pve-container: 4.4-4
pve-docs: 7.4-2
pve-edk2-firmware: 3.20230228-4~bpo11+1
pve-firewall: 4.3-4
pve-firmware: 3.6-5
pve-ha-manager: 3.6.1
pve-i18n: 2.12-1
pve-qemu-kvm: 7.2.0-8
pve-xtermjs: 4.16.0-2
qemu-server: 7.4-3
smartmontools: 7.2-pve3
spiceterm: 3.2-2
swtpm: 0.8.0~bpo11+3
vncterm: 1.7-1
zfsutils-linux: 2.1.11-pve

qm config

agent: 1
bios: ovmf
boot: order=scsi0;net0;ide0
cores: 12
cpu: Skylake-Server
description: #### Windows Server 2016%0A#### SQL Server 2019
efidisk0: SSD-R10:vm-201-disk-0,efitype=4m,format=raw,pre-enrolled-keys=1,size=528K
ide0: none,media=cdrom
machine: pc-q35-7.1
memory: 61440
meta: creation-qemu=7.1.0,ctime=1676181242
name: IRIS
net0: virtio=86:DB:0D:46:C1:58,bridge=vmbr0,firewall=1,tag=25
numa: 1
onboot: 1
ostype: win10
scsi0: SSD-R10:vm-201-disk-1,discard=on,format=raw,iothread=1,size=96G
scsi1: SSD-R10:vm-201-disk-2,discard=on,iothread=1,size=320G
scsi2: SSD-R10:vm-201-disk-3,discard=on,iothread=1,size=160G
scsi3: SSD-R10:vm-201-disk-4,discard=on,iothread=1,size=64G
scsi4: SSD-R10:vm-201-disk-5,discard=on,iothread=1,size=448G
scsihw: virtio-scsi-single
smbios1: uuid=973a474f-45aa-4caa-ab0e-bace1c0aa76e
sockets: 1
tags: ims;windows
vmgenid: 62178394-12fe-4df7-ae25-839471658f30

Workload is SQL Server / Analysis Server 2019 running ETL jobs and processing cubes.
Ill take a look at IoTimoutValue - currently the reset message comes through every 60 seconds once I get the problem.

I actually get this on 2 hosts, this one more than the other.
Both hosts are Dell servers, PERC hardware RAID 10, local storage using Intel DC SATA SSD's.

@vrozenfe
Copy link
Collaborator

Since it is SQL server it might be useful to try reducing the maximum transfer size by specifying "PhysicalBreaks" key
in the Registry
https://access.redhat.com/solutions/6963779
This key works for virtio-scsi device only.
(if I'm not mistaken WSUS can also use SQL engine)

Vadim.

@thinkingcap
Copy link

Its protected content unfortunately.

@vrozenfe
Copy link
Collaborator

@thinkingcap
No problem. There is some open information in the driver code itself
https://github.com/virtio-win/kvm-guest-drivers-windows/blob/master/vioscsi/vioscsi.c#L485
"PhysicalBreaks" Registry Key allows to redefine the maximum transfer size supported by vioscsi mini-port
driver. Without this key the default value is 512, which make the maximum transfer size equal to 2MB
https://github.com/virtio-win/kvm-guest-drivers-windows/blob/master/vioscsi/vioscsi.c#L501
IIRC on SQL server maximum transfer size can go up to 4MB which means that SQL server can generate
1MB/2MB load. 2MB block transfer work really good for direct LUN SSD backends but can be not optimal for some old or
network connected storages. I would suggest setting "PhysicalBreaks" as 0x1f or 0x3f to see if it helps to fix the problem.

@thinkingcap
Copy link

@vrozenfe I assume it requires a reboot when changing?
Storage is local SATA SSD (Intel/Solidgm S4620) in RAID10 (Dell PERC)

@vrozenfe
Copy link
Collaborator

@thinkingcap
System disk requires reboot, for data disk disable/enable cycle is enough.

@MaxXor
Copy link

MaxXor commented Jun 26, 2023

Hello, I'm also haunted by this problem. Running Proxmox 7.2 in production, one Windows Server 2022 VM is affected with MSSQL 2019 CU21 and virtio drivers 0.1.225. Storage backend for all VMs in the cluster is CEPH 16.2.7 (using 10Gbit SPF+, Jumbo Frames 9000 MTU). Sometimes the VM is running for weeks without issues, sometimes multiple lockups in a single day which require a hard reset.

Same symptoms:

  • Windows Event Viewer shows multiple: Reset to device \Device\RaidPort2 was issued. each minute
  • On Proxmox host it shows after a few entries of the previous error: kvm: virtio: zero sized buffers are not allowed
  • Then the VM is locked up and requires a hard reset

virtioscsi

@vrozenfe According to your suggestions I should first try setting these registry values, right? As you can see in the screenshot it happens exactly every 60 secs, which means my default IO timeout is 60secs right? So I would adjust it to 90sec for the vioscsi driver first?

  • HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\vioscsi\Parameters

    • IoTimeoutValue = 0x5a (90)
  • HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\vioscsi\Parameters\Device

    • PhysicalBreaks = 0x3f (63)

Does 0x3f make sense for CEPH with 9000 MTU?

@vrozenfe
Copy link
Collaborator

@MaxXor
I would try to reduce the transfer size ("PhysicalBreaks") first to see if it helps to solve the problem.
Btw, there is a different option to adjust the transfer size without touching the Windows Registry.
Upstream and RHEL QEMU have "max_sectors" parameters for that.
"-device virtio-scsi-pci,id=scsi-vioscsi1,max_sectors=63" will have the same effect as "PhysicalBreaks = 0x3f (63)".
Unfortunately, I don't know if it works for Proxmox or not. You should probably clarify it with Proxmox support team.

I don't recall I've ever seen "kvm: virtio: zero sized buffers are not allowed" before. But since you and @thinkingcap both
mentioned this message, by any chance can you guys post the QEMU version and the QEMU command line?

0x3f makes the maximum transfer size equal to 256K. In my understanding 30 sec (default timeout) period should be enough for Jumbo frame network storage to complete such transfer. Even for write operations, I think. Unfortunately I have very limited knowledge in CEPH, even though that we use it as one of the backends for our internal storage testing.
Ideally, I would run some storage performance test inside the VM to compare latency and standard deviation for different transfer sizes (4K/16K/64K/256K).

Best,
Vadim.

@MaxXor
Copy link

MaxXor commented Jun 27, 2023

Thanks for your answer, unfortunately both IoTimeoutValue (0x5a) and PhysicalBreaks (0x3f) didn't help yet. I really don't understand it. Large file transfer, e.g. by copying a file, are working fine which should definitely put load on CEPH and cause higher IO latencies. It's copying with 500 MB/s, no errors. But if I keep the MSSQL running, monitoring the disk IO, it just locks up with low load (max. 30 MB/s) again after 1-2 hours.

I'm using the Proxmox QEMU package at version 7.2.0-8, which relates to this git commit: https://github.com/proxmox/pve-qemu/tree/93d558c1eef8f3ec76983cbe6848b0dc606ea5f1 It's the latest available for Proxmox 7.x.

Another thing I noticed, which might be worth mentioning is that only one of the vioscsi disks locks up. During the reset events in windows I can still read/write fine to all other disks, except the one which is locked up. The VM I'm using has 4 vioscsi disks.

I will post the QEMU command line later.

@thinkingcap
Copy link

Did you reboot after applying those registry changes?

@MaxXor
Copy link

MaxXor commented Jun 27, 2023

Yes. It's not a critical machine. I can reboot it at any time if it helps us figure out the root cause. 🙂

@thinkingcap
Copy link

Whats your VM config look like? Can you post a qm config <VMID>

@MaxXor
Copy link

MaxXor commented Jun 27, 2023

agent: 1
balloon: 0
boot: order=scsi0
cores: 6
cpu: host
machine: pc-q35-6.2
memory: 32768
meta: creation-qemu=6.2.0,ctime=1657105335
net0: virtio=76:80:CE:B7:21:69,bridge=vmbr4
numa: 0
onboot: 1
ostype: win11
scsi0: ceph_storage:vm-104-disk-0,iothread=1,discard=on,size=100G
scsi1: ceph_storage:vm-104-disk-1,iothread=1,cache=writeback,discard=on,size=500G
scsi2: ceph_storage:vm-104-disk-2,iothread=1,cache=writeback,discard=on,size=500G
scsi3: ceph_storage:vm-104-disk-3,iothread=1,discard=on,size=500G
scsihw: virtio-scsi-single
smbios1: uuid=5839521f-0c23-431c-9806-c7ee8aab6104
sockets: 1
vmgenid: 1c6d5ca3-862b-49c5-848d-453cbe070164

Only scsi1 and scsi2 are affected (I already tried without cache=writeback, but it didn't help).

@thinkingcap
Copy link

I would also change cache to Default (no cache)
I'm also not familiar with Ceph so unsure on performance impact.

@thinkingcap
Copy link

Back to square 1 here, just got kvm: virtio: zero sized buffers are not allowed again

@mherlitzius
Copy link
Author

To mitigate this problem, we switched to local storage. Unfortunately, we have not found any other solution either. Thanks for your help anyway, Vadim!

@MaxXor
Copy link

MaxXor commented Jun 30, 2023

@thinkingcap I just noticed that the automatic trim operations for SSDs were running at the time of the last lockup. I disabled it for now to see if the problem disappears.
image

@thinkingcap
Copy link

@mherlitzius I'm already using local SSD backed storage (from the start)
@MaxXor Our issues occur during time of load when running SQL / AS jobs.

Proxmox 8.0.3 ran without issue for 5 days but got the less frequent error today:
QEMU[4927]: kvm: Desc next is 3
Error in VM is same.
So a different error message on the host but result is the same.

@benyamin-codez
Copy link
Contributor

I think #735 is worth a revisit...

@JonKohler
Copy link
Contributor

Author of #735 / #736 here - thanks for all of the commentary on this thread. All that allows you to do is when you have large IO, it make sure it is properly aligned so if you're max transfer size config is 1MB, you're actually getting 1MB IO's cleanly up and down the stack.

I suspect that what ends up happening here is something else down the stack is binding up, and perhaps just making sure it has nicely aligned IO makes it angrier.

That said, if you've got a very reliable reproducer, you could git bisect this repro, build some demo drivers, and see if you can find a smoking gun for your setup.

@JonKohler
Copy link
Contributor

Also, to be clear, not saying that it couldn't be the problem, perhaps it contributes, but this type of lockup has been going on before that commit was even in the repo (see #623). Very well may be in the same class of issues.

I'll defer to the guru @vrozenfe though. Happy to help if this ends up being part of the issue.

As a data point, we've been running that commit across our fleet for a long time and haven't seen a single customer support case with this signature (albeit in a completely different topology RE qemu/backend storage config).

@benyamin-codez
Copy link
Contributor

As a data point, we've been running that commit across our fleet for a long time and haven't seen a single customer support case with this signature (albeit in a completely different topology RE qemu/backend storage config).

I think this is part of the problem with performing root cause analysis for this issue. The lack of fault reproduction in Nutanix and RHEL environments points to other root causes. I didn't see this issue myself running virtio-blk dataplane until I went over 1GiBps throughput. If someone is able to run my reproducer above on a VM hosted on RHEL rather than a Debian-based hypervisor, that would go a long way to determining the scope of the problem. I don't know enough about your "data path plumbing" to know if it would be reproducible in your kit.

That said, if you've got a very reliable reproducer, you could git bisect this repro, build some demo drivers, and see if you can find a smoking gun for your setup.

I'll probably wait for Vadim to have a look before I cast my very fresh eyes over it. 8^d

@vrozenfe
Copy link
Collaborator

Guys you are all right. In my understanding the root of the problem is in the maximum size of IO transfer. Likely to Nutanix and us (RH) we don't see any problem if the IO transfer exceeds the virtual queue size. Others might have
problem with that. I will try to address this issue very soon. Just need to fix some critical stuff related to the virtio-win installer first.

Cheers,
Vadim.

@benyamin-codez
Copy link
Contributor

Guys you are all right. In my understanding the root of the problem is in the maximum size of IO transfer. Likely to Nutanix and us (RH) we don't see any problem if the IO transfer exceeds the virtual queue size. Others might have problem with that. I will try to address this issue very soon. Just need to fix some critical stuff related to the virtio-win installer first.

Cheers, Vadim.

Thanks Vadim, that would be great.

If I was to hazard a guess, it looks like the train really kicked off with #502 --> #596 --> 2f1a58a

...but don't let me lead you up the garden path...!

@vrozenfe
Copy link
Collaborator

@benyamin-codez

Right, I believe that the problem is that we create SG longer than virtual queue size..
Technically we probably need to reduce the size or to split a large DMA transfer to
a set of smaller chunks.

Best,
Vadim.

@benyamin-codez
Copy link
Contributor

@vrozenfe

So I've had a bit of a peek around the vioscsi code and thought I'd drop a quick post here as to what I've found in the hope that this is more a hint than a distraction.

The issue starts after 7dc052d, with the rework on SRB. The issue is apparent following fdf56dd.

The nature of the problem, throughput graph observations during fault reproduction, what appear to be resets on the bus and similar behaviour to issue #907 (and others) - when considered together - suggest to me that SRBs are being processed out of order, potentially with stray retries. It seems stray SRBs might being returning quite late and corrupting the queue. Given SRBs and the initial LUN reset logic were included in that commit, perhaps one is adversely affecting the other...?

So focusing on #684, I built each commit and tested for the fault. Then I tore down the code to see what effect the changes had on the fault condition. I started with the max_physical_breaks / NumberOfPhysicalBreaks / MaximumTransferLength artefacts to little (but observable) effect on the fault. I also reintroduced the USE_CPU_TO_VQ_MAP code in 6090638 to little effect, but note I didn't try reintroducing the USE_WORK_ITEM code in 28f16b0. I'm reasonably confident the root cause isn't in any of these elements that I checked.

It seems to me that the fault is coincident to the changes in request completion handling and the removal (or maybe a relocation?) of the spin lock in the SRB implementation. So the main two commits would be f1338bb and fdf56dd, but considered together.

The problem here is the SRB implementation has evolved quite a bit in the three years since then... With thanks - it looks great..!

I do suspect it might be something in the VioScsiVQLock and VioScsiVQUnlock implementation.

However, I also didn't notice any functions reminiscent of HwScsiDmaStarted or HwScsiStartIo, so I wasn't sure if SRB is adequately implemented. Is it possible there's a mash up of PIO and SRB modes...? As there is no guarantee of sync between DMA transfers and the SRB data buffer while the SRB is being processed, the data buffer shouldn't be touched until the DMA transfer has completed and the channel has been flushed. I didn't really see code for these mechanisms but it's become a bit of a version mish-mash for me looking up and down the tree... 8^d

I was sort of curious where the newer SRB scaffold appeared from, and whether there was a missing functional dependency on VirtIO or something else in the larger tree.

Anyway, I hope the above is of some help. I might have another look in a day or so.

Regards,
Ben

@benyamin-codez
Copy link
Contributor

Also forgot to mention I checked changes to MaxNumberOfIO, with no effect.

@frwbr
Copy link

frwbr commented Aug 14, 2024

Hi Vadim,

Thanks for looking into this again.

Right, I believe that the problem is that we create SG longer than virtual queue size.. Technically we probably need to reduce the size or to split a large DMA transfer to a set of smaller chunks.

Your remark about the queue size prompted me to look into virtio-scsi multiqueue. My current understanding of virtio is not very deep, but AFAICT current QEMU allocates 1 virtqueue per vCPU for virtio-scsi (plus 2 more virtqueues) [0] [1].

I've tried setting num_queues=1 to force QEMU to only allocate 1+2 virtqueues. On the running VM, I checked the number of virtqueues (num_vqs) via the QEMU monitor with a command like:

info virtio
info virtio-status /machine/peripheral/virtioscsi0/virtio-backend

e.g.:

/machine/peripheral/virtioscsi0/virtio-backend:
[...]
  num_vqs:                 3
  queue_sel:               2
  isr:                     1
  endianness:              little
[...]

With this setting, the reproducer from [2] has not produced a vioscsi Reset to device, \Device\Raidport1, was issued warning so far (see below for QEMU command line).

This could also be a coincidence, but on the other hand -- could there be some unfortunate interaction in presence of virtio-scsi multiqueue that may trigger the issue?

[0] https://github.com/qemu/qemu/blob/a733f37aef3b7d1d33bfe2716af88cdfd67ba64e/hw/virtio/virtio-pci.c#L2526
[1] https://github.com/qemu/qemu/blob/a733f37aef3b7d1d33bfe2716af88cdfd67ba64e/hw/scsi/virtio-scsi.c#L1232
[2] #756 (comment)


Host: Proxmox VE kernel 6.8.12-1-pve based on Ubuntu kernel 6.8
Guest: Windows 10 with virtio-win 0.1.262
Upstream QEMU 9.0.1 command line (the boot disk is the SATA disk, I run the reproducer on the VirtIO SCSI disk) -- mind the num_queues=1:

[...]/qemu-stable-9.0.1/qemu-system-x86_64 \
  -accel kvm \
  -name 'win10-vioscsi-versions,debug-threads=on' \
  -chardev 'socket,id=qmp,path=/var/run/qemu-server/151.qmp,server=on,wait=off' \
  -mon 'chardev=qmp,mode=control' \
  -chardev 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5' \
  -mon 'chardev=qmp-event,mode=control' \
  -pidfile /var/run/qemu-server/151.pid \
  -smbios 'type=1,uuid=32b5c31e-ec75-4d62-9d0b-a756f876e943' \
  -smp '4,sockets=1,cores=4,maxcpus=4' \
  -nodefaults \
  -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' \
  -vnc 'unix:/var/run/qemu-server/151.vnc,password=on' \
  -cpu 'qemu64,+aes,enforce,hv_ipi,hv_relaxed,hv_reset,hv_runtime,hv_spinlocks=0x1fff,hv_stimer,hv_synic,hv_time,hv_vapic,hv_vpindex,+kvm_pv_eoi,+kvm_pv_unhalt,+pni,+popcnt,+sse4.1,+sse4.2,+ssse3' \
  -m 8192 \
  -device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' \
  -device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' \
  -device 'pci-bridge,id=pci.3,chassis_nr=3,bus=pci.0,addr=0x5' \
  -device 'vmgenid,guid=d033521b-0401-46b6-975b-f8b4c6d1a4a6' \
  -device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' \
  -device 'usb-tablet,id=tablet,bus=uhci.0,port=1' \
  -device 'VGA,id=vga,bus=pci.0,addr=0x2,edid=off' \
  -device 'virtio-scsi-pci,id=virtioscsi0,bus=pci.3,addr=0x1,num_queues=1' \
  -drive 'file=/dev/pve/vm-151-disk-1,if=none,id=drive-scsi0,format=raw,cache=none,aio=io_uring,detect-zeroes=on' \
  -device 'scsi-hd,bus=virtioscsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0' \
  -device 'ahci,id=ahci0,multifunction=on,bus=pci.0,addr=0x7' \
  -drive 'file=/dev/pve/vm-151-disk-0,if=none,id=drive-sata0,format=raw,cache=none,aio=io_uring,detect-zeroes=on' \
  -device 'ide-hd,bus=ahci0.0,drive=drive-sata0,id=sata0,bootindex=100' \
  -netdev 'type=tap,id=net0,ifname=tap151i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown' \
  -device 'e1000,mac=BA:FF:FF:84:10:E7,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=102' \
  -rtc 'driftfix=slew,base=localtime' \
  -machine 'hpet=off,type=pc-i440fx-9.0' \
  -global 'kvm-pit.lost_tick_policy=discard'

@benyamin-codez
Copy link
Contributor

I've confirmed @frwbr's results per https://forum.proxmox.com/threads/redhat-virtio-developers-would-like-to-coordinate-with-proxmox-devs-re-vioscsi-reset-to-device-system-unresponsive.139160/post-693772

It does prevent bus resets and both kvm errors, but at a cost to performance.

I'm thinking I might need to revisit my checking of the legacy USE_CPU_TO_VQ_MAP...

@benyamin-codez
Copy link
Contributor

@vrozenfe,

I found it in /vioscsi/helper.c

What I have so far. RFC

--- ./a/vioscsi/helper.c        2024-08-06 05:42:24.000000000 +1000
+++ ./b/vioscsi/helper.c        2024-08-15 08:41:32.366850208 +1000
@@ -51,6 +51,7 @@
     PVOID               va = NULL;
     ULONGLONG           pa = 0;
     ULONG               QueueNumber = VIRTIO_SCSI_REQUEST_QUEUE_0;
+    BOOLEAN             notify = FALSE;
     STOR_LOCK_HANDLE    LockHandle = { 0 };
     ULONG               status = STOR_STATUS_SUCCESS;
     UCHAR               ScsiStatus = SCSISTAT_GOOD;
@@ -108,18 +109,14 @@
         srbExt->psgl,
         srbExt->out, srbExt->in,
         &srbExt->cmd, va, pa);
+    element = &adaptExt->processing_srbs[index];
 
     if (res >= 0) {
-        element = &adaptExt->processing_srbs[index];
+        notify  = virtqueue_kick_prepare(adaptExt->vq[QueueNumber]) ? TRUE : notify;
         InsertTailList(&element->srb_list, &srbExt->list_entry);
         element->srb_cnt++;
-    }
-    VioScsiVQUnlock(DeviceExtension, MessageID, &LockHandle, FALSE);
-    if ( res >= 0){
-        if (virtqueue_kick_prepare(adaptExt->vq[QueueNumber])) {
-            virtqueue_notify(adaptExt->vq[QueueNumber]);
-        }
     } else {
+        RemoveTailList(&element->srb_list);
         virtqueue_notify(adaptExt->vq[QueueNumber]);
         ScsiStatus = SCSISTAT_QUEUE_FULL;
         SRB_SET_SRB_STATUS(Srb, SRB_STATUS_BUSY);
@@ -127,6 +124,17 @@
         StorPortBusy(DeviceExtension, 10);
         CompleteRequest(DeviceExtension, Srb);
         RhelDbgPrint(TRACE_LEVEL_FATAL, " Could not put an SRB into a VQ, so complete it with SRB_STATUS_BUSY. QueueNumber = %d, SRB = 0x%p, Lun = %d, TimeOut = %d.\n", QueueNumber, srbExt->Srb, SRB_LUN(Srb), Srb->TimeOutValue);
+        notify = TRUE;
+    }
+    
+    VioScsiVQUnlock(DeviceExtension, MessageID, &LockHandle, FALSE);
+    
+    if (notify){
+        virtqueue_notify(adaptExt->vq[QueueNumber]);
+        /*if (virtqueue_kick_prepare(adaptExt->vq[QueueNumber])) {
+            virtqueue_notify(adaptExt->vq[QueueNumber]);
+        }
+        */
     }
 
 EXIT_FN_SRB();

@benyamin-codez
Copy link
Contributor

benyamin-codez commented Aug 15, 2024

@vrozenfe, I refactored the above to the root cause.

I should note that performance is odd. I get performance of 3.5GB/s, then 6.4GB/s alternating. Not consistently, but what appears to be according to a timer. Alternating slow and fast periods. This survives reboot. Is that normal?

--- ./a/vioscsi/helper.c        2024-08-06 05:42:24.000000000 +1000
+++ ./b/vioscsi/helper.c        2024-08-15 10:26:16.389005304 +1000
@@ -110,15 +110,14 @@
         &srbExt->cmd, va, pa);
 
     if (res >= 0) {
+        virtqueue_kick_prepare(adaptExt->vq[QueueNumber]);
         element = &adaptExt->processing_srbs[index];
         InsertTailList(&element->srb_list, &srbExt->list_entry);
         element->srb_cnt++;
     }
     VioScsiVQUnlock(DeviceExtension, MessageID, &LockHandle, FALSE);
     if ( res >= 0){
-        if (virtqueue_kick_prepare(adaptExt->vq[QueueNumber])) {
-            virtqueue_notify(adaptExt->vq[QueueNumber]);
-        }
+        virtqueue_notify(adaptExt->vq[QueueNumber]);
     } else {
         virtqueue_notify(adaptExt->vq[QueueNumber]);
         ScsiStatus = SCSISTAT_QUEUE_FULL;

@benyamin-codez
Copy link
Contributor

I should note that performance is odd. I get performance of 3.5GB/s, then 6.4GB/s alternating. Not consistently, but what appears to be according to a timer. Alternating slow and fast periods. This survives reboot. Is that normal?

It looks like this is multiplexing of additional adapters. With one adapter, I get 6.4GB/s consistently. Is that by design?

@benyamin-codez
Copy link
Contributor

benyamin-codez commented Aug 15, 2024

@vrozenfe, I refactored again to eliminate what I think are unnecessary VQ unlocks.

Performance is improved again, with 6.4GB/s on the boot+system disk and 9.2-9.5GB/s on additional disks. The adapter multiplexing appears only to affect disks on additional adapters. I have not tried booting on SCSI ID > 0.

--- ./a/vioscsi/helper.c        2024-08-06 05:42:24.000000000 +1000
+++ ./b/vioscsi/helper.c        2024-08-15 22:59:13.605562341 +1000
@@ -56,6 +56,7 @@
     UCHAR               ScsiStatus = SCSISTAT_GOOD;
     ULONG MessageID;
     int res = 0;
+    BOOLEAN             notify = FALSE;
     PREQUEST_LIST       element;
     ULONG               index;
 ENTER_FN_SRB();
@@ -110,17 +111,11 @@
         &srbExt->cmd, va, pa);
 
     if (res >= 0) {
+        notify = virtqueue_kick_prepare(adaptExt->vq[QueueNumber]);
         element = &adaptExt->processing_srbs[index];
         InsertTailList(&element->srb_list, &srbExt->list_entry);
         element->srb_cnt++;
-    }
-    VioScsiVQUnlock(DeviceExtension, MessageID, &LockHandle, FALSE);
-    if ( res >= 0){
-        if (virtqueue_kick_prepare(adaptExt->vq[QueueNumber])) {
-            virtqueue_notify(adaptExt->vq[QueueNumber]);
-        }
     } else {
-        virtqueue_notify(adaptExt->vq[QueueNumber]);
         ScsiStatus = SCSISTAT_QUEUE_FULL;
         SRB_SET_SRB_STATUS(Srb, SRB_STATUS_BUSY);
         SRB_SET_SCSI_STATUS(Srb, ScsiStatus);
@@ -128,7 +123,10 @@
         CompleteRequest(DeviceExtension, Srb);
         RhelDbgPrint(TRACE_LEVEL_FATAL, " Could not put an SRB into a VQ, so complete it with SRB_STATUS_BUSY. QueueNumber = %d, SRB = 0x%p, Lun = %d, TimeOut = %d.\n", QueueNumber, srbExt->Srb, SRB_LUN(Srb), Srb->TimeOutValue);
     }
-
+    VioScsiVQUnlock(DeviceExtension, MessageID, &LockHandle, FALSE);
+    if (notify){
+        virtqueue_notify(adaptExt->vq[QueueNumber]);
+    }
 EXIT_FN_SRB();
 }

I have backported this into viostor too. I will drop that at issue #907.

EDIT: Removed unnecessary (hopefully) notify in (res < 0)...

@vrozenfe
Copy link
Collaborator

@frwbr
num_queue = 1 is not the way to fix the problem. We support this case mostly for the testing sand troubleshooting purposes. The entire idea behind the multiqueue is to let the entire storage IO sequence:
UM tread initiated file io request -> vioscsi BuildIO/StartIO -> IRQ -> DPC -> resume the UM thread issued fio request to be executed on the same (v)CPU. In this case we can eliminate any IPIs and data copies which will be required to complete this request and resume the waiting UM thread if the request initiator and DPC will be executed on the different CPUs

In my opinion the simplest solutions to work around the problem at the moment would be:

  • try to disable indirect mode (something like "-device virtio-scsi-pci,id=scsi-vioscsi1,indirect_desc=off";
  • or reduce the max_sectors size (-device virtio-scsi-pci,id=scsi-vioscsi1,max_sectors=32(64))
  • or limit the maximum transfer by defining "PhysicalBreaks" parameter in the Registry,
  • or you can also try enlarge the virtual queues size by adding queue_size parameter to the qemu command line.

Best,
Vadim.

@vrozenfe
Copy link
Collaborator

@benyamin-codez

Thanks a lot.
If it improves the performance and works for multi-LUN setup, then
please submit your work as a commit.

Best,
Vadim.

@JonKohler
Copy link
Contributor

Agreed @benyamin-codez thanks for diving into the trenches on this one, let's raise it as a pull request when you finish up your batch of testing and we can comment in there. Please tag me and @sb-ntnx in addition to Vadim, I've got a few thoughts but don't want them to get buried in this very long discussion thread 😃

@frwbr
Copy link

frwbr commented Aug 16, 2024

Hi all,

Thanks @benyamin-codez for jumping into the code! I don't know much about SCSI or Windows drivers so can't comment on the changes themselves. Happy to test a PR with the finished changes though!

If I understand the situation correctly, there does seem to be some bug that can be triggered by excessive IO, e.g. by the 8k+16k reads reproducer. While a large portion of the users affected by this issue are Proxmox VE users, it is not limited to Proxmox VE users, e.g. @GreyWolfBW is using Debian+libvirt. Still, there may be a factor that makes the issue more likely to trigger on Proxmox VE than e.g. on RH (if I understand correctly, Nutanix uses vhost-user-scsi, so that's quite a different setup). As I've done my tests with upstream QEMU stable, I'd rule out our downstream QEMU patches as a factor. Our kernels may be a factor -- next week I can see if I manage to reproduce the issue on a mainline kernel. Also, there may be some peculiarity hiding in my QEMU command line (I've posted them e.g. here) -- I don't see anything suspicious there but I'm obviously biased.

@frwbr num_queue = 1 is not the way to fix the problem. We support this case mostly for the testing sand troubleshooting purposes. The entire idea behind the multiqueue is to let the entire storage IO sequence: UM tread initiated file io request -> vioscsi BuildIO/StartIO -> IRQ -> DPC -> resume the UM thread issued fio request to be executed on the same (v)CPU. In this case we can eliminate any IPIs and data copies which will be required to complete this request and resume the waiting UM thread if the request initiator and DPC will be executed on the different CPUs

Yes, I wouldn't consider num_queues=1 a fix either, and not even a good workaround. :) I mentioned it mainly in the hope it helps in finding the root cause of the issue.

In a similar vein, I built kvm-guest-drivers-windows today at c87ea56 and enabled debug prints (with this patch). I ran the reproducer (4 cores and no restriction on num_queues, QEMU command line below) again on upstream QEMU 9.0.1, and read out

In during-reproducer.txt, is the reset at 86s:

03811537	86.12216187	 <--> SRB_FUNCTION_RESET_LOGICAL_UNIT Target (0::0::0), SRB 0xFFFFD403766D9300

Interestingly, there are no vioscsi debug outputs at all for the ~65 seconds before that, the last one is:

03608265	22.71421814	HandleResponse <--> Operation SCSIOP_READ (0x28), Target (0::0::0), SRB 0xFFFFD403762285D0
03608266	22.71422005	 --> SendSRB Srb = 0xFFFFD4037693DC80.

In my opinion the simplest solutions to work around the problem at the moment would be:

Thanks for the suggestions!

* try to disable indirect mode (something like "-device virtio-scsi-pci,id=scsi-vioscsi1,indirect_desc=off";

That one I already tried in #756 (comment), it seems to get rid of the Desc next is 3, but the reproducer still produces the vioscsi warnings and lockups.

* or reduce the max_sectors size (-device virtio-scsi-pci,id=scsi-vioscsi1,max_sectors=32(64))

* or limit the maximum transfer by defining "PhysicalBreaks" parameter in the Registry,

I think I've looked into those already a couple of weeks ago and didn't see changed behavior when running the reproducer, but I'll try to revisit them.

* or you can also try enlarge the virtual queues size by adding queue_size parameter to the qemu command line.

I'll see if I can try that next week. (EDIT: Didn't have the time to test this, as I was looking into #1135 instead)

Best,

Friedrich


/[...]/qemu-stable-9.0.1/qemu-system-x86_64 \
  -accel kvm \
  -name 'win10-vioscsi-versions,debug-threads=on' \
  -chardev 'socket,id=qmp,path=/var/run/qemu-server/151.qmp,server=on,wait=off' \
  -mon 'chardev=qmp,mode=control' \
  -chardev 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect=5' \
  -mon 'chardev=qmp-event,mode=control' \
  -pidfile /var/run/qemu-server/151.pid \
  -smbios 'type=1,uuid=32b5c31e-ec75-4d62-9d0b-a756f876e943' \
  -smp '4,sockets=1,cores=4,maxcpus=4' \
  -nodefaults \
  -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' \
  -vnc 'unix:/var/run/qemu-server/151.vnc,password=on' \
  -cpu 'qemu64,+aes,enforce,hv_ipi,hv_relaxed,hv_reset,hv_runtime,hv_spinlocks=0x1fff,hv_stimer,hv_synic,hv_time,hv_vapic,hv_vpindex,+kvm_pv_eoi,+kvm_pv_unhalt,+pni,+popcnt,+sse4.1,+sse4.2,+ssse3' \
  -m 8192 \
  -object 'iothread,id=iothread-virtioscsi0' \
  -device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' \
  -device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' \
  -device 'pci-bridge,id=pci.3,chassis_nr=3,bus=pci.0,addr=0x5' \
  -device 'vmgenid,guid=dd75a80d-45d3-4ec6-9f87-572bacbf1a7a' \
  -device 'piix3-usb-uhci,id=uhci,bus=pci.0,addr=0x1.0x2' \
  -device 'usb-tablet,id=tablet,bus=uhci.0,port=1' \
  -chardev 'socket,id=serial0,path=/var/run/qemu-server/151.serial0,server=on,wait=off' \
  -device 'isa-serial,chardev=serial0' \
  -device 'VGA,id=vga,bus=pci.0,addr=0x2,edid=off' \
  -drive 'file=/var/lib/vz/template/iso/virtio-drivers-debug.iso,if=none,id=drive-ide2,media=cdrom,aio=io_uring' \
  -device 'ide-cd,bus=ide.1,unit=0,drive=drive-ide2,id=ide2,bootindex=101' \
  -device 'virtio-scsi-pci,id=virtioscsi0,bus=pci.3,addr=0x1,iothread=iothread-virtioscsi0' \
  -drive 'file=/dev/pve/vm-151-disk-1,if=none,id=drive-scsi0,format=raw,cache=none,aio=io_uring,detect-zeroes=on' \
  -device 'scsi-hd,bus=virtioscsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0' \
  -device 'ahci,id=ahci0,multifunction=on,bus=pci.0,addr=0x7' \
  -drive 'file=/dev/pve/vm-151-disk-0,if=none,id=drive-sata0,format=raw,cache=none,aio=io_uring,detect-zeroes=on' \
  -device 'ide-hd,bus=ahci0.0,drive=drive-sata0,id=sata0,bootindex=100' \
  -netdev 'type=tap,id=net0,ifname=tap151i0,script=/var/lib/qemu-server/pve-bridge,downscript=/var/lib/qemu-server/pve-bridgedown' \
  -device 'e1000,mac=BA:FF:FF:84:10:E7,netdev=net0,bus=pci.0,addr=0x12,id=net0,bootindex=102' \
  -rtc 'driftfix=slew,base=localtime' \
  -machine 'hpet=off,type=pc-i440fx-9.0' \
  -global 'kvm-pit.lost_tick_policy=discard'

@benyamin-codez
Copy link
Contributor

Agreed @benyamin-codez thanks for diving into the trenches on this one, let's raise it as a pull request when you finish up your batch of testing and we can comment in there. Please tag me and @sb-ntnx in addition to Vadim, I've got a few thoughts but don't want them to get buried in this very long discussion thread 😃

All good.

Mostly done with IO and SCSI failure testing.

Proxmox doesn't seem to work well with HMP based blkdebug options, so it has been a bit tedious.

Just working out some optimisation vs WPP tracing issues. It must have been "fun" when @vrozenfe implemented it.

The PR should be up in the next day.

@benyamin-codez
Copy link
Contributor

Also, to be clear, not saying that it couldn't be the problem, perhaps it contributes, but this type of lockup has been going on before that commit was even in the repo (see #623). Very well may be in the same class of issues.

I suspect that what ends up happening here is something else down the stack is binding up, and perhaps just making sure it has nicely aligned IO makes it angrier.

@JonKohler

This is definitely the case. I'm having a look at it now...

I dropped an update here: https://forum.proxmox.com/threads/redhat-virtio-developers-would-like-to-coordinate-with-proxmox-devs-re-vioscsi-reset-to-device-system-unresponsive.139160/post-695547

@JonKohler
Copy link
Contributor

The case about what? I flipped a thru the last few posts on the forums, can you elaborate?

@benyamin-codez
Copy link
Contributor

The case about what? I flipped a thru the last few posts on the forums, can you elaborate?

There is an underlying problem that existed before the v208 release. I would say well before. As you put it:

...something else down the stack is binding up...
...perhaps it contributes, but this type of lockup has been going on before that commit was even in the repo...

Per my comments in the Proxmox post and in the draft PR. I'm focusing on that issue at the moment.

@benyamin-codez
Copy link
Contributor

Also Vadim's comments:

In my understanding the root of the problem is in the maximum size of IO transfer. Likely to Nutanix and us (RH) we don't see any problem if the IO transfer exceeds the virtual queue size. Others might have
problem with that.

and

Right, I believe that the problem is that we create SG longer than virtual queue size..
Technically we probably need to reduce the size or to split a large DMA transfer to
a set of smaller chunks.

...and that may very well be the case, but I'm also reviewing the NOTIFY implementation in [virtio].
I'm making some progress but do have some competing priorities the next few days.

@benyamin-codez
Copy link
Contributor

benyamin-codez commented Sep 19, 2024

PR #1150 has been raised to resolve this issue in [vioscsi].

@benyamin-codez
Copy link
Contributor

PR #1174 has been raised to resolve this issue in [viostor].

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Bug
Projects
None yet
Development

No branches or pull requests