Intel X710-T4L issue - WARNING: queue <num> appears to be hung!
-
I've been struggling with an Intel X710-T4L adapter for a few months now and I'm out of ideas, any help would be much appreciated! Thanks in advance!
System configuration
- pfSense 24.03
- pfSense is a QEMU guest with 16 cores assigned, running on Proxmox 8.2.4. I've passed through the entire X710-T4L device to the guest.
- X710-T4L firmware v9.5
- ixl driver v1.14.2
- Two ports are in use on the X710-T4L:
ixl0 - WAN
ixl1 - MODEM
I am one of those AT&T customers who bypasses their router gateway. I use the following authbridge setup between ixl0 (ONT), and ixl1 (ISP modem).
I had this exact same authbridge setup on a bare metal machine using an Intel I350 adapter running for years without any issues.
Problem description and logs
Every so often ixl0 randomly stops passing packets, the system logs are filled with messages like "ixl0: WARNING: queue 0 appears to be hung!". No hung queue messages are shown for the other ixl ports (ixl1 is the only other port in use). I cannot find any other error/warning logs other than the gateway alarm for packet loss on WAN (ixl0), shortly before the hung queue messages start. I cannot find any logs at all containing ixl before the "queue appears to be hung!" messages appear.
This issue seems to occur randomly. The system is stable for at least 1 day, but sometimes as long as 30 days, or any random amount of time between ~1 and 30 days. The last period of stability lasted 13 days before ixl0 gave out. The system is not under any special load, as far as I can tell, when ixl0 quits. It happens at random times of the day.
ifconfig up/down
has no effect, I've only found a full reboot returns the system to normal.Relevant boot logs for ixl...
This one appeared after I installed the latest version of the ixl driver, I'm assuming nothing to worry about here.Jul 17 13:40:31 kernel Module pci/ixl failed to register: 17 Jul 17 13:40:31 kernel module_register: cannot register pci/ixl from kernel; already loaded from if_ixl.ko
The SR-IOV init failure seems odd, but I don't need to use that feature so I ignore it. The 8 queues also seems odd, since this machine has been given 16 cores, more odd, the ixl driver version that comes with pfSense 24.03 assigns 16 queues, the latest Intel driver assigns 8. Regardless, performance seems better on the latest driver, so I'm not too worried about this either.
The same logs repeat for ixl0 - ixl3Jul 17 13:40:31 kernel ixl0: The device is not iWARP enabled Jul 17 13:40:31 kernel ixl0: Failed to initialize SR-IOV (error=2) Jul 17 13:40:31 kernel ixl0: PCI Express Bus: Speed 8.0GT/s Width x8 Jul 17 13:40:31 kernel ixl0: Ethernet address: f0:b2:b9:0d:61:64 Jul 17 13:40:31 kernel ixl0: Allocating 8 queues for PF LAN VSI; 8 queues active Jul 17 13:40:31 kernel ixl0: Using MSI-X interrupts with 9 vectors Jul 17 13:40:31 kernel ixl0: PF-ID[0]: VFs 32, MSI-X 129, VF MSI-X 5, QPs 384, MDIO shared Jul 17 13:40:31 kernel ixl0: fw 9.150.77492 api 1.15 nvm 9.50 etid 8000f160 oem 1.270.0 Jul 17 13:40:31 kernel ixl0: using 1024 tx descriptors and 1024 rx descriptors Jul 17 13:40:31 kernel ixl0: <Intel(R) Ethernet Connection 700 Series PF Driver, Version - 1.14.2> mem 0x380003000000-0x380003ffffff,0x380004018000-0x38000401ffff irq 16 at device 0.0 on pci1
All other mentions of ixl in the boot log appear to be normal operations.
Attempts to fix
- Flashed latest firmware v9.5
- Installed latest ixl driver v1.14.2
- Installed the same latest driver on the hypervisor just for the heck of it
- Under System > Advanced > Networking, I've tested with all hardware offload settings enabled and disabled. No change in behavior.
- Tried with hardware flow control completely disabled and full flow control enabled. No change in behavior.
- I've been through the following posts that describe similar symptoms, but these issues seem to already be fixed and I don't see the same logs other than the "queue appears to be hung!" messages.
-Intel X710 troubles
-Issues with an Intel x710 and pfsense 2.4.5-p1
-Bug 221919 - ixl: TX queue hang when using TSO and having a high and mixed network load
-
Reboot the VM or reboot the host?
Does it do it if don't pass the hardware through and just use vmxnet?
Steve
-
Good question, rebooting the VM alone is not enough, I've been rebooting the Proxmox host to resolve the issue. Thankfully, the other VMs this Proxmox instance is managing are not a high priority. A reboot of the pfSense VM alone results in aq_errors as the adapter is initializing during boot. It never passes packets from the start, no hung queue errors though. Unfortunately, I only tried this once and I didn't collect adequate logs to share. The tenants involved with this system are quite annoyed when it fails at inopportune times, the priority in the moment was to restore service as quickly as possible. Since this happens randomly it may be little or some time before the issue returns so I can try rebooting the pfSense VM alone again and collect the relevant logs to share.
Noticing this behavior is what lead me to try installing the latest drivers for the adapter on Proxmox too, even though the whole device is being passed through to the VM, it made it smell like it might be a virtualization issue.I have not tried vmxnet. I think that is an VMware ESXi only thing. Proxmox uses Virtio, vtnet. Please correct me if I am wrong. I did test vtnet before putting this system into service and found performance to be noticeably worse, lower throughput, higher latency, and significantly higher host resource usage. I was also concerned the requirements of the authbridge wouldn't play nice with vtnet, though I didn't test that.
Thanks, Louis
-
Yes, sorry I meant vtnet. Though Promox can also present NICs as vmxnet.
I assume the card did not stop responding when using the Promox driver though?
Is it linked at 10G?
-
I assume the card did not stop responding when using the Promox driver though?
Correct, though it's not confirmation of anything. Since this issue takes days or sometimes weeks to manifest, I may not have worked with vtnet long enough to see any issue.
Is it linked at 10G?
WAN (ixl0) is a 1G link, but while testing the X710 with vtnet I was using a 10G link.
My thoughts on using vtnet... I spent about month trying various hardware configurations and tuning options to see what was easiest to manage and provided the best performance. Tested through pfSense from a machine on LAN to machine on WAN where each machine had 100G links into the switch that pfSense's X710 was attached to with 10G links (WAN and LAN networks partitioned on the switch with VLANs). With vtnet I couldn't push more than 5G through pfSense with 2-3ms latency. From what I've read I think it should have done better than that, but I couldn't figure it out. Passthrough with all hardware offloading enabled seemed to provide the best results, full 10G, <=1ms, ~25% less CPU usage on the host relative to vtnet. I also like that passthrough is a bit simpler configuration. Unfortunately, testing for long term stability wasn't practical before putting the system into production.
I could try switching to vtnet long term, but I would prefer to see passthrough work correctly. I went with the X710 instead of a cheaper 1G adapter for the option to upgrade the current 1G link someday. -
Mmm, I would certainly expect best performance with hardware pass-through.
However you might consider adding a 1G NIC for the WAN if that's what it links at. That queue hung error is specific to the ixl driver so if your is only 1G anyway you could just use a different one.
I assume you're not seeing the malicious driver event logs? https://redmine.pfsense.org/issues/13003
-
However you might consider adding a 1G NIC for the WAN if that's what it links at. That queue hung error is specific to the ixl driver so if your is only 1G anyway you could just use a different one.
I considered swapping the x710 with an i350 since I had igb working well for years. But there are a few things that complicate this idea for me...
- It feels a bit like kicking the can down the road, it would be nice if I could get ixl working reliably.
- I was previously using an i350 on a bare metal install of pfSense, I've never tested it long term in a Proxmox VM install. I might find some other issue with this setup too.
- In order to setup and test the authbridge, I need to enable ethernet filtering which is only available with a pfSense+ license. Swapping the network controller will change the NDI and require the current license to be transferred, but I've already reconfigured the network controllers once before and had to transfer the license. Since the license can only be transferred once, I would have to get another license too. Or I could switch to CE and try manually configuring ethernet filtering, it sounds like just the GUI components are missing from CE, just annoying extra work.
I assume you're not seeing the malicious driver event logs? https://redmine.pfsense.org/issues/13003
Correct. I don't see any relevant logs before or after the hung queue messages start.
I'll wait until this happens again, try rebooting the VM alone and collect the relevant logs in a post here. Hopefully that will shed a little more light on what's going on.
Thanks for your consideration so far. -
Failed again, but in a new way. No hung queue messages with just an unresponsive ixl adapter in pfSense. Instead, the entire Proxmox machine crashed. The crash report from Proxmox and pfSense both indicate it was caused by an issue with the X710 adapter.
This is the first time in 6 months of testing I've seen the problematic X710 adapter cause a crash at the hypervisor level.Following are the crash logs from Proxmox and pfSense. There are no logs from Proxmox or pfSense indicating any sort of issue prior to the crash. pfSense indicates a crash dump time of
11:22:11
, and Proxmox indicates a crash dump time of11:23:07
. Indication the issue started at the VM level and spread to the hypervisor?Proxmox crash report:
pci 0000:01:00
is the X710 adapter I've been having issues with.Jul 22 11:23:07 pve kernel: vfio-pci 0000:01:00.3: can't update enabled VF BAR1 [??? 0x00000000 flags 0x0] Jul 22 11:23:07 pve kernel: WARNING: CPU: 32 PID: 6028 at drivers/pci/iov.c:966 pci_iov_update_resource+0x144/0x150 Jul 22 11:23:07 pve kernel: Modules linked in: vfio_pci vfio_pci_core vfio_iommu_type1 vfio iommufd ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter softdog nf_tables bonding tls iavf sunrpc binfmt_misc nfnetlink_log nfnetlink intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd kvm_amd ipmi_ssif kvm irqbypass crct10dif_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel sha256_ssse3 sha1_ssse3 aesni_intel crypto_simd cryptd ib_uverbs dax_hmem cxl_acpi acpi_ipmi rapl ast ipmi_si pcspkr cxl_core ib_core ipmi_devintf i2c_algo_bit ccp k10temp ipmi_msghandler joydev input_leds mac_hid vhost_net vhost vhost_iotlb tap efi_pstore dmi_sysfs ip_tables x_tables autofs4 zfs(PO) spl(O) btrfs blake2b_generic xor raid6_pq libcrc32c rndis_host cdc_ether usbnet mii hid_generic usbmouse usbhid hid xhci_pci ice(OE) xhci_pci_renesas nvme crc32_pclmul xhci_hcd gnss nvme_core bnxt_en i2c_piix4 i40e(OE) nvme_auth Jul 22 11:23:07 pve kernel: CPU: 32 PID: 6028 Comm: kvm Tainted: P OE 6.8.8-2-pve #1 Jul 22 11:23:07 pve kernel: Hardware name: Supermicro AS -2015SV-WTNRT/H13SVW-NT, BIOS 1.1b 12/20/2023 Jul 22 11:23:07 pve kernel: RIP: 0010:pci_iov_update_resource+0x144/0x150 Jul 22 11:23:07 pve kernel: Code: 8b b3 c8 00 00 00 48 8d bb c8 00 00 00 e8 04 8c 1c 00 4d 89 e0 44 89 e9 4c 89 f2 48 89 c6 48 c7 c7 d0 43 40 b9 e8 fc b2 7d ff <0f> 0b e9 4a ff ff ff e8 00 e4 82 00 90 90 90 90 90 90 90 90 90 90 Jul 22 11:23:07 pve kernel: RSP: 0018:ff72c6bbc51a7920 EFLAGS: 00010246 Jul 22 11:23:07 pve kernel: RAX: 0000000000000000 RBX: ff1f06c7c6735000 RCX: 0000000000000000 Jul 22 11:23:07 pve kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 Jul 22 11:23:07 pve kernel: RBP: ff72c6bbc51a7960 R08: 0000000000000000 R09: 0000000000000000 Jul 22 11:23:07 pve kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ff1f06c7c67355d0 Jul 22 11:23:07 pve kernel: R13: 0000000000000001 R14: ff1f06c7c64ecee0 R15: ff1f06c85b0f8000 Jul 22 11:23:07 pve kernel: FS: 00007f9b48ded480(0000) GS:ff1f07258a800000(0000) knlGS:0000000000000000 Jul 22 11:23:07 pve kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jul 22 11:23:07 pve kernel: CR2: 00005763393242c8 CR3: 00000002029f8004 CR4: 0000000000f71ef0 Jul 22 11:23:07 pve kernel: PKRU: 55555554 Jul 22 11:23:07 pve kernel: Call Trace: Jul 22 11:23:07 pve kernel: <TASK> Jul 22 11:23:07 pve kernel: ? show_regs+0x6d/0x80 Jul 22 11:23:07 pve kernel: ? __warn+0x89/0x160 Jul 22 11:23:07 pve kernel: ? pci_iov_update_resource+0x144/0x150 Jul 22 11:23:07 pve kernel: ? report_bug+0x17e/0x1b0 Jul 22 11:23:07 pve kernel: ? handle_bug+0x46/0x90 Jul 22 11:23:07 pve kernel: ? exc_invalid_op+0x18/0x80 Jul 22 11:23:07 pve kernel: ? asm_exc_invalid_op+0x1b/0x20 Jul 22 11:23:07 pve kernel: ? pci_iov_update_resource+0x144/0x150 Jul 22 11:23:07 pve kernel: pci_update_resource+0x27/0x50 Jul 22 11:23:07 pve kernel: pci_restore_iov_state+0xb4/0x150 Jul 22 11:23:07 pve kernel: pci_restore_state.part.0+0x204/0x3a0 Jul 22 11:23:07 pve kernel: pci_dev_restore+0x58/0x80 Jul 22 11:23:07 pve kernel: pci_try_reset_function+0x6a/0xa0 Jul 22 11:23:07 pve kernel: vfio_pci_core_ioctl+0x7bc/0xe80 [vfio_pci_core] Jul 22 11:23:07 pve kernel: ? kvm_vm_ioctl_irq_line+0x27/0x60 [kvm] Jul 22 11:23:07 pve kernel: vfio_device_fops_unl_ioctl+0xa8/0x850 [vfio] Jul 22 11:23:07 pve kernel: ? __pfx_kvm_set_pic_irq+0x10/0x10 [kvm] Jul 22 11:23:07 pve kernel: __x64_sys_ioctl+0xa0/0xf0 Jul 22 11:23:07 pve kernel: x64_sys_call+0xa68/0x24b0 Jul 22 11:23:07 pve kernel: do_syscall_64+0x81/0x170 Jul 22 11:23:07 pve kernel: ? srso_alias_return_thunk+0x5/0xfbef5 Jul 22 11:23:07 pve kernel: ? __mod_memcg_lruvec_state+0x87/0x140 Jul 22 11:23:07 pve kernel: ? srso_alias_return_thunk+0x5/0xfbef5 Jul 22 11:23:07 pve kernel: ? release_pages+0x152/0x4c0 Jul 22 11:23:07 pve kernel: ? srso_alias_return_thunk+0x5/0xfbef5 Jul 22 11:23:07 pve kernel: ? srso_alias_return_thunk+0x5/0xfbef5 Jul 22 11:23:07 pve kernel: ? __mod_memcg_lruvec_state+0x87/0x140 Jul 22 11:23:07 pve kernel: ? srso_alias_return_thunk+0x5/0xfbef5 Jul 22 11:23:07 pve kernel: ? __mod_lruvec_state+0x36/0x50 Jul 22 11:23:07 pve kernel: ? srso_alias_return_thunk+0x5/0xfbef5 Jul 22 11:23:07 pve kernel: ? __lruvec_stat_mod_folio+0x70/0xc0 Jul 22 11:23:07 pve kernel: ? srso_alias_return_thunk+0x5/0xfbef5 Jul 22 11:23:07 pve kernel: ? set_ptes.constprop.0+0x2b/0xb0 Jul 22 11:23:07 pve kernel: ? srso_alias_return_thunk+0x5/0xfbef5 Jul 22 11:23:07 pve kernel: ? do_anonymous_page+0x3a8/0x740 Jul 22 11:23:07 pve kernel: ? srso_alias_return_thunk+0x5/0xfbef5 Jul 22 11:23:07 pve kernel: ? __pte_offset_map+0x1c/0x1b0 Jul 22 11:23:07 pve kernel: ? srso_alias_return_thunk+0x5/0xfbef5 Jul 22 11:23:07 pve kernel: ? __handle_mm_fault+0xc32/0xee0 Jul 22 11:23:07 pve kernel: ? srso_alias_return_thunk+0x5/0xfbef5 Jul 22 11:23:07 pve kernel: ? __count_memcg_events+0x6f/0xe0 Jul 22 11:23:07 pve kernel: ? srso_alias_return_thunk+0x5/0xfbef5 Jul 22 11:23:07 pve kernel: ? count_memcg_events.constprop.0+0x2a/0x50 Jul 22 11:23:07 pve kernel: ? srso_alias_return_thunk+0x5/0xfbef5 Jul 22 11:23:07 pve kernel: ? handle_mm_fault+0xad/0x380 Jul 22 11:23:07 pve kernel: ? srso_alias_return_thunk+0x5/0xfbef5 Jul 22 11:23:07 pve kernel: ? do_user_addr_fault+0x343/0x6b0 Jul 22 11:23:07 pve kernel: ? srso_alias_return_thunk+0x5/0xfbef5 Jul 22 11:23:07 pve kernel: ? irqentry_exit_to_user_mode+0x7e/0x260 Jul 22 11:23:07 pve kernel: ? srso_alias_return_thunk+0x5/0xfbef5 Jul 22 11:23:07 pve kernel: ? irqentry_exit+0x43/0x50 Jul 22 11:23:07 pve kernel: ? srso_alias_return_thunk+0x5/0xfbef5 Jul 22 11:23:07 pve kernel: ? exc_page_fault+0x94/0x1b0 Jul 22 11:23:07 pve kernel: entry_SYSCALL_64_after_hwframe+0x78/0x80 Jul 22 11:23:07 pve kernel: RIP: 0033:0x7f9b4bb6fc5b Jul 22 11:23:07 pve kernel: Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1c 48 8b 44 24 18 64 48 2b 04 25 28 00 00 Jul 22 11:23:07 pve kernel: RSP: 002b:00007ffce3c437f0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 Jul 22 11:23:07 pve kernel: RAX: ffffffffffffffda RBX: 0000562efb229c70 RCX: 00007f9b4bb6fc5b Jul 22 11:23:07 pve kernel: RDX: 0000000000000000 RSI: 0000000000003b6f RDI: 0000000000000051 Jul 22 11:23:07 pve kernel: RBP: 0000562efb229cf4 R08: 0000000000000000 R09: 0000000000000000 Jul 22 11:23:07 pve kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 0000562ef985a630 Jul 22 11:23:07 pve kernel: R13: 0000562ef74d3e10 R14: 0000562ef96ef790 R15: 00000000000000c8 Jul 22 11:23:07 pve kernel: </TASK> Jul 22 11:23:07 pve kernel: ---[ end trace 0000000000000000 ]--- Jul 22 11:23:07 pve kernel: ------------[ cut here ]------------ Jul 22 11:23:07 pve kernel: vfio-pci 0000:01:00.3: can't update enabled VF BAR2 [??? 0x00000000 flags 0x0] Jul 22 11:23:07 pve kernel: WARNING: CPU: 32 PID: 6028 at drivers/pci/iov.c:966 pci_iov_update_resource+0x144/0x150 Jul 22 11:23:07 pve kernel: Modules linked in: vfio_pci vfio_pci_core vfio_iommu_type1 vfio iommufd ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter softdog nf_tables bonding tls iavf sunrpc binfmt_misc nfnetlink_log nfnetlink intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd kvm_amd ipmi_ssif kvm irqbypass crct10dif_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel sha256_ssse3 sha1_ssse3 aesni_intel crypto_simd cryptd ib_uverbs dax_hmem cxl_acpi acpi_ipmi rapl ast ipmi_si pcspkr cxl_core ib_core ipmi_devintf i2c_algo_bit ccp k10temp ipmi_msghandler joydev input_leds mac_hid vhost_net vhost vhost_iotlb tap efi_pstore dmi_sysfs ip_tables x_tables autofs4 zfs(PO) spl(O) btrfs blake2b_generic xor raid6_pq libcrc32c rndis_host cdc_ether usbnet mii hid_generic usbmouse usbhid hid xhci_pci ice(OE) xhci_pci_renesas nvme crc32_pclmul xhci_hcd gnss nvme_core bnxt_en i2c_piix4 i40e(OE) nvme_auth Jul 22 11:23:07 pve kernel: CPU: 32 PID: 6028 Comm: kvm Tainted: P W OE 6.8.8-2-pve #1 Jul 22 11:23:07 pve kernel: Hardware name: Supermicro AS -2015SV-WTNRT/H13SVW-NT, BIOS 1.1b 12/20/2023 Jul 22 11:23:07 pve kernel: RIP: 0010:pci_iov_update_resource+0x144/0x150 Jul 22 11:23:07 pve kernel: Code: 8b b3 c8 00 00 00 48 8d bb c8 00 00 00 e8 04 8c 1c 00 4d 89 e0 44 89 e9 4c 89 f2 48 89 c6 48 c7 c7 d0 43 40 b9 e8 fc b2 7d ff <0f> 0b e9 4a ff ff ff e8 00 e4 82 00 90 90 90 90 90 90 90 90 90 90 Jul 22 11:23:07 pve kernel: RSP: 0018:ff72c6bbc51a7920 EFLAGS: 00010246 Jul 22 11:23:07 pve kernel: RAX: 0000000000000000 RBX: ff1f06c7c6735000 RCX: 0000000000000000 Jul 22 11:23:07 pve kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 Jul 22 11:23:07 pve kernel: RBP: ff72c6bbc51a7960 R08: 0000000000000000 R09: 0000000000000000 Jul 22 11:23:07 pve kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ff1f06c7c6735610 Jul 22 11:23:07 pve kernel: R13: 0000000000000002 R14: ff1f06c7c64ecee0 R15: ff1f06c85b0f8000 Jul 22 11:23:07 pve kernel: FS: 00007f9b48ded480(0000) GS:ff1f07258a800000(0000) knlGS:0000000000000000 Jul 22 11:23:07 pve kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jul 22 11:23:07 pve kernel: CR2: 00005763393242c8 CR3: 00000002029f8004 CR4: 0000000000f71ef0 Jul 22 11:23:07 pve kernel: PKRU: 55555554 Jul 22 11:23:07 pve kernel: Call Trace: Jul 22 11:23:07 pve kernel: <TASK> Jul 22 11:23:07 pve kernel: ? show_regs+0x6d/0x80 Jul 22 11:23:07 pve kernel: ? __warn+0x89/0x160 Jul 22 11:23:07 pve kernel: ? pci_iov_update_resource+0x144/0x150 Jul 22 11:23:07 pve kernel: ? report_bug+0x17e/0x1b0 Jul 22 11:23:07 pve kernel: ? handle_bug+0x46/0x90 Jul 22 11:23:07 pve kernel: ? exc_invalid_op+0x18/0x80 Jul 22 11:23:07 pve kernel: ? asm_exc_invalid_op+0x1b/0x20 Jul 22 11:23:07 pve kernel: ? pci_iov_update_resource+0x144/0x150 Jul 22 11:23:07 pve kernel: pci_update_resource+0x27/0x50 Jul 22 11:23:07 pve kernel: pci_restore_iov_state+0xb4/0x150 Jul 22 11:23:07 pve kernel: pci_restore_state.part.0+0x204/0x3a0 Jul 22 11:23:07 pve kernel: pci_dev_restore+0x58/0x80 Jul 22 11:23:07 pve kernel: pci_try_reset_function+0x6a/0xa0 Jul 22 11:23:07 pve kernel: vfio_pci_core_ioctl+0x7bc/0xe80 [vfio_pci_core] Jul 22 11:23:07 pve kernel: ? kvm_vm_ioctl_irq_line+0x27/0x60 [kvm] Jul 22 11:23:07 pve kernel: vfio_device_fops_unl_ioctl+0xa8/0x850 [vfio] Jul 22 11:23:07 pve kernel: ? __pfx_kvm_set_pic_irq+0x10/0x10 [kvm] Jul 22 11:23:07 pve kernel: __x64_sys_ioctl+0xa0/0xf0 Jul 22 11:23:07 pve kernel: x64_sys_call+0xa68/0x24b0 Jul 22 11:23:07 pve kernel: do_syscall_64+0x81/0x170 Jul 22 11:23:07 pve kernel: ? srso_alias_return_thunk+0x5/0xfbef5 Jul 22 11:23:07 pve kernel: ? __mod_memcg_lruvec_state+0x87/0x140 Jul 22 11:23:07 pve kernel: ? srso_alias_return_thunk+0x5/0xfbef5 Jul 22 11:23:07 pve kernel: ? release_pages+0x152/0x4c0 Jul 22 11:23:07 pve kernel: ? srso_alias_return_thunk+0x5/0xfbef5 Jul 22 11:23:07 pve kernel: ? srso_alias_return_thunk+0x5/0xfbef5 Jul 22 11:23:07 pve kernel: ? __mod_memcg_lruvec_state+0x87/0x140 Jul 22 11:23:07 pve kernel: ? srso_alias_return_thunk+0x5/0xfbef5 Jul 22 11:23:07 pve kernel: ? __mod_lruvec_state+0x36/0x50 Jul 22 11:23:07 pve kernel: ? srso_alias_return_thunk+0x5/0xfbef5 Jul 22 11:23:07 pve kernel: ? __lruvec_stat_mod_folio+0x70/0xc0 Jul 22 11:23:07 pve kernel: ? srso_alias_return_thunk+0x5/0xfbef5 Jul 22 11:23:07 pve kernel: ? set_ptes.constprop.0+0x2b/0xb0 Jul 22 11:23:07 pve kernel: ? srso_alias_return_thunk+0x5/0xfbef5 Jul 22 11:23:07 pve kernel: ? do_anonymous_page+0x3a8/0x740 Jul 22 11:23:07 pve kernel: ? srso_alias_return_thunk+0x5/0xfbef5 Jul 22 11:23:07 pve kernel: ? __pte_offset_map+0x1c/0x1b0 Jul 22 11:23:07 pve kernel: ? srso_alias_return_thunk+0x5/0xfbef5 Jul 22 11:23:07 pve kernel: ? __handle_mm_fault+0xc32/0xee0 Jul 22 11:23:07 pve kernel: ? srso_alias_return_thunk+0x5/0xfbef5 Jul 22 11:23:07 pve kernel: ? __count_memcg_events+0x6f/0xe0 Jul 22 11:23:07 pve kernel: ? srso_alias_return_thunk+0x5/0xfbef5 Jul 22 11:23:07 pve kernel: ? count_memcg_events.constprop.0+0x2a/0x50 Jul 22 11:23:07 pve kernel: ? srso_alias_return_thunk+0x5/0xfbef5 Jul 22 11:23:07 pve kernel: ? handle_mm_fault+0xad/0x380 Jul 22 11:23:07 pve kernel: ? srso_alias_return_thunk+0x5/0xfbef5 Jul 22 11:23:07 pve kernel: ? do_user_addr_fault+0x343/0x6b0 Jul 22 11:23:07 pve kernel: ? srso_alias_return_thunk+0x5/0xfbef5 Jul 22 11:23:07 pve kernel: ? irqentry_exit_to_user_mode+0x7e/0x260 Jul 22 11:23:07 pve kernel: ? srso_alias_return_thunk+0x5/0xfbef5 Jul 22 11:23:07 pve kernel: ? irqentry_exit+0x43/0x50 Jul 22 11:23:07 pve kernel: ? srso_alias_return_thunk+0x5/0xfbef5 Jul 22 11:23:07 pve kernel: ? exc_page_fault+0x94/0x1b0 Jul 22 11:23:07 pve kernel: entry_SYSCALL_64_after_hwframe+0x78/0x80 Jul 22 11:23:07 pve kernel: RIP: 0033:0x7f9b4bb6fc5b Jul 22 11:23:07 pve kernel: Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1c 48 8b 44 24 18 64 48 2b 04 25 28 00 00 Jul 22 11:23:07 pve kernel: RSP: 002b:00007ffce3c437f0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 Jul 22 11:23:07 pve kernel: RAX: ffffffffffffffda RBX: 0000562efb229c70 RCX: 00007f9b4bb6fc5b Jul 22 11:23:07 pve kernel: RDX: 0000000000000000 RSI: 0000000000003b6f RDI: 0000000000000051 Jul 22 11:23:07 pve kernel: RBP: 0000562efb229cf4 R08: 0000000000000000 R09: 0000000000000000 Jul 22 11:23:07 pve kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 0000562ef985a630 Jul 22 11:23:07 pve kernel: R13: 0000562ef74d3e10 R14: 0000562ef96ef790 R15: 00000000000000c8 Jul 22 11:23:07 pve kernel: </TASK> Jul 22 11:23:07 pve kernel: ---[ end trace 0000000000000000 ]--- Jul 22 11:23:08 pve kernel: ------------[ cut here ]------------ Jul 22 11:23:08 pve kernel: vfio-pci 0000:01:00.3: can't update enabled VF BAR4 [??? 0x00000000 flags 0x0] Jul 22 11:23:08 pve kernel: WARNING: CPU: 32 PID: 6028 at drivers/pci/iov.c:966 pci_iov_update_resource+0x144/0x150 Jul 22 11:23:08 pve kernel: Modules linked in: vfio_pci vfio_pci_core vfio_iommu_type1 vfio iommufd ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter softdog nf_tables bonding tls iavf sunrpc binfmt_misc nfnetlink_log nfnetlink intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd kvm_amd ipmi_ssif kvm irqbypass crct10dif_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel sha256_ssse3 sha1_ssse3 aesni_intel crypto_simd cryptd ib_uverbs dax_hmem cxl_acpi acpi_ipmi rapl ast ipmi_si pcspkr cxl_core ib_core ipmi_devintf i2c_algo_bit ccp k10temp ipmi_msghandler joydev input_leds mac_hid vhost_net vhost vhost_iotlb tap efi_pstore dmi_sysfs ip_tables x_tables autofs4 zfs(PO) spl(O) btrfs blake2b_generic xor raid6_pq libcrc32c rndis_host cdc_ether usbnet mii hid_generic usbmouse usbhid hid xhci_pci ice(OE) xhci_pci_renesas nvme crc32_pclmul xhci_hcd gnss nvme_core bnxt_en i2c_piix4 i40e(OE) nvme_auth Jul 22 11:23:08 pve kernel: CPU: 32 PID: 6028 Comm: kvm Tainted: P W OE 6.8.8-2-pve #1 Jul 22 11:23:08 pve kernel: Hardware name: Supermicro AS -2015SV-WTNRT/H13SVW-NT, BIOS 1.1b 12/20/2023 Jul 22 11:23:08 pve kernel: RIP: 0010:pci_iov_update_resource+0x144/0x150 Jul 22 11:23:08 pve kernel: Code: 8b b3 c8 00 00 00 48 8d bb c8 00 00 00 e8 04 8c 1c 00 4d 89 e0 44 89 e9 4c 89 f2 48 89 c6 48 c7 c7 d0 43 40 b9 e8 fc b2 7d ff <0f> 0b e9 4a ff ff ff e8 00 e4 82 00 90 90 90 90 90 90 90 90 90 90 Jul 22 11:23:08 pve kernel: RSP: 0018:ff72c6bbc51a7920 EFLAGS: 00010246 Jul 22 11:23:08 pve kernel: RAX: 0000000000000000 RBX: ff1f06c7c6735000 RCX: 0000000000000000 Jul 22 11:23:08 pve kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 Jul 22 11:23:08 pve kernel: RBP: ff72c6bbc51a7960 R08: 0000000000000000 R09: 0000000000000000 Jul 22 11:23:08 pve kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ff1f06c7c6735690 Jul 22 11:23:08 pve kernel: R13: 0000000000000004 R14: ff1f06c7c64ecee0 R15: ff1f06c85b0f8000 Jul 22 11:23:08 pve kernel: FS: 00007f9b48ded480(0000) GS:ff1f07258a800000(0000) knlGS:0000000000000000 Jul 22 11:23:08 pve kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jul 22 11:23:08 pve kernel: CR2: 00005763393242c8 CR3: 00000002029f8004 CR4: 0000000000f71ef0 Jul 22 11:23:08 pve kernel: PKRU: 55555554 Jul 22 11:23:08 pve kernel: Call Trace: Jul 22 11:23:08 pve kernel: <TASK> Jul 22 11:23:08 pve kernel: ? show_regs+0x6d/0x80 Jul 22 11:23:08 pve kernel: ? __warn+0x89/0x160 Jul 22 11:23:08 pve kernel: ? pci_iov_update_resource+0x144/0x150 Jul 22 11:23:08 pve kernel: ? report_bug+0x17e/0x1b0 Jul 22 11:23:08 pve kernel: ? handle_bug+0x46/0x90 Jul 22 11:23:08 pve kernel: ? exc_invalid_op+0x18/0x80 Jul 22 11:23:08 pve kernel: ? asm_exc_invalid_op+0x1b/0x20 Jul 22 11:23:08 pve kernel: ? pci_iov_update_resource+0x144/0x150 Jul 22 11:23:08 pve kernel: pci_update_resource+0x27/0x50 Jul 22 11:23:08 pve kernel: pci_restore_iov_state+0xb4/0x150 Jul 22 11:23:08 pve kernel: pci_restore_state.part.0+0x204/0x3a0 Jul 22 11:23:08 pve kernel: pci_dev_restore+0x58/0x80 Jul 22 11:23:08 pve kernel: pci_try_reset_function+0x6a/0xa0 Jul 22 11:23:08 pve kernel: vfio_pci_core_ioctl+0x7bc/0xe80 [vfio_pci_core] Jul 22 11:23:08 pve kernel: ? kvm_vm_ioctl_irq_line+0x27/0x60 [kvm] Jul 22 11:23:08 pve kernel: vfio_device_fops_unl_ioctl+0xa8/0x850 [vfio] Jul 22 11:23:08 pve kernel: ? __pfx_kvm_set_pic_irq+0x10/0x10 [kvm] Jul 22 11:23:08 pve kernel: __x64_sys_ioctl+0xa0/0xf0 Jul 22 11:23:08 pve kernel: x64_sys_call+0xa68/0x24b0 Jul 22 11:23:08 pve kernel: do_syscall_64+0x81/0x170 Jul 22 11:23:08 pve kernel: ? srso_alias_return_thunk+0x5/0xfbef5 Jul 22 11:23:08 pve kernel: ? __mod_memcg_lruvec_state+0x87/0x140 Jul 22 11:23:08 pve kernel: ? srso_alias_return_thunk+0x5/0xfbef5 Jul 22 11:23:08 pve kernel: ? release_pages+0x152/0x4c0 Jul 22 11:23:08 pve kernel: ? srso_alias_return_thunk+0x5/0xfbef5 Jul 22 11:23:08 pve kernel: ? srso_alias_return_thunk+0x5/0xfbef5 Jul 22 11:23:08 pve kernel: ? __mod_memcg_lruvec_state+0x87/0x140 Jul 22 11:23:08 pve kernel: ? srso_alias_return_thunk+0x5/0xfbef5 Jul 22 11:23:08 pve kernel: ? __mod_lruvec_state+0x36/0x50 Jul 22 11:23:08 pve kernel: ? srso_alias_return_thunk+0x5/0xfbef5 Jul 22 11:23:08 pve kernel: ? __lruvec_stat_mod_folio+0x70/0xc0 Jul 22 11:23:08 pve kernel: ? srso_alias_return_thunk+0x5/0xfbef5 Jul 22 11:23:08 pve kernel: ? set_ptes.constprop.0+0x2b/0xb0 Jul 22 11:23:08 pve kernel: ? srso_alias_return_thunk+0x5/0xfbef5 Jul 22 11:23:08 pve kernel: ? do_anonymous_page+0x3a8/0x740 Jul 22 11:23:08 pve kernel: ? srso_alias_return_thunk+0x5/0xfbef5 Jul 22 11:23:08 pve kernel: ? __pte_offset_map+0x1c/0x1b0 Jul 22 11:23:08 pve kernel: ? srso_alias_return_thunk+0x5/0xfbef5 Jul 22 11:23:08 pve kernel: ? __handle_mm_fault+0xc32/0xee0 Jul 22 11:23:08 pve kernel: ? srso_alias_return_thunk+0x5/0xfbef5 Jul 22 11:23:08 pve kernel: ? __count_memcg_events+0x6f/0xe0 Jul 22 11:23:08 pve kernel: ? srso_alias_return_thunk+0x5/0xfbef5 Jul 22 11:23:08 pve kernel: ? count_memcg_events.constprop.0+0x2a/0x50 Jul 22 11:23:08 pve kernel: ? srso_alias_return_thunk+0x5/0xfbef5 Jul 22 11:23:08 pve kernel: ? handle_mm_fault+0xad/0x380 Jul 22 11:23:08 pve kernel: ? srso_alias_return_thunk+0x5/0xfbef5 Jul 22 11:23:08 pve kernel: ? do_user_addr_fault+0x343/0x6b0 Jul 22 11:23:08 pve kernel: ? srso_alias_return_thunk+0x5/0xfbef5 Jul 22 11:23:08 pve kernel: ? irqentry_exit_to_user_mode+0x7e/0x260 Jul 22 11:23:08 pve kernel: ? srso_alias_return_thunk+0x5/0xfbef5 Jul 22 11:23:08 pve kernel: ? irqentry_exit+0x43/0x50 Jul 22 11:23:08 pve kernel: ? srso_alias_return_thunk+0x5/0xfbef5 Jul 22 11:23:08 pve kernel: ? exc_page_fault+0x94/0x1b0 Jul 22 11:23:08 pve kernel: entry_SYSCALL_64_after_hwframe+0x78/0x80 Jul 22 11:23:08 pve kernel: RIP: 0033:0x7f9b4bb6fc5b Jul 22 11:23:08 pve kernel: Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1c 48 8b 44 24 18 64 48 2b 04 25 28 00 00 Jul 22 11:23:08 pve kernel: RSP: 002b:00007ffce3c437f0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 Jul 22 11:23:08 pve kernel: RAX: ffffffffffffffda RBX: 0000562efb229c70 RCX: 00007f9b4bb6fc5b Jul 22 11:23:08 pve kernel: RDX: 0000000000000000 RSI: 0000000000003b6f RDI: 0000000000000051 Jul 22 11:23:08 pve kernel: RBP: 0000562efb229cf4 R08: 0000000000000000 R09: 0000000000000000 Jul 22 11:23:08 pve kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 0000562ef985a630 Jul 22 11:23:08 pve kernel: R13: 0000562ef74d3e10 R14: 0000562ef96ef790 R15: 00000000000000c8 Jul 22 11:23:08 pve kernel: </TASK> Jul 22 11:23:08 pve kernel: ---[ end trace 0000000000000000 ]--- Jul 22 11:23:08 pve kernel: ------------[ cut here ]------------ Jul 22 11:23:08 pve kernel: vfio-pci 0000:01:00.3: can't update enabled VF BAR5 [??? 0x00000000 flags 0x0] Jul 22 11:23:08 pve kernel: WARNING: CPU: 32 PID: 6028 at drivers/pci/iov.c:966 pci_iov_update_resource+0x144/0x150 Jul 22 11:23:08 pve kernel: Modules linked in: vfio_pci vfio_pci_core vfio_iommu_type1 vfio iommufd ebtable_filter ebtables ip_set ip6table_raw iptable_raw ip6table_filter ip6_tables iptable_filter softdog nf_tables bonding tls iavf sunrpc binfmt_misc nfnetlink_log nfnetlink intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd kvm_amd ipmi_ssif kvm irqbypass crct10dif_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel sha256_ssse3 sha1_ssse3 aesni_intel crypto_simd cryptd ib_uverbs dax_hmem cxl_acpi acpi_ipmi rapl ast ipmi_si pcspkr cxl_core ib_core ipmi_devintf i2c_algo_bit ccp k10temp ipmi_msghandler joydev input_leds mac_hid vhost_net vhost vhost_iotlb tap efi_pstore dmi_sysfs ip_tables x_tables autofs4 zfs(PO) spl(O) btrfs blake2b_generic xor raid6_pq libcrc32c rndis_host cdc_ether usbnet mii hid_generic usbmouse usbhid hid xhci_pci ice(OE) xhci_pci_renesas nvme crc32_pclmul xhci_hcd gnss nvme_core bnxt_en i2c_piix4 i40e(OE) nvme_auth Jul 22 11:23:08 pve kernel: CPU: 32 PID: 6028 Comm: kvm Tainted: P W OE 6.8.8-2-pve #1 Jul 22 11:23:08 pve kernel: Hardware name: Supermicro AS -2015SV-WTNRT/H13SVW-NT, BIOS 1.1b 12/20/2023 Jul 22 11:23:08 pve kernel: RIP: 0010:pci_iov_update_resource+0x144/0x150 Jul 22 11:23:08 pve kernel: Code: 8b b3 c8 00 00 00 48 8d bb c8 00 00 00 e8 04 8c 1c 00 4d 89 e0 44 89 e9 4c 89 f2 48 89 c6 48 c7 c7 d0 43 40 b9 e8 fc b2 7d ff <0f> 0b e9 4a ff ff ff e8 00 e4 82 00 90 90 90 90 90 90 90 90 90 90 Jul 22 11:23:08 pve kernel: RSP: 0018:ff72c6bbc51a7920 EFLAGS: 00010246 Jul 22 11:23:08 pve kernel: RAX: 0000000000000000 RBX: ff1f06c7c6735000 RCX: 0000000000000000 Jul 22 11:23:08 pve kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 Jul 22 11:23:08 pve kernel: RBP: ff72c6bbc51a7960 R08: 0000000000000000 R09: 0000000000000000 Jul 22 11:23:08 pve kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ff1f06c7c67356d0 Jul 22 11:23:08 pve kernel: R13: 0000000000000005 R14: ff1f06c7c64ecee0 R15: ff1f06c85b0f8000 Jul 22 11:23:08 pve kernel: FS: 00007f9b48ded480(0000) GS:ff1f07258a800000(0000) knlGS:0000000000000000 Jul 22 11:23:08 pve kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jul 22 11:23:08 pve kernel: CR2: 00005763393242c8 CR3: 00000002029f8004 CR4: 0000000000f71ef0 Jul 22 11:23:08 pve kernel: PKRU: 55555554 Jul 22 11:23:08 pve kernel: Call Trace: Jul 22 11:23:08 pve kernel: <TASK> Jul 22 11:23:08 pve kernel: ? show_regs+0x6d/0x80 Jul 22 11:23:08 pve kernel: ? __warn+0x89/0x160 Jul 22 11:23:08 pve kernel: ? pci_iov_update_resource+0x144/0x150 Jul 22 11:23:08 pve kernel: ? report_bug+0x17e/0x1b0 Jul 22 11:23:08 pve kernel: ? handle_bug+0x46/0x90 Jul 22 11:23:08 pve kernel: ? exc_invalid_op+0x18/0x80 Jul 22 11:23:08 pve kernel: ? asm_exc_invalid_op+0x1b/0x20 Jul 22 11:23:08 pve kernel: ? pci_iov_update_resource+0x144/0x150 Jul 22 11:23:08 pve kernel: pci_update_resource+0x27/0x50 Jul 22 11:23:08 pve kernel: pci_restore_iov_state+0xb4/0x150 Jul 22 11:23:08 pve kernel: pci_restore_state.part.0+0x204/0x3a0 Jul 22 11:23:08 pve kernel: pci_dev_restore+0x58/0x80 Jul 22 11:23:08 pve kernel: pci_try_reset_function+0x6a/0xa0 Jul 22 11:23:08 pve kernel: vfio_pci_core_ioctl+0x7bc/0xe80 [vfio_pci_core] Jul 22 11:23:08 pve kernel: ? kvm_vm_ioctl_irq_line+0x27/0x60 [kvm] Jul 22 11:23:08 pve kernel: vfio_device_fops_unl_ioctl+0xa8/0x850 [vfio] Jul 22 11:23:08 pve kernel: ? __pfx_kvm_set_pic_irq+0x10/0x10 [kvm] Jul 22 11:23:08 pve kernel: __x64_sys_ioctl+0xa0/0xf0 Jul 22 11:23:08 pve kernel: x64_sys_call+0xa68/0x24b0 Jul 22 11:23:08 pve kernel: do_syscall_64+0x81/0x170 Jul 22 11:23:08 pve kernel: ? srso_alias_return_thunk+0x5/0xfbef5 Jul 22 11:23:08 pve kernel: ? __mod_memcg_lruvec_state+0x87/0x140 Jul 22 11:23:08 pve kernel: ? srso_alias_return_thunk+0x5/0xfbef5 Jul 22 11:23:08 pve kernel: ? release_pages+0x152/0x4c0 Jul 22 11:23:08 pve kernel: ? srso_alias_return_thunk+0x5/0xfbef5 Jul 22 11:23:08 pve kernel: ? srso_alias_return_thunk+0x5/0xfbef5 Jul 22 11:23:08 pve kernel: ? __mod_memcg_lruvec_state+0x87/0x140 Jul 22 11:23:08 pve kernel: ? srso_alias_return_thunk+0x5/0xfbef5 Jul 22 11:23:08 pve kernel: ? __mod_lruvec_state+0x36/0x50 Jul 22 11:23:08 pve kernel: ? srso_alias_return_thunk+0x5/0xfbef5 Jul 22 11:23:08 pve kernel: ? __lruvec_stat_mod_folio+0x70/0xc0 Jul 22 11:23:08 pve kernel: ? srso_alias_return_thunk+0x5/0xfbef5 Jul 22 11:23:08 pve kernel: ? set_ptes.constprop.0+0x2b/0xb0 Jul 22 11:23:08 pve kernel: ? srso_alias_return_thunk+0x5/0xfbef5 Jul 22 11:23:08 pve kernel: ? do_anonymous_page+0x3a8/0x740 Jul 22 11:23:08 pve kernel: ? srso_alias_return_thunk+0x5/0xfbef5 Jul 22 11:23:08 pve kernel: ? __pte_offset_map+0x1c/0x1b0 Jul 22 11:23:08 pve kernel: ? srso_alias_return_thunk+0x5/0xfbef5 Jul 22 11:23:08 pve kernel: ? __handle_mm_fault+0xc32/0xee0 Jul 22 11:23:08 pve kernel: ? srso_alias_return_thunk+0x5/0xfbef5 Jul 22 11:23:08 pve kernel: ? __count_memcg_events+0x6f/0xe0 Jul 22 11:23:08 pve kernel: ? srso_alias_return_thunk+0x5/0xfbef5 Jul 22 11:23:08 pve kernel: ? count_memcg_events.constprop.0+0x2a/0x50 Jul 22 11:23:08 pve kernel: ? srso_alias_return_thunk+0x5/0xfbef5 Jul 22 11:23:08 pve kernel: ? handle_mm_fault+0xad/0x380 Jul 22 11:23:08 pve kernel: ? srso_alias_return_thunk+0x5/0xfbef5 Jul 22 11:23:08 pve kernel: ? do_user_addr_fault+0x343/0x6b0 Jul 22 11:23:08 pve kernel: ? srso_alias_return_thunk+0x5/0xfbef5 Jul 22 11:23:08 pve kernel: ? irqentry_exit_to_user_mode+0x7e/0x260 Jul 22 11:23:08 pve kernel: ? srso_alias_return_thunk+0x5/0xfbef5 Jul 22 11:23:08 pve kernel: ? irqentry_exit+0x43/0x50 Jul 22 11:23:08 pve kernel: ? srso_alias_return_thunk+0x5/0xfbef5 Jul 22 11:23:08 pve kernel: ? exc_page_fault+0x94/0x1b0 Jul 22 11:23:08 pve kernel: entry_SYSCALL_64_after_hwframe+0x78/0x80 Jul 22 11:23:08 pve kernel: RIP: 0033:0x7f9b4bb6fc5b Jul 22 11:23:08 pve kernel: Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1c 48 8b 44 24 18 64 48 2b 04 25 28 00 00 Jul 22 11:23:08 pve kernel: RSP: 002b:00007ffce3c437f0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 Jul 22 11:23:08 pve kernel: RAX: ffffffffffffffda RBX: 0000562efb229c70 RCX: 00007f9b4bb6fc5b Jul 22 11:23:08 pve kernel: RDX: 0000000000000000 RSI: 0000000000003b6f RDI: 0000000000000051 Jul 22 11:23:08 pve kernel: RBP: 0000562efb229cf4 R08: 0000000000000000 R09: 0000000000000000 Jul 22 11:23:08 pve kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 0000562ef985a630 Jul 22 11:23:08 pve kernel: R13: 0000562ef74d3e10 R14: 0000562ef96ef790 R15: 00000000000000c8 Jul 22 11:23:08 pve kernel: </TASK> Jul 22 11:23:08 pve kernel: ---[ end trace 0000000000000000 ]--- Jul 22 11:23:10 pve pvestatd[5932]: VM 100 qmp command failed - VM 100 qmp command 'query-proxmox-support' failed - got timeout Jul 22 11:23:10 pve pvestatd[5932]: status update time (8.072 seconds) Jul 22 11:23:16 pve kernel: vfio-pci 0000:01:00.2: vfio_bar_restore: reset recovery - restoring BARs
pfSense crash report:
msgbuf.txt
ixl0: Reset Requested! (EMPR) ixl0: ECC Error detected! ixl0: HMC Error detected! ixl0: INFO 0xffffffff ixl0: DATA 0x00000000 ixl0: Rebuilding driver state... Fatal trap 12: page fault while in kernel mode cpuid = 11; apic id = 0b fault virtual address = 0x458 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff80ccd060 stack pointer = 0x28:0xfffffe00dbb34d90 frame pointer = 0x28:0xfffffe00dbb34e10 code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 0 (ixl0 (que 2)) rdi: fffffe00dc1c0620 rsi: 0000000000000004 rdx: ffffffff835d784b rcx: fffff800023a5740 r8: fffff800023a5c60 r9: fffffe00dbb35000 rax: 0000000000000000 rbx: 0000000000000000 rbp: fffffe00dbb34e10 r10: 00000000000001f4 r11: 0000000082859739 r12: fffffe00dbb34da8 r13: 0000000000000000 r14: 0000000000000000 r15: fffffe00dc1c0620 trap number = 12 panic: page fault cpuid = 11 time = 1721665331 KDB: enter: panic
-
Another x710 induced pfSense kernel panic, but this time the hypervisor did not crash.
In 6 months of testing...
This is the first time I've observed two failures in less than 24 hours, and the first time there are "Malicious Driver Detection" messages in the logs.
This is only the second time I've observed the x710 cause a kernel panic. All previous failures only resulted in an unresponsive x710 with "queue <num> appears to be hung!" messages.
I've made no notable changes recently that may have caused this new failure pattern.Rebooting only the VM fixed it this time.
There are no notable log messages on the hypervisor.
Relevant pfSense logs:ixl0: Malicious Driver Detection event 2 on TX queue 0, pf number 0 (PF-0) ixl0: Malicious Driver Detection event 1 on RX queue 3, pf number 0 (PF-0) ixl0: Malicious Driver Detection event 2 on TX queue 1, pf number 0 (PF-0) ixl0: Malicious Driver Detection event 1 on RX queue 0, pf number 0 (PF-0) ixl0: RX Malicious Driver Detection event (unknown) ixl0: Malicious Driver Detection event 1 on RX queue 2, pf number 0 (PF-0) ixl0: Malicious Driver Detection event 2 on TX queue 2, pf number 0 (PF-0) ixl0: Malicious Driver Detection event 1 on RX queue 2, pf number 0 (PF-0) ixl0: Malicious Driver Detection event 1 on RX queue 2, pf number 0 (PF-0) ixl0: Malicious Driver Detection event 1 on RX queue 2, pf number 0 (PF-0) ixl0: Malicious Driver Detection event 1 on RX queue 2, pf number 0 (PF-0) ixl0: TX Malicious Driver Detection event (unknown) ixl0: Malicious Driver Detection event 255 on RX queue 16383, pf number 255 (PF-0) ixl0: Malicious Driver Detection event 1 on RX queue 2, pf number 0 (PF-0) ixl0: Malicious Driver Detection event 1 on RX queue 7, pf number 0 (PF-0) ixl0: Malicious Driver Detection event 2 on TX queue 4, pf number 0 (PF-0) ixl0: Malicious Driver Detection event 31 on TX queue 4095, pf number 15 (PF-0) ixl0: Malicious Driver Detection event 1 on RX queue 2, pf number 0 (PF-0) ixl0: Malicious Driver Detection event 1 on RX queue 2, pf number 0 (PF-0) ixl0: Malicious Driver Detection event 1 on RX queue 2, pf number 0 (PF-0) ixl0: Malicious Driver Detection event 255 on RX queue 16383, pf number 255 (PF-0) ixl0: Malicious Driver Detection event 1 on RX queue 2, pf number 0 (PF-0) ixl0: Malicious Driver Detection event 1 on RX queue 2, pf number 0 (PF-0) ixl0: Malicious Driver Detection event 1 on RX queue 7, pf number 0 (PF-0) ixl0: Malicious Driver Detection event 1 on RX queue 2, pf number 0 (PF-0) ixl0: Malicious Driver Detection event 1 on RX queue 0, pf number 0 (PF-0) ixl0: Malicious Driver Detection event 1 on RX queue 2, pf number 0 (PF-0) ixl0: Malicious Driver Detection event 255 on RX queue 16383, pf number 255 (PF-0) ixl0: Malicious Driver Detection event 1 on RX queue 2, pf number 0 (PF-0) ixl0: Malicious Driver Detection event 1 on RX queue 2, pf number 0 (PF-0) ixl0: Malicious Driver Detection event 31 on TX queue 4095, pf number 15 (PF-0) ixl0: Malicious Driver Detection event 255 on RX queue 16383, pf number 255 (PF-0) ixl0: Malicious Driver Detection event 31 on TX queue 4095, pf number 15 (PF-0) ixl0: Malicious Driver Detection event 255 on RX queue 16383, pf number 255 (PF-0) ixl0: TX Malicious Driver Detection event (unknown) ixl0: Malicious Driver Detection event 1 on RX queue 2, pf number 0 (PF-0) ixl0: Malicious Driver Detection event 31 on TX queue 4095, pf number 15 (PF-0) ixl0: Malicious Driver Detection event 255 on RX queue 16383, pf number 255 (PF-0) ixl0: RX Malicious Driver Detection event (unknown) ixl0: Malicious Driver Detection event 255 on RX queue 16383, pf number 255 (PF-0) ixl0: Malicious Driver Detection event 1 on RX queue 2, pf number 0 (PF-0) ixl0: Malicious Driver Detection event 255 on RX queue 16383, pf number 255 (PF-0) ixl0: Malicious Driver Detection event 31 on TX queue 4095, pf number 15 (PF-0) ixl0: Malicious Driver Detection event 1 on RX queue 2, pf number 0 (PF-0) ixl0: TX Malicious Driver Detection event (unknown) ixl0: Malicious Driver Detection event 1 on RX queue 7, pf number 0 (PF-0) ixl0: Malicious Driver Detection event 1 on RX queue 2, pf number 0 (PF-0) ixl0: Malicious Driver Detection event 255 on RX queue 16383, pf number 255 (PF-0) ixl0: TX Malicious Driver Detection event (unknown) ixl0: Malicious Driver Detection event 1 on RX queue 2, pf number 0 (PF-0) ixl0: Malicious Driver Detection event 1 on RX queue 7, pf number 0 (PF-0) ixl0: TX Malicious Driver Detection event (unknown) ixl0: Malicious Driver Detection event 255 on RX queue 16383, pf number 255 (PF-0) ixl0: Malicious Driver Detection event 1 on RX queue 2, pf number 0 (PF-0) ixl0: RX Malicious Driver Detection event (unknown) ixl0: Malicious Driver Detection event 1 on RX queue 2, pf number 0 (PF-0) ixl0: Malicious Driver Detection event 31 on TX queue 4095, pf number 15 (PF-0) ixl0: Malicious Driver Detection event 255 on RX queue 16383, pf number 255 (PF-0) ixl0: Malicious Driver Detection event 1 on RX queue 7, pf number 0 (PF-0) ixl0: Malicious Driver Detection event 1 on RX queue 2, pf number 0 (PF-0) ixl0: Malicious Driver Detection event 1 on RX queue 2, pf number 0 (PF-0) ixl0: Malicious Driver Detection event 1 on RX queue 2, pf number 0 (PF-0) ixl0: TX Malicious Driver Detection event (unknown) ixl0: Malicious Driver Detection event 1 on RX queue 2, pf number 0 (PF-0) ixl0: Malicious Driver Detection event 255 on RX queue 16383, pf number 255 (PF-0) ixl0: Malicious Driver Detection event 1 on RX queue 2, pf number 0 (PF-0) ixl0: Malicious Driver Detection event 255 on RX queue 16383, pf number 255 (PF-0) ixl0: RX Malicious Driver Detection event (unknown) ixl0: Malicious Driver Detection event 1 on RX queue 2, pf number 0 (PF-0) ixl0: Malicious Driver Detection event 1 on RX queue 2, pf number 0 (PF-0) ixl0: Malicious Driver Detection event 1 on RX queue 2, pf number 0 (PF-0) ixl0: RX Malicious Driver Detection event (unknown) ixl0: Malicious Driver Detection event 1 on RX queue 7, pf number 0 (PF-0) ixl0: Malicious Driver Detection event 1 on RX queue 2, pf number 0 (PF-0) ixl0: Malicious Driver Detection event 1 on RX queue 2, pf number 0 (PF-0) ixl0: Malicious Driver Detection event 1 on RX queue 7, pf number 0 (PF-0) ixl0: Malicious Driver Detection event 1 on RX queue 2, pf number 0 (PF-0) ixl0: TX Malicious Driver Detection event (unknown) ixl0: Malicious Driver Detection event 1 on RX queue 2, pf number 0 (PF-0) ixl0: RX Malicious Driver Detection event (unknown) ixl0: Malicious Driver Detection event 255 on RX queue 16383, pf number 255 (PF-0) ixl1: TX Malicious Driver Detection event (unknown) ixl1: Malicious Driver Detection event 1 on RX queue 2, pf number 0 (PF-1) ixl1: TX Malicious Driver Detection event (unknown) ixl1: RX Malicious Driver Detection event (unknown) ixl1: Malicious Driver Detection event 255 on RX queue 16383, pf number 255 (PF-1) ixl1: RX Malicious Driver Detection event (unknown) ixl0: Malicious Driver Detection event 31 on TX queue 4095, pf number 15 (PF-0) ixl0: Malicious Driver Detection event 255 on RX queue 16383, pf number 255 (PF-0) ixl0: Malicious Driver Detection event 1 on RX queue 2, pf number 0 (PF-0) ixl0: Malicious Driver Detection event 1 on RX queue 0, pf number 0 (PF-0) ixl0: Malicious Driver Detection event 1 on RX queue 2, pf number 0 (PF-0) ixl1: RX Malicious Driver Detection event (unknown) ixl1: Malicious Driver Detection event 31 on TX queue 4095, pf number 15 (PF-1) ixl1: Malicious Driver Detection event 1 on RX queue 2, pf number 0 (PF-1) ixl1: RX Malicious Driver Detection event (unknown) ixl1: RX Malicious Driver Detection event (unknown) ixl1: Malicious Driver Detection event 31 on TX queue 4095, pf number 15 (PF-1) ixl1: Malicious Driver Detection event 255 on RX queue 16383, pf number 255 (PF-1) ixl1: RX Malicious Driver Detection event (unknown) ixl1: TX Malicious Driver Detection event (unknown) ixl1: RX Malicious Driver Detection event (unknown) ixl0: Malicious Driver Detection event 255 on RX queue 16383, pf number 255 (PF-0) ixl0: Reset Requested! (POR) ixl0: ECC Error detected! ixl0: HMC Error detected! ixl0: INFO 0xffffffff ixl0: DATA 0xffffffff ixl0: PCI Exception detected! ixl0: Reset Requested! (POR) ixl0: ECC Error detected! ixl1: RX Malicious Driver Detection event (unknown) ixl1: TX Malicious Driver Detection event (unknown) ixl1: RX Malicious Driver Detection event (unknown) ixl0: Rebuilding driver state... ixl1: TX Malicious Driver Detection event (unknown) ixl1: RX Malicious Driver Detection event (unknown) ixl1: TX Malicious Driver Detection event (unknown) ixl1: RX Malicious Driver Detection event (unknown) ixl1: Malicious Driver Detection event 255 on RX queue 16383, pf number 255 (PF-1) ixl1: Reset Requested! (POR) ixl1: ECC Error detected! ixl1: HMC Error detected! ixl1: INFO 0xffffffff ixl1: DATA 0xffffffff ixl1: PCI Exception detected! ixl1: TX queue 1 still enabled! ixl1: TX queue 4 still enabled! ixl1: TX queue 7 still enabled! ixl0: capability discovery failed; status I40E_ERR_ADMIN_QUEUE_FULL, error OK ixl0: ixl_get_hw_capabilities failed: 19 ixl0: Reload the driver to recover ixl0: Admin Queue is down; resetting... ixl0: capability discovery failed; status I40E_ERR_ADMIN_QUEUE_CRITICAL_ERROR, error OK ixl0: init: Error retrieving HW capabilities; status code 19 ixl0: i40e_aq_get_vsi_params() failed, error -66 aq_error 0 ixl0: initialize vsi failed!! ixl0: Malicious Driver Detection event 1 on RX queue 385, pf number 0 (PF-0) ixl1: Rebuilding driver state... ixl1: PF-ID[1]: VFs 32, MSI-X 129, VF MSI-X 5, QPs 384, MDIO shared ixl1: Allocating 8 queues for PF LAN VSI; 8 queues active ixl1: Rebuilding driver state done. Fatal trap 12: page fault while in kernel mode cpuid = 4; apic id = 04 fault virtual address = 0x458 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff80ccd060 stack pointer = 0x28:0xfffffe00dbb34d90 frame pointer = 0x28:0xfffffe00dbb34e10 code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 0 (ixl0 (que 2)) rdi: fffffe00dc1c0620 rsi: 0000000000000004 rdx: ffffffff835d784b rcx: fffff800023c0740 r8: fffff800023c0c60 r9: fffffe00dbb35000 rax: 0000000000000000 rbx: 0000000000000000 rbp: fffffe00dbb34e10 r10: 00000000000001f4 r11: 00000000803ad992 r12: fffffe00dbb34da8 r13: 0000000000000000 r14: 0000000000000000 r15: fffffe00dc1c0620 trap number = 12 panic: page fault cpuid = 4 time = 1721704766 KDB: enter: panic
At this point I've switched to a different machine running pfSense until I have time to evaluate alternate adapters. I suspect this X710 is not behaving correctly because either the latest driver/firmware is still bugged, likely in combination with my virtualization and authbridge setup, or I got really lucky with a faulty card. Unfortunately, I don't have multiple X710s to test with.
If no new ideas come up, I'll try to remember to post back here in some months when I've settled on an adapter that seems stable long term with this setup. -
Hmm, painful!
It's not a setup I can test here. Potentially it could be a bad NIC.
If you have no choice we can make exceptions for transferring the NDI. If you have to replace a NIC because of hardware failure for example.
-
Quick follow up... I swapped the X710-T4L for an X550-T2 a little over 3 months ago and the system has been rock solid ever since. No problems at all. It seems it was either a bad NIC or a driver problem, unfortunately I'm not planning to test with a different X710 any time soon.