ahcich0: Timeout on slot.....

glippi

Hello,

I recently migrated my servers from esx6.7 to 8.0u2.
With this I also installed a new pfSense VM and uploaded the configuration of the old to the new.

It all works as intended except for 1 thing.
Once in a while, it will be unreachable and then come back, the console shows this:
Feb 7 01:13:16 kernel ahcich0: Timeout on slot 7 port 0
Feb 7 01:13:16 kernel ahcich0: is 00000003 cs 00000000 ss 00000000 rs 00000180 tfd 40 serr 00000000 cmd 0004df17
Feb 7 01:13:16 kernel ahcich0: ... waiting for slots 00000100
Feb 7 01:13:16 kernel ahcich0: Timeout on slot 8 port 0
Feb 7 01:13:16 kernel ahcich0: is 00000003 cs 00000000 ss 00000000 rs 00000180 tfd 40 serr 00000000 cmd 0004df17

One from 2 days earlier:
Feb 5 03:08:10 kernel ahcich0: Timeout on slot 5 port 0
Feb 5 03:08:10 kernel ahcich0: is 00000003 cs 00000000 ss 00000000 rs 00000060 tfd 40 serr 00000000 cmd 0004df17
Feb 5 03:08:10 kernel ahcich0: ... waiting for slots 00000040
Feb 5 03:08:10 kernel ahcich0: Timeout on slot 6 port 0
Feb 5 03:08:10 kernel ahcich0: is 00000003 cs 00000000 ss 00000000 rs 00000060 tfd 40 serr 00000000 cmd 0004df17

I am myself not directly sure, but I think it could be because of my card that I use, which is an intel 82599 10 gigabit and maybe failing.
or something wrong with PCI passthrough on esxi8.... any help would be appreciated :)

The version of pfsense I use:
2.7.2-RELEASE (amd64)
FreeBSD 14.0-CURRENT

glippi

@glippi for the VM it seems there are no events from my esxi host

stephenw10

Is the VM storage accessed via that NIC?

glippi

@stephenw10 it is a vm disk directly attached on the server, no iscsi or anything, other VM's on the same datastore have no issues.
The VM has the intel 82599 10 gigabit dual port directly attached to it, one is WAN and the other LAN, so that is why I am expecting the card to actually get disconnected and reconnected, but the events for the ESXi on host show nothing like this, so maybe driver issues, but it is the same card, same pfSense build and all.

stephenw10

Ah, OK. Seems unlikely to be NIC related then those messages are from the SATA controller.

Does it actually cause an issue when that is shown?

glippi

@stephenw10 Internet and LAN goes down, but VM and pfSense stay up, so there is a hard disconnect it seems on both network ports assigned to the VM from the intel nic

stephenw10

If it was losing the NIC, or even the link, there would be a lot more logged. Is it really only those ahci errors shown?

glippi

@stephenw10 This would be all around that time that this occurs:
Feb 5 02:00:01 php 35301 [pfBlockerNG] No changes to Firewall rules, skipping Filter Reload
Feb 5 02:04:00 sshguard 81577 Exiting on signal.
Feb 5 02:04:00 sshguard 67907 Now monitoring attacks.
Feb 5 02:31:00 sshguard 67907 Exiting on signal.
Feb 5 02:31:00 sshguard 27224 Now monitoring attacks.
Feb 5 02:32:00 sshguard 27224 Exiting on signal.
Feb 5 02:32:00 sshguard 55819 Now monitoring attacks.
Feb 5 02:33:00 sshguard 55819 Exiting on signal.
Feb 5 02:33:00 sshguard 86073 Now monitoring attacks.
Feb 5 03:00:00 php 89853 [pfBlockerNG] Starting cron process.
Feb 5 03:00:01 php 89853 [pfBlockerNG] No changes to Firewall rules, skipping Filter Reload
Feb 5 03:08:10 kernel ahcich0: Timeout on slot 5 port 0
Feb 5 03:08:10 kernel ahcich0: is 00000003 cs 00000000 ss 00000000 rs 00000060 tfd 40 serr 00000000 cmd 0004df17
Feb 5 03:08:10 kernel ahcich0: ... waiting for slots 00000040
Feb 5 03:08:10 kernel ahcich0: Timeout on slot 6 port 0
Feb 5 03:08:10 kernel ahcich0: is 00000003 cs 00000000 ss 00000000 rs 00000060 tfd 40 serr 00000000 cmd 0004df17
Feb 5 03:08:21 login 33166 login on ttyv0 as root

stephenw10

Hmm, nothing shown there then.

When this happens do you have to do anything to restore the connection?

glippi

@stephenw10 No, all just comes back, pfSense uptime shows that it did not go down, WAN and LAN connection uptime shows that WAN and LAN did

stephenw10

Hmm, so it's just down for those 10s or so and nothing else is logged in pfSense?

I would think it pretty much has to be something in the hypervisor with those symptoms.

glippi

@stephenw10 My thoughts so too, but no hardware events reported there, I will gather some more logs soon and share them with you here, for now it has not re-occurred

glippi

@stephenw10 hello, just a small update. I found 2 out of 6 SSD's in me raid with SMART alert flag (RAID6 and I honestly think this should not be seen like this on pfSense and not on other VM's on same drive span)
and I am moving the links away from the Intel NIC to the onboard NIC for WAN and a VM nic for LAN. So far with moving LAN from the Intel has made the connection more stable and not seen the error since.

It seems that the Intel NIC (intel 82599 10 gigabit) is badly supported in ESXi 8u2 vs 6.7u2 so that might also be part of the issue.
Since the hypervisor did change and the error never occurred on the previous (same pfSense config, other ESX version).

I think therefor atm that it is the NIC and the support with ESXi 8 vs 6.7 that has caused this error, today I am swapping out the drives flagged by SMART

glippi

@glippi Hello, after swapping out the intel 82599 10 gigabit dual port and moving this to the onboard ethernet with PCI passthrough to the VM, it seems everything is stable now.
the AHCI errors, however consistent with disks (as far as I can see), it seems that in this case it was with the ethernet controller.
My guess it was just a faulty card as since moving over I have had no issues now for 18 days.

If you would like any logs, please let me know which one you would like to see for further investigation on this.