Pfsense 2.4.5 - Bug? "bge0: firmware handshake timed out, found 0x4b657654" dropping WAN interface needing reboot.

Aterfax

Error message in the logs when my WAN interface stops functioning and cannot reconnect until reboot:

bge0: firmware handshake timed out, found 0x4b657654

Hardware is a HP Gen 8 Microserver, pfsense is running in a VM inside of Unraid.

Pfsense version 2.4.5 and the network card is a HPE Ethernet 1Gb 2-port NC332i Adapter (BCM5720 Broadcom).

Interface is running in PPPoE via a modem, the card itself is running in pfsense via hardware pass-through.

Nothing obvious in the VM logs, didn't have the issue with previous release and seems to happen like clockwork every 24 hrs / in the early morning (1am.)

Anyone got any ideas? Having to reboot every 24 hours is sub-optimal.

I am probably going to have to bridge the interface with the host in the meantime.

Possible regression to reported behaviour in? https://redmine.pfsense.org/issues/6423

stephenw10

Seem unrelated to that bug, not a PPPoE issue.

What happens at 1am? Check the Cron entries, installing the Cron package makes that easier.

Steve

noplan

Renew of your Wan ip from your ISP?
Maybe

Aterfax

@stephenw10 - It's not CRON related afaik, its roughly every 24 hr, but it is not at an exact specific time and sometimes it skips a day - the bug I have referenced is possibly related as it is discussing him having an issue with the same Broadcom chipset and sounds like exactly the same behaviour.

Here's cron anyway:

@noplan Renewing the WAN IP from the ISP is not possible in some terms since I have a static address. If you mean reconnect:

I can see the PPP daemon trying to do a reconnect - I have manually tried to make it reconnect, I have brought the interface down and up then tried to manually reconnect.

Nada - only rebooting seems to bring it back online which would seem to agree with the error message that the card has dropped from the kernel for some reason or another.

stephenw10

You have the dyndns update running at 01.01. Hard to imagine that could kill the NIC somehow. But easy to test by disabling it.

Steve

Aterfax

It's not that, since it did it again at now at 00:37, log output below:

https://pastebin.com/raw/6KddXiNT

stephenw10

Ok so you are also seeing timeout errors on em1 but that is able to recover:

May 22 00:37:30 oakenshield kernel: em1: Watchdog timeout -- resetting
May 22 00:37:30 oakenshield kernel: em1: 2 link states coalesced
May 22 00:37:30 oakenshield kernel: em1: link state changed to UP
May 22 00:37:30 oakenshield kernel: bge0: watchdog timeout -- resetting

'Link states coalesced' implies it was flapping too fast to show each state.

Having em1 also implies you have em0. The first thing I would try there is to swap the em0 and bge0 interface assignments.

Steve

Aterfax

Can you explain what you mean?

I do have an em0 but swapping them is, in a sense, impossible.

em0 and all emX interfaces are virtual adaptors from KVM which are connected to bridged VLANs on the host.

bge0 is a physical device - a PCI passthrough from the host of one of the ports on the physical card.

Swapping the passthrough to the other port would only change port on the card (and would still show up as bge0) while the virtual adaptors would still show up as emX. (Not sure that would really change anything at all?)

@stephenw10 said in Pfsense 2.4.5 - Bug? "bge0: firmware handshake timed out, found 0x4b657654" dropping WAN interface needing reboot.:

Ok so you are also seeing timeout errors on em1 but that is able to recover:
May 22 00:37:30 oakenshield kernel: em1: Watchdog timeout -- resetting
May 22 00:37:30 oakenshield kernel: em1: 2 link states coalesced
May 22 00:37:30 oakenshield kernel: em1: link state changed to UP
May 22 00:37:30 oakenshield kernel: bge0: watchdog timeout -- resetting
'Link states coalesced' implies it was flapping too fast to show each state.

Having em1 also implies you have em0. The first thing I would try there is to swap the em0 and bge0 interface assignments.

Steve

stephenw10

Ah, OK. Yeah no way to do that then.

Hmm, hard to say if em1 timing out is a symptom or cause there. Can you switch those out to virtio NICs?

Steve

Aterfax

@stephenw10 Will swap those out to virtio now, however I think when I had them as virtio they did not work correctly in some manner.

Edit: Seems to be functioning with virtio adaptors well enough in the short term.

stephenw10

Good to hear. I'm not aware if any issues with virtio. I use them here in Proxmox for a number of VMs and have not seen any problems.

Steve

Aterfax

It dropped again this morning around 00:15, not sure what to make of the logs however this time was now coincidental with pfctl driving the CPU to 100% at the same time. Log output below, with some more about the connection from PPP not that I am sure of its relevance:

https://pastebin.com/hAWS3Gzi

stephenw10

Ah, if you were seeing pfctl at 100% you're probably hitting this: https://redmine.pfsense.org/issues/10414

You can test that by pinging the firewall and running Status > Filter reload. If you see ping times spike to ridiculous levels you are hitting it. Try disabling smp as shown in comment 15 on that report.

That is fixed in 2.4.5p1 which should be available soon.

Steve

Aterfax

Doing a filter reload gave me:

Reply from 10.0.10.1: bytes=32 time<1ms TTL=64
Reply from 10.0.10.1: bytes=32 time<1ms TTL=64
Reply from 10.0.10.1: bytes=32 time<1ms TTL=64
Reply from 10.0.10.1: bytes=32 time=12ms TTL=64
Reply from 10.0.10.1: bytes=32 time<1ms TTL=64
Reply from 10.0.10.1: bytes=32 time<1ms TTL=64
Reply from 10.0.10.1: bytes=32 time=3175ms TTL=64
Reply from 10.0.10.1: bytes=32 time<1ms TTL=64
Reply from 10.0.10.1: bytes=32 time=1316ms TTL=64
Reply from 10.0.10.1: bytes=32 time=2ms TTL=64
Reply from 10.0.10.1: bytes=32 time<1ms TTL=64

So it might not be that, this said - do I really want to disable SMP? Won't this result in a significant performance hit?

@stephenw10 said in Pfsense 2.4.5 - Bug? "bge0: firmware handshake timed out, found 0x4b657654" dropping WAN interface needing reboot.:

Ah, if you were seeing pfctl at 100% you're probably hitting this: https://redmine.pfsense.org/issues/10414

You can test that by pinging the firewall and running Status > Filter reload. If you see ping times spike to ridiculous levels you are hitting it. Try disabling smp as shown in comment 15 on that report.

That is fixed in 2.4.5p1 which should be available soon.

Steve

stephenw10

Yeah, that's a huge latency. When it reloads normally it's barely noticeable.

I would at least test disabling smp to see if it solves the issue. If it does that is fixed in 2.4.5p1 so that will be a permanent solution.

Steve