Issues after upgrade 2.4.4p3 to 2.4.5

beria-pl

Hi,

So we've simple setup - 2 VMs with pfsense with pfsync and some 2 CARP interfaces (one of them contains 10 Virtual IPs assigned).
both VMs are the same setup:
4 vCPU
32GB RAM
32GB HDD
3 NICs (WAN, LAN, HA/pfsync

Only two extra packages installed:

haproxy (0.60_4) 1.8.25
acme (0.6.6)

One VM is residing on hyper-v 2016 (core), one is running on proxmox with VirtIO devices. They were running great, but two weeks after the release of 2.4.5 I've decided to update them to newest version, and then problem begins.

So for the first time one VM (that one on proxmox) just stopped working - however, all CARP/VIP was still on this one - and they were pingable, but any of firewall/NAT rules were not working at all. Except for haproxy, all settings related to haproxy were working flawless (all redirections to destination IP etc...) After a reboot, all CARP/VIP goes to second one and it was working fine - and after a reboot, everything was going back to VM that was failed minutes ago.

But the issue was repeating in a random time. So now I have running only second machine (that one on Hyper-V) since that one from proxmox when powered up - it is booting, starting properly, fetching all CARP/VIP working for random time ( 5 minutes - 12 hours ) and dying again in the same way.
There is nothing in /var/log/system.log - the only problem that is logged - it is related to high latency on pinging WAN gateway & LAN gateway.

May 1 23:15:33 router01 dpinger: WAN_GW: Clear latency 332960us stddev 2447163us loss 0%
May 1 23:15:57 router01 dpinger: WAN_GW: Alarm latency 570852us stddev 3194208us loss 0%
May 1 23:16:54 router01 dpinger: WAN_GW: Clear latency 328597us stddev 2507297us loss 0%
May 1 23:16:58 router01 dpinger: LAN_GW1: Clear latency 336645us stddev 2542827us loss 0%
May 1 23:17:19 router01 dpinger: WAN_GW: Alarm latency 500508us stddev 3096165us loss 0%

After that router01 (pfsense VM on proxmox is just unresponsive). While router02 (pfsense VM on Hyper-V) works for more than one week and nothing happened (yet).
The same story happens when only rotuer01 is on-line - it just hung randomly. There is no loop on the network - we checked 4 times. We tried to make clean reinstall VM on proxmox problem was the same. We tried to migrate that VM between different proxmox hosts in cluster -no luck.

Is it some kind of known issue? Shall we wait for fix? Or go with painful road with downgrading to 2.4.4p3?

bmeeks

Known issue. Search in the Installation and Upgrades sub-forum and you will find this thread: https://forum.netgate.com/topic/151819/2-4-5-high-latency-and-packet-loss-not-in-a-vm. Read the entire thread and you will see the issue affects both bare metal and VM installations. It is triggered by large address tables such as the IPv6 bogons table or large pfBlockerNG or DNSBL tables. Those are main culprits.

Expected to be fixed in 2.4.5-p1 when it is released. You can downgrade to 2.4.4_p3. There is a procedure listed in one of the posts in the thread I linked (if I recall, else do a search on the forum for downgrading or reinstalling 2.4.4_p3 and you should find it).

beria-pl

@bmeeks Thx mate!

Downgrade is not so easy for us due to haproxy config (which apparently is not part of backup config in pfsense) and due to acme/LE certificates. But seems that there is no much other options.

bmeeks

@beria-pl said in Issues after upgrade 2.4.4p3 to 2.4.5:

@bmeeks Thx mate!

Downgrade is not so easy for us due to haproxy config (which apparently is not part of backup config in pfsense) and due to acme/LE certificates. But seems that there is no much other options.

Many users have had some sucess by doing two things. First, cut the VM back to a single core if possible. Second, go to INTERFACES in pfSense and make sure the "Block Bogons" checkbox is NOT checked for each interface, especially if you are using IPv6. The problem is triggered by pfctl and pf trying to manipulate large tables of IP addresses in memory. Disabling bogons blocking negates the need to create and manage that big table of IP addresses.

You don't mention using pfBlockerNG, so I'm assuming its IP lists won't be a problem. But if you use or start using pfBlockerNG or DNSBL, then expect the issue to reappear.

beria-pl

Nope - we’re not using anything more that haproxy and acme packages.

For sure I’ll take a shot with single CPU and those bogons network setup.

I’ll check and post output tomorrow!

bmeeks

Unfortunately this bug came in with the update to FreeBSD-11.3-STABLE as the underlying OS for pfSense. The code maintainers for FreeBSD made a fix to the pf firewall engine for a different bug, but that fix unfortunately had a nasty and unanticipated side effect with very large address tables. It triggers a memory allocation inefficency in the kernel and that is responsible for the lags and stalls. The OS basically freezes for a while during the memory allocations required to create and/or update large IP address tables used by the pf firewall engine. In this case, "large" means any table over 65,535 entries.

beria-pl

@bmeeks

Thanks - works like a charm, after setting up 1 vCPU on both.

At least it working now for 30 minutes without any issues.
Is there any timeframe to expect this fix? Or in longer-term it may be better to wait for 2.5.0p1 ;) and now survive with 2.4.5 or downgrade to 2.4.4p3 ?

bmeeks

@beria-pl said in Issues after upgrade 2.4.4p3 to 2.4.5:

@bmeeks

Thanks - works like a charm, after setting up 1 vCPU on both.

At least it working now for 30 minutes without any issues.
Is there any timeframe to expect this fix? Or in longer-term it may be better to wait for 2.5.0p1 ;) and now survive with 2.4.5 or downgrade to 2.4.4p3 ?

I am not privy to the release dates as I am not affiliated with Netgate. As with pretty much every software company out there, Netgate is usually tight-lipped about release schedules (at least ones with very specific target dates). I suspect companies do this to minimize flak in the event they miss the release date due to unforeseen issues that may crop up.

I personally don't expect a long delay in the 2.4.5-p1 fix for this issue, but whether that is later this week or several months from now, I have no idea. If one virtual CPU appears to be working for you, then I would suggest staying on the 2.4.5-RELEASE and not moving to 2.5.0-DEVEL as that branch understandably may have issues crop up -- especially if you keep up with the snapshot updates.

The upstream FreeBSD guys merged the fix into FreeBSD-11.3-STABLE on May 11th, and so far as I can tell from the Github updates, the pfSense team is keeping up. So maybe the fix release won't be too far away.