After upgrade to 2.4.5, primary in HA pair stops sending CARP adv. momentarily after firewall rule changes are applied

techstone

Hi,

I have a pair of Netgate XG-1541 firewalls running in HA.

This weekend I upgraded them from 2.4.3-p1 to 2.4.5. Since then, I've noticed that when I make a firewall rule change on the primary and apply it, the secondary suddenly (and temporarily) becomes CARP master on its interfaces and starts some services, such as OpenVPN and IPsec. This doesn't last long, but it does happen every time I apply the change in the primary's GUI.

Running a tcpdump on one of the firewall's CARP VLANs does show that, shortly after I apply a firewall rule change, CARP advertisements do stop momentarily on the primary.

This is what the primary shows as I apply the change:

May 19 17:46:19	php-fpm		            /rc.filter_synchronize: Beginning XMLRPC sync data to https://172.16.1.3:443/xmlrpc.php.
May 19 17:46:19	php-fpm		            /rc.filter_synchronize: New alert found: A communications error occurred while attempting to call XMLRPC method restore_config_section:
May 19 17:46:19	php-fpm		            /rc.filter_synchronize: A communications error occurred while attempting to call XMLRPC method restore_config_section:
May 19 17:45:21	check_reload_status		Reloading filter
May 19 17:45:18	php-fpm		            /rc.filter_synchronize: Beginning XMLRPC sync data to https://172.16.1.3:443/xmlrpc.php.
May 19 17:45:18	php-fpm		            /rc.filter_synchronize: XMLRPC versioncheck: 19.1 -- 19.1
May 19 17:45:18	php-fpm		            /rc.filter_synchronize: XMLRPC reload data success with https://172.16.1.3:443/xmlrpc.php (pfsense.host_firmware_version).
May 19 17:45:18	php-fpm		            /rc.filter_synchronize: Beginning XMLRPC sync data to https://172.16.1.3:443/xmlrpc.php.
May 19 17:45:17	check_reload_status		Syncing firewall

And this is what the secondary's logs show:

May 19 17:45:35	kernel		carp: 1@lagg0.4005: BACKUP -> MASTER (master timed out)
May 19 17:45:35	check_reload_status		Carp master event
May 19 17:45:35	kernel		carp: 6@cxl0: BACKUP -> MASTER (master timed out)
May 19 17:45:35	kernel		carp: 1@lagg0.52: BACKUP -> MASTER (master timed out)
May 19 17:45:35	kernel		carp: 1@lagg0.4003: BACKUP -> MASTER (master timed out)
May 19 17:45:35	check_reload_status		Carp master event
May 19 17:45:35	kernel		carp: 1@lagg0.4010: BACKUP -> MASTER (master timed out)
May 19 17:45:22	php-fpm	75652	/rc.carpbackup: HA cluster member "(2001:598:1009:302f:0:0:0:1@lagg0.4002): (TS2_FW_P2P_UPLINK)" has resumed CARP state "BACKUP" for vhid 2
May 19 17:45:22	php-fpm	47080	/rc.carpbackup: HA cluster member "(2001:598:1009:301f:0:0:0:1@lagg0.4001): (TS1_FW_P2P_UPLINK)" has resumed CARP state "BACKUP" for vhid 2
May 19 17:45:22	php-fpm	12044	/xmlrpc.php: waiting for pfsync...
May 19 17:45:21	php-fpm	66486	/rc.carpbackup: HA cluster member "(2001:598:1009:301f:0:0:0:1@lagg0.4001): (TS1_FW_P2P_UPLINK)" has resumed CARP state "BACKUP" for vhid 2
May 19 17:45:21	php-fpm	87173	/rc.carpbackup: HA cluster member "(2001:598:1009:302f:0:0:0:1@lagg0.4002): (TS2_FW_P2P_UPLINK)" has resumed CARP state "BACKUP" for vhid 2
May 19 17:45:21	check_reload_status		Carp backup event
May 19 17:45:21	kernel		ifa_maintain_loopback_route: deletion failed for interface lagg0.4001: 3
May 19 17:45:21	kernel		carp: 2@lagg0.4001: BACKUP -> INIT (hardware interface up)
May 19 17:45:20	check_reload_status		Carp backup event
May 19 17:45:20	kernel		ifa_maintain_loopback_route: deletion failed for interface lagg0.4002: 3
May 19 17:45:20	kernel		carp: 2@lagg0.4002: BACKUP -> INIT (hardware interface up)
May 19 17:45:20	check_reload_status		Carp backup event
May 19 17:45:20	kernel		carp: 2@lagg0.4001: INIT -> BACKUP (initialization complete)
May 19 17:45:20	check_reload_status		Carp backup event
May 19 17:45:20	kernel		carp: 2@lagg0.4002: INIT -> BACKUP (initialization complete)
May 19 17:45:20	check_reload_status		Syncing firewall

Note that on tcpdump the primary stopped sending CARP advertisements at 17:45:32, the secondary took and sent two of its own, and at 17:45:37 the primary started sending them again.

I'm not really sure how to troubleshoot this further. Has anyone seen this behaviour before? Why would CARP on the primary stop sending every time I apply a firewall rule configuration change? Any hints on where I can look to get more info?

Thanks,
-Martin

techstone

As I continue troubleshooting I'm realizing that this is not really a CARP issue; CARP is just a component that's affected by a larger problem.

To give a bit of context, the firewalls in the HA pair are configured so that both 10GBASE-T ports on each device are aggregated into a LAG (interface lagg0) that connects to two different Juniper QFX5100 switches that are configured to operate in a virtual chassis.

It seems that whenever I apply a firewall filter rule configuration change the Juniper logs show that the LAG is going down because LACP messages are no longer being received (momentarily) from the primary firewall. However, the physical layer of the 10GBASE-T interfaces that are members of the LAG stay up, so this is not due to loss of Ethernet connectivity.

I am continuing to investigate.

techstone

I don't know what to make of this anymore.

I ran a tcpdump on the primary firewall (tcpdump -i ix0 -nn -e ether proto 0x8890) and confirmed that once I apply a firewall rule config change, LACP advertisements stop being sent from a firewall for a short time. Normally the advertisements are sent every second, but when they stop they seem to stop for 3-5 seconds. This is enough for the Juniper switch to declare the LAG down in its logs.

I also decided to run this short script:

while true
do
echo -n "#"
sleep 0.1
done

and look at the output on a monitor connected to the VGA port of the XG-1541. Turns out that after I apply a config change, the "#" sequence freezes momentarily at the same time as LACP advertisements stop being sent. When the sequence does continue after a few seconds, it seems to "catch up" instantly on displaying the "#", as if they were still being generated and buffered, even if the display seemed frozen.

I'm not a hardware person, so I don't know what to make of this, but is it reasonable to say that the primary firewall is locking up momentarily when applying changes? I don't recall seeing this behavior before, when I was on 2.4.3; is this a behaviour that anyone else has seen?

Thanks.

pete.s.

You have a misconfiguration because pfsense uses slow LACP while you have your switches set to fast LACP.

Others have had noticed this as well and now there is a setting for slow/fast LACP coming up.
https://forum.netgate.com/topic/152862/lacp-doesn-t-work-reliably-slow-pdu-transmission-rate-suspected

I haven't checked the logs but when you apply changes on the master it always takes a few seconds before it has completed. If you turn off CARP it's much faster. This is on 2.4.4 and lower. We haven't upgraded to 2.4.5 since it doesn't look to be 100% there yet.

BTW, there is some negotiating going on with LACP regarding slow/fast timeout. With one end set to slow and the other to fast, the timeout will be 90 seconds in one direction and 3 seconds in the other direction.

techstone

Thanks for your reply @pete-s.

You're right that the Juniper was set to LACP fast while the firewall is set to LACP slow. As this is the configuration that I've had for the pat couple of years without any issue I did not think it was a problem, but I nonetheless took you advice and set the Juniper to LACP slow as well, to make sure to eliminate that as the source of problem. The Juniper now reports the following:

root@s116r2csw01> show lacp interfaces ae32 
Aggregated interface: ae32
    LACP state:       Role   Exp   Def  Dist  Col  Syn  Aggr  Timeout  Activity
      xe-3/0/32      Actor    No    No   Yes  Yes  Yes   Yes     Slow    Active
      xe-3/0/32    Partner    No    No   Yes  Yes  Yes   Yes     Slow    Active
      xe-2/0/32      Actor    No    No   Yes  Yes  Yes   Yes     Slow    Active
      xe-2/0/32    Partner    No    No   Yes  Yes  Yes   Yes     Slow    Active
    LACP protocol:        Receive State  Transmit State          Mux State 
      xe-3/0/32                 Current   Slow periodic Collecting distributing
      xe-2/0/32                 Current   Slow periodic Collecting distributing

{master:1}

I then tried to make another firewall filter change on the primary pfSense and apply it to see if things would be different this time.

The "good" news: the Juniper switch no longer has events showing the AE interface as down

The bad news: unfortunately the CARP issue still occur. A tcpdump on the secondary firewall shows that CARP advertisements stop being sent long enough from the primary for the secondary to think that the primary is dead and takle over CARP IP momentarily on all VLANs.

So although the LACP timeout is "fixed", the CARP issue still occurs. Moreover, based on the output of the above script printing hash symbols, and the fact that CARP advertisements suddenly stop, the firewall still appears to "seize up" for 4-5 seconds about 10-15 seconds after firewall changes are applied. And XMLRPC sync still reports an error, likely due to this "freezing".

I really don't see how such a software upgrade can suddenly induce such a "stop the world" behaviour in the box, but since this is the only thing that's changed, I don't know what else to blame it on. At the same time, I'm sure others would have reported it if the 2.4.5 software introduced this side effect.

I'm really at a loss to explain what's going on.

SteveITS

Is there a CPU spike when the changes are applied? There were several posts about a month ago when 2.4.5 came out about a bug with the underlying FreeBSD code and more than 64k table entries. Try disabling the IPv6 bogon block and/or pfBlockerNG aliases. It was especially prevalent on VMs with more than one CPU but applied to various hardware as well.

jimp

Most likely this: https://forum.netgate.com/post/908806

techstone

Thank you both @teamits and @jimp for the pointers!

As you both referenced info about IPv6 bogons I checked the table with pfctl -t bogonsv6 -T show and it was indeed quite large on the firewalls, being populated by the contents of /etc/bogonsv6. Since I have almost 80 interfaces (VLANs) defined on each firewall, I did not want to have to go to each one and uncheck Block bogon networks, so I did the lazy thing instead and:

cp /etc/bogonsv6 /etc/bogonsv6.bak
cp /dev/null /etc/bogonsv6

on both firewalls and then applied an arbitrary config change to trigger a reload of the firewall rules.

From that moment on both firewalls performed much better and the lock ups and CARP ping-pongs disappeared. I've got the bogon updates set to Monthly, so I'll need to re-empty /etc/bogonsv6 again in a few days, but doing this once a month as a workaround is fine for me until I can upgrade to a release where the locks up are fixed.

Thanks a million again.