After upgrade to 2.4.5, primary in HA pair stops sending CARP adv. momentarily after firewall rule changes are applied
-
Hi,
I have a pair of Netgate XG-1541 firewalls running in HA.
This weekend I upgraded them from 2.4.3-p1 to 2.4.5. Since then, I've noticed that when I make a firewall rule change on the primary and apply it, the secondary suddenly (and temporarily) becomes CARP master on its interfaces and starts some services, such as OpenVPN and IPsec. This doesn't last long, but it does happen every time I apply the change in the primary's GUI.
Running a tcpdump on one of the firewall's CARP VLANs does show that, shortly after I apply a firewall rule change, CARP advertisements do stop momentarily on the primary.
This is what the primary shows as I apply the change:
May 19 17:46:19 php-fpm /rc.filter_synchronize: Beginning XMLRPC sync data to https://172.16.1.3:443/xmlrpc.php. May 19 17:46:19 php-fpm /rc.filter_synchronize: New alert found: A communications error occurred while attempting to call XMLRPC method restore_config_section: May 19 17:46:19 php-fpm /rc.filter_synchronize: A communications error occurred while attempting to call XMLRPC method restore_config_section: May 19 17:45:21 check_reload_status Reloading filter May 19 17:45:18 php-fpm /rc.filter_synchronize: Beginning XMLRPC sync data to https://172.16.1.3:443/xmlrpc.php. May 19 17:45:18 php-fpm /rc.filter_synchronize: XMLRPC versioncheck: 19.1 -- 19.1 May 19 17:45:18 php-fpm /rc.filter_synchronize: XMLRPC reload data success with https://172.16.1.3:443/xmlrpc.php (pfsense.host_firmware_version). May 19 17:45:18 php-fpm /rc.filter_synchronize: Beginning XMLRPC sync data to https://172.16.1.3:443/xmlrpc.php. May 19 17:45:17 check_reload_status Syncing firewall
And this is what the secondary's logs show:
May 19 17:45:35 kernel carp: 1@lagg0.4005: BACKUP -> MASTER (master timed out) May 19 17:45:35 check_reload_status Carp master event May 19 17:45:35 kernel carp: 6@cxl0: BACKUP -> MASTER (master timed out) May 19 17:45:35 kernel carp: 1@lagg0.52: BACKUP -> MASTER (master timed out) May 19 17:45:35 kernel carp: 1@lagg0.4003: BACKUP -> MASTER (master timed out) May 19 17:45:35 check_reload_status Carp master event May 19 17:45:35 kernel carp: 1@lagg0.4010: BACKUP -> MASTER (master timed out) May 19 17:45:22 php-fpm 75652 /rc.carpbackup: HA cluster member "(2001:598:1009:302f:0:0:0:1@lagg0.4002): (TS2_FW_P2P_UPLINK)" has resumed CARP state "BACKUP" for vhid 2 May 19 17:45:22 php-fpm 47080 /rc.carpbackup: HA cluster member "(2001:598:1009:301f:0:0:0:1@lagg0.4001): (TS1_FW_P2P_UPLINK)" has resumed CARP state "BACKUP" for vhid 2 May 19 17:45:22 php-fpm 12044 /xmlrpc.php: waiting for pfsync... May 19 17:45:21 php-fpm 66486 /rc.carpbackup: HA cluster member "(2001:598:1009:301f:0:0:0:1@lagg0.4001): (TS1_FW_P2P_UPLINK)" has resumed CARP state "BACKUP" for vhid 2 May 19 17:45:21 php-fpm 87173 /rc.carpbackup: HA cluster member "(2001:598:1009:302f:0:0:0:1@lagg0.4002): (TS2_FW_P2P_UPLINK)" has resumed CARP state "BACKUP" for vhid 2 May 19 17:45:21 check_reload_status Carp backup event May 19 17:45:21 kernel ifa_maintain_loopback_route: deletion failed for interface lagg0.4001: 3 May 19 17:45:21 kernel carp: 2@lagg0.4001: BACKUP -> INIT (hardware interface up) May 19 17:45:20 check_reload_status Carp backup event May 19 17:45:20 kernel ifa_maintain_loopback_route: deletion failed for interface lagg0.4002: 3 May 19 17:45:20 kernel carp: 2@lagg0.4002: BACKUP -> INIT (hardware interface up) May 19 17:45:20 check_reload_status Carp backup event May 19 17:45:20 kernel carp: 2@lagg0.4001: INIT -> BACKUP (initialization complete) May 19 17:45:20 check_reload_status Carp backup event May 19 17:45:20 kernel carp: 2@lagg0.4002: INIT -> BACKUP (initialization complete) May 19 17:45:20 check_reload_status Syncing firewall
Note that on tcpdump the primary stopped sending CARP advertisements at 17:45:32, the secondary took and sent two of its own, and at 17:45:37 the primary started sending them again.
I'm not really sure how to troubleshoot this further. Has anyone seen this behaviour before? Why would CARP on the primary stop sending every time I apply a firewall rule configuration change? Any hints on where I can look to get more info?
Thanks,
-Martin -
As I continue troubleshooting I'm realizing that this is not really a CARP issue; CARP is just a component that's affected by a larger problem.
To give a bit of context, the firewalls in the HA pair are configured so that both 10GBASE-T ports on each device are aggregated into a LAG (interface lagg0) that connects to two different Juniper QFX5100 switches that are configured to operate in a virtual chassis.
It seems that whenever I apply a firewall filter rule configuration change the Juniper logs show that the LAG is going down because LACP messages are no longer being received (momentarily) from the primary firewall. However, the physical layer of the 10GBASE-T interfaces that are members of the LAG stay up, so this is not due to loss of Ethernet connectivity.
I am continuing to investigate.
-
I don't know what to make of this anymore.
I ran a tcpdump on the primary firewall (
tcpdump -i ix0 -nn -e ether proto 0x8890
) and confirmed that once I apply a firewall rule config change, LACP advertisements stop being sent from a firewall for a short time. Normally the advertisements are sent every second, but when they stop they seem to stop for 3-5 seconds. This is enough for the Juniper switch to declare the LAG down in its logs.I also decided to run this short script:
while true do echo -n "#" sleep 0.1 done
and look at the output on a monitor connected to the VGA port of the XG-1541. Turns out that after I apply a config change, the "#" sequence freezes momentarily at the same time as LACP advertisements stop being sent. When the sequence does continue after a few seconds, it seems to "catch up" instantly on displaying the "#", as if they were still being generated and buffered, even if the display seemed frozen.
I'm not a hardware person, so I don't know what to make of this, but is it reasonable to say that the primary firewall is locking up momentarily when applying changes? I don't recall seeing this behavior before, when I was on 2.4.3; is this a behaviour that anyone else has seen?
Thanks.
-
You have a misconfiguration because pfsense uses slow LACP while you have your switches set to fast LACP.
Others have had noticed this as well and now there is a setting for slow/fast LACP coming up.
https://forum.netgate.com/topic/152862/lacp-doesn-t-work-reliably-slow-pdu-transmission-rate-suspectedI haven't checked the logs but when you apply changes on the master it always takes a few seconds before it has completed. If you turn off CARP it's much faster. This is on 2.4.4 and lower. We haven't upgraded to 2.4.5 since it doesn't look to be 100% there yet.
BTW, there is some negotiating going on with LACP regarding slow/fast timeout. With one end set to slow and the other to fast, the timeout will be 90 seconds in one direction and 3 seconds in the other direction.
-
Thanks for your reply @pete-s.
You're right that the Juniper was set to LACP fast while the firewall is set to LACP slow. As this is the configuration that I've had for the pat couple of years without any issue I did not think it was a problem, but I nonetheless took you advice and set the Juniper to LACP slow as well, to make sure to eliminate that as the source of problem. The Juniper now reports the following:
root@s116r2csw01> show lacp interfaces ae32 Aggregated interface: ae32 LACP state: Role Exp Def Dist Col Syn Aggr Timeout Activity xe-3/0/32 Actor No No Yes Yes Yes Yes Slow Active xe-3/0/32 Partner No No Yes Yes Yes Yes Slow Active xe-2/0/32 Actor No No Yes Yes Yes Yes Slow Active xe-2/0/32 Partner No No Yes Yes Yes Yes Slow Active LACP protocol: Receive State Transmit State Mux State xe-3/0/32 Current Slow periodic Collecting distributing xe-2/0/32 Current Slow periodic Collecting distributing {master:1}
I then tried to make another firewall filter change on the primary pfSense and apply it to see if things would be different this time.
The "good" news: the Juniper switch no longer has events showing the AE interface as down
The bad news: unfortunately the CARP issue still occur. A tcpdump on the secondary firewall shows that CARP advertisements stop being sent long enough from the primary for the secondary to think that the primary is dead and takle over CARP IP momentarily on all VLANs.
So although the LACP timeout is "fixed", the CARP issue still occurs. Moreover, based on the output of the above script printing hash symbols, and the fact that CARP advertisements suddenly stop, the firewall still appears to "seize up" for 4-5 seconds about 10-15 seconds after firewall changes are applied. And XMLRPC sync still reports an error, likely due to this "freezing".
I really don't see how such a software upgrade can suddenly induce such a "stop the world" behaviour in the box, but since this is the only thing that's changed, I don't know what else to blame it on. At the same time, I'm sure others would have reported it if the 2.4.5 software introduced this side effect.
I'm really at a loss to explain what's going on.
-
Is there a CPU spike when the changes are applied? There were several posts about a month ago when 2.4.5 came out about a bug with the underlying FreeBSD code and more than 64k table entries. Try disabling the IPv6 bogon block and/or pfBlockerNG aliases. It was especially prevalent on VMs with more than one CPU but applied to various hardware as well.
-
Most likely this: https://forum.netgate.com/post/908806
-
Thank you both @teamits and @jimp for the pointers!
As you both referenced info about IPv6 bogons I checked the table with
pfctl -t bogonsv6 -T show
and it was indeed quite large on the firewalls, being populated by the contents of/etc/bogonsv6
. Since I have almost 80 interfaces (VLANs) defined on each firewall, I did not want to have to go to each one and uncheck Block bogon networks, so I did the lazy thing instead and:cp /etc/bogonsv6 /etc/bogonsv6.bak cp /dev/null /etc/bogonsv6
on both firewalls and then applied an arbitrary config change to trigger a reload of the firewall rules.
From that moment on both firewalls performed much better and the lock ups and CARP ping-pongs disappeared. I've got the bogon updates set to Monthly, so I'll need to re-empty
/etc/bogonsv6
again in a few days, but doing this once a month as a workaround is fine for me until I can upgrade to a release where the locks up are fixed.Thanks a million again.