2.4.1: pfSense lockup with CARP on bridge interface
-
After successfully upgrading one SG-4860 and one APU2C4 from 2.3.4-p1 to 2.4.1, they both showed the same behavior: after a random number of minutes the routers would become unreachable over the network. Both systems repeatedly showed the same exact behavior 100% of the time over many reboots, so I don't believe the problem is related to the hardware platform.
Logging in over console showed the routers had not crashed and I could run various commands. Weird behaviors observed:
1. Pinging a known reachable IP in the same subnet would time out (i.e. no output). When I tried killing the ping command with CTRL+C I would only get ^C printed on screen, but the program did not end. Since this was run over console and I had no IP reachability, there was no other way to attempt to kill the ping command.
2. Plugging in an external keyboard and pressing CTRL+ALT+DEL resulted in the shutdown sequence starting, but the system would hang and never actually reboot:
pfSense is now shutting down ... ath0_wlan3: ieee80211_new_state_locked: pending RUN -> SCAN transition lost ath0_wlan2: ieee80211_new_state_locked: pending RUN -> SCAN transition lost ath0_wlan2: ieee80211_new_state_locked: pending RUN -> SCAN transition lost ath0: ath_txrx_stop_locked: didn't finish after 100 iterations
3. After a fresh boot the system would appear healthy for a few minutes, then randomly drop from the network.
Running top over console showed this weird output:
last pid: 76962; load averages: 0.51, 0.32, 0.21 up 0+00:17:32 23:24:22 67 processes: 2 running, 62 sleeping, 3 lock <------------------------------------------ ??? CPU: 0.0% user, 1.6% nice, 3.3% system, 0.0% interrupt, 95.1% idle Mem: 88M Active, 46M Inact, 339M Wired, 26M Buf, 7415M Free Swap: 16G Total, 16G Free
Running ps showed several processes in "waiting for lock" state, including an ifconfig process:
[2.4.1-RELEASE][admin@pfSense.localdomain]/root: ps waux USER PID %CPU %MEM VSZ RSS TT STAT STARTED TIME COMMAND root 11 400.0 0.0 0 64 - RL 23:07 53:10.84 [idle] unbound 90940 0.4 0.3 70880 24192 - Ss 23:07 0:03.17 /usr/local/sbin/unbound -c /var/unbound/unbound.conf root 0 0.0 0.0 0 592 - DLs 23:07 0:00.01 [kernel] root 1 0.0 0.0 5024 872 - ILs 23:07 0:00.02 /sbin/init -- root 2 0.0 0.0 0 16 - DL 23:07 0:00.00 [crypto] root 3 0.0 0.0 0 16 - DL 23:07 0:00.00 [crypto returns] root 4 0.0 0.0 0 32 - DL 23:07 0:00.15 [cam] root 5 0.0 0.0 0 16 - DL 23:07 0:00.00 [soaiod1] root 6 0.0 0.0 0 16 - DL 23:07 0:00.00 [soaiod2] root 7 0.0 0.0 0 16 - DL 23:07 0:00.00 [soaiod3] root 8 0.0 0.0 0 16 - DL 23:07 0:00.00 [soaiod4] root 9 0.0 0.0 0 16 - DL 23:07 0:00.00 [sctp_iterator] root 10 0.0 0.0 0 16 - DL 23:07 0:00.00 [audit] root 12 0.0 0.0 0 720 - WL 23:07 0:02.70 [intr] root 13 0.0 0.0 0 64 - DL 23:07 0:00.00 [ng_queue] root 14 0.0 0.0 0 48 - DL 23:07 0:00.08 [geom] root 15 0.0 0.0 0 80 - DL 23:07 0:00.06 [usb] root 16 0.0 0.0 0 16 - DL 23:07 0:00.47 [pf purge] root 17 0.0 0.0 0 16 - DL 23:07 0:00.48 [rand_harvestq] root 18 0.0 0.0 0 48 - DL 23:07 0:00.05 [pagedaemon] root 19 0.0 0.0 0 16 - DL 23:07 0:00.00 [vmdaemon] root 20 0.0 0.0 0 16 - DL 23:07 0:00.00 [pagezero] root 21 0.0 0.0 0 16 - DL 23:07 0:00.01 [bufspacedaemon] root 22 0.0 0.0 0 32 - DL 23:07 0:00.06 [bufdaemon] root 23 0.0 0.0 0 16 - DL 23:07 0:00.01 [vnlru] root 24 0.0 0.0 0 16 - DL 23:07 0:00.18 [syncer] root 58 0.0 0.0 0 16 - DL 23:07 0:00.02 [md0] root 297 0.0 0.4 282680 29232 - Ss 23:07 0:00.09 php-fpm: master process (/usr/local/lib/php-fpm.conf) (php-fpm) root 368 0.0 0.1 19440 4472 - INs 23:07 0:00.02 /usr/local/sbin/check_reload_status root 370 0.0 0.1 19440 4276 - IN 23:07 0:00.00 check_reload_status: Monitoring daemon of check_reload_status root 383 0.0 0.1 9556 4968 - Is 23:07 0:00.02 /sbin/devd -q -f /etc/pfSense-devd.conf root 5229 0.0 0.1 35660 6900 - Is 23:07 0:00.00 nginx: master process /usr/local/sbin/nginx -c /var/etc/nginx-we root 5512 0.0 0.1 35660 7372 - I 23:07 0:00.00 nginx: worker process (nginx) root 5550 0.0 0.1 37708 8056 - L 23:07 0:00.34 nginx: worker process (nginx) <------------------------------------------- ??? root 6080 0.0 0.0 12496 2352 - Is 23:07 0:00.02 /usr/sbin/cron -s root 6445 0.0 0.1 24612 12448 - Is 23:08 0:00.19 /usr/local/sbin/ntpd -g -c /var/etc/ntpd.conf -p /var/run/ntpd.p root 7559 0.0 0.0 10580 2304 - Is 23:08 0:00.00 /usr/local/sbin/sshlockout_pf 15 messagebus 10712 0.0 0.0 21528 3176 - Is 23:08 0:00.00 /usr/local/bin/dbus-daemon --system dhcpd 11863 0.0 0.1 16652 7896 - Is 23:08 0:00.07 /usr/local/sbin/dhcpd -user dhcpd -group _dhcp -chroot /var/dhcp root 12585 0.0 0.1 53492 6928 - Is 23:07 0:00.00 /usr/sbin/sshd root 12685 0.0 0.0 12628 2216 - Is 23:07 0:00.01 /usr/local/sbin/sshlockout_pf 15 root 14777 0.0 0.0 8224 2004 - Is 23:08 0:00.00 /usr/local/bin/minicron 240 /var/run/ping_hosts.pid /usr/local/b root 14907 0.0 0.0 10528 2296 - Is 23:07 0:00.00 dhclient: igb4 [priv] (dhclient) root 14937 0.0 0.0 8224 2020 - I 23:08 0:00.00 minicron: helper /usr/local/bin/ping_hosts.sh (minicron) root 15517 0.0 0.0 8224 2004 - Is 23:08 0:00.00 /usr/local/bin/minicron 3600 /var/run/expire_accounts.pid /usr/l root 15694 0.0 0.0 8224 2004 - Is 23:08 0:00.00 /usr/local/bin/minicron 86400 /var/run/update_alias_url_data.pid root 16272 0.0 0.0 8224 2020 - I 23:08 0:00.00 minicron: helper /usr/local/sbin/fcgicli -f /etc/rc.expireaccoun root 16523 0.0 0.0 8224 2020 - I 23:08 0:00.00 minicron: helper /usr/local/sbin/fcgicli -f /etc/rc.update_alias root 18669 0.0 0.0 15076 2504 - Is 23:08 0:00.42 /usr/local/bin/dpinger -S -r 0 -i GW_1_IPv4 -B 10.123 root 19021 0.0 0.0 13028 2472 - Is 23:08 0:00.19 /usr/local/bin/dpinger -S -r 0 -i GW_2_IPv4 -B 10.12 root 19307 0.0 0.0 15076 2516 - Is 23:08 0:00.35 /usr/local/bin/dpinger -S -r 0 -i GW_2_IPv6 -B fd00: _dhcp 19506 0.0 0.0 10528 2404 - Is 23:07 0:00.01 dhclient: igb4 (dhclient) root 19722 0.0 0.0 13028 2472 - Is 23:08 0:00.22 /usr/local/bin/dpinger -S -r 0 -i GW_3_IPv4 -B 10.12 root 19875 0.0 0.0 13028 2480 - Is 23:08 0:00.23 /usr/local/bin/dpinger -S -r 0 -i GW_3_IPv6 -B fd00: root 20054 0.0 0.0 13028 2472 - Is 23:08 0:00.20 /usr/local/bin/dpinger -S -r 0 -i GW_WAN_1_IPv4 -B <public ip="">root 20330 0.0 0.0 13028 2472 - Is 23:08 0:00.21 /usr/local/bin/dpinger -S -r 0 -i GW_WAN_1_IPv4_Google_DNS -B 10 messagebus 23664 0.0 0.0 21528 3196 - Is 23:08 0:00.00 /usr/local/bin/dbus-daemon --system root 41132 0.0 0.1 20352 5712 - Ss 23:07 0:00.73 /usr/local/sbin/openvpn --config /var/etc/openvpn/server3.conf root 43454 0.0 0.0 7548 3516 - Ss 23:08 0:00.01 /usr/sbin/watchdogd -t 128 root 45654 0.0 0.1 20352 5988 - Ss 23:07 0:00.02 /usr/local/sbin/openvpn --config /var/etc/openvpn/server2.conf root 48389 0.0 0.0 12696 2304 - Ss 23:07 0:00.05 /usr/local/sbin/filterlog -i pflog0 -p /var/run/filterlog.pid root 48511 0.0 0.1 19648 5580 - Ls 23:08 0:00.03 /usr/local/sbin/miniupnpd -f /var/etc/miniupnpd.conf -P /var/run <-------- ??? root 48605 0.0 0.1 20352 5984 - Ss 23:07 0:00.02 /usr/local/sbin/openvpn --config /var/etc/openvpn/server4.conf root 48758 0.0 0.0 10368 2092 - Ss 23:08 0:00.28 /usr/sbin/powerd -b hadp -a hadp -n hadp root 55992 0.0 0.0 13084 2724 - I 23:12 0:00.01 /bin/sh /usr/local/bin/ping_hosts.sh root 56169 0.0 0.0 13084 2724 - I 23:12 0:00.00 /bin/sh /usr/local/bin/ping_hosts.sh root 56277 0.0 0.0 12788 2676 - Is 23:08 0:00.01 /usr/local/sbin/filterdns -p /var/run/filterdns.pid -i 300 -c /v root 56369 0.0 0.0 16988 2856 - L 23:12 0:00.02 ifconfig <---------------------------------------------------------------- ??? root 56416 0.0 0.0 14728 2460 - I 23:12 0:00.00 grep carp: BACKUP vhid root 56745 0.0 0.0 12532 2212 - I 23:12 0:00.00 wc -l root 74116 0.0 0.1 20352 5880 - Ss 23:07 0:00.40 /usr/local/sbin/openvpn --config /var/etc/openvpn/client1.conf avahi 85807 0.0 0.0 29960 3852 - I 23:08 0:00.13 avahi-daemon: running [pfSense.local] (avahi-daemon) root 93089 0.0 0.5 284728 38648 - I 23:11 0:00.46 php-fpm: pool nginx (php-fpm) root 93155 0.0 0.0 10484 2532 - Ss 23:08 0:00.79 /usr/sbin/syslogd -s -c -c -l /var/dhcpd/var/run/log -P /var/run root 94590 0.0 0.0 8520 2732 - Is 23:08 0:00.00 bgpd: parent (bgpd) _bgpd 94858 0.0 0.0 8520 2716 - I 23:08 0:00.00 bgpd: route decision engine (bgpd) _bgpd 95194 0.0 0.0 8520 2792 - I 23:08 0:00.00 bgpd: session engine (bgpd) root 99398 0.0 0.0 6172 1928 - IN 23:20 0:00.00 sleep 60 root 7594 0.0 0.0 13084 2556 u1 I 23:08 0:00.00 /bin/sh /etc/rc.initial root 10266 0.0 0.0 13392 3624 u1 S 23:08 0:00.17 /bin/tcsh root 36372 0.0 0.0 13084 2640 u1- IN 23:08 0:00.60 /bin/sh /var/db/rrd/updaterrd.sh root 83521 0.0 0.0 39432 2848 u1 Is 23:08 0:00.01 login [pam] (login) root 99594 0.0 0.0 21104 2744 u1 R+ 23:20 0:00.03 ps waux root 82157 0.0 0.0 10388 2132 v0 Is+ 23:08 0:00.00 /usr/libexec/getty Pc ttyv0 root 82332 0.0 0.0 10388 2132 v1 Is+ 23:08 0:00.00 /usr/libexec/getty Pc ttyv1 root 82429 0.0 0.0 10388 2132 v2 Is+ 23:08 0:00.00 /usr/libexec/getty Pc ttyv2 root 82538 0.0 0.0 10388 2132 v3 Is+ 23:08 0:00.00 /usr/libexec/getty Pc ttyv3 root 82728 0.0 0.0 10388 2132 v4 Is+ 23:08 0:00.00 /usr/libexec/getty Pc ttyv4 root 83058 0.0 0.0 10388 2132 v5 Is+ 23:08 0:00.00 /usr/libexec/getty Pc ttyv5 root 83404 0.0 0.0 10388 2132 v6 Is+ 23:08 0:00.00 /usr/libexec/getty Pc ttyv6 root 83409 0.0 0.0 10388 2132 v7 Is+ 23:08 0:00.00 /usr/libexec/getty Pc ttyv7 [2.4.1-RELEASE][admin@pfSense.localdomain]/root:</public>
The main features I use:
-
bridges that contain:
-
VLAN tagged interfaces (no PPPoE)
-
wireless interfaces with multiple (3) virtual SSIDs
-
CARP running on bridge interfaces
-
-
NAT in various flavors
-
OpenVPN
-
OpenBGPd
-
no explicit shaping/policing/queueing configured
-
-
Want to bump this thread as I had another attempt at updating both boxes. I thought this was related to the pfBlockerNG issue that everyone seems to be having, which should have been fixed with pfSense 2.4.1 and the latest pfBlockerNG (that I had installed, but not activated). What I observed: the APU2C4 box was upgraded, but left in a CARP backup state for 24h. It did not show any signs of locking up for the whole duration. The moment it became CARP master it only took ~5 minutes to get it to lock up. Same thing then happened for the SG-4860.
I believe this issue is different from the pfBlockerNG issue, as the processess in this case get stuck in the L state, compared to the D state for the pfBlockerNG issue.
-
The main features I use:
bridges that contain:
VLAN tagged interfaces (no PPPoE)
wireless interfaces with multiple (3) virtual SSIDs
CARP running on bridge interfaces
NAT in various flavors
OpenVPN
OpenBGPd
no explicit shaping/policing/queueing configuredIn version 2.4.0 some VLAN labeling (to long names) problems occurs if I was reading it right here through the forum.
In Version 2.4.1 are some hard problems using VLANs at the PPPoE connection!
In the early version 2.4.2 this problems are gone, but this must not be meaning now that the version is stable as others!Across the whole forum therre are many problems updating or upgrading to a 2.4.x version, but often or many users
were installing it fresh and full on an storage and played back their config xml file and all was right then. If I am in
your situation I would try out installing 2.4.0 ADI image on the SG-4860 and the CE Edition on the APU2C4 and
in front of that I would proof the firmware images too and/or update them both if needed. And then play back the
config xml file. -
Thanks for the tip, but I think this is a basic 2.4 bug. I can easily replicate the issue on a freshly installed pfSense VM by sending traffic via a CARP IP on a bridge interface.
https://redmine.pfsense.org/issues/8056
-
I hate to make a "me too" post, but I'm seeing the same thing on 2.4.1 with a clean install and a config.xml loaded from my old system. The only way I have found to keep the firewall up is to turn all of my CARP VIPs into IP Aliases and turn off my secondary firewall.
When I get the hang, hitting ctrl-t on the console gives me variations on:
load: 7.22 cmd: ifconfig 88749 [*carp_if] 11.76r 0.00u 0.00s 0% 2704k
so it looks like it is spinning in carp_if.
-
Another me too, as well.
All our pfSense firewalls that are using Bridged interfaces and CARP will freeze as soon as traffic starts passing across the Bridge Interface.
Had to reinstall 2.3 to get firewalls working again.
-
Hoping this gets some traction, but so far no activity on the linked bug report…
-
Does anyone know if this bug has been fixed in 2.4.2?
-
I re-tested with 2.4.2 and I still see the same behavior.
Until the pfSense team acknowledges https://redmine.pfsense.org/issues/8056 I don't see how we'll get a fix.
-
Bumping this thread in the hope that someone on the pfSense team acknowledges this issue.
-
Looking through the following bug report it looks to be an bug in Free BSD when an interface used as a bridge member has a CARP IP;
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=200319
Has anyone tested to see if the problem persists if CARP IP's are removed from the interfaces that are members of the bridge?
-
Hi,
I have a similar problem but not with BRIDGE but LAGG (with LACP enabled).
the problem started after the upgrade of one of two firewall in CARP from 2.3.4p1 to 2.4.2p1.Could it be related to this bug?
ASAP I will try to better describe my problem. -
Hi,
I have this scenario:
2 firewall Dell PowerEdge R310 with these network adapters:
2 embedded Broadcom NetXtreme II Gigabit Ethernet
1 Intel(R) Gigabit ET Quad Port Server Adapter
firewall 1: PfSense 2.3.4-RELEASE-p1 (amd64) installed on HDD, the only package that is installed is FTP_Client_Proxy, this firewall normally has CARP status MASTER on all interfaces, now it's in carp persistent mantenance mode.
firewall 2: same hardware, but reinstalled with pfSense-CE-2.4.2-RELEASE-amd64 ZFS auto 4GB swap, during the installation I recoverd the previous config.
Just after installation I upgraded it to 2.4.2 p1
I have 2 LAGGs (whith LACP): igb0,igb1 and igb2,igb3.
One LAGG is assigned to an interface, the other has some VLANs.
One Broadcom is directly connected to the other firewall (sync).On all interfaces except for the sync one we have one or more CARP IPs.
Firewall 2 with 2.4.2 p1 version works for less than an hour (about 30 minutes), than the other firewall becames master and this firewall gets stuck:
on console (DRAC) does not respond anymore but, at least in 1 case, something seems to be working; all 4 nics in LACP have link up, but only one PortChannel is up with an interface only and browser shows certificate warning but then it does't load the login page.The system.log file contains rows referring to the stuck status time,
a part from "pfr_update_stats: assertion failed." pre-existing and recurring errors, also new errors showed up like
"sonewconn: pcb 0xfffff8007394e1d0: Listen queue overflow: 2 already in queue awaiting acceptance (12 occurrences)"Any suggestion would be appreciated.
-
Hi,
I have a similar problem but not with BRIDGE but LAGG (with LACP enabled).
the problem started after the upgrade of one of two firewall in CARP from 2.3.4p1 to 2.4.2p1.Could it be related to this bug?
ASAP I will try to better describe my problem.Sorry, I forgot a BRIDGE between an OpenVPN TAP and an interface.
Now I'm trying after removing the bridge.Thanks.
EDIT: I confirm more than 2 hours whitout problems.
So even a bridge little used not assigned as interface, that include an interface with an IP CARP triggers the problem.