Navigation

    Netgate Discussion Forum
    • Register
    • Login
    • Search
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search

    2.4.1: pfSense lockup with CARP on bridge interface

    Installation and Upgrades
    5
    14
    1614
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • B
      bkraptor last edited by

      After successfully upgrading one SG-4860 and one APU2C4 from 2.3.4-p1 to 2.4.1, they both showed the same behavior: after a random number of minutes the routers would become unreachable over the network. Both systems repeatedly showed the same exact behavior 100% of the time over many reboots, so I don't believe the problem is related to the hardware platform.

      Logging in over console showed the routers had not crashed and I could run various commands. Weird behaviors observed:

      1. Pinging a known reachable IP in the same subnet would time out (i.e. no output). When I tried killing the ping command with CTRL+C I would only get ^C printed on screen, but the program did not end. Since this was run over console and I had no IP reachability, there was no other way to attempt to kill the ping command.

      2. Plugging in an external keyboard and pressing CTRL+ALT+DEL resulted in the shutdown sequence starting, but the system would hang and never actually reboot:

      pfSense is now shutting down ...
      
      ath0_wlan3: ieee80211_new_state_locked: pending RUN -> SCAN transition lost
      ath0_wlan2: ieee80211_new_state_locked: pending RUN -> SCAN transition lost
      ath0_wlan2: ieee80211_new_state_locked: pending RUN -> SCAN transition lost
      ath0: ath_txrx_stop_locked: didn't finish after 100 iterations
      

      3. After a fresh boot the system would appear healthy for a few minutes, then randomly drop from the network.

      Running top over console showed this weird output:

      last pid: 76962;  load averages:  0.51,  0.32,  0.21    up 0+00:17:32  23:24:22
      67 processes:  2 running, 62 sleeping, 3 lock <------------------------------------------ ???
      CPU:  0.0% user,  1.6% nice,  3.3% system,  0.0% interrupt, 95.1% idle
      Mem: 88M Active, 46M Inact, 339M Wired, 26M Buf, 7415M Free
      Swap: 16G Total, 16G Free
      

      Running ps showed several processes in "waiting for lock" state, including an ifconfig process:

      [2.4.1-RELEASE][admin@pfSense.localdomain]/root: ps waux
      USER         PID  %CPU %MEM    VSZ   RSS TT  STAT STARTED     TIME COMMAND
      root          11 400.0  0.0      0    64  -  RL   23:07   53:10.84 [idle]
      unbound    90940   0.4  0.3  70880 24192  -  Ss   23:07    0:03.17 /usr/local/sbin/unbound -c /var/unbound/unbound.conf
      root           0   0.0  0.0      0   592  -  DLs  23:07    0:00.01 [kernel]
      root           1   0.0  0.0   5024   872  -  ILs  23:07    0:00.02 /sbin/init --
      root           2   0.0  0.0      0    16  -  DL   23:07    0:00.00 [crypto]
      root           3   0.0  0.0      0    16  -  DL   23:07    0:00.00 [crypto returns]
      root           4   0.0  0.0      0    32  -  DL   23:07    0:00.15 [cam]
      root           5   0.0  0.0      0    16  -  DL   23:07    0:00.00 [soaiod1]
      root           6   0.0  0.0      0    16  -  DL   23:07    0:00.00 [soaiod2]
      root           7   0.0  0.0      0    16  -  DL   23:07    0:00.00 [soaiod3]
      root           8   0.0  0.0      0    16  -  DL   23:07    0:00.00 [soaiod4]
      root           9   0.0  0.0      0    16  -  DL   23:07    0:00.00 [sctp_iterator]
      root          10   0.0  0.0      0    16  -  DL   23:07    0:00.00 [audit]
      root          12   0.0  0.0      0   720  -  WL   23:07    0:02.70 [intr]
      root          13   0.0  0.0      0    64  -  DL   23:07    0:00.00 [ng_queue]
      root          14   0.0  0.0      0    48  -  DL   23:07    0:00.08 [geom]
      root          15   0.0  0.0      0    80  -  DL   23:07    0:00.06 [usb]
      root          16   0.0  0.0      0    16  -  DL   23:07    0:00.47 [pf purge]
      root          17   0.0  0.0      0    16  -  DL   23:07    0:00.48 [rand_harvestq]
      root          18   0.0  0.0      0    48  -  DL   23:07    0:00.05 [pagedaemon]
      root          19   0.0  0.0      0    16  -  DL   23:07    0:00.00 [vmdaemon]
      root          20   0.0  0.0      0    16  -  DL   23:07    0:00.00 [pagezero]
      root          21   0.0  0.0      0    16  -  DL   23:07    0:00.01 [bufspacedaemon]
      root          22   0.0  0.0      0    32  -  DL   23:07    0:00.06 [bufdaemon]
      root          23   0.0  0.0      0    16  -  DL   23:07    0:00.01 [vnlru]
      root          24   0.0  0.0      0    16  -  DL   23:07    0:00.18 [syncer]
      root          58   0.0  0.0      0    16  -  DL   23:07    0:00.02 [md0]
      root         297   0.0  0.4 282680 29232  -  Ss   23:07    0:00.09 php-fpm: master process (/usr/local/lib/php-fpm.conf) (php-fpm)
      root         368   0.0  0.1  19440  4472  -  INs  23:07    0:00.02 /usr/local/sbin/check_reload_status
      root         370   0.0  0.1  19440  4276  -  IN   23:07    0:00.00 check_reload_status: Monitoring daemon of check_reload_status
      root         383   0.0  0.1   9556  4968  -  Is   23:07    0:00.02 /sbin/devd -q -f /etc/pfSense-devd.conf
      root        5229   0.0  0.1  35660  6900  -  Is   23:07    0:00.00 nginx: master process /usr/local/sbin/nginx -c /var/etc/nginx-we
      root        5512   0.0  0.1  35660  7372  -  I    23:07    0:00.00 nginx: worker process (nginx)
      root        5550   0.0  0.1  37708  8056  -  L    23:07    0:00.34 nginx: worker process (nginx) <------------------------------------------- ???
      root        6080   0.0  0.0  12496  2352  -  Is   23:07    0:00.02 /usr/sbin/cron -s
      root        6445   0.0  0.1  24612 12448  -  Is   23:08    0:00.19 /usr/local/sbin/ntpd -g -c /var/etc/ntpd.conf -p /var/run/ntpd.p
      root        7559   0.0  0.0  10580  2304  -  Is   23:08    0:00.00 /usr/local/sbin/sshlockout_pf 15
      messagebus 10712   0.0  0.0  21528  3176  -  Is   23:08    0:00.00 /usr/local/bin/dbus-daemon --system
      dhcpd      11863   0.0  0.1  16652  7896  -  Is   23:08    0:00.07 /usr/local/sbin/dhcpd -user dhcpd -group _dhcp -chroot /var/dhcp
      root       12585   0.0  0.1  53492  6928  -  Is   23:07    0:00.00 /usr/sbin/sshd
      root       12685   0.0  0.0  12628  2216  -  Is   23:07    0:00.01 /usr/local/sbin/sshlockout_pf 15
      root       14777   0.0  0.0   8224  2004  -  Is   23:08    0:00.00 /usr/local/bin/minicron 240 /var/run/ping_hosts.pid /usr/local/b
      root       14907   0.0  0.0  10528  2296  -  Is   23:07    0:00.00 dhclient: igb4 [priv] (dhclient)
      root       14937   0.0  0.0   8224  2020  -  I    23:08    0:00.00 minicron: helper /usr/local/bin/ping_hosts.sh  (minicron)
      root       15517   0.0  0.0   8224  2004  -  Is   23:08    0:00.00 /usr/local/bin/minicron 3600 /var/run/expire_accounts.pid /usr/l
      root       15694   0.0  0.0   8224  2004  -  Is   23:08    0:00.00 /usr/local/bin/minicron 86400 /var/run/update_alias_url_data.pid
      root       16272   0.0  0.0   8224  2020  -  I    23:08    0:00.00 minicron: helper /usr/local/sbin/fcgicli -f /etc/rc.expireaccoun
      root       16523   0.0  0.0   8224  2020  -  I    23:08    0:00.00 minicron: helper /usr/local/sbin/fcgicli -f /etc/rc.update_alias
      root       18669   0.0  0.0  15076  2504  -  Is   23:08    0:00.42 /usr/local/bin/dpinger -S -r 0 -i GW_1_IPv4 -B 10.123
      root       19021   0.0  0.0  13028  2472  -  Is   23:08    0:00.19 /usr/local/bin/dpinger -S -r 0 -i GW_2_IPv4 -B 10.12
      root       19307   0.0  0.0  15076  2516  -  Is   23:08    0:00.35 /usr/local/bin/dpinger -S -r 0 -i GW_2_IPv6 -B fd00:
      _dhcp      19506   0.0  0.0  10528  2404  -  Is   23:07    0:00.01 dhclient: igb4 (dhclient)
      root       19722   0.0  0.0  13028  2472  -  Is   23:08    0:00.22 /usr/local/bin/dpinger -S -r 0 -i GW_3_IPv4 -B 10.12
      root       19875   0.0  0.0  13028  2480  -  Is   23:08    0:00.23 /usr/local/bin/dpinger -S -r 0 -i GW_3_IPv6 -B fd00:
      root       20054   0.0  0.0  13028  2472  -  Is   23:08    0:00.20 /usr/local/bin/dpinger -S -r 0 -i GW_WAN_1_IPv4 -B <public ip="">root       20330   0.0  0.0  13028  2472  -  Is   23:08    0:00.21 /usr/local/bin/dpinger -S -r 0 -i GW_WAN_1_IPv4_Google_DNS -B 10
      messagebus 23664   0.0  0.0  21528  3196  -  Is   23:08    0:00.00 /usr/local/bin/dbus-daemon --system
      root       41132   0.0  0.1  20352  5712  -  Ss   23:07    0:00.73 /usr/local/sbin/openvpn --config /var/etc/openvpn/server3.conf
      root       43454   0.0  0.0   7548  3516  -  Ss   23:08    0:00.01 /usr/sbin/watchdogd -t 128
      root       45654   0.0  0.1  20352  5988  -  Ss   23:07    0:00.02 /usr/local/sbin/openvpn --config /var/etc/openvpn/server2.conf
      root       48389   0.0  0.0  12696  2304  -  Ss   23:07    0:00.05 /usr/local/sbin/filterlog -i pflog0 -p /var/run/filterlog.pid
      root       48511   0.0  0.1  19648  5580  -  Ls   23:08    0:00.03 /usr/local/sbin/miniupnpd -f /var/etc/miniupnpd.conf -P /var/run <-------- ???
      root       48605   0.0  0.1  20352  5984  -  Ss   23:07    0:00.02 /usr/local/sbin/openvpn --config /var/etc/openvpn/server4.conf
      root       48758   0.0  0.0  10368  2092  -  Ss   23:08    0:00.28 /usr/sbin/powerd -b hadp -a hadp -n hadp
      root       55992   0.0  0.0  13084  2724  -  I    23:12    0:00.01 /bin/sh /usr/local/bin/ping_hosts.sh
      root       56169   0.0  0.0  13084  2724  -  I    23:12    0:00.00 /bin/sh /usr/local/bin/ping_hosts.sh
      root       56277   0.0  0.0  12788  2676  -  Is   23:08    0:00.01 /usr/local/sbin/filterdns -p /var/run/filterdns.pid -i 300 -c /v
      root       56369   0.0  0.0  16988  2856  -  L    23:12    0:00.02 ifconfig <---------------------------------------------------------------- ???
      root       56416   0.0  0.0  14728  2460  -  I    23:12    0:00.00 grep carp: BACKUP vhid
      root       56745   0.0  0.0  12532  2212  -  I    23:12    0:00.00 wc -l
      root       74116   0.0  0.1  20352  5880  -  Ss   23:07    0:00.40 /usr/local/sbin/openvpn --config /var/etc/openvpn/client1.conf
      avahi      85807   0.0  0.0  29960  3852  -  I    23:08    0:00.13 avahi-daemon: running [pfSense.local] (avahi-daemon)
      root       93089   0.0  0.5 284728 38648  -  I    23:11    0:00.46 php-fpm: pool nginx (php-fpm)
      root       93155   0.0  0.0  10484  2532  -  Ss   23:08    0:00.79 /usr/sbin/syslogd -s -c -c -l /var/dhcpd/var/run/log -P /var/run
      root       94590   0.0  0.0   8520  2732  -  Is   23:08    0:00.00 bgpd: parent (bgpd)
      _bgpd      94858   0.0  0.0   8520  2716  -  I    23:08    0:00.00 bgpd: route decision engine (bgpd)
      _bgpd      95194   0.0  0.0   8520  2792  -  I    23:08    0:00.00 bgpd: session engine (bgpd)
      root       99398   0.0  0.0   6172  1928  -  IN   23:20    0:00.00 sleep 60
      root        7594   0.0  0.0  13084  2556 u1  I    23:08    0:00.00 /bin/sh /etc/rc.initial
      root       10266   0.0  0.0  13392  3624 u1  S    23:08    0:00.17 /bin/tcsh
      root       36372   0.0  0.0  13084  2640 u1- IN   23:08    0:00.60 /bin/sh /var/db/rrd/updaterrd.sh
      root       83521   0.0  0.0  39432  2848 u1  Is   23:08    0:00.01 login [pam] (login)
      root       99594   0.0  0.0  21104  2744 u1  R+   23:20    0:00.03 ps waux
      root       82157   0.0  0.0  10388  2132 v0  Is+  23:08    0:00.00 /usr/libexec/getty Pc ttyv0
      root       82332   0.0  0.0  10388  2132 v1  Is+  23:08    0:00.00 /usr/libexec/getty Pc ttyv1
      root       82429   0.0  0.0  10388  2132 v2  Is+  23:08    0:00.00 /usr/libexec/getty Pc ttyv2
      root       82538   0.0  0.0  10388  2132 v3  Is+  23:08    0:00.00 /usr/libexec/getty Pc ttyv3
      root       82728   0.0  0.0  10388  2132 v4  Is+  23:08    0:00.00 /usr/libexec/getty Pc ttyv4
      root       83058   0.0  0.0  10388  2132 v5  Is+  23:08    0:00.00 /usr/libexec/getty Pc ttyv5
      root       83404   0.0  0.0  10388  2132 v6  Is+  23:08    0:00.00 /usr/libexec/getty Pc ttyv6
      root       83409   0.0  0.0  10388  2132 v7  Is+  23:08    0:00.00 /usr/libexec/getty Pc ttyv7
      [2.4.1-RELEASE][admin@pfSense.localdomain]/root:</public> 
      

      The main features I use:

      • bridges that contain:

        • VLAN tagged interfaces (no PPPoE)

        • wireless interfaces with multiple (3) virtual SSIDs

        • CARP running on bridge interfaces

      • NAT in various flavors

      • OpenVPN

      • OpenBGPd

      • no explicit shaping/policing/queueing configured

      1 Reply Last reply Reply Quote 0
      • B
        bkraptor last edited by

        Want to bump this thread as I had another attempt at updating both boxes. I thought this was related to the pfBlockerNG issue that everyone seems to be having, which should have been fixed with pfSense 2.4.1 and the latest pfBlockerNG (that I had installed, but not activated). What I observed: the APU2C4 box was upgraded, but left in a CARP backup state for 24h. It did not show any signs of locking up for the whole duration. The moment it became CARP master it only took ~5 minutes to get it to lock up. Same thing then happened for the SG-4860.

        I believe this issue is different from the pfBlockerNG issue, as the processess in this case get stuck in the L state, compared to the D state for the pfBlockerNG issue.

        1 Reply Last reply Reply Quote 0
        • ?
          Guest last edited by

          The main features I use:

          bridges that contain:
                  VLAN tagged interfaces (no PPPoE)
                  wireless interfaces with multiple (3) virtual SSIDs
                  CARP running on bridge interfaces
              NAT in various flavors
              OpenVPN
              OpenBGPd
              no explicit shaping/policing/queueing configured

          In version 2.4.0 some VLAN labeling (to long names) problems occurs if I was reading it right here through the forum.
          In Version 2.4.1 are some hard problems using VLANs at the PPPoE connection!
          In the early version 2.4.2 this problems are gone, but this must not be meaning now that the version is stable as others!

          Across the whole forum therre are many problems updating or upgrading to a 2.4.x version, but often or many users
          were installing it fresh and full on an storage and played back their config xml file and all was right then. If I am in
          your situation I would try out installing 2.4.0 ADI image on the SG-4860 and the CE Edition on the APU2C4 and
          in front of that I would proof the firmware images too and/or update them both if needed. And then play back the
          config xml file.

          1 Reply Last reply Reply Quote 0
          • B
            bkraptor last edited by

            Thanks for the tip, but I think this is a basic 2.4 bug. I can easily replicate the issue on a freshly installed pfSense VM by sending traffic via a CARP IP on a bridge interface.

            https://redmine.pfsense.org/issues/8056

            1 Reply Last reply Reply Quote 0
            • W
              wwwdrich last edited by

              I hate to make a "me too" post, but I'm seeing the same thing on 2.4.1 with a clean install and a config.xml loaded from my old system. The only way I have found to keep the firewall up is to turn all of my CARP VIPs into IP Aliases and turn off my secondary firewall.

              When I get the hang, hitting ctrl-t on the console gives me variations on:

              load: 7.22  cmd: ifconfig 88749 [*carp_if] 11.76r 0.00u 0.00s 0% 2704k
              

              so it looks like it is spinning in carp_if.

              1 Reply Last reply Reply Quote 0
              • W
                webwiz last edited by

                Another me too, as well.

                All our pfSense firewalls that are using Bridged interfaces and CARP will freeze as soon as traffic starts passing across the Bridge Interface.

                Had to reinstall 2.3 to get firewalls working again.

                1 Reply Last reply Reply Quote 0
                • B
                  bkraptor last edited by

                  Hoping this gets some traction, but so far no activity on the linked bug report…

                  1 Reply Last reply Reply Quote 0
                  • W
                    webwiz last edited by

                    Does anyone know if this bug has been fixed in 2.4.2?

                    1 Reply Last reply Reply Quote 0
                    • B
                      bkraptor last edited by

                      I re-tested with 2.4.2 and I still see the same behavior.

                      Until the pfSense team acknowledges https://redmine.pfsense.org/issues/8056 I don't see how we'll get a fix.

                      1 Reply Last reply Reply Quote 0
                      • B
                        bkraptor last edited by

                        Bumping this thread in the hope that someone on the pfSense team acknowledges this issue.

                        1 Reply Last reply Reply Quote 0
                        • W
                          webwiz last edited by

                          Looking through the following bug report it looks to be an bug in Free BSD when an interface used as a bridge member has a CARP IP;

                          https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=200319

                          Has anyone tested to see if the problem persists if CARP IP's are removed from the interfaces that are members of the bridge?

                          1 Reply Last reply Reply Quote 0
                          • G
                            gtoso last edited by

                            Hi,
                            I have a similar problem but not with BRIDGE but LAGG (with LACP enabled).
                            the problem started after the upgrade of one of two firewall in CARP from 2.3.4p1 to 2.4.2p1.

                            Could it be related to this bug?
                            ASAP I will try to better describe my problem.

                            1 Reply Last reply Reply Quote 0
                            • G
                              gtoso last edited by

                              Hi,
                              I have this scenario:
                              2 firewall Dell PowerEdge R310 with these network adapters:
                              2 embedded Broadcom NetXtreme II Gigabit Ethernet
                              1 Intel(R) Gigabit ET Quad Port Server Adapter
                              firewall 1: PfSense 2.3.4-RELEASE-p1 (amd64) installed on HDD, the only package that is installed is FTP_Client_Proxy, this firewall normally has CARP status MASTER on all interfaces, now it's in carp persistent mantenance mode.
                              firewall 2: same hardware, but reinstalled with pfSense-CE-2.4.2-RELEASE-amd64 ZFS auto 4GB swap, during the installation I recoverd the previous config.
                              Just after installation I upgraded it to 2.4.2 p1
                              I have 2 LAGGs (whith LACP): igb0,igb1 and igb2,igb3.
                              One LAGG is assigned to an interface, the other has some VLANs.
                              One Broadcom is directly connected to the other firewall (sync).

                              On all interfaces except for the sync one we have one or more CARP IPs.
                              Firewall 2 with 2.4.2 p1 version works for less than an hour (about 30 minutes), than the other firewall becames master and this firewall gets stuck:
                              on console (DRAC) does not respond anymore but, at least in 1 case, something seems to be working; all 4 nics in LACP have link up, but only one PortChannel is up with an interface only and browser shows certificate warning but then it does't load the login page.

                              The system.log file contains rows referring to the stuck status time,
                              a part from "pfr_update_stats: assertion failed." pre-existing and recurring errors, also new errors showed up like
                              "sonewconn: pcb 0xfffff8007394e1d0: Listen queue overflow: 2 already in queue awaiting acceptance (12 occurrences)"

                              Any suggestion would be appreciated.

                              1 Reply Last reply Reply Quote 0
                              • G
                                gtoso last edited by

                                @gtoso:

                                Hi,
                                I have a similar problem but not with BRIDGE but LAGG (with LACP enabled).
                                the problem started after the upgrade of one of two firewall in CARP from 2.3.4p1 to 2.4.2p1.

                                Could it be related to this bug?
                                ASAP I will try to better describe my problem.

                                Sorry, I forgot a BRIDGE between an OpenVPN TAP and an interface.
                                Now I'm trying after removing the bridge.

                                Thanks.

                                EDIT: I confirm more than 2 hours whitout problems.
                                So even a bridge little used not assigned as interface, that include an interface with an IP CARP triggers the problem.

                                1 Reply Last reply Reply Quote 0
                                • First post
                                  Last post