Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    2.3 Lockup with no packages

    Scheduled Pinned Locked Moved Problems Installing or Upgrading pfSense Software
    61 Posts 10 Posters 13.3k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • A Offline
      afreaken
      last edited by

      @SoloIT:

      I've started rolling all mine back to 2.2.6. Though I assume the team is working on the issue, I've heard nothing from them in 2 days. I've asked for a status update.

      know a safe source for 2.2.6? Don't think I downloaded it, just did the auto upgrade. Last version I have is 2.2.3, and I would rather get the last stable release (from a safe source).

      1 Reply Last reply Reply Quote 0
      • S Offline
        SoloIT
        last edited by

        You can get them direct from pfSense: http://files.pfsense.org/mirror/downloads/old/

        1 Reply Last reply Reply Quote 0
        • A Offline
          afreaken
          last edited by

          @SoloIT:

          You can get them direct from pfSense: http://files.pfsense.org/mirror/downloads/old/

          Oh nice, thanks. Wish my searches lead me there in the first place…

          1 Reply Last reply Reply Quote 0
          • P Offline
            pnp
            last edited by

            I considered opening a new topic as I am not sure my problem has the same causes, but finally decided to post here based on the same similarities noticed:

            • had the same "Listen queue overflow" error
            • using intel drivers (igb)
            • using ipsec
            • having high CPU usage

            The one thing I did not had yet was a complete lockup. Packet forwarding and routing kept working, but squid and squidguard weren't (the only two packages I have installed), or better they were running but failed to deliver the webpages almost always.

            If it was wrong posting here, please tell me and I'll start a new thread.

            My setup is a redundant 2 pfsense boxes with 2 WAN and 2 LAN with the following hardware:
            SuperMicro SYS-5018A-FTN4 with
            1x AOC-SGP-I4 (Standard 4-port GbE with Intel i350)
            2x SO-4GB-1600E (4GB 1600MHz DDR3, ECC, SO DIMM)
            2x Kingston SSD 60GB (GEOM mirror)

            Both were running 2.2.6 until about a couple of weeks ago. I upgraded the secondary to 2.3 and disabled carp on the master.

            After 2 or 3 days, I received complaints from the users that they were not able to browse the internet.
            I logged in the web configurator and saw the following in system.log:

            sonewconn: pcb 0xfffff8010cd72dc8: Listen queue overflow: 193 already in queue awaiting acceptance (97 occurrences)
            

            At the time I switched to the master and rebooted the secondary. Then proceeded to do a complete xml backup without the packages, a clean install of 2.3, and a config restore.

            I then again switched carp off on the primary, and kept monitoring the secondary for problems. Yesterday, after 8 days without any problem I was about to upgrade the primary to 2.3. Fortunately I hadn't the time to get it done, because today I again noticed problems with browsing the internet.

            Logging in the pfsense web page took a while, and this time I did not see any problem reported in system.log. Its probably worth mentioning that after the reinstall I had added

            kern.ipc.soacceptqueue = 1024
            

            to system tunables.

            Running top I saw a load average of 5+, when the usual is <1.

            Noticed some ipsec tunnels were down (maybe 10~20). I have about 70 ipsec configured.

            I enabled carp on the primary, but this time I did not reboot the secondary. I have spent the last hours trying to find out what is wrong. The following is information from now, about 8+ hours of the time I noticed the problem and switched traffic to the other machine:

            top:

            last pid: 28265;  load averages:  5.08,  5.05,  5.01             up 9+06:59:06  17:53:33
            68 processes:  1 running, 67 sleeping
            CPU 0:  0.0% user,  0.0% nice,  0.0% system,  0.7% interrupt, 99.3% idle
            CPU 1:  0.0% user,  0.0% nice,  0.0% system,  100% interrupt,  0.0% idle
            CPU 2:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
            CPU 3:  0.0% user,  0.0% nice,  100% system,  0.0% interrupt,  100% idle
            CPU 4:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
            CPU 5:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
            CPU 6:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
            CPU 7:  0.4% user,  0.0% nice,  0.0% system,  0.0% interrupt, 99.6% idle
            Mem: 21M Active, 561M Inact, 653M Wired, 698M Buf, 6634M Free
            Swap: 16G Total, 16G Free
            
              PID USERNAME  THR PRI NICE   SIZE    RES STATE   C   TIME    WCPU COMMAND
            38979 root        1  20    0 21856K  3084K CPU7    7   0:14   0.20% top
            94760 root        1  20    0 14516K  2320K select  5  50:33   0.00% syslogd
            18106 root        1  20    0 16676K  2276K bpf     0  25:57   0.00% filterlog
             6928 clamav      2  20    0   412M   346M select  3   0:44   0.00% clamd
            15398 root        1  20    0 46196K  8528K kqread  3   0:37   0.00% nginx
            15275 root        1  20    0 46196K  8592K kqread  0   0:31   0.00% nginx
            84115 root        1  52   20 17000K  2596K wait    6   0:10   0.00% sh
            66757 unbound     8  20    0   123M 32388K kqread  2   0:07   0.00% unbound
            83556 dhcpd       1  20    0 24804K 13648K select  2   0:06   0.00% dhcpd
            46922 root        5  20    0 15012K  2292K accept  0   0:03   0.00% dpinger
            60772 squid       1  20    0 37752K  4092K select  3   0:03   0.00% pinger
            65339 squid       1  20    0 37752K  4092K select  5   0:03   0.00% pinger
            48881 root        5  20    0 15012K  2292K accept  3   0:03   0.00% dpinger
            96574 squid       1  20    0 37752K  4092K select  3   0:03   0.00% pinger
            48364 root        5  20    0 19108K  2376K accept  7   0:03   0.00% dpinger
            59818 root        2  20    0 30144K 17988K kqread  0   0:03   0.00% ntpd
            47149 root        5  20    0 15012K  2292K accept  0   0:03   0.00% dpinger
            49585 root        5  20    0 19108K  2372K accept  7   0:02   0.00% dpinger
            49171 root        5  20    0 19108K  2372K accept  4   0:02   0.00% dpinger
            47717 root        5  20    0 19108K  2372K accept  0   0:02   0.00% dpinger
            48172 root        5  20    0 19108K  2372K accept  6   0:02   0.00% dpinger
            94771 squid       1  20    0   199M 51996K kqread  0   0:02   0.00% squid
            72924 root        1  20    0 82268K  7512K select  5   0:02   0.00% sshd
            71417 root        1  20    0 21616K  5496K select  5   0:01   0.00% openvpn
            47670 root        1  20    0 21616K  5596K select  6   0:01   0.00% openvpn
            70996 root        1  23    0 12268K  1884K nanslp  0   0:01   0.00% minicron
            31269 root        1  20    0 16532K  2260K nanslp  0   0:01   0.00% cron
             1111 clamav      1  20    0 25268K  2864K select  0   0:01   0.00% c-icap
            74415 root        1  20    0   262M 26728K kqread  6   0:01   0.00% php-fpm
             1648 clamav     12  20    0 26708K  3192K semwai  3   0:00   0.00% c-icap
             1381 clamav     12  21    0 26708K  3192K select  4   0:00   0.00% c-icap
            40729 root        1  52    0   266M 43300K accept  2   0:00   0.00% php-fpm
              289 root        1  20    0 13624K  4840K select  0   0:00   0.00% devd
            62093 root        1  25    0 17000K  2528K wait    5   0:00   0.00% sh
            69439 root        1  47    0 12268K  1888K nanslp  4   0:00   0.00% minicron
            58909 root       17  20    0   253M 14680K uwait   3   0:00   0.00% charon
              275 root        1  40   20 18888K  2504K kqread  3   0:00   0.00% check_reload_status
            20054 root        1  20    0 18896K  2404K select  7   0:00   0.00% xinetd
            59970 squid       1  28    0 33564K 11700K sbwait  7   0:00   0.00% squidGuard
            59969 squid       1  28    0 33564K 11700K sbwait  0   0:00   0.00% squidGuard
            60434 squid       1  29    0 33564K 11700K sbwait  2   0:00   0.00% squidGuard
            59396 squid       1  25    0 33564K 11700K sbwait  0   0:00   0.00% squidGuard
            60174 squid       1  26    0 33564K 11700K sbwait  5   0:00   0.00% squidGuard
            59604 squid       1  27    0 33564K 11700K sbwait  4   0:00   0.00% squidGuard
            59176 squid       1  27    0 33564K 11700K sbwait  6   0:00   0.00% squidGuard
            59769 squid       1  27    0 33564K 11700K sbwait  6   0:00   0.00% squidGuard
            

            vmstat -i

            interrupt                          total       rate
            irq23: ehci0                     1605871          2
            cpu0:timer                     904539001       1126
            irq257: igb0:que 0               6571130          8
            irq258: igb0:que 1               5848552          7
            irq259: igb0:que 2               5582121          6
            irq260: igb0:que 3               5203123          6
            irq261: igb0:que 4               5512906          6
            irq262: igb0:que 5               7177261          8
            irq263: igb0:que 6               6540870          8
            irq264: igb0:que 7               5954809          7
            irq265: igb0:link                     12          0
            irq266: igb1:que 0             180504559        224
            irq267: igb1:que 1             155556613        193
            irq268: igb1:que 2             135560934        168
            irq269: igb1:que 3             134799683        167
            irq270: igb1:que 4             169856947        211
            irq271: igb1:que 5             114559553        142
            irq272: igb1:que 6             108745891        135
            irq273: igb1:que 7             175595604        218
            irq274: igb1:link                      4          0
            irq293: igb4:que 0              19205978         23
            irq294: igb4:que 1              14175553         17
            irq295: igb4:que 2              13186026         16
            irq296: igb4:que 3              13357795         16
            irq297: igb4:que 4              14144730         17
            irq298: igb4:que 5              15867243         19
            irq299: igb4:que 6              16196010         20
            irq300: igb4:que 7              14305323         17
            irq301: igb4:link                      4          0
            irq302: igb5:que 0             122160662        152
            irq303: igb5:que 1             148252972        184
            irq304: igb5:que 2             136667917        170
            irq305: igb5:que 3             139662860        173
            irq306: igb5:que 4             148586814        185
            irq307: igb5:que 5             201532284        251
            irq308: igb5:que 6             101346091        126
            irq309: igb5:que 7             158692479        197
            irq310: igb5:link                      6          0
            irq328: igb7:link                      2          0
            irq330: ahci1                    3126604          3
            cpu5:timer                      71360664         88
            cpu6:timer                      21788299         27
            cpu2:timer                      24476506         30
            cpu7:timer                      28427574         35
            cpu3:timer                      22974538         28
            cpu1:timer                     115989008        144
            cpu4:timer                      27194256         33
            Total                         3722393642       4636
            

            from pfsense Diagnostics / System Activity

            last pid: 57816;  load averages:  5.04,  5.03,  5.00  up 9+07:02:44    17:57:11
            375 processes: 14 running, 265 sleeping, 96 waiting
            
            Mem: 21M Active, 558M Inact, 654M Wired, 698M Buf, 6636M Free
            Swap: 16G Total, 16G Free
            
              PID USERNAME PRI NICE   SIZE    RES STATE   C   TIME    WCPU COMMAND
               11 root     155 ki31     0K   128K CPU3    3 219.5H 100.00% [idle{idle: cpu3}]
               11 root     155 ki31     0K   128K RUN     5 219.4H 100.00% [idle{idle: cpu5}]
               11 root     155 ki31     0K   128K CPU6    6 219.4H 100.00% [idle{idle: cpu6}]
               11 root     155 ki31     0K   128K CPU2    2 219.1H 100.00% [idle{idle: cpu2}]
               11 root     155 ki31     0K   128K CPU4    4 219.1H 100.00% [idle{idle: cpu4}]
               11 root     155 ki31     0K   128K CPU0    0 217.7H 100.00% [idle{idle: cpu0}]
               12 root     -92    -     0K  1600K CPU1    1  35:20 100.00% [intr{irq303: igb5:que}]
               11 root     155 ki31     0K   128K CPU7    7 219.2H  99.46% [idle{idle: cpu7}]
            64580 root      28    0   266M 36836K piperd  6   0:03   2.78% php-fpm: pool nginx (php-fpm)
               11 root     155 ki31     0K   128K RUN     1 197.0H   0.00% [idle{idle: cpu1}]
               12 root     -60    -     0K  1600K WAIT    0  86:48   0.00% [intr{swi4: clock}]
               12 root     -92    -     0K  1600K RUN     1  73:15   0.00% [intr{irq267: igb1:que}]
               12 root     -92    -     0K  1600K WAIT    2  70:13   0.00% [intr{irq268: igb1:que}]
               12 root     -92    -     0K  1600K WAIT    4  69:13   0.00% [intr{irq270: igb1:que}]
               12 root     -92    -     0K  1600K WAIT    7  62:36   0.00% [intr{irq273: igb1:que}]
               12 root     -92    -     0K  1600K WAIT    6  50:53   0.00% [intr{irq272: igb1:que}]
            94760 root      20    0 14516K  2320K select  2  50:34   0.00% /usr/sbin/syslogd -s -c -c -l /var/dhcpd/v
               12 root     -92    -     0K  1600K WAIT    0  48:35   0.00% [intr{irq266: igb1:que}]
            

            ifconfig

            
            igb5: flags=8943 <up,broadcast,running,promisc,simplex,multicast>metric 0 mtu 1500
            	options=400b8 <vlan_mtu,vlan_hwtagging,jumbo_mtu,vlan_hwcsum,vlan_hwtso>ether 0c:c4:7a:--:--:--
            	inet6 fe80::ec4:7aff:fe68:c925%igb5 prefixlen 64 tentative scopeid 0x6 
            	inet 172.17.23.242 netmask 0xffffff00 broadcast 172.17.23.255 
            	inet 172.17.23.254 netmask 0xffffff00 broadcast 172.17.23.255 vhid 2 
            	nd6 options=29 <performnud,ifdisabled,auto_linklocal>media: Ethernet autoselect (100baseTX <full-duplex>)
            	status: active
            	carp: BACKUP vhid 2 advbase 1 advskew 100</full-duplex></performnud,ifdisabled,auto_linklocal></vlan_mtu,vlan_hwtagging,jumbo_mtu,vlan_hwcsum,vlan_hwtso></up,broadcast,running,promisc,simplex,multicast>
            

            I think what I am seeing is related to a problem with igb5 and irq303 placing CPU1 under 100% interrupt load. What I don't know is why, and if it is something with the gib drivers or with freebsd.

            I stopped and started all the services this box is running, and removed the cable from the lan port in question, but that CPU1 100% interrupt load continues.

            I also think that this problem could go further back, but maybe it was not that evident. In 2.2.6 I had to reboot pfsense every 5 days or so to get ipsec back running. It would start being unable to connect one of the tunnels, then a few hours/day another, and so on. Usually when I noticed this there was about 3 to 6 tunnels down, and restarting the ipsec service didn't solved it.
            I saw reports of this in the forum, but no solution.

            Googling I found https://lists.freebsd.org/pipermail/freebsd-stable/2014-June/079005.html which is very similar, but not sure it is the same problem.

            I haven't rebooted the secondary yet, so I can run more tests if needed, but I have little knowledge of freebsd.

            Any thoughts or suggestions anyone ?

            1 Reply Last reply Reply Quote 0
            • S Offline
              SoloIT
              last edited by

              Are you seeing the port go down in the system.log file? My guess is the issues are related. I cannot say in every case my hardware is totally locked. Many of mine are in remote sites, and I don't have anyone I trust to do more than reboot once traffic is not flowing. I've had too many crashes with the firewall that I'm physically with and reverted it back to 2.2.6 a few days ago.

              2.3 seems to down the port under IPSec load. Load in this case is does not seem to be the capacity of the hardware of by of the connections coming into the box. For example, I have the same hardware at multiple sites. 1 had 20 Mb WAN and other has 3 Mb WAN. However, after running about 45-60 minutes at >75% of the WAN capacity, the pfSense will typically malfunction.

              1 Reply Last reply Reply Quote 0
              • P Offline
                pnp
                last edited by

                Information in system.log starts about 6 hours ago, which is after the problem.
                I believe it keeps working because I have traffic forwarded to a few internal servers, and I don't have a report of it failing.

                1 Reply Last reply Reply Quote 0
                • A Offline
                  adam65535
                  last edited by

                  It definitely sounds like the same symptoms that I had at a backup site when I upgraded the secondary from 2.1.5 to 2.3.  Once the problem occurred the Interrupt assigned to an igb queue stayed high on the secondary 2.3 even when very little traffic was going through the secondary.  No signs of problems in the logs.  I did NOT have the "Listen queue overflow" error in the logs.  I had failed back to the primary when the initial problem occurred so very little traffic was going to the secondary.  When that single interrupt that is assigned to an igb queue got into the 100% cpu state it seemed like whatever traffic would be diverted to that IRQ is lost and not seen by the firewall (not even tcpdump sees it).  With a 4 core system and 2 queues per interface (i had hw.igb.num_queues set to 2) I was probably losing about 1/2 of the traffic to the firewall (just a guess).  Some traffic was still flowing which I assume is the traffic sent to the other igb queues on the other processors.  Traffic on the other interfaces seemed to be working from what I could tell.

                  One ipsec tunnel on the secondary was working to our main headquarters but another ipsec tunnel to the primary site was down because it couldn't recieve packets form the primary site IP.  It was just random IPs that the firewall couldn't communicate with which I am sure was because the traffic was getting directed to the igb interrupt taking 100% cpu.  The other queues were working for the other igb interrupts on the same interface it seems like what was happening.

                  Unfortunately (or fortunately) it only happened once so far.  That was 3.5 days ago which was about 10 hours after switching carp to the secondary server and disabling it on the old 2.1.5 primary server.

                  Not using pfblocker here btw.  Ipsec to 2 sites, openvpn for administration(not used during the problem), IP Aliases (about 7) on a carp IP, and built in load balancer (about 17 pools).  Only installed packages are nmap and openvp-client-export.  I did have bash installed from a 2.1.5 manual pkg install which is still on the systems.  Snort was running on 2.1.5 and still in the config but it was uninstalled prior to 2.3 upgrade and never installed on 2.3.

                  One ipsec tunnel was working to our main headquarters but another ipsec tunnel to the primary site was down because it couldn't recieve packets form the primary site IP.  It was just random IPs that the firewall couldn't communicate with which I am sure was because the traffic was getting directed to the igb interrupt taking 100% cpu.  The other queues were working for the other igb interrupts on the same interface.

                  Again… I still don't know if this is because of the hw.igb.num_queues setting that I had in place.

                  1 Reply Last reply Reply Quote 0
                  • A Offline
                    afreaken
                    last edited by

                    The symptoms I experienced were:

                    1st day after updating - IPsec tunnel went down. Web GUI was functional, Remote side was fine and trying to reestablish connection, while Local side wouldn't do anything. I performed a system reboot on both systems from the web GUI. I believe after several minutes, Local webGUI became unresponsive. Tried pinging the system for a couple minutes with no luck.
                    Went to the device and tried using the serial console. Console was not letting me use standard options, giving me an error along the lines of command not recognized. Enter did not refresh the console UI. Given people were bitching because they couldn't access our services from the remote site, I hard reset the device to get it to reboot.

                    2nd day after updating - IPsec tunnel went down. Local Web GUI was unresponsive. Again people bitching, so I hard reset rather than trying to mess with the console given my previous experience.

                    today: Remote site was down, had someone hard reset on other side.

                    1 Reply Last reply Reply Quote 0
                    • C Offline
                      cmb
                      last edited by

                      Disabling SMP seems to prevent this issue from happening.
                      https://forum.pfsense.org/index.php?topic=110710.msg618388#msg618388

                      1 Reply Last reply Reply Quote 0
                      • C Offline
                        cmb
                        last edited by

                        Looks like we have a fix for this issue in latest 2.3.1 snapshot.
                        https://redmine.pfsense.org/issues/6296

                        If you're impacted, please upgrade and report back.

                        1 Reply Last reply Reply Quote 0
                        • T Offline
                          The Brave Sir Robin
                          last edited by

                          Been having this trouble on my setup. Was rock-solid on 2.2.6.  The LAN interface locks up when using heavy traffic on IPSec connection. I get about 6 hours use then LAN interface locks up.
                          I am trying:

                          2.3.1-DEVELOPMENT (amd64)
                          built on Fri May 13 09:31:29 CDT 2016
                          FreeBSD 10.3-RELEASE-p2

                          to see if this fixes the issue. Can't be more unstable than the stable version. If this doesn't work might have to go back to 2.2.6

                          1 Reply Last reply Reply Quote 0
                          • D Offline
                            draconpern
                            last edited by

                            Want to report that I too have the same lock up issue, I have 3 Netgate SG-2440.  Also seeing the same issue where interrupts goes really high.  Based on when the devices lock up, I think it's dependent on the total number of packets that has gone through ipsec.  Once it hits some limit, the system becomes unresponsive.

                            All three systems uses ipsec.  Two of the systems uses ipsec a lot and locks up after about a week.  One rarely uses ipsec and has only locked up once.

                            1 Reply Last reply Reply Quote 0
                            • First post
                              Last post
                            Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.