Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    PfSense 2.3 LAN interface stops routing traffic - stops working after 2 or 3 day

    Scheduled Pinned Locked Moved General pfSense Questions
    88 Posts 31 Posters 44.0k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • R Offline
      rlrobs
      last edited by

      Someone tried to disable the hyper threading in the bios?

      For tests only.

      1 Reply Last reply Reply Quote 0
      • A Offline
        adam65535
        last edited by

        I always disable hyperthreading on firewalls so they are disabled on my systems when i had the crash lost packets.

        1 Reply Last reply Reply Quote 0
        • C Offline
          cmb
          last edited by

          We've confirmed that the problem no longer occurs after disabling all but one CPU core. So that looks to be a viable immediate workaround for most users. See instructions in my post here.
          https://forum.pfsense.org/index.php?topic=110710.msg618388#msg618388

          I doubt if Hyperthreading is relevant either way. It happens in any SMP system including ones without HT. Any HT cores will also need to be disabled for the workaround, but not because they're HT, just additional cores.

          1 Reply Last reply Reply Quote 0
          • J Offline
            jswope
            last edited by

            I am having the same issue randomly stops routing traffic to all vlans. If i reboot  it will be fine for a day or so then does it again

            1 Reply Last reply Reply Quote 0
            • Z Offline
              Zaphon
              last edited by

              Add me to the list as well.  I've got this happening on both a SUPERMICRO SYS-5018A-FTN4 1U Rackmount Server (C2758 8-core) as well as a SG-2440 pfSense appliance (C2358 2-core).  Both have Intel igb x 4 interfaces on them.  And they have IPSEC tunnels (required to reach the colo where our VOIP phone system is).  However, what's interesting is it's NOT happening on my home system which is a AMD Athlon System (a dell I got for $250 from Best Buy 4+ years ago) which has dual intel em interfaces on it.  I have the same IPSEC tunnels on it (so I have 4 total locations, 2 offices, my home, and a COLO, all 4 running pfSense (the COLO is still 2.2.6), and they're all connected to each other (so every location has 3 IPSEC tunnels)).  This didn't start occurring until 2.3.  I actually thought maybe this had something to do with AES-NI since the only systems I have AES-NI on are the ones affected..

              I'm going to try the single core trick to see if that helps for now, though I'm concerned with speed issues (as I have the 8-core C2758 in a location that has Gigabit because the C2358 maxed out around 600Mbit)..  NOTE:  I guess it's not the number of cores that cause the C2758 to be able to handle gigabit, but rather the faster clock speed..  Even with 1 core it's still able to handle the full gigabit..  So that's good..

              1 Reply Last reply Reply Quote 0
              • B Offline
                byusinger84
                last edited by

                @cmb:

                We've confirmed that the problem no longer occurs after disabling all but one CPU core. So that looks to be a viable immediate workaround for most users. See instructions in my post here.
                https://forum.pfsense.org/index.php?topic=110710.msg618388#msg618388

                I doubt if Hyperthreading is relevant either way. It happens in any SMP system including ones without HT. Any HT cores will also need to be disabled for the workaround, but not because they're HT, just additional cores.

                Disabled all but one core. I will let you know if I continue to have issues. Please let me know when there is a more permanent fix.

                1 Reply Last reply Reply Quote 0
                • O Offline
                  OLBaID
                  last edited by

                  Hello add me +  as well, recently upgraded hardware from an ALIX to a SuperMicro SBE200-9B with 4 IGB NICs, I am getting the watchdog timeout error as well on the new hardware (not on the ALIX) as the LAN IGB1 will drop randomly every few days:

                  https://dl.dropboxusercontent.com/u/42296/SuperMicroPfsense.JPG

                  Doing some research prior to finding this thread I found:

                  https://doc.pfsense.org/index.php/Disable_ACPI

                  Now reading this I can try to disable the other cores for now. Hoping there is a solution soon.

                  Love PFsense for many years, cant say that enough!

                  1 Reply Last reply Reply Quote 0
                  • B Offline
                    breakaway
                    last edited by

                    I'm getting this as well. Most of my pfSense are virtual machines running on VMWare ESXi.

                    I use pfSense for building site-to-site IPSEC tunnels (Blowfish encryption).

                    In my case it's happening when I see heavy loads across the IPSEC tunnel (this is normally at night, for running backups).

                    It appears traffic stops completely on the LAN interface. If I look on the console, I see "em0: Watchdog timeout – resetting" or something to that effect (where em1 is my LAN interface).

                    For encryption, I use Blowfish 256 bit with a SHA512 Hash Algorithm. DH Group Phase 1 - 8192 bit.

                    For phase 2, I use ESP with Blowfish 256 bit with a SHA512 Hash Algorithm. PFS key group 18 - 8192 bit.

                    After reducing the DH Key Group + PFS Key Group to to 14 - 2018 bit I have noted an increase in stability (it hasn't locked in about a week). I've just applied this "workaround" on a few other machines I manage, I will report back on this.

                    1 Reply Last reply Reply Quote 0
                    • J Offline
                      j.koopmann
                      last edited by

                      I am afraid this did not do the trick. It happened again yesterday evening. So I disabled Cores 1,2,3 with

                      hint.lapic.1.disabled=1
                      hint.lapic.2.disabled=1
                      hint.lapic.3.disabled=1

                      and rebooted. This morning: LAN was dead once again. I logged in on the serial console, did

                      ifconfig igb2 down
                      ifconfig igb2 up

                      and 5-10 seconds later everything else was back online. I noticed tons of

                      ifa_add_loopback_route: insertion failed: 17

                      in dmesg however. dmesg also said

                      cpu0 (BSP): APIC ID:  0
                        cpu (AP): APIC ID:  1 (disabled)
                        cpu (AP): APIC ID:  2 (disabled)
                        cpu (AP): APIC ID:  3 (disabled)

                      So I assume the cores ARE disabled! Something else going on? What do you pfsense gurus want me to do/debug the next time it happens?

                      Regards,
                        JP

                      1 Reply Last reply Reply Quote 0
                      • C Offline
                        cmb
                        last edited by

                        @j.koopmann:

                        I am afraid this did not do the trick. It happened again yesterday evening. So I disabled Cores 1,2,3 with

                        hint.lapic.1.disabled=1
                        hint.lapic.2.disabled=1
                        hint.lapic.3.disabled=1

                        and rebooted. This morning: LAN was dead once again. I logged in on the serial console, did

                        ifconfig igb2 down
                        ifconfig igb2 up

                        and 5-10 seconds later everything else was back online.

                        That all looks correct. The only really solid confirmation I have that it fixes it is with em and re NICs. They're single-queue, where igb is multi-queue, so it's possible there's more to it in the igb case. igb's num_queues could be set to 1, but that has a pretty significant impact on achievable top end throughput.

                        an ifconfig down and up of the interfaces with SMP doesn't do anything that I've seen, the fact the network comes back with that suggests it's "better" than before. Not that dead is any better than dead.

                        1 Reply Last reply Reply Quote 0
                        • J Offline
                          j.koopmann
                          last edited by

                          Only that the ifconfig down/up stuff even worked before I disabled the cores… :-)

                          I now have a cronjob that checks the LAN interface every minute and if it cannot ping internal systems restarts the interface and logs it to system.log. If you need me to run additional debugs: Go ahead please! :-)

                          Regards,
                            JP

                          1 Reply Last reply Reply Quote 0
                          • E Offline
                            erdmensch
                            last edited by

                            Thank you for this workaround!

                            Disabling the CPUs seems to solve my problems for the moment.

                            On my Supermicro MBD-X7SPA-HF with Quad em pcie card, using multiple vlans and ipsec tunnels:

                            • Traffic on vlan Interfaces dead after some time. From 15min up to a few hours at most.
                            • Wan and pfsense still responding
                            • Reboot solves the problem (multiple times a day)

                            Same behaviour with a replacement asrock Q1900M and Quad em card.

                            Other observations:

                            Old alix 2d13 runs fine with the same vlan/ipsec setup (very slow, but ok as backup).

                            2 boxes with a gigabyte j1900n-d3v also run fine without disabling the cores:

                            • using the onboard re interfaces
                            • they also have some ipsec tunnels
                            • they do not have vlan interfaces
                            1 Reply Last reply Reply Quote 0
                            • C Offline
                              covex
                              last edited by

                              @j.koopmann:

                              Only that the ifconfig down/up stuff even worked before I disabled the cores… :-)

                              I now have a cronjob that checks the LAN interface every minute and if it cannot ping internal systems restarts the interface and logs it to system.log. If you need me to run additional debugs: Go ahead please! :-)

                              Regards,
                                JP

                              hey jp, could you share the script for that cron job?
                              in my case alix apu box stays up, no problems with it. sg2440 locked up once and soekris 6501 locks every couple days.

                              1 Reply Last reply Reply Quote 0
                              • J Offline
                                j.koopmann
                                last edited by

                                Sure..

                                
                                #!/usr/local/bin/perl
                                
                                use Net::Ping;
                                use Sys::Syslog;
                                
                                $server_to_ping="192.168.1.1";
                                $server_to_ping2="192.168.1.2";
                                
                                sub check_ping_server
                                {
                                $host_alive=1;
                                $ping=Net::Ping->new('icmp');
                                if( $ping->ping($_[0]) ) { $host_alive=1;}
                                 else  {$host_alive=0;}
                                return $host_alive;
                                }
                                
                                if(!check_ping_server($server_to_ping) && !check_ping_server($server_to_ping2))
                                    {
                                    system("ifconfig igb2 down");
                                    system("sleep 2");
                                    system("ifconfig igb2 up");
                                    system("sleep 5");
                                    openlog("checkigb2", "ndelay", LOG_USER);
                                    syslog('notice', 'IP check failed, igb2 restarted');
                                    closelog();
                                    }
                                    else
                                    {
                                      openlog("checkigb2", "ndelay", LOG_USER);
                                      syslog('notice', 'IP check ok');
                                      closelog();
                                    }
                                
                                exit;
                                
                                

                                Not the best piece of coding but it seems to do the trick.

                                Regards,
                                  JP

                                1 Reply Last reply Reply Quote 0
                                • O Offline
                                  OLBaID
                                  last edited by

                                  Hi been tracking the bug (thanks CMB!):

                                  https://redmine.pfsense.org/issues/6296

                                  A comment on the bug tracker had this link:

                                  https://forum.pfsense.org/index.php?topic=107471.msg602590#msg602590

                                  This user disabled the following in System/ Advanced / Networking:

                                  I had all enabled:

                                  Disable hardware checksum offload
                                  Disable hardware TCP segmentation offload
                                  Disable hardware large receive offload

                                  Can anyone else here confirm if this helps if they are experiencing the issue (I am working to get this hardware back online soon to test myself)

                                  Not trying to muddy any waters, but hoping for a fast resolution.

                                  Thanks

                                  1 Reply Last reply Reply Quote 0
                                  • E Offline
                                    erdmensch
                                    last edited by

                                    I had these options disabled even before the upgrade to 2.3.
                                    Only disabling the CPU cores solves the issue for me.

                                    1 Reply Last reply Reply Quote 0
                                    • J Offline
                                      jeffvfren
                                      last edited by

                                      I'm having the same issue. Disable CPU cores, still under monitoring.
                                      Hopefully the fix release asap.

                                      Update:
                                      It does not work for me, the issue just happen again.

                                      1 Reply Last reply Reply Quote 0
                                      • H Offline
                                        h311m4n
                                        last edited by

                                        Add us to the list as well.

                                        We have a virtual pfsense cluster on our DRC site with and ipsec tunnel to our prod site. Since the update, we've had constant crashes from the master on the DRC site.

                                        It has been stable for the past 2-3 days now, but with a VCENTER session open, I see a constant CPU usage warning on the master on the other side. We have done the CPU workaround too but as soon as we launch our VEEAM replications, we can be pretty much sure that the ipsec tunnel will fall. Def looks like all the UDP traffic is basically doing a DOS…

                                        1 Reply Last reply Reply Quote 0
                                        • C Offline
                                          cmb
                                          last edited by

                                          We're still working on tracking this down. Have it narrowed down to something in our IPsec changes (which are just back-ports from FreeBSD -CURRENT), but that's still 80 change sets potentially related.

                                          1 Reply Last reply Reply Quote 0
                                          • H Offline
                                            h311m4n
                                            last edited by

                                            @cmb:

                                            We're still working on tracking this down. Have it narrowed down to something in our IPsec changes (which are just back-ports from FreeBSD -CURRENT), but that's still 80 change sets potentially related.

                                            Good to know you're working on it!  :)

                                            1 Reply Last reply Reply Quote 0
                                            • First post
                                              Last post
                                            Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.