Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    PfSense 2.3 LAN interface stops routing traffic - stops working after 2 or 3 day

    Scheduled Pinned Locked Moved General pfSense Questions
    88 Posts 31 Posters 44.0k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • J Offline
      j.koopmann
      last edited by

      I am afraid this did not do the trick. It happened again yesterday evening. So I disabled Cores 1,2,3 with

      hint.lapic.1.disabled=1
      hint.lapic.2.disabled=1
      hint.lapic.3.disabled=1

      and rebooted. This morning: LAN was dead once again. I logged in on the serial console, did

      ifconfig igb2 down
      ifconfig igb2 up

      and 5-10 seconds later everything else was back online. I noticed tons of

      ifa_add_loopback_route: insertion failed: 17

      in dmesg however. dmesg also said

      cpu0 (BSP): APIC ID:  0
        cpu (AP): APIC ID:  1 (disabled)
        cpu (AP): APIC ID:  2 (disabled)
        cpu (AP): APIC ID:  3 (disabled)

      So I assume the cores ARE disabled! Something else going on? What do you pfsense gurus want me to do/debug the next time it happens?

      Regards,
        JP

      1 Reply Last reply Reply Quote 0
      • C Offline
        cmb
        last edited by

        @j.koopmann:

        I am afraid this did not do the trick. It happened again yesterday evening. So I disabled Cores 1,2,3 with

        hint.lapic.1.disabled=1
        hint.lapic.2.disabled=1
        hint.lapic.3.disabled=1

        and rebooted. This morning: LAN was dead once again. I logged in on the serial console, did

        ifconfig igb2 down
        ifconfig igb2 up

        and 5-10 seconds later everything else was back online.

        That all looks correct. The only really solid confirmation I have that it fixes it is with em and re NICs. They're single-queue, where igb is multi-queue, so it's possible there's more to it in the igb case. igb's num_queues could be set to 1, but that has a pretty significant impact on achievable top end throughput.

        an ifconfig down and up of the interfaces with SMP doesn't do anything that I've seen, the fact the network comes back with that suggests it's "better" than before. Not that dead is any better than dead.

        1 Reply Last reply Reply Quote 0
        • J Offline
          j.koopmann
          last edited by

          Only that the ifconfig down/up stuff even worked before I disabled the cores… :-)

          I now have a cronjob that checks the LAN interface every minute and if it cannot ping internal systems restarts the interface and logs it to system.log. If you need me to run additional debugs: Go ahead please! :-)

          Regards,
            JP

          1 Reply Last reply Reply Quote 0
          • E Offline
            erdmensch
            last edited by

            Thank you for this workaround!

            Disabling the CPUs seems to solve my problems for the moment.

            On my Supermicro MBD-X7SPA-HF with Quad em pcie card, using multiple vlans and ipsec tunnels:

            • Traffic on vlan Interfaces dead after some time. From 15min up to a few hours at most.
            • Wan and pfsense still responding
            • Reboot solves the problem (multiple times a day)

            Same behaviour with a replacement asrock Q1900M and Quad em card.

            Other observations:

            Old alix 2d13 runs fine with the same vlan/ipsec setup (very slow, but ok as backup).

            2 boxes with a gigabyte j1900n-d3v also run fine without disabling the cores:

            • using the onboard re interfaces
            • they also have some ipsec tunnels
            • they do not have vlan interfaces
            1 Reply Last reply Reply Quote 0
            • C Offline
              covex
              last edited by

              @j.koopmann:

              Only that the ifconfig down/up stuff even worked before I disabled the cores… :-)

              I now have a cronjob that checks the LAN interface every minute and if it cannot ping internal systems restarts the interface and logs it to system.log. If you need me to run additional debugs: Go ahead please! :-)

              Regards,
                JP

              hey jp, could you share the script for that cron job?
              in my case alix apu box stays up, no problems with it. sg2440 locked up once and soekris 6501 locks every couple days.

              1 Reply Last reply Reply Quote 0
              • J Offline
                j.koopmann
                last edited by

                Sure..

                
                #!/usr/local/bin/perl
                
                use Net::Ping;
                use Sys::Syslog;
                
                $server_to_ping="192.168.1.1";
                $server_to_ping2="192.168.1.2";
                
                sub check_ping_server
                {
                $host_alive=1;
                $ping=Net::Ping->new('icmp');
                if( $ping->ping($_[0]) ) { $host_alive=1;}
                 else  {$host_alive=0;}
                return $host_alive;
                }
                
                if(!check_ping_server($server_to_ping) && !check_ping_server($server_to_ping2))
                    {
                    system("ifconfig igb2 down");
                    system("sleep 2");
                    system("ifconfig igb2 up");
                    system("sleep 5");
                    openlog("checkigb2", "ndelay", LOG_USER);
                    syslog('notice', 'IP check failed, igb2 restarted');
                    closelog();
                    }
                    else
                    {
                      openlog("checkigb2", "ndelay", LOG_USER);
                      syslog('notice', 'IP check ok');
                      closelog();
                    }
                
                exit;
                
                

                Not the best piece of coding but it seems to do the trick.

                Regards,
                  JP

                1 Reply Last reply Reply Quote 0
                • O Offline
                  OLBaID
                  last edited by

                  Hi been tracking the bug (thanks CMB!):

                  https://redmine.pfsense.org/issues/6296

                  A comment on the bug tracker had this link:

                  https://forum.pfsense.org/index.php?topic=107471.msg602590#msg602590

                  This user disabled the following in System/ Advanced / Networking:

                  I had all enabled:

                  Disable hardware checksum offload
                  Disable hardware TCP segmentation offload
                  Disable hardware large receive offload

                  Can anyone else here confirm if this helps if they are experiencing the issue (I am working to get this hardware back online soon to test myself)

                  Not trying to muddy any waters, but hoping for a fast resolution.

                  Thanks

                  1 Reply Last reply Reply Quote 0
                  • E Offline
                    erdmensch
                    last edited by

                    I had these options disabled even before the upgrade to 2.3.
                    Only disabling the CPU cores solves the issue for me.

                    1 Reply Last reply Reply Quote 0
                    • J Offline
                      jeffvfren
                      last edited by

                      I'm having the same issue. Disable CPU cores, still under monitoring.
                      Hopefully the fix release asap.

                      Update:
                      It does not work for me, the issue just happen again.

                      1 Reply Last reply Reply Quote 0
                      • H Offline
                        h311m4n
                        last edited by

                        Add us to the list as well.

                        We have a virtual pfsense cluster on our DRC site with and ipsec tunnel to our prod site. Since the update, we've had constant crashes from the master on the DRC site.

                        It has been stable for the past 2-3 days now, but with a VCENTER session open, I see a constant CPU usage warning on the master on the other side. We have done the CPU workaround too but as soon as we launch our VEEAM replications, we can be pretty much sure that the ipsec tunnel will fall. Def looks like all the UDP traffic is basically doing a DOS…

                        1 Reply Last reply Reply Quote 0
                        • C Offline
                          cmb
                          last edited by

                          We're still working on tracking this down. Have it narrowed down to something in our IPsec changes (which are just back-ports from FreeBSD -CURRENT), but that's still 80 change sets potentially related.

                          1 Reply Last reply Reply Quote 0
                          • H Offline
                            h311m4n
                            last edited by

                            @cmb:

                            We're still working on tracking this down. Have it narrowed down to something in our IPsec changes (which are just back-ports from FreeBSD -CURRENT), but that's still 80 change sets potentially related.

                            Good to know you're working on it!  :)

                            1 Reply Last reply Reply Quote 0
                            • M Offline
                              marcvb
                              last edited by

                              We also are having this problem with 2.3 , it happens each week.
                              We use multiple ipsec connections.
                              The pfsense is inside a vmware.
                              2Gb ram
                              4 cores

                              I think the state table size is also bigger: 82% (165500/201000)
                              When i want to open the state table it also crashes Allowed memory size of 268435456 bytes exhausted.
                              I will put more ram in the system when i have a downtime window.

                              –--
                              Yes i got my down window upgraded the ram to 4Gb, can now view the states without any problem.

                              1 Reply Last reply Reply Quote 0
                              • J Offline
                                j.koopmann
                                last edited by

                                Happened again today twice but with different details.

                                1. LAN went down. For whatever reason my script doing ifconfig down and up did not help. I tried manually with no luck. Retried and introduced longer sleep between down and up and then it worked. This was the first crash in several days.

                                2. appr. 60 minutes later LAN was down again. However this time I did not even manage to get any result on the serial console. I had to power down/up. No core dump I was able to find however after login I saw a crash and uploaded it. not sure if it is related. I would say it happened around the time of (1).

                                
                                Fatal trap 12: page fault while in kernel mode
                                cpuid = 0; apic id = 00
                                fault virtual address	= 0x0
                                fault code		= supervisor read data, page not present
                                instruction pointer	= 0x20:0xffffffff80d22566
                                stack pointer	        = 0x28:0xfffffe001a38c590
                                frame pointer	        = 0x28:0xfffffe001a38c770
                                code segment		= base 0x0, limit 0xfffff, type 0x1b
                                			= DPL 0, pres 1, long 1, def32 0, gran 1
                                processor eflags	= interrupt enabled, resume, IOPL = 0
                                current process		= 12 (irq260: igb2:que 0)
                                version.txt06000025412713111367  7616 ustarrootwheelFreeBSD 10.3-RELEASE #31 01118b4(RELENG_2_3): Thu Apr 28 03:57:55 CDT 2016
                                    root@ce23-amd64-builder:/builder/pfsense/tmp/obj/builder/pfsense/tmp/FreeBSD-src/sys/pfSense
                                

                                Regards,
                                  JP

                                1 Reply Last reply Reply Quote 0
                                • J Offline
                                  jswope
                                  last edited by

                                  I have 5 locations and we use the Dell R210 servers all sites except 2 have this issue. Only difference between them are Internet Service Providers.

                                  Site 1 R210 Charter Fiber no issues

                                  Site 2 AT&T LTE charter fiber bo issues

                                  Site 3 Charter Fiber issues every 2-3 days watchdog error no lan routing

                                  Site 4 charter coax internet issues every 3-days

                                  1 Reply Last reply Reply Quote 0
                                  • M Offline
                                    marcvb
                                    last edited by

                                    A other firewall of us got stuck just now, when i looked at the console it was showing a different wan adres.
                                    It seems to have been reverted to a private, will take a screenshot next time.

                                    1 Reply Last reply Reply Quote 0
                                    • C Offline
                                      cmb
                                      last edited by

                                      We seem to have a fix for this. It's not yet merged into the source tree, but will be soon.

                                      @jswope:

                                      I have 5 locations and we use the Dell R210 servers all sites except 2 have this issue. Only difference between them are Internet Service Providers.

                                      It's not universal or even close to it or we wouldn't have released with the issue. Not ISP-specific. Just happens the traffic profile of some systems will encounter it. Certain UDP streams is what triggers it.

                                      1 Reply Last reply Reply Quote 0
                                      • A Offline
                                        afreaken
                                        last edited by

                                        @cmb:

                                        We seem to have a fix for this. It's not yet merged into the source tree, but will be soon.

                                        @jswope:

                                        I have 5 locations and we use the Dell R210 servers all sites except 2 have this issue. Only difference between them are Internet Service Providers.

                                        It's not universal or even close to it or we wouldn't have released with the issue. Not ISP-specific. Just happens the traffic profile of some systems will encounter it. Certain UDP streams is what triggers it.

                                        Do you have an ETA?

                                        1 Reply Last reply Reply Quote 0
                                        • C Offline
                                          cmb
                                          last edited by

                                          Sometime this week.

                                          1 Reply Last reply Reply Quote 0
                                          • E Offline
                                            eeit
                                            last edited by

                                            Hello,

                                            any news on this ? Is there a special setting or a patch to fix the issue available ?

                                            Thx.

                                            1 Reply Last reply Reply Quote 0
                                            • First post
                                              Last post
                                            Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.