Navigation

    Netgate Discussion Forum
    • Register
    • Login
    • Search
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search

    PfSense 2.3 LAN interface stops routing traffic - stops working after 2 or 3 day

    General pfSense Questions
    31
    88
    27675
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • B
      byusinger84 last edited by

      I'm having a really weird issue ever since we upgraded to pfSense 2.3 from 2.2.x. Our company manages networks for approximately 60 sites. Whenever I see a new version of pfsense out, I usually upgrade about 2-3 locations to test with to make sure it's stable in our environment before I upgrade.

      All seemed to be going well until a few days after upgrading.

      For some reason every few days with seemingly no rhyme or reason the LAN interface stops servicing anything. I can still access the firewall via IPSec and the external WAN side, but inside the network I cannot ping the LAN interface nor can I get to the internet. Only a reboot seems to resolve the issue.

      The only package we use is pfblockerNG which is at version 2.0.10 currently, though I'm going to update it to 2.0.11 right now to see if that helps. Other than that's it's a pretty vanilla install.

      Aside from the LAN interface not responding there really isn't anything else I can see to indicate why this is happening. If I look at the logs, I see absolutely nothing out of the ordinary. Everything is the same old routine information. The ONLY other symptom that I have seen happen at all three locations is that the load average is pretty high. Usually around 3.01-3.08. All three sites utilize different hardware (hence why they are my test sites) and yet this issue happens at each. Some are Intel one is AMD, they all have different RAM sizes, SSDs vs HDs, different brands of interface cards (some Intel some Realtek), etc.

      I have attached a screenshot of the dashboard so you can see what I'm seeing. Thoughts?
      ![lva frozen pfense - blacked.png](/public/imported_attachments/1/lva frozen pfense - blacked.png)
      ![lva frozen pfense - blacked.png_thumb](/public/imported_attachments/1/lva frozen pfense - blacked.png_thumb)

      1 Reply Last reply Reply Quote 0
      • B
        byusinger84 last edited by

        If it's helpful, here is my screenshot after the reboot. You can see things look much more normal with regards to states and CPU utilization/load averages.

        ![lva not frozen pfsense.PNG](/public/imported_attachments/1/lva not frozen pfsense.PNG)
        ![lva not frozen pfsense.PNG_thumb](/public/imported_attachments/1/lva not frozen pfsense.PNG_thumb)

        1 Reply Last reply Reply Quote 0
        • ?
          Guest last edited by

          The only package we use is pfblockerNG which is at version 2.0.10 currently, though I'm going to update it to 2.0.11 right now to see if that helps. Other than that's it's a pretty vanilla install.

          Did you read this before you were upgrading? Issues updating from 2.3-RC (or older 2.3 installs) to 2.3-RELEASE

          Aside from the LAN interface not responding there really isn't anything else I can see to indicate why this is happening. If I look at the logs, I see absolutely nothing out of the ordinary. Everything is the same old routine information.

          The hard way would be to search and find nothing the easiest way is to back to 2.2.6 or from what you where
          coming and and install or play back the config.xml and/or backups.

          "Some" where reporting that they have really great problems with the upgrade, others that there is a
          problem with pfBlockerNG and others on top were talking about alls was fine and is going on fine. So
          you see it is really rich mixed up and I personally would be patient and go with the last known working
          version and wait a moment.

          The ONLY other symptom that I have seen happen at all three locations is that the load average is pretty high.

          Could it be that your RAM is running out of space? Perhaps a nice try to upgrade the hardware at first and
          go on then with the software? Only perhaps i mean.

          Usually around 3.01-3.08. All three sites utilize different hardware (hence why they are my test sites) and yet this issue happens at each.

          Would it help to go with fully identically hardware such the pfSense store or Netgate are offering?
          Or a custom build by your own I mean, based on a well working hardware platform? Only a thought?

          Some are Intel one is AMD, they all have different RAM sizes, SSDs vs HDs, different brands of interface cards (some Intel some Realtek), etc.

          On the other hand if something if affected by that, and all hardware is the same, you could runing in the
          next trap and ending up with all devices bricked!!!! So a cool spare machine or testing device would be nice
          to have too, as I see it right.

          My guess;

          • Sporadic running out of mbuf size
          • Sporadic running out of RAM
          • If Squid is in the game and acting as a caching proxy server it could perhaps running out of storage space?
          • to much traffic on the DMZ or LAN side and the entire LAN interface(s) are going down or will be saturated
            and are not responding anymore?
          • Cool iOS BYOD devices in the tethering mode (with his own DHCP) or massively TOS signals from Apples
            iPhones! Or any device on the guest side perhaps?

          If you are able to reach the WAN side from outside and also that over VPN, that side is not the problem
          in my eyes it must be more or less something on the LAN side.

          Personally I mean on top of this, the new version 2.3 and above should be getting some or much more
          power, RAM, stronger hardware and what else you would call it.

          1 Reply Last reply Reply Quote 0
          • R
            rlrobs last edited by

            I have same issue.

            My hardware:

            Dell Power edge 2900
            32GB RAM
            QuadCore
            HD SAS: 512GB

            Packets:
            Suricata
            PFBlocker
            Zabbix-aget-LTS
            OpenVPN Client Export.

            :(

            1 Reply Last reply Reply Quote 0
            • C
              cezarq last edited by

              I having a very similar problem too… After a few days the pfSense 2.3 stop routing the traffic and after reboot everything works fine... With the 2.2.6 release I never had this problem...

              1 Reply Last reply Reply Quote 0
              • U
                ulicky last edited by

                Same problems in this post: https://forum.pfsense.org/index.php?topic=110716.0

                1 Reply Last reply Reply Quote 0
                • B
                  byusinger84 last edited by

                  First of all, thank you for your thought out reply, BlueKobold. With that said, I think you are missing the issue at heart, so please don't be offended if I dismiss some of your points.

                  I have already considered going to 2.2.6. These are my test sites. They will stay on this release until it's stable enough to roll out to the rest, but thanks for the suggestion.

                  As you can see on my firewall screen shots, I am no where CLOSE to running out of RAM so again that is not the issue. Also, this is why I test on multiple hardware platforms. They ALL exhibit the same behavior. The one I am showing you is the oldest but still plenty capable. Again, RAM/hardware "load" is not an issue.

                  Most of our sites use 2-3 different hardware models all fully supported by pfsense and even sold on the store. While they are not "official" hardware builds sold from the pfsense store, they are identical hardware to things the store has sold in the past. Again not trying to be dismissive but asking me to buy hardware is silly given that this is a community supported forum and I'm already using properly supported hardware.

                  I have plenty of spare machines for emergencies. That's not the issue. The issue I'm having is stated above and I would appreciate help on that issue specifically, not on how to run our networks. Again thanks for the suggestion but please stick to helping with the issue at hand.

                  Not running out of mbuf.

                  Not running out of RAM.

                  There is no squid server nor proxy cache server, etc.

                  A rogue DHCP server or iOS devices doing what you suggest would not be causing the behavior I am seeing.

                  Obviously, as I said, the WAN/VPN side is fine it's just the LAN side however the interface stops responding thus meaning it's something on the firewall software itself. It's not a hardware issue and especially not given the various types of hardware I'm testing on ALL showing the same behavior.

                  I understand that 2.3 requires a bit more juice but I promise it's not a load issue.

                  Sorry again if any of this comes across as ungrateful for your post. I wasn't trying to be rude, but more clearly state and reiterate that the problem is definitely software related and I would appreciate help that way. Thanks.

                  1 Reply Last reply Reply Quote 0
                  • C
                    cmb last edited by

                    byusinger84: I'm sending you a PM with an alternate kernel to try. It's not an issue I'm able to replicate, so trying to get some feedback from those who can. It at least won't make anything worse.

                    1 Reply Last reply Reply Quote 0
                    • B
                      byusinger84 last edited by

                      An update for anyone following this thread. The new kernel did not work.

                      1 Reply Last reply Reply Quote 1
                      • B
                        byusinger84 last edited by

                        Firewall froze again this morning. What is interesting is that even though there are no log entries that I can see as to why this is happening, the monitoring tab shows something interesting.

                        As you can see in the attached screen shot, CPU interrupt spikes to 25% and sits there. You can see this happened last night at 1:30 AM and continued straight through until this morning. Because this is a school there should be little to no utilization at that time of night. My guess is something is hanging.

                        ![pfsense 2.3 freeze edit.PNG](/public/imported_attachments/1/pfsense 2.3 freeze edit.PNG)
                        ![pfsense 2.3 freeze edit.PNG_thumb](/public/imported_attachments/1/pfsense 2.3 freeze edit.PNG_thumb)

                        1 Reply Last reply Reply Quote 0
                        • A
                          adam65535 last edited by

                          The cpu interrupt spike is what happened when mine had the issue too.  It stayed high even when I failed back over to the 2.1.5 primary cluster member and very little traffic was going to the secondary 2.3 member.  I had to reboot to get the high interrupt back to near zero.  I had this happen interestingly right at 5am too when it happened several days ago :).

                          I haven't had another incident yet since 2 days ago so far.  Maybe it is just  a matter of time though.  These are the changes I made that were different than when it stopped passing some percentage of traffic on the WAN last time…

                          • Removed hw.igb.num_queues=2 from the /boot/loader.conf.local file (now defaults to 0 which I think means self tuning)
                          • Removed hw.igb.rx_process_limit=1000 from /boot/loader.conf.local file (now defaults to 100)
                          • Changed my IP Aliases from being assigned to the WAN interface to being assigned to CARP on secondary.  Primary still had IP Aliases on CARP IP as it was not upgraded.  This meant my IP Aliases were up on both members until I switched to the secondary.  It is a backup site so I didn't realize it as production traffic is not going there.  Only transaction logs and nfs copying between sites over ipsec.  I don't think this is related because CARP was disabled on the primary when the traffic stopped flowing 10 hours later so while the IPs were up on both for awhile when I switched to the secondary server no IP Aliases were up on the primary pfsense 2.1.5 server.

                          So far so good but I have only had one time where some percentage of traffic stopped being received from the WAN interface and it was 10 hours after switching to the secondary pfsense 2.3 firewall.

                          My scenario might just be related to the num_queues thing that I had leftover from the previous 2.1.5 version.

                          1 Reply Last reply Reply Quote 0
                          • B
                            byusinger84 last edited by

                            @adam65535:

                            The cpu interrupt spike is what happened when mine had the issue too.  It stayed high even when I failed back over to the 2.1.5 primary cluster member and very little traffic was going to the secondary 2.3 member.  I had to reboot to get the high interrupt back to near zero.  I had this happen interestingly right at 5am too when it happened several days ago :).

                            I haven't had another incident yet since 2 days ago so far.  Maybe it is just  a matter of time though.  These are the changes I made that were different than when it stopped passing some percentage of traffic on the WAN last time…

                            • Removed hw.igb.num_queues=2 from the /boot/loader.conf.local file (now defaults to 0 which I think means self tuning)
                            • Removed hw.igb.rx_process_limit=1000 from /boot/loader.conf.local file (now defaults to 100)
                            • Changed my IP Aliases from being assigned to the WAN interface to being assigned to CARP on secondary.  Primary still had IP Aliases on CARP IP as it was not upgraded.  This meant my IP Aliases were up on both members until I switched to the secondary.  It is a backup site so I didn't realize it as production traffic is not going there.  Only transaction logs and nfs copying between sites over ipsec.  I don't think this is related because CARP was disabled on the primary when the traffic stopped flowing 10 hours later so while the IPs were up on both for awhile when I switched to the secondary server no IP Aliases were up on the primary pfsense 2.1.5 server.

                            So far so good but I have only had one time where some percentage of traffic stopped being received from the WAN interface and it was 10 hours after switching to the secondary pfsense 2.3 firewall.

                            My scenario might just be related to the num_queues thing that I had leftover from the previous 2.1.5 version.

                            None of those things apply to me in my case so the underlying issue must be something else.

                            1 Reply Last reply Reply Quote 0
                            • R
                              Redshift82r last edited by

                              Updated to 2.3 - running in a Parallels VM on OS/X.  I had the same issue - lan would stop responding, while Wan/VPN was responsive. I had pfblockerNG running hourly updates, so changed that to daily. Also removed DHCP registration and Static DHCP registration from DNS Resolver. I don't know which one fixed it, but have not had a hang requiring reboot now for 5 days.

                              The system seemed to hang just after the top of the hour 3 or 4 times per day (hence the cron change) and there were also DHCP logs which stated that a static IP address had changed its MAC address from its MAC address to the same MAC address hence the DHCP change in Resolver).

                              Whichever it was, no probs now.

                              Hope that helps

                              1 Reply Last reply Reply Quote 0
                              • B
                                byusinger84 last edited by

                                @Redshift82r:

                                Updated to 2.3 - running in a Parallels VM on OS/X.  I had the same issue - lan would stop responding, while Wan/VPN was responsive. I had pfblockerNG running hourly updates, so changed that to daily. Also removed DHCP registration and Static DHCP registration from DNS Resolver. I don't know which one fixed it, but have not had a hang requiring reboot now for 5 days.

                                The system seemed to hang just after the top of the hour 3 or 4 times per day (hence the cron change) and there were also DHCP logs which stated that a static IP address had changed its MAC address from its MAC address to the same MAC address hence the DHCP change in Resolver).

                                Whichever it was, no probs now.

                                Hope that helps

                                I removed pfblocker completely and it still froze so I don't think that's the culprit.

                                I do not have DHCP running on pfsense.

                                I wish those were my issues but sadly they are not.

                                1 Reply Last reply Reply Quote 0
                                • A
                                  adam65535 last edited by

                                  I don't use dhcp or pfblocker.  Firewall still running well for me going on 3 days.

                                  1 Reply Last reply Reply Quote 0
                                  • B
                                    byusinger84 last edited by

                                    I set pfblocker to now only update once a day. We shall see.

                                    1 Reply Last reply Reply Quote 0
                                    • D
                                      diablo266 last edited by

                                      I'm not running pfblocker or any services outside of openvpn/ipsec, other than that it's a clean install on this hardware: https://www.supermicro.com/products/motherboard/Atom/X10/A1SRi-2558F.cfm

                                      The problem still persists even with the custom kernel so I'd be surprised if any of these services in particular are the cause.

                                      1 Reply Last reply Reply Quote 0
                                      • M
                                        mer last edited by

                                        If the custom kernel is the one provided by cmb, that disabled the netmap stuff.  What is a bit interesting is IPsec;  I think alot or all of folks talking about this problem/symptom they have IPsec and em/igb interfaces involved. 
                                        Would it be possible to disable the IPsec VPNs temporarily?  That would be an interesting data point.  If the problem goes away, that narrows down the search for root cause.  If it doesn't, then it's not a factor.

                                        Just to make it clear, I'm not part of or associated with pfSense, just a user that likes puzzles.

                                        1 Reply Last reply Reply Quote 0
                                        • B
                                          byusinger84 last edited by

                                          @mer:

                                          If the custom kernel is the one provided by cmb, that disabled the netmap stuff.  What is a bit interesting is IPsec;  I think alot or all of folks talking about this problem/symptom they have IPsec and em/igb interfaces involved. 
                                          Would it be possible to disable the IPsec VPNs temporarily?  That would be an interesting data point.  If the problem goes away, that narrows down the search for root cause.  If it doesn't, then it's not a factor.

                                          Just to make it clear, I'm not part of or associated with pfSense, just a user that likes puzzles.

                                          I am actually thinking you are correct. Unfortunately I can't disable IPsec because it is essential for our sites to function properly. I am however using em/igb at these three test sites so that might explain a few things. Perhaps a bad network driver is causing the issue?

                                          1 Reply Last reply Reply Quote 0
                                          • C
                                            cmb last edited by

                                            We've confirmed it's not specific to any particular NIC. Happens on em, igb, and re at a minimum, and probably anything.

                                            It seems to be related to UDP traffic streams across IPsec. dd /dev/urandom to UDP netcat with a bouncer back on the other side, and it's replicable within a few minutes to a few hours. Faster CPUs seem to be less likely to hit the issue quickly.

                                            It seems like it might be specific to SMP (>1 CPU core). I haven't been able to trigger it on an ALIX even beating it up much harder, relative to CPU speed, than faster systems where it is replicable.

                                            If you're running on a VM and seeing this issue with >1 vCPU, try changing your VM to 1 vCPU.

                                            If you're on a physical system, you can force the OS to disable additional cores. Take care when doing this, try it on a test system first if you're not comfortable with doing things along these lines.

                                            dmesg | grep cpu
                                            
                                            

                                            to find the apic. You'll have something like:

                                             cpu0 (BSP): APIC ID:  0
                                             cpu1 (AP): APIC ID:  2
                                            
                                            

                                            In /boot/loader.conf.local (create file if it doesn't exist), add:

                                            hint.lapic.2.disabled=1
                                            

                                            where 2 is the APIC ID of the cpu1 CPU core to disable. Replace accordingly if yours isn't 2. Add more lines like that for each additional CPU to disable so you only have cpu0 left enabled. Reboot.

                                            Then report back whether it continues to happen. That seems to suffice as a temporary workaround.

                                            1 Reply Last reply Reply Quote 0
                                            • J
                                              j.koopmann last edited by

                                              I seen to be running in a very similar problem. 2.3 was running fine for days and yesterday evening everything was dead. Or so I believed. I restarted and everything was back. This morning: Internet dead again. Restarted, but still no change. Restarted, everything ok. I then upgraded to 2.3.1, restarted, everything dead. So I attached the serial console just to find that the system was up, WAN connected, responsive, LAN IP attached… Just no traffic on LAN.

                                              I simply did ifconfig igb2 down and up and all was running.

                                              Yes I have IPSEC (and I unfortunately need it). It is on an APU with 4 cores. What is puzzling me is why this is happening right after reboot?

                                              1 Reply Last reply Reply Quote 0
                                              • A
                                                adam65535 last edited by

                                                I do nfs copies over ipsec between sites so we definitely send some udp traffic through the firewall.  It is not a lot of traffic though so maybe that is why it only happened once so far on my system.  Almost 4 days now.  Initially it happened after 10 hours.

                                                1 Reply Last reply Reply Quote 0
                                                • B
                                                  byusinger84 last edited by

                                                  @cmb:

                                                  We've confirmed it's not specific to any particular NIC. Happens on em, igb, and re at a minimum, and probably anything.

                                                  It seems to be related to UDP traffic streams across IPsec. dd /dev/urandom to UDP netcat with a bouncer back on the other side, and it's replicable within a few minutes to a few hours. Faster CPUs seem to be less likely to hit the issue quickly.

                                                  It seems like it might be specific to SMP (>1 CPU core). I haven't been able to trigger it on an ALIX even beating it up much harder, relative to CPU speed, than faster systems where it is replicable.

                                                  If you're running on a VM and seeing this issue with >1 vCPU, try changing your VM to 1 vCPU.

                                                  If you're on a physical system, you can force the OS to disable additional cores. Take care when doing this, try it on a test system first if you're not comfortable with doing things along these lines.

                                                  dmesg | grep cpu
                                                  
                                                  

                                                  to find the apic. You'll have something like:

                                                   cpu0 (BSP): APIC ID:  0
                                                   cpu1 (AP): APIC ID:  2
                                                  
                                                  

                                                  In /boot/loader.conf.local (create file if it doesn't exist), add:

                                                  hint.lapic.2.disabled=1
                                                  

                                                  where 2 is the APIC ID of the cpu1 CPU core to disable. Replace accordingly if yours isn't 2. Add more lines line that for each additional CPU to disable so you only have cpu0 left enabled. Reboot.

                                                  Then report back if it happens again. That might suffice as a temporary workaround, and will give us additional data points in finding the specific root cause.

                                                  AWESOME! Thank you! I will do this and report back. Question, if the system is a dual core with hyper-threading, do the hyper-threads show up as a core as well and do they also need to be disabled?

                                                  1 Reply Last reply Reply Quote 0
                                                  • R
                                                    rlrobs last edited by

                                                    Someone tried to disable the hyper threading in the bios?

                                                    For tests only.

                                                    1 Reply Last reply Reply Quote 0
                                                    • A
                                                      adam65535 last edited by

                                                      I always disable hyperthreading on firewalls so they are disabled on my systems when i had the crash lost packets.

                                                      1 Reply Last reply Reply Quote 0
                                                      • C
                                                        cmb last edited by

                                                        We've confirmed that the problem no longer occurs after disabling all but one CPU core. So that looks to be a viable immediate workaround for most users. See instructions in my post here.
                                                        https://forum.pfsense.org/index.php?topic=110710.msg618388#msg618388

                                                        I doubt if Hyperthreading is relevant either way. It happens in any SMP system including ones without HT. Any HT cores will also need to be disabled for the workaround, but not because they're HT, just additional cores.

                                                        1 Reply Last reply Reply Quote 0
                                                        • J
                                                          jswope last edited by

                                                          I am having the same issue randomly stops routing traffic to all vlans. If i reboot  it will be fine for a day or so then does it again

                                                          1 Reply Last reply Reply Quote 0
                                                          • Z
                                                            Zaphon last edited by

                                                            Add me to the list as well.  I've got this happening on both a SUPERMICRO SYS-5018A-FTN4 1U Rackmount Server (C2758 8-core) as well as a SG-2440 pfSense appliance (C2358 2-core).  Both have Intel igb x 4 interfaces on them.  And they have IPSEC tunnels (required to reach the colo where our VOIP phone system is).  However, what's interesting is it's NOT happening on my home system which is a AMD Athlon System (a dell I got for $250 from Best Buy 4+ years ago) which has dual intel em interfaces on it.  I have the same IPSEC tunnels on it (so I have 4 total locations, 2 offices, my home, and a COLO, all 4 running pfSense (the COLO is still 2.2.6), and they're all connected to each other (so every location has 3 IPSEC tunnels)).  This didn't start occurring until 2.3.  I actually thought maybe this had something to do with AES-NI since the only systems I have AES-NI on are the ones affected..

                                                            I'm going to try the single core trick to see if that helps for now, though I'm concerned with speed issues (as I have the 8-core C2758 in a location that has Gigabit because the C2358 maxed out around 600Mbit)..  NOTE:  I guess it's not the number of cores that cause the C2758 to be able to handle gigabit, but rather the faster clock speed..  Even with 1 core it's still able to handle the full gigabit..  So that's good..

                                                            1 Reply Last reply Reply Quote 0
                                                            • B
                                                              byusinger84 last edited by

                                                              @cmb:

                                                              We've confirmed that the problem no longer occurs after disabling all but one CPU core. So that looks to be a viable immediate workaround for most users. See instructions in my post here.
                                                              https://forum.pfsense.org/index.php?topic=110710.msg618388#msg618388

                                                              I doubt if Hyperthreading is relevant either way. It happens in any SMP system including ones without HT. Any HT cores will also need to be disabled for the workaround, but not because they're HT, just additional cores.

                                                              Disabled all but one core. I will let you know if I continue to have issues. Please let me know when there is a more permanent fix.

                                                              1 Reply Last reply Reply Quote 0
                                                              • O
                                                                OLBaID last edited by

                                                                Hello add me +  as well, recently upgraded hardware from an ALIX to a SuperMicro SBE200-9B with 4 IGB NICs, I am getting the watchdog timeout error as well on the new hardware (not on the ALIX) as the LAN IGB1 will drop randomly every few days:

                                                                https://dl.dropboxusercontent.com/u/42296/SuperMicroPfsense.JPG

                                                                Doing some research prior to finding this thread I found:

                                                                https://doc.pfsense.org/index.php/Disable_ACPI

                                                                Now reading this I can try to disable the other cores for now. Hoping there is a solution soon.

                                                                Love PFsense for many years, cant say that enough!

                                                                1 Reply Last reply Reply Quote 0
                                                                • B
                                                                  breakaway last edited by

                                                                  I'm getting this as well. Most of my pfSense are virtual machines running on VMWare ESXi.

                                                                  I use pfSense for building site-to-site IPSEC tunnels (Blowfish encryption).

                                                                  In my case it's happening when I see heavy loads across the IPSEC tunnel (this is normally at night, for running backups).

                                                                  It appears traffic stops completely on the LAN interface. If I look on the console, I see "em0: Watchdog timeout – resetting" or something to that effect (where em1 is my LAN interface).

                                                                  For encryption, I use Blowfish 256 bit with a SHA512 Hash Algorithm. DH Group Phase 1 - 8192 bit.

                                                                  For phase 2, I use ESP with Blowfish 256 bit with a SHA512 Hash Algorithm. PFS key group 18 - 8192 bit.

                                                                  After reducing the DH Key Group + PFS Key Group to to 14 - 2018 bit I have noted an increase in stability (it hasn't locked in about a week). I've just applied this "workaround" on a few other machines I manage, I will report back on this.

                                                                  1 Reply Last reply Reply Quote 0
                                                                  • J
                                                                    j.koopmann last edited by

                                                                    I am afraid this did not do the trick. It happened again yesterday evening. So I disabled Cores 1,2,3 with

                                                                    hint.lapic.1.disabled=1
                                                                    hint.lapic.2.disabled=1
                                                                    hint.lapic.3.disabled=1

                                                                    and rebooted. This morning: LAN was dead once again. I logged in on the serial console, did

                                                                    ifconfig igb2 down
                                                                    ifconfig igb2 up

                                                                    and 5-10 seconds later everything else was back online. I noticed tons of

                                                                    ifa_add_loopback_route: insertion failed: 17

                                                                    in dmesg however. dmesg also said

                                                                    cpu0 (BSP): APIC ID:  0
                                                                      cpu (AP): APIC ID:  1 (disabled)
                                                                      cpu (AP): APIC ID:  2 (disabled)
                                                                      cpu (AP): APIC ID:  3 (disabled)

                                                                    So I assume the cores ARE disabled! Something else going on? What do you pfsense gurus want me to do/debug the next time it happens?

                                                                    Regards,
                                                                      JP

                                                                    1 Reply Last reply Reply Quote 0
                                                                    • C
                                                                      cmb last edited by

                                                                      @j.koopmann:

                                                                      I am afraid this did not do the trick. It happened again yesterday evening. So I disabled Cores 1,2,3 with

                                                                      hint.lapic.1.disabled=1
                                                                      hint.lapic.2.disabled=1
                                                                      hint.lapic.3.disabled=1

                                                                      and rebooted. This morning: LAN was dead once again. I logged in on the serial console, did

                                                                      ifconfig igb2 down
                                                                      ifconfig igb2 up

                                                                      and 5-10 seconds later everything else was back online.

                                                                      That all looks correct. The only really solid confirmation I have that it fixes it is with em and re NICs. They're single-queue, where igb is multi-queue, so it's possible there's more to it in the igb case. igb's num_queues could be set to 1, but that has a pretty significant impact on achievable top end throughput.

                                                                      an ifconfig down and up of the interfaces with SMP doesn't do anything that I've seen, the fact the network comes back with that suggests it's "better" than before. Not that dead is any better than dead.

                                                                      1 Reply Last reply Reply Quote 0
                                                                      • J
                                                                        j.koopmann last edited by

                                                                        Only that the ifconfig down/up stuff even worked before I disabled the cores… :-)

                                                                        I now have a cronjob that checks the LAN interface every minute and if it cannot ping internal systems restarts the interface and logs it to system.log. If you need me to run additional debugs: Go ahead please! :-)

                                                                        Regards,
                                                                          JP

                                                                        1 Reply Last reply Reply Quote 0
                                                                        • E
                                                                          erdmensch last edited by

                                                                          Thank you for this workaround!

                                                                          Disabling the CPUs seems to solve my problems for the moment.

                                                                          On my Supermicro MBD-X7SPA-HF with Quad em pcie card, using multiple vlans and ipsec tunnels:

                                                                          • Traffic on vlan Interfaces dead after some time. From 15min up to a few hours at most.
                                                                          • Wan and pfsense still responding
                                                                          • Reboot solves the problem (multiple times a day)

                                                                          Same behaviour with a replacement asrock Q1900M and Quad em card.

                                                                          Other observations:

                                                                          Old alix 2d13 runs fine with the same vlan/ipsec setup (very slow, but ok as backup).

                                                                          2 boxes with a gigabyte j1900n-d3v also run fine without disabling the cores:

                                                                          • using the onboard re interfaces
                                                                          • they also have some ipsec tunnels
                                                                          • they do not have vlan interfaces
                                                                          1 Reply Last reply Reply Quote 0
                                                                          • C
                                                                            covex last edited by

                                                                            @j.koopmann:

                                                                            Only that the ifconfig down/up stuff even worked before I disabled the cores… :-)

                                                                            I now have a cronjob that checks the LAN interface every minute and if it cannot ping internal systems restarts the interface and logs it to system.log. If you need me to run additional debugs: Go ahead please! :-)

                                                                            Regards,
                                                                              JP

                                                                            hey jp, could you share the script for that cron job?
                                                                            in my case alix apu box stays up, no problems with it. sg2440 locked up once and soekris 6501 locks every couple days.

                                                                            1 Reply Last reply Reply Quote 0
                                                                            • J
                                                                              j.koopmann last edited by

                                                                              Sure..

                                                                              
                                                                              #!/usr/local/bin/perl
                                                                              
                                                                              use Net::Ping;
                                                                              use Sys::Syslog;
                                                                              
                                                                              $server_to_ping="192.168.1.1";
                                                                              $server_to_ping2="192.168.1.2";
                                                                              
                                                                              sub check_ping_server
                                                                              {
                                                                              $host_alive=1;
                                                                              $ping=Net::Ping->new('icmp');
                                                                              if( $ping->ping($_[0]) ) { $host_alive=1;}
                                                                               else  {$host_alive=0;}
                                                                              return $host_alive;
                                                                              }
                                                                              
                                                                              if(!check_ping_server($server_to_ping) && !check_ping_server($server_to_ping2))
                                                                                  {
                                                                                  system("ifconfig igb2 down");
                                                                                  system("sleep 2");
                                                                                  system("ifconfig igb2 up");
                                                                                  system("sleep 5");
                                                                                  openlog("checkigb2", "ndelay", LOG_USER);
                                                                                  syslog('notice', 'IP check failed, igb2 restarted');
                                                                                  closelog();
                                                                                  }
                                                                                  else
                                                                                  {
                                                                                    openlog("checkigb2", "ndelay", LOG_USER);
                                                                                    syslog('notice', 'IP check ok');
                                                                                    closelog();
                                                                                  }
                                                                              
                                                                              exit;
                                                                              
                                                                              

                                                                              Not the best piece of coding but it seems to do the trick.

                                                                              Regards,
                                                                                JP

                                                                              1 Reply Last reply Reply Quote 0
                                                                              • O
                                                                                OLBaID last edited by

                                                                                Hi been tracking the bug (thanks CMB!):

                                                                                https://redmine.pfsense.org/issues/6296

                                                                                A comment on the bug tracker had this link:

                                                                                https://forum.pfsense.org/index.php?topic=107471.msg602590#msg602590

                                                                                This user disabled the following in System/ Advanced / Networking:

                                                                                I had all enabled:

                                                                                Disable hardware checksum offload
                                                                                Disable hardware TCP segmentation offload
                                                                                Disable hardware large receive offload

                                                                                Can anyone else here confirm if this helps if they are experiencing the issue (I am working to get this hardware back online soon to test myself)

                                                                                Not trying to muddy any waters, but hoping for a fast resolution.

                                                                                Thanks

                                                                                1 Reply Last reply Reply Quote 0
                                                                                • E
                                                                                  erdmensch last edited by

                                                                                  I had these options disabled even before the upgrade to 2.3.
                                                                                  Only disabling the CPU cores solves the issue for me.

                                                                                  1 Reply Last reply Reply Quote 0
                                                                                  • J
                                                                                    jeffvfren last edited by

                                                                                    I'm having the same issue. Disable CPU cores, still under monitoring.
                                                                                    Hopefully the fix release asap.

                                                                                    Update:
                                                                                    It does not work for me, the issue just happen again.

                                                                                    1 Reply Last reply Reply Quote 0
                                                                                    • First post
                                                                                      Last post