PfSense 2.3 LAN interface stops routing traffic - stops working after 2 or 3 day

diablo266

I'm not running pfblocker or any services outside of openvpn/ipsec, other than that it's a clean install on this hardware: https://www.supermicro.com/products/motherboard/Atom/X10/A1SRi-2558F.cfm

The problem still persists even with the custom kernel so I'd be surprised if any of these services in particular are the cause.

mer

If the custom kernel is the one provided by cmb, that disabled the netmap stuff. What is a bit interesting is IPsec; I think alot or all of folks talking about this problem/symptom they have IPsec and em/igb interfaces involved.
Would it be possible to disable the IPsec VPNs temporarily? That would be an interesting data point. If the problem goes away, that narrows down the search for root cause. If it doesn't, then it's not a factor.

Just to make it clear, I'm not part of or associated with pfSense, just a user that likes puzzles.

byusinger84

@mer:

If the custom kernel is the one provided by cmb, that disabled the netmap stuff. What is a bit interesting is IPsec; I think alot or all of folks talking about this problem/symptom they have IPsec and em/igb interfaces involved.
Would it be possible to disable the IPsec VPNs temporarily? That would be an interesting data point. If the problem goes away, that narrows down the search for root cause. If it doesn't, then it's not a factor.

Just to make it clear, I'm not part of or associated with pfSense, just a user that likes puzzles.

I am actually thinking you are correct. Unfortunately I can't disable IPsec because it is essential for our sites to function properly. I am however using em/igb at these three test sites so that might explain a few things. Perhaps a bad network driver is causing the issue?

cmb

We've confirmed it's not specific to any particular NIC. Happens on em, igb, and re at a minimum, and probably anything.

It seems to be related to UDP traffic streams across IPsec. dd /dev/urandom to UDP netcat with a bouncer back on the other side, and it's replicable within a few minutes to a few hours. Faster CPUs seem to be less likely to hit the issue quickly.

It seems like it might be specific to SMP (>1 CPU core). I haven't been able to trigger it on an ALIX even beating it up much harder, relative to CPU speed, than faster systems where it is replicable.

If you're running on a VM and seeing this issue with >1 vCPU, try changing your VM to 1 vCPU.

If you're on a physical system, you can force the OS to disable additional cores. Take care when doing this, try it on a test system first if you're not comfortable with doing things along these lines.

dmesg | grep cpu

to find the apic. You'll have something like:

 cpu0 (BSP): APIC ID:  0
 cpu1 (AP): APIC ID:  2

In /boot/loader.conf.local (create file if it doesn't exist), add:

hint.lapic.2.disabled=1

where 2 is the APIC ID of the cpu1 CPU core to disable. Replace accordingly if yours isn't 2. Add more lines like that for each additional CPU to disable so you only have cpu0 left enabled. Reboot.

Then report back whether it continues to happen. That seems to suffice as a temporary workaround.

j.koopmann

I seen to be running in a very similar problem. 2.3 was running fine for days and yesterday evening everything was dead. Or so I believed. I restarted and everything was back. This morning: Internet dead again. Restarted, but still no change. Restarted, everything ok. I then upgraded to 2.3.1, restarted, everything dead. So I attached the serial console just to find that the system was up, WAN connected, responsive, LAN IP attached… Just no traffic on LAN.

I simply did ifconfig igb2 down and up and all was running.

Yes I have IPSEC (and I unfortunately need it). It is on an APU with 4 cores. What is puzzling me is why this is happening right after reboot?

adam65535

I do nfs copies over ipsec between sites so we definitely send some udp traffic through the firewall. It is not a lot of traffic though so maybe that is why it only happened once so far on my system. Almost 4 days now. Initially it happened after 10 hours.

byusinger84

@cmb:

We've confirmed it's not specific to any particular NIC. Happens on em, igb, and re at a minimum, and probably anything.

It seems to be related to UDP traffic streams across IPsec. dd /dev/urandom to UDP netcat with a bouncer back on the other side, and it's replicable within a few minutes to a few hours. Faster CPUs seem to be less likely to hit the issue quickly.

It seems like it might be specific to SMP (>1 CPU core). I haven't been able to trigger it on an ALIX even beating it up much harder, relative to CPU speed, than faster systems where it is replicable.

If you're running on a VM and seeing this issue with >1 vCPU, try changing your VM to 1 vCPU.

If you're on a physical system, you can force the OS to disable additional cores. Take care when doing this, try it on a test system first if you're not comfortable with doing things along these lines.
dmesg | grep cpu
to find the apic. You'll have something like:
 cpu0 (BSP): APIC ID:  0
 cpu1 (AP): APIC ID:  2
In /boot/loader.conf.local (create file if it doesn't exist), add:
hint.lapic.2.disabled=1
where 2 is the APIC ID of the cpu1 CPU core to disable. Replace accordingly if yours isn't 2. Add more lines line that for each additional CPU to disable so you only have cpu0 left enabled. Reboot.

Then report back if it happens again. That might suffice as a temporary workaround, and will give us additional data points in finding the specific root cause.

AWESOME! Thank you! I will do this and report back. Question, if the system is a dual core with hyper-threading, do the hyper-threads show up as a core as well and do they also need to be disabled?

rlrobs

Someone tried to disable the hyper threading in the bios?

For tests only.

adam65535

I always disable hyperthreading on firewalls so they are disabled on my systems when i had the ~~crash~~ lost packets.

cmb

We've confirmed that the problem no longer occurs after disabling all but one CPU core. So that looks to be a viable immediate workaround for most users. See instructions in my post here.
https://forum.pfsense.org/index.php?topic=110710.msg618388#msg618388

I doubt if Hyperthreading is relevant either way. It happens in any SMP system including ones without HT. Any HT cores will also need to be disabled for the workaround, but not because they're HT, just additional cores.

jswope

I am having the same issue randomly stops routing traffic to all vlans. If i reboot it will be fine for a day or so then does it again

Zaphon

Add me to the list as well. I've got this happening on both a SUPERMICRO SYS-5018A-FTN4 1U Rackmount Server (C2758 8-core) as well as a SG-2440 pfSense appliance (C2358 2-core). Both have Intel igb x 4 interfaces on them. And they have IPSEC tunnels (required to reach the colo where our VOIP phone system is). However, what's interesting is it's NOT happening on my home system which is a AMD Athlon System (a dell I got for $250 from Best Buy 4+ years ago) which has dual intel em interfaces on it. I have the same IPSEC tunnels on it (so I have 4 total locations, 2 offices, my home, and a COLO, all 4 running pfSense (the COLO is still 2.2.6), and they're all connected to each other (so every location has 3 IPSEC tunnels)). This didn't start occurring until 2.3. I actually thought maybe this had something to do with AES-NI since the only systems I have AES-NI on are the ones affected..

I'm going to try the single core trick to see if that helps for now, though I'm concerned with speed issues (as I have the 8-core C2758 in a location that has Gigabit because the C2358 maxed out around 600Mbit).. NOTE: I guess it's not the number of cores that cause the C2758 to be able to handle gigabit, but rather the faster clock speed.. Even with 1 core it's still able to handle the full gigabit.. So that's good..

byusinger84

@cmb:

We've confirmed that the problem no longer occurs after disabling all but one CPU core. So that looks to be a viable immediate workaround for most users. See instructions in my post here.
https://forum.pfsense.org/index.php?topic=110710.msg618388#msg618388

I doubt if Hyperthreading is relevant either way. It happens in any SMP system including ones without HT. Any HT cores will also need to be disabled for the workaround, but not because they're HT, just additional cores.

Disabled all but one core. I will let you know if I continue to have issues. Please let me know when there is a more permanent fix.

OLBaID

Hello add me + as well, recently upgraded hardware from an ALIX to a SuperMicro SBE200-9B with 4 IGB NICs, I am getting the watchdog timeout error as well on the new hardware (not on the ALIX) as the LAN IGB1 will drop randomly every few days:

https://dl.dropboxusercontent.com/u/42296/SuperMicroPfsense.JPG

Doing some research prior to finding this thread I found:

https://doc.pfsense.org/index.php/Disable_ACPI

Now reading this I can try to disable the other cores for now. Hoping there is a solution soon.

Love PFsense for many years, cant say that enough!

breakaway

I'm getting this as well. Most of my pfSense are virtual machines running on VMWare ESXi.

I use pfSense for building site-to-site IPSEC tunnels (Blowfish encryption).

In my case it's happening when I see heavy loads across the IPSEC tunnel (this is normally at night, for running backups).

It appears traffic stops completely on the LAN interface. If I look on the console, I see "em0: Watchdog timeout – resetting" or something to that effect (where em1 is my LAN interface).

For encryption, I use Blowfish 256 bit with a SHA512 Hash Algorithm. DH Group Phase 1 - 8192 bit.

For phase 2, I use ESP with Blowfish 256 bit with a SHA512 Hash Algorithm. PFS key group 18 - 8192 bit.

After reducing the DH Key Group + PFS Key Group to to 14 - 2018 bit I have noted an increase in stability (it hasn't locked in about a week). I've just applied this "workaround" on a few other machines I manage, I will report back on this.

j.koopmann

I am afraid this did not do the trick. It happened again yesterday evening. So I disabled Cores 1,2,3 with

hint.lapic.1.disabled=1
hint.lapic.2.disabled=1
hint.lapic.3.disabled=1

and rebooted. This morning: LAN was dead once again. I logged in on the serial console, did

ifconfig igb2 down
ifconfig igb2 up

and 5-10 seconds later everything else was back online. I noticed tons of

ifa_add_loopback_route: insertion failed: 17

in dmesg however. dmesg also said

cpu0 (BSP): APIC ID: 0
cpu (AP): APIC ID: 1 (disabled)
cpu (AP): APIC ID: 2 (disabled)
cpu (AP): APIC ID: 3 (disabled)

So I assume the cores ARE disabled! Something else going on? What do you pfsense gurus want me to do/debug the next time it happens?

Regards,
JP

cmb

@j.koopmann:

I am afraid this did not do the trick. It happened again yesterday evening. So I disabled Cores 1,2,3 with

hint.lapic.1.disabled=1
hint.lapic.2.disabled=1
hint.lapic.3.disabled=1

and rebooted. This morning: LAN was dead once again. I logged in on the serial console, did

ifconfig igb2 down
ifconfig igb2 up

and 5-10 seconds later everything else was back online.

That all looks correct. The only really solid confirmation I have that it fixes it is with em and re NICs. They're single-queue, where igb is multi-queue, so it's possible there's more to it in the igb case. igb's num_queues could be set to 1, but that has a pretty significant impact on achievable top end throughput.

an ifconfig down and up of the interfaces with SMP doesn't do anything that I've seen, the fact the network comes back with that suggests it's "better" than before. Not that dead is any better than dead.

j.koopmann

Only that the ifconfig down/up stuff even worked before I disabled the cores… :-)

I now have a cronjob that checks the LAN interface every minute and if it cannot ping internal systems restarts the interface and logs it to system.log. If you need me to run additional debugs: Go ahead please! :-)

Regards,
JP

erdmensch

Thank you for this workaround!

Disabling the CPUs seems to solve my problems for the moment.

On my Supermicro MBD-X7SPA-HF with Quad em pcie card, using multiple vlans and ipsec tunnels:

Traffic on vlan Interfaces dead after some time. From 15min up to a few hours at most.
Wan and pfsense still responding
Reboot solves the problem (multiple times a day)

Same behaviour with a replacement asrock Q1900M and Quad em card.

Other observations:

Old alix 2d13 runs fine with the same vlan/ipsec setup (very slow, but ok as backup).

2 boxes with a gigabyte j1900n-d3v also run fine without disabling the cores:

using the onboard re interfaces
they also have some ipsec tunnels
they do not have vlan interfaces

covex

@j.koopmann:

Only that the ifconfig down/up stuff even worked before I disabled the cores… :-)

I now have a cronjob that checks the LAN interface every minute and if it cannot ping internal systems restarts the interface and logs it to system.log. If you need me to run additional debugs: Go ahead please! :-)

Regards,
JP

hey jp, could you share the script for that cron job?
in my case alix apu box stays up, no problems with it. sg2440 locked up once and soekris 6501 locks every couple days.