PfSense 2.3 LAN interface stops routing traffic - stops working after 2 or 3 day

byusinger84

First of all, thank you for your thought out reply, BlueKobold. With that said, I think you are missing the issue at heart, so please don't be offended if I dismiss some of your points.

I have already considered going to 2.2.6. These are my test sites. They will stay on this release until it's stable enough to roll out to the rest, but thanks for the suggestion.

As you can see on my firewall screen shots, I am no where CLOSE to running out of RAM so again that is not the issue. Also, this is why I test on multiple hardware platforms. They ALL exhibit the same behavior. The one I am showing you is the oldest but still plenty capable. Again, RAM/hardware "load" is not an issue.

Most of our sites use 2-3 different hardware models all fully supported by pfsense and even sold on the store. While they are not "official" hardware builds sold from the pfsense store, they are identical hardware to things the store has sold in the past. Again not trying to be dismissive but asking me to buy hardware is silly given that this is a community supported forum and I'm already using properly supported hardware.

I have plenty of spare machines for emergencies. That's not the issue. The issue I'm having is stated above and I would appreciate help on that issue specifically, not on how to run our networks. Again thanks for the suggestion but please stick to helping with the issue at hand.

Not running out of mbuf.

Not running out of RAM.

There is no squid server nor proxy cache server, etc.

A rogue DHCP server or iOS devices doing what you suggest would not be causing the behavior I am seeing.

Obviously, as I said, the WAN/VPN side is fine it's just the LAN side however the interface stops responding thus meaning it's something on the firewall software itself. It's not a hardware issue and especially not given the various types of hardware I'm testing on ALL showing the same behavior.

I understand that 2.3 requires a bit more juice but I promise it's not a load issue.

Sorry again if any of this comes across as ungrateful for your post. I wasn't trying to be rude, but more clearly state and reiterate that the problem is definitely software related and I would appreciate help that way. Thanks.

cmb

byusinger84: I'm sending you a PM with an alternate kernel to try. It's not an issue I'm able to replicate, so trying to get some feedback from those who can. It at least won't make anything worse.

byusinger84

An update for anyone following this thread. The new kernel did not work.

byusinger84

Firewall froze again this morning. What is interesting is that even though there are no log entries that I can see as to why this is happening, the monitoring tab shows something interesting.

As you can see in the attached screen shot, CPU interrupt spikes to 25% and sits there. You can see this happened last night at 1:30 AM and continued straight through until this morning. Because this is a school there should be little to no utilization at that time of night. My guess is something is hanging.

![pfsense 2.3 freeze edit.PNG](/public/imported_attachments/1/pfsense 2.3 freeze edit.PNG)
![pfsense 2.3 freeze edit.PNG_thumb](/public/imported_attachments/1/pfsense 2.3 freeze edit.PNG_thumb)

adam65535

The cpu interrupt spike is what happened when mine had the issue too. It stayed high even when I failed back over to the 2.1.5 primary cluster member and very little traffic was going to the secondary 2.3 member. I had to reboot to get the high interrupt back to near zero. I had this happen interestingly right at 5am too when it happened several days ago :).

I haven't had another incident yet since 2 days ago so far. Maybe it is just a matter of time though. These are the changes I made that were different than when it stopped passing some percentage of traffic on the WAN last time…

Removed hw.igb.num_queues=2 from the /boot/loader.conf.local file (now defaults to 0 which I think means self tuning)
Removed hw.igb.rx_process_limit=1000 from /boot/loader.conf.local file (now defaults to 100)
Changed my IP Aliases from being assigned to the WAN interface to being assigned to CARP on secondary. Primary still had IP Aliases on CARP IP as it was not upgraded. This meant my IP Aliases were up on both members until I switched to the secondary. It is a backup site so I didn't realize it as production traffic is not going there. Only transaction logs and nfs copying between sites over ipsec. I don't think this is related because CARP was disabled on the primary when the traffic stopped flowing 10 hours later so while the IPs were up on both for awhile when I switched to the secondary server no IP Aliases were up on the primary pfsense 2.1.5 server.

So far so good but I have only had one time where some percentage of traffic stopped being received from the WAN interface and it was 10 hours after switching to the secondary pfsense 2.3 firewall.

My scenario might just be related to the num_queues thing that I had leftover from the previous 2.1.5 version.

byusinger84

@adam65535:

The cpu interrupt spike is what happened when mine had the issue too. It stayed high even when I failed back over to the 2.1.5 primary cluster member and very little traffic was going to the secondary 2.3 member. I had to reboot to get the high interrupt back to near zero. I had this happen interestingly right at 5am too when it happened several days ago :).

I haven't had another incident yet since 2 days ago so far. Maybe it is just a matter of time though. These are the changes I made that were different than when it stopped passing some percentage of traffic on the WAN last time…

Removed hw.igb.num_queues=2 from the /boot/loader.conf.local file (now defaults to 0 which I think means self tuning)

Removed hw.igb.rx_process_limit=1000 from /boot/loader.conf.local file (now defaults to 100)

Changed my IP Aliases from being assigned to the WAN interface to being assigned to CARP on secondary. Primary still had IP Aliases on CARP IP as it was not upgraded. This meant my IP Aliases were up on both members until I switched to the secondary. It is a backup site so I didn't realize it as production traffic is not going there. Only transaction logs and nfs copying between sites over ipsec. I don't think this is related because CARP was disabled on the primary when the traffic stopped flowing 10 hours later so while the IPs were up on both for awhile when I switched to the secondary server no IP Aliases were up on the primary pfsense 2.1.5 server.

So far so good but I have only had one time where some percentage of traffic stopped being received from the WAN interface and it was 10 hours after switching to the secondary pfsense 2.3 firewall.

My scenario might just be related to the num_queues thing that I had leftover from the previous 2.1.5 version.

None of those things apply to me in my case so the underlying issue must be something else.

Redshift82r

Updated to 2.3 - running in a Parallels VM on OS/X. I had the same issue - lan would stop responding, while Wan/VPN was responsive. I had pfblockerNG running hourly updates, so changed that to daily. Also removed DHCP registration and Static DHCP registration from DNS Resolver. I don't know which one fixed it, but have not had a hang requiring reboot now for 5 days.

The system seemed to hang just after the top of the hour 3 or 4 times per day (hence the cron change) and there were also DHCP logs which stated that a static IP address had changed its MAC address from its MAC address to the same MAC address hence the DHCP change in Resolver).

Whichever it was, no probs now.

Hope that helps

byusinger84

@Redshift82r:

Updated to 2.3 - running in a Parallels VM on OS/X. I had the same issue - lan would stop responding, while Wan/VPN was responsive. I had pfblockerNG running hourly updates, so changed that to daily. Also removed DHCP registration and Static DHCP registration from DNS Resolver. I don't know which one fixed it, but have not had a hang requiring reboot now for 5 days.

The system seemed to hang just after the top of the hour 3 or 4 times per day (hence the cron change) and there were also DHCP logs which stated that a static IP address had changed its MAC address from its MAC address to the same MAC address hence the DHCP change in Resolver).

Whichever it was, no probs now.

Hope that helps

I removed pfblocker completely and it still froze so I don't think that's the culprit.

I do not have DHCP running on pfsense.

I wish those were my issues but sadly they are not.

adam65535

I don't use dhcp or pfblocker. Firewall still running well for me going on 3 days.

byusinger84

I set pfblocker to now only update once a day. We shall see.

diablo266

I'm not running pfblocker or any services outside of openvpn/ipsec, other than that it's a clean install on this hardware: https://www.supermicro.com/products/motherboard/Atom/X10/A1SRi-2558F.cfm

The problem still persists even with the custom kernel so I'd be surprised if any of these services in particular are the cause.

mer

If the custom kernel is the one provided by cmb, that disabled the netmap stuff. What is a bit interesting is IPsec; I think alot or all of folks talking about this problem/symptom they have IPsec and em/igb interfaces involved.
Would it be possible to disable the IPsec VPNs temporarily? That would be an interesting data point. If the problem goes away, that narrows down the search for root cause. If it doesn't, then it's not a factor.

Just to make it clear, I'm not part of or associated with pfSense, just a user that likes puzzles.

byusinger84

@mer:

If the custom kernel is the one provided by cmb, that disabled the netmap stuff. What is a bit interesting is IPsec; I think alot or all of folks talking about this problem/symptom they have IPsec and em/igb interfaces involved.
Would it be possible to disable the IPsec VPNs temporarily? That would be an interesting data point. If the problem goes away, that narrows down the search for root cause. If it doesn't, then it's not a factor.

Just to make it clear, I'm not part of or associated with pfSense, just a user that likes puzzles.

I am actually thinking you are correct. Unfortunately I can't disable IPsec because it is essential for our sites to function properly. I am however using em/igb at these three test sites so that might explain a few things. Perhaps a bad network driver is causing the issue?

cmb

We've confirmed it's not specific to any particular NIC. Happens on em, igb, and re at a minimum, and probably anything.

It seems to be related to UDP traffic streams across IPsec. dd /dev/urandom to UDP netcat with a bouncer back on the other side, and it's replicable within a few minutes to a few hours. Faster CPUs seem to be less likely to hit the issue quickly.

It seems like it might be specific to SMP (>1 CPU core). I haven't been able to trigger it on an ALIX even beating it up much harder, relative to CPU speed, than faster systems where it is replicable.

If you're running on a VM and seeing this issue with >1 vCPU, try changing your VM to 1 vCPU.

If you're on a physical system, you can force the OS to disable additional cores. Take care when doing this, try it on a test system first if you're not comfortable with doing things along these lines.

dmesg | grep cpu

to find the apic. You'll have something like:

 cpu0 (BSP): APIC ID:  0
 cpu1 (AP): APIC ID:  2

In /boot/loader.conf.local (create file if it doesn't exist), add:

hint.lapic.2.disabled=1

where 2 is the APIC ID of the cpu1 CPU core to disable. Replace accordingly if yours isn't 2. Add more lines like that for each additional CPU to disable so you only have cpu0 left enabled. Reboot.

Then report back whether it continues to happen. That seems to suffice as a temporary workaround.

j.koopmann

I seen to be running in a very similar problem. 2.3 was running fine for days and yesterday evening everything was dead. Or so I believed. I restarted and everything was back. This morning: Internet dead again. Restarted, but still no change. Restarted, everything ok. I then upgraded to 2.3.1, restarted, everything dead. So I attached the serial console just to find that the system was up, WAN connected, responsive, LAN IP attached… Just no traffic on LAN.

I simply did ifconfig igb2 down and up and all was running.

Yes I have IPSEC (and I unfortunately need it). It is on an APU with 4 cores. What is puzzling me is why this is happening right after reboot?

adam65535

I do nfs copies over ipsec between sites so we definitely send some udp traffic through the firewall. It is not a lot of traffic though so maybe that is why it only happened once so far on my system. Almost 4 days now. Initially it happened after 10 hours.

byusinger84

@cmb:

We've confirmed it's not specific to any particular NIC. Happens on em, igb, and re at a minimum, and probably anything.

It seems to be related to UDP traffic streams across IPsec. dd /dev/urandom to UDP netcat with a bouncer back on the other side, and it's replicable within a few minutes to a few hours. Faster CPUs seem to be less likely to hit the issue quickly.

It seems like it might be specific to SMP (>1 CPU core). I haven't been able to trigger it on an ALIX even beating it up much harder, relative to CPU speed, than faster systems where it is replicable.

If you're running on a VM and seeing this issue with >1 vCPU, try changing your VM to 1 vCPU.

If you're on a physical system, you can force the OS to disable additional cores. Take care when doing this, try it on a test system first if you're not comfortable with doing things along these lines.
dmesg | grep cpu
to find the apic. You'll have something like:
 cpu0 (BSP): APIC ID:  0
 cpu1 (AP): APIC ID:  2
In /boot/loader.conf.local (create file if it doesn't exist), add:
hint.lapic.2.disabled=1
where 2 is the APIC ID of the cpu1 CPU core to disable. Replace accordingly if yours isn't 2. Add more lines line that for each additional CPU to disable so you only have cpu0 left enabled. Reboot.

Then report back if it happens again. That might suffice as a temporary workaround, and will give us additional data points in finding the specific root cause.

AWESOME! Thank you! I will do this and report back. Question, if the system is a dual core with hyper-threading, do the hyper-threads show up as a core as well and do they also need to be disabled?

rlrobs

Someone tried to disable the hyper threading in the bios?

For tests only.

adam65535

I always disable hyperthreading on firewalls so they are disabled on my systems when i had the ~~crash~~ lost packets.

cmb

We've confirmed that the problem no longer occurs after disabling all but one CPU core. So that looks to be a viable immediate workaround for most users. See instructions in my post here.
https://forum.pfsense.org/index.php?topic=110710.msg618388#msg618388

I doubt if Hyperthreading is relevant either way. It happens in any SMP system including ones without HT. Any HT cores will also need to be disabled for the workaround, but not because they're HT, just additional cores.