2.3 stops routing traffic every 1 og 2 days.

denmly

Hi.

I'm having a problem like a few others, that my PFsense FW stops routing traffic.

I'm able to access to web interface from the internet, and reboot the server and then it comes up again with no problems.

My initial setup:

One VMWare Guest running PFSense 2.3 i386 version with 2 Xeon X5650, 1 GB and 20 Disk. (Primary FW)
One HW box running 2.2.6 with Geode(TM) Integrated Processor by AMD PCS CPU, 256 mb Ram and 1 GB CF Card (Backup fw with Carp enabled.

Then I reinstalled the Primary FW with x64 version 2.3, and restored configuration, and installed a Backup FW On VMware with version 2.3, and restored Backup FW configuration on this som my setup now was:

Two PFsense FW's running 2.3, with CARP
HW: Vmware Guest with 2 Xeon X5650 CPU's, 1 GB RAM and 20 GB Disk.

Internet connection is a 1 Gbit connection.

But the problem still happends every 1 or 2 days.

This weekend I stopped a SNMP test from my Zabbix server, and then i ran one more day before stopping traffic.

This morning the FW stopped again, and then accessed the /status.php site and took a backup of the status_output.tgz file.

I'm unsure of what to try next…

Regards Michael

Guest

I would try out to insert 2 x 4 GB of RAM inside of the server! Then you could be trying to high
up the mbuf size and then this would be perhaps preventing you form that problem if you run
out of space. And if the lower amount of RAM it selfs will be the problem its solved too.

All kind of data passing through the CPU of the firewall will hitting too the memory system, and this might
be not really great enough for all actions. It could be that this was in the past not really a problem but in the
future I really think some more GBs amount of RAM should be better to invest.

rlrobs

I have the same problem with pfSense 2.3 on dell poweredge 2900.

Dell Power edge 2900
32GB RAM
QuadCore
HD SAS: 512GB
4 interfaces intel.

packages:
Suricata
PFBlocker
Zabbix-aget-LTS
OpenVPN Client Export.

adam65535

The Dell R320 system I updated from 2.1.5 to 2.3 also had a similar symptom using igb driver. It is the secondary of a pair of Dell R320's in a HA (primary/secondary failover) setup. The primary is still on 2.1.5. The secondary ran fine with it being master on 2.3 for about 10 hours or so and then half of the connections to some IPs stopped working. It was wierd because I couldn't ping some systems on the network but I could ping others. Same thing with remote systems. Some I could ping and others I couldn't. I could not ping the ISPs router (pfsense's default route) but I could pass traffic through it. Network traces on the other systems and routers showed that packets went out and were sent back but pfsense didn't see them (or dropped them?). Interrupts went to about 30% when the problem started around 5am several days ago. Even when I switched back to the primary the interrupts were still pegged at 30% even though no traffic was going through the secondary (that I could tell) after moving traffic back to the primary (carp). I didn't think to do a netstat -i to see which IRQs were maxed or look at dropped packets as it was very early in the morning for me unfortunately. Keep in mind that this is a backup site that is not active so not much traffic goes through it except transaction logs, etc.

I noticed that I still had a hw.igb.num_queues set to 2 trying to optimize the drivers on pfsense 2.1.5 to limit nmbclusters from what I remember (my memory is not that good though :)). It seemed like a big coincidence that is half the CPUs that are on the system and also maybe around half or 1/4 of the connections were failing (hyperthreading disabled in bios). The driver was creating 4 queues according to netstat -i.

I removed that setting and hoping it was related. I will be doing tests during business hours the next week or so to try and determine if the problems is resolved or not.

Do you have num_queues set also by chance? It is not needed any more from everything I read.

(I will update this post with the network card model numbers when I get to work in the morning).
I have kern.ipc.nmbclusters="131072" set and using about 43000 of them in my setup with 4 cores and 8 interfaces (two 4 port intel cards).
Running 3 site Ipsec tunnels, openvpn but that was not in use anytime, port forwards, carp(of course), and built in load balancer.

denmly

Yesterday i tried to upgrade the fw with 4 GB ram…

But it died last night at about 21.00...

I'm almost ready to downgrade to 2.2.6, because this is driving me crazy...

Is there a place to get the old iso files online??

cmb

@adam65535:

I noticed that I still had a hw.igb.num_queues set to 2 trying to optimize the drivers on pfsense 2.1.5 to limit nmbclusters from what I remember (my memory is not that good though :)).

There were problems in igb multi-queue in the old drivers, that's why people ended up setting num_queues to 1 or a small number. In all FreeBSD 10.x and newer base versions (2.2.0-2.3.1+), you shouldn't specify hw.igb.num_queues at all. Remove that from loader.conf and/or loader.conf.local to let it use the default (1 queue per CPU core).

It's possible setting num_queues to some non-default number causes problems, especially if a low number, as I doubt much testing happens in those circumstances.

cmb

denmly: I PMed you a link to a kernel to try with instructions.

denmly

@cmb:

denmly: I PMed you a link to a kernel to try with instructions.

I'll try this kernel right away

denmly

New kernel is installed, and now its just wait and see… :-)

ulicky

I have same problem on 2 same machines, before 2.3 it was ok.

Supermicro board + 4x igb interfaces
Intel(R) Xeon(R) CPU X3430 @ 2.40GHz - 4 CPUs: 1 package(s) x 4 core(s)
Memory usage 2% of 8148 MiB

Any solution for that?

byusinger84

Having the same issue on this post: https://forum.pfsense.org/index.php?topic=110710

mer

ulicky and byusinger84, can you console in or ssh to the box? Assuming the interfaces are igb or em, see if there are any messages related to "watchdog timeout".
I don't have any fixes, but if you have those interfaces, it may be related to something a few other folks are seeing.

byusinger84

@mer:

ulicky and byusinger84, can you console in or ssh to the box? Assuming the interfaces are igb or em, see if there are any messages related to "watchdog timeout".
I don't have any fixes, but if you have those interfaces, it may be related to something a few other folks are seeing.

I'll check this out the next time the LAN interface freezes again.

thx2000

I'm pretty sure I'm experiencing the same problem, as mentioned in this post: https://forum.pfsense.org/index.php?topic=110320.0

I've noticed that if I leave the system on long enough the LAN interface will eventually drop offline after 2-3 days even without any SIP traffic through the VPN. I'll try to check for watchdog timeout messages the next time it occurs.

denmly

Just experienced another of these fw breakdowns even with the new kernel from CMB :-(

it came at the same time that a big transfer of data started through a site to site vpn tunnel…

cmb

@denmly:

Just experienced another of these fw breakdowns even with the new kernel from CMB :-(

it came at the same time that a big transfer of data started through a site to site vpn tunnel…

That's not good, maybe something different in your case. Others have had promising results with the no-netmap kernel, though it hasn't been long enough yet to have a lot of confidence. What type of VPN?

denmly

Ipsec VPN to another Pfsense 2.3…

cmb

Could you get me a status tgz from your system? Browse to status.php and click the link to download the tgz. Email the file or a link to it to cmb at pfsense dot org.

denmly

Email is now sent to you.

byusinger84

@denmly:

Just experienced another of these fw breakdowns even with the new kernel from CMB :-(

it came at the same time that a big transfer of data started through a site to site vpn tunnel…

I also experienced the same issue even using the new kernel. Also I don't think this is related to SIP traffic because one of the sites that's had the issue doesn't use SIP.