2.3 stops routing traffic every 1 og 2 days.

adam65535

The Dell R320 system I updated from 2.1.5 to 2.3 also had a similar symptom using igb driver. It is the secondary of a pair of Dell R320's in a HA (primary/secondary failover) setup. The primary is still on 2.1.5. The secondary ran fine with it being master on 2.3 for about 10 hours or so and then half of the connections to some IPs stopped working. It was wierd because I couldn't ping some systems on the network but I could ping others. Same thing with remote systems. Some I could ping and others I couldn't. I could not ping the ISPs router (pfsense's default route) but I could pass traffic through it. Network traces on the other systems and routers showed that packets went out and were sent back but pfsense didn't see them (or dropped them?). Interrupts went to about 30% when the problem started around 5am several days ago. Even when I switched back to the primary the interrupts were still pegged at 30% even though no traffic was going through the secondary (that I could tell) after moving traffic back to the primary (carp). I didn't think to do a netstat -i to see which IRQs were maxed or look at dropped packets as it was very early in the morning for me unfortunately. Keep in mind that this is a backup site that is not active so not much traffic goes through it except transaction logs, etc.

I noticed that I still had a hw.igb.num_queues set to 2 trying to optimize the drivers on pfsense 2.1.5 to limit nmbclusters from what I remember (my memory is not that good though :)). It seemed like a big coincidence that is half the CPUs that are on the system and also maybe around half or 1/4 of the connections were failing (hyperthreading disabled in bios). The driver was creating 4 queues according to netstat -i.

I removed that setting and hoping it was related. I will be doing tests during business hours the next week or so to try and determine if the problems is resolved or not.

Do you have num_queues set also by chance? It is not needed any more from everything I read.

(I will update this post with the network card model numbers when I get to work in the morning).
I have kern.ipc.nmbclusters="131072" set and using about 43000 of them in my setup with 4 cores and 8 interfaces (two 4 port intel cards).
Running 3 site Ipsec tunnels, openvpn but that was not in use anytime, port forwards, carp(of course), and built in load balancer.

denmly

Yesterday i tried to upgrade the fw with 4 GB ram…

But it died last night at about 21.00...

I'm almost ready to downgrade to 2.2.6, because this is driving me crazy...

Is there a place to get the old iso files online??

cmb

@adam65535:

I noticed that I still had a hw.igb.num_queues set to 2 trying to optimize the drivers on pfsense 2.1.5 to limit nmbclusters from what I remember (my memory is not that good though :)).

There were problems in igb multi-queue in the old drivers, that's why people ended up setting num_queues to 1 or a small number. In all FreeBSD 10.x and newer base versions (2.2.0-2.3.1+), you shouldn't specify hw.igb.num_queues at all. Remove that from loader.conf and/or loader.conf.local to let it use the default (1 queue per CPU core).

It's possible setting num_queues to some non-default number causes problems, especially if a low number, as I doubt much testing happens in those circumstances.

cmb

denmly: I PMed you a link to a kernel to try with instructions.

denmly

@cmb:

denmly: I PMed you a link to a kernel to try with instructions.

I'll try this kernel right away

denmly

New kernel is installed, and now its just wait and see… :-)

ulicky

I have same problem on 2 same machines, before 2.3 it was ok.

Supermicro board + 4x igb interfaces
Intel(R) Xeon(R) CPU X3430 @ 2.40GHz - 4 CPUs: 1 package(s) x 4 core(s)
Memory usage 2% of 8148 MiB

Any solution for that?

byusinger84

Having the same issue on this post: https://forum.pfsense.org/index.php?topic=110710

mer

ulicky and byusinger84, can you console in or ssh to the box? Assuming the interfaces are igb or em, see if there are any messages related to "watchdog timeout".
I don't have any fixes, but if you have those interfaces, it may be related to something a few other folks are seeing.

byusinger84

@mer:

ulicky and byusinger84, can you console in or ssh to the box? Assuming the interfaces are igb or em, see if there are any messages related to "watchdog timeout".
I don't have any fixes, but if you have those interfaces, it may be related to something a few other folks are seeing.

I'll check this out the next time the LAN interface freezes again.

thx2000

I'm pretty sure I'm experiencing the same problem, as mentioned in this post: https://forum.pfsense.org/index.php?topic=110320.0

I've noticed that if I leave the system on long enough the LAN interface will eventually drop offline after 2-3 days even without any SIP traffic through the VPN. I'll try to check for watchdog timeout messages the next time it occurs.

denmly

Just experienced another of these fw breakdowns even with the new kernel from CMB :-(

it came at the same time that a big transfer of data started through a site to site vpn tunnel…

cmb

@denmly:

Just experienced another of these fw breakdowns even with the new kernel from CMB :-(

it came at the same time that a big transfer of data started through a site to site vpn tunnel…

That's not good, maybe something different in your case. Others have had promising results with the no-netmap kernel, though it hasn't been long enough yet to have a lot of confidence. What type of VPN?

denmly

Ipsec VPN to another Pfsense 2.3…

cmb

Could you get me a status tgz from your system? Browse to status.php and click the link to download the tgz. Email the file or a link to it to cmb at pfsense dot org.

denmly

Email is now sent to you.

byusinger84

@denmly:

Just experienced another of these fw breakdowns even with the new kernel from CMB :-(

it came at the same time that a big transfer of data started through a site to site vpn tunnel…

I also experienced the same issue even using the new kernel. Also I don't think this is related to SIP traffic because one of the sites that's had the issue doesn't use SIP.

thx2000

@byusinger84:

@denmly:

Just experienced another of these fw breakdowns even with the new kernel from CMB :-(

it came at the same time that a big transfer of data started through a site to site vpn tunnel…

I also experienced the same issue even using the new kernel. Also I don't think this is related to SIP traffic because one of the sites that's had the issue doesn't use SIP.

I don't think it's strictly related to SIP traffic, but there are tons of RTP UDP packets that are sent during a call. So for whatever reason that payload over the VPN is exacerbating the underlying issue.

denmly

Just has yet another insident, it seems that everytime it happends it is on the hour, eg. 21.00, 23.00, 01.00. or 05.00 are the times i've noticed this problems starts.

I can see it on my MRTG traffic graphs, when traffic stops comming through the FW.

cmb

@denmly:

Just has yet another insident, it seems that everytime it happends it is on the hour, eg. 21.00, 23.00, 01.00. or 05.00 are the times i've noticed this problems starts.

I can see it on my MRTG traffic graphs, when traffic stops comming through the FW.

Do you have any cron jobs that run on the hour? pfblocker for instance.