2.3 stops routing traffic every 1 og 2 days.



  • Hi.

    I'm having a problem like a few others, that my PFsense FW stops routing traffic.

    I'm able to access to web interface from the internet, and reboot the server and then it comes up again with no problems.

    My initial setup:

    One VMWare Guest running PFSense 2.3 i386 version with 2 Xeon X5650, 1 GB and 20 Disk. (Primary FW)
    One HW box running 2.2.6 with Geode(TM) Integrated Processor by AMD PCS CPU, 256 mb Ram and 1 GB CF Card (Backup fw with Carp enabled.

    Then I reinstalled the Primary FW with x64 version 2.3, and restored configuration, and installed a Backup FW On VMware with version 2.3, and restored Backup FW configuration on this som my setup now was:

    Two PFsense FW's running 2.3, with CARP
    HW: Vmware Guest with 2 Xeon X5650 CPU's, 1 GB RAM and 20 GB Disk.

    Internet connection is a 1 Gbit connection.

    But the problem still happends every 1 or 2 days.

    This weekend I stopped a SNMP test from my Zabbix server, and then i ran one more day before stopping traffic.

    This morning the FW stopped again, and then accessed the /status.php site and took a backup of the status_output.tgz file.

    I'm unsure of what to try next…

    Regards Michael



  • I would try out to insert 2 x 4 GB of RAM inside of the server! Then you could be trying to high
    up the mbuf size and then this would be perhaps preventing you form that problem if you run
    out of space. And if the lower amount of RAM it selfs will be the problem its solved too.

    All kind of data passing through the CPU of the firewall will hitting too the memory system, and this might
    be not really great enough for all actions. It could be that this was in the past not really a problem but in the
    future I really think some more GBs amount of RAM should be better to invest.



  • I have the same problem with pfSense 2.3 on dell poweredge 2900.

    Dell Power edge 2900
    32GB RAM
    QuadCore
    HD SAS: 512GB
    4 interfaces intel.

    packages:
    Suricata
    PFBlocker
    Zabbix-aget-LTS
    OpenVPN Client Export.



  • The Dell R320 system I updated from 2.1.5 to 2.3 also had a similar symptom using igb driver.  It is the secondary of a pair of Dell R320's in a HA (primary/secondary failover) setup.  The primary is still on 2.1.5.  The secondary ran fine with it being master on 2.3 for about 10 hours or so and then half of the connections to some IPs stopped working.  It was wierd because I couldn't ping some systems on the network but I could ping others.  Same thing with remote systems.  Some I could ping and others I couldn't.  I could not ping the ISPs router (pfsense's default route) but I could pass traffic through it.  Network traces on the other systems and routers showed that packets went out and were sent back but pfsense didn't see them (or dropped them?).  Interrupts went to about 30% when the problem started around 5am  several days ago.  Even when I switched back to the primary the interrupts were still pegged at 30% even though no traffic was going through the secondary (that I could tell) after moving traffic back to the primary (carp).  I didn't think to do a netstat -i to see which IRQs were maxed or look at dropped packets as it was very early in the morning for me unfortunately.  Keep in mind that this is a backup site that is not active so not much traffic goes through it except transaction logs, etc.

    I noticed that I still had a hw.igb.num_queues set to 2 trying to optimize the drivers on pfsense 2.1.5 to limit nmbclusters from what I remember (my memory is not that good though :)).  It seemed like a big coincidence that is half the CPUs that are on the system and also maybe around half or 1/4 of the connections were failing (hyperthreading disabled in bios).  The driver was creating 4 queues according to netstat -i.

    I removed that setting and hoping it was related.  I will be doing tests during business hours the next week or so to try and determine if the problems is resolved or not.

    Do you have num_queues set also by chance?  It is not needed any more from everything I read.

    (I will update this post with the network card model numbers when I get to work in the morning).
    I have kern.ipc.nmbclusters="131072" set and using about 43000 of them in my setup with 4 cores and 8 interfaces (two 4 port intel cards).
    Running 3 site Ipsec tunnels, openvpn but that was not in use anytime, port forwards, carp(of course), and built in load balancer.



  • Yesterday i tried to upgrade the fw with 4 GB ram…

    But it died last night at about 21.00...

    I'm almost ready to downgrade to 2.2.6, because this is driving me crazy...

    Is there a place to get the old iso files online??



  • @adam65535:

    I noticed that I still had a hw.igb.num_queues set to 2 trying to optimize the drivers on pfsense 2.1.5 to limit nmbclusters from what I remember (my memory is not that good though :)).

    There were problems in igb multi-queue in the old drivers, that's why people ended up setting num_queues to 1 or a small number. In all FreeBSD 10.x and newer base versions (2.2.0-2.3.1+), you shouldn't specify hw.igb.num_queues at all. Remove that from loader.conf and/or loader.conf.local to let it use the default (1 queue per CPU core).

    It's possible setting num_queues to some non-default number causes problems, especially if a low number, as I doubt much testing happens in those circumstances.



  • denmly: I PMed you a link to a kernel to try with instructions.



  • @cmb:

    denmly: I PMed you a link to a kernel to try with instructions.

    I'll try this kernel right away



  • New kernel is installed, and now its just wait and see… :-)



  • I have same problem on 2 same machines, before 2.3 it was ok.

    Supermicro board + 4x igb interfaces
    Intel(R) Xeon(R) CPU X3430 @ 2.40GHz - 4 CPUs: 1 package(s) x 4 core(s)
    Memory usage 2% of 8148 MiB

    Any solution for that?



  • Having the same issue on this post: https://forum.pfsense.org/index.php?topic=110710



  • ulicky and byusinger84, can you console in or ssh to the box?  Assuming the interfaces are igb or em, see if there are any messages related to "watchdog timeout".
    I don't have any fixes, but if you have those interfaces, it may be related to something a few other folks are seeing.



  • @mer:

    ulicky and byusinger84, can you console in or ssh to the box?  Assuming the interfaces are igb or em, see if there are any messages related to "watchdog timeout".
    I don't have any fixes, but if you have those interfaces, it may be related to something a few other folks are seeing.

    I'll check this out the next time the LAN interface freezes again.



  • I'm pretty sure I'm experiencing the same problem, as mentioned in this post: https://forum.pfsense.org/index.php?topic=110320.0

    I've noticed that if I leave the system on long enough the LAN interface will eventually drop offline after 2-3 days even without any SIP traffic through the VPN.  I'll try to check for watchdog timeout messages the next time it occurs.



  • Just experienced another of these fw breakdowns even with the new kernel from CMB :-(

    it came at the same time that a big transfer of data started through a site to site vpn tunnel…



  • @denmly:

    Just experienced another of these fw breakdowns even with the new kernel from CMB :-(

    it came at the same time that a big transfer of data started through a site to site vpn tunnel…

    That's not good, maybe something different in your case. Others have had promising results with the no-netmap kernel, though it hasn't been long enough yet to have a lot of confidence. What type of VPN?



  • Ipsec VPN to another Pfsense 2.3…



  • Could you get me a status tgz from your system? Browse to status.php and click the link to download the tgz. Email the file or a link to it to cmb at pfsense dot org.



  • Email is now sent to you.



  • @denmly:

    Just experienced another of these fw breakdowns even with the new kernel from CMB :-(

    it came at the same time that a big transfer of data started through a site to site vpn tunnel…

    I also experienced the same issue even using the new kernel. Also I don't think this is related to SIP traffic because one of the sites that's had the issue doesn't use SIP.



  • @byusinger84:

    @denmly:

    Just experienced another of these fw breakdowns even with the new kernel from CMB :-(

    it came at the same time that a big transfer of data started through a site to site vpn tunnel…

    I also experienced the same issue even using the new kernel. Also I don't think this is related to SIP traffic because one of the sites that's had the issue doesn't use SIP.

    I don't think it's strictly related to SIP traffic, but there are tons of RTP UDP packets that are sent during a call.  So for whatever reason that payload over the VPN is exacerbating the underlying issue.



  • Just has yet another insident, it seems that everytime it happends it is on the hour, eg. 21.00, 23.00, 01.00. or 05.00 are the times i've noticed this problems starts.

    I can see it on my MRTG traffic graphs, when traffic stops comming through the FW.



  • @denmly:

    Just has yet another insident, it seems that everytime it happends it is on the hour, eg. 21.00, 23.00, 01.00. or 05.00 are the times i've noticed this problems starts.

    I can see it on my MRTG traffic graphs, when traffic stops comming through the FW.

    Do you have any cron jobs that run on the hour? pfblocker for instance.



  • I used to have PF Blocker installed, but have removed the package.

    Just checked crontab, and there still was two jobs in cron from pfblocker.

    I have now disabled both of the jobs…

    yesterday i was very desperate, so i tried to upgrade to the latest development version 2.3.1, and this morning i had no problems yet...
    Not sure if this was a good move to upgrade to 2.3.1-development?



  • I have set pfblocker to now only update once a day. Other than that there are no other custom cron jobs. We shall see.



  • pfBlocker is not the issue. I've been running with no packages and still have in the issue. (see my thread: https://forum.pfsense.org/index.php?topic=110525)

    I've started rolling mine back to 2.2.6. Too many headaches and sleepless nights with 2.3.



  • It looks like disabling SMP is an immediate workaround for the problem while we track down and fix the root cause.
    https://forum.pfsense.org/index.php?topic=110953.0


Log in to reply