[Solved] High latency and low CPU utilization after 2.1->2.1.3 upgrade


  • Last week, we upgraded from our 2.1 pfSense x64 installation to 2.1.3 x64.

    Everything seemed to be running normal, until I noticed pings were sometimes over 1000ms. Upon further investigation, the high ping times came whenever the firewall was being accessed, either through the GUI, or just regular internet usage.

    There were no hardware changes made. We have about 50 users, but there are usually between 5 and 20 concurrent users at any given time.

    We used to have fairly high CPU utilization on the dual-core Celeron U3300 2.5GHz (2GB RAM). Both the dashboard, and sysctl hw.ncpu are showing 2 cores, and top, Diagnostics->System Activity and the RRD graphs are showing nearly 100% idle time. See below for System Activity, and attachment for the RRD graph. You'll see two outages, the first short one was for making a clonezilla image of 2.1, the next was the 2.1->2.1.3 upgrade.

    
    last pid: 67374;  load averages:  0.02,  0.01,  0.00  up 0+01:11:52    10:34:16
    256 processes: 3 running, 234 sleeping, 19 waiting
    
    Mem: 386M Active, 202M Inact, 236M Wired, 280K Cache, 211M Buf, 1132M Free
    Swap: 4096M Total, 4096M Free
    
      PID USERNAME PRI NICE   SIZE    RES STATE   C   TIME   WCPU COMMAND
       11 root     171 ki31     0K    32K RUN     1  69:53 100.00% [idle{idle: cpu1}]
       11 root     171 ki31     0K    32K CPU0    0  67:42 97.07% [idle{idle: cpu0}]
    

    I'm not sure if the ping times are directly related to the CPU utilization. I did reboot the switches and checked that both the interfaces are set to auto. They are both showing that they negotiated at 100baseTX (full duplex).

    Can anyone shed some insight on what I can do to resolve this?

    Thanks.


  • I am a noob…but I recently read somewhere that this might have to do with the untimely expiration of states. There should be a way to change firewall optimization behavior (i don't know where it is located and I am not able to access my WebGUI at the moment) wherein you need to opt for "conservative" behavior which retains the states for a longer period which result in higher CPU and memory utilization. The "aggressive" option actually kills connections more frequently.

    The option names "conservative" and "aggressive" are kind of misleading in a way.


  • @golmaal:

    I am a noob…but I recently read somewhere that this might have to do with the untimely expiration of states. There should be a way to change firewall optimization behavior (i don't know where it is located and I am not able to access my WebGUI at the moment) wherein you need to opt for "conservative" behavior which retains the states for a longer period which result in higher CPU and memory utilization. The "aggressive" option actually kills connections more frequently.

    The option names "conservative" and "aggressive" are kind of misleading in a way.

    Thanks for response. I found the option you referred to in the System->Advanced->Firewall/NAT tab, and it was set to normal. I changed it to conservative, and am finding minimal, if any, change in CPU usage and ping times.

    The pings are all on the LAN side, btw. I tried pinging from both pfSense's shell and the local workstations.


  • Just some more information I've discovered:

    I have several machines pinging the pfsense box at the moment, and it seems the pings only spike on the machines that are pushing traffic…

  • Netgate Administrator

    What NICs do you have on that box?
    New drivers for Intel NICs went in between 2.1 and 2.1.3. If you are using Intel NICs and you have any tuning options in loader.conf.local try removing them.

    Steve


  • @stephenw10:

    What NICs do you have on that box?
    New drivers for Intel NICs went in between 2.1 and 2.1.3. If you are using Intel NICs and you have any tuning options in loader.conf.local try removing them.

    Steve

    I've got a D-Link NIC for LAN side and Elitegroup (onboard) for WAN. I do not have a loader.conf.local, and there are no NIC based tuning arguments in loader.conf…

    Thanks for the response.

  • Netgate Administrator

    Ok, what drivers are those NICs using? D-Link and Elitegroup could be almost anything.
    Copy and paste the output of:

    pciconf -lv | grep 20000
    

    Steve


  • @stephenw10:

    Ok, what drivers are those NICs using? D-Link and Elitegroup could be almost anything.
    Copy and paste the output of:

    pciconf -lv | grep 20000
    

    Steve

    re0@pci0:2:0:0: class=0x020000 card=0x26511019 chip=0x816810ec rev=0x03 hdr=0x00
    vr0@pci0:3:1:0: class=0x020000 card=0x14051186 chip=0x31061106 rev=0x8b hdr=0x00

    Thanks

  • Netgate Administrator

    Hmm, no changes to those drivers as far as I'm aware.
    Go to System: Advanced: Networking: and make sure you have all hardware offloading options disabled. Not that it should have changed since 2.1.
    Are you using traffic shaping? Are you using VLANs?

    Steve


  • @stephenw10:

    Hmm, no changes to those drivers as far as I'm aware.
    Go to System: Advanced: Networking: and make sure you have all hardware offloading options disabled. Not that it should have changed since 2.1.
    Are you using traffic shaping? Are you using VLANs?

    Steve

    We aren't using traffic shaping at the moment, and there are no VLANs set on the pfSense box. There is a VLAN set on our VoIP system, but that is only so it bypasses the firewall, as our last system didn't have VLAN capabilities and we haven't gotten around to changing it yet (if we ever bother to), so I'm hoping that is a non-issue.

    Two of the three offloading options are disabled. The only one that wasn't is "Hardware Checksum Offloading". I disabled it, and rebooted the firewall.

    There was no change in the problems, unfortunately.

    I'm considering backing up my Sarg logs and rolling back to my Clonezilla image and attempting the upgrade again, or starting with a fresh install, importing the current config, and reinstalling the Dansguardian and squid packages. I'm just trying to avoid doing another all-nighter. My hard disk is 300gb, and I allocated the whole thing. Even though only 35gb is used, due to the filesystem type, Clonezilla takes 2.5 hours to run…

  • Netgate Administrator

    @murfle:

    Clonezilla takes 2.5 hours to run…

    Ouch!

    A fresh install and config restore is certainly an option. You might want to setup a very simple WAN and LAN config to test the throughput before you restore the config though just to make sure it's not a problem with the config file.

    I assume your VoIP VLAN completely bypasses pfSense then? One user recently had some latency issues with VLANs. There's no chance of tagged packets endding up at the pfSense NICs?

    When you say you're not currently running traffic shaping do you mean you have done previously? Were you using rates close to the restriction you're seeing? Just speculation but maybe you have some rogue config options somewhere that were interpreted by the upgrade code incorrectly. It might be worth having a manual read through of the config file.

    One thing you could try is downloading some data directly to the pfSense machine to check if the restriction is WAN or LAN side. E.g.

    [2.1.3-RELEASE][root@pfsense.fire.box]/root(1):  fetch -o /dev/null http://download.thinkbroadband.com/10MB.zip
    /dev/null                                     100% of   10 MB 2067 kBps
    

    Thinkbroadband is a good site for me but you might want to use something more local to you. You should get >1Mbps wherever you are though.
    Edit: Not this thread!  ::)

    Steve


  • @stephenw10:

    A fresh install and config restore is certainly an option. You might want to setup a very simple WAN and LAN config to test the throughput before you restore the config though just to make sure it's not a problem with the config file.

    That was my thinking too, however seeing it in writing gives me the idea to shut down all my packages and do a default config just to see if the symptoms remain on my current install. Thanks :)

    I assume your VoIP VLAN completely bypasses pfSense then? One user recently had some latency issues with VLANs. There's no chance of tagged packets endding up at the pfSense NICs?

    In theory, that's how it was supposed to be working. but I wasn't the one who set up the switches. Only the port on the switch that goes straight to the internet facing mikrotik is set with that VLAN PVID, however, all ports are tagged with the VLAN ID. I'm still not 100% on all the VLAN technologies. All VoIP traffic come in over a single port, so maybe I need to just tag the two ports that are needed… I'll dig a bit deeper on that, regardless if that's the cause or not. I'd prefer to have it cleaned up, not to mention I need to learn this stuff.

    When you say you're not currently running traffic shaping do you mean you have done previously? Were you using rates close to the restriction you're seeing? Just speculation but maybe you have some rogue config options somewhere that were interpreted by the upgrade code incorrectly. It might be worth having a manual read through of the config file.

    Traffic shaping was done on our last firewall which was smoothwall. It was decommissioned in January. I'll take a closer look through the config, thanks.

    One thing you could try is downloading some data directly to the pfSense machine to check if the restriction is WAN or LAN side. E.g.

    [2.1.3-RELEASE][root@pfsense.fire.box]/root(1):  fetch -o /dev/null http://download.thinkbroadband.com/10MB.zip
    /dev/null                                     100% of   10 MB 2067 kBps
    

    Thinkbroadband is a good site for me but you might want to use something more local to you.

    WAN side is acting normally. Our line is very poor, but latency is absolutely solid. We haven't had full throughput in several weeks. I tried fetching a couple files from the shell, and mixed with watching iftop and pinging external DNS servers showed no irregularities.

    Thanks again for all the input. Will keep you posted.


  • I installed a fresh copy of 2.1.3 on identical hardware today, and started restoring config pieces one at a time. The symptoms of high ping times started with the import of the captive portal configuration.

    I deleted our only zone, and made a fresh one, leaving out the "Enable per-user bandwidth restriction" option. On 2.1.0 that option wasn't really working properly, and I had just left it.

    I went back to the upgraded machine that I started this thread about, took the check mark out of "Enable per-user bandwidth restriction" and now our ping times are always as expected!

    I haven't noticed our CPU usage get as high as it used to, but hey, that might actually be a good thing if everything is performing well.

    I'm not going to mark this as solved just yet. Gonna keep an eye on things for a day or two first.

    Thanks to all for their help!

    Edit: I forgot to mention this even affected those with their MAC addresses added to passthru…

  • Netgate Administrator

    Well deduced.  Methodical testing FTW.  ;D

    Steve


  • Everything's been running smooth. Marked the subject as solved.

    Thanks again!