Pfsense locks up when transferring 10+ mbit/s of data



  • I recently built a new PFsense box on top of a dual core Atom D2700 with 2GB of DDR3.

    Whenever I push "lots" of data through it - 10+mbit/s, the transfer will work fine for several minutes, then it will drop in speed to ~7mbit/s and stay there, and the Pfsense web interface will be completely unresponsive until I pause the transfer. Ping times for Pfsense hover in the 100ms range. RRD graphs aren't really telling me anything other than that memory usage is only around 30% and a widget on the dashboard tells me that my CPU usage never reaches 10%.

    Several seconds after pausing the transfer, the pfsense web interface is once again available and the ping times drop back to normal (<1ms) - please note that i can sustain a transfer of 25mb/s for several minutes before pfsense becomes unresponsive.

    Also, please note that I have dual gigabit NICs in the router.

    I don't have CPU temps, but I do have a fan and the heatsink is cool to the touch so I doubt its a heat issue.

    I've attached an example of what I see on RRD graphs for packets / sec - the 'dead space' areas are when this issue occurs.

    This is easily replicable - is there anything else I can test?

    Any help would be GREATLY appreciated.
    Thank you




  • What version of pfSense?

    What are the FreeBSD names of the interfaces (e.g. LAN is em0, WAN is msk1)?



  • Thanks so much for the quick reply -
    Pfsense 2.0.1
    interfaces are bge0(wan) and bge1(lan)

    I saw a forum thread about these interfaces and it suggested making the following changes to loader.conf.local
    kern.ipc.nmbclusters="131072"
    hw.bge.tso_enable=0
    hw.pci.enable_msix=0

    But they seemed to have little effect - I was still able to replicate the issue. These changes are still in place.

    I should let you know that I have a site to site openvpn link, but this problem presented itself before I set that up.


  • Netgate Administrator

    Are you running 32 or 64bit? Full install or NanoBSD?
    What you are describing could easily be the NIC problem to which you have already tried the solution:
    http://doc.pfsense.org/index.php/Tuning_and_Troubleshooting_Network_Cards#Broadcom_bce.284.29_Cards

    You should check that the settings are being applied correctly.
    If you have plenty of ram then try doubling up the nmbclusters again. If you are running 64bit and have a multiport NIC then they can really eat clusters under some circumstances.

    Steve



  • I'm running 32-bit nanobsd+vga booting off of a flash drive.

    Its running on this system:

    http://www.newegg.com/Product/Product.aspx?Item=N82E16856205006

    I'll double nmbcluster value and test again - could you explain to me exactly what that value does? Or how high I can set it with 2GB of ram?

    Thank you!



  • I increased nmbclusters value to 262144 but this did not resolve the issue.

    it is easiest for me to test this with NNTP traffic. It seems that if I'm using SSL on NNTP, the pfsense box stops responding much quicker. I just did a 10+GB transfer over unencrypted NNTP and I was able to move 8GB or so before speed dropped to 500KB/s and the web interface stopped responding. Would there be any reason for this? Is SSL traffic more demanding on the router? I'm not doing traffic shaping or anything like that.

    Are there logs that I could provide that would help?

    Thanks,




  • Aug 10 08:49:56 kernel: acpi_throttle3: <acpi cpu="" throttling="">on cpu3
    Aug 10 08:49:56 kernel: acpi_throttle3: failed to attach P_CNT
    Aug 10 08:49:56 kernel: device_attach: acpi_throttle3 attach returned 6
    Aug 10 08:49:56 kernel: p4tcc0: <cpu frequency="" thermal="" control="">on cpu0
    Aug 10 08:49:56 kernel: p4tcc1: <cpu frequency="" thermal="" control="">on cpu1
    Aug 10 08:49:56 kernel: p4tcc2: <cpu frequency="" thermal="" control="">on cpu2
    Aug 10 08:49:56 kernel: p4tcc3: <cpu frequency="" thermal="" control="">on cpu3
    Aug 10 08:49:57 php: : rc.newwanip: Informational is starting ovpns1.
    Aug 10 08:49:57 php: : rc.newwanip: on (IP address: 10.0.9.1) (interface: ) (real interface: ovpns1).
    Aug 10 08:49:57 check_reload_status: Reloading filter
    Aug 10 08:49:57 php: : OpenNTPD is starting up.
    Aug 10 08:49:57 php: : pfSense package system has detected an ip change -> … Restarting packages.
    Aug 10 08:49:58 php: : Restarting/Starting all packages.
    Aug 10 08:49:59 php: : IPSEC: One or more IPsec tunnel endpoints has changed its IP. Refreshing.
    Aug 10 08:49:59 login: login on ttyv0 as root
    Aug 10 08:50:00 sshlockout[24777]: sshlockout/webConfigurator v3.0 starting up
    Aug 10 08:50:01 check_reload_status: Reloading filter
    Aug 10 08:50:03 php: : Restarting/Starting all packages.
    Aug 10 08:50:05 php: /index.php: Successful webConfigurator login for user 'admin' from 10.0.2.15
    Aug 10 08:50:05 php: /index.php: Successful webConfigurator login for user 'admin' from 10.0.2.15
    Aug 10 08:50:44 apinger: Error while feeding rrdtool: Broken pipe
    Aug 10 08:51:44 apinger: /usr/local/bin/rrdtool respawning too fast, waiting 300s.
    Aug 10 08:57:51 apinger: ALARM: WAN(69.243..) *** down ***
    Aug 10 08:57:51 apinger: alarm canceled: WAN(69.243.
    .) *** down ***
    Aug 10 08:58:01 check_reload_status: Reloading filter
    Aug 10 08:58:01 check_reload_status: Reloading filter
    Aug 10 09:03:12 apinger: ALARM: WAN(69.243..) *** delay ***
    Aug 10 09:03:22 check_reload_status: Reloading filter
    Aug 10 09:40:12 apinger: alarm canceled: WAN(69.243.
    .) *** delay ***
    Aug 10 09:40:24 check_reload_status: Reloading filter
    Aug 10 09:40:24 sshd[45735]: Did not receive identification string from 10.0.2.22
    Aug 10 09:40:24 sshd[45670]: Did not receive identification string from 10.0.2.22
    Aug 10 09:40:30 apinger: ALARM: WAN(69.243..) *** delay ***
    Aug 10 09:40:40 check_reload_status: Reloading filter
    Aug 10 09:48:36 apinger: alarm canceled: WAN(69.243.
    .) *** delay ***
    Aug 10 09:48:46 check_reload_status: Reloading filter
    Aug 10 09:57:28 sshd[4691]: Did not receive identification string from 10.0.2.22
    Aug 10 09:57:28 sshd[4799]: Did not receive identification string from 10.0.2.22
    Aug 10 09:57:28 apinger: ALARM: WAN(69.243..) *** down ***
    Aug 10 09:57:28 sshd[5110]: Did not receive identification string from 10.0.2.22
    Aug 10 09:57:28 sshd[5407]: Did not receive identification string from 10.0.2.22
    Aug 10 09:57:28 apinger: alarm canceled: WAN(69.243.
    .) *** down ***
    Aug 10 09:57:28 sshd[5157]: Did not receive identification string from 10.0.2.22
    Aug 10 09:57:28 sshd[5552]: Did not receive identification string from 10.0.2.22
    Aug 10 09:57:38 check_reload_status: Reloading filter
    Aug 10 09:57:38 check_reload_status: Reloading filter
    Aug 10 10:10:51 apinger: ALARM: WAN(69.243..) *** delay ***
    Aug 10 10:10:59 apinger: alarm canceled: WAN(69.243.
    .) *** delay ***
    Aug 10 10:11:02 check_reload_status: Reloading filter
    Aug 10 10:11:09 check_reload_status: Reloading filter</cpu></cpu></cpu></cpu></acpi>


  • Netgate Administrator

    @5k1ttl3:

    I'll double nmbcluster value and test again - could you explain to me exactly what that value does? Or how high I can set it with 2GB of ram?

    Not in any helpful way!  ;)
    If you search the forum there are a number of detailed threads about this as it has previously been a problem.
    You can check the MBUF usage on the dashboard which would normally show very high or completely used if the NIC issue is related.

    I'm not sure what you are showing me with your logs. Apart from your WAN connection being a bit flaky (or the alarm settings being too low) I see no problems.

    Steve



  • I wasnt sure if anything in my system logs was valuable - wasnt sure what the apinger alarms were about.
    I also wasnt sure about the 'reloading filter' entries.

    I just did another transfer and found something (possibly) odd - my MBUF stays static at 2990/262144 - it didnt go up at all when i started a transfer, it hadnt moved at all when the transfer had slowed down and the web interface kicked me out (unless it went up too quickly and kicked me out before I could see it) - and it never went DOWN once the firewall went back to idle.

    Also the entry about IPSec tunnel endpoints is odd - I do not have an ipSec tunnel set up. (although I used to…maybe I need to completely blow away this configuration and build a new one from scratch.... )

    Thank you for all of your assistance so far


  • Netgate Administrator

    Well your MBUF usage looks pretty normal so it's probably not that. Though you could try running netstat  -m to see if anything else looks odd.
    Apinger is the process that checks the status of your WAN by periodically pinging the gateway. In your logs it is showing that the ping time has become excessively high (200ms from memory). This can happen when the connection is pushed close to its limit. You can tune this setting to prevent false alarms in System: Gateways: Edit gateway: Advanced.
    I dont know how much help ive been to be honest!

    Steve



  • You've been more help than you know -

    This afternoon I completely blew away my configs and rebuilt it from scratch.
    I Initiated 15GB of transfer and took a nap while logging pings to my pfsense box -
    transfer completed on time, and i didn't get a SINGLE ping response >1ms.

    I HATE speaking too soon on these matters but this might have resolved my problem -

    When I built this box, I just made a backup of the old system and imported it ALL into the new system - I think there might've been some old, bad configuration stuff left in there (the LOG file entry about ipsec tipped me off)

    One problem I'm seeing now is the following:
    my network connection seems to drop out on my desktop, kind of. MY system will keep running my "bandwidth test" out to my NNTP servers, and I can stop this transfer, and I still wont be able to reach the web from this machine. I can reach the PFsense web interface just fine now with ping times of <1ms, but I cant get out to the web until go in and make some change in PFsense - in this case it was changing my DNS servers - but its not a DNS issue. I think just changing something causes pfsense to "Reset" and allow traffic through.

    Surely it wasn't getting angry that I had a constant ping running against it, was it? I stopped the ping. I guess I'll just keep an eye on it unless you have any ideas.

    Thanks again,
    Jay


  • Netgate Administrator

    Hard to say. Certainly when you save the config some things are restarted/reloaded so that seems likely.
    Have you tried adjusting the apinger settings? If the WAN connection gets flagged as down at any time that can cause problems. Though I haven't experienced it myself others have reported very slow GUI access in that instance.

    Steve



  • In the logs apinger marks your gateways as down, because of packetloss/too high latency

    You seem to be saturating UP and DOWN on your WAN, basically making the WAN too slow/unreliable from for apinger's perspective.

    You can just disable the gateway monitor on that WAN Gateway if you want…  ::)


Locked