Hyper-V 2012R2 mbufs memory leak/exhaustion



  • We recently switched from a hardware based solution to a Hyper-V pfSense clustered solution in an attempt to remove proprietary appliances from our racks. pfSense is running well in a virtualized environment, but we are facing a memory leak of some sort with the network buffers. We are seeing, over days, a gradual increase in mbuf usage to the point where the firewall runs out of buffer space and has to be rebooted. We increased our MBUF clusters to 1,000,000 to buy us more time between reboots, but non-the-less something is happening here. We have other virtualized instances and it does not happen in all cases. My suspicion is has something to do with a certain service, package or feature we are using. This particular firewall is using snort, ipSec and OpeVPN and CARP/HA among normal firewall traffic. It is is connected to Hyper-V with a standard NIC (hn0) with trunking enabled at the O/S level to allow native VLAN traffic across the virtual NIC.

    With an uptime of 2 days, here is where we are at:
    306732/2688/309420 mbufs in use (current/cache/total)
    4145/1685/5830/1000000 mbuf clusters in use (current/cache/total/max)

    The "mbuf in use" count will continue to rise, and we will eventually crash and traffic will stop passing.

    This is 2.2.4 w/ 4 CPUs and 2GB of statically assigned RAM.

    Has anyone experienced similar? We hate to revert back to our hardware appliances, we rather attempt to find the memory leak / root cause and work to resolution.

    Thanks.

    Phil

    EDIT: 538,000 mbufs in use. At the current rate, we should crash in 8 days.

    EDIT2: Like clockwork we crashed tonight when we hit a staggering 1 million mbufs in use. Took about 10 days.

    No doubt a bug here, it's just a matter of figuring out the combination of settings I am using in conjunction with Hyper-V to get to a fix. Anyone care to help me troubleshoot further to ultimately find the root cause and submit for a bug/patch?



  • Upgraded to 2.2.5 and the problem has amplified significantly. We were able to go 10 days before reboots, now we can only go 4, maybe 5 days. We were hoping this to be solved, but it has gotten much worse.

    EDIT: Even worse today, we are less than 48 hours to 1 million. The upgrade to 2.2.5 has actually made whatever the memory leak is far worse.



  • One thing to try would be to use more virtual interfaces and let windows do the VLANing.

    We do have IPSEC and NAT running on our virtualized pfSense.  Our systems have been up on Server 2012r2 on pfSense 2.2.4 for about 75 days, will upgrade to 2.2.5 tomorrow for a few of them.  Not much traffic running across these boxes.

    For pfSense systems i have been doing 2 cores and 4gig of memory.  I would make sure to be using the 64 bit version of pfSense.



  • Unfortunately we cannot do the Windows size trunking because we have more VLANs then the 8 limit on Hyper-V. We are 64-bit (as shown as amd64 on the dashboard). We move about 20MB/s traffic around the clock on the virtual appliance. We have 4 cores established (needed due to VPN traffic) at 2GB of memory. We can try to increase to 4GB for the hell of it.



  • @pciccone:

    Upgraded to 2.2.5 and the problem has amplified significantly. We were able to go 10 days before reboots, now we can only go 4, maybe 5 days. We were hoping this to be solved, but it has gotten much worse.

    EDIT: Even worse today, we are less than 48 hours to 1 million. The upgrade to 2.2.5 has actually made whatever the memory leak is far worse.

    The upgrade definitely didn't change anything there. The Hyper-V components in the OS haven't changed in the last 3-4 releases, and the only OS changes were minor security updates that can't impact anything along those lines. Something other than the upgrade changed to make it worse if it's worse. More traffic maybe?

    If you can provide some specifics to replicate, we can pass the issue along to Microsoft with details and get it fixed if it's not already in FreeBSD 10-STABLE.

    If you have the ability to test a replicable circumstance with a 2.3 snapshot, that would help as well. It's possible that's something that's been fixed in its 10-STABLE base.



  • OK standby on that. I will setup a 2.3-snapshot appliance and load the config to run for a while to test assuming most features are stable enough for a few hours. I will report back on this. If it still shows the same signs I can also ship a copy of the appliance to whomever will try to replicate with relevant keys and passwords changed. In theory my appliances should perform the same way no matter the physical hardware. That in itself creates a good test.



  • That'd be helpful, thanks. I don't think it's specific to your configuration as much as the traffic going through the box (the amount or type of it maybe), but that's hard to say at this point.



  • Loaded a 2.3-snapshot (current as of today) and then restored the config to a new VM. We were not able to pass any traffic it seems to get to the point to test this issue. I am including multiple issues in this thread on the experience. I am not sure which was the ultimate cause of failure. We have the 2.3-snapshot VM ready to retry after any suggestions to proceed further.

    This appears on the console over and over again, I suspect this is the issue (I have read very recent threads about disabling TSO in FreeBSD but not sure how that translates to pfSense):

    
    hn0: exceed max page buffers,75,32
    
    

    Also:

    
    Crash report begins.  Anonymous machine information:
    
    amd64
    10.2-STABLE
    FreeBSD 10.2-STABLE #190 a9f1fcf(devel): Sat Nov 21 05:20:23 CST 2015     root@pfs23-amd64-builder:/usr/home/pfsense/pfsense/tmp/obj/usr/home/pfsense/pfsense/tmp/FreeBSD-src/sys/pfSense
    
    Crash report details:
    
    PHP Errors:
    [21-Nov-2015 15:37:45 America/New_York] PHP Stack trace:
    [21-Nov-2015 15:37:45 America/New_York] PHP   1\. {main}() /etc/rc.filter_configure_sync:0
    [21-Nov-2015 15:37:45 America/New_York] PHP   2\. filter_configure_sync() /etc/rc.filter_configure_sync:37
    [21-Nov-2015 15:37:45 America/New_York] PHP   3\. filter_rules_generate() /etc/inc/filter.inc:273
    [21-Nov-2015 15:37:45 America/New_York] PHP   4\. filter_generate_ipsec_rules() /etc/inc/filter.inc:3645
    [21-Nov-2015 15:37:45 America/New_York] PHP Stack trace:
    [21-Nov-2015 15:37:45 America/New_York] PHP   1\. {main}() /etc/rc.filter_configure_sync:0
    [21-Nov-2015 15:37:45 America/New_York] PHP   2\. filter_configure_sync() /etc/rc.filter_configure_sync:37
    [21-Nov-2015 15:37:45 America/New_York] PHP   3\. filter_rules_generate() /etc/inc/filter.inc:273
    [21-Nov-2015 15:37:45 America/New_York] PHP   4\. filter_generate_ipsec_rules() /etc/inc/filter.inc:3645
    [21-Nov-2015 15:37:45 America/New_York] PHP Fatal error:  Call to undefined function XML_RPC_encode() in /usr/local/pkg/snort/snort.inc on line 3867
    [21-Nov-2015 15:37:45 America/New_York] PHP Stack trace:
    [21-Nov-2015 15:37:45 America/New_York] PHP   1\. {main}() /etc/rc.start_packages:0
    [21-Nov-2015 15:37:45 America/New_York] PHP   2\. sync_package() /etc/rc.start_packages:66
    [21-Nov-2015 15:37:45 America/New_York] PHP   3\. eval() /etc/inc/pkg-utils.inc:596
    [21-Nov-2015 15:37:45 America/New_York] PHP   4\. sync_snort_package_config() /etc/inc/pkg-utils.inc(596) : eval()'d code:1
    [21-Nov-2015 15:37:45 America/New_York] PHP   5\. snort_sync_on_changes() /usr/local/pkg/snort/snort.inc:1062
    [21-Nov-2015 15:37:45 America/New_York] PHP   6\. snort_do_xmlrpc_sync() /usr/local/pkg/snort/snort.inc:3824
    
    

    Last - snort package install:

    
    Executing custom_php_resync_config_command()...
    PHP ERROR: Type: 1, File: /usr/local/pkg/snort/snort.inc, Line: 3867, Message: Call to undefined function XML_RPC_encode()pkg: POST-INSTALL script failed
    
    

  • Banned

    @pciccone:

    Last - snort package install:

    
    Executing custom_php_resync_config_command()...
    PHP ERROR: Type: 1, File: /usr/local/pkg/snort/snort.inc, Line: 3867, Message: Call to undefined function XML_RPC_encode()pkg: POST-INSTALL script failed
    
    

    Completely OT here. Someone got the great idea to remove

    
    require_once("xmlrpc.inc");
    
    

    from /etc/inc/pkg-utils.inc, breaking ~30 packages or so. Sigh.

    https://github.com/pfsense/pfsense/pull/2102



  • I saw the commit, but it's not in the snapshot yet. I will try again next weekend, during the maintenance window on this. I do believe the "hn0: exceed max page buffers,75,32" will be an open issue with Hyper-V and pfSense.


Log in to reply