(SOLVED) AMD 64 6/13 snapshot failing after approx 4 hours



  • NOTE:  This has been solved, see the lower section of this post.

    All:

    I have a server AMD64 full install using today's RC2 snapshot (6/13) that is failing after approx 4 hours.  The console is responsive, the WebGUI is very sluggish and occasionally fails to load a page, and it appears that no traffic is being passed over the WAN interface.

    Here are some other notes:

    The server is an HP DL-360 G6
    When the condition occurs the users on the LAN subnet loose access to data on the WAN (internet).
    When the condition occurs the all IPSEC / OpenVPN connections timeout and drop.  
    All interfaces (except WAN) are VLANS on a common 'fabric' sharing one physical link attached to a Cisco 3650 series layer 2 switch.  
    All interfaces use Static IPs.
    A reboot fixes the issue for another 4 hours.

    Any thoughts?  This server was working fine on an older RC1 image (up for months) and this condition started after updating to an RC2 image (I have updated to several RC2 images since Friday in an attempt to resolve with no luck.

    Next time I'm onsite, I can gather logs and the like if anyone wants.

    On a side note, is there anything architecture specific in the config file (x86 vs x64) or can I just rebuild it as a 32 bit host and restore the config to see if it helps?

    Thanks in advance.

    ===========
    ==   SOLVED  ==

    Thanks to: David Handelman, Gloom, CMB & stephenw10 for the help

    BACKUP
    I don't need to tell you to back-up your config and prepare to reinstall PFsense and restore or to do this off-hours, do I? ::)

    VERIFY THE ROOT ISSUE:
    The server was depleting all memory buffers for the network interface and in essence stopping all traffic from passing through the NICs.

    If you are experiencing symptoms, you can run:  netstat -m | grep "mbufs denied"  The first three numbers should be 0/0/0.   If any of these numbers are non-zero, you will need to increase your buffer.

    SOLUTION
    Tune the kernel to increase memory buffers.

    DETERMINE THE CURRENT VALUE:
    run 'sysctl kern.ipc.nmbclusters' and save this value so you can revert any changes.

    DETERMINE THE AVAILABLE MEMORY:
    Run: top -d 1 | grep Mem:   (Try to run this at a busy time)
    At the end of the output, you will see ? Free, this is your available ram.   Note that this is only a point in time snapshot, do not allocate 100% of this amount below.

    You should multiply the current kern.ipc.nmbclusters value (determined above) by 2048 (2KB) to determined the number of BYTES used by the buffer and add it to the available memory (note that the units may not match KB/MB/GB/etc.).  This is the available memory you should use when calculating the value in the next step.

    DETERMINE THE NEW BUFFER VALUE:
    Each memory buffer (MBUF) consumes 2KB of memory as noted above, and a recommended starting point is between 32768 and 65536 buffers.   The result will consume between 64MB and 128MB of system memory, ensure that your system can support the value you choose without using all available RAM.

    You can monitor your buffer usage by running 'netstat -m' and looking at the second output line which should look similar to the following

    
    16321/3173/19494/131072 mbuf clusters in use (current/cache/total/max)
    
    

    The third value (19494 in this example) is the number of buffers in use.  The fourth value (131072 in this example) is the current MAX buffers available.   If you see value 3 approach value 4 (say 80% or so), you should increase the buffer.

    APPLY THE CHANGE:

    Obviously replace VALUE with the number determined above

    To update a currently running system run:
        sysctl kern.ipc.nmbclusters="VALUE"

    To make this setting persist a reboot  
       edit /boot/loader.conf and replaced
           kern.ipc.nmbclusters="0"
                  with
           kern.ipc.nmbclusters="VALUE"
                  (or add it if not present)

    To make this setting persist an upgrade
       edit (or create) /boot/loader.conf.local and add the following line
           kern.ipc.nmbclusters="VALUE"

    MONITOR THE CHANGE:
    You can monitor your buffer usage by running 'netstat -m' and looking at the second line which should look similar to the following

    
    16321/3173/19494/131072 mbuf clusters in use (current/cache/total/max)
    
    

    The third value (19494 in this example) is the number of buffers in use.  The fourth value (131072 in this example) is the VALUE you assigned.   if you see value 3 approach value 4 (say 80% or so), you should increase the buffer.

    Hope this helps someone.

    Brian



  • You can safely restore over I386, the config is the same.
    What nics are you using?



  • The onboard NICS (HP NC382i)
     – these are Broadcom, I believe the Broadcom BCM5709C
    http://h18000.www1.hp.com/products/quickspecs/13235_na/13235_na.html

    They have been working fine up until RC2.



  • You can try to add
    kern.ipc.nmbcluster=65536 to the /boot/loader.conf.local

    I'm suffering a lot from those nics. And I found that the line above helps.

    Also You can downgrade to the working snapshot.

    Good Luck..



  • I will try a downgrade tomorrow.  The oldest snapshot I see is May 24,  hopefully this is old enough.

    I know it was working on Feb 24 based on:  http://forum.pfsense.org/index.php/topic,33052.msg174381.html#msg174381

    If I was out of memory buffer (nmbclusters), wouldn't I see it in the system log?



  • I'm not sure that you will see it in the log.



  • I found someone onsite with access to the room, who rebooted it for me.  I successfully downgraded to:

    http://snapshots.pfsense.org/FreeBSD_RELENG_8_1/amd64/pfSense_HEAD/updates/pfSense-Full-Update-2.0-RC1-amd64-20110524-0246.tgz

    If this works for an extended period, what should I do to report this?  As it happens, I have a spare piece of hardware with the same specs (HP 360-G6) that I could bring up to test new snapshots with as well as assist in identifying the root cause.

    let me know,

    Brian



  • This appears it could be related to: http://redmine.pfsense.org/issues/1425
    – The above bug is in reference to BGE drivers and I am using BCE drivers, related?

    quick question, what is the latest snapshot I can use / when did you swap to FreeBSD 8.1?   Running a uname -a on the 5/24 version shows:
    FreeBSD host.name.here 8.1-RELEASE-p3 FreeBSD 8.1-RELEASE-p3 #1: Tue May 24...

    Can anyone provide info on what version I can run (and where to get it)?

    Thanks,



  • Bad news… 5/27 snapshot just took a dirt nap... I'll have to go onsite to gain console access and I'll try modifying the kern.ipc.nmbcluster...

    Does anyone know where I could get a Feb 24 snapshot?

    Any Ideas?



  • Ah the joys of the HP DL series of servers.
    First run
    netstat -m
    You will probably see a line similar to this

    0/1564137/323086 requests for mbufs denied (mbufs/clusters/mbuf+clusters)

    This is most likely the cause of your problems and IIRC has existed on the DL360 G6/7 since day 1

    You need it to be all zeros so from a command prompt try

    sysctl kern.ipc.nmbclusters=51200

    That is twice the default value but you may need to increase it depending on your requirements.
    I run my G6 and G7 boxes at 4x default value but I've got full internet BGP routes on them and I've got the 13th May image running solidly since it's release with no issues.

    NB This setting will not persist through upgrades or reboots.



  • Thanks.. I will check once I'm onsite later today.  (luckily, this is in a lab environment and not prod).

    If I add this via /boot/loader.conf.local, it should persist a reboot, but not an upgrade, correct?

    Looking at the freeBSD tuning guide (http://www.freebsd.org/doc/handbook/configtuning-kernel-limits.html) they recommend a max of 32K

    We recommend values between 4096 and 32768 for machines with greater amounts of memory. Under no circumstances should you specify an arbitrarily high value for this parameter as it could lead to a boot time crash.

    I'm not worried about memory, so I may set it to 64K (65536) which would consume 128MB of ram.  This should not be an issue as this server has 24GB (it's overkill, but it's what we had laying around that I could use in a lab… One of the advantages of working in a large shop  ;))



  • lol and here was me thinking my 16Gb was OTT

    Yes loader.conf.local will survive a reboot.

    The handbook is a little conservative in it's values. Assuming you are running the 64bit build and have the physical memory it's safe to take it up quite a bit higher with the 8.x versions of FreeBSD.



  • I was able to work on this today and have made the following changes:

    Upgraded to today's snapshot:  (6/14)

    ran sysctl kern.ipc.nmbclusters=65536

    edited /boot/loader.conf and replaced

    kern.ipc.nmbclusters="0"
    with
    kern.ipc.nmbclusters="65536"

    I am watching the netstat -m outbut and have seen no denied mbuf allocations (although I did not see any before the change either).



  • I don't want to call it fixed yet… but so far, so good.  Will update tomorrow.

    Dumb question... is there any reason I couldn't add this to "System: Advanced: System Tunables" as a manual entry if it proves to be stable similar to the following?

    And Thank you all for the help!



  • @wallacebw:

    Dumb question… is there any reason I couldn't add this to "System: Advanced: System Tunables" as a manual entry if it proves to be stable similar to the following?

    that's only for sysctls, not loader items.



  • I ran kern.ipc.nmbclusters="131072" on an amd64 machine with 4GB RAM with no problem, due to mbuf numbers steadily climbing. When your mbuf number hits the set value of that variable, everything stops.



  • @CMB:

    Understood, but if these sysctls commands are executed at startup, wouldn't this effectively have the same impact for this particular entry given that the buffers don't fill up that quickly and can be modified post-boot?

    I would like to have it in the config so that If i forget to modify the loader after an upgrade, the setting still gets applied.



  • It used to work several months ago when you added it in there then it stopped for some reason I can't remember, there is a post about it you might be able to find. So now I just stick it in /boot/loader.conf.local and remember to set it after each update.


  • Netgate Administrator

    If you put it in /boot/loader.conf.local it should get copied across an update automatically.

    Steve



  • @stephenw10:

    If you put it in /boot/loader.conf.local it should get copied across an update automatically.

    Steve

    I just tested and can confirm.  Good info… Thanks!
    This should make it into the next guide book.  I didn't see any mention of it in the 1.2.3 book.



  • I updated the first port with issue details and resolution steps.  please feel free to let me know if I have any inaccuracies / bad advice listed.

    Thanks again.


Locked