Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    (SOLVED) AMD 64 6/13 snapshot failing after approx 4 hours

    Scheduled Pinned Locked Moved 2.0-RC Snapshot Feedback and Problems - RETIRED
    21 Posts 6 Posters 12.4k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • W
      wallacebw
      last edited by

      NOTE:  This has been solved, see the lower section of this post.

      All:

      I have a server AMD64 full install using today's RC2 snapshot (6/13) that is failing after approx 4 hours.  The console is responsive, the WebGUI is very sluggish and occasionally fails to load a page, and it appears that no traffic is being passed over the WAN interface.

      Here are some other notes:

      The server is an HP DL-360 G6
      When the condition occurs the users on the LAN subnet loose access to data on the WAN (internet).
      When the condition occurs the all IPSEC / OpenVPN connections timeout and drop.  
      All interfaces (except WAN) are VLANS on a common 'fabric' sharing one physical link attached to a Cisco 3650 series layer 2 switch.  
      All interfaces use Static IPs.
      A reboot fixes the issue for another 4 hours.

      Any thoughts?  This server was working fine on an older RC1 image (up for months) and this condition started after updating to an RC2 image (I have updated to several RC2 images since Friday in an attempt to resolve with no luck.

      Next time I'm onsite, I can gather logs and the like if anyone wants.

      On a side note, is there anything architecture specific in the config file (x86 vs x64) or can I just rebuild it as a 32 bit host and restore the config to see if it helps?

      Thanks in advance.

      ===========
      ==   SOLVED  ==

      Thanks to: David Handelman, Gloom, CMB & stephenw10 for the help

      BACKUP
      I don't need to tell you to back-up your config and prepare to reinstall PFsense and restore or to do this off-hours, do I? ::)

      VERIFY THE ROOT ISSUE:
      The server was depleting all memory buffers for the network interface and in essence stopping all traffic from passing through the NICs.

      If you are experiencing symptoms, you can run:  netstat -m | grep "mbufs denied"  The first three numbers should be 0/0/0.   If any of these numbers are non-zero, you will need to increase your buffer.

      SOLUTION
      Tune the kernel to increase memory buffers.

      DETERMINE THE CURRENT VALUE:
      run 'sysctl kern.ipc.nmbclusters' and save this value so you can revert any changes.

      DETERMINE THE AVAILABLE MEMORY:
      Run: top -d 1 | grep Mem:   (Try to run this at a busy time)
      At the end of the output, you will see ? Free, this is your available ram.   Note that this is only a point in time snapshot, do not allocate 100% of this amount below.

      You should multiply the current kern.ipc.nmbclusters value (determined above) by 2048 (2KB) to determined the number of BYTES used by the buffer and add it to the available memory (note that the units may not match KB/MB/GB/etc.).  This is the available memory you should use when calculating the value in the next step.

      DETERMINE THE NEW BUFFER VALUE:
      Each memory buffer (MBUF) consumes 2KB of memory as noted above, and a recommended starting point is between 32768 and 65536 buffers.   The result will consume between 64MB and 128MB of system memory, ensure that your system can support the value you choose without using all available RAM.

      You can monitor your buffer usage by running 'netstat -m' and looking at the second output line which should look similar to the following

      
      16321/3173/19494/131072 mbuf clusters in use (current/cache/total/max)
      
      

      The third value (19494 in this example) is the number of buffers in use.  The fourth value (131072 in this example) is the current MAX buffers available.   If you see value 3 approach value 4 (say 80% or so), you should increase the buffer.

      APPLY THE CHANGE:

      Obviously replace VALUE with the number determined above

      To update a currently running system run:
          sysctl kern.ipc.nmbclusters="VALUE"

      To make this setting persist a reboot  
         edit /boot/loader.conf and replaced
             kern.ipc.nmbclusters="0"
                    with
             kern.ipc.nmbclusters="VALUE"
                    (or add it if not present)

      To make this setting persist an upgrade
         edit (or create) /boot/loader.conf.local and add the following line
             kern.ipc.nmbclusters="VALUE"

      MONITOR THE CHANGE:
      You can monitor your buffer usage by running 'netstat -m' and looking at the second line which should look similar to the following

      
      16321/3173/19494/131072 mbuf clusters in use (current/cache/total/max)
      
      

      The third value (19494 in this example) is the number of buffers in use.  The fourth value (131072 in this example) is the VALUE you assigned.   if you see value 3 approach value 4 (say 80% or so), you should increase the buffer.

      Hope this helps someone.

      Brian

      1 Reply Last reply Reply Quote 0
      • D
        David Handelman
        last edited by

        You can safely restore over I386, the config is the same.
        What nics are you using?

        1 Reply Last reply Reply Quote 0
        • W
          wallacebw
          last edited by

          The onboard NICS (HP NC382i)
           – these are Broadcom, I believe the Broadcom BCM5709C
          http://h18000.www1.hp.com/products/quickspecs/13235_na/13235_na.html

          They have been working fine up until RC2.

          1 Reply Last reply Reply Quote 0
          • D
            David Handelman
            last edited by

            You can try to add
            kern.ipc.nmbcluster=65536 to the /boot/loader.conf.local

            I'm suffering a lot from those nics. And I found that the line above helps.

            Also You can downgrade to the working snapshot.

            Good Luck..

            1 Reply Last reply Reply Quote 0
            • W
              wallacebw
              last edited by

              I will try a downgrade tomorrow.  The oldest snapshot I see is May 24,  hopefully this is old enough.

              I know it was working on Feb 24 based on:  http://forum.pfsense.org/index.php/topic,33052.msg174381.html#msg174381

              If I was out of memory buffer (nmbclusters), wouldn't I see it in the system log?

              1 Reply Last reply Reply Quote 0
              • D
                David Handelman
                last edited by

                I'm not sure that you will see it in the log.

                1 Reply Last reply Reply Quote 0
                • W
                  wallacebw
                  last edited by

                  I found someone onsite with access to the room, who rebooted it for me.  I successfully downgraded to:

                  http://snapshots.pfsense.org/FreeBSD_RELENG_8_1/amd64/pfSense_HEAD/updates/pfSense-Full-Update-2.0-RC1-amd64-20110524-0246.tgz

                  If this works for an extended period, what should I do to report this?  As it happens, I have a spare piece of hardware with the same specs (HP 360-G6) that I could bring up to test new snapshots with as well as assist in identifying the root cause.

                  let me know,

                  Brian

                  1 Reply Last reply Reply Quote 0
                  • W
                    wallacebw
                    last edited by

                    This appears it could be related to: http://redmine.pfsense.org/issues/1425
                    – The above bug is in reference to BGE drivers and I am using BCE drivers, related?

                    quick question, what is the latest snapshot I can use / when did you swap to FreeBSD 8.1?   Running a uname -a on the 5/24 version shows:
                    FreeBSD host.name.here 8.1-RELEASE-p3 FreeBSD 8.1-RELEASE-p3 #1: Tue May 24...

                    Can anyone provide info on what version I can run (and where to get it)?

                    Thanks,

                    1 Reply Last reply Reply Quote 0
                    • W
                      wallacebw
                      last edited by

                      Bad news… 5/27 snapshot just took a dirt nap... I'll have to go onsite to gain console access and I'll try modifying the kern.ipc.nmbcluster...

                      Does anyone know where I could get a Feb 24 snapshot?

                      Any Ideas?

                      1 Reply Last reply Reply Quote 0
                      • G
                        Gloom
                        last edited by

                        Ah the joys of the HP DL series of servers.
                        First run
                        netstat -m
                        You will probably see a line similar to this

                        0/1564137/323086 requests for mbufs denied (mbufs/clusters/mbuf+clusters)

                        This is most likely the cause of your problems and IIRC has existed on the DL360 G6/7 since day 1

                        You need it to be all zeros so from a command prompt try

                        sysctl kern.ipc.nmbclusters=51200

                        That is twice the default value but you may need to increase it depending on your requirements.
                        I run my G6 and G7 boxes at 4x default value but I've got full internet BGP routes on them and I've got the 13th May image running solidly since it's release with no issues.

                        NB This setting will not persist through upgrades or reboots.

                        Never underestimate the power of human stupidity

                        1 Reply Last reply Reply Quote 0
                        • W
                          wallacebw
                          last edited by

                          Thanks.. I will check once I'm onsite later today.  (luckily, this is in a lab environment and not prod).

                          If I add this via /boot/loader.conf.local, it should persist a reboot, but not an upgrade, correct?

                          Looking at the freeBSD tuning guide (http://www.freebsd.org/doc/handbook/configtuning-kernel-limits.html) they recommend a max of 32K

                          We recommend values between 4096 and 32768 for machines with greater amounts of memory. Under no circumstances should you specify an arbitrarily high value for this parameter as it could lead to a boot time crash.

                          I'm not worried about memory, so I may set it to 64K (65536) which would consume 128MB of ram.  This should not be an issue as this server has 24GB (it's overkill, but it's what we had laying around that I could use in a lab… One of the advantages of working in a large shop  ;))

                          1 Reply Last reply Reply Quote 0
                          • G
                            Gloom
                            last edited by

                            lol and here was me thinking my 16Gb was OTT

                            Yes loader.conf.local will survive a reboot.

                            The handbook is a little conservative in it's values. Assuming you are running the 64bit build and have the physical memory it's safe to take it up quite a bit higher with the 8.x versions of FreeBSD.

                            Never underestimate the power of human stupidity

                            1 Reply Last reply Reply Quote 0
                            • W
                              wallacebw
                              last edited by

                              I was able to work on this today and have made the following changes:

                              Upgraded to today's snapshot:  (6/14)

                              ran sysctl kern.ipc.nmbclusters=65536

                              edited /boot/loader.conf and replaced

                              kern.ipc.nmbclusters="0"
                              with
                              kern.ipc.nmbclusters="65536"

                              I am watching the netstat -m outbut and have seen no denied mbuf allocations (although I did not see any before the change either).

                              1 Reply Last reply Reply Quote 0
                              • W
                                wallacebw
                                last edited by

                                I don't want to call it fixed yet… but so far, so good.  Will update tomorrow.

                                Dumb question... is there any reason I couldn't add this to "System: Advanced: System Tunables" as a manual entry if it proves to be stable similar to the following?

                                And Thank you all for the help!

                                1 Reply Last reply Reply Quote 0
                                • C
                                  cmb
                                  last edited by

                                  @wallacebw:

                                  Dumb question… is there any reason I couldn't add this to "System: Advanced: System Tunables" as a manual entry if it proves to be stable similar to the following?

                                  that's only for sysctls, not loader items.

                                  1 Reply Last reply Reply Quote 0
                                  • C
                                    clarknova
                                    last edited by

                                    I ran kern.ipc.nmbclusters="131072" on an amd64 machine with 4GB RAM with no problem, due to mbuf numbers steadily climbing. When your mbuf number hits the set value of that variable, everything stops.

                                    db

                                    1 Reply Last reply Reply Quote 0
                                    • W
                                      wallacebw
                                      last edited by

                                      @CMB:

                                      Understood, but if these sysctls commands are executed at startup, wouldn't this effectively have the same impact for this particular entry given that the buffers don't fill up that quickly and can be modified post-boot?

                                      I would like to have it in the config so that If i forget to modify the loader after an upgrade, the setting still gets applied.

                                      1 Reply Last reply Reply Quote 0
                                      • G
                                        Gloom
                                        last edited by

                                        It used to work several months ago when you added it in there then it stopped for some reason I can't remember, there is a post about it you might be able to find. So now I just stick it in /boot/loader.conf.local and remember to set it after each update.

                                        Never underestimate the power of human stupidity

                                        1 Reply Last reply Reply Quote 0
                                        • stephenw10S
                                          stephenw10 Netgate Administrator
                                          last edited by

                                          If you put it in /boot/loader.conf.local it should get copied across an update automatically.

                                          Steve

                                          1 Reply Last reply Reply Quote 0
                                          • W
                                            wallacebw
                                            last edited by

                                            @stephenw10:

                                            If you put it in /boot/loader.conf.local it should get copied across an update automatically.

                                            Steve

                                            I just tested and can confirm.  Good info… Thanks!
                                            This should make it into the next guide book.  I didn't see any mention of it in the 1.2.3 book.

                                            1 Reply Last reply Reply Quote 0
                                            • First post
                                              Last post
                                            Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.