Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    Buffer errors, packet loss, latency

    Scheduled Pinned Locked Moved 2.0-RC Snapshot Feedback and Problems - RETIRED
    21 Posts 6 Posters 7.4k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • C
      clarknova
      last edited by

      2.0-RC1 (amd64)
      built on Sun Feb 13 23:53:14 EST 2011

      Lately I'm seeing daily spikes in latency on my RRD quality graphs. Healthy rtt would be 40-50 ms, but I'm getting ~30% of the samples over 100 ms. Normally I fix this by lowering the bandwidth of my HFSC parent queue, but that doesn't appear to help at all, suggesting it's not a congestion issue this time.

      When I manually ping my WAN gateway, I get 2.2% packet loss, high jitter, and occasionally this:

      
      ping: sendto: No buffer space available
      
      

      Some search results from 2007 suggest it might be a states table exhaustion problem, but the Dashboard is currently reporting 57061/389000 states and 9% memory usage, so it doesn't seem likely.

      Could this just be symptoms of a DSL problems? It wouldn't be out of the question, but I don't see any other evidence of that just yet.

      Any ideas?

      db

      1 Reply Last reply Reply Quote 0
      • W
        wallabybob
        last edited by

        I've seen a number of reports of what seems like a memory (mbuf) leak.  What are the mbuf stats?

        1 Reply Last reply Reply Quote 0
        • C
          clarknova
          last edited by

          4062 /4865 currently, but always growing (since August). I've seen them get up to 25,000+ before the whole think tanks (I don't know if it was freeze, panic or other at the time). I have a cron job to record the output of netstat -m to a file hourly, if you think it might be informative.

          db

          1 Reply Last reply Reply Quote 0
          • E
            eri--
            last edited by

            Yes please and show even what services are you runing on the box.
            Also type of nics.

            1 Reply Last reply Reply Quote 0
            • C
              clarknova
              last edited by

              Intel 82574L Gigabit Ethernet x2 (em)
              Unbound and Cron packages installed
              Status: Services shows:

              cron 	The cron utility is used to manage commands on a schedule. 	
              
              dhcpd 	DHCP Service 	
              
              ntpd 	NTP clock sync 	
              
              openvpn 	OpenVPN client: 	
              
              unbound 	Unbound is a validating, recursive, and caching DNS resolver.
              

              all running. Does that answer your question about which services I'm running?

              I cleared the attached log when I updated to the RC 1 day 21 hours ago.

              netstat-m.log.txt

              db

              1 Reply Last reply Reply Quote 0
              • C
                clarknova
                last edited by

                I'm now thinking this doesn't have anything to do with uptime. Although a reboot appeared to fix the latency problems in the past, I think it may have been chance coincidence, because i normally do my updates at off-peak times.

                Could it be something in the way the traffic shaper works? That's what it looks like to me, as if high-priority traffic isn't getting prioritized at all.

                The first screen shot shows the Quality graph. When things are working well it stays in the 40-60 range. The second graph is throughput during the same period. Note that Quality is poor even when the upload rate is less than maximum.

                You can see in the throughput graph that I lowered the WAN parent queue from 4000 kbit to 2500 kbit in an attempt to restore quality, but it appears to have had no such effect.

                latency.png
                latency.png_thumb
                throughput.png
                throughput.png_thumb

                db

                1 Reply Last reply Reply Quote 0
                • K
                  Kevin
                  last edited by

                  Sound like about the same issue I am having.  Probably should merge our threads. My box is also using the em driver.  I think that is where the problem lies.  I have zero packages and no traffic shaping.  It is running mostly defaults.

                  1 Reply Last reply Reply Quote 0
                  • C
                    clarknova
                    last edited by

                    @Kevin:

                    Sound like about the same issue I am having.  Probably should merge our threads. My box is also using the em driver.  I think that is where the problem lies.  I have zero packages and no traffic shaping.  It is running mostly defaults.

                    I just looked at your thread and I have had very similar symptoms, reported here: http://forum.pfsense.org/index.php/topic,32897.0.html

                    I suspect my issues are all related, and they appear to be related to yours too. I'm using the SM X7SPA-H, incidentally.

                    Further examination of my packet loss symptoms, it appears that the packet loss and latency problems appear when my uplink is running at max. That would be normal with no QoS in place, but it appears that regardless of where I limit my WAN parent queue, as soon as packets start to drop, they drop from queues that shouldn't be full or dropping, such as icmp and voip, which never run up to their full allocation.

                    db

                    1 Reply Last reply Reply Quote 0
                    • K
                      Kevin
                      last edited by

                      Mine are X7SPE-HF using dual 82574L Intel NICs.  Right now I only have 2 VoIP phones connected registered to an external server.  They lose registration after only a few minutes with no traffic at all.

                      1 Reply Last reply Reply Quote 0
                      • C
                        clarknova
                        last edited by

                        I just realized my traffic shaper is failing to classify a lot of the traffic it used to. So bulk traffic is squeezing out interactive traffic, as they all land in the default queue for reasons unknown to me.

                        db

                        1 Reply Last reply Reply Quote 0
                        • K
                          Kevin
                          last edited by

                          Still having the same issue on the latest RC1 March 2.

                          Is there any information i can send to try and resolve this.

                          Passes traffic for a few minutes then quits.  Best I can tell only the NIC stops

                          1 Reply Last reply Reply Quote 0
                          • S
                            sullrich
                            last edited by

                            We think there is an mbuf leak in the Intel nic code.  We used the Yandex drivers in 1.2.3 and now I am starting to wish we did the same for RC1.  Looks like we will be importing the Yandex drivers soon so please keep an eye peeled on the snapshot server.  Hopefully by tomorrow.

                            1 Reply Last reply Reply Quote 0
                            • S
                              sullrich
                              last edited by

                              @Kevin:  Since you can reproduce this so quickly please email me at sullrich@gmail.com and I will work with you as soon as we have a new version available.

                              1 Reply Last reply Reply Quote 0
                              • C
                                clarknova
                                last edited by

                                @sullrich:

                                We think there is an mbuf leak in the Intel nic code.   We used the Yandex drivers in 1.2.3 and now I am starting to wish we did the same for RC1.   Looks like we will be importing the Yandex drivers soon so please keep an eye peeled on the snapshot server.   Hopefully by tomorrow.

                                Great news. Thanks for the update. Please let me know if I can do any testing or provide any info to help.

                                db

                                1 Reply Last reply Reply Quote 0
                                • C
                                  clarknova
                                  last edited by

                                  2.0-RC1 (amd64)
                                  built on Thu Mar 3 19:27:51 EST 2011

                                  Although I am thoroughly pleased with the new Yandex driver, I'm still seeing what looks like an mbuf leak. This is from a file that records 'netstat -m' every 4 hours, since my last upgrade.

                                  [2.0-RC1]:grep "mbuf clusters" netstat-m.log
                                  8309/657/8966/131072 mbuf clusters in use (current/cache/total/max)
                                  8389/577/8966/131072 mbuf clusters in use (current/cache/total/max)
                                  8484/610/9094/131072 mbuf clusters in use (current/cache/total/max)
                                  8630/720/9350/131072 mbuf clusters in use (current/cache/total/max)
                                  8815/663/9478/131072 mbuf clusters in use (current/cache/total/max)
                                  8958/744/9702/131072 mbuf clusters in use (current/cache/total/max)
                                  9055/775/9830/131072 mbuf clusters in use (current/cache/total/max)
                                  9086/744/9830/131072 mbuf clusters in use (current/cache/total/max)
                                  9192/766/9958/131072 mbuf clusters in use (current/cache/total/max)
                                  9331/771/10102/131072 mbuf clusters in use (current/cache/total/max)
                                  9627/731/10358/131072 mbuf clusters in use (current/cache/total/max)
                                  9873/757/10630/131072 mbuf clusters in use (current/cache/total/max)
                                  
                                  

                                  db

                                  1 Reply Last reply Reply Quote 0
                                  • K
                                    Kevin
                                    last edited by

                                    I am still seeing the same issues as of the March 15 snapshot.  Traffic stops passing after a short time.  I will upgrade again tomorrow.

                                    Any more insight or ideas on what is happening.  The box is still connected via serial port to a PC Ermal has remote access to.

                                    1 Reply Last reply Reply Quote 0
                                    • M
                                      mdima
                                      last edited by

                                      Hello,
                                      I am using intel nics on the x64 rc1 snapshots (2 x Intel PRO/1000 MT Dual Port Server Adapter + 1 Intel PRO/1000 on the motherboard) and I am seeing my MBUF growing every time… I confirm... for example now on the dashboard I read:
                                      mbuf usage: 5153 /6657

                                      even if the firewall has only 76 states active... but the number/max is growing every hour... even if I don't have any problem related to this, traffic passes with no problems...

                                      1 Reply Last reply Reply Quote 0
                                      • C
                                        clarknova
                                        last edited by

                                        @mdima:

                                        even if I don't have any problem related to this, traffic passes with no problems…

                                        The problem comes when the mbufs in use (the numbers you see) run into the max (which you don't see, unless you run 'netstat -m' from the shell), at which point everything stops rather precipitously.

                                        db

                                        1 Reply Last reply Reply Quote 0
                                        • M
                                          mdima
                                          last edited by

                                          @clarknova:

                                          @mdima:

                                          even if I don't have any problem related to this, traffic passes with no problems…

                                          The problem comes when the mbufs in use (the numbers you see) run into the max (which you don't see, unless you run 'netstat -m' from the shell), at which point everything stops rather precipitously.

                                          I know, I don't have problems at all, just I see that mbuf is growing constantly and that its values are very high if I compare them with another firewall I am using (x86 RC1 with 3Com nics, which has values about 200-300 mbuf sizes, max 1200).

                                          On my x64 RC1 my netstat-m reports:

                                          5147/1510/6657 mbufs in use (current/cache/total)
                                          5124/1270/6394/25600 mbuf clusters in use (current/cache/total/max)

                                          I don't know if it's normal or not, I just confirm that with intel nics on x64 this values seems to be growing constantly…

                                          1 Reply Last reply Reply Quote 0
                                          • C
                                            clarknova
                                            last edited by

                                            @mdima:

                                            5124/1270/6394/25600 mbuf clusters in use (current/cache/total/max)

                                            I think when that 6394 hits 25600 you will see a panic. You can bump up that 25600 value by putting

                                            
                                            kern.ipc.nmbclusters="131072"
                                            
                                            

                                            (or the value of your choice) into /boot/loader.conf.local and reboot. This uses more RAM, but not a lot more, and buys you time between reboots.

                                            db

                                            1 Reply Last reply Reply Quote 0
                                            • First post
                                              Last post
                                            Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.