Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    Diagnosing latency spikes

    Scheduled Pinned Locked Moved Hardware
    xl710latency spikeixl
    10 Posts 4 Posters 1.8k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • Z
      zmiguel
      last edited by

      Hi! I've been running pfSense for a few months now, and I've been able to tune my system well enough where I'm not getting dropped packets.
      However, I still haven't been able to figure out why I get so many latency spikes.

      My specs are:

      • Intel XL710-BM1 (4x SFP+)
      • Intel Core i3-10100
      • 16GB Ram

      My WAN is connected with an SFP+ 10GbE adapter

      I've made sure my network card firmware/nvm match the driver in use by pfsense for best possible compatibility.

      I've checked a lot of tuning guides and recommendations and ended up with the following changes to my loader.conf.local file

      hw.intr_storm_threshold=0
      kern.ipc.nmbclusters="8388608"
      kern.ipc.nmbjumbop="1048576"
      kern.ipc.nmbjumbo9="1048576"
      kern.ipc.nmbjumbo16="1048576"
      net.isr.maxthreads="-1"
      net.isr.bindthreads="1"
      hw.ixl.num_queues="1"
      

      My system tunables are the following:

      kern.ipc.maxsockbuf             16777216	 
      net.inet.tcp.sendbuf_max        16777216	 
      net.inet.tcp.recvbuf_max        16777216	 
      net.inet.tcp.sendbuf_inc        262144	 
      net.inet.tcp.recvbuf_inc        262144	 
      net.route.netisr_maxqlen        2048	 
      net.inet.ip.intr_queue_maxlen   2048	 
      net.core.rmem_default           8388608	 
      net.core.rmem_max               16777216	 
      net.core.wmem_default           8388608	 
      net.core.wmem_max               16777216	 
      dev.ixl.0.fw_lldp		0	 
      dev.ixl.1.fw_lldp		0	 
      dev.ixl.2.fw_lldp		0	 
      dev.ixl.3.fw_lldp		0	 
      net.inet.udp.maxdgram           131072	 
      net.inet.udp.recvspace          131072	 
      net.core.netdev_max_backlog     262144
      

      This has made it so I no long have dropped UDP packets, but I still experience latency spikes very often.
      I've ruled out buffer bloat as the latency spikes happen even when there's little to no network utilization.

      The latency spikes go from 1-3ms to ~300ms when it happens. With some instances where the latency goes beyond 1 second. Averaging out to ~85ms latency on a ping test or similar.

      I'm not sure what else I should be looking for to diagnose the issue. I know the ixl drivers are known to cause all sorts of problems on FreeBSD and if it was now I would have chosen a different NIC. But I'm trying to get it to work as good as possible.

      Note: from all the test I've done, this only happens on the WAN side, pinging any device on my LAN never has any latency spikes.

      Any and all comments and recommendation are very much appreciated.

      johnpozJ 1 Reply Last reply Reply Quote 0
      • stephenw10S
        stephenw10 Netgate Administrator
        last edited by

        What pfSense version are you running?

        If you're running 2.6 have you applied the patch for this: https://redmine.pfsense.org/issues/12827

        Steve

        Z 1 Reply Last reply Reply Quote 0
        • Z
          zmiguel @stephenw10
          last edited by

          @stephenw10
          I'm running 22.01, I've applied the patch and the results are the same with or without it.
          I think the issue is hardware tuning related. Unfortunately I don't have any other NICs I can test with.

          1 Reply Last reply Reply Quote 0
          • johnpozJ
            johnpoz LAYER 8 Global Moderator @zmiguel
            last edited by johnpoz

            @zmiguel said in Diagnosing latency spikes:

            The latency spikes go from 1-3ms to ~300ms when it happens. With some instances where the latency goes beyond 1 second. Averaging out to ~85ms latency on a ping test or similar.

            To what exactly.. Another device on your own network.. A icmp response from a device outside of your control over a network outside of your control.. Why would you think that latency is is because of your device?

            If you sniff the traffic leaving your device, and they leave say every 1 second from a typical constant ping - and the responses take longer to return than you think.. How is it you think that is a problem with your device?

            Are you saying the response is returned to you and your device (pfsense) is having a delay in processing them?

            Now if you had device A -- pfsense --- device B.. That was all over your network.. And pfsense is not processing the traffic correclty (causing a delay) that would be one thing.

            But if your pinging out your wan -- to what your ISP device, or some IP out on the internet.. I do not follow your logic that its pfsense that is causing the problem.

            Is this whatever your testing to on your wan side, actually still on your network?

            An intelligent man is sometimes forced to be drunk to spend time with his fools
            If you get confused: Listen to the Music Play
            Please don't Chat/PM me for help, unless mod related
            SG-4860 24.11 | Lab VMs 2.8, 24.11

            Z 1 Reply Last reply Reply Quote 0
            • Z
              zmiguel @johnpoz
              last edited by

              @johnpoz said in Diagnosing latency spikes:

              To what exactly.. Another device on your own network..

              To any device on the WAN side of my network

              Why would you think that latency is is because of your device?

              Because it wasn't here when I was using my previous router (EdgeRouter 12), nor is it when connected directly to my modem.

              Are you saying the response is returned to you and your device (pfsense) is having a delay in processing them?

              I don't know, but I would assume so.

              Now if you had device A -- pfsense --- device B.. That was all over your network.. And pfsense is not processing the traffic correclty (causing a delay) that would be one thing.

              I can try setting this up and see how it does

              Is this whatever your testing to on your wan side, actually still on your network?

              I've tested straight from pfsense to 1.1.1.1, 8.8.8.8 and a few other remote servers that I own, I've also tested the same thing from a device on my LAN side.

              I can try having a device I control directly on the WAN side and test it like you mentioned above.

              stephenw10S johnpozJ 2 Replies Last reply Reply Quote 0
              • stephenw10S
                stephenw10 Netgate Administrator @zmiguel
                last edited by

                @zmiguel said in Diagnosing latency spikes:

                I can try having a device I control directly on the WAN side and test it

                Yes that would be a far better test. Pretty much confirm the issue.

                You could also try running pcaps on WAN and LAN and looking at the actual latency between query and reply. Of course you are already through the NIC hardware and driver at that point but if it was something in the packet processing it would show as clear difference between WAN and LAN.

                Steve

                1 Reply Last reply Reply Quote 0
                • johnpozJ
                  johnpoz LAYER 8 Global Moderator @zmiguel
                  last edited by johnpoz

                  @zmiguel said in Diagnosing latency spikes:

                  Because it wasn't here when I was using my previous router (EdgeRouter 12), nor is it when connected directly to my modem.

                  While I see how those could lead you to your conclusion.. But I take it when your device is directly connected to your modem you have a different IP with your isp. Or when you have the other edge router as well common for this IP to be different.

                  Different IPs tend to point to different hardware for their next hop.. Which could lead to changes in the performance. Or the issue your seeing could be sporadic..

                  The only real way to actually say its device X causing the problem is when you control complete testing path of the test be it with network gear and with the end devices.. Testing out to internet, or even to your isp device has way to many variables involved to actually point the finger at your hardware.

                  Like saying it only rains when you wash your car ;) You can control when you wash your car, but thinking it rains only when you do when when it rains is completely out of your control is not a valid testing.

                  It may very well be something wrong with pfsense - not saying its 100% not that. But the testing parameters leave room for it just could be something in your isp or the internet, or where your testing too issue causing the latency differences.. When trying to track down something like this is best to limit the variables to stuff you directly control and can monitor to pinpoint the actual cause.

                  When you suspect its X causing the problem - you need to remove as many other variables as possible and test whatever your issue is with just X.. If it is X it would present itself with the other variables removed.

                  Like using a different soap brand when washing your car, this doesn't remove the actual real variable (the weather).. But could lead you to believe hey it only rains when I wash my car with Acme soap, not when I use SuperSuds brand.

                  edit: Yeah I know the car washing example is horrible and pretty obvious - but attempting to test latency or bandwidth that goes out to your isp or anything over the internet just completely out of your control.. Like the weather ;)

                  An intelligent man is sometimes forced to be drunk to spend time with his fools
                  If you get confused: Listen to the Music Play
                  Please don't Chat/PM me for help, unless mod related
                  SG-4860 24.11 | Lab VMs 2.8, 24.11

                  1 Reply Last reply Reply Quote 0
                  • Z
                    zmiguel
                    last edited by zmiguel

                    To add some more information, here's a MTR directly from pfsense to a VPS

                    pfsense.local (37.x.x.x) -> 194.x.x.x                                                2022-05-05T14:34:46+0100
                                                                                         Packets               Pings
                        Host                                                           Loss%   Snt   Last   Avg  Best  Wrst StDev  
                    1.  AS???    10.208.128.1                                          94.5%  1000    2.4 138.3   1.0 1405. 352.1  
                    2.  AS8657   telepac16-hsi.cprm.net                                87.6%  1000    3.1 110.2   2.6 1873. 293.7  
                    3.  AS8657   tva-cr1-bu10-200.cprm.net                              0.9%  1000    4.6 136.0   3.6 2727. 324.7
                    4.  AS8657   lon1-cr1-be2.cprm.net                                  0.8%  1000   36.8 190.3  33.6 2627. 371.3
                    5.  AS1299   ldn-b7-link.ip.twelve99.net                            0.5%  1000   35.6 196.3  35.2 3310. 384.6
                    6.  AS1299   ldn-bb4-link.ip.twelve99.net                          33.7%  1000   35.9 230.6  35.3 3214. 404.2
                    7.  AS1299   adm-bb4-link.ip.twelve99.net                           0.7%  1000   39.8 198.2  38.7 3112. 373.2
                    8.  AS1299   ddf-b3-link.ip.twelve99.net                            1.0%  1000   49.9 197.4  42.2 3059. 375.0
                    9.  AS1299   contabo-svc072466-ic359931.ip.twelve99-cust.net        0.9%  1000   43.1 189.0  42.9 2958. 349.6
                    10. AS51167  x.contaboserver.net                                    0.6%  1000   43.7 191.0  43.0 2859. 352.1 
                    

                    And here's the same but connected directly to the modem

                    
                     Desktop (37.x.x.x) -> 194.x.x.x                                                      2022-05-05T14:51:17+0100
                                                                                          Packets               Pings
                        Host                                                            Loss%   Snt   Last   Avg  Best  Wrst StDev  
                    1.  AS???     10.208.128.1                                          87.6%  1000    2.2   4.1   1.1  53.3   5.7  
                    2.  (waiting for reply)
                    3.  AS8657    tva-cr1-bu10-200.cprm.net                              0.0%  1000    4.3   5.0   3.3  50.8   3.8
                    4.  AS8657    lon1-cr1-be2.cprm.net                                  0.0%  1000   34.3  34.1  33.5  81.0   2.9
                    5.  AS1299    ldn-b7-link.ip.twelve99.net                            0.0%  1000   35.7  36.3  35.2  84.1   4.6
                    6.  AS1299    ldn-bb4-link.ip.twelve99.net                           0.5%  1000   36.1  45.8  35.4 105.2  11.8
                    7.  AS1299    adm-bb4-link.ip.twelve99.net                           0.0%  1000   39.2  41.8  38.6  85.1   3.6
                    8.  AS1299    ddf-b3-link.ip.twelve99.net                            0.0%  1000   43.6  44.3  42.2 103.0   3.8
                    9.  AS1299    contabo-svc072466-ic359931.ip.twelve99-cust.net        0.0%  1000   43.8  44.2  42.8  97.0   3.3
                    10. AS51167   x.contaboserver.net                                    0.0%  1000   44.0  45.2  43.6  96.5   3.0
                    

                    @johnpoz said in Diagnosing latency spikes:

                    While I see how those could lead you to your conclusion.. But I take it when your device is directly connected to your modem you have a different IP with your isp. Or when you have the other edge router as well common for this IP to be different.

                    They get an IP in the same subnet, and the hops are the same.

                    I understand that it's not ideal to test this to the internet. But as I only have issues when going out to my WAN and the only variable that I changed was my firewall/router it leads me to believe there's something wrong with my hardware choices.

                    I'll run the tests during the weekend as to not affect my network so much during the weekdays

                    1 Reply Last reply Reply Quote 0
                    • G
                      GeorgeCZ58
                      last edited by

                      Hello zmiguel, how did you resolve the issue?

                      johnpozJ 1 Reply Last reply Reply Quote 0
                      • johnpozJ
                        johnpoz LAYER 8 Global Moderator @GeorgeCZ58
                        last edited by

                        @GeorgeCZ58 said in Diagnosing latency spikes:

                        how did you resolve the issue?

                        Maybe he changed ISP.. nowhere in his testing did he show pfsense had anything to do with this - the only way to show that pfsense adding latency would be to sniff on the in out interfaces..

                        Here is his test without pfsense..

                        1.  AS???     10.208.128.1                                          87.6%  1000    2.2   4.1   1.1  53.3   5.7  
                        

                        To the first hop, but then hops after that show zero.. So that points to the device just not answering.. Now if he showed 87.6 loss or higher on every hop after that - then he could pretty safely say there is an issue with connectivity.

                        if you show a traceroute and all of sudden somewhere down the line you see loss, and that loss is with every hop after that, then that points to actual loss. But loss to specific hop and then zero or much lower points to the device with high just not answering all of the pings, or not answering them in a timely manner, etc.

                        Since in his pfsense is not listed as a hop in his first trace, would see he is tracing from pfsense directly - so pfsense isn't even nating or routing the traffic.. But some how it still adds latency to the return of something it sends out?? So what he got the answer but didn't actually process its return for X ms?

                        There was a recent thread where user thought pfsense was adding latency and showed him how to test..

                        Here sniffing on wan and lan at same time, from time traffic hit wan and pfsense sent it out lan it added a whole 0.000114 seconds.

                        https://forum.netgate.com/post/1112354

                        An intelligent man is sometimes forced to be drunk to spend time with his fools
                        If you get confused: Listen to the Music Play
                        Please don't Chat/PM me for help, unless mod related
                        SG-4860 24.11 | Lab VMs 2.8, 24.11

                        1 Reply Last reply Reply Quote 0
                        • First post
                          Last post
                        Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.