Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    25.03 beta - Bufferbloat / FQ CoDel issues

    Scheduled Pinned Locked Moved Development
    26 Posts 4 Posters 749 Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • RobbieTTR
      RobbieTT
      last edited by RobbieTT

      I'm seeing some weird issues with buffer bloat symptoms on 25.03 beta.

      It took me a while to realise that the hiccups I was seeing when on VTCs / VOIP etc were possibly buffer bloat. Firstly, my fq-codel settings have worked well for years and secondly I still get an A+ on waveform.com (no idea why dslreports seemed to die a slow death).

      My issue is confined to when I have download flows, from around 10 up to 16 flows. The online testers seem to be capped to 4 or so, so do not trigger my performance drop at high bandwidth/high flows.

      Thankfully I have macOS systems so can test using the excellent native tools Apple produced with the IETF. With this tool I can generate the real-world conditions where things get weird:

      rob@Smaug ~ % networkQuality -v -I en6
      ==== Verbose Results ====
      ---
      Capacity:
      ---
         Uplink capacity: 91.026 Mbps
            Accuracy: High
            Uplink bytes transferred: 208.000 MB
            Uplink Flow count: 16
         Downlink capacity: 829.781 Mbps
            Accuracy: High
            Downlink bytes transferred: 1.905 GB
            Downlink Flow count: 16
      ---
      Latency:
      ---
         Idle Latency:
            4126 RPM (14.542 milliseconds)
               Transport: 7384 RPM (8.125 milliseconds)
               Security: 2232 RPM (26.875 milliseconds)
               HTTP: 6956 RPM (8.625 milliseconds)
            Accuracy: High
         Responsiveness: Medium
            483 RPM (124.051 milliseconds)
               Transport: 1311 RPM (45.766 milliseconds)
               Security: 662 RPM (90.553 milliseconds)
               HTTP: 1268 RPM (47.309 milliseconds)
               HTTP loaded: 304 RPM (197.165 milliseconds)
            Accuracy: High
      ---
      Protocols Used:
      ---
          HTTP/2: 100%
      ---
      Transport-layer info:
      ---
          ECN Disabled: 100%, L4S Disabled: 100%
      ---
      Other Info:
      ---
         Test Endpoint: uklon5-edge-bx-007.aaplimg.com
         Interface: en6
         Start: 2025-05-09 13:02:25.624
         End: 2025-05-09 13:02:44.635
         OS Version: Version 15.4.1 (Build 24E263)
      
      ==== SUMMARY ====
      Uplink capacity: 91.026 Mbps
      Downlink capacity: 829.781 Mbps
      Responsiveness: Medium (124.051 milliseconds | 483 RPM)
      Idle Latency: 14.542 milliseconds | 4126 RPM
      rob@Smaug ~ % 
      

      Responsiveness of 'Medium' @ 124.051 ms is not terrible but on previous builds it was always 'High'. If I really tighten my FQ-CoDel bandwidth down by another 100 Mbps (I have a 1 GbE download service over PPPoE, with 110 Mbps Up) then I can resolve the issues I am seeing and the Apple tool goes back to 'High' again - but I sacrificing a lot of download bandwidth and dropping a lot of packets to do so.

      My first thought was with the new PPPoE backend and the spread of flows over multiple cores may be part of it (albeit improving CPU utilisation remarkably along the way).

      I have 8 physical cores and run with hyper-threading disabled and have a fast CPU. However, when I revert to the older PPPoE backend I just get a different set of symptoms when working a single core that hard, so it is a poor oranges to apples comparison.

      Perhaps the hard roll-off of PPPoE performance happens when flows exceed the physical number of cores?
      Perhaps the new if_pppoe is poorly optimised for high numbers of flows at maximum rate?
      Perhaps there is a wider difference with v25.03, outside of the PPPoE changes?

      Reading more widely, I can see Dave Taht has written this year about fq-codel being poorly coded / implemented on freeBSD (he uses terms like 'broken' but he is always passionate about his fq-codel work). Perhaps this a freeBSD issue or a latent issue unmasked by if_pppoe?

      Do others see this issue and does anyone know a way around it by different limiter/scheduler/fq-codel configurations?

      ☕️

      w0wW 1 Reply Last reply Reply Quote 0
      • RobbieTTR RobbieTT referenced this topic
      • w0wW
        w0w @RobbieTT
        last edited by

        @RobbieTT
        https://www.waveform.com/tools/bufferbloat
        And what does it show here?

        RobbieTTR 1 Reply Last reply Reply Quote 0
        • RobbieTTR
          RobbieTT @w0w
          last edited by

          @w0w

          As mentioned, it still gives me an A+ but the score does not reflect the issues now seen at higher flows:

           2025-05-09 at 16.58.05.png

          It's one of the aspects that confused me until I worked out the limitations of this site (at least using it from here in the UK).

          ☕️

          w0wW 1 Reply Last reply Reply Quote 1
          • w0wW
            w0w @RobbieTT
            last edited by

            @RobbieTT
            Hmm, interesting, really.
            Have you tested it on 24.11 already? I mean this Apple network quality tool.

            RobbieTTR 1 Reply Last reply Reply Quote 0
            • RobbieTTR
              RobbieTT @w0w
              last edited by RobbieTT

              @w0w

              Not that recently but all was ok back then so didn't appreciate the differing flow generation capabilities between it and the online tools as they all gave similar results then. I guess you don't look that hard when all is well.

              The Apple / IETF tool came with macOS Mojave, so it's been around for a few years now. I was still rocking a EdgeRouter back then and it did a pretty good job with pppoe and fq_codel, so not much to see.

              Looking into my current issue in a bit more detail I can see that it is only real-world noticeable when there is heavy traffic & flows in both directions (ie simultaneously). Running tests sequentially shows that upload is more impacted than download.

              Running pure download I get full bandwidth, low latency and good responsiveness scores. That gives me something to focus on tomorrow. Of course, simultaneous tests are not really reflected in the online buffer bloat tests. Another reason why my real-world performance is bad and yet I get a reassuring A+ on waveform.com.

              Wish I had more bandwidth to throw around or at least a symmetrical service...

              ☕️

              w0wW 1 Reply Last reply Reply Quote 0
              • w0wW
                w0w @RobbieTT
                last edited by w0w

                @RobbieTT
                I see something similar only on a wireless connection, but it's always been like that. I just tested fast.com with 16 streams, and the jitter didn’t exceed 7 ms on the wired connection. This was without any limiters applied — I’ll test it later with limiters as well.

                But I think that for my 1 Gbps symmetrical connection, even 16 or 30 streams may not be enough to fully saturate it. It probably requires something like 160 streams, and I don’t see any way to achieve that — I don’t have any Apple devices anyway.

                Edit:
                This is what I see with fast.com 30 connections. Drops are only on upload pipe.
                f87ddb02-5373-4a97-b402-f3a6eab843af-image.png

                RobbieTTR 1 Reply Last reply Reply Quote 0
                • RobbieTTR
                  RobbieTT @w0w
                  last edited by RobbieTT

                  @w0w
                  Similar results on fast.com for me, with my normal fq_codel settings. There is a drop in throughput between 8 and 16 streams though. Not that I find fast.com to be particularly trustworthy as it sometimes reports throughput well beyond my max bandwidth:

                  16 streams:

                   2025-05-10 at 12.52.08.png

                  8 streams:

                   2025-05-10 at 12.53.58.png

                  I think the main issue I have is only apparent whilst at (or near) being fully loaded in both directions; fast.com only tests sequentially rather than simultaneously. So isn't enough of a trigger. My bandwidth is quite asymmetric but it is all I can get.

                  The old pppoe backend seems to cope better when tapping on the upload and download limits at the same time - albeit to do so it took a fast CPU to cope with the load on a single core; my Netgate 6100 would struggle with this but was pretty easy for my Xeon system.

                  Perhaps if_pppoe has an issue that only manifests on simultaneous loads as it share the workload across multiple cores, or perhaps the fq_codel implementation is now running into issues with pppoe on multiple cores/flows/directions?

                  ☕️

                  w0wW 1 Reply Last reply Reply Quote 0
                  • w0wW
                    w0w @RobbieTT
                    last edited by

                    @RobbieTT
                    Your fast.com settings are just too weak. Here's how I use it:
                    684ec5a2-c506-4ec3-841c-54b4856e9337-image.png
                    But of course, I admit that it's much easier to run into bufferbloat issues on a 100 Mbps connection. I also assume that it’s enough to overload a 100 Mbps upstream channel for bufferbloat to become noticeable.
                    By the way, what are your shaper settings? What does Diagnostics – Limiter Info show?
                    And what about the power-saving settings, by the way? They were changed for newer hardware in version 23.05, weren't they?

                    RobbieTTR 1 Reply Last reply Reply Quote 0
                    • RobbieTTR
                      RobbieTT @w0w
                      last edited by RobbieTT

                      @w0w

                      Working fast.com harder doesn't really change my results. Presumably because the download and upload sessions are sequential:

                       2025-05-10 at 16.44.56.png

                      Doing the fast.com run above my limiters looked like this for download:

                       2025-05-10 at 16.42.54.png

                      And for upload:

                       2025-05-10 at 16.42.54.png

                      Going through the data I think tweaking the upload bandwidth down on my fq_codel settings may help for simultaneous upload+download sessions. I can only refine that on the Apple / IETF tool though.

                      Yes, the power saving was changed in 23.x and 24.x. 25.03 also had an Intel microcode change but not looked into the details. Either way, the sleep settings are not a factor and the CPU isn't working that hard throughout the tests. I could be hitting a NIC limitation but both the relevant NIC hardware are reasonably competent and should have margin to spare.

                      ☕️

                      w0wW GertjanG 2 Replies Last reply Reply Quote 1
                      • w0wW
                        w0w @RobbieTT
                        last edited by

                        @RobbieTT
                        Yeah, interesting...
                        If possible, I’d repeat the tests on version 24.11 — do you still have an old boot environment? Just in case the issue turns out to be caused by some changes on the provider’s side.

                        RobbieTTR 1 Reply Last reply Reply Quote 0
                        • RobbieTTR
                          RobbieTT @w0w
                          last edited by RobbieTT

                          @w0w
                          Ok, switched back to 24.11 and ran the Apple tool again:

                          rob@Smaug ~ % networkQuality             
                          ==== SUMMARY ====
                          Uplink capacity: 90.237 Mbps
                          Downlink capacity: 805.436 Mbps
                          Responsiveness: High (33.661 milliseconds | 1782 RPM)
                          Idle Latency: 12.625 milliseconds | 4752 RPM
                          rob@Smaug ~ % 
                          

                          Responsiveness score returns back to 'High' again.

                          I find it perplexing that the older firmware with single-core PPPoE is, in this regard, working better than multiple cores with if_pppoe.

                          It was a valid idea to double check again though.

                          Edit: Scratch the above for now as I think I found a misplaced patch being applied when it should not have been. This may have polluted my real-world experience and the testing....

                          ☕️

                          1 Reply Last reply Reply Quote 0
                          • w0wW
                            w0w
                            last edited by w0w

                            I'm also starting to recall and analyze a bit what's going on with these traffic limiters. It's actually quite interesting that I'm seeing packet drops on the PPPoE upload, even though I haven’t set any actual bandwidth limit. It's configured to the maximum. Still, under load—though it's actually below 1 Gbit/s—I’m seeing drops specifically on the upload, on PPPoE using the new backend. I haven’t tested it yet on the old backend. However, I did test it on the second provider (which is behind triple NAT through ROOter using a 5G mobile network). Yes, I have Multi-WAN, but the second provider is only used for failover. So... either I didn’t notice, or under the same test conditions as before, I’m not seeing any drops at all on the second WAN, which is ~200/~50Mbit/s. Obviously, the same limiters are in place, and the bandwidth cap is still 1 Gbit/s, but logically, it shouldn't be active in either case, right?
                            Edit: just tested using old PPPoE backend, same drops on the upload pipe.

                            RobbieTTR 1 Reply Last reply Reply Quote 0
                            • RobbieTTR
                              RobbieTT @w0w
                              last edited by

                              @w0w
                              Some of your fq_codel setting are really demanding though.

                              With a usual latency variance over the internet of around ±1ms or more (when unloaded) and with a usual setting of 5ms on fq_codel, you have a setting of 1µs. That's quite brutal I guess and probably more suited to use inside a data centre than over the net.

                              My router crashed in the early hours for no explicable reason, so my testing today was borked. Outside of testing or configuration changes it's my first ever hard crash of pfSense.

                              ☕️

                              w0wW 1 Reply Last reply Reply Quote 0
                              • w0wW
                                w0w @RobbieTT
                                last edited by

                                @RobbieTT said in 25.03 beta - Bufferbloat / FQ CoDel issues:

                                Some of your fq_codel setting are really demanding though

                                Those are new default settings, I think. I have seen something on redmine regarding it, but... Ignored it 😁

                                @RobbieTT said in 25.03 beta - Bufferbloat / FQ CoDel issues:

                                My router crashed in the early hours for no explicable reason, so my testing today was borked

                                It just happens sometimes, any crash dumps available?

                                T RobbieTTR 2 Replies Last reply Reply Quote 0
                                • T
                                  tman222 @w0w
                                  last edited by

                                  @w0w said in 25.03 beta - Bufferbloat / FQ CoDel issues:

                                  @RobbieTT said in 25.03 beta - Bufferbloat / FQ CoDel issues:

                                  Some of your fq_codel setting are really demanding though

                                  Those are new default settings, I think. I have seen something on redmine regarding it, but... Ignored it 😁

                                  @RobbieTT said in 25.03 beta - Bufferbloat / FQ CoDel issues:

                                  My router crashed in the early hours for no explicable reason, so my testing today was borked

                                  It just happens sometimes, any crash dumps available?

                                  Hi @w0w - I'm curious about this too. Where did you see that there might be new defaults on FQ CoDel parameters? Unless I missed it and that particular traffic shaping algorithm was changed / improved, 1us seems way too low. Thanks in advance.

                                  w0wW 1 Reply Last reply Reply Quote 0
                                  • w0wW
                                    w0w @tman222
                                    last edited by w0w

                                    @tman222 said in 25.03 beta - Bufferbloat / FQ CoDel issues:

                                    Where did you see that there might be new defaults on FQ CoDel parameters?

                                    https://redmine.pfsense.org/issues/16037

                                    And this is what I see when I select an already created limiter — but you also don’t see any of those parameters when creating one...

                                    dec7c970-e1de-4e27-b1f5-7c0aeb280913-image.png
                                    And when you try to create the new one
                                    1c5b29fd-5adc-4b5c-89f6-e36fdff28a4c-image.png

                                    I don't really think those are new defaults, because all the fq-codel man pages I can find on the web reference the same 5ms value that @RobbieTT mentioned.

                                    RobbieTTR 1 Reply Last reply Reply Quote 0
                                    • RobbieTTR
                                      RobbieTT @w0w
                                      last edited by

                                      @w0w said in 25.03 beta - Bufferbloat / FQ CoDel issues:

                                      It just happens sometimes, any crash dumps available?

                                      No crash log or anything of note in the usual logs. It just stopped doing its stuff.

                                      ☕️

                                      1 Reply Last reply Reply Quote 0
                                      • RobbieTTR
                                        RobbieTT @w0w
                                        last edited by

                                        @w0w said in 25.03 beta - Bufferbloat / FQ CoDel issues:

                                        @tman222 said in 25.03 beta - Bufferbloat / FQ CoDel issues:

                                        Where did you see that there might be new defaults on FQ CoDel parameters?

                                        And this is what I see when I select an already created limiter — but you also don’t see any of those parameters when creating one...

                                        I don't really think those are new defaults, because all the fq-codel man pages I can find on the web reference the same 5ms value that @RobbieTT mentioned.

                                        The defaults can be messed up and showing zero, according to the redmine. The pfSense manual still has the correct defaults listed.

                                        You do see the parameters when creating a new one, only that they do not appear until you set and save that page. If you look closely on your screenshot, below Scheduler: FQ_CODEL, you will see this note:

                                        Save this limiter to see algorithm parameters.

                                        Caution, coffee may be hot etc.

                                        It catches many of us out when we haven't set a new one in ages. It's a weird UI human factor fail thing and I have no idea why pfSense makes it so complicated compared to other routers.

                                        As Douglas Adams would have it "It's a black panel with a black button that lights-up black when you press it..."*


                                        *Hotblack's ship, when he was spending a year dead, for tax reasons.

                                        w0wW 1 Reply Last reply Reply Quote 1
                                        • w0wW
                                          w0w @RobbieTT
                                          last edited by

                                          @RobbieTT said in 25.03 beta - Bufferbloat / FQ CoDel issues:

                                          Caution, coffee may be hot etc.

                                          It catches many of us out when we haven't set a new one in ages.

                                          Absolutely. Of course, that doesn’t change the fact that no one expects the default parameters to have values different from those stated in the documentation — or at the very least, everyone is used to trusting that those parameters actually exist and are being applied. I just didn’t check them myself, of course.

                                          RobbieTTR 1 Reply Last reply Reply Quote 0
                                          • RobbieTTR
                                            RobbieTT @w0w
                                            last edited by

                                            @w0w
                                            No it doesn't and until your link to the redmine I had no idea it was a thing. It doesn't look like Netgate has addressed the issue, presumably because it is both intermittent and potentially unnoticed when new limiters are set.

                                            ☕️

                                            1 Reply Last reply Reply Quote 0
                                            • First post
                                              Last post
                                            Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.