• Categories
  • Recent
  • Tags
  • Popular
  • Users
  • Search
  • Register
  • Login
Netgate Discussion Forum
  • Categories
  • Recent
  • Tags
  • Popular
  • Users
  • Search
  • Register
  • Login

25.03 beta - Bufferbloat / FQ CoDel issues

Scheduled Pinned Locked Moved Development
26 Posts 4 Posters 1.8k Views
Loading More Posts
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • R
    RobbieTT
    last edited by RobbieTT May 9, 2025, 12:31 PM May 9, 2025, 12:28 PM

    I'm seeing some weird issues with buffer bloat symptoms on 25.03 beta.

    It took me a while to realise that the hiccups I was seeing when on VTCs / VOIP etc were possibly buffer bloat. Firstly, my fq-codel settings have worked well for years and secondly I still get an A+ on waveform.com (no idea why dslreports seemed to die a slow death).

    My issue is confined to when I have download flows, from around 10 up to 16 flows. The online testers seem to be capped to 4 or so, so do not trigger my performance drop at high bandwidth/high flows.

    Thankfully I have macOS systems so can test using the excellent native tools Apple produced with the IETF. With this tool I can generate the real-world conditions where things get weird:

    rob@Smaug ~ % networkQuality -v -I en6
    ==== Verbose Results ====
    ---
    Capacity:
    ---
       Uplink capacity: 91.026 Mbps
          Accuracy: High
          Uplink bytes transferred: 208.000 MB
          Uplink Flow count: 16
       Downlink capacity: 829.781 Mbps
          Accuracy: High
          Downlink bytes transferred: 1.905 GB
          Downlink Flow count: 16
    ---
    Latency:
    ---
       Idle Latency:
          4126 RPM (14.542 milliseconds)
             Transport: 7384 RPM (8.125 milliseconds)
             Security: 2232 RPM (26.875 milliseconds)
             HTTP: 6956 RPM (8.625 milliseconds)
          Accuracy: High
       Responsiveness: Medium
          483 RPM (124.051 milliseconds)
             Transport: 1311 RPM (45.766 milliseconds)
             Security: 662 RPM (90.553 milliseconds)
             HTTP: 1268 RPM (47.309 milliseconds)
             HTTP loaded: 304 RPM (197.165 milliseconds)
          Accuracy: High
    ---
    Protocols Used:
    ---
        HTTP/2: 100%
    ---
    Transport-layer info:
    ---
        ECN Disabled: 100%, L4S Disabled: 100%
    ---
    Other Info:
    ---
       Test Endpoint: uklon5-edge-bx-007.aaplimg.com
       Interface: en6
       Start: 2025-05-09 13:02:25.624
       End: 2025-05-09 13:02:44.635
       OS Version: Version 15.4.1 (Build 24E263)
    
    ==== SUMMARY ====
    Uplink capacity: 91.026 Mbps
    Downlink capacity: 829.781 Mbps
    Responsiveness: Medium (124.051 milliseconds | 483 RPM)
    Idle Latency: 14.542 milliseconds | 4126 RPM
    rob@Smaug ~ % 
    

    Responsiveness of 'Medium' @ 124.051 ms is not terrible but on previous builds it was always 'High'. If I really tighten my FQ-CoDel bandwidth down by another 100 Mbps (I have a 1 GbE download service over PPPoE, with 110 Mbps Up) then I can resolve the issues I am seeing and the Apple tool goes back to 'High' again - but I sacrificing a lot of download bandwidth and dropping a lot of packets to do so.

    My first thought was with the new PPPoE backend and the spread of flows over multiple cores may be part of it (albeit improving CPU utilisation remarkably along the way).

    I have 8 physical cores and run with hyper-threading disabled and have a fast CPU. However, when I revert to the older PPPoE backend I just get a different set of symptoms when working a single core that hard, so it is a poor oranges to apples comparison.

    Perhaps the hard roll-off of PPPoE performance happens when flows exceed the physical number of cores?
    Perhaps the new if_pppoe is poorly optimised for high numbers of flows at maximum rate?
    Perhaps there is a wider difference with v25.03, outside of the PPPoE changes?

    Reading more widely, I can see Dave Taht has written this year about fq-codel being poorly coded / implemented on freeBSD (he uses terms like 'broken' but he is always passionate about his fq-codel work). Perhaps this a freeBSD issue or a latent issue unmasked by if_pppoe?

    Do others see this issue and does anyone know a way around it by different limiter/scheduler/fq-codel configurations?

    ☕️

    W 1 Reply Last reply May 9, 2025, 1:57 PM Reply Quote 0
    • R RobbieTT referenced this topic on May 9, 2025, 1:49 PM
    • W
      w0w @RobbieTT
      last edited by May 9, 2025, 1:57 PM

      @RobbieTT
      https://www.waveform.com/tools/bufferbloat
      And what does it show here?

      R 1 Reply Last reply May 9, 2025, 4:05 PM Reply Quote 0
      • R
        RobbieTT @w0w
        last edited by May 9, 2025, 4:05 PM

        @w0w

        As mentioned, it still gives me an A+ but the score does not reflect the issues now seen at higher flows:

         2025-05-09 at 16.58.05.png

        It's one of the aspects that confused me until I worked out the limitations of this site (at least using it from here in the UK).

        ☕️

        W 1 Reply Last reply May 9, 2025, 4:30 PM Reply Quote 1
        • W
          w0w @RobbieTT
          last edited by May 9, 2025, 4:30 PM

          @RobbieTT
          Hmm, interesting, really.
          Have you tested it on 24.11 already? I mean this Apple network quality tool.

          R 1 Reply Last reply May 9, 2025, 8:57 PM Reply Quote 0
          • R
            RobbieTT @w0w
            last edited by RobbieTT May 9, 2025, 9:37 PM May 9, 2025, 8:57 PM

            @w0w

            Not that recently but all was ok back then so didn't appreciate the differing flow generation capabilities between it and the online tools as they all gave similar results then. I guess you don't look that hard when all is well.

            The Apple / IETF tool came with macOS Mojave, so it's been around for a few years now. I was still rocking a EdgeRouter back then and it did a pretty good job with pppoe and fq_codel, so not much to see.

            Looking into my current issue in a bit more detail I can see that it is only real-world noticeable when there is heavy traffic & flows in both directions (ie simultaneously). Running tests sequentially shows that upload is more impacted than download.

            Running pure download I get full bandwidth, low latency and good responsiveness scores. That gives me something to focus on tomorrow. Of course, simultaneous tests are not really reflected in the online buffer bloat tests. Another reason why my real-world performance is bad and yet I get a reassuring A+ on waveform.com.

            Wish I had more bandwidth to throw around or at least a symmetrical service...

            ☕️

            W 1 Reply Last reply May 10, 2025, 4:07 AM Reply Quote 0
            • W
              w0w @RobbieTT
              last edited by w0w May 10, 2025, 5:29 AM May 10, 2025, 4:07 AM

              @RobbieTT
              I see something similar only on a wireless connection, but it's always been like that. I just tested fast.com with 16 streams, and the jitter didn’t exceed 7 ms on the wired connection. This was without any limiters applied — I’ll test it later with limiters as well.

              But I think that for my 1 Gbps symmetrical connection, even 16 or 30 streams may not be enough to fully saturate it. It probably requires something like 160 streams, and I don’t see any way to achieve that — I don’t have any Apple devices anyway.

              Edit:
              This is what I see with fast.com 30 connections. Drops are only on upload pipe.
              f87ddb02-5373-4a97-b402-f3a6eab843af-image.png

              R 1 Reply Last reply May 10, 2025, 12:15 PM Reply Quote 0
              • R
                RobbieTT @w0w
                last edited by RobbieTT May 10, 2025, 12:17 PM May 10, 2025, 12:15 PM

                @w0w
                Similar results on fast.com for me, with my normal fq_codel settings. There is a drop in throughput between 8 and 16 streams though. Not that I find fast.com to be particularly trustworthy as it sometimes reports throughput well beyond my max bandwidth:

                16 streams:

                 2025-05-10 at 12.52.08.png

                8 streams:

                 2025-05-10 at 12.53.58.png

                I think the main issue I have is only apparent whilst at (or near) being fully loaded in both directions; fast.com only tests sequentially rather than simultaneously. So isn't enough of a trigger. My bandwidth is quite asymmetric but it is all I can get.

                The old pppoe backend seems to cope better when tapping on the upload and download limits at the same time - albeit to do so it took a fast CPU to cope with the load on a single core; my Netgate 6100 would struggle with this but was pretty easy for my Xeon system.

                Perhaps if_pppoe has an issue that only manifests on simultaneous loads as it share the workload across multiple cores, or perhaps the fq_codel implementation is now running into issues with pppoe on multiple cores/flows/directions?

                ☕️

                W 1 Reply Last reply May 10, 2025, 2:50 PM Reply Quote 0
                • W
                  w0w @RobbieTT
                  last edited by May 10, 2025, 2:50 PM

                  @RobbieTT
                  Your fast.com settings are just too weak. Here's how I use it:
                  684ec5a2-c506-4ec3-841c-54b4856e9337-image.png
                  But of course, I admit that it's much easier to run into bufferbloat issues on a 100 Mbps connection. I also assume that it’s enough to overload a 100 Mbps upstream channel for bufferbloat to become noticeable.
                  By the way, what are your shaper settings? What does Diagnostics – Limiter Info show?
                  And what about the power-saving settings, by the way? They were changed for newer hardware in version 23.05, weren't they?

                  R 1 Reply Last reply May 10, 2025, 4:04 PM Reply Quote 0
                  • R
                    RobbieTT @w0w
                    last edited by RobbieTT May 10, 2025, 4:05 PM May 10, 2025, 4:04 PM

                    @w0w

                    Working fast.com harder doesn't really change my results. Presumably because the download and upload sessions are sequential:

                     2025-05-10 at 16.44.56.png

                    Doing the fast.com run above my limiters looked like this for download:

                     2025-05-10 at 16.42.54.png

                    And for upload:

                     2025-05-10 at 16.42.54.png

                    Going through the data I think tweaking the upload bandwidth down on my fq_codel settings may help for simultaneous upload+download sessions. I can only refine that on the Apple / IETF tool though.

                    Yes, the power saving was changed in 23.x and 24.x. 25.03 also had an Intel microcode change but not looked into the details. Either way, the sleep settings are not a factor and the CPU isn't working that hard throughout the tests. I could be hitting a NIC limitation but both the relevant NIC hardware are reasonably competent and should have margin to spare.

                    ☕️

                    W G 2 Replies Last reply May 10, 2025, 4:27 PM Reply Quote 1
                    • W
                      w0w @RobbieTT
                      last edited by May 10, 2025, 4:27 PM

                      @RobbieTT
                      Yeah, interesting...
                      If possible, I’d repeat the tests on version 24.11 — do you still have an old boot environment? Just in case the issue turns out to be caused by some changes on the provider’s side.

                      R 1 Reply Last reply 30 days ago Reply Quote 0
                      • R
                        RobbieTT @w0w
                        last edited by RobbieTT 30 days ago 30 days ago

                        @w0w
                        Ok, switched back to 24.11 and ran the Apple tool again:

                        rob@Smaug ~ % networkQuality             
                        ==== SUMMARY ====
                        Uplink capacity: 90.237 Mbps
                        Downlink capacity: 805.436 Mbps
                        Responsiveness: High (33.661 milliseconds | 1782 RPM)
                        Idle Latency: 12.625 milliseconds | 4752 RPM
                        rob@Smaug ~ % 
                        

                        Responsiveness score returns back to 'High' again.

                        I find it perplexing that the older firmware with single-core PPPoE is, in this regard, working better than multiple cores with if_pppoe.

                        It was a valid idea to double check again though.

                        Edit: Scratch the above for now as I think I found a misplaced patch being applied when it should not have been. This may have polluted my real-world experience and the testing....

                        ☕️

                        1 Reply Last reply Reply Quote 0
                        • W
                          w0w
                          last edited by w0w 29 days ago 29 days ago

                          I'm also starting to recall and analyze a bit what's going on with these traffic limiters. It's actually quite interesting that I'm seeing packet drops on the PPPoE upload, even though I haven’t set any actual bandwidth limit. It's configured to the maximum. Still, under load—though it's actually below 1 Gbit/s—I’m seeing drops specifically on the upload, on PPPoE using the new backend. I haven’t tested it yet on the old backend. However, I did test it on the second provider (which is behind triple NAT through ROOter using a 5G mobile network). Yes, I have Multi-WAN, but the second provider is only used for failover. So... either I didn’t notice, or under the same test conditions as before, I’m not seeing any drops at all on the second WAN, which is ~200/~50Mbit/s. Obviously, the same limiters are in place, and the bandwidth cap is still 1 Gbit/s, but logically, it shouldn't be active in either case, right?
                          Edit: just tested using old PPPoE backend, same drops on the upload pipe.

                          R 1 Reply Last reply 29 days ago Reply Quote 0
                          • R
                            RobbieTT @w0w
                            last edited by 29 days ago

                            @w0w
                            Some of your fq_codel setting are really demanding though.

                            With a usual latency variance over the internet of around ±1ms or more (when unloaded) and with a usual setting of 5ms on fq_codel, you have a setting of 1µs. That's quite brutal I guess and probably more suited to use inside a data centre than over the net.

                            My router crashed in the early hours for no explicable reason, so my testing today was borked. Outside of testing or configuration changes it's my first ever hard crash of pfSense.

                            ☕️

                            W 1 Reply Last reply 29 days ago Reply Quote 0
                            • W
                              w0w @RobbieTT
                              last edited by 29 days ago

                              @RobbieTT said in 25.03 beta - Bufferbloat / FQ CoDel issues:

                              Some of your fq_codel setting are really demanding though

                              Those are new default settings, I think. I have seen something on redmine regarding it, but... Ignored it 😁

                              @RobbieTT said in 25.03 beta - Bufferbloat / FQ CoDel issues:

                              My router crashed in the early hours for no explicable reason, so my testing today was borked

                              It just happens sometimes, any crash dumps available?

                              T R 2 Replies Last reply 29 days ago Reply Quote 0
                              • T
                                tman222 @w0w
                                last edited by 29 days ago

                                @w0w said in 25.03 beta - Bufferbloat / FQ CoDel issues:

                                @RobbieTT said in 25.03 beta - Bufferbloat / FQ CoDel issues:

                                Some of your fq_codel setting are really demanding though

                                Those are new default settings, I think. I have seen something on redmine regarding it, but... Ignored it 😁

                                @RobbieTT said in 25.03 beta - Bufferbloat / FQ CoDel issues:

                                My router crashed in the early hours for no explicable reason, so my testing today was borked

                                It just happens sometimes, any crash dumps available?

                                Hi @w0w - I'm curious about this too. Where did you see that there might be new defaults on FQ CoDel parameters? Unless I missed it and that particular traffic shaping algorithm was changed / improved, 1us seems way too low. Thanks in advance.

                                W 1 Reply Last reply 29 days ago Reply Quote 0
                                • W
                                  w0w @tman222
                                  last edited by w0w 29 days ago 29 days ago

                                  @tman222 said in 25.03 beta - Bufferbloat / FQ CoDel issues:

                                  Where did you see that there might be new defaults on FQ CoDel parameters?

                                  https://redmine.pfsense.org/issues/16037

                                  And this is what I see when I select an already created limiter — but you also don’t see any of those parameters when creating one...

                                  dec7c970-e1de-4e27-b1f5-7c0aeb280913-image.png
                                  And when you try to create the new one
                                  1c5b29fd-5adc-4b5c-89f6-e36fdff28a4c-image.png

                                  I don't really think those are new defaults, because all the fq-codel man pages I can find on the web reference the same 5ms value that @RobbieTT mentioned.

                                  R 1 Reply Last reply 29 days ago Reply Quote 0
                                  • R
                                    RobbieTT @w0w
                                    last edited by 29 days ago

                                    @w0w said in 25.03 beta - Bufferbloat / FQ CoDel issues:

                                    It just happens sometimes, any crash dumps available?

                                    No crash log or anything of note in the usual logs. It just stopped doing its stuff.

                                    ☕️

                                    1 Reply Last reply Reply Quote 0
                                    • R
                                      RobbieTT @w0w
                                      last edited by 29 days ago

                                      @w0w said in 25.03 beta - Bufferbloat / FQ CoDel issues:

                                      @tman222 said in 25.03 beta - Bufferbloat / FQ CoDel issues:

                                      Where did you see that there might be new defaults on FQ CoDel parameters?

                                      And this is what I see when I select an already created limiter — but you also don’t see any of those parameters when creating one...

                                      I don't really think those are new defaults, because all the fq-codel man pages I can find on the web reference the same 5ms value that @RobbieTT mentioned.

                                      The defaults can be messed up and showing zero, according to the redmine. The pfSense manual still has the correct defaults listed.

                                      You do see the parameters when creating a new one, only that they do not appear until you set and save that page. If you look closely on your screenshot, below Scheduler: FQ_CODEL, you will see this note:

                                      Save this limiter to see algorithm parameters.

                                      Caution, coffee may be hot etc.

                                      It catches many of us out when we haven't set a new one in ages. It's a weird UI human factor fail thing and I have no idea why pfSense makes it so complicated compared to other routers.

                                      As Douglas Adams would have it "It's a black panel with a black button that lights-up black when you press it..."*


                                      *Hotblack's ship, when he was spending a year dead, for tax reasons.

                                      W 1 Reply Last reply 29 days ago Reply Quote 1
                                      • W
                                        w0w @RobbieTT
                                        last edited by 29 days ago

                                        @RobbieTT said in 25.03 beta - Bufferbloat / FQ CoDel issues:

                                        Caution, coffee may be hot etc.

                                        It catches many of us out when we haven't set a new one in ages.

                                        Absolutely. Of course, that doesn’t change the fact that no one expects the default parameters to have values different from those stated in the documentation — or at the very least, everyone is used to trusting that those parameters actually exist and are being applied. I just didn’t check them myself, of course.

                                        R 1 Reply Last reply 29 days ago Reply Quote 0
                                        • R
                                          RobbieTT @w0w
                                          last edited by 29 days ago

                                          @w0w
                                          No it doesn't and until your link to the redmine I had no idea it was a thing. It doesn't look like Netgate has addressed the issue, presumably because it is both intermittent and potentially unnoticed when new limiters are set.

                                          ☕️

                                          1 Reply Last reply Reply Quote 0
                                          10 out of 26
                                          • First post
                                            10/26
                                            Last post
                                          Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.
                                            This community forum collects and processes your personal information.
                                            consent.not_received