Playing with fq_codel in 2.4

chrcoluk

The percentage of bandwidth you pay for is situationally dependent. If you always get 100% of what your isp says they'll give you then 95% works. If it dips to 94% of what you subscribe for and you set dummynet to 95% then dummynet can't do anything for you.

It's a granular as firewall rules can be

shaping is nowhere near that simple.

The reason steam is harder to shape is it opens so many threads.

I always get 100% from my isp but it doesnt mean 95% will always work well for all types of traffic.

Looking at dummynet configuration it looks like multiple specific rate limits cannot be set within one pipe however weighting can be applied so I can put steam downloads on a low weight and things like dns lookups and emails on a high weight, this is what I will look into on my config next. Thanks to the OP giving me a starting point. :)

AltQ on PFSense is incredibly granular but of course someone has put the effort into integrating it all into the GUI, and AltQ itself allows children in a queue to have their own limits set.

Already got some good results.

When I setup the dummynet config (basic as in the OP) I had a iptv stream running to my STB and I can see from my ping monitoring on my connection, the peak latency has plummeted, it was an almost steady increase in peak latency, now its spikes instead of constant and the spikes been generally much lower.

I will post back on how my steam downloads testing goes.

belt9

That's kind of the point of fq_codel, KISS. It is by design intended to be simple.

Now you might be trying to get it to do something it wasn't designed to do, in which case yes you will have to do some weird stuff - bit more likely you should just look elsewhere.

chrcoluk

I am a fan of simple, if steam works well in the current config then the current config stays, the weighting is a fallback plan if it doesnt work well.

From what I understand all the weighting is dummynet side, it simply dynamically adjusts the throughput allowance of each thread based on the weight assigned, by default everything has the same weight.

All fully documented on the dummynet man page, I dont think its a non supported feature.

belt9

Yeah man page is very helpful.

You should just be able to weight steam as you desire, apply the queue to a rule that catches ports and protocols for steam and let fq_codel do the rest.

I think you'll be happy with it. I just set it up and it's working very well for me.

It's awesome for weighting a guest net and primary net!

chrcoluk

I played some more.

First I misunderstood the man page, the weight flag does nothing on fq_codel, it only has an affect on another queuing type.

I tested steam and the result wasnt good, lots of packet loss during a steam download, the packet loss only goes close to 0% when the pipe size is below 40% of my line capacity, as I said steam is probably the most brutal traffic I have seen on my home connection.

HFSC can manage it at anything below 90%, however latency during saturation is vastly superior on fq_codel, packet loss is worse but latency better.

If I increase the queue slots to 500 (default 50), then packet loss almost stops at pipe size below 75%, with a small hit to latency.

belt9

It sounds like something else is going wrong on your box.

Weighting definitely applies to fq_codel, I've tested it on my own system and it matches weighted values every time.

I've also tested with both steam and flent rrul. No packet loss.

What is your line bandwidth? Are you trying to use dummynet and altq at the same time?

chrcoluk

altq is off during this testing to ensure no conflict.

Thanks for clarifying weighting has an affect I will try it on my test config as originally planned. The reason I said it wasnt valid is because this is in the man page from dummynet section on ipfw.

weight weight
Specifies the weight to be used for flows matching this queue.
The weight must be in the range 1..100, and defaults to 1.

The following case-insensitive parameters can be configured for a
scheduler:

type {fifo | wf2q+ | rr | qfq}
specifies the scheduling algorithm to use.
fifo is just a FIFO scheduler (which means that all packets
are stored in the same queue as they arrive to the
scheduler). FIFO has O(1) per-packet time complexity,
with very low constants (estimate 60-80ns on a 2GHz
desktop machine) but gives no service guarantees.
wf2q+ implements the WF2Q+ algorithm, which is a Weighted Fair
Queueing algorithm which permits flows to share bandwidth

type {fifo | wf2q+ | rr | qfq}
specifies the scheduling algorithm to use.
fifo is just a FIFO scheduler (which means that all packets
are stored in the same queue as they arrive to the
scheduler). FIFO has O(1) per-packet time complexity,
with very low constants (estimate 60-80ns on a 2GHz
desktop machine) but gives no service guarantees.
wf2q+ implements the WF2Q+ algorithm, which is a Weighted Fair
Queueing algorithm which permits flows to share bandwidth
according to their weights. Note that weights are not
priorities; even a flow with a minuscule weight will
never starve. WF2Q+ has O(log N) per-packet processing
cost, where N is the number of flows, and is the default
algorithm used by previous versions dummynet's queues.
rr implements the Deficit Round Robin algorithm, which has
O(1) processing costs (roughly, 100-150ns per packet) and
permits bandwidth allocation according to weights, but
with poor service guarantees.
qfq implements the QFQ algorithm, which is a very fast
variant of WF2Q+, with similar service guarantees and
O(1) processing costs (roughly, 200-250ns per packet).

This made me think its exclusive to wf2q+ however, fq_codel is omitted on the type section, so it didnt confirm that fq_codel has no weighting algorithm so I made the assumption.

My downstream throughput is around 71603kbit. Calculated after removing vdsl overheads, and also confirmed with experimentation when rate limiting to see when a rate limit starts having an affect. The bandwidth is very consistent whether its on peak or off peak, if I remove shaping and download via steam it flatlines at the max speed with no dips.

Steam on average opens between 20 and 40 connections when downloading, most of these connections appear to be short lived making them very difficult to shape. Instead of downloading a large compressed file and uncompressing it steam seems to either download individual files on their own sessions or download files in fragments with the aim of maximising tcp sessions. The problem is significantly reduced if I choose a server that has a high rtt such as in america (I am in the UK), swamping a connection with low RTT high bandwidth sessions will murder it.

I have been looking at the box configuration itself, the hardware tuning etc. As I understand, if packets are batched together with things like interrupt moderation as well as a low kernel hertz timer, then shaping is less efficient as it cannot intervene at frequent enough intervals. These are all things I am investigating as an ongoing process and I havent given up on this.

I have just reported back here how things went on the configuration suggested in the OP, with the only alternative config tried so far been to increase the queue depth.

So so far on my unit on my usage test patterns fq_codel via dummynet is much better for latency but worse at packet loss compared to HFSC on altq.

Also to add steam itself supports throttling the speeds, in that case I have tested on "unlimited", "7MB sec, which seems to be just below what my line can do" and "5MB sec", when steam throttles its not clean tho, it works by spiking to full speed, then pausing, then full speed again so it evens out that way. If I set my pipe size lower and leave steam at unlimited its a clean reduction in speed flatlined at the pipe throughput rate. throttling via steam vs the pipe size is more effective at higher speeds, but the pipe gets better when set very low. I will report back after trying more stuff and welcome suggestions that are reasonable (trying entirely new kit is not reasonable in case you about to suggest it).

belt9

You could try applying a different shaping algorithm to steam and see if it works better.

chrcoluk

I have stopped playing for now and moved back to HFSC ingress and fairq egress.

Suddenly thats improved due to the changes I made to try and help dummynet.

I did some tuning based on the information here (the author of isr code)

http://alter.org.ua/soft/fbsd/netisr/

I disabled interrupt moderation on my 2 intel ports.

I increased kernel timer rate which doesnt help interrupt overheads but I have the spare cpu clocks idle.

Now ALTQ+ HFSC is handing steam no problem at 97%, the latency performance is much closer to fq_codel as well.

I dont know yet if my iptv issues will improve or if I will still need to temporarily disable ingress ALTQ when relying on the iptv stuff.

The netisr changes I consider experimental of course as it goes against what I considered good practice (and obviously what pfsense considers good practice) the author seems to suggest deffere mode for netisr is superior for routing and direct is only best when running as a server, especially the case when you have more cpu cores than network cards.

Hopefully more people will contribute to this thread on their experience of dummynet fq_codel vs ALTQ shaping (particurly HFSC).

Harvy66

Increased the timer rate? It defaulted to 1000hz for me or at least I don't remember changing it. Was it lower for you or did you increase it even more? FreeBSD 10 added tickless kernel scheduling. I'm not sure kern.hz even applies anymore.

I'm also getting good results from ALTQ+HFSC+Codel

chrcoluk

yeah by default it runs double the kern hz on my unit, I had set it back to legacy behaviour so 1000khz timer, but for this I removed my override so its 2000 per second on each core.

By the way I think now this is a PPS issue that my unit is failing to handle when its too high.

It seems HFSC drops the PPS quite aggressively when it is in play and this is why it is working better for me, from what I can tell my unit starts dropping random packets when PPS is over about 1500 when it reaches 2500 or so it gets quite bad, although the download is never affected visibly.

I do have NAT, no cpu cores are been saturated, pfsense is behind a vdsl modem I have in bridge mode so there is also a chance that modem is not handling the higher pps itself.

So at least for now I will keep my feedback out of this thread given the problem for me might be related to me hitting a PPS bottleneck.

Harvy66

Sounds like something is wrong. My i5-3.2ghz Haswell Intel-i350T2 is handling 1.44Mpps with HFSC+Codel just fine. 400Kpps is about 15-20% load and 1.44Mpps is about 20%-25% load. And I still get about 1ms pings while doing this.

I have some forum posts around about what changes I made, like enabling MSI-X (enable soft interrupts if your NIC properly supports it) and removing the limitation on the number of packets to process per interrupt, which is by default 40 I think.

I used to process 1.44Mpps at only 600 interrupts per second, but over the years, it is now about 1,200 interrupts per second. All I know is CPU is low, interrupts are low, latency and jitter and loss are low. Even with only 300 interrupts per core per second.

chrcoluk

agree something is wrong, problem is I dont know what at the moment. :)

Will diagnose some more at some point.

MSIX is enabled by default on my i350, I actually temporarily disabled it already to try and get to bottom of it but MSIX in this case is not the solution. The packets per interrupt is something I not heard off, you know where that is configured?

Harvy66

@chrcoluk:

agree something is wrong, problem is I dont know what at the moment. :)

Will diagnose some more at some point.

MSIX is enabled by default on my i350, I actually temporarily disabled it already to try and get to bottom of it but MSIX in this case is not the solution. The packets per interrupt is something I not heard off, you know where that is configured?

I guess it's not 40. Not sure what "40" is. I definitely remembering something.

https://calomel.org/freebsd_network_tuning.html

Intel igb(4): FreeBSD puts an upper limit on the the number of received

packets a network card can process to 100 packets per interrupt cycle. This

limit is in place because of inefficiencies in IRQ sharing when the network

card is using the same IRQ as another device. When the Intel network card is

assigned a unique IRQ (dmesg) and MSI-X is enabled through the driver

(hw.igb.enable_msix=1) then interrupt scheduling is significantly more

efficient and the NIC can be allowed to process packets as fast as they are

received. A value of "-1" means unlimited packet processing and sets the same

value to dev.igb.0.rx_processing_limit and dev.igb.1.rx_processing_limit .

hw.igb.rx_process_limit="-1" # (default 100 pps, packets per second)

I want to say that when I echo'd my system settings, msix was disabled and I had to explicitly add it to pfSense to enabled it. I'm also under the impression that many NICs that claim to support MSIX, do not correctly and have odd bugs when msix is enabled, so it was turned off. I'm going off of memory from a long while ago when I was researching.

chrcoluk

yeah mine has definitely been on, checked via dmesg.

I will turn it back on, I only turned off just to check if somehow it was not working properly.

belt9

I've got some great results with fq_codel, it makes significant improvements on wifi where latency can be a real problem!

https://forum.pfsense.org/index.php?topic=135843.msg745944#msg745944

chrcoluk

I think its interrupt related.

I enabled polling on both nic's in use and the packet loss when downloading of steam or meganz is gone.

Before I enabled polling I was observing the interrupts/sec, first the AIM (adaptive interrupt moderation) seems to not be working on my i350 ports as it has no affect, secondly I will generate 8000/sec interrupts for a download of about 70mbit/sec. I see people on here reporting similar usage but for half a gigabit/sec, so it seems two things pointing to moderation not working.

altq reports around 2000 pps, assuming thats accurate then some how I have 4 interrupts for every packet.

I have seen threads on here regarding fake i350's, I dont think its impossible mine is fake which could explain the broken AIM.

Harvy66

You can get the number off of your nic and run it through Intel.com to see if it's valid.
https://www.intel.com/content/www/us/en/support/network-and-i-o/ethernet-products/000007074.html

w0w

How did you enable polling?
I am not quite sure this is good idea anyway, but I can't see this option in GUI anymore, is it moved somewhere?
You'd better try to add boot.conf.local "safe" settings for Intel cards
hw.igb.num_queues=2
dev.igb.0.fc=0
dev.igb.1.fc=0
Also I have disabled hardware TCP segmentation offload and hardware large receive offload.
P.S. In addition to Harvy66 post
https://forums.servethehome.com/index.php?threads/chinese-intel-i350-adapters.3746/#post-58686

chrcoluk

I compiled it into the kernel and then enabled it in the cli via ifconfig, I dont think you can enable it with a module.

Polling is seen as pointless now days, but only because modern intel and broadcom hardware do interrupt moderation, for some reason my i350 isnt moderating the interrupts.

I have already disabled flow control and played with the queues, all that made no difference.

When I can be bothered I will check how harvy said, although I am already pretty sure I had no such label on my card.