HFSC explained - decoupled bandwidth and delay - Q&A - Ask anything

jetblackwolf

Thanks for bumping this thread up. (it should be a sticky next to ermal's post which is something like 7 years old now)

Anyway, is it a limitation of the GUI to have only one interface per HFSC queue? This would solve the issue of splitting bandwidth using UPPERLIMIT on multiple LAN scenarios? Really knocks the flexibility of link share around when bandwidth is carved up on a per interface level.

Harvy66

@xbipin:

Well I don't quiet agree with you on the theory that real-time would take from root instead of the parent queue..

It's not debatable, it's a fact.

queue root_igb0 on igb0 bandwidth 99Mb priority 0 {qACK, qUnclassified, qClassified}
queue qACK on igb0 bandwidth 19.80Mb qlimit 1024
queue qUnclassified on igb0 bandwidth 29.70Mb {qUDP, qDefault}
queue qUDP on igb0 bandwidth 13.07Mb qlimit 1024 hfsc( codel linkshare(16.34Mb 5 13.07Mb) )
queue qDefault on igb0 bandwidth 13.07Mb qlimit 1024 hfsc( codel default )
queue qClassified on igb0 bandwidth 1Mb {qNormal, qHigh}
queue qNormal on igb0 bandwidth 440Kb qlimit 1024 hfsc( codel )
queue qHigh on igb0 bandwidth 440Kb qlimit 1024 hfsc( codel realtime 40Mb linkshare(550Kb 5 440Kb) )

Notice the parent queue qClassified is assigned 1Mb of bandwidth and the child queue qHigh is assigned 440Kb of linkshare and 40Mb of realtime. This is a perfectly valid configuration. And if qHigh decided to pull down 40Mb a sec and qClassified only has 1Mb of bandwidth, how much bandwidth is left over for qNormal? None.

Realtime is ALWAYS guaranteed to be available, assuming the queue has not gone over the provisioned amount of realtime, and all bandwidth the realtime consumes counts towards the parent queue(s).

And of course your realtime didn't starve the parent queue in your simple test because realtime cannot consume more than 80% of the bandwidth. That means there's always at least 20% free for link share. As long as there is some free bandwidth and you don't have an upper limit on the parent queue that is being exceeded by the realtime child queue, the sibling queues will still have some bandwidth to work with.

jetblackwolf

Whoops I feel like my question above may have also been irrelevant to the specific topic of this thread. (discussing decoupled bw/delay)

Perhaps a more direct question about the thread topic would be on the ping test by Nullity (thanks for that hands on example)…

How would CODEL affect that test, if the goal for CODEL is to maintain a 5ms buffer length, the packets in that test are essentially held for 10ms (the d parameter) due to the 0Kb bandwidth of the M1 parameter?

Would CODEL just drop all the packets? And if not, why? (Unfortunately I don't have access to a pfsense lab environment where I can run interesting tests like these.)

Nullity

@jetblackwolf:

Whoops I feel like my question above may have also been irrelevant to the specific topic of this thread. (discussing decoupled bw/delay)

Perhaps a more direct question about the thread topic would be on the ping test by Nullity (thanks for that hands on example)…

How would CODEL affect that test, if the goal for CODEL is to maintain a 5ms buffer length, the packets in that test are essentially held for 10ms (the d parameter) due to the 0Kb bandwidth of the M1 parameter?

Would CODEL just drop all the packets? And if not, why? (Unfortunately I don't have access to a pfsense lab environment where I can run interesting tests like these.)

CoDel is designed to improve delay but keep all other aspects unchanged (like bandwidth). CoDel keeps the queue size low, but at least 1 packet needs to be queued before CoDel drops other incoming packets.

Artificially adding delay with HFSC (or ipfw once the CoDel & fq_codel are added: http://caia.swin.edu.au/freebsd/aqm/) might affect CoDel's queue delay calculation, or maybe not since I don't know the implementation details. Regardless, CoDel should not cause a problem, by design, but the easiest answer is to just run a test a see. :)

I really need to set up a (virtual?) lab too…

Harvy66

Codel has an extremely large internal buffer, but the buffering algorithm is does not do a hard cut-off. When the latency gets greater than 10ms, it will drop one packet, not all packets. It will then check one interval later to see if the buffer has gotten below the threshold. If not, drop another packet.

When playing around, I was able to get Codel to cause multi-second ping times, but you really have to game it.

curtisgrice

@Nullity:

Advanced real-time VOIP configuration

Say you want to efficiently improve your VOIP setup. You have 7 VOIP lines and a 5Mbit upload. Each VOIP packet is 160 bytes (G.711), with an average bitrate of 64Kbit/s per line (a 160byte packet sent every 20ms). We want to improve our worst-case delay (which also improves worst-case jitter and overall call quality) from 20ms to 5ms, so we calculate the bitrate needed to achieve that:
160 bytes × 8 = 1280 bits
1280 bits × (1000ms ÷ 5ms) = 256000 bits/s

So, to send a 160 byte packet within 5ms we need to allocate 256Kbit/s. This gives us our m1 (256Kb) and our d (5).

Now we need to calculate our maximum average bitrate:
7 lines × 64Kbit/s = 448Kbit/s
Just to be safe, we allocate bandwidth for an extra line, so
8 lines × 64Kbit/s = 512Kbit/s

This gives us our m2 (512Kb).

To make sure that your m1 (the per packet delay) can always be fulfilled by your connection, make sure that the m1 (256Kb) multiplied by the maximum number of simultaneous sessions (7 or 8) is less than your maximum upload. 2048Kbit (256Kb × 8) is less than 5000Kbit, good.

Our finalized configuration is:
m1=256Kb
d=5
m2=512Kb

This configuration will guarantee a 7.4ms (5ms+MTU transmission delay) worst-case delay for each packet, with a limit of 8 simultaneous calls (512Kbit/s). We get low-delay, low-jitter calls as though you had allocated 2048Kbit/s of bandwidth, but you actually only allocated 512Kbit/s of bandwidth.

I may be misunderstanding the m1 value. It would seem to me that with m1 set to 256Kb with packets taking ~7.5ms to send if I had 8 active lines and by chance each 20ms packet happened at the same time, I would have a potential 80ms (8 x 7.5) delay as the queue becomes backlogged. While not disastrous to VoIP, if this scales up I would begin to see the delays in the call. Perhaps my understanding of VoIP is lacking.

Sorry about the zombie thread…

Harvy66

I think it's simpler to just use the single packet case. If you have 1 160byte(1280bit) packet every 50ms, that's 64kbit/s. But, at 64Kb/s, HFSC is allowed to schedule that packet in any manner, as long as it completes before the 50ms to send the next packet. This means your absolutely worst case is 50ms.

If you want this packet to be sent out faster, say 5ms, then you need "5ms" of bandwidth for scheduling reasons. That will take 256Kb of bandwidth to send 1280bit packet in 5ms. But you don't need 256Kb average, just 256Kb of burst. So you tell HFSC, 64Kb average, 256Kb burst for 5ms.

Or something along these lines.

Nullity

@curtisgrice:

@Nullity:

Advanced real-time VOIP configuration

Say you want to efficiently improve your VOIP setup. You have 7 VOIP lines and a 5Mbit upload. Each VOIP packet is 160 bytes (G.711), with an average bitrate of 64Kbit/s per line (a 160byte packet sent every 20ms). We want to improve our worst-case delay (which also improves worst-case jitter and overall call quality) from 20ms to 5ms, so we calculate the bitrate needed to achieve that:
160 bytes × 8 = 1280 bits
1280 bits × (1000ms ÷ 5ms) = 256000 bits/s

So, to send a 160 byte packet within 5ms we need to allocate 256Kbit/s. This gives us our m1 (256Kb) and our d (5).

Now we need to calculate our maximum average bitrate:
7 lines × 64Kbit/s = 448Kbit/s
Just to be safe, we allocate bandwidth for an extra line, so
8 lines × 64Kbit/s = 512Kbit/s

This gives us our m2 (512Kb).

To make sure that your m1 (the per packet delay) can always be fulfilled by your connection, make sure that the m1 (256Kb) multiplied by the maximum number of simultaneous sessions (7 or 8) is less than your maximum upload. 2048Kbit (256Kb × 8) is less than 5000Kbit, good.

Our finalized configuration is:
m1=256Kb
d=5
m2=512Kb

This configuration will guarantee a 7.4ms (5ms+MTU transmission delay) worst-case delay for each packet, with a limit of 8 simultaneous calls (512Kbit/s). We get low-delay, low-jitter calls as though you had allocated 2048Kbit/s of bandwidth, but you actually only allocated 512Kbit/s of bandwidth.

I may be misunderstanding the m1 value. It would seem to me that with m1 set to 256Kb with packets taking ~7.5ms to send if I had 8 active lines and by chance each 20ms packet happened at the same time, I would have a potential 80ms (8 x 7.5) delay as the queue becomes backlogged. While not disastrous to VoIP, if this scales up I would begin to see the delays in the call. Perhaps my understanding of VoIP is lacking.

Sorry about the zombie thread…

The scenerio was for 7 lines, not 8. It could handle 8, but that'd be 100% usage… I dunno what standard operating prodecure is, but allowing for no overhead seems dangerous.

There will be no problematic backlog. The scenerio above can handle 7 simultaneous lines while guaranteeing a per-packet latency of ~7.5ms for every VOIP packet, because both the internet connection and queue are not overused at any point, even during 7 simultaneous VOIP calls.

curtisgrice

The scenerio was for 7 lines, not 8. It could handle 8, but that'd be 100% usage… I dunno what standard operating prodecure is, but allowing for no overhead seems dangerous.

There will be no problematic backlog. The scenerio above can handle 7 simultaneous lines while guaranteeing a per-packet latency of ~7.5ms for every VOIP packet, because both the internet connection and queue are not overused at any point, even during 7 simultaneous VOIP calls.

Thanks for the response! It sounds like the m1 is per packet. Is that correct? So if pfSense received 2 packets within 2ms each packet would be allocated m1 and I would see a burst of 256Kb/s x 2 = 512Kb/s?

You also talk about the connection not being over used. If I have a 5Mb/s upload (rated for 6Mb/s), 4Mb/s for linkshare in other queues (4Mb/s upper limit), and 1Mb/s for the qVoIP, I should never have to worry about over saturation.
Note: Not exact numbers but the same idea.

Nullity

@curtisgrice:

Thanks for the response! It sounds like the m1 is per packet. Is that correct? So if pfSense received 2 packets within 2ms each packet would be allocated m1 and I would see a burst of 256Kb/s x 2 = 512Kb/s?

Yeah, exactly. As I understand it m1/d are per packet. If m1/d were per queue, then the per packet latency would be dependant on how many packets were queued, which would make latency unpredictable and practically break HFSC's latency guarantees.

You also talk about the connection not being over used. If I have a 5Mb/s upload (rated for 6Mb/s), 4Mb/s for linkshare in other queues (4Mb/s upper limit), and 1Mb/s for the qVoIP, I should never have to worry about over saturation.
Note: Not exact numbers but the same idea.

There are a number of ways to do it, but that setup seems fine. You should stress test it to confirm proper functionality though, regardless.

Harvy66

It's kind of both. The scheduling is done at the queue level, but it is done in discrete steps of packets. The service curve is just the priority, which every curve is "greater" at any given moment is scheduled next. The m1+d modify the curve to make it look like it is greater. If sending the packet will push it beyond its limit, it won't send the packet yet. Of course if there is spare bandwidth and no other packets, it'll just send the packet anyway.

I just think of HFSC as a dynamic priority based queue where the priority is adjusted in real-time per packet processed, such that all queue configurations are respected. Of course my simplification breaks down in extreme situations, like absurdly low bandwidths, absurdly large MTUs, or absurdly low target latencies, but it's a good rule of thumb.

Nullity

@Harvy66:

It's kind of both. The scheduling is done at the queue level, but it is done in discrete steps of packets. The service curve is just the priority, which every curve is "greater" at any given moment is scheduled next. The m1+d modify the curve to make it look like it is greater. If sending the packet will push it beyond its limit, it won't send the packet yet. Of course if there is spare bandwidth and no other packets, it'll just send the packet anyway.

I just think of HFSC as a dynamic priority based queue where the priority is adjusted in real-time per packet processed, such that all queue configurations are respected. Of course my simplification breaks down in extreme situations, like absurdly low bandwidths, absurdly large MTUs, or absurdly low target latencies, but it's a good rule of thumb.

(Using the scenerio from earlier.)
If 8 packets are instantly queued and m1/d was per queue, the queue would only get 256Kbit for 5ms which would only be 1 packet transmitted, violating the HFSC's latency guarantees for the other 7 packets.

If it were per queue we'd need to set m1 to 8x the bandwidth to support 8 concurrent sessions, which may be true, but I think m1/d is per packet since HFSC's m1/d/m2 correspond with the umax/dmax/r, which is defined in page 10 the HFSC paper as follows:

Each session is characterized by three parameters: the largest unit of work, denoted umax, for which the session requires delay guarantee, the guaranteed delay dmax, and the session's average rate r. As an example, if a session requires per packet delay guarantee, then umax represents the maximum size of a packet. Similarly, a video or an audio session can require per frame delay guarantee, by setting umax to the maximum size of a frame.

umax = packet and umax/dmax directly correspond with m1/d, therefore m1/d are per packet as well.

Edit: Also, I'm not sure what "session" means exactly… maybe flow?
I don't understand the HFSC paper very well, but the few parts I do understand only mention single session (flow?) queues, which may be the only way to to properly use HFSC, but that seems strangely limited.

We'd had this per packet/queue argument before and I had few more references, but I forget them now. Maybe they're in this thread? Reading the source-code or even the pseudo-code found in the HFSC paper could answer our question, but it seems too advanced for me...

Testing this per queue/per packet theory should be reasonably easy. Someday... :-\

Harvy66

@Nullity:

(Using the scenerio from earlier.)
If 8 packets are instantly queued and m1/d was per queue, the queue would only get 256Kbit for 5ms which would only be 1 packet transmitted, violating the HFSC's latency guarantees for the other 7 packets.

Latency guarantees are only guaranteed if you don't go past your bandwidth limit. If you burst 8 1280bit(160byte) packets at once, the last packet is going to have 8960bits of data in front of it and 10,240 including itself. To send 10240 bits in 5ms, you need 2mbits of bandwidth. You can't have your cake and eat it to. It is impossible in this universe to guarantee all 8 packets will get sent within 5ms with anything less than ~2Mb/s.

Even in this case, the average bandwidth is 1280bits * 20/sec(every 50ms) * 8 packets = 204,800bps, which is about 1/10th the bandwidth you need to make sure none of those 8 packets hang in the queue for more than 5ms. You need about 2Mb/s, but you only need it for 5ms. Your average is 204.8Kb/s.

m1 = burst
d = duration of burst (of course you can't send fraction packets, so you need to make sure all the packets you want to burst fit m1*d)
m2 = overall average

m1 is moot if you attempt to go over m2.

*my interpreation

Nullity

@Harvy66:

@Nullity:

(Using the scenerio from earlier.)
If 8 packets are instantly queued and m1/d was per queue, the queue would only get 256Kbit for 5ms which would only be 1 packet transmitted, violating the HFSC's latency guarantees for the other 7 packets.

Latency guarantees are only guaranteed if you don't go past your bandwidth limit. If you burst 8 1280bit(160byte) packets at once, the last packet is going to have 8960bits of data in front of it and 10,240 including itself. To send 10240 bits in 5ms, you need 2mbits of bandwidth. You can't have your cake and eat it to. It is impossible in this universe to guarantee all 8 packets will get sent within 5ms with anything less than ~2Mb/s.

Even in this case, the average bandwidth is 1280bits * 20/sec(every 50ms) * 8 packets = 204,800bps, which is about 1/10th the bandwidth you need to make sure none of those 8 packets hang in the queue for more than 5ms. You need about 2Mb/s, but you only need it for 5ms. Your average is 204.8Kb/s.

m1 = burst
d = duration of burst (of course you can't send fraction packets, so you need to make sure all the packets you want to burst fit m1*d)
m2 = overall average

m1 is moot if you attempt to go over m2.

*my interpreation

Any response to the HFSC paper saying that m1/d & umax/dmax are per packet?

It's confusing since m2 is obviously per queue.

Nullity

@Nullity:

@Harvy66:

@Nullity:

(Using the scenerio from earlier.)
If 8 packets are instantly queued and m1/d was per queue, the queue would only get 256Kbit for 5ms which would only be 1 packet transmitted, violating the HFSC's latency guarantees for the other 7 packets.

Latency guarantees are only guaranteed if you don't go past your bandwidth limit. If you burst 8 1280bit(160byte) packets at once, the last packet is going to have 8960bits of data in front of it and 10,240 including itself. To send 10240 bits in 5ms, you need 2mbits of bandwidth. You can't have your cake and eat it to. It is impossible in this universe to guarantee all 8 packets will get sent within 5ms with anything less than ~2Mb/s.

Even in this case, the average bandwidth is 1280bits * 20/sec(every 50ms) * 8 packets = 204,800bps, which is about 1/10th the bandwidth you need to make sure none of those 8 packets hang in the queue for more than 5ms. You need about 2Mb/s, but you only need it for 5ms. Your average is 204.8Kb/s.

m1 = burst
d = duration of burst (of course you can't send fraction packets, so you need to make sure all the packets you want to burst fit m1*d)
m2 = overall average

m1 is moot if you attempt to go over m2.

*my interpreation

Any response to the HFSC paper saying that m1/d & umax/dmax are per packet?

It's confusing since m2 is obviously per queue.

Well, actually, it doesn't exactly say it's per-packet. It just says "if a session requires per packet delay guarantee, then umax represents the maximum size of a packet"… so, would 7 separate VOIP flows be considered a single session or multiple sessions? Is a queue's traffic considered a single "session", regardless?

Fuck, I really don't want to read the HFSC paper for the 100th time... Though, IIRC, the paper only mentions single-flow scenerios, so the source-code might provide the only proper answer.

I'm beginning to think you may be right though; m1 should be multiplied × 7 to support 7 concurrent flows.

HFSC... :-\

Derelict

Again From: http://linux-tc-notes.sourceforge.net/tc/doc/sch_hfsc.txt

Example 3.

That worked, but not as well as you hoped. You recall TCP can
send multiple packets before waiting for a reply and you would
like them all sent before web gets to send it's first packet.
You also estimate you have 10 concurrent users all doing the same
thing. From a packet capture you decide sending 4 packets before
waiting for a reply is typical.

10 users by 4 packets each means 40 MTU sized packets. Thus you
must adjust the burst speed, so ssh gets 40 times the speed of
web and you must allow for 40 MTU sized packets in the burst
time:

tc class add dev eth0 parent 1:0 classid 1:1 hfsc
ls m1 24kbps d 60ms m2 500kbps # web
tc class add dev eth0 parent 1:0 classid 1:2 hfsc
ls m1 975kbps d 60ms m2 500kbps # ssh

Note that 975:24 is 40:1, and (40 * 1500 / 1mbps = 60).

Seems to imply the same. Though I think that math might be wrong. Seems like it should be (40 * 1500 * 8 / 1mbps = 457ms) (A 1500B MTU is 12000 not 1500)

Nullity

@Derelict:

Again From: http://linux-tc-notes.sourceforge.net/tc/doc/sch_hfsc.txt

Example 3.

That worked, but not as well as you hoped. You recall TCP can
send multiple packets before waiting for a reply and you would
like them all sent before web gets to send it's first packet.
You also estimate you have 10 concurrent users all doing the same
thing. From a packet capture you decide sending 4 packets before
waiting for a reply is typical.

10 users by 4 packets each means 40 MTU sized packets. Thus you
must adjust the burst speed, so ssh gets 40 times the speed of
web and you must allow for 40 MTU sized packets in the burst
time:

tc class add dev eth0 parent 1:0 classid 1:1 hfsc
ls m1 24kbps d 60ms m2 500kbps # web
tc class add dev eth0 parent 1:0 classid 1:2 hfsc
ls m1 975kbps d 60ms m2 500kbps # ssh

Note that 975:24 is 40:1, and (40 * 1500 / 1mbps = 60).

Seems to imply the same. Though I think that math might be wrong. Seems like it should be (40 * 1500 * 8 / 1mbps = 457ms) (A 1500B MTU is 12000 not 1500)

Yeah, he makes that mistake in example 2 as well. I tried emailing him months ago but the email bounced.

I guess it makes more sense if m1/m2 were both per queue.

Harvy66

I really need to add, my interpretation is entirely based on their description of the problem, a high abstract level of how their trying to solve it, what variables are provided to the end-user (m1/d/m2) they are using to represent how to address the issue, and the desired result.

I am filling in the implementation blanks with how I would go about "solving" the issue at a lower but still abstract level. This gives rise to how I think the variables are meant to be used. Typically the simplest answer is the best answer, and to me, I think this is the simplest way to think about the problem. But dear lord, the math behind and implementation details. Nope, I can't handle that. I find it easier to attempt to understand the problem and infer the details than to read the details directly.

With nothing empirical backing up my opinions, I am very open to corrections. Nullity has already corrected my notion of a "burst". I was at first thinking of it more like additive to the average(think comcast's "Powerboost"), but it turns out the average must be respected. At this point I realized it was not a way to add more bandwidth for a short amount of time, but a way to compress the available bandwidth into a uneven distribution.

Packets aren't evenly distributed, so the problem attempting to be solved is to reduce latency caused by this uneven distribution. Some of the uneven distribution is caused by bursts of packets, but other parts is because packets are atomic and of the packet is too large for the available bandwidth (this is the case they focus on when describing the problem), your latency may go up. You can hide this latency without increasing the average bandwidth by "bursting" the bandwidth and making it unevenly distributed.

Of course this happens by temporarily increasing the bandwidth at the micro-level, but not the macro-level.

This is what I think they're trying to solve with m1+d.

edit:
Another way to say it is bandwidth and latency are naturally coupled. For example, you cannot send a 1500 byte packet on 1Mb/s link in less time than 12ms (assuming store-and-forward). For any given bandwidth, there is a minimum latency for a given packet size.

HFSC allows you to "decouple" latency and bandwidth by temporarily giving a queue more bandwidth to make it act like it has the latency of higher bandwidth, but at the same time, you don't want it to have more than its fair-share of bandwidth.

HFSC also has a secondary decoupling of bandwidth and latency, but it's not that it's trying to decouple the natural coupling of bandwidth and latency, but the pseudo-coupling of bandwidth and latency that nearly all traffic shaping algorithms seem to have. This coupling isn't a fundamental natural coupling, but a biased characteristic that most algorithms have. For example. Many times giving one queue 2x more bandwidth than another queue will result in that queue having 1/2 the latency, even when the link is idle and you attempt to send a single small packet. This is because most algorithms use a kind of "back-off" strict priory based scheduling. This causes the shaper to "wait" for free bandwidth, and the less bandwidth a queue has, the longer it wait before it decides "yes, this bandwidth really is free".

HFSC uses "curves" to dynamically change the priority of a queue. If a queue is idle, it will have the highest priority. This means a 64kbit/s queue can have a higher priority than a 1Gb/s queue. But as soon as a packet goes through that 64kb queue, suddenly its priority drops. How far it drops relative to other queues is dependent on the relations of their assigned bandwidth. In this way, HFSC can pretty much guarantee that a queue of a given bandwidth should never have a latency worse than that of dedicated link with that same amount of bandwidth(ignoring kernel scheduling issues). But a queue can have a latency as low as the actual physical link rate.

The typical observed latency is probably some where between. In most real networks, the vast majority of connections are TCP that respond to congestion and a connection is rarely capable of actually maintaining 100% saturation for more than a short burst. What this many times means is there is a bit of breathing room on the link. This causes your typical observed latency to be quite low, closer to the link rate than the provisioned rate, unless you queue is bufferbloated with a large backlog.

Bufferbloat is the enemy of all.

trueeasy

Thanks for Gr8 explanation of HFSC

I am trying to shape my game traffic (Overwatch) on my network

I have captured Packets on my Wireshark UDP packets for the game source is the game server –-> my IP

I wanted to calculate the m1 as you mentioned above, i am confused to which packet am i suppose to base my calculations on

do i pick the largest UDP packet i received or do i take an average size of a 100 packets to calculate m1

Wire.PNG_thumb

Nullity

@trueeasy:

Thanks for Gr8 explanation of HFSC

I am trying to shape my game traffic (Overwatch) on my network

I have captured Packets on my Wireshark UDP packets for the game source is the game server –-> my IP

I wanted to calculate the m1 as you mentioned above, i am confused to which packet am i suppose to base my calculations on

do i pick the largest UDP packet i received or do i take an average size of a 100 packets to calculate m1

The largest. It might be easier to just calculate m1 as 1500, or whatever the MTU is. Unless the bandwidth allocation for your queues is very fine-grained, overallocating isn't a bad choice to assure that Overwatch never gets starved for latency/bandwidth.