Playing with fq_codel in 2.4

bafonso · Oct 2, 2018, 3:35 PM

@knowbe4 said in Playing with fq_codel in 2.4:

how to do fq_codel on dual wan (load balancing + failover) setup?

I tried to follow jim-p video on fq_codel, it works on single wan. but on dual it's not working.

I have several gateways (WAN + VPNs) and I did it using limiter rules on LAN. They all get put into the same pipe. The tricky part with your setup is that your overall bandwidth is shared over two connections so if I were you I would use interface QoS per WAN interface and use a round-robin WAN assignment which is I'm assuming you are doing given you are doing load balancing. The only way to properly load balance and share bandwidth evenly between your customers would be to weight the random gateway assignment based on real time bandwidth usage, effectively load balancing the bandwidth as opposed to connections. I'm not sure such thing exists off the shelf.

zwck · Oct 3, 2018, 2:33 PM

@dtaht

What exactly would you need, a port open on the pfsense so you could ssh into it and you could play with fq_codel. Or are you more interested in just pure data acquisition with a pfsense standard config using flent. I have not looked into flent, are there some guides i can follow? Or would it be already enough to setup a ubuntu vm and you'll do the rest.

Pentangle · Oct 3, 2018, 1:52 PM

@thenarc Sorry for the delay, this is my first time back in the office since I wrote my last post. I've tried adding masks to the queues but still get connections dying right at the end of the upload test. I can confirm when disabling the limiters the connections stay up.

Pentangle · Oct 3, 2018, 1:58 PM

@tman222 I'm holding my breath, but I think that Quantum change might have fixed this. It's not dropped this time. I'll try further tests later, but I suspect this might be the issue - thanks!

Pentangle · Oct 3, 2018, 2:20 PM

@tman222 Done a bunch more tests and no drops!! :) thanks.

dtaht · Oct 3, 2018, 2:38 PM

@zwck The easiest thing from my perspective would be for you to compile netperf with

./configure --enable-demo

for this platform, run "netserver -N" on the box (open up the relevant port 12685 on both ipv6 and ipv4 if available), then I can flent test from anywhere if you give me the ip.

That's all flent needs to target tests at a box from the outside world. If you are extra ambitious you can also install irtt (it's in go) and open up it's udp port....

Flent has other good guidelines on flent.org if you wish to play with it yourself (the rrul test is the way to get the mostest fastest). but it involves installing a lot of python code somewhere.

Having ssh into the box also available would let me tcpdump and monitor other things (I'm not usually big on asking for access of that level, being able to target some tests at the box would be a start)

dtaht · Oct 3, 2018, 2:35 PM

@pentangle You shouldn't have to fiddle with the quantum at all. You are saying that a quantum 300 works when a quantum 1514 does not? The quantum 300 thing is an optimization for very slow links, it shouldn't "cause connections to drop at the end of a test".

tman222 · Oct 3, 2018, 3:43 PM

Hi @dtaht - the intuition behind my suggestion was to try to make the algorithm more "fair" since VoIP packets tend to be a bit smaller.

I was going off of what I read here:

https://www.bufferbloat.net/projects/codel/wiki/Best_practices_for_benchmarking_Codel_and_FQ_Codel/

"We have generally settled on a quantum of 300 for usage below 100mbit as this is a good compromise between SFQ and pure DRR behavior that gives smaller packets a boost over larger ones."

@Pentangle also mentioned that his connection speed was 80/20.

Is this not the right way to think about the quantum?

zwck · Oct 3, 2018, 3:49 PM

@dtaht said in Playing with fq_codel in 2.4:

@zwck The easiest thing from my perspective would be for you to compile netperf with

./configure --enable-demo

for this platform, run "netserver -N" on the box (open up the relevant port 12685 on both ipv6 and ipv4 if available), then I can flent test from anywhere if you give me the ip.

That's all flent needs to target tests at a box from the outside world. If you are extra ambitious you can also install irtt (it's in go) and open up it's udp port....

Flent has other good guidelines on flent.org if you wish to play with it yourself (the rrul test is the way to get the mostest fastest). but it involves installing a lot of python code somewhere.

Having ssh into the box also available would let me tcpdump and monitor other things (I'm not usually big on asking for access of that level, being able to target some tests at the box would be a start)

I mean i can set up an netperf server behind the pfsense, on it, might not be possible for me.

dtaht · Oct 3, 2018, 4:02 PM

@tman222 It is a right way to think about the problem. :) However the OP was reporting "all my connections drop at the end of the test". There should be a "burp" at the beginning of the test as all the flows start, but i guess i don't know what he means by what he said.

dtaht · Oct 3, 2018, 4:04 PM

@zwck If you can port forward or ipv6 for a box behind, that would work. Otherwise I tend to originate flent tests from the box(es) behind to one or more of our flent servers in the cloud.

zwck · Oct 3, 2018, 4:05 PM

@dtaht said in Playing with fq_codel in 2.4:

@zwck If you can port forward or ipv6 for a box behind, that would work. Otherwise I tend to originate flent tests from the box(es) behind to one or more of our flent servers in the cloud.

For that type of test should i just turn off any type of the traffic shaping or should i keep it running as i have it configured.

dtaht · Oct 3, 2018, 4:06 PM

if the netgate folk could make that netperf (and irtt) available that would be very helpful overall. I really don't trust web based tests - most of them peak out at ~400mbit in the browser...

dtaht · Oct 3, 2018, 4:07 PM

@zwck well, we're trying to determine how well the shaper is working, so we'd want it on for a string of tests and off for another string.

zwck · Oct 3, 2018, 4:10 PM

@dtaht I'll set this up tomorrow morning with the simplest shaper ( for my wan interface) and send you the needed information.

uptownVagrant · Oct 3, 2018, 10:59 PM

I've been following this thread for a couple of weeks now but I'm still running into an issue.

If I have an in and out limiter set on the WAN interface, using the exact steps @jimp lays out in the August 2018 hangout, I get packet loss. If I stress the connection using a client visiting dslreports.com/speedtest to create load and I run a constant ping to my WAN_DHCP gateway, I have seconds of echo response loss while the test runs. DSLreports shows overall A, bufferbloat A, quality A. If I move the limiters to the LAN interface, making the needed interface and in/out queue adjustments to the floating rules, I do not see loss and DSLreports shows the same AAA result.

Can anyone else recreate this? My circuit is rated at 50Mbps down and 10Mbps up and I am limiting at 49000Kbps and 9800Kbps respectively after finding this to consistently work well in the past with ALTQ and CODEL on WAN. I am running 2.4.4 CE.

dtaht · Oct 3, 2018, 11:04 PM

You should see some loss, (that's the whole point), but ping should lose very few packets. Are you saying ping is dying for many seconds? or just dropping a couple not in a bunch? It's helpful to look at your retransmits on your dslreports test, and ping itself, dropping packets. what do you mean by a "constant ping". A ping -f test (flooding) WILL drop a ton of ping packets but a normal ping should drop, oh, maybe 3%?

Part of why I'd like to run flent is I can measure all that. :)

uptownVagrant · Oct 3, 2018, 11:40 PM

@dtaht thanks for the quick response. I am pinging every 500 ms during the dslreports test with a timeout of 400ms for the response - typical latency between my WAN interface and gateway is less than 3 ms. Over the coarse of the entire test, download and upload, including the pauses, I see the following.

Packets: sent=86, rcvd=62, error=0, lost=24 (27.9% loss) in 42.513662 sec
RTTs in ms: min/avg/max/dev: 1.042 / 3.546 / 11.348 / 1.575
Bandwidth in kbytes/sec: sent=0.121, rcvd=0.087

I'll take a look at flent and see what I can gather.

Pentangle · Oct 4, 2018, 12:11 AM

@dtaht I'm happy to expand on it. I understand what you would expect from a "burp" at the start of the test, and I did sometimes see circa ~30-40ms of latency at that time, but I never saw (or heard in the case of my Sonos streaming radio) any drops at the start of the test (or anywhere during the download test), only at the end of the upload test exactly when the graph dove down from it's "hump" it was drawing. I did try it about 6-7 times and it was repeatable whereby the internet radio would pause for a good 5 seconds and you'd be chucked out of any RDP session to an internet host at that specific time. Changed the quantum only down to 300 from 1514 and it made the difference that no connections were dropped despite another 7 or so tests today with everyone in the office working at the time and the Sonos playing. It was repeatable, so despite you saying it shouldn't help it appears however that it did.

uptownVagrant · Oct 4, 2018, 4:00 AM

@dtaht Ok, here is what I have set up to test my issue using spare hardware. I've confirmed what I was seeing with other hardware using a different topology and using Flent to produce the traffic. The limiters are 49000Kbps and 9800Kbps.

The test lab is laid out as such - all network connections are copper GigE.

flentuser@netperf2:~/flent$ flent rrul -p all_scaled -l 60 -H 172.16.21.76 -t UptownVagrant -o RRUL_Test001.png

I have included the graph created by Flent as well as attached the pfsense configuration (very close to stock) and the gzip output from Flent. During the RRUL test I was pinging the WAN-DHCP gateway every 500 ms with 60 bytes and a reply timeout of 400 ms for each echo request from the Thinkpad using hrping. I'm seeing huge loss when simulating this tiny traffic while Flent RRUL is running.

Packets: sent=146, rcvd=26, error=0, lost=120 (82.1% loss) in 72.500948 sec
RTTs in ms: min/avg/max/dev: 0.538 / 1.004 / 1.865 / 0.340
Bandwidth in kbytes/sec: sent=0.120, rcvd=0.021

0_1538625522209_config-pfSense.localdomain-20181003200505.xml

0_1538625551026_rrul-2018-10-03T204700.604527.UptownVagrant.flent.gz