PfSense underperforming, high jitter + random packet loss
-
I could use some help in troubleshooting this issue which I've recently uncovered. Some background information on my setup:
PC is a Windows 8.1 HTPC that is always on and uses a Ceton tuner to record live TV. It has an i5-2500k processor, Asus P8Z68-Pro mobo, an Intel PRO/1000 PT Dual Port Server Adapter, and 8GB of ram. I use Cox internet on an Arris SB8200 with the 150/10 tier internet. pfSense runs as a Hyper-V VM with exclusive access to the two NIC interfaces.
For months I've had pfSense running my LAN with 2 VLANs; VLAN 1 for default network connectivity and VLAN 5 for devices that want to connect to the internet via policy based routing through my privacy VPN. As part of troubleshooting this problem I removed the VLAN configuration and simplified back down to 2 NIC interfaces and disabled the privacy VPN. At the same time I've been working with my ISP to fix numerous noise issues in the HFC plant in my area, so I have smokeping running on an AWS instance hitting my CMTS and Modem separately with pings every 30 seconds to check packet loss. The plant noise is gone but the red flag was some lingering low-grade packet loss I was seeing.
Having said that: the primary issue I'm having is pfSense routing performance. It introduces unexplained jitter and frequent stalls in packet processing (especially when traffic shaping is turned on) to the point that I notice it quite severely during online gaming and VoIP calls.
I finally realized it was pfSense after getting frustrated and trying the following which failed to net any change whatsoever:
- starting a new pfSense VM from scratch to validate it was some config change I made
- Booting pfSense directly without a hypervisor to validate it was not hyper-v getting in the way
- Moving to an entirely new x86 machine and running natively to validate it was not some Z68 or I5-2500K latency issue
I was finally convinced it was actually pfSense itself that was to blame when I decided to boot up an old Untangle VM I had lying around from back when I decided to try it. The jitter and momentary connection stalls were immediately gone and I haven't had any measurable packet loss while running that VM. To be clear this is not coming from my ISP or CPE; in the span of about 1 minute I shutdown the pfSense VM and boot up the Untangle VM and it all disappears.
Here are some representative tests I ran back to back as quickly as possible to maintain pretty consistent network conditions (click on the download/upload graph label to view phase bufferbloat):
pfSense 2.4.2, no traffic shaper (1040ms spike during download, single 420ms spike during upload)
pfSense 2.4.2, codelq traffic shaper (2 small 400ms download spikes, 17 ~300+ms upload spikes)
Untangle 13.1.0, no traffic shaper
Untangle 13.1.0, fq_codel shaperThe fq_codel result on Untangle is, as expected, pretty much perfect. The pfSense results on the other hand look ridiculous, and are far from the worst I've actually collected.
I've tried going through the pfSense wikis for Low Throughput Troubleshooting and Tuning and Troubleshooting Network Cards. I've tried searching the forums and web to no avail. I'm not that familiar with FreeBSD but nothing immediately stands out to me as a red flag (system / interrupt load appears low, etc). I have all NIC offloading features disabled on both the guest and host (no LRO, TSO, etc). Hell, I even replicated this problem on another machine I had lying around with a fresh default pfSense install.
I'm really at a loss and could use some help regarding next steps for troubleshooting to bring pfSense performance back into line. I much prefer pfSense's interface/power to Untangle and would like to get back to it ASAP.
-
My results pfsense 2.4.2 with no traffic shaping
http://www.dslreports.com/speedtest/26817207
-
Thanks for the results Chris. Can you try with Hi-Res bufferbloat enabled?
I really don't think it is the NIC, but out of sheer "I'm out of other ideas" desperation I ordered an I350T2V2 from Arrow to test.
-
Sure that changes the results
1 no shaping
http://www.dslreports.com/speedtest/26818126
2 I enabled fq_codel and limiters in pfsense but with those bufferbloat setting still the internet came to a crawl
http://www.dslreports.com/speedtest/26818505
-
Thanks for that. There are maybe some hints of a similar (or the same?) problem in your results, but nothing particularly conclusive or definitive. May I ask what hardware you're running on?
For those playing along, here's how pfSense is comparing to Untangle on the exact same hardware minutes apart:
-
This is on a j1900
I did a other test with traffic shaper and enabling codel in every q
internet kept working fine while testinghttp://www.dslreports.com/speedtest/26819500
I have Untangle also will give it a spin
-
Dude tested UT on same HW it first errored the test for a few times then I got these, pfsense did better with HFSC and Codel
http://www.dslreports.com/speedtest/26821430
-
UT proof
-
Dude tested UT on same HW it first errored the test for a few times then I got these, pfsense did better with HFSC and Codel
http://www.dslreports.com/speedtest/26821430
Bizarre. What NIC are you running?
-
Nic is Intel dual port server grade
-
Tested using a brand new Intel I350T2V2, exactly the same results.
-
My ping spikes up to 300ms sure but it goes down and I get A with no interruption to the services same on UT, can you post your results with Intel nic ? was the internet slow while performing the test ? try traffic shaper with HFSC and enable codel on every q and post your results,
-
The issue seems to be entirely with ALTQ shaping.
I decided to spend the day booted natively into pfSense (home alone, so nobody to be bothered with intermittent internet and no access to the TV) to troubleshoot this.
Ultimately after different iterations of ALTQ shapers with and without codel I couldn't find a single one that offered even remotely acceptable performance and that didn't introduce gigantic latency / bufferings spikes.
I decide to try this: https://forum.pfsense.org/index.php?topic=126637.0
Lo-and-behold, it worked like a charm. Using dummynet and real fq_codel on limiters gives me results I would expect without the altq insanity.
https://www.dslreports.com/speedtest/26865693I don't know if I'm the only one experiencing thing, but it honestly seems like currently altq is introducing side effects worse than the problems it is supposed to fix.
-
pfSense 2.4.3 alphabeta built on Sat Dec 16 11:23:26 CST 2017,
Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz e3c226d2i (2xi210 LAN)
tunables
kern.ipc.maxsockbuf 256000000
hw.igb.rxd="4096"
hw.igb.txd="4096"
net.inet.tcp.syncache.hashsize=1024
net.inet.tcp.syncache.bucketlimit=100
net.isr.defaultqlimit=4096
net.link.ifqmaxlen=10240
hw.igb.rx_process_limit="-1"
hw.igb.num_queues=2
dev.igb.0.fc=0
dev.igb.1.fc=0
kern.ipc.nmbjumbo9="20000"
kern.ipc.nmbclusters="1000000"
WAN is PPPoE 300/300Mbit over gigabit LAN to ISP router (some CISCO with 10G fiber optic connection)FQ_CODEL enabled, Hi-Res bufferbloat and other settings as posted by NaterGator:
http://www.dslreports.com/speedtest/26877901
FQ_CODEL enabled, Hi-Res bufferbloat and 30/30 streams:
http://www.dslreports.com/speedtest/26877933
FQ_CODEL disabled, Hi-Res bufferbloat and other settings as posted by NaterGator:
http://www.dslreports.com/speedtest/26877771
FQ_CODEL disabled, Hi-Res bufferbloat and 30/30 streams:
http://www.dslreports.com/speedtest/26877806
FQ_CODEL disabled, no tunables, Hi-Res bufferbloat and other settings as posted by NaterGator:
http://www.dslreports.com/speedtest/26877572
FQ_CODEL disabled, no tunables, Hi-Res bufferbloat and 30/30 streams:
http://www.dslreports.com/speedtest/26877682
I do not see any huge difference, just some fluctuations that are mostly on ISP side I think.
If you want me to test ALTQ shaper, please provide some sample configuration. But really, I have had some not very good experience with ALTQ at least it have twice as much overhead bandwidth comparing to IPFW shaper.
-
Interesting results… I wonder if asymmetric link bandwidth is having a greater influence?
This was my "typical" basic altq test with no limiter/fq_codel: https://i.imgur.com/d1vQLFc.png (only the one shaper on the WAN interface)
I also tried the configuration outlined here: http://www.speedtest.net/insights/blog/maximized-speed-non-gigabit-internet-connection/
Also...go bolts?
-
ALTQ CODELQ, NaterGator settings — http://www.dslreports.com/speedtest/27005845 As you can see dslreports automatucally dropped to 18 : 6 streams.
And for the 30/30 streams we have a problem! Triple test start ended with stuck on idle latency testing with spikes (failed due to overall timeout. error:2) and at the end I've got this with 24/24 http://www.dslreports.com/speedtest/27006168
And repeat test with FQ_CODEL and 30/30 — http://www.dslreports.com/speedtest/27006586
There is something broken in ALTQ CODELQ… -
Thanks for the extra effort and offering some level of confirmation that I'm not totally crazy. I'm not sure if this is an issue I should submit to the pfSense tracker or if this belongs upstream on FreeBSD's end.
FWIW: To reduce variables I use the preferences on the dslreports test to set fixed servers that I know are close by and a fixed number of streams.
-
You get great results :) using fq_codel. The minimum ping spike I could get was 150 something just on download, upload is fine , but I think ISP matters and also that you have a symmetrical speed makes a difference
-
Chrismallia, yes it's ISP, just very good ISP network at least in my location.
NaterGator, it's FreeBSD, but I don't think anybody cares ALTQ CODELQ, you have alternative with HFSC and codel enabled queue. I think next 3-5 years we will see some progress for IPFW or ALTQ — it does not matter they both need code to be rewritten from scratch, because of used 32-bit integers they both do not support modern traffic bandwidth (over 4 Gigs/sec). -
@w0w:
NaterGator, it's FreeBSD, but I don't think anybody cares ALTQ CODELQ, you have alternative with HFSC and codel enabled queue. I think next 3-5 years we will see some progress for IPFW or ALTQ — it does not matter they both need code to be rewritten from scratch, because of used 32-bit integers they both do not support modern traffic bandwidth (over 4 Gigs/sec).
Hmm, I do see this issue in HFSC with and without codel enabled. What I'm saying is any altq enabled shaping at all triggers the issue.