PfSense underperforming, high jitter + random packet loss



  • I could use some help in troubleshooting this issue which I've recently uncovered. Some background information on my setup:

    PC is a Windows 8.1 HTPC that is always on and uses a Ceton tuner to record live TV. It has an i5-2500k processor, Asus P8Z68-Pro mobo, an Intel PRO/1000 PT Dual Port Server Adapter, and 8GB of ram. I use Cox internet on an Arris SB8200 with the 150/10 tier internet. pfSense runs as a Hyper-V VM with exclusive access to the two NIC interfaces.

    For months I've had pfSense running my LAN with 2 VLANs; VLAN 1 for default network connectivity and VLAN 5 for devices that want to connect to the internet via policy based routing through my privacy VPN. As part of troubleshooting this problem I removed the VLAN configuration and simplified back down to 2 NIC interfaces and disabled the privacy VPN. At the same time I've been working with my ISP to fix numerous noise issues in the HFC plant in my area, so I have smokeping running on an AWS instance hitting my CMTS and Modem separately with pings every 30 seconds to check packet loss. The plant noise is gone but the red flag was some lingering low-grade packet loss I was seeing.

    Having said that: the primary issue I'm having is pfSense routing performance. It introduces unexplained jitter and frequent stalls in packet processing (especially when traffic shaping is turned on) to the point that I notice it quite severely during online gaming and VoIP calls.

    I finally realized it was pfSense after getting frustrated and trying the following which failed to net any change whatsoever:

    1. starting a new pfSense VM from scratch to validate it was some config change I made
    2. Booting pfSense directly without a hypervisor to validate it was not hyper-v getting in the way
    3. Moving to an entirely new x86 machine and running natively to validate it was not some Z68 or I5-2500K latency issue

    I was finally convinced it was actually pfSense itself that was to blame when I decided to boot up an old Untangle VM I had lying around from back when I decided to try it. The jitter and momentary connection stalls were immediately gone and I haven't had any measurable packet loss while running that VM. To be clear this is not coming from my ISP or CPE; in the span of about 1 minute I shutdown the pfSense VM and boot up the Untangle VM and it all disappears.

    Here are some representative tests I ran back to back as quickly as possible to maintain pretty consistent network conditions (click on the download/upload graph label to view phase bufferbloat):
    pfSense 2.4.2, no traffic shaper (1040ms spike during download, single 420ms spike during upload)
    pfSense 2.4.2, codelq traffic shaper (2 small 400ms download spikes, 17 ~300+ms upload spikes)
    Untangle 13.1.0, no traffic shaper
    Untangle 13.1.0, fq_codel shaper

    The fq_codel result on Untangle is, as expected, pretty much perfect. The pfSense results on the other hand look ridiculous, and are far from the worst I've actually collected.

    I've tried going through the pfSense wikis for Low Throughput Troubleshooting and Tuning and Troubleshooting Network Cards. I've tried searching the forums and web to no avail. I'm not that familiar with FreeBSD but nothing immediately stands out to me as a red flag (system / interrupt load appears low, etc). I have all NIC offloading features disabled on both the guest and host (no LRO, TSO, etc). Hell, I even replicated this problem on another machine I had lying around with a fresh default pfSense install.

    I'm really at a loss and could use some help regarding next steps for troubleshooting to bring pfSense performance back into line. I much prefer pfSense's interface/power to Untangle and would like to get back to it ASAP.



  • My results pfsense 2.4.2 with no traffic shaping

    http://www.dslreports.com/speedtest/26817207



  • Thanks for the results Chris. Can you try with Hi-Res bufferbloat enabled?

    I really don't think it is the NIC, but out of sheer "I'm out of other ideas" desperation I ordered an I350T2V2 from Arrow to test.



  • Sure  that changes the results

    1 no shaping

    http://www.dslreports.com/speedtest/26818126

    2 I enabled fq_codel  and limiters  in pfsense  but with those bufferbloat setting  still the internet came to a crawl

    http://www.dslreports.com/speedtest/26818505



  • Thanks for that. There are maybe some hints of a similar (or the same?) problem in your results, but nothing particularly conclusive or definitive. May I ask what hardware you're running on?

    For those playing along, here's how pfSense is comparing to Untangle on the exact same hardware minutes apart:



  • This is on a j1900
    I did a other test with traffic shaper and enabling codel in  every q
    internet kept working fine while testing

    http://www.dslreports.com/speedtest/26819500

    I have Untangle also will give it a spin



  • Dude tested UT on same HW it first errored the test for a few times then I got these, pfsense did better with HFSC and Codel

    http://www.dslreports.com/speedtest/26821430



  • UT proof




  • @Chrismallia:

    Dude tested UT on same HW it first errored the test for a few times then I got these, pfsense did better with HFSC and Codel

    http://www.dslreports.com/speedtest/26821430

    Bizarre. What NIC are you running?



  • Nic is Intel dual port server grade



  • Tested using a brand new Intel I350T2V2, exactly the same results.



  • My ping spikes up to 300ms sure but it goes down and I get A with no interruption to the services  same on UT,  can you post your results with Intel nic ? was the internet slow while performing the test ? try traffic shaper with HFSC and enable codel on every q and post your results,



  • The issue seems to be entirely with ALTQ shaping.

    I decided to spend the day booted natively into pfSense (home alone, so nobody to be bothered with intermittent internet and no access to the TV) to troubleshoot this.

    Ultimately after different iterations of ALTQ shapers with and without codel I couldn't find a single one that offered even remotely acceptable performance and that didn't introduce gigantic latency / bufferings spikes.

    I decide to try this: https://forum.pfsense.org/index.php?topic=126637.0

    Lo-and-behold, it worked like a charm. Using dummynet and real fq_codel on limiters gives me results I would expect without the altq insanity.
    https://www.dslreports.com/speedtest/26865693

    I don't know if I'm the only one experiencing thing, but it honestly seems like currently altq is introducing side effects worse than the problems it is supposed to fix.



  • pfSense 2.4.3 alphabeta built on Sat Dec 16 11:23:26 CST 2017,
    Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz e3c226d2i (2xi210 LAN)
    tunables
    kern.ipc.maxsockbuf 256000000
    hw.igb.rxd="4096"
    hw.igb.txd="4096"
    net.inet.tcp.syncache.hashsize=1024
    net.inet.tcp.syncache.bucketlimit=100
    net.isr.defaultqlimit=4096
    net.link.ifqmaxlen=10240
    hw.igb.rx_process_limit="-1"
    hw.igb.num_queues=2
    dev.igb.0.fc=0
    dev.igb.1.fc=0
    kern.ipc.nmbjumbo9="20000"
    kern.ipc.nmbclusters="1000000"
    WAN is PPPoE 300/300Mbit over gigabit LAN to ISP router (some CISCO with 10G fiber optic connection)

    FQ_CODEL enabled,  Hi-Res bufferbloat and other settings as posted by NaterGator:

    http://www.dslreports.com/speedtest/26877901

    FQ_CODEL enabled,  Hi-Res bufferbloat and 30/30 streams:

    http://www.dslreports.com/speedtest/26877933

    FQ_CODEL disabled, Hi-Res bufferbloat and other settings as posted by NaterGator:

    http://www.dslreports.com/speedtest/26877771

    FQ_CODEL disabled, Hi-Res bufferbloat and 30/30 streams:

    http://www.dslreports.com/speedtest/26877806

    FQ_CODEL disabled, no tunables, Hi-Res bufferbloat and other settings as posted by NaterGator:

    http://www.dslreports.com/speedtest/26877572

    FQ_CODEL disabled, no tunables, Hi-Res bufferbloat and 30/30 streams:

    http://www.dslreports.com/speedtest/26877682

    I do not see any huge difference, just some fluctuations that are mostly on ISP side I think.

    If you want me to test ALTQ shaper, please provide some sample configuration. But really, I have had some not very good experience with ALTQ at least it have twice as much overhead bandwidth comparing to IPFW shaper.



  • Interesting results… I wonder if asymmetric link bandwidth is having a greater influence?

    This was my "typical" basic altq test with no limiter/fq_codel: https://i.imgur.com/d1vQLFc.png (only the one shaper on the WAN interface)

    I also tried the configuration outlined here: http://www.speedtest.net/insights/blog/maximized-speed-non-gigabit-internet-connection/

    Also...go bolts?



  • ALTQ CODELQ, NaterGator settings — http://www.dslreports.com/speedtest/27005845  As you can see dslreports automatucally dropped to 18 : 6 streams.
    And for the 30/30 streams we have a problem! Triple test start ended with stuck on idle latency testing with spikes (failed due to overall timeout. error:2) and at the end I've got this with 24/24 http://www.dslreports.com/speedtest/27006168
    And repeat test with FQ_CODEL and 30/30 — http://www.dslreports.com/speedtest/27006586
    There is something broken in ALTQ CODELQ…



  • Thanks for the extra effort and offering some level of confirmation that I'm not totally crazy. I'm not sure if this is an issue I should submit to the pfSense tracker or if this belongs upstream on FreeBSD's end.

    FWIW: To reduce variables I use the preferences on the dslreports test to set fixed servers that I know are close by and a fixed number of streams.



  • @w0w

    You get great results :) using fq_codel. The minimum  ping spike I could get was 150 something just on download, upload is fine , but I think ISP matters and also that you have a symmetrical  speed  makes a difference



  • Chrismallia, yes it's ISP, just very good ISP network at least in my location.
    NaterGator, it's FreeBSD, but I don't think anybody cares ALTQ CODELQ, you have alternative with HFSC and codel enabled queue. I think next 3-5 years we will see some progress for IPFW or ALTQ — it does not matter they both need code to be rewritten from scratch, because of used 32-bit integers they both do not support modern traffic bandwidth (over 4 Gigs/sec).



  • @w0w:

    NaterGator, it's FreeBSD, but I don't think anybody cares ALTQ CODELQ, you have alternative with HFSC and codel enabled queue. I think next 3-5 years we will see some progress for IPFW or ALTQ — it does not matter they both need code to be rewritten from scratch, because of used 32-bit integers they both do not support modern traffic bandwidth (over 4 Gigs/sec).

    Hmm, I do see this issue in HFSC with and without codel enabled. What I'm saying is any altq enabled shaping at all triggers the issue.



  • w0w

    What is strange for me is that with no traffic shaping I get low ping spikes  on download and high ping spikes on upload, when enabling any shaping including fq  I get low ping spikes on upload but then get high ping spikes on download,  see my results in post 3 if you may, I cant understand it



  • NaterGator, I did not tested HFSC for a long time, but I can test it also, later this week. Can you provide more settings regarding queues you have used? Did you change any other settings like RED or ECN, priority, queue limit in packets?



  • @Chrismallia:

    w0w

    What is strange for me is that with no traffic shaping I get low ping spikes  on download and high ping spikes on upload, when enabling any shaping including fq  I get low ping spikes on upload but then get high ping spikes on download,  see my results in post 3 if you may, I cant understand it

    I have had some similar results, but I am not sure if it's some DSLreports bug or anything else, their server, or other problems, ex. ISP.
    You should repeat your tests 10 times at least with traffic shaping and without, to make some conclusions.



  • http://www.dslreports.com/speedtest/27223186 ALTQ HFSC codel enabled queue + ECN, dslreports 24/6 HD enabled
    https://www.dslreports.com/speedtest/27223235 ALTQ HFSC codel enabled queue + ECN, dslreports 30/30 HD enabled — twice got failed due to overall timeout. error:2
    http://www.dslreports.com/speedtest/27223412 FQ_CODEL again

    NaterGator did you try to set bandwidth limit for the ALQ shapers more tighten then for FQ_CODEL? I think it can be possible reason for the spikes you've got. The overhead for the ALTQ is slightly bigger so it's possible that it fails on a bandwidth limit. To make sure put the bandwidth to 50% of overall.

    I can not explain why dslreports often fails to start with ALTQ CODEL and FQ_CODEL mostly never fails. May be it's by design of DSLreports or ALQ CODEL.  :D



  • @w0w:

    NaterGator did you try to set bandwidth limit for the ALQ shapers more tighten then for FQ_CODEL? I think it can be possible reason for the spikes you've got. The overhead for the ALTQ is slightly bigger so it's possible that it fails on a bandwidth limit. To make sure put the bandwidth to 50% of overall.

    Oh yes, I tried some absurdly low limits just to be absolutely positive I had plenty of headroom, for example:
    https://www.dslreports.com/speedtest/25228491

    The insane spikey buffering effect from ALTQ is definitely still there.



  • Did you try ECN on ALTQ? It is enabled by default on IPFW FQ_CODEL.



  • Yes, I tried with and without ECN.