pfsense latency spikes in ESXi


  • Hello folks,

    I've been running pfsense for a few years now virtualized in ESXi on https://www.supermicro.com/en/products/system/midtower/5028/SYS-5028D-TN4T.cfm with plenty of free resources available (32 Gb RAM, number of 8 TB HDD drivers) along with plex and media share on separate VMs.

    I've been experiencing networking issues with this setup for a while now and looking for a way to debug it properly and find a permanent solution.

    I use 2 x 1GB ports which ESXi detects as Intel Corporation I350 Gigabit (igbn driver) and 1 x 10Gb port which ESXi detects as Intel(R) Ethernet Connection X552/X557-AT 10GBASE-T (ixgben driver): 2 ISP providers (500 Mbit and 1Gbit, bridge mode) and 1 LAN to physical 1Gb switch with cat6e all over the house.
    These ports are used in two vSwitches and separated into port groups which are used in pfsense VM and other VMs.
    Pfsense rules all the networking across my network of VMs and physical devices (NAT, DHCP, etc) plus OpenVPN server. Pfsense uses VMXNET3 adapters, hardware offloading is disabled in the pfsense settings according to the virtualized pfsense documentation. CoDel limiter is enabled.
    Pfsense VM itself is barely loaded (0.14, 0.14, 0.10) with 2 x vCPU and consumes 18% of memory (1987 MiB in total).
    VM's HW settings:

    • Hardware virtualization (Expose hardware assisted virtualization to the guest OS) - disabled;
    • Performance Counters (Enable virtualized CPU performance counters) - disabled;
    • I/O MMU - disabled;

    My rounds of debugging:

    1. At first, I thought it was cable-related, crimped cables a good few times with tester and iperf - didn't help.
    2. Replaced network switch - didn't help.
    3. Upgraded ESXi from 6.5 to 7.x, motherboard firmware to the latest available, pfsense from 2.4 to the latest 2.5-p1 - didn't help.
    4. Replaced the only memory stick in the server - apparently, it caused loads of sudden packet drops and ping spikes from <1 ms to 200ms+. It helped a lot but ping spikes still remain, although they're not as bad and as frequent as used to be. Jumps are still happening from <1 ms to 10-30 ms.
    5. Tried to disable parameters Net.Vmxnet3SwLRO, Net.UseHwTSO and Net.UseHwTSO6 in the ESXi host - didn't help.

    Could you please suggest what I might've missed? I'm getting pissed off by the occasional "just about right time" ping spikes in online FPS games like COD: Warzone and others as they introduce lags.
    I'm even considering a separate HW box to run pfsense on it but everywhere I checked people don't seem to have any issues with virtualized pfsense instances whatsoever. I can't figure out is it an ESXi / pfsense settings related issue or an issue with the motherboard or network adapters.

    Thanks


  • Damn, I even tried latest OPNSense - didn’t help, same story. Also looses settings entirely after hard reset due to power loss for whatever reason.


  • Your latencies are WAN side?

    Who is your ISP, is your modem in bridged mode?

    My understanding is, if you remove Pfsense from the mix and only use your ISPs kit, it will work fine, I also believe this issue is not apparent in the 2.5.x builds, though I've not tested this myself. One other suggestion that seems to work is put your ISPs modem back in modem mode, and double-NAT, while not ideal, apparently the issue also goes away.

    The issue seems to affect ESXi more than other hypervisors.

    Here is my post on the same thing (it sounds similar to yours)
    https://forum.netgate.com/topic/155642/troubleshooting-wan-latency

    I have Codel limiters in place which help a little, but i still get WAN spikes and no matter how many times people tell me it's my ISP, if i remove PfSense and go back to the junk my ISP provides, i do not face these issues.

    My WAN spikes are not limited to only when the WAN is in use, i also see huge spikes when only connected to work via VPN and i barely use 1Mbits, but i can get cut off calls because of the WAN spikes. I have found ways to limit this but it's still a PITA


  • @rod-it puma6 is involved in my case too. Latency spikes happen both on WAN and LAN. I disabled CoDel completely because it made things worse. I don't quite understand why pfsense and superhub don't like each other that much.
    Isolated testing of my network showed that:

    • only pfsense causes LAN latency spikes. If it's excluded from the network everything's fine.
    • pfsense causes packet loss on the last hop to google according to MTR, if connected directly to the router in bridge mode there's no packet loss whatsoever

    Will try to move pfsense out of ESXi to HW box, hope it'll help. This issue is very annoying because at times even simple web browsing turns into a struggle. OPNsense has the same issues. Especially noticeable with UDP traffic.

    FYI: this issue is still under discussion here https://forum.kitz.co.uk/index.php/topic,24600.195.html and useful tips regarding unbound settings were given here: https://forum.kitz.co.uk/index.php/topic,24600.90.html


  • Well. the guys at kitz basically found out the same
    There are hardware issues with the modems, which is fudged in firmware fixes.
    Most probably those fixes were done/possible only in router mode, not in bridge mode.

    I doubt moving pfsense to dedicated hw will fix the spikes.

    Dns resolution is irrelevant to the problem (unless of course dns udp is ones ONLY traffic.)
    And so is mtr packet loss at a certain hop.

    As a side note, any direct comparison with once upon a time forks, can't be conclusive for any pfsense functionality/feature/issue


  • It does seem to point to it being a Pf issue, however many discussions still point it to the ISP or router hardware - which is fine, but there are many other people who double-NAT, use Pf 2.5.x or remove Pf from the mix and the problems go away - for this reason i am waiting eagerly for 2.5 to be official, i dont want to run beta software at the moment, though i still dabble with the idea of trying it, i just dont seem to find the time.

    That said, it can't be too far away, so it's likely just as easy to tough it out a little longer.


  • @rod-it
    When i say pf, i mean specifically in the BSD network stack, not pf directly.

    I note this because BSD is an older version in 2.4 and bug fixes in the network may be resolved in 2.5 due to the upgraded OS.

    This is just a guess, but there are online videos that show the issue being resolved in 2.5


  • Are the two modems in bridge mode, having the same LAN subnet?


  • @rod-it But if you do nat at the isp router(it doesn't have to be double), then the corrected firmware "covers" the issue.
    When in bridge mode, it does/can not.

    Unless there is a situation when another router is being used in bridge mode and not experiencing the lags, then it can be attributed to pf.

    I doubt 2.5 can solve this.


  • While i agree with you, there are videos showing that for some 2.5 does fix the issue - the forum linked above also shows people using pf with SH3 and not facing the issues. Perhaps it's related to the NICs in the Pf box and how the SH3 is talking to it.

    While i know Puma is the cause, based on what you say above, if we rule out Pf completely, there is no fix, other than downgrading broadband to a SH2 or moving ISP, which in some cases is not an option, the only other option is to remove Pf completely and go back to a physical router (My Nighthawk didn't show these symptoms) or just use the SH3 directly, which is a bad choice for anyone who wants to do more than basic internet. I can also tell you that if i leave the modem in bridged mode and connect a laptop or PC directly, i also do not have these issues.

    I can also tell you that my issues spiked (no pun intended) back in Feb/March of 2020, and you'll see this posted all over the place too, I believe this was the date VM put the fix described above in place for anyone in router mode, but sadly this made things worse for those of us who use bridged mode.

    To be clear, i am not saying this is a Pf or BSD issue directly, but something not gelling well between the sH3 and Pf/BSD

    I, like many others would just love to know a solution, replacing the modem or changing ISPs is not an option, reverting to a hardware router and not using Pfsense is an option, but not one i'd like to have to choose.

    Appreciate everyone input and help though


  • @cool_corona completely different subnets, in bridge mode


  • If we start doubting things like how ethernet works at the hardware level, then we have a multitude of options we can't really control. to blame too.
    For example, if traffic for the bridged modem passes through a l2 switch which gets it from a tagged trunked port and feeds it untagged to the bridged modem , it also adds a store and forward buffer and a few microseconds of delay.

    It should be neglicible and hard to measure.

    However a faulty receiving end on the bridged modem could be positively or negatively affected with this buffering.

    I'm wild guessing here, lets hope someone stumbles upon a combination that will work for the rest of us.


  • Happy new year everyone!

    Now I have brand new hardware box - 6 port Qotom Q555G6 with i5-7200U, 8GB RAM, 64 GB SSD.

    Installed pfsense, set it up manually from scratch. In general, latency decreased a bit but latency spikes and packet drops are the same as they were before (1-30%).
    Installed opnsense, set it up manually again, from scratch. Same latency spikes with packet drops (20-30 % at times).
    Both used in Fail-safe 2 WAN configuration as per configuration with one gateway group and dpinger enabled.
    Direct connection to the router doesn't have such issues. Ping spikes and packet loss happen on both WAN interfaces even when ISP modems connected directly to pfsense/opnsene hw box (one is puma6 affected with firmware patches installed, the other one is not affected).

    No idea what else to do here, it's annoyed the hell out of me already.

    It looks like it's an issue with freebsd itself. Seems to be related to https://forum.netgate.com/topic/151819/2-4-5-high-latency-and-packet-loss-not-in-a-vm/
    I also increased Firewall Maximum States to 1632000 and Firewall Maximum Table Entries to 2000000 but it didn't make any difference at all.


  • 2.4.5 had this issue and it was supposed to be fixed in p1 for anyone who used more than a single core, however it has not fixed it for me, i still suffer the same WAN issues. (notably, I didn't have them in 2.4.5)

    I do not see this when directly connected to my ISPs router, but part of the reason for moving away from it was it's very basic and Pfsense does a lot for me.

    I've still yet to find the time to install a 2.5 and play with it, but with it being not so far off it's possible final release, I've waited this long, a little longer wont hurt.

    I still think something else is going on under the hood and I've offered logs, I just wouldn't know where to start to troubleshoot this.


  • @rod-it FYI, on the opnsense roadmap for 21.1 which is due to be released at the end of January (hopefully) very first bullet is "Fix stability and reliability issues with regard to vmx(4), vtnet(4), ixl(4), ix(4) and em(4) ethernet drivers." Hope they'll manage to eliminate this issue.
    I went through alternatives and about to give openwrt a try. The rest opensource/free firewalls are far behind pf/opnsense functionality wize.

    With 2.4.5 I had way too high CPU load.


  • @oiyae

    I have no issues with my Qotom , see signature


  • I'm running ESXi with lots of VM's similar to OP and also using a Virgin Media Superhub 3 in Bridged mode connected to a virtualised pfSense box 4vCPU's and 6GB of RAM assigned.

    @Rod-It On 2.4.5 I experienced some issues with unbound crashing and pfBlockerNG DNSBL, however 2.4.5-P1 resolved most of these problems. I'm using FQ_Codel which has helped keep my latencies more in check, but the underlying issues with the SH3 and the ISP Virgin media as a whole still exist. Only recently my area got upgraded to Docsis 3.1, despite the modem not being able to support it I suspect the upgrades helped out with congestion/latency in the area.

    Here's a quick screenshot of the latencies mapped out by Grafana on my monitoring stack:
    alt text
    Few spikes here and there, but nothing too major.

    I had a quick skim read through the thread, but does taking pfSense out the equation help at all? @oiyae

    PS: I'm on Vivid 350 package HUB firmware: 9.1.1912.302


  • @pfsensation it does, when I got PC connected directly into SH3 router running for a good few hours with MTR to 8.8.8.8 there were no issues whatsoever. At the same time laptop connected over the wire to pfsense box shown packet drops and latency spikes with MTR to 8.8.8.8 and google.com

    Docsis 3.0
    HW version 5.01
    SW version 6.12.18.26


  • @oiyae said in pfsense latency spikes in ESXi:

    @pfsensation it does, when I got PC connected directly into SH3 router running for a good few hours with MTR to 8.8.8.8 there were no issues whatsoever. At the same time laptop connected over the wire to pfsense box shown packet drops and latency spikes with MTR to 8.8.8.8 and google.com

    Docsis 3.0
    HW version 5.01
    SW version 6.12.18.26

    Call up VM and get the SH3 replaced, yours seems to be a much older revision and it's running software a lot older compared to mine. I would start there as in the newer firmware versions they have improved the latency issues quite a bit by offloading tasks off to the WiFi SoC.

    Here's the info on mine below

    Standard specification compliant : DOCSIS 3.0
    Hardware version : 10
    Software version : 9.1.1912.302
    

  • @pfsensation they'll send me a replacement that supports 1Gb, will see how it goes


  • @oiyae

    As far as i know they'll only send a SH4 if you buy the gig1 package or specifically ask for one and the rep is kind enough to honour it, but even so they still use the Puma chipset.

    They do mask the problem better, but it's still there.

    If they do send you a SH4 though I'm going to try that also


  • @rod-it said in pfsense latency spikes in ESXi:

    @oiyae

    As far as i know they'll only send a SH4 if you buy the gig1 package or specifically ask for one and the rep is kind enough to honour it, but even so they still use the Puma chipset.

    They do mask the problem better, but it's still there.

    If they do send you a SH4 though I'm going to try that also

    For the time being, yeah. They're only giving it out to gig1 customers or anyone in a severely high utilisation area that makes enough noise.

    I've asked several times, bypassed the Indian customer service and spoke to someone in the UK. The general gist I got is, only their Level 2 and higher technical team can order a SH4 in specific circumstances (low supply?). However, getting through to one of those guys right now is near impossible.

    Do let us know how it goes! I've recently signed up to extend my virgin media contract because no one else in my area supplies more than 60meg. Being a network engineer myself, not a great fan of their network or their hardware as you can imagine...


  • For ESXi deployments, note that FreeBSD and hence pfsense disables MSX-I interrupt handling per default and thus causes substantially higher CPU load. This could lead to spikes.

    Try inserting the following into your /boot/loader.conf.local:

    hw.pci.honor_msi_blacklist=0

    References:


  • @asche

    That link talks about PCI passthrough, for me i am running the system fully virtual, no pass-through.

    My issue was present in ESXi 6.7U2, U3 and I've since upgraded to ESXi 7.01 and the issue still persists, I also changed my storage from PCIe SSDs to NVMe SSDs.

    CPU is less than 5% most of the time in my setup.

    My spikes are only on the WAN too, not the LAN, and have never been LAN side.

    Thanks for the article though.


  • @rod-it look at the netgate documentation -> tuning -> vmxnet, they have the same advice there (since a few months I believe).

    https://docs.netgate.com/pfsense/en/latest/hardware/tune.html#vmware-vmx-4-interfaces


  • @asche

    Thank you again, i will look at it more tomorrow, but one has to assume if this was the cause, both the LAN and the WAN would have these issues, no?

    FYI, my server has 4 on-board NICs;

    Broadcom NetXtreme BCM5719

    In case it helps or is relevant.

    I do also have a couple of spare 4 port Intel NICs, I'm half tempted to throw one in the server and see if this makes any difference moving my settings over - specifically the WAN one for now.

    My issues also didn't start until around March time of last year


  • @rod-it

    If you are willing to experiment, try removing the VMX interfaces presented to your pfSense VM and replace them with E1000e.

    Not the recommended solution, I know, but I had a lot of problems with VMX and FreeBSD (both my mail server and pfSense). Changing to E1000e worked for me.

    I'm not sure if it's related to the problems I saw but pass-through is enabled automatically when adding VMX interfaces through the ESXi GUI.


  • @biggsy

    I will try that later if I can, adding / removing and reconfiguring NICS can be done on the fly.

    Note though that you are likely referring to DirectPath I/O, which passes through specific functions of the cards, not the cards themselves.

    This can of course be disabled if needs be.

    If I get the chance later today I will edit the WAN NIC to be E1000e and see how it goes over night.


  • While i will do some testing if time permits, this question was raised back in April, a month after my issues started (March), and the latter post suggests as many of us suspect, something changed, somewhere. It could be driver, ISP, package configuration or PfSense with ESXi and VMXNET3 specifically - since the common factors for people having issues are;

    VM in ESXi
    VMXNET3 adapter (in most cases, since this is the default)
    ISP in the UK is Virgin Media *SH3 and SH4 known to be buggy)
    WAN latencies and packet loss

    https://forum.netgate.com/topic/152770/is-e1000e-better-supported-than-vmxnet3-in-pfsense/

    Within the above it is recommended that VMXNET3 is the used adapter

    One poster is having issues LAN side, not WAN side, however this was also posted for 2.4.5 where other known issues were fixed in P1.

    My issues seemed to only start in P1

    It could be (in my case at least) related to FW in the ISPs modem vs driver support in the FreeBSD OS.


  • @rod-it said in pfsense latency spikes in ESXi:

    While i will do some testing if time permits, this question was raised back in April, a month after my issues started (March), and the latter post suggests as many of us suspect, something changed, somewhere. It could be driver, ISP, package configuration or PfSense with ESXi and VMXNET3 specifically - since the common factors for people having issues are;

    VM in ESXi
    VMXNET3 adapter (in most cases, since this is the default)
    ISP in the UK is Virgin Media *SH3 and SH4 known to be buggy)
    WAN latencies and packet loss

    https://forum.netgate.com/topic/152770/is-e1000e-better-supported-than-vmxnet3-in-pfsense/

    Within the above it is recommended that VMXNET3 is the used adapter

    One poster is having issues LAN side, not WAN side, however this was also posted for 2.4.5 where other known issues were fixed in P1.

    My issues seemed to only start in P1

    It could be (in my case at least) related to FW in the ISPs modem vs driver support in the FreeBSD OS.

    Honestly, I think in your case its just ISP related. In 2.4.5 I did have issues with larger pfblocker lists and unbound. This was patched on P1.

    I'm on the same ISP as you, and I've had major issues myself. It seems they cannot do anything right, crappy modem, crappy network, crappy peering (often loss to certain places like cloud flare). Now speaking to you and the OP seems even the same model of the flawed puma 6 modem doesn't even run the same firmware (guess they're cocking up firmware updates). I've had to get my modem replaced to improve my situation + use things like Fq_codel. Still nowhere near perfect but probably the best you'll get from flawed puma chipset modems + just a overall sketchy ISP.

    All in all, it's just a landmine. If possible I recommend you ditch the ISP and move onto something better. With working from home and online learning more prevalent, no one has time to endlessly faff around with Virgin. I've had no choice until recently, I've got community fibre coming to my area which is full fibre to the premises, I'm switching to that right away.

    PS: I assumed everyone read the pfsense guides, I've got the bootloader.conf.local options recommended by netgate setup.


  • You're probably right, it likely is just the ISPs modem and/or the ISP themselves, the issue i have is all other ISPs round this way are of a much lesser service and while I'd prefer stability over speed, If i left, I'd also need a TV and telephone package which then come in at different costs, at present i have a fair bundle.

    It's all swings and roundabouts, but I've been a VM customer for some 20 years and this is by far the most unreliable service to date, what irks me most is their inability to accept responsibility and actually do something about it - even if it is for a 2% customer base, we're the ones paying the for top tier package, either on broadband only or generically from their services, we should have a little more respect shown.

    To throw it in the mix though, and I'm sure you've read this yourself, using the router as a router, does not suffer this issues, so my guess is this is more than just the ISP here, likely a driver or setting not helping somewhere.


  • @rod-it said in pfsense latency spikes in ESXi:

    You're probably right, it likely is just the ISPs modem and/or the ISP themselves, the issue i have is all other ISPs round this way are of a much lesser service and while I'd prefer stability over speed, If i left, I'd also need a TV and telephone package which then come in at different costs, at present i have a fair bundle.

    It's all swings and roundabouts, but I've been a VM customer for some 20 years and this is by far the most unreliable service to date, what irks me most is their inability to accept responsibility and actually do something about it - even if it is for a 2% customer base, we're the ones paying the for top tier package, either on broadband only or generically from their services, we should have a little more respect shown.

    To throw it in the mix though, and I'm sure you've read this yourself, using the router as a router, does not suffer this issues, so my guess is this is more than just the ISP here, likely a driver or setting not helping somewhere.

    It's most likely due to the half assed patch, and customers on a bunch of different firmware/hardware revisions. If you haven't already, try push for a Hub 4 and see if you get lucky, if nothing else get your Hub 3 replaced. Hopefully they give you one with a newer firmware which makes it a lesser pain. Do setup some traffic management, it'll help a lot. If you can, I recommend you use an app called WeQ4U to make the call, it'll wait in the phone queue for you. Their customer support wait times are through the roof at the moment...

    The truth is, the SoC in their modems are just weak. On the Hub 3, as soon as I push past 100mbps, often times I notice issues even though I'm on the 350 package. These include the sudden spikes here and there, and frequent packet loss. It's improved for me though due to getting my Hub replaced, and my general area being upgraded but it's still an issue. I wouldn't be comfortable with anyone using more than 80mbps bandwidth if I wanted to have a gaming session for example.

    All that being said, this ISP isn't one for gamers, enthusiasts or power users until they fix these glaring issues. I'm not 100% sure what your requirements are in terms of TV, but most ISPs do provide some kind of phone service, you can just get that and then sort out TV another way. In my household we mostly use Youtube/Netflix/Plex and then we have a sky box just for Freeview channels, this works just fine in my case but in yours maybe look at IPTV or other services?


  • @pfsensation

    I will try for a SH4 when i get time to call again, i was keen to go on their gig1 package, not for any reasons specifically, but the price and the current issues have put me in a spot where i may even reduce the package and re-instate my SH2. I'll play on wanting Gig1 though in an attempt to get the SH4 now in preparation.

    I do have FQCodel in place for WAN traffic shaping and it does indeed help, but I dont even have to be using my connection for it to spike on the WAN

    Weekends ironically are affected less than weekdays.

    Yesterday and today I've had less than 1% packet loss with highest peak of 130ms, this is about 1/4 of usual spike.

    I know they have recently done upgrades round here too - 9th December to be exact, but it was not clear what they was upgrading, i know it was planned and took out all services for sporadic times during the day.

    A combination of over subscription, high overall usage in my area, dodgy boxes (as in not fit for purpose), bad FW and possible driver conflicts in FreeBSD/ESXi all add to the mix.

    I appreciate your replies and sticking with this.


  • Ok, I got SH4 and the situation improved significantly right after I replaced my old SH3 with it. Terreble packet drops up to 20% are gone now, from thinkbroadband monitoring I see that overall average latency decreased for about 10%. Latency spikes of 60-140ms are present still but not so frequent that it'd disrupt my work/gaming and happen less frequently (2-4 times a day). Packet drops are happening 3-7 times a day but they're barely noticable (<1%). I've been running it for a week now and opnsense's gateway monitoring hasn't reported any latency issues nor packet drops on it.


  • @oiyae

    Can you report back again in a week or so, I've found if i reboot my SH3 it will run for about 7-10 days with much less of an issue, after which time it starts again.

    Also, during those first few days i can get to and ping 192.168.100.1 after which it's almost not there and i can no longer see it.


  • @oiyae said in pfsense latency spikes in ESXi:

    Ok, I got SH4 and the situation improved significantly right after I replaced my old SH3 with it. Terreble packet drops up to 20% are gone now, from thinkbroadband monitoring I see that overall average latency decreased for about 10%. Latency spikes of 60-140ms are present still but not so frequent that it'd disrupt my work/gaming and happen less frequently (2-4 times a day). Packet drops are happening 3-7 times a day but they're barely noticable (<1%). I've been running it for a week now and opnsense's gateway monitoring hasn't reported any latency issues nor packet drops on it.

    Glad its working out better for you, as soon as I saw the hardware revision and software much lower than mine thought it might be a factor here.


  • @rod-it said in pfsense latency spikes in ESXi:

    @oiyae

    Can you report back again in a week or so, I've found if i reboot my SH3 it will run for about 7-10 days with much less of an issue, after which time it starts again.

    Also, during those first few days i can get to and ping 192.168.100.1 after which it's almost not there and i can no longer see it.

    Funny you mention that, I had that happen yesterday, couldn't access the management page or ping the router. Restarted it, and all was well and good. Thought it could be just a red herring.