Troubleshooting WAN latency
-
Can i ask for some advice on how to troubleshoot a latency issue please. Specifically WAN side.
I've scoured the forum and found a few articles about this and a couple of tips but nothing that is solving my issues, i will say that i am running 2.4.5P1 and did not experience the issues noted in 2.4.5, but since P1 i am experiencing what was described in 2.4.5 - latency wise, but not with regards to the CPU usage or processes.
For reference (https://redmine.pfsense.org/issues/10414).
So a little spec;
ESXi 6.7U3, VM is 2 x vCPU and 4GB ram, it has been rebooted several times without a resolve.
Several VLANs configured, but no issues on LAN or Wi-Fi, only the WAN.
The VM has a dedicated NIC for WAN and 2 bonded for LAN and all other traffic.
I do not experience high CPU, typically less than 10% usageIt does seem to tie in with when PfBlockerNG updates - I dont have huge lists, about 200k items total at the moment.
It's not at a specific time, but within about 10 minutes of the hourly update of PfBlockerNG, and not always - but that could be because i simply don't notice it and could be completely unrelated. PfBlocker NG was the last package i installed and configured
PfBlockerNg is the latest devel version, I have just shy of 200k on the block list presently, so nothing demanding.
My WAN speed is 350 down / 35 up, but the connection could be almost idle and the latency still shoots up, in some cases to 75% packet loss, but more often than not 30% or so and i get disconnected for just less than a couple of minutes each time.
I do not block bogon networks
I started to notice this back in March when Covid-19 started and we all started working from home, now for the most part i don't notice it, but if i happen to be on a call at the time, i get sporadic parts of the conversation (VOIP calls for work), at first i put this down to Wi-Fi as i was working directly on my work laptop so i just put up with it, about 2 months in and it was becoming apparent we're not rushing back to the building any time soon so i setup the laptop more permanently and wired it. The issue still persists.
I've been trying to note the times when i see this happen, the device i am using when i notice this is gigabit wired and it's only the WAN i have issues with.
Rather than waffle on, can anyone point me in the direction of what to look for, how to troubleshoot etc - i can supply more information on request, i think that's easiest.
I have also seen a few people who also have similar issues to me saying that moving to 2.5 solves their issues - i am keen, but would rather help diagnose the cause as well if I can.
In case it's relevant and to avoid people asking, the host is an HPE DL360 G9
2 x 10 core CPUs with 512GB ram and ~6TB PCIe flash storageThis is only my home lab system so not something i can't put up with for a little longer.
Oh and in case it's relevant.
My ISP supplied modem/router is set to modem only mode, for about 1-2 hours after rebooting the modem i can ping/connect to it's management page (192.168.100.1), after that time it pings maybe once every 2-3 minutes and i am unable to get to the management page at all.
There are no rules in PfSense for this network but i also do not see anything blocked in the logs, i can't honestly say how long this has happened as I've never needed the management page of the modem, it's just something i noticed during troubleshooting.
FYI I've had the same external IP for around 2 years currently.
As i write this latency on pings to my gateway are starting to increase (ISPs gateway)
Ugh
-
Something i can add.
As i was writing this and latency was increasing i got 3 pings going.
1 to my pfsense - nothing dropped.
1 to my ISPs gateway, high pings 300ms or more
1 to my ISPs modem mode IP 192.168.100.1What i witnesses was;
gateway pings increase, gateway shows latency and packet loss in the Pfsense dashboard, but the ISPs modem started to reply on pings - consistently. Once the latency reduced and went back to stable, the ISPs modem IP stopped replying again.
This makes me want to rule out PfBlockerNG completely and say it's something going on with the modem and my PfSense, but what, where do i start?
I did also notice this guide, I'm sure i've done this in the past and didn't see a difference, but i am fine configuring it again if anyone thinks this will help or is related, my connection is not PPOE though, just DHCP direct from the ISP
https://docs.netgate.com/pfsense/en/latest/interfaces/accessing-modem-from-inside-firewall.html
FYI I am not massively bothered about getting to the modem page, it's rare i need it and i don't think i've used it in years, I'm just putting it out there in case it has relevance here.
-
@Rod-It I have seen reports that this modem ip is reachable only when the network is down.
You can verify this by disconnecting from upstream. and see if there is a pattern with pinging.
From what you say, it doesn't seem to be pf related.
And as for pfblockerng and lists, well, it might cause slow dns resolution with huge lists, but NEVER packet loss. -
I would likely agree that in modem only mode, the page may not be visible unless it's required, which would typically be for other reasons anyway, however it does ping very sporadically so it's not completely offline so to speak. Also as noted earlier, if i disconnect the modem, then power it back on, i can both ping and access the modem management page for a few hours thereafter while still retaining internet access.
I went with your suggestion, i removed the coax cable from the back of the modem while running a ping to the modem only mode IP, i can confirm that the modem did NOT return, i did lose internet and Pf shows as offline with packet loss as expected, but the modem page did not return.
reconnecting the internet as above did grant me access to the modem IP and webpage once more, it's only lasted about half hour this time, but i was able to get in and take screenshots, nothing shows as unusual all connections look good and strong. While writing this the ping has also returned to my modem IP and is solid 1ms - 2ms or so.
I am however less concerned about gaining access to the modem management page than i am about troubleshooting why i have latency and why it's almost every hour.
It's not unbearable, but it's also frustrating.
I will report this to my ISP for completeness, but every hour +/- 10 minutes seems too regular for it to be an external factor and in the 22 years I've been with them, only ever had 2 maybe 3 issues, all of which disconnected me completely as it was a cabinet fault (outside).
Open to any other suggestions of course and I will also build a clean Pf box alongside this one, to see if i can repeat the issue, if i cannot repeat the issue this will confirm something is going on within Pf - right?
-
@Rod-It Most probably its an isp issue.
You are reporting packet loss, not just latency, which is much more severe.
I highly doubt it has anything to do with pf.
Most probably, as traffic patterns changed massively due to work from home loads, there is congestion somewhere.You could hook up a laptop directly on the modem, and run some pings for a few hours.
-
Just an FYI the ISP is investigating, but this may take some time as nothing is showing as a fault at present and I've had to log it on their forum as it's not a critical or without service fault.
Thanks for chiming in.
I'm still fairly new to networking in general and troubleshooting stuff like this, I deal mostly with server and virtualization but need to understand networking more generally, so all advice and guidance is well received.
-
@Rod-It Keep in mind that isps make money by overselling installed capacity. The covid situation has invalidated many known statistically safe patterns, and isp's struggle to balance traffic with upgrades.
Any descent isp has in place monitoring capabilities and can see where and when congestion occurs. If a report comes in, they will have a look, and in most cases silently adjust things, even though they will never say more than just fixed.Now if you really want to make sure your infrastructure is ok, you need a box (laptop/desktop) that can be moved around and can run iperf3.
iperf3 creates traffic that saturates lines and will reveal lurking issues.
First you run iperf3 on your lan, with just the ethernet switch.
Thus a baseline is established. You should expect near gigabit speeds there.
Then you run the same test with pf lan, and finally you move the probe to the wan side of pf.
You can leave it running for some time and observe cpu load, interface errors, and packet loss.
Much easier than recreating pf from scratch too. -
Again I agree with you about over saturation and Covid-19 having an impact - that's a given, especially as more and more people work from home or watch Netflix etc to pass the time.
I am happy to also put it down to this, but the timings are too frequent, I've been watching the logs for a while on and off all day and it seems to happen every hour about 15-20 minutes past, not always disconnecting me, but generating latency.
The fact it's frequent would suggest this is not a saturation issue, but something else is happening.
https://www.thinkbroadband.com/broadband/monitoring/quality/share/d4f84708f295bf0e6f70b9e8dc9486e001bd6167
Here is what i started capturing earlier today, ignore the red block - Dumbo here forgot to allow WAN pings.
I am also aware my ISP uses Intel's Puma6 chipset in my router, another reason i put it in modem only mode, their next revision uses Puma7 also not great, but offers gigabit speeds.
I also noticed my modem is only showing 2 of 4 upstream channels, I've posted that on the other forum too and this may be the cause, but I'll have to wait and see.
The LAN side of things is not affected, I can happily push gigabit over the network all day long without issues, Wi-Fi is fine and stable and none of the VLANs show issues, it's only WAN and seemingly every hour, roughly 15-20 minutes past each time.
As an FYI - the WAN and LAN are pretty quiet today as there is only me in and for the most part i only have a couple of pings running to a few devices and YT in the background to keep me from falling asleep.
Wan is showing 2Mbps currently and the biggest other VLAN in use is 5k - a fairly quiet typical day.
-
@Rod-It Still it could be congestion upstream.
From what you say there is nothing in your system to cause this.
However traffic in the neighborhood can. -
Absolutely, but not so frequent on the same times.
All i can do now is wait for the ISP techs to look at my router logs and determine if i have a problem, from what i read, if the qam is not 64 (for example if it's 32 or 16) or the channels are not populated fully, in my case 2 instead of 4 and downstream 24, then this could cause sporadic issues - in a way I hope it's the ISP, at least i can stop diagnosing a fault i cannot control.
Not having fully connected channels is something to do with SNR and less to do with congestion, either way, appreciate you sticking with me.
-
@Rod-It My practical knowledge on docsis is worse than yours, so we have to wait for the isp.
Apart from that, iperf testing can verify everything else but the wan link. -
I guess the other side to this is, had i not have been working from home, would I have ever know if there was an issue.
I pay them a monthly fee, they can earn it this month.
-
What brand and model modem are you using?
Who is the ISP?
-
SuperHub3 (Arris VMDG505) provided by Virgin Media UK.
It would not be my choice, but we dont get a choice, take it or leave it.
I still have my old SH2 but this does not support the speed i am paying for and the SH4 is only available to gigabit customers, of which i cannot be at this time as it's not in my area (even if i wanted to be on this)
-
So, I can confirm this is NOT an ISP issue, if i remove PfSense from the mix and connect directly to the router i do not have this issue.
This is last nights BQM directly connected to a laptop;
Ironically this is this mornings, back on PfSense;
Before the red block was yesterday on PfSense, the red block was when i had the device connected directly to a laptop (screen above) and the content after the red block is back on PfSense - it has yet to have a repeat spike like it was, i believe the initial spike was as devices were all connecting back online for the first time.
Pfsense has not been rebooted, all i did was unplug the WAN cable and put it in to another device - i did this already a week ago before i started with the BQMs and this did not happen.
Hopefully it's gone away, but posting to keep all informed.
Not sure i understand why this was the case if it has now gone away, so any other ways to troubleshoot the cause in the future would be welcome.
Also note i am continually able to ping and connect to my modem, whereas before this i was not, i wonder if this has some reflection on the sporadic spikes.
-
Two things to double check :
MTU
Buffer Bloat -
-
I do appreciate the replies though and things to look for.
I got a spike, as expected during the test, but since this morning i am not suffering the same continual latencies i was for the last few weeks/month and looking at my live graph above, it's been much more settled, so i am lost as to what was causing it
-
@Rod-It said in Troubleshooting WAN latency:
VMDG505
So this is a Puma6 powered modem. And the spikes you are seeing are detailed here..
http://badmodems.com
There are a few issues with the Intel chip that will never be fixed. There are some that have been patched.. but will never be applied to many modems.
If your ISP says they are unaware of this issue then they are lieing.
UDP traffic will cause all kinds of havok. Video, Voip calls, gaming.. ect.
.
Try this tool.. http://www.dslreports.com/tools/puma6 Run this test on something other than Firefox. That browser seems to have its own issues lately. -
My ISP is aware of this, but there is nothing to replace it with, we are not allowed to use our own modems, so it's this or drop to a lower tier.
The modem i have is capable of up to 500Mbits, then they offer you a SH4 (SuperHub4) capable of gigabit, but that is a) not in my area yet and b) runs on the Pupa7 which also has this latency issue, however a slightly better CPU masks a lot of this, likewise, i still have no choice in what i use for the modem piece.
Today has been a lot better - and I can't understand why, I'm actually using the internet more today than yesterday, but I've had maybe 5 spikes, none of which have disconnected me, and of course, some lag and latency spikes I expect.
I've already ran that test as well as many others and i am currently reading up on SQM.