Diagnosing latency spikes
-
Hi! I've been running pfSense for a few months now, and I've been able to tune my system well enough where I'm not getting dropped packets.
However, I still haven't been able to figure out why I get so many latency spikes.My specs are:
- Intel XL710-BM1 (4x SFP+)
- Intel Core i3-10100
- 16GB Ram
My WAN is connected with an SFP+ 10GbE adapter
I've made sure my network card firmware/nvm match the driver in use by pfsense for best possible compatibility.
I've checked a lot of tuning guides and recommendations and ended up with the following changes to my
loader.conf.local
filehw.intr_storm_threshold=0 kern.ipc.nmbclusters="8388608" kern.ipc.nmbjumbop="1048576" kern.ipc.nmbjumbo9="1048576" kern.ipc.nmbjumbo16="1048576" net.isr.maxthreads="-1" net.isr.bindthreads="1" hw.ixl.num_queues="1"
My system tunables are the following:
kern.ipc.maxsockbuf 16777216 net.inet.tcp.sendbuf_max 16777216 net.inet.tcp.recvbuf_max 16777216 net.inet.tcp.sendbuf_inc 262144 net.inet.tcp.recvbuf_inc 262144 net.route.netisr_maxqlen 2048 net.inet.ip.intr_queue_maxlen 2048 net.core.rmem_default 8388608 net.core.rmem_max 16777216 net.core.wmem_default 8388608 net.core.wmem_max 16777216 dev.ixl.0.fw_lldp 0 dev.ixl.1.fw_lldp 0 dev.ixl.2.fw_lldp 0 dev.ixl.3.fw_lldp 0 net.inet.udp.maxdgram 131072 net.inet.udp.recvspace 131072 net.core.netdev_max_backlog 262144
This has made it so I no long have dropped UDP packets, but I still experience latency spikes very often.
I've ruled out buffer bloat as the latency spikes happen even when there's little to no network utilization.The latency spikes go from 1-3ms to ~300ms when it happens. With some instances where the latency goes beyond 1 second. Averaging out to ~85ms latency on a ping test or similar.
I'm not sure what else I should be looking for to diagnose the issue. I know the ixl drivers are known to cause all sorts of problems on FreeBSD and if it was now I would have chosen a different NIC. But I'm trying to get it to work as good as possible.
Note: from all the test I've done, this only happens on the WAN side, pinging any device on my LAN never has any latency spikes.
Any and all comments and recommendation are very much appreciated.
-
What pfSense version are you running?
If you're running 2.6 have you applied the patch for this: https://redmine.pfsense.org/issues/12827
Steve
-
@stephenw10
I'm running 22.01, I've applied the patch and the results are the same with or without it.
I think the issue is hardware tuning related. Unfortunately I don't have any other NICs I can test with. -
@zmiguel said in Diagnosing latency spikes:
The latency spikes go from 1-3ms to ~300ms when it happens. With some instances where the latency goes beyond 1 second. Averaging out to ~85ms latency on a ping test or similar.
To what exactly.. Another device on your own network.. A icmp response from a device outside of your control over a network outside of your control.. Why would you think that latency is is because of your device?
If you sniff the traffic leaving your device, and they leave say every 1 second from a typical constant ping - and the responses take longer to return than you think.. How is it you think that is a problem with your device?
Are you saying the response is returned to you and your device (pfsense) is having a delay in processing them?
Now if you had device A -- pfsense --- device B.. That was all over your network.. And pfsense is not processing the traffic correclty (causing a delay) that would be one thing.
But if your pinging out your wan -- to what your ISP device, or some IP out on the internet.. I do not follow your logic that its pfsense that is causing the problem.
Is this whatever your testing to on your wan side, actually still on your network?
-
@johnpoz said in Diagnosing latency spikes:
To what exactly.. Another device on your own network..
To any device on the WAN side of my network
Why would you think that latency is is because of your device?
Because it wasn't here when I was using my previous router (EdgeRouter 12), nor is it when connected directly to my modem.
Are you saying the response is returned to you and your device (pfsense) is having a delay in processing them?
I don't know, but I would assume so.
Now if you had device A -- pfsense --- device B.. That was all over your network.. And pfsense is not processing the traffic correclty (causing a delay) that would be one thing.
I can try setting this up and see how it does
Is this whatever your testing to on your wan side, actually still on your network?
I've tested straight from pfsense to 1.1.1.1, 8.8.8.8 and a few other remote servers that I own, I've also tested the same thing from a device on my LAN side.
I can try having a device I control directly on the WAN side and test it like you mentioned above.
-
@zmiguel said in Diagnosing latency spikes:
I can try having a device I control directly on the WAN side and test it
Yes that would be a far better test. Pretty much confirm the issue.
You could also try running pcaps on WAN and LAN and looking at the actual latency between query and reply. Of course you are already through the NIC hardware and driver at that point but if it was something in the packet processing it would show as clear difference between WAN and LAN.
Steve
-
@zmiguel said in Diagnosing latency spikes:
Because it wasn't here when I was using my previous router (EdgeRouter 12), nor is it when connected directly to my modem.
While I see how those could lead you to your conclusion.. But I take it when your device is directly connected to your modem you have a different IP with your isp. Or when you have the other edge router as well common for this IP to be different.
Different IPs tend to point to different hardware for their next hop.. Which could lead to changes in the performance. Or the issue your seeing could be sporadic..
The only real way to actually say its device X causing the problem is when you control complete testing path of the test be it with network gear and with the end devices.. Testing out to internet, or even to your isp device has way to many variables involved to actually point the finger at your hardware.
Like saying it only rains when you wash your car ;) You can control when you wash your car, but thinking it rains only when you do when when it rains is completely out of your control is not a valid testing.
It may very well be something wrong with pfsense - not saying its 100% not that. But the testing parameters leave room for it just could be something in your isp or the internet, or where your testing too issue causing the latency differences.. When trying to track down something like this is best to limit the variables to stuff you directly control and can monitor to pinpoint the actual cause.
When you suspect its X causing the problem - you need to remove as many other variables as possible and test whatever your issue is with just X.. If it is X it would present itself with the other variables removed.
Like using a different soap brand when washing your car, this doesn't remove the actual real variable (the weather).. But could lead you to believe hey it only rains when I wash my car with Acme soap, not when I use SuperSuds brand.
edit: Yeah I know the car washing example is horrible and pretty obvious - but attempting to test latency or bandwidth that goes out to your isp or anything over the internet just completely out of your control.. Like the weather ;)
-
To add some more information, here's a MTR directly from pfsense to a VPS
pfsense.local (37.x.x.x) -> 194.x.x.x 2022-05-05T14:34:46+0100 Packets Pings Host Loss% Snt Last Avg Best Wrst StDev 1. AS??? 10.208.128.1 94.5% 1000 2.4 138.3 1.0 1405. 352.1 2. AS8657 telepac16-hsi.cprm.net 87.6% 1000 3.1 110.2 2.6 1873. 293.7 3. AS8657 tva-cr1-bu10-200.cprm.net 0.9% 1000 4.6 136.0 3.6 2727. 324.7 4. AS8657 lon1-cr1-be2.cprm.net 0.8% 1000 36.8 190.3 33.6 2627. 371.3 5. AS1299 ldn-b7-link.ip.twelve99.net 0.5% 1000 35.6 196.3 35.2 3310. 384.6 6. AS1299 ldn-bb4-link.ip.twelve99.net 33.7% 1000 35.9 230.6 35.3 3214. 404.2 7. AS1299 adm-bb4-link.ip.twelve99.net 0.7% 1000 39.8 198.2 38.7 3112. 373.2 8. AS1299 ddf-b3-link.ip.twelve99.net 1.0% 1000 49.9 197.4 42.2 3059. 375.0 9. AS1299 contabo-svc072466-ic359931.ip.twelve99-cust.net 0.9% 1000 43.1 189.0 42.9 2958. 349.6 10. AS51167 x.contaboserver.net 0.6% 1000 43.7 191.0 43.0 2859. 352.1
And here's the same but connected directly to the modem
Desktop (37.x.x.x) -> 194.x.x.x 2022-05-05T14:51:17+0100 Packets Pings Host Loss% Snt Last Avg Best Wrst StDev 1. AS??? 10.208.128.1 87.6% 1000 2.2 4.1 1.1 53.3 5.7 2. (waiting for reply) 3. AS8657 tva-cr1-bu10-200.cprm.net 0.0% 1000 4.3 5.0 3.3 50.8 3.8 4. AS8657 lon1-cr1-be2.cprm.net 0.0% 1000 34.3 34.1 33.5 81.0 2.9 5. AS1299 ldn-b7-link.ip.twelve99.net 0.0% 1000 35.7 36.3 35.2 84.1 4.6 6. AS1299 ldn-bb4-link.ip.twelve99.net 0.5% 1000 36.1 45.8 35.4 105.2 11.8 7. AS1299 adm-bb4-link.ip.twelve99.net 0.0% 1000 39.2 41.8 38.6 85.1 3.6 8. AS1299 ddf-b3-link.ip.twelve99.net 0.0% 1000 43.6 44.3 42.2 103.0 3.8 9. AS1299 contabo-svc072466-ic359931.ip.twelve99-cust.net 0.0% 1000 43.8 44.2 42.8 97.0 3.3 10. AS51167 x.contaboserver.net 0.0% 1000 44.0 45.2 43.6 96.5 3.0
@johnpoz said in Diagnosing latency spikes:
While I see how those could lead you to your conclusion.. But I take it when your device is directly connected to your modem you have a different IP with your isp. Or when you have the other edge router as well common for this IP to be different.
They get an IP in the same subnet, and the hops are the same.
I understand that it's not ideal to test this to the internet. But as I only have issues when going out to my WAN and the only variable that I changed was my firewall/router it leads me to believe there's something wrong with my hardware choices.
I'll run the tests during the weekend as to not affect my network so much during the weekdays
-
Hello zmiguel, how did you resolve the issue?
-
@GeorgeCZ58 said in Diagnosing latency spikes:
how did you resolve the issue?
Maybe he changed ISP.. nowhere in his testing did he show pfsense had anything to do with this - the only way to show that pfsense adding latency would be to sniff on the in out interfaces..
Here is his test without pfsense..
1. AS??? 10.208.128.1 87.6% 1000 2.2 4.1 1.1 53.3 5.7
To the first hop, but then hops after that show zero.. So that points to the device just not answering.. Now if he showed 87.6 loss or higher on every hop after that - then he could pretty safely say there is an issue with connectivity.
if you show a traceroute and all of sudden somewhere down the line you see loss, and that loss is with every hop after that, then that points to actual loss. But loss to specific hop and then zero or much lower points to the device with high just not answering all of the pings, or not answering them in a timely manner, etc.
Since in his pfsense is not listed as a hop in his first trace, would see he is tracing from pfsense directly - so pfsense isn't even nating or routing the traffic.. But some how it still adds latency to the return of something it sends out?? So what he got the answer but didn't actually process its return for X ms?
There was a recent thread where user thought pfsense was adding latency and showed him how to test..
Here sniffing on wan and lan at same time, from time traffic hit wan and pfsense sent it out lan it added a whole 0.000114 seconds.