2.4.5 High latency and packet loss, not in a vm

getcom

Hello all,

I experienced similar issues also on bare metal. My conclusion is that it is traffic related. pfBlockerNG is also producing traffic with the lists, DNSBL & Maxmind updates.
There was a netgate patch of pfctl in FreeBSD 11.3 which may has indifferent side effects.
Here are some more details beginning from here: https://forum.netgate.com/post/901257
I catched all reported problems beginning from broken mirror, missing PHP files, high latency on both gateways, high system load, unresponsible console.
I will restore to 2.4.4-P3 tomorrow.

mikekoke

Same problem in a physical box.
When I edit a rule and apply the changes, the latency rises to 300 ms.

asan

A
asan 17 minutes ago

I'm also affected.
HW: SG-4860

If the process pfctl has a 100% peak, ping latency is also very high.

Reply from 9.9.9.9: bytes=32 time=2ms TTL=55
Reply from 9.9.9.9: bytes=32 time=2ms TTL=55
Reply from 9.9.9.9: bytes=32 time=2ms TTL=55
Reply from 9.9.9.9: bytes=32 time=2ms TTL=55
Reply from 9.9.9.9: bytes=32 time=2ms TTL=55
Reply from 9.9.9.9: bytes=32 time=2ms TTL=55
Reply from 9.9.9.9: bytes=32 time=2ms TTL=55
Reply from 9.9.9.9: bytes=32 time=2ms TTL=55
Reply from 9.9.9.9: bytes=32 time=2ms TTL=55
Reply from 9.9.9.9: bytes=32 time=1125ms TTL=55
Reply from 9.9.9.9: bytes=32 time=2ms TTL=55
Reply from 9.9.9.9: bytes=32 time=2ms TTL=55
Reply from 9.9.9.9: bytes=32 time=2ms TTL=55
Reply from 9.9.9.9: bytes=32 time=2ms TTL=55
Reply from 9.9.9.9: bytes=32 time=2ms TTL=55
Reply from 9.9.9.9: bytes=32 time=2ms TTL=55
Reply from 9.9.9.9: bytes=32 time=1613ms TTL=55
Reply from 9.9.9.9: bytes=32 time=2ms TTL=55
Reply from 9.9.9.9: bytes=32 time=2ms TTL=55
Reply from 9.9.9.9: bytes=32 time=2ms TTL=55
Reply from 9.9.9.9: bytes=32 time=2ms TTL=55
Reply from 9.9.9.9: bytes=32 time=2ms TTL=55
Reply from 9.9.9.9: bytes=32 time=2ms TTL=55
Reply from 9.9.9.9: bytes=32 time=2ms TTL=55
Reply from 9.9.9.9: bytes=32 time=2ms TTL=55
Reply from 9.9.9.9: bytes=32 time=2ms TTL=55
Reply from 9.9.9.9: bytes=32 time=2ms TTL=55
Reply from 9.9.9.9: bytes=32 time=2ms TTL=55
Reply from 9.9.9.9: bytes=32 time=1190ms TTL=55
Reply from 9.9.9.9: bytes=32 time=5ms TTL=55
Reply from 9.9.9.9: bytes=32 time=2ms TTL=55
Reply from 9.9.9.9: bytes=32 time=2ms TTL=55
Reply from 9.9.9.9: bytes=32 time=2ms TTL=55
Reply from 9.9.9.9: bytes=32 time=2ms TTL=55
Reply from 9.9.9.9: bytes=32 time=2ms TTL=55
Reply from 9.9.9.9: bytes=32 time=2ms TTL=55
Reply from 9.9.9.9: bytes=32 time=2ms TTL=55
Reply from 9.9.9.9: bytes=32 time=2ms TTL=55
Reply from 9.9.9.9: bytes=32 time=2ms TTL=55
Reply from 9.9.9.9: bytes=32 time=2ms TTL=55
Reply from 9.9.9.9: bytes=32 time=2ms TTL=55

stephenw10

Try running a packet capture on the WAN when you see this. Filter by pings.
Check to see where the latency is happening. Ping requests delayed sending, delayed responses or somehow delayed within pf before it gets back to the ping process.

Steve

A Former User

Delayed by pf. Pings between vlans see the latency when tables are reloaded.

From one vlan to another:

Screen Shot 2020-04-04 at 16.19.02.png

Derelict

That is not a packet capture.

A Former User

I am aware of that. Standby for a packet capture.

Screen Shot 2020-04-04 at 16.54.05.png

Derelict

If you are not able to test in a way that allows you to post actual pcaps I don't know how much good it is going to do anyone.

It is past the point of trying to convince people this is a problem (in apparently edge cases). Now it's about trying to compile information so it can be identified and corrected.

A Former User

That is a pcap, in wireshark with my public ip blanked out. I would be happy to send you the file if you would like but I'll decline to post it publicly, some knuckle head will just decide to go fishing around at my public ip.

stephenw10

I find adding the 'time difference' and 'response time' columns useful here.

That will show if the request is delayed. And what the actual response time on the wire is. Like:

Derelict

I just don't think this data is very helpful at diagnosing exactly what is happening.

A Former User

@stephenw10 said in 2.4.5 High latency and packet loss, not in a vm:

I see delta time but not response time as column choices. Maybe it would be more expedient for me to send the pcap. I have used wireshark exactly once, this time. :)

OK, I see now. Custom column and then icmp.resptime. Does that make any sense if it's not sorted by the icmp seq number?

A Former User

I hope this is more useful. If not I'll try again.

A Former User

I'll add this to the mix. I changed the average time in the gateway settings. That's the time dpinger averages over. When changing the setting, saving and then applying it the interface locked up for an extended time (minutes).

So, I ssh'd in, ran top and did it again:

Screen Shot 2020-04-04 at 21.04.40.png

I can see dpinger using some resources, but why pfctl, ntpd and sshd? I'm not sure if that means anything, but it sure appears odd to me.

riften

This looks so much like the problem I had, even before PFS 2.45. The symptoms. Latency spikes, then packet loss over and over. I had just created my first VLAN and gave the VLAN interface a static IPV6 in one of the 64s I should have. But no route and this horrible latency and packet drop. I followed the info HERE and created a 'Configuration Override' on the WAN IPV6 and set my VLAN static IPV6 and that was the only way to get darn ATT to route IPV6 from my VLAN. It made it trouble free after I spent almost a week pulling out my hair. So just wondering, can you guys ping (route) from your LAN or from the VLANS in ipv6? I am seeing ipv4 pings but did I miss the ipv6 pings...
I'm on 2.45 with no issues, and am using the latest PFBLOCKERNG. It just looks so familiar...

A Former User

I can ping ipv6 without issue. I get a /56 from my isp.

The only thing that has changed in my configuration is the pfsense version.

I have offered to share my config.xml to test on matching hardware. My Supermicro hardware is the same as a box Netgate sells other than not being Netgate branded.

This is an frustrating problem, more so for Netgate than anyone else I'm sure.

stephenw10

Can you see what is calling pfctl if you run, say: ps -auxdww | grep pfctl.

A Former User

root 25572 33.5 0.0 8828 4888 - R 09:34 0:04.12 | | `-- /sbin/pfctl -o basic -f /tmp/rules.debug

A Former User

I was able to run ps auxdww >> psoutput a few times before the shell locked up.

Here it is: (removed)

stephenw10

Thanks, that could be useful.
Interesting there are things there using far more CPU than I would ever expect.

You might want to remove it though if those public IPs are static.

Steve