Outbound ping problem to DNS Filter servers

njc

This is the first PFSense problem I've encountered in over a decade that I've been unable to solve by researching the forums and experimenting. I'll attempt to make this post as complete and concise as possible.

Problem:
At two locations, both Netgate 7100's running 25.07.1 (though this problem existed at least on 24.11). DNSFilter client applications continuously ping 103.247.36.36 and 103.247.37.37 to determine connectivity. The problem is these pings often do not get through (but sometimes they do go through and receive a reply). It seems to be related to the amount of pings being sent (both of these locations have 100+ clients). Pings to other addresses always work. DNS lookup requests to those servers also always seem to work. Notably, pinging those servers from PFSense itself (GUI or ssh) always works also.

At any given time, many clients have pings that work, but it is not isolated to any particular clients. If I leave a ping going, eventually it will start working. Once it's working, it will continue working until it is stopped and some time elapses (presumably when the state expires).

PFSense Configurations:
All IPv4. No interfaces have IPv6 configured. "Allow IPv6" is UNchecked in Advanced->Networking.
No limiters/schedules/shapers at either location.

Location A: freeradius, openvpn ('aws-wizard, ipsec-profile-wizard', 'Netgate_Firmware_Upgrade', and 'Nexus' are also installed.)
Location B: No packages (however 'aws-wizard, ipsec-profile-wizard', 'Netgate_Firmware_Upgrade', and 'Nexus' are installed - are these 'stock'?)

Troubleshooting:
I've tried a lot, but I'll attempt to list everything I've done.

Packet captures, both from GUI and from ssh (tcpdump). I always see the packet on the LAN side, but I do not see it on the WAN side (unless it happens to be working, of course).
Logging. I set up a rule to Pass ICMP/any traffic to those servers and set it to log. Interestingly, I do not see ALL the ping requests logged, only the first one of a series (indicated as a pass).
I tried an explicit Outbound NAT rule.
Toggled state policy between Interface Bound and Floating.
I've messed with the Advanced firewall settings ICMP timeouts.
I set net.inet.icmp.icmplim up to 2000.
I set net.route.netisr_maxqlen to 4096.
I set net.inet.ip.intr_queue_maxlen to 4000.
I've enabled/disabled NIC hardware checksum offloading.
I've tried increasing the state table sizes, though nowhere have I seen the state table look anywhere close to full. The search rate might be "high", often around 5000/s.
PFinfo shows a continuously growing number of "Blocked" packets out. I suspect this may be indicative of my problem, but not sure where else to look.
It appears to be a NAT issue, but I do not know where to find any outbound NAT logs
Of course I've searched and read countless forum posts here and elsewhere.

Additional:
We have other locations running 23.01 (also 7100's) that do not appear to have this problem, but they also have fewer clients generating these pings. On those, I show far fewer packets out blocked (and it's not growing) on the WAN on PFinfo. All other settings/tunables are stock.

Summary:
I think I'm seeing the packets be accepted on the local interface, passing through the firewall rules, but then not being NAT'd out the WAN (except when it's working).

Where to go from here?
Thanks in advance,
Nick

njc

tcpdump output below.
lagg1.52 is the LAN (actually a separate VLAN from the regular LAN, so it's easier to capture). lagg1.4090 is the WAN side. It's tricky to separate all the traffic, so what I did was set the length of the ping to 60 so it's obvious which pings are from my test vs. the others.
When I ping 9.9.9.9 -l 60, I can see this as length 102. When I ping 103.247.37.37 -l 60 I do not see it on the WAN side at all. It's also obvious on LAN side that there's no reply from 103.247.37.37.

LAN side:

[25.07.1-RELEASE][admin@***]/root: tcpdump -i lagg1.52 -n host 10.52.0.10 -e
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on lagg1.52, link-type EN10MB (Ethernet), snapshot length 262144 bytes
15:33:38.147038 {WAN} > 00:e0:ed:f0:5b:a2, ethertype IPv4 (0x0800), length 102: 10.52.0.10 > 103.247.37.37: ICMP echo request, id 1, seq 13345, length 68
15:33:42.927090 {WAN} > 00:e0:ed:f0:5b:a2, ethertype ARP (0x0806), length 56: Request who-has 10.52.0.1 (00:e0:ed:f0:5b:a2) tell 10.52.0.10, length 42
15:33:42.927098 00:e0:ed:f0:5b:a2 > {WAN}, ethertype ARP (0x0806), length 42: Reply 10.52.0.1 is-at 00:e0:ed:f0:5b:a2, length 28
15:33:42.927183 {WAN} > 00:e0:ed:f0:5b:a2, ethertype IPv4 (0x0800), length 102: 10.52.0.10 > 103.247.37.37: ICMP echo request, id 1, seq 13346, length 68
15:33:47.941896 {WAN} > 00:e0:ed:f0:5b:a2, ethertype IPv4 (0x0800), length 102: 10.52.0.10 > 103.247.37.37: ICMP echo request, id 1, seq 13347, length 68
15:33:52.753096 {WAN} > 00:e0:ed:f0:5b:a2, ethertype IPv4 (0x0800), length 102: 10.52.0.10 > 9.9.9.9: ICMP echo request, id 1, seq 13348, length 68
15:33:52.767568 00:e0:ed:f0:5b:a2 > {WAN}, ethertype IPv4 (0x0800), length 102: 9.9.9.9 > 10.52.0.10: ICMP echo reply, id 1, seq 13348, length 68
15:33:53.777293 {WAN} > 00:e0:ed:f0:5b:a2, ethertype IPv4 (0x0800), length 102: 10.52.0.10 > 9.9.9.9: ICMP echo request, id 1, seq 13349, length 68
15:33:53.791564 00:e0:ed:f0:5b:a2 > {WAN}, ethertype IPv4 (0x0800), length 102: 9.9.9.9 > 10.52.0.10: ICMP echo reply, id 1, seq 13349, length 68
15:33:54.793003 {WAN} > 00:e0:ed:f0:5b:a2, ethertype IPv4 (0x0800), length 102: 10.52.0.10 > 9.9.9.9: ICMP echo request, id 1, seq 13350, length 68
15:33:54.807614 00:e0:ed:f0:5b:a2 > {WAN}, ethertype IPv4 (0x0800), length 102: 9.9.9.9 > 10.52.0.10: ICMP echo reply, id 1, seq 13350, length 68

WAN side (only relevant traffic and obscured WAN IP/MAC):
9.9.9.9:

[25.07.1-RELEASE][admin@***]/root: tcpdump -i lagg1.4090 -n net 9.9.9.9/32 -e
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on lagg1.4090, link-type EN10MB (Ethernet), snapshot length 262144 bytes
15:38:15.102894 00:e0:ed:f0:5b:a2 > {WAN GW}, ethertype IPv4 (0x0800), length 102: {WAN IP} > 9.9.9.9: ICMP echo request, id 1, seq 13351, length 68
15:38:15.117499 {WAN GW} > 00:e0:ed:f0:5b:a2, ethertype IPv4 (0x0800), length 102: 9.9.9.9 > {WAN IP}: ICMP echo reply, id 1, seq 13351, length 68
15:38:16.114749 00:e0:ed:f0:5b:a2 > {WAN GW}, ethertype IPv4 (0x0800), length 102: {WAN IP} > 9.9.9.9: ICMP echo request, id 1, seq 13352, length 68
15:38:16.129052 {WAN GW} > 00:e0:ed:f0:5b:a2, ethertype IPv4 (0x0800), length 102: 9.9.9.9 > {WAN IP}: ICMP echo reply, id 1, seq 13352, length 68
15:38:17.130690 00:e0:ed:f0:5b:a2 > {WAN GW}, ethertype IPv4 (0x0800), length 102: {WAN IP} > 9.9.9.9: ICMP echo request, id 1, seq 13353, length 68
15:38:17.145017 {WAN GW} > 00:e0:ed:f0:5b:a2, ethertype IPv4 (0x0800), length 102: 9.9.9.9 > {WAN IP}: ICMP echo reply, id 1, seq 13353, length 68

103.247.37.37:

[25.07.1-RELEASE][admin@***]/root: tcpdump -i lagg1.4090 -n net 103.247.37.37/32 -e

//nothing found with length 102

njc

OK, new discovery. I ran this command:

pfctl -x loud

...and the firewall promptly locked up (I'm remote, so the VPN dropped). A few minutes later it came back up (thankfully). A review of the system log shows thousands of messages like this:

Nov 20 16:31:53 pfSense kernel: pf: wire key attach failed on lagg1.4090: :3ICMP out wire: 103.247.37.37:8 {WAN IP}:9 0:0 @49, existing: ICMP out wire: 103.247.37.37:8 {WAN IP}:9 stack: 103.247.37.37:8 10.100.80.70:9 0:0 @49
Nov 20 16:31:53 pfSense kernel: {WAN IP}pf: BAD state: :3 stack: 103.247.37.37:8 10.100.80.120ICMP out wire: 103.247.36.36:8 {WAN IP}:1 0:0 @49, existing: ICMP out wire: 103.247.36.36:8 {WAN IP}:1 stack: 103.247.36.36ICMP out wire: 103.247.36.36:8 {WAN IP}:9 0:0 @49, existing: ICMP out wire: 103.247.36.36:8 {WAN IP}:9 stack: 103.247.36.36:8 10.100.80.70:9 0:0 @49
Nov 20 16:31:53 pfSense kernel: pf: wire key attach failed on lagg1.4090: 0:0 @49, existing: ICMP out wire: 103.247.37.37:8 out wire: 103.247.37.37:8 {WAN IP}:4 0:0 @49, existing: ICMP out wire: 103.247.37.37:8 {WAN IP}:4 stack: 103.247.37.37:8 10.100.80.129:4 0:0 @49
Nov 20 16:31:53 pfSense kernel: pf: wire key attach failed on lagg1.4090: pf: wire key attach failed on lagg1.4090: pf: wire key attach failed on lagg1.4090: pf: wire key attach failed on lagg1.4090: ICMPICMP out wire: 103.247.37.37:8 {WAN IP}:3ICMP out wire: 103.247.37.37:8 {WAN IP}:12 0:0 @49, existing: ICMP out wire: 103.247.37.37:8 {WAN IP}:12 stack: 103.247.37.37:8 10.100.80.221:12 0:0 @49

I believe this is clearly related to my problem, and my guess is when I enabled "loud" debug, there were so many messages to be logged, the system got overwhelmed. The debug level is back to "urgent" and these messages are not being logged.

So what does that error mean and what can I do about it...?

SteveITS

@njc said in Outbound ping problem to DNS Filter servers:

I do not see ALL the ping requests logged, only the first one of a series

After that the state is open so there is not a "new" connection being made.

Overall, are these Windows PCs, and did you make NAT changes? There is an edge case bug in FreeBSD for pinging the same host:

@stephenw10 said in Can't ping the same IP from multiple devices:

if you have 1:1 NAT (or static ports outbound NAT) then only one internal system can open a unique state

It is fixed so maybe it's in 25.11? pfSense release notes don't normally call out FreeBSD bugs IIRC.

njc

Update. I believe this is my issue.
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=283795

Next step is to load up a spare 7100 with 25.11.r.20251118.1708 (FreeBSD-16.0-CURRENT), apply our config, and swap the cables over. This way if that RC has other issues we can go back...

njc

cross-post. Thank you @SteveITS !

SteveITS

@njc If you have ZFS on this 7100 you can revert to a 25.07 boot environment. But a spare works too.

This one drove me nuts for a while.

njc

@SteveITS Thanks. I'd upvote you but I don't have enough street cred yet :)

SteveITS

@njc :) here’s a couple