ESP sometimes using WAN interface alias IP instead of WAN interface IP
-
I have Netgate 6100s with a IPSec tunnel configured between the two sites.
Site A: 23.09.1-RELEASE
WAN IP: X.X.X.228/24 (routed)Site B: 23.09.1-RELEASE
WAN IP: X.X.X.78/28 (bridged)
WAN aliases: X.X.X.66 .. X.X.X.77/8The tunnel's P1 and P2 always negotiate successfully. Most of the time traffic passes successfully as well.
However I experience random traffic drops. When it does, I've restored it by stopping and restarting the ipsec service on both sides. Sometimes it takes multiple restarts.
This happened again this afternoon, and I spent some time attempting to diagnose the problem.
I collected a packet capture on both sides and found that ESP traffic from site B to site A was using one of the WAN aliases (X.X.X.66) rather than the primary WAN IP (X.X.X.78). I restarted it again and it used different WAN aliases (X.X.X.68, X.X.X.69, etc.). Sometime later, it started using X.X.X.78 again, and traffic began passing again.
This seems very similar to an issue that was reported and closed as non-reproducible (https://forum.netgate.com/topic/132468/).
-
For I brief moment, I thought I had this figured out.
My WAN aliases were configured as "Virtual IPs", so the those addresses were known to the interface/kernel. I changed them to "Proxy ARP" (which is all I really needed).
Unfortunately this hasn't resolved the problem, even after a reboot.
03:32:19.362111 IP X.X.X.66 > X.X.X.228: ESP(spi=0xce9d0104,seq=0x114), length 120 03:32:20.366429 IP X.X.X.66 > X.X.X.228: ESP(spi=0xce9d0104,seq=0x115), length 120 03:32:21.371396 IP X.X.X.66 > X.X.X.228: ESP(spi=0xce9d0104,seq=0x116), length 120 03:32:22.376840 IP X.X.X.66 > X.X.X.228: ESP(spi=0xce9d0104,seq=0x117), length 120 03:32:23.384406 IP X.X.X.66 > X.X.X.228: ESP(spi=0xce9d0104,seq=0x118), length 120 03:32:24.388246 IP X.X.X.66 > X.X.X.228: ESP(spi=0xce9d0104,seq=0x119), length 120 03:32:24.600089 IP X.X.X.228 > X.X.X.78: ESP(spi=0xc023fef7,seq=0xb6), length 96 03:32:25.391084 IP X.X.X.66 > X.X.X.228: ESP(spi=0xce9d0104,seq=0x11a), length 120
In this packet trace, the Site B source IP is X.X.X.66 instead of the expected X.X.X.78. Traffic from Site A back to Site B is correctly using the destination IP X.X.X.78.
I'm at a loss. Now that the WAN aliases have been changed to Proxy ARP, I can't understand why it wouldn't be using the single WAN address.
-
Ok, I believe I've found the root cause and it was a misconfiguration.
I have "Manual Outbound NAT" configured. This is to use a 4 address pool instead of the firewall's WAN address for NAT.
So even though there was an wildcard "Auto Created Rule" for TCP port 500 (ISAKMP) using the WAN address, there wasn't a rule for the ESP protocol. I added a wildcard rule for ESP that used the WAN address last night, and haven't had any issues since.
In retrospect, this makes sense, since each time I lost connection, the source address was one of the addresses in the pool. What doesn't make sense to me is that it ever used the WAN IP address. Like so many things, it would have been a lot easier to diagnose than a connection that always failed than one that sometimes worked.
I think an argument could be made that if pfsense is going to add an Auto Created Outbound NAT rule for ISAKMP, it should probably create an Auto Created rule for ESP at the same time.