Wireguard client in LAN fails to recover after WAN failover. How to reproduce similar problem using policy routing.

arkanoid

This is not a problem of Wireguard on Pfsense, but wireguard in LAN using pfsense as gateway.

Situation:

PfSense 2.6.0 with WAN1, failover WAN2. State Killing on Gateway Failure enabled.
Wireguard client on LAN host C1 with 4Mbps symmetric traffic (MULTIPLE:MULTIPLE state)
On normal condition, C1 connects to remote wireguard server S1:50000/udp via WAN1

When WAN1 goes down, C1 udp packages are routed to S1 via WAN2, states are updated and it keeps working.

When WAN1 returns UP, C1 udp packages will never leave the firewall, nor via WAN1 nor via WAN2 (checked with tcpdump). States are still referring to WAN2 like "WAN2:23169 (C1:48614) -> S1:50000". C1 sends udp packets to firewall, S1 sends packets firewall, C1 receives packets from S1, S1 does not receive packets from C1.

How to to reproduce similar problem using policy routing:

Use policy routing to make C1 use WAN1 as gateway and start wireguard traffic to/from S1. You will have working wireguard C1 -> S1 via WAN1.
Change policy routing rule and make WAN2 the new gateway: you will have working wireguard C1 -> S1 still via WAN1.
Manually delete state "WAN1:port (C1:anotherport) -> S1:50000": vpn goes down forever. C1 sends udp packets to firewall, S1 sends packets firewall (WAN1), C1 receives packets from S1 still via WAN1, S1 does not receive any packet from C1.

Basically S1 will never update "endpoint" from WAN1 to WAN2 as it won't receive any packets from WAN2, so it will keep sending handshake requests to WAN1.

Any workaround/solutions?

Wireguard would start working again by restarting wireguard on C1. This makes a change in the outgoing udp port and triggers a state change.

arkanoid

Please consider moving this to Routing/MultiWAN topic if you find more appropriate

arkanoid

I've made a clearer example for the behavior described in my first post

Starting condition

# pfctl -s all | grep 192.168.0.80 | grep 5001
pass in quick on re3 route-to (re2 WAN1_GW) inet proto udp from 192.168.0.80 to WG_SERVER port = 50001 keep state label "USER_RULE: Testing wireguard states" ridentifier 1656796548
all udp WG_SERVER:50001 <- 192.168.0.80:58125       MULTIPLE:MULTIPLE
all udp WAN1:17552 (192.168.0.80:58125) -> WG_SERVER:50001       MULTIPLE:MULTIPLE

^ I can kill any of these states in any order, they are re-created immediately and wg connection remains solid

Edit policy based routing rule. Force 192.168.0.80 to route packets via WAN2 instead of WAN1

# pfctl -s all | grep 192.168.0.80 | grep 50001*
pass in quick on re3 route-to (re0 WAN2_GW) inet proto udp from 192.168.0.80 to WG_SERVER port = 50001 keep state label "USER_RULE: Testing wireguard states" ridentifier 1656796548
all udp WG_SERVER:50001 <- 192.168.0.80:58125       MULTIPLE:MULTIPLE
all udp WAN1:17552 (192.168.0.80:58125) -> WG_SERVER:50001       MULTIPLE:MULTIPLE

Wireguard still working via WAN1, but new rule is in place

cases:
A) if I kill second state, it is re-created as-is and wireguard keeps working via WAN1
B) If I kill first state wireguard stops working forever (apparently no timeout). If I kill also second state it starts working again on WAN2

arkanoid

I've identified that the above simulated state happens when failover happens in rapid succession.

I mean if tier 1 wan happens to go DOWN and UP in rapid succession, the rules/state update logic hangs like in the example above. It it happens slower, udp connections keeps living on tier 2.