Multi-WAN setup with OpenVPNs flaky
I have pfSense 2.4.5, and I recently added another WAN interface for 5G mobile data and setup the gateway group, and I got everything working. However, it's been very unreliable lately, and I've spent day and night trying to fix it. Going mad now so I've joined the forums for help, please.
I have always had an OpenVPN client (now AirVPN via UDP), but I created a 2nd instance for the 2nd WAN. Each WAN has 1 OpenVPN client associated with it. The two OpenVPN client interfaces are members of a Gateway Group. The firewall rules send most LAN traffic out to the Gateway Group for combined bandwidth.
Because the 5G reception is flaky (separate issue being sorted out), sometimes that OpenVPN client goes down. I've configured the Gateway to recognize this via the "High latency or packet loss" option, and it goes down for short periods a few times an hour.
All good up until now, but what happens is this. When the 5G gateway goes down, all outbound traffic stops, even over my old broadband WAN. My SSH session to the firewall gets killed. The traffic flow graph on the dashboard page shows traffic going outbound, but nothing coming back inbound. When I try to ping any IP from a LAN client, I get timeouts or destination host unreachable. On the pfSense shell, I get no route to host. Everything completely packs in. If I wait a few minutes it comes back up and eventually goes back to normal. But all VoIP/Video calls are killed. Sometimes it happens very frequently so becomes impossible to make stable calls.
When things are working, and I print the routes, there is never a default route. Is this normal with a Multi-WAN setup? When I try to install an additional package from the front end, the list of available packages is empty. When I try to do pkg update from the Shell, I get an error; no route to host. (Even with both WANs up). When I do I fresh install and restore my configuration, I always get an error that packages can't be (re)installed, because there was no internet access.
I'm convinced all of these problems all point to routing being screwed somehow. When I run nststat -rWh, I get this:
[2.4.5-RELEASE][root@pfSense.int]/root: netstat -rWh
Destination Gateway Flags Use Mtu Netif Expire
one.one.one.one 10.20.204.90 UGHS 174371 16384 lo0
one.one.one.one 10.24.200.64 UGHS 163068 16384 lo0
dns.google 192.168.8.1 UGHS 77886 1500 vtnet2
10.20.204.0/24 10.20.204.1 UGS 0 1500 ovpnc7
10.20.204.1 link#10 UH 0 1500 ovpnc7
10.20.204.90 link#10 UHS 0 16384 lo0
10.24.200.0/24 10.24.200.1 UGS 0 1500 ovpnc8
10.24.200.1 link#11 UH 0 1500 ovpnc8
10.24.200.64 link#11 UHS 19386 16384 lo0
localhost link#4 UH 2639 16384 lo0
192.168.8.0/24 link#3 U 78 1500 vtnet2
192.168.8.253 link#3 UHS 0 16384 lo0
192.168.21.0/24 link#1 U 0 1500 vtnet0
192.168.21.253 link#1 UHS 0 16384 lo0
192.168.42.0/24 link#2 U 39938647 1500 vtnet1
pfSense link#2 UHS 0 16384 lo0
Destination Gateway Flags Use Mtu Netif Expire
localhost link#4 UH 0 16384 lo0
fe80::%vtnet0/64 link#1 U 0 1500 vtnet0
fe80::20c:29ff:fe4f:8886%vtnet0 link#1 UHS 0 16384 lo0
fe80::%vtnet1/64 link#2 U 0 1500 vtnet1
fe80::20c:29ff:fe4f:8890%vtnet1 link#2 UHS 0 16384 lo0
fe80::%vtnet2/64 link#3 U 0 1500 vtnet2
fe80::24ae:e4ff:fed7:9170%vtnet2 link#3 UHS 0 16384 lo0
fe80::%lo0/64 link#4 U 0 16384 lo0
fe80::1%lo0 link#4 UHS 0 16384 lo0
fe80::%ovpns1/64 link#9 U 0 1500 ovpns1
fe80::2bd:16ff:fe1b:ff01%ovpns1 link#9 UHS 0 16384 lo0
fe80::20c:29ff:fe4f:8886%ovpnc7 link#10 UHS 0 16384 lo0
fe80::20c:29ff:fe4f:8886%ovpnc8 link#11 UHS 0 16384 lo0
(All IPv6 is turned off).
The Interfaces are:
DMZ (wan) -> vtnet0 -> v4: 192.168.21.253/24 (my telephone broadband WAN)
LAN (lan) -> vtnet1 -> v4: 192.168.42.253/24 (my internal LAN)
OPENVPNCLIENTDMZ (opt1) -> ovpnc7 -> v4: 10.20.204.90/24 (my broadband OpenVPN client)
OPENVPNSERVER (opt2) -> ovpns1 -> A Server I run (not an issue)
OPENVPNLANBRIDGEINTERFACE (opt3) -> bridge0 -> (Bridge for my server)
HUAWEI (opt4) -> vtnet2 -> v4: 192.168.8.253/24 (my 5G mobile WAN)
OPENVPNCLIENTHUAWEI (opt5) -> ovpnc8 -> v4: 10.24.200.64/24 (my 5G mobile OpenVPN client)
I use 188.8.131.52 as a monitor IP on vtnet2. I have already disabled the Monitoring on wan - it's considered always up. The OpenVPN client interfaces use the P2P tunnel end as a monitor IP. opt4 goes down occasionally, but this is not in a GW group. opt5 goes down when opt4 goes down.
I guess my questions are:
How can I approach fixing my routes? Why does all connectivity seize up every time one of the VPNs go down? How can I fix my packages because there's no default route.
I'm happy to post config or anything - but the whole XML config backup is quite big and I'll have to strip out the certs etc. Let me know if its needed.
Thanks in advance.
The main cause of everything hanging up was this option:
System -> Advanced -> Miscellaneous
"Flush all states when a gateway goes down"
I had checked this option years ago when I only had one WAN interface.
I still have the doubt about the default route though.
I'm still having severe problems with routing.
When I ping 184.108.40.206 or 220.127.116.11 from the pfSense shell, it goes into a routing loop and exhausts the TTL.
When I ping 18.104.22.168 or 22.214.171.124, I often get "no route to host". Sometimes it works.
But if I specify the source address, it works well:
[2.4.5-RELEASE][root@pfSense.int]/root: ping -S 10.20.204.90 126.96.36.199 PING 188.8.131.52 (184.108.40.206) from 10.20.204.90: 56 data bytes 64 bytes from 220.127.116.11: icmp_seq=0 ttl=116 time=21.044 ms 64 bytes from 18.104.22.168: icmp_seq=1 ttl=116 time=20.887 ms 64 bytes from 22.214.171.124: icmp_seq=2 ttl=116 time=21.234 ms 64 bytes from 126.96.36.199: icmp_seq=3 ttl=116 time=21.606 ms [2.4.5-RELEASE][root@pfSense.int]/root: ping -S 10.20.204.90 188.8.131.52 PING 184.108.40.206 (220.127.116.11) from 10.20.204.90: 56 data bytes 64 bytes from 18.104.22.168: icmp_seq=0 ttl=116 time=21.235 ms 64 bytes from 22.214.171.124: icmp_seq=1 ttl=116 time=20.973 ms 64 bytes from 126.96.36.199: icmp_seq=2 ttl=116 time=21.790 ms 64 bytes from 188.8.131.52: icmp_seq=3 ttl=116 time=21.884 ms round-trip min/avg/max/stddev = 20.973/21.486/22.240/0.308 ms [2.4.5-RELEASE][root@pfSense.int]/root: ping -S 10.20.204.90 184.108.40.206 PING 220.127.116.11 (18.104.22.168) from 10.20.204.90: 56 data bytes 64 bytes from 22.214.171.124: icmp_seq=0 ttl=58 time=15.984 ms 64 bytes from 126.96.36.199: icmp_seq=1 ttl=58 time=15.907 ms 64 bytes from 188.8.131.52: icmp_seq=2 ttl=58 time=15.715 ms 64 bytes from 184.108.40.206: icmp_seq=3 ttl=58 time=15.637 ms [2.4.5-RELEASE][root@pfSense.int]/root: ping -S 10.20.204.90 220.127.116.11 PING 18.104.22.168 (22.214.171.124) from 10.20.204.90: 56 data bytes 64 bytes from 126.96.36.199: icmp_seq=0 ttl=58 time=15.852 ms 64 bytes from 188.8.131.52: icmp_seq=1 ttl=58 time=16.028 ms 64 bytes from 184.108.40.206: icmp_seq=2 ttl=58 time=16.030 ms 64 bytes from 220.127.116.11: icmp_seq=3 ttl=58 time=15.974 ms
Here's the end of the output from pinging without the source address:
36 bytes from localhost (127.0.0.1): Redirect Host(New addr: 10.20.204.90) Vr HL TOS Len ID Flg off TTL Pro cks Src Dst 4 5 00 0054 77e2 0 0000 05 01 0000 127.0.0.1 18.104.22.168 36 bytes from localhost (127.0.0.1): Redirect Host(New addr: 10.20.204.90) Vr HL TOS Len ID Flg off TTL Pro cks Src Dst 4 5 00 0054 77e2 0 0000 04 01 0000 127.0.0.1 22.214.171.124 36 bytes from localhost (127.0.0.1): Redirect Host(New addr: 10.20.204.90) Vr HL TOS Len ID Flg off TTL Pro cks Src Dst 4 5 00 0054 77e2 0 0000 03 01 0000 127.0.0.1 126.96.36.199 36 bytes from localhost (127.0.0.1): Redirect Host(New addr: 10.20.204.90) Vr HL TOS Len ID Flg off TTL Pro cks Src Dst 4 5 00 0054 77e2 0 0000 02 01 0000 127.0.0.1 188.8.131.52 36 bytes from localhost (127.0.0.1): Time to live exceeded Vr HL TOS Len ID Flg off TTL Pro cks Src Dst 4 5 00 0054 77e2 0 0000 01 01 0000 127.0.0.1 184.108.40.206
What's going on!?