Multi-WAN setup with OpenVPNs flaky

cmsdloma

Hi,

I have pfSense 2.4.5, and I recently added another WAN interface for 5G mobile data and setup the gateway group, and I got everything working. However, it's been very unreliable lately, and I've spent day and night trying to fix it. Going mad now so I've joined the forums for help, please.

I have always had an OpenVPN client (now AirVPN via UDP), but I created a 2nd instance for the 2nd WAN. Each WAN has 1 OpenVPN client associated with it. The two OpenVPN client interfaces are members of a Gateway Group. The firewall rules send most LAN traffic out to the Gateway Group for combined bandwidth.

Because the 5G reception is flaky (separate issue being sorted out), sometimes that OpenVPN client goes down. I've configured the Gateway to recognize this via the "High latency or packet loss" option, and it goes down for short periods a few times an hour.

All good up until now, but what happens is this. When the 5G gateway goes down, all outbound traffic stops, even over my old broadband WAN. My SSH session to the firewall gets killed. The traffic flow graph on the dashboard page shows traffic going outbound, but nothing coming back inbound. When I try to ping any IP from a LAN client, I get timeouts or destination host unreachable. On the pfSense shell, I get no route to host. Everything completely packs in. If I wait a few minutes it comes back up and eventually goes back to normal. But all VoIP/Video calls are killed. Sometimes it happens very frequently so becomes impossible to make stable calls.

When things are working, and I print the routes, there is never a default route. Is this normal with a Multi-WAN setup? When I try to install an additional package from the front end, the list of available packages is empty. When I try to do pkg update from the Shell, I get an error; no route to host. (Even with both WANs up). When I do I fresh install and restore my configuration, I always get an error that packages can't be (re)installed, because there was no internet access.

I'm convinced all of these problems all point to routing being screwed somehow. When I run nststat -rWh, I get this:

[2.4.5-RELEASE][root@pfSense.int]/root: netstat -rWh
Routing tables

Internet:
Destination Gateway Flags Use Mtu Netif Expire
one.one.one.one 10.20.204.90 UGHS 174371 16384 lo0
one.one.one.one 10.24.200.64 UGHS 163068 16384 lo0
dns.google 192.168.8.1 UGHS 77886 1500 vtnet2
10.20.204.0/24 10.20.204.1 UGS 0 1500 ovpnc7
10.20.204.1 link#10 UH 0 1500 ovpnc7
10.20.204.90 link#10 UHS 0 16384 lo0
10.24.200.0/24 10.24.200.1 UGS 0 1500 ovpnc8
10.24.200.1 link#11 UH 0 1500 ovpnc8
10.24.200.64 link#11 UHS 19386 16384 lo0
localhost link#4 UH 2639 16384 lo0
192.168.8.0/24 link#3 U 78 1500 vtnet2
192.168.8.253 link#3 UHS 0 16384 lo0
192.168.21.0/24 link#1 U 0 1500 vtnet0
192.168.21.253 link#1 UHS 0 16384 lo0
192.168.42.0/24 link#2 U 39938647 1500 vtnet1
pfSense link#2 UHS 0 16384 lo0

Internet6:
Destination Gateway Flags Use Mtu Netif Expire
localhost link#4 UH 0 16384 lo0
fe80::%vtnet0/64 link#1 U 0 1500 vtnet0
fe80::20c:29ff:fe4f:8886%vtnet0 link#1 UHS 0 16384 lo0
fe80::%vtnet1/64 link#2 U 0 1500 vtnet1
fe80::20c:29ff:fe4f:8890%vtnet1 link#2 UHS 0 16384 lo0
fe80::%vtnet2/64 link#3 U 0 1500 vtnet2
fe80::24ae:e4ff:fed7:9170%vtnet2 link#3 UHS 0 16384 lo0
fe80::%lo0/64 link#4 U 0 16384 lo0
fe80::1%lo0 link#4 UHS 0 16384 lo0
fe80::%ovpns1/64 link#9 U 0 1500 ovpns1
fe80::2bd:16ff:fe1b:ff01%ovpns1 link#9 UHS 0 16384 lo0
fe80::20c:29ff:fe4f:8886%ovpnc7 link#10 UHS 0 16384 lo0
fe80::20c:29ff:fe4f:8886%ovpnc8 link#11 UHS 0 16384 lo0

(All IPv6 is turned off).

The Interfaces are:

DMZ (wan) -> vtnet0 -> v4: 192.168.21.253/24 (my telephone broadband WAN)
LAN (lan) -> vtnet1 -> v4: 192.168.42.253/24 (my internal LAN)
OPENVPNCLIENTDMZ (opt1) -> ovpnc7 -> v4: 10.20.204.90/24 (my broadband OpenVPN client)
OPENVPNSERVER (opt2) -> ovpns1 -> A Server I run (not an issue)
OPENVPNLANBRIDGEINTERFACE (opt3) -> bridge0 -> (Bridge for my server)
HUAWEI (opt4) -> vtnet2 -> v4: 192.168.8.253/24 (my 5G mobile WAN)
OPENVPNCLIENTHUAWEI (opt5) -> ovpnc8 -> v4: 10.24.200.64/24 (my 5G mobile OpenVPN client)

I use 8.8.8.8 as a monitor IP on vtnet2. I have already disabled the Monitoring on wan - it's considered always up. The OpenVPN client interfaces use the P2P tunnel end as a monitor IP. opt4 goes down occasionally, but this is not in a GW group. opt5 goes down when opt4 goes down.

I guess my questions are:

How can I approach fixing my routes? Why does all connectivity seize up every time one of the VPNs go down? How can I fix my packages because there's no default route.

I'm happy to post config or anything - but the whole XML config backup is quite big and I'll have to strip out the certs etc. Let me know if its needed.

Thanks in advance.

Dave

cmsdloma

The main cause of everything hanging up was this option:

System -> Advanced -> Miscellaneous
"Flush all states when a gateway goes down"

I had checked this option years ago when I only had one WAN interface.

I still have the doubt about the default route though.

cmsdloma

I'm still having severe problems with routing.

When I ping 1.1.1.1 or 1.0.0.1 from the pfSense shell, it goes into a routing loop and exhausts the TTL.

When I ping 8.8.8.8 or 8.8.4.4, I often get "no route to host". Sometimes it works.
But if I specify the source address, it works well:

[2.4.5-RELEASE][root@pfSense.int]/root: ping -S 10.20.204.90 8.8.4.4
PING 8.8.4.4 (8.8.4.4) from 10.20.204.90: 56 data bytes
64 bytes from 8.8.4.4: icmp_seq=0 ttl=116 time=21.044 ms
64 bytes from 8.8.4.4: icmp_seq=1 ttl=116 time=20.887 ms
64 bytes from 8.8.4.4: icmp_seq=2 ttl=116 time=21.234 ms
64 bytes from 8.8.4.4: icmp_seq=3 ttl=116 time=21.606 ms

[2.4.5-RELEASE][root@pfSense.int]/root: ping -S 10.20.204.90 8.8.8.8
PING 8.8.8.8 (8.8.8.8) from 10.20.204.90: 56 data bytes
64 bytes from 8.8.8.8: icmp_seq=0 ttl=116 time=21.235 ms
64 bytes from 8.8.8.8: icmp_seq=1 ttl=116 time=20.973 ms
64 bytes from 8.8.8.8: icmp_seq=2 ttl=116 time=21.790 ms
64 bytes from 8.8.8.8: icmp_seq=3 ttl=116 time=21.884 ms

round-trip min/avg/max/stddev = 20.973/21.486/22.240/0.308 ms
[2.4.5-RELEASE][root@pfSense.int]/root: ping -S 10.20.204.90 1.1.1.1
PING 1.1.1.1 (1.1.1.1) from 10.20.204.90: 56 data bytes
64 bytes from 1.1.1.1: icmp_seq=0 ttl=58 time=15.984 ms
64 bytes from 1.1.1.1: icmp_seq=1 ttl=58 time=15.907 ms
64 bytes from 1.1.1.1: icmp_seq=2 ttl=58 time=15.715 ms
64 bytes from 1.1.1.1: icmp_seq=3 ttl=58 time=15.637 ms

[2.4.5-RELEASE][root@pfSense.int]/root: ping -S 10.20.204.90 1.0.0.1
PING 1.0.0.1 (1.0.0.1) from 10.20.204.90: 56 data bytes
64 bytes from 1.0.0.1: icmp_seq=0 ttl=58 time=15.852 ms
64 bytes from 1.0.0.1: icmp_seq=1 ttl=58 time=16.028 ms
64 bytes from 1.0.0.1: icmp_seq=2 ttl=58 time=16.030 ms
64 bytes from 1.0.0.1: icmp_seq=3 ttl=58 time=15.974 ms

Here's the end of the output from pinging without the source address:

36 bytes from localhost (127.0.0.1): Redirect Host(New addr: 10.20.204.90)
Vr HL TOS  Len   ID Flg  off TTL Pro  cks      Src      Dst
 4  5  00 0054 77e2   0 0000  05  01 0000 127.0.0.1  1.1.1.1

36 bytes from localhost (127.0.0.1): Redirect Host(New addr: 10.20.204.90)
Vr HL TOS  Len   ID Flg  off TTL Pro  cks      Src      Dst
 4  5  00 0054 77e2   0 0000  04  01 0000 127.0.0.1  1.1.1.1

36 bytes from localhost (127.0.0.1): Redirect Host(New addr: 10.20.204.90)
Vr HL TOS  Len   ID Flg  off TTL Pro  cks      Src      Dst
 4  5  00 0054 77e2   0 0000  03  01 0000 127.0.0.1  1.1.1.1

36 bytes from localhost (127.0.0.1): Redirect Host(New addr: 10.20.204.90)
Vr HL TOS  Len   ID Flg  off TTL Pro  cks      Src      Dst
 4  5  00 0054 77e2   0 0000  02  01 0000 127.0.0.1  1.1.1.1

36 bytes from localhost (127.0.0.1): Time to live exceeded
Vr HL TOS  Len   ID Flg  off TTL Pro  cks      Src      Dst
 4  5  00 0054 77e2   0 0000  01  01 0000 127.0.0.1  1.1.1.1

What's going on!?