Gateway error on both WANs at same time
-
Hi, I'm new to pfsense.
With the help of some guides I set it up with two WANs in failover group (one via a cabled router, the other via a wireless 4G LTE router).
Then I have 3 vpn connections in load balancing.
All work fine except that after some day both WANs show a gateway error at the same time, both are removed from routing table and then both restarted after about 10 minutes.
I googled but I can't believe that both router drop the DHCP leases at the same time and I can't believe that two monitored IP fail at the same time (one ip is the Cloudflare DNS 1.1.1.1, the other from google 8.8.4.4).The hardware is a modded Watchguard xtm 5 (flashed bios, upgraded ram and cpu).
What happens?
-
@valepe69
Update:
one thing I forgot: the two lines from the wan routers go into a managed switch, pass through vlan into another managed switch and pfsense wans are connected to two untagged ports. -
@valepe69 Could it be the switch? If you look in the system logs at that time do you see the link states go up and down? You seem to have a bit of complexity between the pfSense box and the modems that could cause it. If they are going down then either the links are being lost or there is 100% packet loss. You have 2 switches in-between. My money is on one of them. Are they L2 or L3? Do you have routes built into them? Any way to bypass them and go straight into the modems?
-
@stewart where can I see the interface log?
In system log there's no report about links up and down.
Both switches are advanced L2 types not fully L3.
One is a Mikrotik CSS326-24G-2S+RM, the second is a D-Link DGS-1210-28.
Between them there's a LACP connection of two ports.Yes, I can bypass them. I did it because the watchguard box will be a backup router. My idea is to put wan lines into vlan and pass all to a new server with virtualized pfsense
-
@stewart update: I move the wan connections of Pfsense to the first switch bypassing the trunk to the second one.
Nothing strange happened when I moved WAN2 (the 4G backup connection).
I waited that all is fine then I moved the WAN1.
Well, it happens:Mar 17 16:30:03 php-fpm 9329 /rc.openvpn: MONITOR: WAN1_DHCP has packet loss, omitting from routing group WAN_FAILOVER Mar 17 16:30:03 php-fpm 9329 8.8.4.4|192.168.xxx.41|WAN1_DHCP|12.524ms|0.208ms|24%|down|highloss Mar 17 16:30:03 php-fpm 9329 /rc.openvpn: Gateway, switch to: WAN2_DHCP Mar 17 16:30:03 php-fpm 9329 /rc.openvpn: Default gateway setting Interface WAN2_DHCP Gateway as default. Mar 17 16:30:03 php-fpm 9329 /rc.openvpn: Gateway, none 'available' for inet6, use the first one configured. '' Mar 17 16:30:03 php-fpm 9329 /rc.openvpn: OpenVPN: One or more OpenVPN tunnel endpoints may have changed its IP. Reloading endpoints that may use WAN1_DHCP. Mar 17 16:30:03 php-fpm 9329 /rc.openvpn: OpenVPN: Resync client1 my_first_vpn Mar 17 16:30:03 php-fpm 9329 /rc.openvpn: MONITOR: WAN2_DHCP has packet loss, omitting from routing group WAN_FAILOVER Mar 17 16:30:03 kernel arpresolve: can't allocate llinfo for 192.168.xxx.1 on em0 Mar 17 16:30:03 kernel arpresolve: can't allocate llinfo for 192.168.xxx.1 on em0 Mar 17 16:30:03 php-fpm 9329 1.1.1.1|192.168.yyy.91|WAN2_DHCP|45.021ms|10.854ms|21%|down|highloss Mar 17 16:30:03 php-fpm 9329 /rc.openvpn: Gateway, switch to: Mar 17 16:30:03 php-fpm 9329 OpenVPN terminate old pid: 42787
So it seems that the switches have nothing to do with my issue
-
@valepe69 said in Gateway error on both WANs at same time:
@stewart update: I move the wan connections of Pfsense to the first switch bypassing the trunk to the second one.
Nothing strange happened when I moved WAN2 (the 4G backup connection).
I waited that all is fine then I moved the WAN1.
Well, it happens:Mar 17 16:30:03 php-fpm 9329 /rc.openvpn: MONITOR: WAN1_DHCP has packet loss, omitting from routing group WAN_FAILOVER Mar 17 16:30:03 php-fpm 9329 8.8.4.4|192.168.xxx.41|WAN1_DHCP|12.524ms|0.208ms|24%|down|highloss Mar 17 16:30:03 php-fpm 9329 /rc.openvpn: Gateway, switch to: WAN2_DHCP Mar 17 16:30:03 php-fpm 9329 /rc.openvpn: Default gateway setting Interface WAN2_DHCP Gateway as default. Mar 17 16:30:03 php-fpm 9329 /rc.openvpn: Gateway, none 'available' for inet6, use the first one configured. '' Mar 17 16:30:03 php-fpm 9329 /rc.openvpn: OpenVPN: One or more OpenVPN tunnel endpoints may have changed its IP. Reloading endpoints that may use WAN1_DHCP. Mar 17 16:30:03 php-fpm 9329 /rc.openvpn: OpenVPN: Resync client1 my_first_vpn Mar 17 16:30:03 php-fpm 9329 /rc.openvpn: MONITOR: WAN2_DHCP has packet loss, omitting from routing group WAN_FAILOVER Mar 17 16:30:03 kernel arpresolve: can't allocate llinfo for 192.168.xxx.1 on em0 Mar 17 16:30:03 kernel arpresolve: can't allocate llinfo for 192.168.xxx.1 on em0 Mar 17 16:30:03 php-fpm 9329 1.1.1.1|192.168.yyy.91|WAN2_DHCP|45.021ms|10.854ms|21%|down|highloss Mar 17 16:30:03 php-fpm 9329 /rc.openvpn: Gateway, switch to: Mar 17 16:30:03 php-fpm 9329 OpenVPN terminate old pid: 42787
So it seems that the switches have nothing to do with my issue
Is it predictable enough to do a packet capture? Does WAN1 going down trigger WAN2 going down?
-
@stewart said in Gateway error on both WANs at same time:
Is it predictable enough to do a packet capture?
How can I do that?
Does WAN1 going down trigger WAN2 going down?
It seems so but why?
-
@valepe69 said in Gateway error on both WANs at same time:
@stewart said in Gateway error on both WANs at same time:
Is it predictable enough to do a packet capture?
How can I do that?
Diagnostics->Packet Capture
Then you can view it in Wireshark. It's OK if you don't really understand what you are seeing as long as you understand which direction the traffic is flowing. Maybe ICMP is being blocked and the rest of the traffic is going through. Maybe you see the firewall sending out a ton of data and nothing coming back. The limitation is that you will only be able to capture from one port at a time unless you want to drop to the CLI and run TCPdump and grab the files over WinSCP.Does WAN1 going down trigger WAN2 going down?
It seems so but why?
No idea but would be useful for a clearer picture. Also, what if you just leave WAN2 unplugged? Does the problem continue or go away? Again, just for another point of information.
-
@stewart Another small update: after thinking about the issue I remembered that I set the DNS IPs the same the monitored IPs of the two WANs.
So I changed the monitored IPs.
Now I can't anymore reproduce the issue: if I unplug the WAN1 cable WAN2 doesn't go down but it become default instantly.
But, there is a big but now. Once the VPN connections change from WAN1 to WAN2 correctly, they don't return to WAN1 when this become available again.This is a big problem because WAN2 is useful only for backup of main line (it's slow: 30 MBit/s and it has monthly data plan)
If I can't resolve this problem I should remove the handling of the two WANs from pfsense and leave the 4G router does all the stuff. It has two ethernet port: a WAN for a broadband modem and a LAN port). It should handle the failover automatically but never tested it.
-
@valepe69 That's because the route is still good as long as the VPNs are connected. If you disconnect and reconnect they should connect back over the primary WAN link. They fail over to WAN2 because the WAN1 link goes down and they keep attempting a reconnect until a connection is made, in this case over WAN2. Once WAN1 comes back up all new traffic will go over it but existing connections aren't moved. All of the Gateways are good all of the time as long as it's up. The firewall only sends out on whatever is selected but if data is already going over WAN2 it won't terminate the connection. You would need to do it manually. Think about it; if the WAN2 Gateway wasn't good then it would never be able to send pings out to monitor it. If you are in a company with hosted VoIP phones or trunks, would you want everyone's call to just suddenly drop when the primary internet comes back online? No, the transition needs to be graceful. Existing calls continue out their established route but new calls would go over the primary link that's come back up. Eventually everything moves over but it isn't a hard drop.
The only way to force everything to go over WAN1 would be to disconnect WAN2. Othewise you would need to disconnect/reconnect to get it to connect out over the new link. At least that is my understanding of how it works.
-
@valepe69 Also, what is happening over the VPN? You can check the monitoring logs to see how much data is used. If it's just access to local files and stuff then it may be quite low and not a concern. It could also depend on how you have the VPN configured. Do you force all traffic from your VPN users over it? If so, that will eat up a lot more data. Maybe RDP over VPN could be a good solution since there is very low data usage that way.
-
@stewart you're right but I forgot to write that is a home application so if for a few seconds the line goes down nothing bad. My primary goal is that the main traffic (clear and vpn) must prefer the WAN1 connection.
And I should add that before the change of monitored IPs this change happened without a glitch. WAN1 down then all go to WAN2, just after WAN1 reappears, all come back to it. -
@valepe69 What were the monitor IPs before and after? I'm unclear what you were trying to communicate in your previous post?
-
@stewart said in Gateway error on both WANs at same time:
@valepe69 What were the monitor IPs before and after? I'm unclear what you were trying to communicate in your previous post?
Sorry, before the change primary DNS are 8.8.4.4 and 1.1.1.1.
The WAN1 monitored IP was 8.8.4.4 and WAN2 monitored IP was 1.1.1.1.
Then I change the monitored IP to: 8.8.8.8 for WAN1 and 1.0.0.1 for WAN2Then I noticed this behavior. Maybe a coincidence but before the change all the test I made on failover worked as espected.
Added:
When the primary DNS was the same as the monitored IPs I saw that the vpn connection change from WAN1 to WAN2 and from WAN2 to WAN1 automatically. Only after the change I saw that it change from WAN1 to WAN2 and doesn't revert to WAN1. Maybe a coincidence during my previous tests. -
Well, I just made another test but this time all worked fine:
- I unplugged WAN1, all connections flow to WAN2
- Waited that all stabilized then I replugged WAN1
- This time all connections go back to WAN1 just after it was online again
Don't know what to think.
Apologize for my english, it isn't my main language.
-
@valepe69 said in Gateway error on both WANs at same time:
Well, I just made another test but this time all worked fine:
- I unplugged WAN1, all connections flow to WAN2
- Waited that all stabilized then I replugged WAN1
- This time all connections go back to WAN1 just after it was online again
Don't know what to think.
I guess that as long as it's working then you're all set. I don't know why changing those IPs made a difference and I wouldn't expect things to be moved back from WAN2 to WAN1. Maybe you have a reconnect period on the VPNs that's being hit and they reconnect over the new WAN1 connection? I don't know.
Apologize for my english, it isn't my main language.
No need to apologize, your English is spot on!
-
@stewart said in Gateway error on both WANs at same time:
I guess that as long as it's working then you're all set. I don't know why changing those IPs made a difference and I wouldn't expect things to be moved back from WAN2 to WAN1. Maybe you have a reconnect period on the VPNs that's being hit and they reconnect over the new WAN1 connection? I don't know.
Maybe an open connection keep the vpn on WAN2 and revert to WAN1 when there's no activity.
Anyway, I'll look into this because I want to be sure that vpn connections come back to WAN1.
Otherwise I have two solutions:- implement the automatic failover of 4G router (but I have no control on it)
- implement some scripting to force the come back to WAN1
Thanks a lot