Gateway dropping but not reconnecting - Netgate 6100 with gateway groups
-
Hello
We have an ethernet fibre leased line (through Openreach) which is presented on the usual (here in the UK) Avda NTE equipment.
Our Netgate 6100 pfsense+ connects to the Adva through SFP LC multimode fibre.
The WAN1 connection settings on the pfsense are set to static IP.Last night we got an alert to say the pfsense is offline. We checked with the provider and no issues found. We had someone onsite powercycle the 6100 and everything came back up. Looking back at the system log we can see the following:
Mar 15 03:56:00 fw sshguard[78020]: Exiting on signal. Mar 15 03:56:00 fw sshguard[10734]: Now monitoring attacks. Mar 15 03:57:39 fw rc.gateway_alarm[54586]: >>> Gateway alarm: WAN_1_GW (Addr:1.1.1.1 Alarm:1 RTT:4.603ms RTTsd:.089ms Loss:22%) Mar 15 03:57:39 fw check_reload_status[439]: updating dyndns WAN_1_GW Mar 15 03:57:39 fw check_reload_status[439]: Restarting IPsec tunnels Mar 15 03:57:39 fw check_reload_status[439]: Restarting OpenVPN tunnels/interfaces Mar 15 03:57:39 fw check_reload_status[439]: Reloading filter Mar 15 03:57:40 fw php-fpm[86239]: /rc.openvpn: MONITOR: WAN_1_GW has packet loss, omitting from routing group WAN1_FIRST Mar 15 03:57:40 fw php-fpm[86239]: 1.1.1.1|51.155.XX.XX|WAN_1_GW|4.603ms|0.046ms|24%|down|highloss Mar 15 03:57:40 fw php-fpm[86239]: /rc.openvpn: Gateway, switch to: Mar 15 03:57:40 fw php-fpm[86239]: /rc.openvpn: The command '/sbin/route -n6 get 'default' 2>/dev/null | /usr/bin/egrep 'flags: <.*PROTO.*>'' returned exit code '1', the output was '' Mar 15 03:57:40 fw php-fpm[86239]: /rc.openvpn: OpenVPN: One or more OpenVPN tunnel endpoints may have changed IP addresses. Reloading endpoints that may use WAN_1_GW. Mar 15 03:57:40 fw php-fpm[398]: /rc.filter_configure_sync: Gateway, switch to: Mar 15 04:08:00 fw sshguard[10734]: Exiting on signal. Mar 15 04:08:00 fw sshguard[10753]: Now monitoring attacks. Mar 15 04:38:00 fw sshguard[10753]: Exiting on signal. Mar 15 04:38:00 fw sshguard[45267]: Now monitoring attacks. Mar 15 05:11:00 fw sshguard[45267]: Exiting on signal.
We do currently have a Gateway group setup called WAN1_FIRST. This is setup to failover to WAN2 in the event of connection down on WAN1 - however at this moment in time, WAN2 isn't actually active as we're waiting on building work.
What is odd is that the connection didn't come back up. Something appears to have got stuck. Surely it should have tried to failover to WAN2, found that WAN2 was also down, and then kept retrying both connections until one comes up. As it happened, nothing came back up automatically until we did a router reboot at 8am.
Does anyone know what the following line means:
Mar 15 03:57:40 fw php-fpm[86239]: /rc.openvpn: Gateway, switch to: Mar 15 03:57:40 fw php-fpm[86239]: /rc.openvpn: The command '/sbin/route -n6 get 'default' 2>/dev/null | /usr/bin/egrep 'flags: <.*PROTO.*>'' returned exit code '1', the output was '
This is my gateway group setup. Which is in preparation for the WAN2 actually going live.
My question I guess is, what causes this error? And should I perhaps not have the gateway grouping setup until I have the secondary wan up and running?
Many thanks
James -
@mpcjames I'm really outside of my comfort zone here but doesn't those lines indicate some problem with IPv6 gateway?So how have you defined your Default gateway(s) (IPv4 and IPv6) ?
In my setup which has been working flawlessly when testing and in at least one actual failover transition, I have (System / Routing / Gateways) :
IPv4 - WAN_1_FIRST
IPv6 - AutomaticBTW, I have my Trigger Level set to Member Down, but I believe it worked when I tested with it set to Packet Loss as well.