OpenVPN gateway monitoring shows offline until forced save
-
I have a number of OpenVPN connections with gateways that use external DNS servers (4.2.2.1 > 4.2.2.5) to validate they are up and measure latency. When my pfSense box boots all the my VPN connections start and are accessible but the gateways report down until I open one and apply a "save" which refreshes them all correctly to up.
Any ideas how to troubleshoot / resolve would be appreciated.EDIT: None of the 4.2.2.1 servers used are part of any DNS Resolver / Forwarder settings.
-
Do you see any traffic actually going to those IP addresses? In the states table, or packet captures?
Are there dpinger instances running for those? -
Wen I packet capture against any of the VPN interfaces (ovpnc1 > 5) I see traffic flowing as would be expected from a up and valid VPN connection. All gateways report as down due to 100% loss.
System General logs
Oct 1 21:06:46 php-fpm 90819 /vpn_openvpn_client.php: Gateways status could not be determined, considering all as up/active. (Group: VPN_Group) Oct 1 21:06:43 php-fpm 91797 /vpn_openvpn_client.php: Gateways status could not be determined, considering all as up/active. (Group: VPN_Group) Oct 1 21:06:41 php-fpm 4253 /vpn_openvpn_server.php: Gateways status could not be determined, considering all as up/active. (Group: VPN_Group) Oct 1 21:05:15 php-fpm 342 /vpn_openvpn_client.php: Gateways status could not be determined, considering all as up/active. (Group: VPN_Group) Oct 1 21:05:13 php-fpm 90819 /vpn_openvpn_client.php: Gateways status could not be determined, considering all as up/active. (Group: VPN_Group) Oct 1 21:05:12 php-fpm 91797 /vpn_openvpn_server.php: Gateways status could not be determined, considering all as up/active. (Group: VPN_Group) Oct 1 21:04:18 php-fpm 4253 /index.php: Successful login for user 'admin' from: 192.168.20.101 (Local Database)
dpinger logs
Oct 1 21:02:48 dpinger VPN1_WAN 4.2.2.1: Alarm latency 0us stddev 0us loss 100% Oct 1 21:02:48 dpinger VPN3_WAN 4.2.2.3: Alarm latency 0us stddev 0us loss 100% Oct 1 21:02:48 dpinger VPN2_WAN 4.2.2.2: Alarm latency 0us stddev 0us loss 100% Oct 1 21:02:48 dpinger VPN4_WAN 4.2.2.4: Alarm latency 0us stddev 0us loss 100% Oct 1 21:02:46 dpinger send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 75% dest_addr 4.2.2.4 bind_addr 10.18.xxx.xxx identifier "VPN4_WAN " Oct 1 21:02:46 dpinger send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 75% dest_addr 4.2.2.1 bind_addr 10.9.xxx.xxx identifier "VPN1_WAN " Oct 1 21:02:46 dpinger send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% dest_addr 192.168.200.1 bind_addr 192.168.200.1 identifier "RW_VPN " Oct 1 21:02:46 dpinger send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 75% dest_addr 4.2.2.3 bind_addr 10.28.xxx.xxx identifier "VPN3_WAN " Oct 1 21:02:46 dpinger send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 75% dest_addr 4.2.2.2 bind_addr 10.26.xxx.xx identifier "VPN2_WAN " Oct 1 21:02:46 dpinger send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 90% dest_addr 76.91.0.1 bind_addr xx.xx.xx.xx identifier "WAN_DHCP "
Not that I think it matters but these are configured into two failover groups as such
VPN states shows connections established
and at the same time gateways all shown as down
and logs from when I edit a gateway, make no changes and save/apply... all gateway status return to up
Oct 1 21:21:55 dpinger send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 1000ms loss_alarm 80% dest_addr 8.8.4.4 bind_addr xxx.xxx.xx.xxx identifier "WAN3 " Oct 1 21:21:55 dpinger send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 75% dest_addr 4.2.2.4 bind_addr xxx.xxx.xxx.xxx identifier "VPN4_WAN " Oct 1 21:21:55 dpinger send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 75% dest_addr 4.2.2.1 bind_addr 1xx.xxx.xxx.xxx identifier "VPN1_WAN " Oct 1 21:21:55 dpinger send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% dest_addr 192.168.200.1 bind_addr 192.168.200.1 identifier "RW_VPN " Oct 1 21:21:55 dpinger send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 75% dest_addr 4.2.2.3 bind_addr 10.28.xxx.xxx identifier "VPN3_WAN " Oct 1 21:21:55 dpinger send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 75% dest_addr 4.2.2.2 bind_addr 10.26.xxx.xxx identifier "VPN2_WAN " Oct 1 21:21:55 dpinger send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 90% dest_addr 76.91.0.1 bind_addr xxx.xxx.xxx.xxx identifier "WAN_DHCP "
Ignore VPN5, my VPN provider allows 5 connections and from the reboot their servers havent processed the reboots disconnection correctly yet.
What else could I help share to track this down?
-
Filter the states and the packet captures on
icmp
and/or the monitor IP address of one of the VPNs. Seeing traffic flowing doesn't mean anything to the gateway status. I'm curious if, for example, there is a state for 4.2.2.2 going out WAN and not a VPN, or going out the wrong VPN. -
thanks Jim
Traffic capture shows no traffic heading to 4.2.2.1 from the VPN1 gateway or any other WAN connection. I can however see all the 4.2.2.1 checks if I capture on lo0.
-
That's a bit odd. It shouldn't be on
localhost
/lo0
. Maybe it doesn't have a default route to start with? -
yeah, I was surprised for sure. Soon as I hit save on any gateway they all revert to the correct state , this is what the states show post applying that "save"
-
heres a pcap post saving, you can see the ICMP correctly....
0_1538495830645_post-save-gateway-traffic.pcapng
(please feel free to nuke this file after youve looked in case it contains any thing I shouldnt share in public. )
-
Hi Jim, wanted to check if theres any actions I can do to produce a resolution? Should I create a bug report for this matter or is this a case of user error I should look further into?
-
I still haven't managed to reproduce anything like it here locally. There must be some difference in the states or dpinger when it works vs when it doesn't. You might have to go to
/status.php
on the firewall and grab an archive from when it's broken and then again when it works, and then compare the contents for any references to the gateways in question. -
thanks Jim - I'll do that. This is 100% reproducible for me on multiple systems so let me do that comparison and see what I find. I'll report back. thanks again.
-
Did you manage to resolve it? I've been having the identical issue with my pfSense install for a while as well. Currently running version 2.4.4-RELEASE-p2.