OpenVPN gateway monitoring shows offline until forced save

q54e3w

I have a number of OpenVPN connections with gateways that use external DNS servers (4.2.2.1 > 4.2.2.5) to validate they are up and measure latency. When my pfSense box boots all the my VPN connections start and are accessible but the gateways report down until I open one and apply a "save" which refreshes them all correctly to up.
Any ideas how to troubleshoot / resolve would be appreciated.

EDIT: None of the 4.2.2.1 servers used are part of any DNS Resolver / Forwarder settings.

jimp

Do you see any traffic actually going to those IP addresses? In the states table, or packet captures?
Are there dpinger instances running for those?

q54e3w

Wen I packet capture against any of the VPN interfaces (ovpnc1 > 5) I see traffic flowing as would be expected from a up and valid VPN connection. All gateways report as down due to 100% loss.

System General logs

Oct 1 21:06:46	php-fpm	90819	/vpn_openvpn_client.php: Gateways status could not be determined, considering all as up/active. (Group: VPN_Group)
Oct 1 21:06:43	php-fpm	91797	/vpn_openvpn_client.php: Gateways status could not be determined, considering all as up/active. (Group: VPN_Group)
Oct 1 21:06:41	php-fpm	4253	/vpn_openvpn_server.php: Gateways status could not be determined, considering all as up/active. (Group: VPN_Group)
Oct 1 21:05:15	php-fpm	342	/vpn_openvpn_client.php: Gateways status could not be determined, considering all as up/active. (Group: VPN_Group)
Oct 1 21:05:13	php-fpm	90819	/vpn_openvpn_client.php: Gateways status could not be determined, considering all as up/active. (Group: VPN_Group)
Oct 1 21:05:12	php-fpm	91797	/vpn_openvpn_server.php: Gateways status could not be determined, considering all as up/active. (Group: VPN_Group)
Oct 1 21:04:18	php-fpm	4253	/index.php: Successful login for user 'admin' from: 192.168.20.101 (Local Database)

dpinger logs

Oct 1 21:02:48	dpinger		VPN1_WAN 4.2.2.1: Alarm latency 0us stddev 0us loss 100%
Oct 1 21:02:48	dpinger		VPN3_WAN 4.2.2.3: Alarm latency 0us stddev 0us loss 100%
Oct 1 21:02:48	dpinger		VPN2_WAN 4.2.2.2: Alarm latency 0us stddev 0us loss 100%
Oct 1 21:02:48	dpinger		VPN4_WAN 4.2.2.4: Alarm latency 0us stddev 0us loss 100%
Oct 1 21:02:46	dpinger		send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 75% dest_addr 4.2.2.4 bind_addr 10.18.xxx.xxx identifier "VPN4_WAN "
Oct 1 21:02:46	dpinger		send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 75% dest_addr 4.2.2.1 bind_addr 10.9.xxx.xxx identifier "VPN1_WAN "
Oct 1 21:02:46	dpinger		send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% dest_addr 192.168.200.1 bind_addr 192.168.200.1 identifier "RW_VPN "
Oct 1 21:02:46	dpinger		send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 75% dest_addr 4.2.2.3 bind_addr 10.28.xxx.xxx identifier "VPN3_WAN "
Oct 1 21:02:46	dpinger		send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 75% dest_addr 4.2.2.2 bind_addr 10.26.xxx.xx identifier "VPN2_WAN "
Oct 1 21:02:46	dpinger		send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 90% dest_addr 76.91.0.1 bind_addr xx.xx.xx.xx identifier "WAN_DHCP "

Not that I think it matters but these are configured into two failover groups as such

VPN states shows connections established

and at the same time gateways all shown as down

and logs from when I edit a gateway, make no changes and save/apply... all gateway status return to up

Oct 1 21:21:55	dpinger		send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 1000ms loss_alarm 80% dest_addr 8.8.4.4 bind_addr xxx.xxx.xx.xxx identifier "WAN3 "
Oct 1 21:21:55	dpinger		send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 75% dest_addr 4.2.2.4 bind_addr xxx.xxx.xxx.xxx identifier "VPN4_WAN "
Oct 1 21:21:55	dpinger		send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 75% dest_addr 4.2.2.1 bind_addr 1xx.xxx.xxx.xxx identifier "VPN1_WAN "
Oct 1 21:21:55	dpinger		send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% dest_addr 192.168.200.1 bind_addr 192.168.200.1 identifier "RW_VPN "
Oct 1 21:21:55	dpinger		send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 75% dest_addr 4.2.2.3 bind_addr 10.28.xxx.xxx identifier "VPN3_WAN "
Oct 1 21:21:55	dpinger		send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 75% dest_addr 4.2.2.2 bind_addr 10.26.xxx.xxx identifier "VPN2_WAN "
Oct 1 21:21:55	dpinger		send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 90% dest_addr 76.91.0.1 bind_addr xxx.xxx.xxx.xxx identifier "WAN_DHCP "

Ignore VPN5, my VPN provider allows 5 connections and from the reboot their servers havent processed the reboots disconnection correctly yet.

What else could I help share to track this down?

jimp

Filter the states and the packet captures on icmp and/or the monitor IP address of one of the VPNs. Seeing traffic flowing doesn't mean anything to the gateway status. I'm curious if, for example, there is a state for 4.2.2.2 going out WAN and not a VPN, or going out the wrong VPN.

q54e3w

thanks Jim

Traffic capture shows no traffic heading to 4.2.2.1 from the VPN1 gateway or any other WAN connection. I can however see all the 4.2.2.1 checks if I capture on lo0.

0_1538494637243_lo0_cap.pcapng

jimp

That's a bit odd. It shouldn't be on localhost/lo0. Maybe it doesn't have a default route to start with?

q54e3w

yeah, I was surprised for sure. Soon as I hit save on any gateway they all revert to the correct state , this is what the states show post applying that "save"

q54e3w

heres a pcap post saving, you can see the ICMP correctly....

0_1538495830645_post-save-gateway-traffic.pcapng

(please feel free to nuke this file after youve looked in case it contains any thing I shouldnt share in public. )

q54e3w

Hi Jim, wanted to check if theres any actions I can do to produce a resolution? Should I create a bug report for this matter or is this a case of user error I should look further into?

jimp

I still haven't managed to reproduce anything like it here locally. There must be some difference in the states or dpinger when it works vs when it doesn't. You might have to go to /status.php on the firewall and grab an archive from when it's broken and then again when it works, and then compare the contents for any references to the gateways in question.

q54e3w

thanks Jim - I'll do that. This is 100% reproducible for me on multiple systems so let me do that comparison and see what I find. I'll report back. thanks again.

kesawi

Did you manage to resolve it? I've been having the identical issue with my pfSense install for a while as well. Currently running version 2.4.4-RELEASE-p2.