OpenVPN gateway monitoring shows offline until forced save



  • I have a number of OpenVPN connections with gateways that use external DNS servers (4.2.2.1 > 4.2.2.5) to validate they are up and measure latency. When my pfSense box boots all the my VPN connections start and are accessible but the gateways report down until I open one and apply a "save" which refreshes them all correctly to up.
    Any ideas how to troubleshoot / resolve would be appreciated.

    EDIT: None of the 4.2.2.1 servers used are part of any DNS Resolver / Forwarder settings.


  • Rebel Alliance Developer Netgate

    Do you see any traffic actually going to those IP addresses? In the states table, or packet captures?
    Are there dpinger instances running for those?



  • Wen I packet capture against any of the VPN interfaces (ovpnc1 > 5) I see traffic flowing as would be expected from a up and valid VPN connection. All gateways report as down due to 100% loss.

    System General logs

    Oct 1 21:06:46	php-fpm	90819	/vpn_openvpn_client.php: Gateways status could not be determined, considering all as up/active. (Group: VPN_Group)
    Oct 1 21:06:43	php-fpm	91797	/vpn_openvpn_client.php: Gateways status could not be determined, considering all as up/active. (Group: VPN_Group)
    Oct 1 21:06:41	php-fpm	4253	/vpn_openvpn_server.php: Gateways status could not be determined, considering all as up/active. (Group: VPN_Group)
    Oct 1 21:05:15	php-fpm	342	/vpn_openvpn_client.php: Gateways status could not be determined, considering all as up/active. (Group: VPN_Group)
    Oct 1 21:05:13	php-fpm	90819	/vpn_openvpn_client.php: Gateways status could not be determined, considering all as up/active. (Group: VPN_Group)
    Oct 1 21:05:12	php-fpm	91797	/vpn_openvpn_server.php: Gateways status could not be determined, considering all as up/active. (Group: VPN_Group)
    Oct 1 21:04:18	php-fpm	4253	/index.php: Successful login for user 'admin' from: 192.168.20.101 (Local Database)
    

    dpinger logs

    Oct 1 21:02:48	dpinger		VPN1_WAN 4.2.2.1: Alarm latency 0us stddev 0us loss 100%
    Oct 1 21:02:48	dpinger		VPN3_WAN 4.2.2.3: Alarm latency 0us stddev 0us loss 100%
    Oct 1 21:02:48	dpinger		VPN2_WAN 4.2.2.2: Alarm latency 0us stddev 0us loss 100%
    Oct 1 21:02:48	dpinger		VPN4_WAN 4.2.2.4: Alarm latency 0us stddev 0us loss 100%
    Oct 1 21:02:46	dpinger		send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 75% dest_addr 4.2.2.4 bind_addr 10.18.xxx.xxx identifier "VPN4_WAN "
    Oct 1 21:02:46	dpinger		send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 75% dest_addr 4.2.2.1 bind_addr 10.9.xxx.xxx identifier "VPN1_WAN "
    Oct 1 21:02:46	dpinger		send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% dest_addr 192.168.200.1 bind_addr 192.168.200.1 identifier "RW_VPN "
    Oct 1 21:02:46	dpinger		send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 75% dest_addr 4.2.2.3 bind_addr 10.28.xxx.xxx identifier "VPN3_WAN "
    Oct 1 21:02:46	dpinger		send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 75% dest_addr 4.2.2.2 bind_addr 10.26.xxx.xx identifier "VPN2_WAN "
    Oct 1 21:02:46	dpinger		send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 90% dest_addr 76.91.0.1 bind_addr xx.xx.xx.xx identifier "WAN_DHCP "
    

    Not that I think it matters but these are configured into two failover groups as such

    0_1538453873171_gatewaygroups.jpg

    VPN states shows connections established

    0_1538453964851_vpn1_states.jpg

    and at the same time gateways all shown as down

    0_1538454077180_gateway_status.jpg

    and logs from when I edit a gateway, make no changes and save/apply... all gateway status return to up

    Oct 1 21:21:55	dpinger		send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 1000ms loss_alarm 80% dest_addr 8.8.4.4 bind_addr xxx.xxx.xx.xxx identifier "WAN3 "
    Oct 1 21:21:55	dpinger		send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 75% dest_addr 4.2.2.4 bind_addr xxx.xxx.xxx.xxx identifier "VPN4_WAN "
    Oct 1 21:21:55	dpinger		send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 75% dest_addr 4.2.2.1 bind_addr 1xx.xxx.xxx.xxx identifier "VPN1_WAN "
    Oct 1 21:21:55	dpinger		send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% dest_addr 192.168.200.1 bind_addr 192.168.200.1 identifier "RW_VPN "
    Oct 1 21:21:55	dpinger		send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 75% dest_addr 4.2.2.3 bind_addr 10.28.xxx.xxx identifier "VPN3_WAN "
    Oct 1 21:21:55	dpinger		send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 75% dest_addr 4.2.2.2 bind_addr 10.26.xxx.xxx identifier "VPN2_WAN "
    Oct 1 21:21:55	dpinger		send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 90% dest_addr 76.91.0.1 bind_addr xxx.xxx.xxx.xxx identifier "WAN_DHCP "
    

    0_1538454262036_gateways_up.jpg

    Ignore VPN5, my VPN provider allows 5 connections and from the reboot their servers havent processed the reboots disconnection correctly yet.

    What else could I help share to track this down?


  • Rebel Alliance Developer Netgate

    Filter the states and the packet captures on icmp and/or the monitor IP address of one of the VPNs. Seeing traffic flowing doesn't mean anything to the gateway status. I'm curious if, for example, there is a state for 4.2.2.2 going out WAN and not a VPN, or going out the wrong VPN.



  • thanks Jim

    0_1538494022192_states-filter-4221.jpg

    0_1538494133049_vpn1.jpg

    Traffic capture shows no traffic heading to 4.2.2.1 from the VPN1 gateway or any other WAN connection. I can however see all the 4.2.2.1 checks if I capture on lo0.

    0_1538494637243_lo0_cap.pcapng


  • Rebel Alliance Developer Netgate

    That's a bit odd. It shouldn't be on localhost/lo0. Maybe it doesn't have a default route to start with?



  • yeah, I was surprised for sure. Soon as I hit save on any gateway they all revert to the correct state , this is what the states show post applying that "save"

    0_1538495646928_post-save-gateways.jpg



  • heres a pcap post saving, you can see the ICMP correctly....

    0_1538495830645_post-save-gateway-traffic.pcapng

    (please feel free to nuke this file after youve looked in case it contains any thing I shouldnt share in public. )



  • Hi Jim, wanted to check if theres any actions I can do to produce a resolution? Should I create a bug report for this matter or is this a case of user error I should look further into?


  • Rebel Alliance Developer Netgate

    I still haven't managed to reproduce anything like it here locally. There must be some difference in the states or dpinger when it works vs when it doesn't. You might have to go to /status.php on the firewall and grab an archive from when it's broken and then again when it works, and then compare the contents for any references to the gateways in question.



  • thanks Jim - I'll do that. This is 100% reproducible for me on multiple systems so let me do that comparison and see what I find. I'll report back. thanks again.