How to prevent restarting OpenVPN tunnels/ifaces if gateway monitor goes down

snow

Hi guys,

In my environment, I'm using multi WAN setup with 2 internet connections.
Connection 1 (WAN_1) with static IP configuration, will be used for high prio connections, e.g. incoming traffic to the web servers, IPSec traffic, etc.
Connection 2 (WAN_2) with dhcp configuration, will be used for low prio connection, e.g. outgoing traffic to port 80/443, etc.

I'm using pfsense with current version 2.3.4 (amd64) on a 2 node HA cluster with carp enabled on all interfaces.
The OpenVPN Server is configured to run on carp address on connection WAN_1 (static).

Because it's currently not possible to limit the upstream bandwith, the 2nd WAN connection goes down/up if the upstream bandwith will be utilized 100% for a longer time.
The limiter issue is described here: https://redmine.pfsense.org/issues/4310 (not working in HA setup).

Gateway groups

For connection failover, I'm using 2 gateway groups as configured below:

Group Name Gateways Priority
Multi_WAN_Failover_P1 WAN_2 Tier 2
WAN_1 Tier 1

Trigger level = Member down

Group Name Gateways Priority
Multi_WAN_Failover_P2 WAN_2 Tier 1
WAN_1 Tier 2

Trigger level = Member down

These groups will be used on the appropriate rules with destinations outside the office networks.
For an example, gateway group "Multi_WAN_Failover_P2" will be used as gateway for a rule which is configured to alow access to outgoing connection on port 80/443 (outside the office networks).
If WAN_2 gateway monitor goes down because of the upstream utilization (from connection WAN_2), the appropriate connection will be switched to WAN_1.
IF WAN_2 gateway monitor is up again, the appropriate connection will be switched back.
This is working without any problems.

Gateway configuration

The appropriate gateways are configured as showing below:

WAN_1

Monitor IP = 8.8.8.8

Weight = 1
Data Payload = 0
Latency thresholds = 200/500
Packet Loss thresholds = 10/20
Probe Interval = 500
Loss Interval = 2000
Time Period = 10000
Alert interval = 1000

WAN_2

Monitor IP = 8.8.4.4

Weight = 1
Data Payload = 0
Latency thresholds = 200/500
Packet Loss thresholds = 10/20
Probe Interval = 500
Loss Interval = 2000
Time Period = 10000
Alert interval = 1000

Problem

If WAN_2 connection goes down because of the upstream utilization, OpenVPN tunnels/interfaces will be restarted (as well ipsec).
For an example:

Jul 13 14:48:53 php-fpm 21033 /rc.dyndns.update: MONITOR: WAN_2 is down, omitting from routing group Multi_WAN_Failover_P1 8.8.4.4|x.x.x.x|WAN_2|0ms|0ms|100%|down
Jul 13 14:48:54 php-fpm 21033 /rc.dyndns.update: Message sent to xxx@xxx OK
Jul 13 14:48:54 php-fpm 35935 /rc.openvpn: OpenVPN: One or more OpenVPN tunnel endpoints may have changed its IP. Reloading endpoints that may use WAN_2.
Jul 13 14:48:54 check_reload_status updating dyndns WAN_2
Jul 13 14:48:54 check_reload_status Restarting ipsec tunnels
Jul 13 14:48:54 check_reload_status Restarting OpenVPN tunnels/interfaces
Jul 13 14:48:54 check_reload_status Reloading filter
Jul 13 14:48:56 php-fpm 35935 /rc.openvpn: OpenVPN: One or more OpenVPN tunnel endpoints may have changed its IP. Reloading endpoints that may use WAN_2.
Jul 13 14:48:57 php-fpm 35935 /rc.filter_configure_sync: MONITOR: WAN_2 is available now, adding to routing group Multi_WAN_Failover_P1 8.8.4.4|x.x.x.x|WAN_2|114.333ms|205.948ms|0.0%|none
Jul 13 14:48:57 php-fpm 35935 /rc.filter_configure_sync: Message sent to xxx@xxx OK
Jul 13 14:49:03 check_reload_status updating dyndns WAN_2
Jul 13 14:49:03 check_reload_status Restarting ipsec tunnels
Jul 13 14:49:03 check_reload_status Restarting OpenVPN tunnels/interfaces
Jul 13 14:49:03 check_reload_status Reloading filter

As showing above, OpenVPN tunnels/interfaces will be restarted (as well ipsec) as well if WAN_2 connection will be available again.

Result

If a user is connected to OpenVPN, he is loosing some of his connections because of timeouts.
As I can see, this only occurs if the OpenVPN connection will be established inside the office networks.
If a user is connected to OpenVPN outside the office networks (e.g. home office), he is not loosing any of his connections.

Would it be possible to prevent that OpenVPN tunnels/interfaces will be restarted (as well ipsec) if gateway monitor from WAN_2 goes down/up?

Derelict

Just because you cannot use limiters with pfsync does not mean you cannot restrict outbound traffic.

Enable a shaper on that WAN, set it to PRIQ, set the bandwidth to the limit you want to try. The usual 90% is a good place to start.

Create a child queue, set an arbitrary priority, something like 4, and make it the default queue.

Outbound on that interface will now be "limited."

HA without pfsync (state sync) is better than no HA at all. If you really need limiters that is another option.

snow

I'm already using traffic shaper with scheduler type PRIQ for each interface.
To avoid that WAN_2 connection goes down frequently I changed bandwith to 4Mbits, before it was set to 6Mbits.
Normally this connection should have 200Mbit/s down and 12Mbit's up, but it seems it's only stable when limiting the upstream to 4Mbit/s.
With this change, it currently seems the connection is stable as well there is a high utilization.

Maybe, you can help me concerning my initial question why both OpenVPN and IPSec tunnels will be restarted if WAN_2 connection will be omitted and added back again to routing group Multi_WAN_Failover_P1 as showing below:

Jul 13 14:48:53 php-fpm 21033 /rc.dyndns.update: MONITOR: WAN_2 is down, omitting from routing group Multi_WAN_Failover_P1 8.8.4.4|x.x.x.x|WAN_2|0ms|0ms|100%|down
Jul 13 14:48:54 php-fpm 21033 /rc.dyndns.update: Message sent to xxx@xxx OK
Jul 13 14:48:54 php-fpm 35935 /rc.openvpn: OpenVPN: One or more OpenVPN tunnel endpoints may have changed its IP. Reloading endpoints that may use WAN_2.
Jul 13 14:48:54 check_reload_status updating dyndns WAN_2
Jul 13 14:48:54 check_reload_status Restarting ipsec tunnels
Jul 13 14:48:54 check_reload_status Restarting OpenVPN tunnels/interfaces
Jul 13 14:48:54 check_reload_status Reloading filter
Jul 13 14:48:56 php-fpm 35935 /rc.openvpn: OpenVPN: One or more OpenVPN tunnel endpoints may have changed its IP. Reloading endpoints that may use WAN_2.
Jul 13 14:48:57 php-fpm 35935 /rc.filter_configure_sync: MONITOR: WAN_2 is available now, adding to routing group Multi_WAN_Failover_P1 8.8.4.4|x.x.x.x|WAN_2|114.333ms|205.948ms|0.0%|none
Jul 13 14:48:57 php-fpm 35935 /rc.filter_configure_sync: Message sent to xxx@xxx OK
Jul 13 14:49:03 check_reload_status updating dyndns WAN_2
Jul 13 14:49:03 check_reload_status Restarting ipsec tunnels
Jul 13 14:49:03 check_reload_status Restarting OpenVPN tunnels/interfaces
Jul 13 14:49:03 check_reload_status Reloading filter

In my understanding, this may be by design, because OpenVPN server is running on WAN_1 carp and the appropriate interface is a member of routing group Multi_WAN_Failover_P1.
Please correct me if I'm wrong.

Derelict

If you bind the OpenVPN server to localhost instead and port forward CARP:1194 to 127.0.0.1:1194 it might help.

I do not think there is much selectivity in what services are restarted when a gateway goes down. Multi-WAN gateway events are kind of expensive.

Soyokaze

@Derelict:

If you bind the OpenVPN server to localhost instead and port forward CARP:1194 to 127.0.0.1:1194 it might help.

I use this configuration (binding OpenVPN servers to localhost) daily. It works.

snow

@Soyokaze:

@Derelict:

If you bind the OpenVPN server to localhost instead and port forward CARP:1194 to 127.0.0.1:1194 it might help.

I use this configuration (binding OpenVPN servers to localhost) daily. It works.

Are you using this configuration because of the same issue (Tunnel restart when the appropriate interface will be omitted and added back again to multi wan group as in my case)?

If not, do you have any other pros/cons for this config?

Derelict

I use that configuration in HA clusters so OpenVPN server does not stop/start based on who is the master.

The servers just stay running on both and whichever one is master receives the traffic.

Soyokaze

@snow:

@Soyokaze:

@Derelict:

If you bind the OpenVPN server to localhost instead and port forward CARP:1194 to 127.0.0.1:1194 it might help.

I use this configuration (binding OpenVPN servers to localhost) daily. It works.

Are you using this configuration because of the same issue (Tunnel restart when the appropriate interface will be omitted and added back again to multi wan group as in my case)?

If not, do you have any other pros/cons for this config?

Somewhat.
It is easier to toss around endpoint in case of multiple IPs/multihomed setups (and for RR/FO multihomed endpoint this is a must).
It is easier to control outbound connections (for 'client' VPN) using floating rules.
…
and I use it like that for so long I don't even remember all nuances.
But less daemon restarts is one of them.

snow

Thank you guys.

I'm now using the OpenVPN server bound to localhost with appropriate port forwarding from carp to localhost.
That's working perfectly.

rdr

Hello !

I use the same kind of solution for OpenVPN servers (binding the OpenVPN server to localhost when using CARP).

But I still have the problem for the OpenVPN site to site clients. Indeed these clients are bound to a Gateway group and when the backup gateway (which doesn't route any traffic when active gateway is UP) goes down, the OpenVPN clients restart and for example SSH connexions through VPN get broken.

Did someone find a good solution for this ?

Regards,

Derelict

No. Those states are going to break. They will need to reconnect.

rdr

OK that's crystal clear, thanks !

rdr

Actually, it looks like SSH don't break anymore when unsetting ""State Killing on Gateway Failure / Flush all states when a gateway goes down".

But I have to check if there are negative side effects, since this option was set in order to improve WAN failover with different OpenVPN clients bound to different gateway groups.

Derelict

If it doesn't it is because it actually reconnects.

I have never seen ssh do that.