[Solved] Site-to-Site and Client-Server OpenVPN randomly reconnecting from client side

cysiacom

This a solved question but I post it here for someone to save their time.

We have a Multi Site-to-Site plus Remote Access VPN servers.
Central Office, Satellite offices and Road Warriors
pfSense on any side of the tunnel.
Central office runs a pfSense in a HyperV Virtual Machine.
Satellite Offices runs pfSense on PCEngines APU2D2 Hardware
All versions are fully updated.
All hardware with AES-NI enabled

Site-toSite was deployed with:

Peer to Peer SSL configuration
AES-128-GCM
SHA256 Auth
LZ4-v2
Subnet Toplogy
UDP Fast I/O

After some weeks running in test mode like a charm we changed to production mode.
Some weeks later we noticed all client side (Site-to-Site and Remote Users) restarting the tunnel once, twice or even more in a hour with the same log event

Nov 15 15:22:20	openvpn	90690	SIGUSR1[soft,ping-restart] received, process restarting
Nov 15 15:22:20	openvpn	90690	[VPN Server] Inactivity timeout (--ping-restart), restarting

A 10 second packet loss was detected on the tunnel while restarting de tunnel and some apps went problematic due to this.
Running continous pings from client side network to server side network did not help to keep tunnel alive.
No problem were found in any other link (internet, intranet, etc.) managed by the firewalls.
TCPDump showed nothing but loss of ICMP returns when restarting the tunnel and OpenVPN renegotiation due to client restarting.

We tried any solution related to the problem, similar to this unsolved question https://forum.netgate.com/topic/115125/openvpn-tunnel-allways-reconnects

Tried changing compression, encryption algorithm, keeps alive but nothing worked.
Tried changing some hardware but it was of no use.
Then changed from SSl to Shared Key and then the tunnel kept established without restarts. This was a workarround but not a solution as we wanted to use AES-GCM as encryption algorith, but this is not possible with Shared Key.

So we concluded that something were wrong with SSL or Keeps alive.
A very usefull clue were found on https://forum.netgate.com/post/393487

[The 60-second timeout is a generic timeout error, not indicative of any specific problem. The server-side logs are better indications of the problem in these cases.
Most likely explanations:
Server side blocking the traffic in firewall rules (or failing to pass it, as the case may be)
ISP/Uplink blocking the traffic
Time mismatch between client and server
Certificate/CA mismatch between client and server
TLS Key mismatch between client and server
Other setting mismatch between client and server
The exact mismatch or error would be found in the server logs.]

Finallly we found that Central Server, were the VPN Server side is running, was out of time sync due to improper virtual machine configuration.
The server was 3:30 minutes ahead of current NTP time of the rest of the pfSenses.
Central PFSense server had NTP configured properly but had Time Sync Integration Service enabled, and Host Machine was not properly time synced with NTP Server.

So PFSense synced with NTP properly but hardware inmediately corrected time to the wrong running time on the host machine.

So finally the solution was:

Disable Time Sync Integration Service on HyperV Configuration
Forced ntpdate on Server side inorder to sync date-time
Enable the same pool of NTP Servers as time reference for all of them

So far tunnel are working properly without any problem.

Conclusion:

Keep your infrastructure time synced with a reliable source and double check when using virtualization services and integrations