IPSEC suddenly stops working
-
I need a little help or advice if possible. I currently have 4 sites that were all running 2.4.5p1 pfSense with IPSEC connecting all 4 together without any major issues.
Internal IPs in /24s using 172.16.0.x, 172.16.1.x, 172.16.2.x and 172.16.3.x.
With the release of 2.5.0 I ran the upgrade on 172.16.0.x (which is ideally a test-lab location) which kinda screwed up (I know, should have clean installed…) The environment was using a Lanner box running an older Atom processor which is pretty-much end-of-life, so have some Watchguard Firebox XTM 5’s with C2D processors, 4Gb RAM - which was my short-term upgrade path for greater use of IDS as the Atom ran too high on utilization when doing a lot…
Built the XTM5, restored a configuration and after a lot of tweaking got it running with all packages and IPSEC tunnels. No biggie, just took longer and a little more complex than I had hoped.
Herein lies the issue… After running for a while, the IPSEC on that location just appears to stop, VPN offline, clicking connect from there or from one of the other sites doesn’t resolve anything. Clicking stop on the GUI doesn’t stop, restart also seems to do nothing. Am unable to run ‘swanctl --list-conns’ or ‘swanctl --load-all --file /var/etc/ipsec/swanctl.conf --debug 1’ as it doesn’t respond with anything
If I reboot, all is good for a while until the same happens again.
Believing the issue is with 2.5.0, I just rebuilt that system to 2.4.5p1, restored some config to keep my IPSEC tunnels, interfaces etc, NAT, Firewall rules and so on an so forth. System was up and running from midnight.
Just realized a short while ago that the tunnel is now not responding again. Internet is not dropping as I have remote access to computers at that location. Logged into firewall and checked Status, IPSEC which says the usual collecting information, nothing. Ran shell, cannot issue swanctl commands just like before. Checking the IPSEC log from the shell shows corruption occurring @ 10:52 -
Apr 15 10:52:04 FCU-Group-FW charon: 11[IKE] <con1000|5> activatCLOG^A^@^@^@\xc2\xf2^A^@\xec\xcd^G^@^@^@^@^@Can this ACTUALLY be hardware related to the XTM5 or am I missing something absolutely obvious??? I mean, I put it back to 2.4.5p1 so same version as the others etc…
Obviously I can’t change the others to 2.5.0 or 2.5.1 until I know for sure what is the root cause and ensure stability…
Any help would be greatly appreciated…!
-
@paulk201270 Looks like bad RAM to me. Do you have ECC memory?
-
@lst_hoe Nope, regular RAM, but I had changed it as a test but the same kept happening. In the interim I have now rebuilt to the new 251 image and aside from Unbound crashing - adding watch to restart it, the IPSEC appears to be working better,
-
@paulk201270 Still having the same issue, even having switched RAM.
Another firewall (different hardware) exhibiting the same issue. Both running 2.5.1, both built clean and reconfigured manually to remove any doubt of upgrade issues. Both built on Watchguard hardware XTM5s.
If selecting Stop for IPSEC on the services page it never stops. Rebooting Firewall normalizes and it works for a day or so then stops again.
Log shows the following and then nothing for days till rebooted...
May 7 00:16:57 charon 59608 12[ENC] <con100000|63> generating INFORMATIONAL response 716 [ ]
May 7 00:16:57 charon 59608 12[NET] <con100000|63> sending packet: from XXX.XXX.XXX.XXX[500] to XXX.XXX.XXX.XX[500] (57 bytes)
May 7 00:17:00 newsyslog 25803 logfile turned over due to size>500K
May 7 00:17:00 newsyslog 25803 logfile turned over due to size>500K
May 7 00:17:06 charon 59608 15[NET] <con300000|66> received packet: from XXX.XXX.XXX.XX[500] to XXX.XXX.XX.XX[500] (57 bytes)
May 7 00:17:06 charon 59608 15[ENC] <con300000|66> parsed INFORMATIONAL request 344 [ ]
May 7 00:17:06 charon 59608 15[ENC] <con300000|66> generating INFORMATIONAL response 344 [ ]
May 7 00:17:06 charon 59608 15[NET] <con300000|66> sending packet: from XXX.XXX.XX.XXX[500] to XXX.XXX.XXX.XX[500] (57 bytes)
May 7 00:28:45 charon 59608 03[KNL] creating rekey job for CHILD_SA ESP/0xc4427143/XXX.XXX.XXX.XXX
May 7 00:29:32 charon 59608 03[KNL] creating rekey job for CHILD_SA ESP/0xc3cd1301/XXX.XXX.XXX.XXX
May 7 00:35:33 charon 59608 03[KNL] creating rekey job for CHILD_SA ESP/0xc2535822/XXX.XXX.XXX.XXX
May 7 00:37:14 charon 59608 03[KNL] creating rekey job for CHILD_SA ESP/0xc6823624/XXX.XXX.XXX.XXX
May 7 00:38:50 charon 59608 03[KNL] creating delete job for CHILD_SA ESP/0xc4427143/XXX.XXX.XXX.XXX
May 7 00:38:50 charon 59608 03[KNL] creating delete job for CHILD_SA ESP/0xc3cd1301/XXX.XXX.XXX.XXX
May 7 00:46:02 charon 59608 03[KNL] creating delete job for CHILD_SA ESP/0xc2535822/XXX.XXX.XXX.XXX
May 7 00:46:02 charon 59608 03[KNL] creating delete job for CHILD_SA ESP/0xc6823624/XXX.XXX.XXX.XXX
May 7 00:51:12 charon 59608 03[KNL] creating rekey job for CHILD_SA ESP/0xc12d5134/XXX.XXX.XXX.XXX
May 7 00:54:35 charon 59608 03[KNL] creating rekey job for CHILD_SA ESP/0xc2f81b76/XXX.XXX.XXX.XXX
May 7 01:02:12 charon 59608 03[KNL] creating delete job for CHILD_SA ESP/0xc12d5134/XXX.XXX.XXX.XXX
May 7 01:02:12 charon 59608 03[KNL] creating delete job for CHILD_SA ESP/0xc2f81b76/XXX.XXX.XXX.XXX
May 11 21:56:19 charon 59608 03[KNL] interface pppoe0 activated
May 11 21:56:19 charon 59608 03[KNL] XXX.XXX.XXX.XXX disappeared from pppoe0
May 11 21:56:19 charon 59608 03[KNL] interface pppoe0 deactivated
May 11 21:56:34 charon 59608 03[KNL] XXX.XXX.XXX.XXX appeared on pppoe0
May 12 12:41:43 charon 59608 00[DMN] SIGTERM received, shutting downEven selecting stop multiple times nothing else adds or changes in log.
If it was just happening on one machine I could understand something iffy, but it's same style hardware but multiple locations etc.
Would appreciate if anyone has some suggestions as to how to proceed...
Many thanks
Paul. -
I'm seeing this exact same behaviour. It started after upgrading to 2.5.1.
The tunnels all stop working. The IPSEC status page shows no tunnels. The IPSEC widget on the Dashboard has the spinning cog permanently. The CPU widget also just displays the spinning cog.The stop and the restart IPSEC service buttons do nothing and sometimes even kills the web gui.
A reboot sorts the problem for a while but it always returns.
My logs are pretty much the same as above.
-
@paddy Are you using similar/same hardware???
-
@paulk201270 Yes I'm using the Watchguard xtm5.
I've gone back to the previous 2.4 pfsense and so far so good.
-
@paddy Ah, thanks for that confirmation. In the interim I've just put the first site back to my older Lanner solution to see if it is specific to the hardware, which I think it is in some way. Older hardware not as powerful but not sure how to get Netgate to investigate the specifics of the XTM 5 series - perhaps someone can advise what to request...
-
By mistake posted this to Redmine as a 'potential' bug, but was told that they do not support this particular hardware. Would appreciate it if anyone else could potentially reproduce or add additional info that might make further investigation possible...