IPSec tunnels work for several hours to days but then stop routing traffic
-
Hi,
We have been using PFSense for several years, however for the past 3-4 months we have been having an issue where the tunnels suddenly stop passing traffic despite being established, have SPD and SAD entries and the firewall not showing as blocking any traffic. Dropping the VPN on either side causes the tunnel to rebuild, but it still doesn't pass traffic. The only solution is to reboot the pfsense server.The Draytek routers are set to dial out only using the following phase 1 settings IKEv1 Main Mode, Always on, AES128/SHA1/G1 and are using shared keys. There are then two Phase2 tunnels using the AES128/SHA1/no PFS, the option to create a separate SA per subnet is enabled.
PFSense is configured with the same settings, but has "Disable Rekey", "Responder Only" and "Enable DPD" options enabled.
I've intentionally dropped the VPN and then taken a sample of the logs, which i'll post up shortly, but they seem to show the tunnel building fine and I get the SAD and SPD entries come back in.
Would anyone have any suggestions on what could be causing this. There are multiple sites connecting using the Draytek routers, but they all show the same symptoms eventually.
-
Here are the logs from the Draytek side when I dropped the VPN tunnel (it's in reverse order).
2020-02-13 15:02:39 [L2L][UP][IPsec][@1:SITENAME]
2020-02-13 15:02:39 Delete exist flowstate of static route 0A150000/FFFF0000 ...
2020-02-13 15:02:39 sent QI2, IPsec SA established with xx.xx.xxx.xx (pfsenseIP). In/Out Index: 0/-1
2020-02-13 15:02:39 IPsec SA #1016 will be replaced after 2996 seconds
2020-02-13 15:02:39 IKE ==>, Next Payload=ISAKMP_NEXT_HASH, Exchange Type = 0x20, Message ID = 0x6ef620c4
2020-02-13 15:02:39 Accept ESP prorosal ENCR ESP_AES, HASH AUTH_ALGORITHM_HMAC_SHA2_256
2020-02-13 15:02:39 IKE <==, Next Payload=ISAKMP_NEXT_HASH, Exchange Type = 0x20, Message ID = 0x6ef620c4
2020-02-13 15:02:39 [IPSEC/IKE][L2L][1:SITENAME][@xx.xx.xxx.xx (pfsenseIP)] quick_outI1: match network
2020-02-13 15:02:39 Client L2L remote network setting is 10.21.0.0/16
2020-02-13 15:02:39 IKE ==>, Next Payload=ISAKMP_NEXT_HASH, Exchange Type = 0x20, Message ID = 0x6ef620c4
2020-02-13 15:02:38 Start IKE Quick Mode to xx.xx.xxx.xx (pfsenseIP)
2020-02-13 15:02:38 Dialing Node1 (SITENAME) :
2020-02-13 15:02:38 [L2L][UP][IPsec][@1:SITENAME]
2020-02-13 15:02:38 [IPsec] Dial-out for 'More route' of L2L[0]
2020-02-13 15:02:38 Delete exist flowstate of static route AC150000/FFFF0000 ...
2020-02-13 15:02:38 sent QI2, IPsec SA established with xx.xx.xxx.xx (pfsenseIP). In/Out Index: 0/-1
2020-02-13 15:02:38 IPsec SA #1015 will be replaced after 2850 seconds
2020-02-13 15:02:38 IKE ==>, Next Payload=ISAKMP_NEXT_HASH, Exchange Type = 0x20, Message ID = 0x5b369a20
2020-02-13 15:02:38 Accept ESP prorosal ENCR ESP_AES, HASH AUTH_ALGORITHM_HMAC_SHA2_256
2020-02-13 15:02:38 IKE <==, Next Payload=ISAKMP_NEXT_HASH, Exchange Type = 0x20, Message ID = 0x5b369a20
2020-02-13 15:02:38 [IPSEC/IKE][L2L][1:SITENAME][@xx.xx.xxx.xx (pfsenseIP)] quick_outI1: match network
2020-02-13 15:02:38 Client L2L remote network setting is 172.21.0.0/16
2020-02-13 15:02:38 IKE ==>, Next Payload=ISAKMP_NEXT_HASH, Exchange Type = 0x20, Message ID = 0x5b369a20
2020-02-13 15:02:38 Start IKE Quick Mode to xx.xx.xxx.xx (pfsenseIP)
2020-02-13 15:02:38 ISAKMP SA established with xx.xx.xxx.xx (pfsenseIP). In/Out Index: 0/-1
2020-02-13 15:02:38 ISAKMP SA #1014 will be replaced after 21375 seconds
2020-02-13 15:02:38 IKE <==, Next Payload=ISAKMP_NEXT_ID, Exchange Type = 0x2, Message ID = 0x0
2020-02-13 15:02:38 IKE ==>, Next Payload=ISAKMP_NEXT_ID, Exchange Type = 0x2, Message ID = 0x0
2020-02-13 15:02:38 NAT-Traversal: Using RFC 3947, no NAT detected
2020-02-13 15:02:38 IKE <==, Next Payload=ISAKMP_NEXT_KE, Exchange Type = 0x2, Message ID = 0x0
2020-02-13 15:02:38 IKE ==>, Next Payload=ISAKMP_NEXT_KE, Exchange Type = 0x2, Message ID = 0x0
2020-02-13 15:02:38 Accept Phase1 prorosals : ENCR OAKLEY_AES_CBC, HASH OAKLEY_SHA2_256
2020-02-13 15:02:38 IKE <==, Next Payload=ISAKMP_NEXT_SA, Exchange Type = 0x2, Message ID = 0x0
2020-02-13 15:02:38 IKE ==>, Next Payload=ISAKMP_NEXT_SA, Exchange Type = 0x2, Message ID = 0x0
2020-02-13 15:02:38 [IPSEC/IKE][L2L][1:SITENAME][@xx.xx.xxx.xx (pfsenseIP)] Initiating IKE Main Mode
2020-02-13 15:02:38 Initiating IKE Main Mode to xx.xx.xxx.xx (pfsenseIP)
2020-02-13 15:02:38 Dialing Node1 (SITENAME) :
2020-02-13 15:02:37 IKE_RELEASE VPN : L2L Dial-out, Profile index = 1, Name = SITENAME, ifno = 11
2020-02-13 15:02:37 Delete exist flowstate of VPN ifno: 11 ....
2020-02-13 15:02:37 [L2L][DOWN][IPsec][@1:SITENAME]
2020-02-13 15:02:37 DropVPN() VPN : L2L Dial-out, Profile index = 1, Name = SITENAME, ifno = 11
2020-02-13 15:02:37 IKE_RELEASE VPN : L2L Dial-out, Profile index = 1, Name = SITENAME, ifno = 10
2020-02-13 15:02:37 IKE ==>, Next Payload=ISAKMP_NEXT_HASH, Exchange Type = 0x5, Message ID = 0x61712547
2020-02-13 15:02:37 IKE ==>, Next Payload=ISAKMP_NEXT_HASH, Exchange Type = 0x5, Message ID = 0xf446b1b4
2020-02-13 15:02:37 Delete exist flowstate of VPN ifno: 10 ....
2020-02-13 15:02:37 [L2L][DOWN][IPsec][@1:SITENAME]
2020-02-13 15:02:37 DropVPN() VPN : L2L Dial-out, Profile index = 1, Name = SITENAME, ifno = 10 -
I don't seem to be able to post the PFSense log, as it gets detected as spam when I post, so here's an image of it instead:
-
Hi
It is possible that your problem is described herehttps://forum.netgate.com/topic/148857/ipsec-ikev2-error-trap-not-found-unable-to-acquire-reqid
-
@Konstanti thanks for the reply, I've added a cron job as you suggested, since i'm sure i'll run into that limitation at some point.
I think my issue may be slightly different though, since rebooting the ipsec service doesn't fix the issue, and sometimes the issue occurs after only a number of hours too.
I'll check the reqid next time it happens though.
-
@nbegley Are you sure you disabled PFS on both sides?
From the log on pfSense side, I don't think so.
-
@dusan The default configuration we are using has PFS on the Phase1 but no PFS on Phase2.
I have tried using PFS on both phases on a few sites to see if that improved things, which it didn't. I've just had a look at the particular site those logs are for and that site is using AES128/SHA256/PFS14 for both phases, so yes it has PFS on both sides for both phases.
-
@nbegley I had similar problem many years ago with IPv4 and now I'm having a similar problem with IPv6. Both are similar to yours in that a non-pfSense device trying to connect to pfSense. Both differs from yours in that the IKEv1 Phase is indicated as "connected" on one side, or on both sides but there're no traffic, i.e. no ESP packets are delivered to the other end of the tunnel in the first place (not after several hours). Both cases trace back to mis-routing problem of the ISP on the side of non-pfSense device (both my cases was connection between different ISPs).
Until the ISPs resolve their own problem, the temporary solution (successful in my case) was setting short margin time, disabling Responder only, enabling automatically ping host on pfSense and, if Phase 1 is "connected" only on one side, disable Internet connection on pfSense side for awhile.
Automatically ping host obviously helps. Short margin time and disabled Responder only help because otherwise pfSense would time up first (most of the cases). Disabled Internet connection helps because it clears the mis-route in route cache of ISP's router (possibly border router.)
-
@dusan I've made the changes to reduce the tunnel lifetime etc and this does seem to have helped quite a bit, and it now seem more consistent on when it stops e.g. it's taken around 2 days on several occasions now. I can also confirm that I do see the ReqID errors every time the issue occurs, however i've now got a cron job restarting the IPSEC service twice a day, yet I still get ReqID trap errors. If I try manually restarting the service when I get the ReqID errors then this seems to stop any further errors being logged, but it doesn't get the connections that were failing to start, I have to reboot for that.
I'm thinking of using cron to restart PFSense daily as that seems to be the only thing that gets it going again.
-
@nbegley I'm not sure why you disable PFS Disable Rekey Disable Reauth or set Responder Only. The more change you make to pfSense's default settings the less chance you'll keep tunnels connected. According to my test (10 years ago), Draytek is compatible to pfSense, but I suggest you do your own interoperability test.
-- Set margin time = 30s.
-- Set short lifetime, like 30m Phase 1 and 15m Phase 2.
-- Do not set Responder Only. Don't Disable Reauth, Disable Rekey or turn off PFS.
-- (Just for the purpose of testing) Use different ciphersuit for Phase 1 and Phase 2 (say, DH group 15 and 14 respectively).If the tunnel can't be established or stops working after 1h, problem is yours. If it stops after 2 days, go after your ISP.