PFSense <–> PFsense: IPSEC Tunnels Losing Connectivity

Zeon

Hey guys,
I'm wondering if anyone has experienced anything like this. I have a number of PFsense routers for our organization (about 7) at each of our physical sites and we have inter site VPN on all of them. Some run as physical machines and others as virtual (under ESXi). I am having trouble on a number where the IPSEC tunnels say they are up (green arrow under diagnostics) but won't pass any traffic. This affects only some of the tunnels but is quite annoying. Simply unticking and reticking the "enable ipsec" button solves the problem with the tunnels coming back up and passing traffic within 2-3 seconds. Anyone experienced anything like this?

On this router I have the following in the log. Please note that connectivity was lost between 0700-0800 (I restarted IPSEC at 0800).

Jan 10 07:07:17 racoon: INFO: received Vendor ID: DPD
Jan 10 07:07:17 racoon: [northcote.example.co.nz]: [60.234.X.X] INFO: Selected NAT-T version: RFC 3947
Jan 10 07:07:17 racoon: [Self]: [116.90.136.91] INFO: Hashing 116.90.136.91[500] with algo #2
Jan 10 07:07:17 racoon: INFO: NAT-D payload #-1 verified
Jan 10 07:07:17 racoon: [northcote.example.co.nz]: [60.234.X.X] INFO: Hashing 60.234.X.X[500] with algo #2
Jan 10 07:07:17 racoon: INFO: NAT-D payload #0 verified
Jan 10 07:07:17 racoon: INFO: NAT not detected
Jan 10 07:07:17 racoon: [northcote.example.co.nz]: [60.234.X.X] NOTIFY: couldn't find the proper pskey, try to get one by the peer's address.
Jan 10 07:07:17 racoon: INFO: Adding remote and local NAT-D payloads.
Jan 10 07:07:17 racoon: [northcote.example.co.nz]: [60.234.X.X] INFO: Hashing 60.234.X.X[500] with algo #2
Jan 10 07:07:17 racoon: [Self]: [116.90.136.91] INFO: Hashing 116.90.136.91[500] with algo #2
Jan 10 07:07:17 racoon: [northcote.example.co.nz]: INFO: ISAKMP-SA established 116.90.136.91[500]-60.234.X.X[500] spi:966bd13e75469a53:a53165f1591f7419
Jan 10 08:03:31 racoon: INFO: @(#)ipsec-tools 0.8.0 (http://ipsec-tools.sourceforge.net)
Jan 10 08:03:31 racoon: INFO: @(#)This product linked OpenSSL 0.9.8n 24 Mar 2010 (http://www.openssl.org/)
Jan 10 08:03:31 racoon: INFO: Reading configuration from "/var/etc/racoon.conf"
Jan 10 08:03:31 racoon: [Self]: INFO: 116.90.136.91[4500] used for NAT-T
Jan 10 08:03:31 racoon: [Self]: INFO: 116.90.136.91[4500] used as isakmp port (fd=14)
Jan 10 08:03:31 racoon: [Self]: INFO: 116.90.136.91[500] used for NAT-T
Jan 10 08:03:31 racoon: [Self]: INFO: 116.90.136.91[500] used as isakmp port (fd=15)
Jan 10 08:03:31 racoon: INFO: unsupported PF_KEY message REGISTER
Jan 10 08:03:31 racoon: NOTIFY: no in-bound policy found: 60.234.74.32/29[0] 192.168.1.0/24[0] proto=any dir=in
Jan 10 08:03:31 racoon: [northcote.example.co.nz]: INFO: IPsec-SA request for 60.234.X.X queued due to no phase1 found.
Jan 10 08:03:31 racoon: [northcote.example.co.nz]: INFO: initiate new phase 1 negotiation: 116.90.136.91[500]<=>60.234.X.X[500]
Jan 10 08:03:31 racoon: INFO: begin Aggressive mode.
Jan 10 08:03:31 racoon: INFO: received Vendor ID: RFC 3947
Jan 10 08:03:31 racoon: INFO: received broken Microsoft ID: FRAGMENTATION
Jan 10 08:03:31 racoon: INFO: received Vendor ID: DPD
Jan 10 08:03:31 racoon: [northcote.example.co.nz]: [60.234.X.X] INFO: Selected NAT-T version: RFC 3947
Jan 10 08:03:31 racoon: [Self]: [116.90.136.91] INFO: Hashing 116.90.136.91[500] with algo #2
Jan 10 08:03:31 racoon: INFO: NAT-D payload #-1 verified
Jan 10 08:03:31 racoon: [northcote.example.co.nz]: [60.234.X.X] INFO: Hashing 60.234.X.X[500] with algo #2
Jan 10 08:03:31 racoon: INFO: NAT-D payload #0 verified
Jan 10 08:03:31 racoon: INFO: NAT not detected
Jan 10 08:03:31 racoon: [northcote.example.co.nz]: [60.234.X.X] NOTIFY: couldn't find the proper pskey, try to get one by the peer's address.
Jan 10 08:03:31 racoon: INFO: Adding remote and local NAT-D payloads.
Jan 10 08:03:31 racoon: [northcote.example.co.nz]: [60.234.X.X] INFO: Hashing 60.234.X.X[500] with algo #2
Jan 10 08:03:31 racoon: [Self]: [116.90.136.91] INFO: Hashing 116.90.136.91[500] with algo #2
Jan 10 08:03:31 racoon: [northcote.example.co.nz]: INFO: ISAKMP-SA established 116.90.136.91[500]-60.234.X.X[500] spi:d7aaa4f1bb667250:a25dfd17c6719ac3
Jan 10 08:03:32 racoon: [northcote.example.co.nz]: INFO: initiate new phase 2 negotiation: 116.90.136.91[500]<=>60.234.X.X[500]
Jan 10 08:03:32 racoon: [northcote.example.co.nz]: INFO: IPsec-SA established: ESP 116.90.136.91[500]->60.234.X.X[500] spi=161069962(0x999bb8a)
Jan 10 08:03:32 racoon: [northcote.example.co.nz]: INFO: IPsec-SA established: ESP 116.90.136.91[500]->60.234.X.X[500] spi=131905850(0x7dcb93a)
Jan 10 08:03:33 racoon: [northcote.example.co.nz]: [60.234.X.X] ERROR: unknown Informational exchange received.
Jan 10 08:03:33 racoon: [northcote.example.co.nz]: [60.234.X.X] ERROR: unknown Informational exchange received.
Jan 10 08:03:38 racoon: [northcote.example.co.nz]: [60.234.X.X] ERROR: unknown Informational exchange received.
Jan 10 08:03:38 racoon: [northcote.example.co.nz]: [60.234.X.X] ERROR: unknown Informational exchange received.
Jan 10 08:03:43 racoon: [northcote.example.co.nz]: [60.234.X.X] ERROR: unknown Informational exchange received.
Jan 10 08:03:43 racoon: [northcote.example.co.nz]: [60.234.X.X] ERROR: unknown Informational exchange received.
Jan 10 08:03:48 racoon: [northcote.example.co.nz]: [60.234.X.X] ERROR: unknown Informational exchange received.
Jan 10 08:03:48 racoon: [northcote.example.co.nz]: [60.234.X.X] ERROR: unknown Informational exchange received.
Jan 10 08:03:53 racoon: [northcote.example.co.nz]: [60.234.X.X] ERROR: unknown Informational exchange received.

Zeon

Oh and just to clarify that none of the PFsenses are behind firewalls and all have public IPs on the WAN interface.

lexl

We are experiencing the same problem. We have one pfSense in our datacenter, one pfSense in our office and another 3rp party ipsec vpn at a customers site. Both the tunnel from our office to the datacenter as the tunnel from the customer to the datacenters shows this problem.
Sometime the tunnels stay up for a couple of days, in some cases we have restart the ipsec several times a day.

We are using version 2.0.1 on both psSense boxes. We have experimented a bit with pinging a host to keep the tunnel open, also enable/disable DPD. Sofar we still have the problem.

This is very annoying as it make the VPN unusable.

Has anyone found a solution ?

Lex

pfsensedummie

Same problem in here. Seems all the issues we are having are related to the 2.00 and 2.01 versions.

Ipsec tunnels OK but no data traffic.

In our office we run a 1.2.3 box with 30+ IPSEC connections. Timeouts occur thus far only on connections that have 2.00 or above. When I look within the log all the failure of traffic happens on exact the same intervals. There are no schedules or whatever running on the pfsense boxes.

Could this be an issue with the racoonservice?

It's a very strange issue which is occurring lately. If there is a solution I would really like to know.

edit…

Here is the interval from our monitoring service:

katdrvr

I am having the same issue since the upgrade to v2.0. Tried 2.0.1 but that did not fix the issue. I have to restart raccoon every couple of hours.

csnf

Same issue here. I have 2.0 on the main IPSec tunnel and 1.2.3 on 8 different machines and randomly stop sending data across the tunnel. I have to restart raccoon to get things working again. This only happens when I upgraded to 2.0. I hope somebody can isolate this issue.

Zeon

Hey guys,
Just to let you all know I'm going to try what was suggested in this thread:
http://forum.pfsense.org/index.php/topic,41617.0.html

So remove the NAT-T traversal and dead peer detection and see how that goes.

jmarquez

Hi all.

Same frustrating problem here with 2 VPN using pfSense 2.0.1 in all sides.

I read in some post that this only happens from version 2.0 up, so I might downgrade to 1.2.3 as this issue makes unusable the VPN connection.

Hope this is fixed soon.

Regards,
Jesus

cmb

@jmarquez:

I read in some post that this only happens from version 2.0 up, so I might downgrade to 1.2.3 as this issue makes unusable the VPN connection.

That's not true, it happens on occasion with every IPsec implementation on every device in the world. 2.0.x does not have any general IPsec problems. It's most always related to misconfigurations. Most commonly, mismatched lifetimes on P1 and/or P2 for the symptoms described here, though at times it can be circumstances where you need DPD enabled.

There isn't enough info here on any of the reported issues to troubleshoot, and every issue is likely a different cause, so if you're having issues please start your own thread with specifics - IPsec logs from both sides in particular.

Zeon - this one's your thread, post your IPsec logs from the other end. The bit shown here just shows one end renegotiated successfully.

jmarquez

Don't get me wrong cmb.

I'm really happy using pfSense. I think that it is a great peace of code.
I agree with you about every person's issue related to ipSec. My issue is similar to the ones related on this thread just in the fact that tunnels drop randomly.

In my particular problem, I followed the steps described by Zeon post (http://forum.pfsense.org/index.php/topic,41617.0.html) and the tunnel have not dropped so far.

All the best.

Zeon

@cmb:

@jmarquez:

I read in some post that this only happens from version 2.0 up, so I might downgrade to 1.2.3 as this issue makes unusable the VPN connection.

That's not true, it happens on occasion with every IPsec implementation on every device in the world. 2.0.x does not have any general IPsec problems. It's most always related to misconfigurations. Most commonly, mismatched lifetimes on P1 and/or P2 for the symptoms described here, though at times it can be circumstances where you need DPD enabled.

There isn't enough info here on any of the reported issues to troubleshoot, and every issue is likely a different cause, so if you're having issues please start your own thread with specifics - IPsec logs from both sides in particular.

Zeon - this one's your thread, post your IPsec logs from the other end. The bit shown here just shows one end renegotiated successfully.

Hi CMB,
Firstly, I can say after a few days of disabled DPD and NT-T that I have had no further dropouts and couldn't be happier. This is true across 6 separate tunnels with some having latency of 1ms and others as high as 30ms (throughput of the internet connections is anywhere between 100mbps to 30mbps).

Unfortunately i don't have the logs of the problem anymore but will try to recreate them one weekend for the benefit of the other users on here.

Out of interest when is DPD needed? I have had situation where I have knocked a cable out for up to 10 seconds and the tunnel still seems to work fine once I plug back in?

cmb

@Zeon:

Firstly, I can say after a few days of disabled DPD and NT-T that I have had no further dropouts and couldn't be happier. This is true across 6 separate tunnels with some having latency of 1ms and others as high as 30ms (throughput of the internet connections is anywhere between 100mbps to 30mbps).

Disabling NAT-T where you don't need it is a good thing to do. For DPD, as long as it's enabled on both sides with the same settings you should be good. That's what we use on all ours internally.

@Zeon:

Unfortunately i don't have the logs of the problem anymore but will try to recreate them one weekend for the benefit of the other users on here.

Out of interest when is DPD needed? I have had situation where I have knocked a cable out for up to 10 seconds and the tunnel still seems to work fine once I plug back in?

Circumstances where one end drops an SA and the other doesn't recognize when that SA is no longer valid is where DPD fixes having to force restart one or both ends. That may be a reboot on one side or the other (primarily an unplanned one like a power outage or yanking the plug, an orderly reboot should tell the other end to clear it), or an IP change on one of the sides where there are dynamic WANs. Those are the two most common that I can think of offhand. Just knocking a cable out for a few seconds or minutes even is no big deal, unless you happen to get a new IP when it's reconnected (with dynamic WANs, the link up will force reconnect to your ISP, which with some will get you a new IP). If you still have the same IP, the existing SA is still valid and will work fine.

maldex

struggeling across this thread reminds me of the same issue i had a while ago as well, quite annoying, including against Astaro 8.2.
I wouldn't vow this but crosschecking my config now, one of the configuration change leftovers since the performance tests we did quiet a while ago (<v2.01) is="" that="" we're="" using="" <em="">Blowfish in Phase1 now. It never happened again so i completly forgot about this. I'm using the my 2.01(dyn-IP) now against both, pfsense 2.01(also dyn-ip) and Astaro V8.3 (fixed-ip):

All have public IPs (not nat involved, Nat Traversal disabled)
Default Mutual PSK, Main mode (btw i thought this cannot work with ipsec by definition? well done!!! :)) , My & Peer IP Address, Default Policy Gen. and Proposal Checking.
Phase1:
– Encryption algorithm: Blowfish 256
– Hash algorithm: SHA1
– DH key group: 5 and Lifetime: 86400
– DPD: Enabled, 10 Detection and 5 retries
Phase2:
– Encryption algorithms: AES 256 (Only this, no other proposal)
– Hash algorithms: MD5 (Only this, no other proposal)
– PFS key group: 5 and Lifetime: 86400. Auto Ping remote Host is set

yes, not the same encryption and hashing in phase 1 and 2, but even the one with 2xphase2 works stable now. Sorry, can't provide more details,

I'll let you guys know if i encounter a 'stalled' vpn again.

cheers
Josh</v2.01)>

boogieshafer

on the pfsense side, try setting the P1 Policy Generation to "unique"

i was having similar issues for subequent reconnects for the Shrew client where restarting the pfsense ipsec process would clear the issue

i did NOT need to disable NAT-T or DPD, just changing the P1 Policy Generation setting from "default" to "unique" was the only change i made

dhatz

It seems that several people are reporting IPsec VPN issues with pfsense 2.x (note: which includes the recent ipsec-tools 0.8.0). While some problems may be due to misconfiguration (e.g. the racoon / mpd conflict), the pfsense<->pfsense VPN scenario should be trouble-free.

As most of the problems posted here seem to be related to rekeying, I've been searching the ipsec-tools-devel mailing lists for clues. Check the following discussions:

http://old.nabble.com/why-is-SA-lifetime-kilobyte-limit-disabled-in-racoon–td31648198.html

Even if Node-A think IPsec-SA is expired at this time, Node-B doen't
think so. i.e. the states of IPsec-SA is mismatched.

Understand – similar things already happen with time-based
lifetimes if there is a clock skew between the two boxes.
(This is particulary bad if the oldest available SA is used
by the kernel.)

Racoon's strategy of rekeying is "Initiator do it." If Node-B
is responder, Node-A doesn't start rekeying even if IPsec-SA is
expired.
That sounds like a bug in racoon. It seems that if either end is
unsatisfied with the SA, that end should trigger a new one.

I'd also call this a shortcoming at least. The standards are
weak, and one doesn't know how other implementations behave.
It would be safer if both sides did care about renegotiations.

But the key
question is what the other implementions do, and what the standard says.

I've just tried OpenBSD's isakmpd (the oldish version in pkgsrc).
It initiates a Phase 2 exchange if the soft timeout on its
side expires, even if it was responder initially. (It randomizes
the soft timeouts to minimize the chance that both sides start
the exchange simultanously.)
PFC2409 says that both sides can initiate rekeying. "Can" --
this is not much of a guideline for implementors.

I can see the argument that especially with a 24h or less
lifetime, AES doesn't need volume-based rekeying.

OK, I was more concerned about interoperability. What if
the other side insists in some volume limit?

I've just tried OpenBSD's isakmpd (the oldish version in pkgsrc).
It initiates a Phase 2 exchange if the soft timeout on its
side expires, even if it was responder initially. (It randomizes
the soft timeouts to minimize the chance that both sides start
the exchange simultanously.)
PFC2409 says that both sides can initiate rekeying. "Can" --
this is not much of a guideline for implementors.

True, but it seems the original responder initiating a renegotiation is
the only reasonable behavior.

At the very least, it would appear to suggest that if the original
initiator rejects an attempt on the part of the original responder to
rekey, that's a bug.

True, but it seems the original responder initiating a renegotiation is
the only reasonable behavior.

If both side start rekeying at same time, there is/was a problem of
SA selection.

The two rekeying session makes two pair of IPsec-SAs. racoon can
do this, and IPsec implementations (kernel side) do one of following:

a. Use oldest IPsec-SA to send and keep all IPsec-SAs to receive(KAME)
b. Use newest IPsec-SA to send and keep all IPsec-SAs to receive(Fast IPsec)
c. Use newest IPsec-SA to send/receive and purge older IPsec-SAs

Of cause, c. is bad behavior, but small implementations(kernel side)
may handle only one sessions and one key pair at a time.
Standards don't prohibit this. This problem is exist between IKE
standards and IPsec standards. It seems IKEv2 makes this more clean.

Today, most implementations select b. or have configuration for it.
And racoon isn't used on other than KAME, Fast IPsec, or Linux(a. or b.)
I think your logic actually works fine. But racoon is old product,
so it doesn't catch recent trends up.

http://marc.info/?l=ipsec-tools-devel&m=129905181832157&w=2
http://marc.info/?l=ipsec-tools-devel&m=129916127621017&w=2

let me revive the discussion on an active negotiation,
as opposed to a passive daemon. Until recently my use
of IPsec was tied to isakmpd, ipsecctl, and OpenBSD
and my views are conditioned by this fact. There the
IPsec daemon is normally active in initiating its
negotiations at startup, unless told to configure
a passive listener for a particular tunnel/transport.
At the other extreme there is even a so called
active-only setting.

The implicit and default setting in racoon-0.7.3 is
"passive off", but this still waits for a demand to be
detected. Thus the mode is better described as "passive
until harshly bugged to get going"! The need to ping
and wait for a ridiculously long delay should not be
acceptable in most circumstances. Forgive me for the
critisism, but to me this is a design flaw. It is a
question of dependability and of trust to erect the
desired IPsec tunnels already at booting time.

Funny: when we tried to switch from racoon to isakmpd at work, a long
long time ago, this is one of the things we noticed on our TODO list:
patch isakmpd to negociate SAs only when traffic comes to the tunnel :-)

And this is how things should (can ?) be done according to RFC 2367
which provide SADB_ACQUIRE PFkey message….

Now, doing comparative browsing in the sources 0.7.3
and 0.8, the actual use of the variable PASSIVE in
"struct remoteconf" has indeed expanded somewhat.
Is the code progressing or maturing into a state
that allows an actively negotiating daemon? I.e.,
without waiting for traffic demand before commencing?

Not afaik.
Feel free to provide a patch for that, this would not be so
complicated to parse all config and start negociation for needed
tunnels, but there are also setups where we want to have tunnels
negociated only when needed (so when traffic comes to the tunnel), so
a patch will need to provide this feature as optional.
The best would be to have a peer-based (or sainfo based ?) token for
that.

Please also note that this is quite easy to also generate dummy
traffic for the needed tunnels when you activate the configuration if
you want.
And of course generate dummy traffic from time to time to ensure the
tunnel will always be up.