IPhone IPSEC working again

daplumber

I hadn't checked in a while but my 06/02 snapshot IPSEC wasn't working with my iPhone. I checked nothing had changed and restored the IPSEC section from the backup where I had last used it. After checking settings and logs the failure seemed to be happening with phase 1 completing but not starting phase 2. I didn't save the log, sorry.

On the off chance and mostly out of habit I hit the auto-update. Lo! and Behold! with version

FreeBSD 8.3-RELEASE-p2 #1: Sun Jun 3 09:39:21 EDT 2012 root@FreeBSD_8.3_pfSense_2.1.snaps.pfsense.org:/usr/obj./usr/pfSensesrc/src/sys/pfSense_SMP.8 i386

The IPSEC from iPhone is automagically working again! Yay!

(I think this counts as "feedback", right?) ;D

daplumber

OK, so "Sun Jun 3 14:50:59 EDT 2012" broke IPSEC again. That's what full backups are for…

jimp

Update to a current snapshot and try again.

daplumber

So, finally got a chance to test again, this time with ver:

"8.3-RELEASE-p2 FreeBSD 8.3-RELEASE-p2 #1: Tue Jun 5 17:03:37 EDT 2012 root@FreeBSD_8.3_pfSense_2.1.snaps.pfsense.org:/usr/obj./usr/pfSensesrc/src/sys/pfSense_SMP.8 i386"

and it's not working again. Bear in mind I haven't changed config on either server or mobile client end. Now I get this in the log:

Jun 5 22:21:50 	racoon: INFO: @(#)ipsec-tools 0.8.0 (http://ipsec-tools.sourceforge.net)
Jun 5 22:21:50 	racoon: INFO: @(#)This product linked OpenSSL 0.9.8q 2 Dec 2010 (http://www.openssl.org/)
Jun 5 22:21:50 	racoon: INFO: Reading configuration from "/var/etc/racoon.conf"
Jun 5 22:21:50 	racoon: INFO: Resize address pool from 0 to 253
Jun 5 22:21:50 	racoon: [Self]: INFO: WAN.WAN.WAN.WAN[4500] used for NAT-T
Jun 5 22:21:50 	racoon: [Self]: INFO: WAN.WAN.WAN.WAN[4500] used as isakmp port (fd=13)
Jun 5 22:21:50 	racoon: [Self]: INFO: WAN.WAN.WAN.WAN[500] used for NAT-T
Jun 5 22:21:50 	racoon: [Self]: INFO: WAN.WAN.WAN.WAN[500] used as isakmp port (fd=14)
Jun 5 22:25:12 	racoon: INFO: unsupported PF_KEY message REGISTER
Jun 5 22:44:15 	racoon: [Self]: INFO: respond new phase 1 negotiation: WAN.WAN.WAN.WAN[500]<=>MOB.MOB.MOB.MOB[500]
Jun 5 22:44:15 	racoon: INFO: begin Aggressive mode.
Jun 5 22:44:15 	racoon: INFO: received Vendor ID: RFC 3947
Jun 5 22:44:15 	racoon: INFO: received Vendor ID: draft-ietf-ipsec-nat-t-ike-08
Jun 5 22:44:15 	racoon: INFO: received Vendor ID: draft-ietf-ipsec-nat-t-ike-07
Jun 5 22:44:15 	racoon: INFO: received Vendor ID: draft-ietf-ipsec-nat-t-ike-06
Jun 5 22:44:15 	racoon: INFO: received Vendor ID: draft-ietf-ipsec-nat-t-ike-05
Jun 5 22:44:15 	racoon: INFO: received Vendor ID: draft-ietf-ipsec-nat-t-ike-04
Jun 5 22:44:15 	racoon: INFO: received Vendor ID: draft-ietf-ipsec-nat-t-ike-03
Jun 5 22:44:15 	racoon: INFO: received Vendor ID: draft-ietf-ipsec-nat-t-ike-02
Jun 5 22:44:15 	racoon: INFO: received Vendor ID: draft-ietf-ipsec-nat-t-ike-02
Jun 5 22:44:15 	racoon: INFO: received Vendor ID: draft-ietf-ipsra-isakmp-xauth-06.txt
Jun 5 22:44:15 	racoon: INFO: received Vendor ID: CISCO-UNITY
Jun 5 22:44:15 	racoon: INFO: received Vendor ID: DPD
Jun 5 22:44:15 	racoon: [MOB.MOB.MOB.MOB] INFO: Selected NAT-T version: RFC 3947
Jun 5 22:44:15 	racoon: INFO: Adding remote and local NAT-D payloads.
Jun 5 22:44:15 	racoon: [MOB.MOB.MOB.MOB] INFO: Hashing MOB.MOB.MOB.MOB[500] with algo #2 (NAT-T forced)
Jun 5 22:44:15 	racoon: [Self]: [WAN.WAN.WAN.WAN] INFO: Hashing WAN.WAN.WAN.WAN[500] with algo #2 (NAT-T forced)
Jun 5 22:44:15 	racoon: INFO: Adding xauth VID payload.

Where "WAN" is the IP of the pfSense WAN NIC, and "MOB" is the IP of the mobile device.

Here's the log from the client end (OS X):

6/5/12 10:44:15.653 PM configd: IPSec connecting to server FQDynDNSN
6/5/12 10:44:15.653 PM configd: SCNC: start, triggered by SystemUIServer, type IPSec, status 0
6/5/12 10:44:15.722 PM configd: IPSec Phase1 starting.
6/5/12 10:44:15.732 PM racoon: IPSec connecting to server WAN.WAN.WAN.WAN
6/5/12 10:44:15.732 PM racoon: Connecting.
6/5/12 10:44:15.732 PM racoon: IPSec Phase1 started (Initiated by me).
6/5/12 10:44:15.737 PM racoon: IKE Packet: transmit success. (Initiator, Aggressive-Mode message 1).
6/5/12 10:44:16.052 PM racoon: IKEv1 Phase1 AUTH: success. (Initiator, Aggressive-Mode Message 2).
6/5/12 10:44:16.052 PM racoon: IKE Packet: receive success. (Initiator, Aggressive-Mode message 2).
6/5/12 10:44:16.053 PM racoon: IKEv1 Phase1 Initiator: success. (Initiator, Aggressive-Mode).
6/5/12 10:44:16.053 PM racoon: IKE Packet: transmit success. (Initiator, Aggressive-Mode message 3).
6/5/12 10:44:16.053 PM racoon: IPSec Phase1 established (Initiated by me).
6/5/12 10:44:26.078 PM racoon: Received retransmitted packet from WAN.WAN.WAN.WAN[500].
6/5/12 10:44:36.166 PM racoon: Received retransmitted packet from WAN.WAN.WAN.WAN[500].
6/5/12 10:44:45.973 PM racoon: Received retransmitted packet from WAN.WAN.WAN.WAN[500].
6/5/12 10:44:46.054 PM configd: IPSec disconnecting from server WAN.WAN.WAN.WAN
6/5/12 10:44:46.055 PM racoon: IPSec disconnecting from server WAN.WAN.WAN.WAN
6/5/12 10:44:46.055 PM racoon: IKE Packet: transmit success. (Information message).
6/5/12 10:44:46.055 PM racoon: IKEv1 Information-Notice: transmit success. (Delete ISAKMP-SA).

It looks like Phase 2 just doesn't seem to get started?

jimp

Not sure what it's not doing for you… I am on the latest snapshot, and I can connect right up with my phone and surf to things on the lan side.

daplumber

@jimp:

Not sure what it's not doing for you… I am on the latest snapshot, and I can connect right up with my phone and surf to things on the lan side.

Would you mind forwarding me a copy of your settings? That way I can test to see if something else in my install is broken.

daplumber

OK, so now with an update to:

8.3-RELEASE-p2 FreeBSD 8.3-RELEASE-p2 #1: Tue Jun 5 23:58:17 EDT 2012 root@FreeBSD_8.3_pfSense_2.1.snaps.pfsense.org:/usr/obj./usr/pfSensesrc/src/sys/pfSense_SMP.8 i386

It's working again. No config changes at all, and same end point hardware and networks. What the <bleep>is going on? I didn't see anything to do with IPSEC in the commit or activity for this last update?</bleep>

rcfa

@daplumber:

It's working again. No config changes at all, and same end point hardware and networks. What the <bleep>is going on? I didn't see anything to do with IPSEC in the commit or activity for this last update?</bleep>

Not sure if it's related, but I had to disable/enable IPSec after an upgrade to get things working and/or reboot the system a second time. After, things seem to run fairly reliably, but after the system comes up after an upgrade, it usually doesn't work properly until I cycle IPSec and/or reboot.

daplumber

This getting insane. Anecdotal experience is that my IPSEC stops working every other update and then works again.

Checkpoint: "8.3-RELEASE-p2 FreeBSD 8.3-RELEASE-p2 #1: Fri Jun 8 06:50:37 EDT 2012 root@FreeBSD_8.3_pfSense_2.1.snaps.pfsense.org:/usr/obj./usr/pfSensesrc/src/sys/pfSense_SMP.8 i386"

is working again. The previous update wasn't. Bouncing the service seems to matter not one whit.

rcfa

Sorry to hear that the bouncing of the service doesn't work for you, I assume, a second reboot after the install didn't do anything either?

I have obviously slightly different issues with IPSec, one is, that when I pass massive amounts of traffic through the tunnel (easy, given that all my IPv4 traffic passes through that tunnel), it silently stops working: tunnel shows as up, gateways are up, etc. just traffic stops flowing (still trying to figure out how to debug that one). Things on the dashboard are indistinguishable from the working setup. Bounce the tunnel down and up, back to working condition, until the next time it happens. Tried the prefer older SA setting, too, no difference. ???

Anyway, not trying to hijack your thread, just saying there are still some glitches somewhere in IPSec.

daplumber

Understood. I'm just wondering if there's any difference between the first snapshot build of the day and the second? The second is the one that seems to work for me. If no-one's working on it, it should be the same, right? ;) ::)

If you have a lot of traffic, have you checked to see if racoon is running out of some resource or maybe timing out somewhere? Generically speaking code under load can cause bugs to crawl out of the woodwork that may not otherwise show up, especially timing issues, resource allocation and cleanup, locks, and race conditions. One of the first test activities in a former life of mine as a tester is to ramp up the usage until something breaks. IMHO well-written "defensive" code should degrade gracefully, then refuse to service more requests and/or abort with a meaningful message about what resource was exhausted or error occurred.

Programmers loathe testers, it's usually because they've finally got something to work after many hours and frustrations, and now: "This <bleep>wants the code to behave under abusive/crazy conditions?!" I was/am a very good tester. I can break any code, the point was that it should break only with a sufficiently high level of effort, and "go down screaming errors about the injustice of the insanity it is being subjected to."

The FreeBSD "fortunes" have always had some of the best quotes IMHO:

Osborn's Law:
Variables won't; constants aren't.

O'Toole's Commentary on Murphy's Law:
Murphy was an optimist.

Our OS who art in CPU, UNIX be thy name.
Thy programs run, thy syscalls done,
In kernel as it is in user!

(All starting with "O" for some reason… ;D )</bleep>

rcfa

@daplumber:

Understood. I'm just wondering if there's any difference between the first snapshot build of the day and the second? The second is the one that seems to work for me. If no-one's working on it, it should be the same, right? ;) ::)

One should think so…
...but since you were quoting, here's one of my favorite ones:

The difference between theory and practice is, that in theory there is no such difference, but in practice, there is.

@daplumber:

If you have a lot of traffic, have you checked to see if racoon is running out of some resource or maybe timing out somewhere? Generically speaking code under load can cause bugs to crawl out of the woodwork that may not otherwise show up, especially timing issues, resource allocation and cleanup, locks, and race conditions. One of the first test activities in a former life of mine as a tester is to ramp up the usage until something breaks. IMHO well-written "defensive" code should degrade gracefully, then refuse to service more requests and/or abort with a meaningful message about what resource was exhausted or error occurred.

I understand what you say, I did a reasonable bit of software testing too, and I'm obviously still good at using things in a way that they break :D

I should somewhat explain, though what I mean with "lots of traffic". Most of the day, the internet just sits there idle: here a page load on a web site, there an e-mail trickling in. We're talking about a few hundred e-mail messages per day, maybe a few hundred or low thousands of web pages visited. "Heavy traffic" is something like downloading a Mac OS X OS update disk image with e.g. 1.4GB in size, or streaming a Netflix movie.
So it's the naked data volume that's somewhat heavy, but not the number of requests on racoon or such.
The "beauty" of it is, that you look at the IPSec status page, the dashboard, etc. and everything looks fine and dandy. Just nothing is happening. Anything that brings down the tunnel and restarts it, is just fine. It's just the easiest thing to do to toggle the "enable IPSec" checkbox on the IPsec page.

I guess I just have to keep my eyes peeled, and hopefully sooner or later I find that fried moth…

wallabybob

@rcfa:

when I pass massive amounts of traffic through the tunnel (easy, given that all my IPv4 traffic passes through that tunnel), it silently stops working: tunnel shows as up, gateways are up, etc. just traffic stops flowing (still trying to figure out how to debug that one). Things on the dashboard are indistinguishable from the working setup.

Does data transfer recover after (say) 5 to minutes?

It might be worth doing a packet capture on the tunnel interface when it is in this state: maybe there is no traffic at all, maybe the only traffic is the two ends saying to each other "I'm here".

@rcfa:

Bounce the tunnel down and up, back to working condition, until the next time it happens.

Have you tried "less brutal" means such as initiating a ping across the tunnel or starting a new TCP connection across the tunnel (e.g. to access a web page).

rcfa

@wallabybob:

@rcfa:

when I pass massive amounts of traffic through the tunnel (easy, given that all my IPv4 traffic passes through that tunnel), it silently stops working: tunnel shows as up, gateways are up, etc. just traffic stops flowing (still trying to figure out how to debug that one). Things on the dashboard are indistinguishable from the working setup.

Does data transfer recover after (say) 5 to minutes?

Not that I'm aware of…
Here's a couple of typical modes of failure:
a) I return to my computer after a longer idle period, try to access a web page: nothing happens, eventually it times out with an error page. I notice Skype's off-line, too. So at this point, chances are that I'm catching it after having been in that state for a while...

b) I'm actively browsing the web, downloading something or another, suddenly the downloads "slow down" (which is of course just the effect of the browser calculating average download speed, when in reality the download just plain stops). Since there are speed fluctuations anyway, I may not notice, until I open another web page, or notice that Skype is off-line.

@wallabybob:

It might be worth doing a packet capture on the tunnel interface when it is in this state: maybe there is no traffic at all, maybe the only traffic is the two ends saying to each other "I'm here".

Well, unless the SA monitor and the Dashboard are not plain lying, the tunnel is up, so something is supposed to be working.

@wallabybob:

@rcfa:

Bounce the tunnel down and up, back to working condition, until the next time it happens.

Have you tried "less brutal" means such as initiating a ping across the tunnel or starting a new TCP connection across the tunnel (e.g. to access a web page).

Yep, see above.

The only thing that's somewhat "abnormal" about my setup, is that this IPSec link is the IPv4 pseudo-default route. Obviously as far as pfSense goes, the default route is something else, i.e. the WAN interface, but since the remote network on the IPSec link is 0.0.0.0/0, it snarfs up all regular traffic. So maybe IPSec, which usually is used just for snarfing up a specific subnet's traffic has some glitches that are exposed by my somewhat different use.

What makes things since yesterday a bit more difficult, is that I now also have an IPv6 tunnelbroker interface, which is the default route for IPv6 traffic. So now I can have a mixed-mode situation, where IPv6 works, and IPv4 doesn't, so I'm not as quick to catch on with the IPSec tunnel acting up, because some things may continue to work, because they use the IPv6 network…

jimp

Do you control the other side of the IPsec tunnel?

What you describe is a classic symptom of the far side dropping the P1 but not informing you that it did so. pfSense, without DPD, has no way to know it's down, so it keeps trying to talk on the SA it has.

IF you can enter a keep-alive IP on the far side for an IP in your LAN, that would make their end re-establish a P1 when it fails and maintain connectivity.
Otherwise, make sure both sides support DPD.

rcfa

@jimp:

Do you control the other side of the IPsec tunnel?

What you describe is a classic symptom of the far side dropping the P1 but not informing you that it did so. pfSense, without DPD, has no way to know it's down, so it keeps trying to talk on the SA it has.

IF you can enter a keep-alive IP on the far side for an IP in your LAN, that would make their end re-establish a P1 when it fails and maintain connectivity.
Otherwise, make sure both sides support DPD.

The other side is a ZyWALL unit, and the link is marked as "nailed up" i.e. permanent/auto-reestablish
It also has a remote monitor IP that's the LAN address of the pfSense box that it is supposed to ping regularly.

And DPD is turned on on the pfSense box, too.

rcfa

I hope I'm not jinxing myself by posting this, but, things seem to have remained stable since I turned off NAT Traversal on both sides.

Strictly speaking, right now things don't go through NAT, but there are/were cases when I had to put a VoIP appliance between the WAN and the firewall, at which point there would be NAT even though the firewall would be an "exposed host". So for such cases, I always had NAT traversal turned on, and during link negotiation the systems notice that it's not needed and then don't use it.

This was the same with pfSense according to the logs, so I figured, it's fine. For shits and giggles, I turned NAT-T off on both sides, and since then things have been up. (Of course, maybe I was just lucky and in a few hours I have to say:"Oops, back to the same old…")

Still, while it seems like I might have found a cure, why would it negotiate a NAT-T free connection, and then later fail?

Well, I'll keep an eye on things, to see if it now stays up reliably, which would be great.

Or have there been other recent changes that could have had an influence on this issue?