VPN Client Cannot Connect Through pfSense

kc8apf

I've been pulling my hair out over the exact same issue. This worked perfectly fine under 1.2.2. I'm not sure what changed in 1.2.3 that caused it to break.

For my setup, I see the entire key exchange happen successfully on port 500 and then my laptop sends a fragmented udp packet on port 4500. The remote side never seems to acknowledge that packet. If I connect directly to the WAN, everything seems to work fine.

I also saw the lack of NAT if scrub is disabled. I believe this is just part of how pf works. In order for NAT to be applied to the packet, it needs to reassemble the fragmented packet first. Scrub has multiple options, one is for reassembling fragmented packets. pfsense seems to turn on 'fragment reassemble' and 'random-id' on by default.

kc8apf

FWIW, I happen to have a setup and enough gear such that I could setup a parallel pfsense installation for testing. Modifying the config for slight differences in hardware (fxp instead of em, more VLANs since only 2 NICs instead of 5), I found that 1.2.3 does work correctly. So it seems to be related to my hardware or possibly the em driver. It looks like there are some reports of problems with checksum offloading in the em driver in FreeBSD 8.0, but it isn't clear if they apply to earlier versions as well.

After a very hectic night of upgrading my live system to pfSense 2.0-beta1 and then reinstalling 1.2.3-RC1 from scratch to rule out driver problems, it seems that I might have a hardware issue. with 2.0, I was able to connect to my work VPN, but I saw lots of other odd issues such as corrupt packets and the inability for one of my hosts to get a single packet through pf even though the rules allowed it.

I remember the system being 1.2.2 before I did the upgrade to 1.2.3, but 1.2.2 doesn't recognize the em devices in my system. I remember having that experience before, so it's very possible that I was using 1.2.3-RC1 originally.

With 1.2.3-RC1, I'm still seeing some corrupt packets that are blocked by pf due to bad TCP headers. It appears that the packet is being misparsed and so the header data ends up being completely wrong which leads to a few corrupt packet entries for a single packet. I did notice that my system was a bit hot to the touch (it's a fanless design that I've had some concerns about heat dissipation about), so I enabled powerd. That seemed to lower the temperature (at least by feel), but I still get occasional corrupt packets. I had previously experimented with powerd, but I don't believe it would have been active after a reboot in the original configuration. I have been unable to VPN successfully with this setup which is very odd since my original configuration was able to.

I've ordered another of the same hardware config so I can rule out flakey hardware (and it'd be handy to have as a spare anyway). It should arrive in a few days. I'll see what happens in that case. I'll also be trying to figure out a better heat dissipation system given the machine's installed location. More to follow.

danswartz

Interesting. Let us know what shakes out of this.

kc8apf

I'm still waiting for the spare machine to arrive. In the meantime, I happened to have a fan handy so I tried cooling the live machine to see if that had any effect. The machine is now cool to the touch. I still get the occasional corrupt packets with bad TCP headers and cannot get VPN to connect.

A few other interesting observations:

The packet corruption seems to only occur on subnets that are bridged
- specifically, em0 is bridged with vlan 2 on em3 and em1 is bridged with vlan 2 on em2
turning off txcsum, rxcsum, and vlanhwtag on all interfaces has no effect
netstat -s and the driver stats don't show any problems other than the bad hdrs

The spare machine should arrive tomorrow. I need to find a hub or another machine with 2 nics to do tcpdump from a machine other than the pfsense machine since its possible that the problem is happening in the nic.

kc8apf

Looks like the corrupt packets aren't really corrupt after all. Just a mistake on my behalf. The default snaplength for tcpdump is 64 bytes which isn't necessarily enough for the packets that show up on pflog0. Adding -s 256 seems to have cleaned those up.

So, I've done all I can with the live system without perturbing the family.

EddieA

I really haven't had any time to progress on this, unfortunately. However, I'm hoping that this weekend, I can try again.

But, I think I now know what the issue is, why I can make the connection using Linux iptables, and cannot with pfSense, which I want to confirm with some traces, and "experimentation".

I, understand that without scrub reassembling the fragmented packets, it can't do NAT correctly, and looking at the traces, for iptables, I'm assuming that it has to do reassembly, in order to recalculate the checksums after changing the source IP.

However, it's what happens after this reassembly where iptables and pfSense differ. I know I say pfSense, but I also know it's the underlying FreeBSD that's really in control here, and there's not a lot pfSense can do to influence this.

In my case, the VPN Client sends 2220 bytes of data, split as 1280 and 940. Now, that gives, on the wire, 1300 and 960. Hmmmmmm, 1300 is the "standard" MTU for Cisco VPNs, isn't it.

Anyway, back to the saga. After traversing iptables, the outbound packets are EXACTLY the same size. However, with pfSense, the packets are split, based on the WAN MTU of 1500, or that's what I'm guessing, because what I have is 1500 and 760, on the wire, which gives 1480 and 740 as the data.

So, it looks like iptables "remembers" how the packets were fragmented, and ensures that the outgoing packets are exactly the same size. Obviously, if the WAN MTU is lower, then the packets are fragmented based on that instead, which is what I saw on my initial trace of iptables.

Is there any way I can make pfSense/FreeBSD replicate this behaviour. I'm going to "force" it, by dropping the MTU, on the WAN, to 1300, to see if the VPN will connect. But then this change will affect all packets, not just these particular ones.

I wasn't able to validate this "theory", by using ping, as the target IP does accept a 1472 byte payload, with the "don't fragment" option set. But I don't know where, at the destination, the packets are reassembled, and what the MTU is at the point this happens.

I'll post back here, once I've run those tests.

Cheers.

kc8apf

The spare machine showed up today (thanks to Netgate for a very speedy order and delivery). I set it up in parallel to my live machine with the same config (barring the obvious change of the WAN IP) as a first test. VPN failed to connect.

I reset to factory defaults and tried again with a minimum config. Same result.

So, it seems that it isn't bad hardware, but is hardware-specific. I don't have time to hunt this down any further tonight, but I at least know that I can replicate the problem in a test environment with a minimum configuration. Now I need to find a hub or setup a machine as a bridge so I can get impartial tcpdumps.

EddieA

OK, now I am totally confused.

I "adjusted" the MTU, on the WAN interface down to 1300 and re-ran the connection test. It failed yet again.

Now, comparing the traces, on the WAN side between pfSense and iptables, I can see absolutely NO difference, apart, obviously, from the encrypted payload, which neither pfSense or iptables is going to mess with. The packets are exactly the same sizes, they have exactly the same flags set, everything is the same. >:(

So, does anyone have any other ideas what I can try next.

Cheers.

kc8apf

I setup a spare machine as a bridge so I could get some packet captures from a neutral viewpoint. The resulting captures are at http://www.kc8apf.net/files/. With 1.2.3 and a default config, I was never able to establish a VPN connection. I tried turning off hardware checksumming, TSO, and all of the PF rules other than NAT (edited rules.debug and loaded it with pfctl -f). None seem to have any effect.

I don't have too much experience looking at Wireshark output. I know the various protocols reasonably well, but the display and interpretation is a bit confusing. Anyway, it seems that with 1.2.3 the fragments for the UDP packet for NAT-T have bad header checksums. That prevents the packet from being reassembled. When I had taken captures of the same traffic from pfSense (I don't have the captures anymore), the headers were shown as intact. Between that and my testing with other NICs, this seems to be specific to the 'em' driver and happens after PF has seen the packet.

I installed the latest 2.0 snapshot with a default config. The VPN connected on the first try. I'm not sure what the exact bug is (there seem to be a few possible problem reports against em), but it seems to be resolved in FreeBSD 8.0. This is good to know, but I'm very hesitant to use the 2.0 snapshots for my live system. I run a number of domains through that router and can't risk having it behave sporadically. At least for the moment, I'll just need to live without a working VPN connection to work.

danswartz

Any chance you can use a NIC with different driver for now?

kc8apf

Considering the machine I'm using is a single-board computer with 4 ems and 1 fxp and all of them are in use. No, not really.

danswartz

Ugh :)

EddieA

I've still not had chance to try my ESXi setup, with a different "virtual" NIC. Hopefully sometime over the weekend. I'll report back once I finally get to it.

I'm also getting an HP T5720 Thin Client that I'm planning on using for pfSense. I'll try with that as well, once it arrives.

Cheers.

EddieA

Ah well, bloody typical. Today is the first day my wife has telecommuted for a couple of weeks, so I was hoping that I could finally prove, or disprove, kc8apf's conjecture that the 'em' driver was the root cause of these issues.

However, between then and now, her company has changed their VPN software. The current software connects, without issue, through my current setup. >:(

So kc8apf, I'm sorry I can no longer check if switching the NIC driver, from 'em', to something else fixes the issue with the fragmented UDP packets. :(

Cheers.

rwalker

I can confirm that the em driver has nothing to do with the issue. I have machines with both fxp and em nics and they both do the exact same thing. I actually waited a while before upgrading to 1.2.3 from 1.2.2 to avoid this kind of issue. This SUCKS! Anyone done a downgrade from 1.2.3 to 1.2.2? How did it go?

Roy

danswartz

I would have been surprised. It would be interesting to know what the protocol difference between the old and new VPN software is. That might make it more feasible to come up with a theory. As far as 1.2.2 vs 1.2.3, one thing that would be helpful would be to save /tmp/rules.debug from 1.2.3 - do a fresh 1.2.2 install, restoring the config and then compare the two files. Might not be helpful due to lots of pf rules noise, but maybe it would.

rwalker

You have to set the static port in the manual NAT for both the ingress and egress interface. Ie if the traffic is coming from the LAN interface out the WAN interface, you need 2 NAT rules with static ports on UDP 500, one on each interface with the same source and destination. 1.2.2 only required the one on the egress interface.

Roy

rwalker

Ok while adding the static port entry on both interfaces got it working, it only stays working for about 2-3 hours. Then you have to reset the state table to get it to connect again. Anyone have an idea why that is? For obvious reasons, resetting the state table is not a viable workaround.

Thanks,
Roy