VPN Client Cannot Connect Through pfSense
-
Well, that didn't help. :(
It still gets part of the way through the negotiation, and then, well, it fails.
What I did do though, was to run a packet capture, on the LAN interface, grabbing everything from the laptop host. Maybe, over the weekend, if I get time, I'll try and resurrect my Linux/iptables firewall, and grab a matching trace, to see if I can spot any anomalies. But, I'll admit that I'm not that good at reading packet traces. ;D
But, unless I can get this resolved, it isn't looking good for pfSense, as she needs to work from home a couple of days a week, and I can't pull everything off the network, to let her connect directly.
Changed NAT and Firewall rules posted.
Because I can't be 100% sure about the order I changed the rules around, the only other test I want to try, first thing tomorrow, is my current rule set, but letting the ICMP packets, either in, or at least logged, as I've seen those cause issues with a different VPN in the past.
Cheers.
-
OK, I set up a second copy of pfSense, to run some tests. The only changes, from the default setup, were as follows:
On the WAN interface, I unchecked both "Block Private Networks", and "Block Bogon Networks".
On the Outbound NAT, I selected "Manual Outbound NAT rule generation (Advanced Outbound NAT (AON))", and then in the generated rule, I changed "Static Port" to YES.
No other rules were added or changed.
My wife's VPN still would not connect. Is there ANYTHING else that is changed, by pfSense, as the packet traverses the firewall, apart from what NAT requires, that can be controlled, that might be causing this.
I did a full packet capture, on the LAN interface, of the connection attempt, with pfSense as the firewall.
I then switched back, to my "old" Linux firewall, using iptables, and repeated the same packet capture, when she was able to connect, without any issues.
If anyone, with a little more knowledge of ISAKMP negotiation would like to take a quick look, at the captures, I've posted them both on my ftp server: ftp.BogoLinux.net/pub/pfSense.cap and ftp.BogoLinux.net/pub/iptables.cap
I'd love to get to the bottom of this, as pfSense gives me way more control over my firewall that iptables ever did.
Or, should I wait a couple of days, and if I don't get an answer here, take it to the mailing list.
Cheers.
-
Yuck, that is a pain :( Looked at both traces, but am not an IPSEC expert, so didn't mean a lot to me. Does anything show up as blocked in the filter log?
-
A further thought: I assume the trace was captured on the LAN? Can you try on the WAN and see what is getting through outbound?
-
Yuck, that is a pain :( Looked at both traces, but am not an IPSEC expert, so didn't mean a lot to me. Does anything show up as blocked in the filter log?
Nope, the firewall log was completely blank. That was one of the first things I checked, when I started this exercise.
A further thought: I assume the trace was captured on the LAN? Can you try on the WAN and see what is getting through outbound?
I can, but in order to make the log readable, I'd have to shut down/disconnect the other machines in the house. I guess that might be the next step.
Cheers.
-
Not necessarily. From what I can see, you are never getting the tunnel up, due to a failure in the ISAKMP stuff, which is all related to UDP port 500, so if you captured only based on that…
-
Not necessarily. From what I can see, you are never getting the tunnel up, due to a failure in the ISAKMP stuff, which is all related to UDP port 500, so if you captured only based on that…
True. I might just limit it to UDP though, just in case it's something really funky, and the Static port option isn't working correctly.
The one thing I did notice, going back through the traces, is that the "failure" occurs after the VPN has sent a fragmented packet out, and it doesn't get a reply. Packets 43, 44 for pfSense, and 79, 80, for iptables. Maybe the "scrub" option is breaking things, although the fragment bits appear to be correctly set. It's something else to try tonight, when she gets home again.
Cheers.
-
Ah, I think this is a false alarm. If you are using wireshark (as I was), the fragments are NOT considered UDP, but vanilla IP, so they will not show up. If you remove the UDP filter, you will (I think) see those frames. This is another reason to try sniffing the WAN, I guess.
-
Ah, I think this is a false alarm. If you are using wireshark (as I was), the fragments are NOT considered UDP, but vanilla IP, so they will not show up. If you remove the UDP filter, you will (I think) see those frames. This is another reason to try sniffing the WAN, I guess.
I wasn't filtering, and I did see all the packets.
OK, next experiment. I ran, matching, traces on both the LAN and WAN interfaces, again for both pfSense and iptables. The relevant capture files can be found, on my FTP server, ftp.BogoLinux.net/pub, for those interested.
Concentrating on the packet(s) sent, that fail to illicit a response, here's what I see. Note that all interpretation of the packets was done using WireShark.
On the LAN side, a UDP request of 2220 bytes was sent, which was spread over two packets. The first, was identified by WireShark as an IP packet, and contained 1280 bytes of data. The "More fragments" bit is set. The second packet, was identified as UDP/ISAKMP, containing 940 bytes, with no fragmentation bits set. WireShark also shows the reassembled data. Obviously, this is identical for both pfSense and iptables.
On the WAN side, is where things get different, obviously, as one works, and the other doesn't. ;D
For iptables, what I see, are 5 packets. The 1st 4 are all identified as IP packets, each containing 552 bytes of data, and all have the "More fragments" bit set. The 5th packet, is identified as UDP/ISAKMP, with 12 bytes of data, and no fragmentation bits set. Again, WireShark also shows the reassembled data.
Now, for pfSense. Here there are just 2 packets. The 1st one is identified as UDP/ISAKMP, with 1480 bytes of data, and the "More fragments" bit set. The 2nd packet is identified as IP, with 740 bytes of data, and no fragmentation bits set. WireShark does not show reassembled data.
So, why has WireShark interpreted the 2 packets on the WAN side, for pfSense, the opposite way around, and because of this, it appears not to realise that they are fragments, and should be reassembled. I checked through all the headers, but was unable to see anything different, that might cause this behaviour. Now, if WireShark has mis-interpreted the packets, is it also possible, that the receiving VPN server has also made the same "mistake".
As a final test, I set "Disable firewall scrub", as it looks like the "scrub" also tries to re-assemble fragments, which reloaded the rules, cleared the State Table, for the IP, on the LAN, trying to connect, and tried again. This still didn't work. Would that be sufficient to reset the "scrub" rules, or should I really have re-booted, to make sure everything was "clean". Unfortunately, I didn't have a trace running for that last attempt.
Since then, I have rebooted, so guess the only test left, is to repeat the "Disabled scrub" one, this time with a trace running. Unless anyone has any other ideas to try.
Cheers.
-
OK, because I suspected the way "scrub" wasn't completely removed in the tests I did previously, I re-booted pfSense an re-ran the traces. This still failed.
BUT, after disabling "scrub", the fragmented packets are no longer NATted. ??? Yeah, that's right. Any packets passing through pfSense, that are not fragmented, are NATted. The fragmented packets are NOT. Take a look at the screenshot, from Wireshark. Packets 20 ->23 and 27 -> 31. WTF
Should I pursue this, on this thread, or start a new one, to deal with this no-NAT.
Cheers.
-
I've been pulling my hair out over the exact same issue. This worked perfectly fine under 1.2.2. I'm not sure what changed in 1.2.3 that caused it to break.
For my setup, I see the entire key exchange happen successfully on port 500 and then my laptop sends a fragmented udp packet on port 4500. The remote side never seems to acknowledge that packet. If I connect directly to the WAN, everything seems to work fine.
I also saw the lack of NAT if scrub is disabled. I believe this is just part of how pf works. In order for NAT to be applied to the packet, it needs to reassemble the fragmented packet first. Scrub has multiple options, one is for reassembling fragmented packets. pfsense seems to turn on 'fragment reassemble' and 'random-id' on by default.
-
FWIW, I happen to have a setup and enough gear such that I could setup a parallel pfsense installation for testing. Modifying the config for slight differences in hardware (fxp instead of em, more VLANs since only 2 NICs instead of 5), I found that 1.2.3 does work correctly. So it seems to be related to my hardware or possibly the em driver. It looks like there are some reports of problems with checksum offloading in the em driver in FreeBSD 8.0, but it isn't clear if they apply to earlier versions as well.
After a very hectic night of upgrading my live system to pfSense 2.0-beta1 and then reinstalling 1.2.3-RC1 from scratch to rule out driver problems, it seems that I might have a hardware issue. with 2.0, I was able to connect to my work VPN, but I saw lots of other odd issues such as corrupt packets and the inability for one of my hosts to get a single packet through pf even though the rules allowed it.
I remember the system being 1.2.2 before I did the upgrade to 1.2.3, but 1.2.2 doesn't recognize the em devices in my system. I remember having that experience before, so it's very possible that I was using 1.2.3-RC1 originally.
With 1.2.3-RC1, I'm still seeing some corrupt packets that are blocked by pf due to bad TCP headers. It appears that the packet is being misparsed and so the header data ends up being completely wrong which leads to a few corrupt packet entries for a single packet. I did notice that my system was a bit hot to the touch (it's a fanless design that I've had some concerns about heat dissipation about), so I enabled powerd. That seemed to lower the temperature (at least by feel), but I still get occasional corrupt packets. I had previously experimented with powerd, but I don't believe it would have been active after a reboot in the original configuration. I have been unable to VPN successfully with this setup which is very odd since my original configuration was able to.
I've ordered another of the same hardware config so I can rule out flakey hardware (and it'd be handy to have as a spare anyway). It should arrive in a few days. I'll see what happens in that case. I'll also be trying to figure out a better heat dissipation system given the machine's installed location. More to follow.
-
Interesting. Let us know what shakes out of this.
-
I'm still waiting for the spare machine to arrive. In the meantime, I happened to have a fan handy so I tried cooling the live machine to see if that had any effect. The machine is now cool to the touch. I still get the occasional corrupt packets with bad TCP headers and cannot get VPN to connect.
A few other interesting observations:
- The packet corruption seems to only occur on subnets that are bridged
- specifically, em0 is bridged with vlan 2 on em3 and em1 is bridged with vlan 2 on em2 - turning off txcsum, rxcsum, and vlanhwtag on all interfaces has no effect
- netstat -s and the driver stats don't show any problems other than the bad hdrs
The spare machine should arrive tomorrow. I need to find a hub or another machine with 2 nics to do tcpdump from a machine other than the pfsense machine since its possible that the problem is happening in the nic.
- The packet corruption seems to only occur on subnets that are bridged
-
Looks like the corrupt packets aren't really corrupt after all. Just a mistake on my behalf. The default snaplength for tcpdump is 64 bytes which isn't necessarily enough for the packets that show up on pflog0. Adding -s 256 seems to have cleaned those up.
So, I've done all I can with the live system without perturbing the family.
-
I really haven't had any time to progress on this, unfortunately. However, I'm hoping that this weekend, I can try again.
But, I think I now know what the issue is, why I can make the connection using Linux iptables, and cannot with pfSense, which I want to confirm with some traces, and "experimentation".
I, understand that without scrub reassembling the fragmented packets, it can't do NAT correctly, and looking at the traces, for iptables, I'm assuming that it has to do reassembly, in order to recalculate the checksums after changing the source IP.
However, it's what happens after this reassembly where iptables and pfSense differ. I know I say pfSense, but I also know it's the underlying FreeBSD that's really in control here, and there's not a lot pfSense can do to influence this.
In my case, the VPN Client sends 2220 bytes of data, split as 1280 and 940. Now, that gives, on the wire, 1300 and 960. Hmmmmmm, 1300 is the "standard" MTU for Cisco VPNs, isn't it.
Anyway, back to the saga. After traversing iptables, the outbound packets are EXACTLY the same size. However, with pfSense, the packets are split, based on the WAN MTU of 1500, or that's what I'm guessing, because what I have is 1500 and 760, on the wire, which gives 1480 and 740 as the data.
So, it looks like iptables "remembers" how the packets were fragmented, and ensures that the outgoing packets are exactly the same size. Obviously, if the WAN MTU is lower, then the packets are fragmented based on that instead, which is what I saw on my initial trace of iptables.
Is there any way I can make pfSense/FreeBSD replicate this behaviour. I'm going to "force" it, by dropping the MTU, on the WAN, to 1300, to see if the VPN will connect. But then this change will affect all packets, not just these particular ones.
I wasn't able to validate this "theory", by using ping, as the target IP does accept a 1472 byte payload, with the "don't fragment" option set. But I don't know where, at the destination, the packets are reassembled, and what the MTU is at the point this happens.
I'll post back here, once I've run those tests.
Cheers.
-
The spare machine showed up today (thanks to Netgate for a very speedy order and delivery). I set it up in parallel to my live machine with the same config (barring the obvious change of the WAN IP) as a first test. VPN failed to connect.
I reset to factory defaults and tried again with a minimum config. Same result.
So, it seems that it isn't bad hardware, but is hardware-specific. I don't have time to hunt this down any further tonight, but I at least know that I can replicate the problem in a test environment with a minimum configuration. Now I need to find a hub or setup a machine as a bridge so I can get impartial tcpdumps.
-
OK, now I am totally confused.
I "adjusted" the MTU, on the WAN interface down to 1300 and re-ran the connection test. It failed yet again.
Now, comparing the traces, on the WAN side between pfSense and iptables, I can see absolutely NO difference, apart, obviously, from the encrypted payload, which neither pfSense or iptables is going to mess with. The packets are exactly the same sizes, they have exactly the same flags set, everything is the same. >:(
So, does anyone have any other ideas what I can try next.
Cheers.
-
I setup a spare machine as a bridge so I could get some packet captures from a neutral viewpoint. The resulting captures are at http://www.kc8apf.net/files/. With 1.2.3 and a default config, I was never able to establish a VPN connection. I tried turning off hardware checksumming, TSO, and all of the PF rules other than NAT (edited rules.debug and loaded it with pfctl -f). None seem to have any effect.
I don't have too much experience looking at Wireshark output. I know the various protocols reasonably well, but the display and interpretation is a bit confusing. Anyway, it seems that with 1.2.3 the fragments for the UDP packet for NAT-T have bad header checksums. That prevents the packet from being reassembled. When I had taken captures of the same traffic from pfSense (I don't have the captures anymore), the headers were shown as intact. Between that and my testing with other NICs, this seems to be specific to the 'em' driver and happens after PF has seen the packet.
I installed the latest 2.0 snapshot with a default config. The VPN connected on the first try. I'm not sure what the exact bug is (there seem to be a few possible problem reports against em), but it seems to be resolved in FreeBSD 8.0. This is good to know, but I'm very hesitant to use the 2.0 snapshots for my live system. I run a number of domains through that router and can't risk having it behave sporadically. At least for the moment, I'll just need to live without a working VPN connection to work.
-
Any chance you can use a NIC with different driver for now?
-
Considering the machine I'm using is a single-board computer with 4 ems and 1 fxp and all of them are in use. No, not really.
-
Ugh :)
-
I've still not had chance to try my ESXi setup, with a different "virtual" NIC. Hopefully sometime over the weekend. I'll report back once I finally get to it.
I'm also getting an HP T5720 Thin Client that I'm planning on using for pfSense. I'll try with that as well, once it arrives.
Cheers.
-
Ah well, bloody typical. Today is the first day my wife has telecommuted for a couple of weeks, so I was hoping that I could finally prove, or disprove, kc8apf's conjecture that the 'em' driver was the root cause of these issues.
However, between then and now, her company has changed their VPN software. The current software connects, without issue, through my current setup. >:(
So kc8apf, I'm sorry I can no longer check if switching the NIC driver, from 'em', to something else fixes the issue with the fragmented UDP packets. :(
Cheers.
-
I can confirm that the em driver has nothing to do with the issue. I have machines with both fxp and em nics and they both do the exact same thing. I actually waited a while before upgrading to 1.2.3 from 1.2.2 to avoid this kind of issue. This SUCKS! Anyone done a downgrade from 1.2.3 to 1.2.2? How did it go?
Roy
-
I would have been surprised. It would be interesting to know what the protocol difference between the old and new VPN software is. That might make it more feasible to come up with a theory. As far as 1.2.2 vs 1.2.3, one thing that would be helpful would be to save /tmp/rules.debug from 1.2.3 - do a fresh 1.2.2 install, restoring the config and then compare the two files. Might not be helpful due to lots of pf rules noise, but maybe it would.
-
You have to set the static port in the manual NAT for both the ingress and egress interface. Ie if the traffic is coming from the LAN interface out the WAN interface, you need 2 NAT rules with static ports on UDP 500, one on each interface with the same source and destination. 1.2.2 only required the one on the egress interface.
Roy
-
Ok while adding the static port entry on both interfaces got it working, it only stays working for about 2-3 hours. Then you have to reset the state table to get it to connect again. Anyone have an idea why that is? For obvious reasons, resetting the state table is not a viable workaround.
Thanks,
Roy