Upgrade to 2.4.5 broke 802.1x RADIUS WiFi over VPN

DAVe3283 · Apr 22, 2020, 6:44 PM

Updating my OpenVPN host from pfSense 2.4.4-p3 to 2.4.5 broke 802.1x WPA2-Enterprise WiFi at the remote sites. The problem seems possibly related to the RADIUS handshake / connectivity. Reverting the OpenVPN host (main site) to 2.4.4-p3 restores functionality. Remote site can remain on 2.4.5 and it works again so long as the main site is 2.4.4-p3 or older.

The setup is as follows:

OpenVPN is setup as site-to-site tunnel, routable between sites. I can directly connect between PCs at sites.
Firewalls set to allow all traffic over OpenVPN tunnel.
All sites have UniFi UAP access points, talking to single RADIUS server at main site.
RADIUS server is a Windows Server 2012R2 domain controller + DNS + NPS (etc.).
Clients are primarily domain-joined Windows PCs, authenticating with a computer certificate. Phones use username/password and that seems to break too.

I have 2 remote sites, one running 2.4.4-p3 and the other 2.4.5. Both exhibit the same behavior, and only the main site (host) pfSense version seems to matter. I can provide config specifics as needed.

When the host is on 2.4.4-p3, everything works fine. When I update it to 2.4.5 WiFi authentication fails, and laptops try to connect over and over with no logged error (thanks Microsoft). I do see RADIUS connectivity in the states tables of both host and remote pfSense. I also see RADIUS activity start (but never succeed or fail) in the server log. I can SSH in to the AP and ping the RADIUS server, and ping the AP from the RADIUS server regardless of pfSense version. I suspect some packets are being routed differently, dropped, or modified on the latest version that the previous version didn't touch. Or vice versa??

e-1-1 · Apr 22, 2020, 6:48 PM

@DAVe3283 I'd do packet captures on both main and remote sites for traffic from the APs

DAVe3283 · Apr 22, 2020, 8:09 PM

I was hoping someone had seen this already and it was either a known issue or an easy fix.

I captured the traffic with the main site rolled back to 2.4.4-p3 (working WiFi join). I'll re-update to 2.4.5 this weekend and get a pair of captures with it in the broken state, and go from there. I don't want to break my family's WiFi while they are all trying to work from home... again

If anyone has any insight that might save me from learning the low-level protocol behind 802.1x, please chime in!

stephenw10 · Apr 24, 2020, 2:54 PM

Nope, that's not anything we are aware of. I doubt it's 802.1x specific, that just happens to be hitting some other issue.

I would normally look at a NAT issue here but you say it's all routed?

Then check for a fragmentation problem, I've seen that is similar situations. Not aware of anything relevant that changed going to 2.4.5 though. A packet capture should show that if it is.

Steve

DAVe3283 · Apr 24, 2020, 1:07 AM

Correct, it is all routed. Each site is on a different subnet:

10.0.0.0/23 main site LAN
10.0.7.0/24 VPN tunnel
10.0.6.0/24 site "H" LAN
10.0.8.0/24 site "C" LAN

Traceroute from Domain Controler / NPS / RADIUS server:

PS C:\Users\DAVe3283> tracert UAP-AC-LR.<site C>.<domain>

Tracing route to UAP-AC-LR.<site C>.<domain> [10.0.8.2]
over a maximum of 30 hops:

  1    <1 ms    <1 ms    <1 ms  pfsense1.<domain> [10.0.1.100]
  2   103 ms   112 ms   114 ms  pfsense.<site C>.<domain> [10.0.7.8]
  3    82 ms    77 ms    99 ms  uap-ac-lr.<site C>.<domain> [10.0.8.2]

Trace complete.

Traceroute from UniFi access point:

BZ.v3.9.15# traceroute DC.<domain>
traceroute to DC.<domain> (10.0.1.1), 30 hops max, 38 byte packets
 1  pfSense.<site C>.<domain> (10.0.8.1)  0.108 ms  0.134 ms  0.151 ms
 2  pfsense1.<domain> (10.0.7.1)  152.666 ms  135.819 ms  101.842 ms
 3  dc.<domain> (10.0.1.1)  95.220 ms  81.191 ms  69.084 ms

Anyone see anything wrong with that general setup?

I did notice on the working capture, some of the RADIUS messages are large enough they fragment, and carry all the way through the tunnel to the RADIUS server still fragmented. I will pay close attention to those fragmented packets when I redo the test on the new version. If those are getting dropped or mis-assembled rather than just passed through, that would do it.

stephenw10 · Apr 24, 2020, 2:57 PM

If traffic is being fragmented I would bet that is somehow causing this.

Make sure you don't have pfscrub disabled anywhere.

Is it failing from all sites? Are they all sending fragmented packets?

Assigning the OpenVPN interface will change the way fragmented packets are handled. We have seen it work correctly with the interface assigned at the receiving end when it failed without that.

Steve

DAVe3283 · Apr 24, 2020, 4:20 PM

@stephenw10 Good points, thanks!

I can post the working packet capture if needed, but from what I can tell with WireShark, some of the RADIUS messages are larger than a single packet, so they get fragmented. Probably due to the size of the public certificates being sent. And with the VPN host on 2.4.4-p3, those packets get passed straight through untouched, and everything works.

All sites were failing when the main site was upgraded to 2.4.5. One remote site is 2.4.5, the other is 2.4.4-p3, so it seems the deciding factor is the host/receiving end of the VPN.

This has been my config for as long as I can remember:
🔒 Log in to view
That look OK? The remote sites use the same settings, except one site has the Firewall Optimization Options set to "High-latency" due to their internet connection type.

The OpenVPN interface is not assigned on either end currently. I can try that if it turns out to be a fragmentation issue.

My plan is to try and re-upgrade the main site to 2.4.5 this Saturday and redo the packet captures once things stop working. Hopefully it is as easy as assigning an interface. Is that the recommended configuration? Should I be assigning the OpenVPN interface at all sites, or just the main/host?

stephenw10 · Apr 24, 2020, 7:15 PM

Yes, that looks fine.

The packet capture should show what's happening. I would guess that the packet fragments are being dropped somewhere. Where we've seen that before is been on leaving the internal interface at the receiving end.

Steve

DAVe3283 · Apr 24, 2020, 7:40 PM

@stephenw10 So far I've been using tcpdump on the pfSense box to capture what it is passing through the various interfaces. Is that reliable to capture the packets being dropped in this situation? Or do I need to find a way to capture the LAN port with an external tool? I hope not...

stephenw10 · Apr 24, 2020, 9:39 PM

Yes, that is how I've seen it previously. I could see packets (or fragments) coming in in the OpenVPN interface and not leaving the internal interface as I expected them to.
Still not sure what might have induced that in 2.4.5. If that is what you're seeing.

Steve

DAVe3283 · Apr 25, 2020, 6:00 PM

I updated to pfSense 2.4.5 again, and as soon as a client tried to connect where the RADIUS packets were fragmented, it wouldn't work. So something with fragmented packet handling changed from 2.4.4-p3 to 2.4.5, and that is breaking RADIUS in this case.

For now, I assigned the OpenVPN interface, and enabled it. It was not enabled after assignment. Should I have left it disabled?
🔒 Log in to view

I then had to restart OpenVPN service, but things are working again! Unfortunately, not all connection attempts result in fragments, so that could just be coincidence; I will watch for any future failures.

Hopefully it will just keep working, and I can go back to my regularly scheduled weekend! If not, I guess I will start a deep dive into the packet captures. I grabbed them at every interface along the entire route, so I have plenty to look at.

stephenw10 · Apr 25, 2020, 7:57 PM

Nice, that does sounds like what you were hitting then.
When I've seen it before it didn't take much analysis. The packets were simply not being sent from the internal interface, where the radius server is in this case I'd guess.

I have no idea why that behaviour might have changed. If you confirm that was the issue we will have look into it.

Steve

DAVe3283 · Apr 25, 2020, 8:04 PM

@stephenw10 This does seem to be the fix. I now have a packet capture of a working exchange with fragmented packets after adding the OpenVPN interface, and a failure before adding the interface. I can post the captures and full network topo if that would help, but I would rather share it privately if possible.

stephenw10 · Apr 25, 2020, 9:13 PM

Really we would need to reproduce it here. Can you say how and where it was failing before assigning the intrerface?

Like packet fragments were arriving on the OpenVPN interface pcap but never leaving on the internal pcap? That's what I expect to see if anything.

Steve

DAVe3283 · Apr 25, 2020, 10:28 PM

@stephenw10 Sure. I dug in, and found this appears to be several things coming together to cause this failure.

In all cases, packets are fragmented at 1504 bytes inside the OpenVPN tunnel.

The twist: On my local (main) LAN, I have jumbo frames turned on (MTU of 9000 bytes). But-- not the RADIUS server; I forgot, so it had the default MTU of 1558 bytes This... was not helping anything.

With 2.4.5 & no OpenVPN interface, the LAN interface appears to reassemble some packets as large as 1763 bytes on the LAN side. But some remain fragmented at 1566 bytes.

So the packets were passing through the LAN interface (yay), but were being dropped by the RADIUS server because its NIC didn't have jumbo frames enabled (boo). I bet if jumbo frames were enabled, it would have worked fine.

I would expect that packets should either consistently be re-assembled or consistently be left fragmented as they pass from OpenVPN loopback to the LAN interface. So there probably is some undesired behavior on pfSense 2.4.5 in this config. And this is probably where the difference between 2.4.4-p3 & 2.4.5 lies.

After assigning OpenVPN an interface on 2.4.5, fragments do not appear to be reassembled at all. This avoided the jumbo frame mis-match on the LAN, and everything works.

Let me know if you need any more information.

Edit: weirder and weirder. pfSense did not have anything entered for the MTU on the LAN interface (or any interface), so should have been using the default MTU of 1500. In fact, it drops anything larger than 1500 coming in. So why in the world was it assembling packets with a MTU well in excess of that? I would say that is a bug as well.

stephenw10 · Apr 25, 2020, 11:08 PM

Hmm, that is weird. What was it doing in 2.4.4p3? Not reassembling the packets even without the interface assigned?

Let me see if I can find anything that might explain that in 2.4.5.

Steve

DAVe3283 · Apr 25, 2020, 11:13 PM

@stephenw10 Exactly. 2.4.4-p3 was not reassembling packets even without the interface assigned.

I still find it weird that 2.4.5 reassembles some packets, seemingly at random, without the interface assigned. And assembling packets that exceed the set MTU of the adapter, at that.

BurnettC27 · May 14, 2020, 4:11 PM

Hello,

I've been dealing with this issue for the past couple of days and stumbled upon this forum post. The two remote sites that I have updated to 2.4.5 no longer work with our RADIUS WPA2/AES SSID.

Has this been identified as a bug? Is there a fix in the works?

If we create an Interface for OpenVPN and assign it to the client, then our gateway group for failover will no longer be active.

Please assist. :)

Thanks,
Chris

stephenw10 · May 14, 2020, 7:16 PM

The gateway group that the client is using? Or having the OpenVPN gateway as part of the group?

BurnettC27 · May 14, 2020, 7:23 PM

Yes, the gateway group that the OpenVPN client is using. It is currently set as a TRI WAN Failover.

If we create a new OpenVPN interface, do we have to associate it in any way with the OpenVPN client?

Maybe I'm misunderstanding the solution stated above..

What is the purpose of creating an interface for OpenVPN? What changed in version 2.4.5?

Thanks for your help!

stephenw10 · May 14, 2020, 7:32 PM

No the failover group should work exactly the same. You will have to restart the OpenVPN client once it's been assigned as an interface.

The purpose of assigning it as an interface if that it is then treated differently by pf. You should make sure any pass rules are on the assigned interface only rather then the general openvpn tab so states are created there.

Specifically pfscrub, which reassembles fragmented packets, seems to be applied differently. We have seen it fix exactly this sort situation.

I don't know what might have changed in 2.4.5 to change the behaviour though.

Steve

wdup · Jun 20, 2020, 8:53 PM

Hello,

I'm so glad to have found this thread!! I have been investigating the same issue for a few days and for some reason did not find this thread until now!!

I have a very similar environment:

Main site connecting to multiple sites using OpenVPN site-to-site (Peer to Peer (SSL/TLS) || UDP || TUN) tunnels
Main site hosting Windows NPS (RADIUS) server
Branch sites with UniFi UAP access points, authenticating domain joined Windows clients against RADIUS server at main site, using WPA2-Enterprise (computer authentication)

As in the initial post, it is possible to connect to servers/PCs at main site and to servers/PCs between sites and firewalls are set to allow all traffic over OpenVPN tunnels.

First experienced a problem after upgrading sites from pfSense 2.4.4-p3 to 2.4.5 :: after the upgrades wireless clients were unable to authenticate against RADIUS server over S2S OpenVPN tunnels.

At the time I was able to get clients to authenticate by changing Framed-MTU on Windows NPS server as per this article:

https://support.microsoft.com/en-us/help/883389/how-to-reduce-the-eap-packet-size-by-using-the-framed-mtu-attribute-in

I used the suggested Framed-MTU = 1344

pfSense on the main site is hosted on Hyper-V and was not upgraded from 2.4.4-p3 due to performance problems reported for pfSense 2.4.5 running on hypervisors.

With performance issues resolved with 2.4.5-p1 release, I upgraded main site (and all branch sites) to 2.4.5-p1, and now wireless clients are not able to authenticate.

I should add, I had OpenVPN custom options in place:

OpenVPN custom options used on pfSense 2.4.4:

fragment 1400;
mssfix;

OpenVPN custom options on pfSense 2.4.5:

tun-mtu 1500;
tun-mtu-extra 32;
mssfix 1450;

I have tried several things, including different custom options for OpenVPN for S2S server/clients and lower values for Framed-MTU on Windows NPS server, but have not been able to get wireless clients to authenticate. Clients on the same site and the RADIUS server continue to work as expected and the only change to the environment was the pfSense upgrades to 2.4.5-p1.

I will assign OpenVPN interfaces and revert back.

DAVe3283 · Jun 22, 2020, 9:50 PM

@wdup said in Upgrade to 2.4.5 broke 802.1x RADIUS WiFi over VPN:

...
I will assign OpenVPN interfaces and revert back.

You can either assign OpenVPN interfaces or revert back. No need to do both.

Actually, from what I have seen, you only need to assign OpenVPN an interface on the main site (with the NPS), not on the remote sites. I am running pfSense 2.4.5_p1 on HyperV, so it should work the same way for you.

IMO Netgate should mention this in the Assigning OpenVPN Interfaces documentation, as there is no indication it is necessary for proper fragmentation handling. Also IMO it should not be necessary for proper fragmentation handling; proper handling of fragmented packets should be a baseline for the VPN to be considered working. But at least a note in the docs would be nice.

wdup · Jun 23, 2020, 10:17 AM

@DAVe3283 LOL ... apologies, my previous comment was ambiguous, I meant I will revert back with feedback after assigning the OpenVPN interfaces

I have indeed assigned the OpenVPN interfaces and I'm happy to confirm the problem is resolved.

To confirm, I have also only assigned OpenVPN interfaces on the main site where the Windows NPS server is hosted, and clients can authenticate again.

However, my concern is we may be in a unique situation where we experience this "problem", but I would like to understand what has in fact changed from previous pfSense versions and what the underlying cause of the "problem" is?

Even though the "problem" is solved by only assigning OpenVPN interfaces at the main site, I feel it might be best to assign ALL OpenVPN interfaces at ALL sites to avoid similar "problems" going forward - what do you think?

I agree a note in the documentation would be great!

stephenw10 · Jun 23, 2020, 11:08 AM

It's probably this you're hitting: https://redmine.pfsense.org/issues/7779

You could confirm it by checking the packet size and if they are fragmented in a packet capture.

If you are using RADIUS with UDP this is more likely to be an issue. If it's using TLS, and therefore TCP, I expect it to detect the route MTU and use packets that do not fragment. If it is not doing so you should investigate that.

Steve

wdup · Jun 23, 2020, 12:11 PM

@stephenw10 Thank you for the reply.

If I may ask the question differently, is there any harm in assigning ALL OpenVPN interfaces?

stephenw10 · Jun 23, 2020, 12:56 PM

Not really. You will need to restart any OpenVPN servers after assigning them as an interface though.

Also to actually make use of it make sure traffic is passed on the assigned interface firewall rules and not the 'OpenVPN' rules.

Steve

ogghi · Jul 6, 2021, 8:27 AM

Hi there, I am running 2.5.1 on 2 sites, with a site2site openvpn.
I would like to get radius to work in both directions in order to have fall-back NPS for Wifi.
Right now there is a rule in the openvpn interface which allows all.
There is also one for opt3 which is not handling any traffic though.
Would it be enough to disable the rule in openvpn if to get the traffic handled by the opt3 if?
I would need to do that on both sites I guess, to have it working in both directions?

stephenw10 · Jul 6, 2021, 5:47 PM

Yes, if you disable a rule on the group OpenVPN interface traffic will hit rules on the assigned interfaces and get the required reply-to tags.

Steve

ogghi · Jul 7, 2021, 6:38 AM

@stephenw10 Hey there, I was brave and tested to change those settings from remote :)
Nothing broke. Traffic is being handled by the interface specific rule now.
But still I don't get any request on the RADIUS server on the other tunnel end. Always bad UDP checksum...

stephenw10 · Jul 7, 2021, 11:34 AM

@ogghi said in Upgrade to 2.4.5 broke 802.1x RADIUS WiFi over VPN:

Always bad UDP checksum...

In a packet capture?
That's expected if you have checksum offloading enabled on the capture interface.

You're not seeing the radius traffic arrive at the server at all?

Steve

ogghi · Jul 7, 2021, 11:59 AM

@stephenw10 nothing arrives on the radius server from over the vpn connection.
That's the weird thing. At least nothing is logged in the windows service...

stephenw10 · Jul 7, 2021, 1:35 PM

Hmm, well I'd pcap on the server to be sure. I'd also pcap at each interface in the route to see where it's failing.
We have seen issues with large UDP packets not fragmenting correctly across the tunnel. You would see that in a pcap if you are hitting that or something similar.

Steve

ogghi · Jul 8, 2021, 10:28 AM

@stephenw10 Just did some package capture. On the ADC on the other tunnel side:
🔒 Log in to view

On the one where it's working:
🔒 Log in to view

I am wondering why the length seems to be capped at 190 bytes for the one going through the tunnel...?

stephenw10 · Jul 8, 2021, 3:23 PM

190B may just be the size of that request.

Where, specifically did you capture there?

I would check on the OpenVPN and internal interfaces at both ends if the tunnel. The traffic should appear in all 4 places but since something is failing it may not. You need to determine where it's failing.

Steve

ogghi · Jul 8, 2021, 4:07 PM

@stephenw10 thanks for your help! :)
So I did capture traffic. Seems there is just no reply from the RADIUS server. Traffic gets to the server, but there is never any packet being sent back.
So it seems like debugging this windows NAPS is due here!

EDIT: Seems it must be some issue on Windows firewall? The NPS server logs nothing at all. If running locally NTradping tool it shows at least some log entries. But other then opening port 1812 UDP on the firewall...what else could I do here?

stephenw10 · Jul 8, 2021, 5:24 PM

Does it log something if there is a bad request? Incorrect shared secret for example.

You might be able to see some difference in the radius requests that fail wireshark. They are smaller packets as you noted.

I don't think that's a problem in the pfSense config though if traffic arrives at the server and looks the same as when it arrives in the remote firewall.

Steve

ogghi · Jul 9, 2021, 7:41 AM

@stephenw10 Hi there!
I just checked again the radius config for the auth servers in the pfSense. Actually I reconfigured it. Now the packet sizes are identical.
I get the bad UDP checksum also for the radius on the ADC without VPN where it's working.
So my current thought is that there might be an issue with the NPS itself. I'll try to uninstall/reinstall the role there. Who knows...

ogghi · Jul 12, 2021, 12:33 PM

@ogghi I think I'll try and debug on the windows server/NPS side. The packets arrive at the windows server as seen on Wireshark. But nothing is ever logged on NPS. So it might be some really stupid bug here..