IPSec PMTU

pakjebakmeel

Hiya,

I've been running an IPSec tunnel for a while between 2 Ubuntu boxes using Strongswan, I have now replaced one side with a PfSense box but I'm running into a problem.

The Ubuntu box calculates the tunnel overhead properly and sends a frag required when the packet does not fit into the tunnel. The PfSense box does not do this, it just fragments the inner payload and reassembles it on the the other side. I do not want this because the performance is quite bad.

I'm now using MSS clamping as a workaround but my UDP connections (rsync backups and such) run at about 50% of the line speed. A tcpdump shows a lot of fragmentation.

Ubuntu 15.04 + Linux strongSwan U5.3.3/K3.19.0-28-generic:

root@hlv-us00:/home/user# ping -M do -s 1394 192.168.10.10
PING 192.168.10.10 (192.168.10.10) 1394(1422) bytes of data.
1402 bytes from 192.168.10.10: icmp_seq=1 ttl=63 time=26.1 ms

root@hlv-us00:/home/user# ping -M do -s 1395 192.168.10.10
PING 192.168.10.10 (192.168.10.10) 1395(1423) bytes of data.
ping: local error: Message too long, mtu=1422

As you can see Linux + Strongswan calculates overhead correctly and just sends a FRAG_REQ when the inner packet no longer fits the tunnel. PfSense however shows different behaviour..

PfSense 2.2.4 amd64:

C:\Users\pakjebakmeel>ping -f -l 1394 192.168.178.202

Pinging 192.168.178.202 with 1394 bytes of data:
Reply from 192.168.178.202: bytes=1394 time=65ms TTL=63
Reply from 192.168.178.202: bytes=1394 time=43ms TTL=63
Reply from 192.168.178.202: bytes=1394 time=48ms TTL=63
Reply from 192.168.178.202: bytes=1394 time=46ms TTL=63

C:\Users\pakjebakmeel>ping -f -l 1472 192.168.178.202

Pinging 192.168.178.202 with 1472 bytes of data:
Reply from 192.168.178.202: bytes=1472 time=71ms TTL=63
Reply from 192.168.178.202: bytes=1472 time=86ms TTL=63
Reply from 192.168.178.202: bytes=1472 time=212ms TTL=63
Reply from 192.168.178.202: bytes=1472 time=80ms TTL=63

WHY U NO SEND FRAG_REQ??

I have "Clear invalid DF bits instead of dropping the packets" DISABLED

How can I enable this behaviour on PfSense like it works on Linux + strongswan. This out-of-the-box behaviour breaks PMTU over IPSec which is bad. I consider MSS clamping a workaround for broken PMTU and it doesn't work for any other protocols than TCP..

Is this a bug?
Is this a limitation of BSD?
Is this intentionally?

What gives.. Anyone any idea's? Thanks.

pakjebakmeel

I have now migrated my tunnels back to a Strongswan installation on a Ubuntu 15.04 virtual machine. PMTU is now working as expected:

C:\Users\wsmeltekop>ping 192.168.178.1 -l 1394 -f

Pinging 192.168.178.1 with 1394 bytes of data:
Reply from 192.168.178.1: bytes=1394 time=19ms TTL=61
Reply from 192.168.178.1: bytes=1394 time=19ms TTL=61
Reply from 192.168.178.1: bytes=1394 time=17ms TTL=61
Reply from 192.168.178.1: bytes=1394 time=17ms TTL=61

Ping statistics for 192.168.178.1:
    Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
    Minimum = 17ms, Maximum = 19ms, Average = 18ms

C:\Users\wsmeltekop>ping 192.168.178.1 -l 1395 -f

Pinging 192.168.178.1 with 1395 bytes of data:
Packet needs to be fragmented but DF set.
Packet needs to be fragmented but DF set.
Packet needs to be fragmented but DF set.
Packet needs to be fragmented but DF set.

Ping statistics for 192.168.178.1:
    Packets: Sent = 4, Received = 0, Lost = 4 (100% loss),

I have done some reading and it seems that strongswan only sets up the tunnels, the kernel is responsible for encryption and 'grabbing' the packets for encapsulation. Also, the kernel should calculate the MTU overhead and handle that properly. When fragmentation=no is set in the config it should return an ICMP FRAG_REQ if it doesn't fit the resulting ESP packet is bigger than the WAN MTU.

Why does this work fine on Ubuntu whilst FreeBSD seems to have an issue with this? Am I doing something wrong? Is this working for anyone else?

pakjebakmeel

I would very much like to move my IPSec tunnels back to PfSense, they are now terminated on the Ubuntu box behind NAT-T.

Has anyone got any idea's about this? I found that PfSense adds 'fragmentation=yes' to the ipsec.conf by default. I have modified the vpn.inc file to set disable fragmentation. I can confirm it is negated in the ipsec.conf file.

Now it drops packets that are too big but still no FRAG_REQ ICMP packet..

1. Why does PfSense not send this? It is vital to a functional connection on non-default MTU's.
2. Am I the only one bothered with this behaviour? Anyone who can confirm this behaviour?

cmb

The fragmentation setting in ipsec.conf only affects IKE fragmentation, e.g. the negotiation, not anything that goes inside the tunnel. Something else changed if you saw a change in behavior of traffic in the tunnel after changing that.

There does seem to be an issue in FreeBSD in that it'll fragment traffic with DF set if it's traversing IPsec. Or at least that seems to be the case at a quick review. Needs more investigation.

jwt

cmb should open a bug after verification of the issue.

jwt

Can you try:

sysctl -w net.inet.ipsec.dfbit=1

on both boxes, and report back?

pakjebakmeel

Thanks for the suggestion, the tunnel is now terminated on a Ubuntu VM. I'll need to find the time this evening to move the tunnel back to PfSense for testing. I'm going to give this a try this evening.

Currently the value is:

net.inet.ipsec.dfbit = 0

If set to 0, the DF bit on the outer IPv4 header will be cleared while 1 means that the outer DF bit is set regardless from the inner DF bit and 2 indicates that the DF bit is copied from the inner header to the outer one.

I would suggest to set the value to '2'. But yes, currently set to 0 would imply that my ping with DF set gets stripped on the outter layer and cause the problems observed. It would indeed explain what I am seeing.

jwt

we've seen some trouble when setting it to =2

it's on the list to investigate, as both cmb and I think '2' makes the most sense.

pakjebakmeel

I have moved the tunnel back to PfSense.. The tunnel is up and passing data. The DF bit is being cleared as expected and the traffic is getting fragmented.

As soon as I set the value to 1 the traffic starts dropping. When I set it to 2 there is no change, traffic is still fragmenting.

sysctl -w net.inet.ipsec.dfbit=0

net.inet.ipsec.dfbit: 0 -> 0

Result:

ping 192.168.178.202 -f -l 1472

Pinging 192.168.178.202 with 1472 bytes of data:
Reply from 192.168.178.202: bytes=1472 time=17ms TTL=63
Reply from 192.168.178.202: bytes=1472 time=16ms TTL=63
Reply from 192.168.178.202: bytes=1472 time=17ms TTL=63
Reply from 192.168.178.202: bytes=1472 time=16ms TTL=63

Ping statistics for 192.168.178.202:
    Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
    Minimum = 16ms, Maximum = 17ms, Average = 16ms

sysctl -w net.inet.ipsec.dfbit=1

net.inet.ipsec.dfbit: 0 -> 1

Result:

ping 192.168.178.202 -f -l 1472

Pinging 192.168.178.202 with 1472 bytes of data:
Request timed out.
Request timed out.
Request timed out.
Request timed out.

Ping statistics for 192.168.178.202:
    Packets: Sent = 4, Received = 0, Lost = 4 (100% loss)

sysctl -w net.inet.ipsec.dfbit=2

net.inet.ipsec.dfbit: 1 -> 2

Result:

ping 192.168.178.202 -f -l 1472

Pinging 192.168.178.202 with 1472 bytes of data:
Reply from 192.168.178.202: bytes=1472 time=17ms TTL=63
Reply from 192.168.178.202: bytes=1472 time=31ms TTL=63
Reply from 192.168.178.202: bytes=1472 time=15ms TTL=63
Reply from 192.168.178.202: bytes=1472 time=25ms TTL=63

Ping statistics for 192.168.178.202:
    Packets: Sent = 4, Received = 4, Lost = 0 (0% loss),
Approximate round trip times in milli-seconds:
    Minimum = 15ms, Maximum = 31ms, Average = 22ms

So, unfortunately not the expected results..

jwt

So many things could be wrong with your setup (blocking ICMP 'frag needed messages, needing to set dfbit on both ends, etc.)

We'll work it out in the lab, likely after 2.3

georgeman

Are there any updates on this issue? I can reproduce on 2.3.

Also, I can't find any related bug filed on Redmine

cmb

There's some kind of issue there in FreeBSD. It needs to be duplicated on stock 11-CURRENT, quantified, and reported upstream. Still on my to do list.

carl2187

I can confirm this issue still exists in 2.3.1-Release-p5.

After placing a pfsense/strongswan in place of Ubuntu/strongswan, the IPSEC tunnel accepts up to length 1472 pings (1500 overall) instead of the "correct" non-fragmenting size that StrongSwan/Ubuntu 14.04 calculates, of 1410 (1438 overall)

The issue has probably gone un-noticed by many users, as the fragmentation is transparent to the application using the tunnel, so it requires a high level of networking expertise to even detect it is occurring.

I do not see a bug for this in upstream 11. Being that 11-Alpha4 is out now, this needs addressed quickly if it is to be fixed in 11-Release. cmb, perhaps you have some leverage in the upstream dev team and could kick this along into their bug tracker?

I'm building up a vanilla 11-Alpha4 with the latest StrongSwan to test functionality independent of pfsense, and on the latest possible version to confirm the issue is really an "upstream" one. I'll report back here in the next few days with the results. Having a little trouble being that the default kernel doesn't include the IPSEC option that strongswan needs, so learning how to compile a kernel in freebsd became a pre-req to testing strongswan!

Thanks!

-Ben

carl2187

Just finished testing StrongSwan 5.4.0 on FreeBSD-11-Alpha4 to check if the "upstream" components have the same issue. And they do… Same results as OP's, and my experience in PFSense 2.3.1_5. So this is clearly not a PFSense specific issue, and it hasn't been fixed in the upcoming FreeBSD 11 release either.

Stuck with Ubuntu/StrongSwan for a bit longer...

iz

Hi,

I have a site-to-site IPSec VPN tunnel between two pfSense 4.2.2p2 instances and have run into the same issue described by the OP in this old thread.

My VPN tunnel has the following properties:
Phase 1:
Encryption Algorithm: AES-128
Authentication: SHA1
PFS: DH Group 2 (1024 bit)

Phase 2:
Protocol: ESP
Encryption Algorithm: AES-128
Hash Algorithm: SHA1
PFS: DH Group 2 (1024 bit)

If the net.inet.ipsec.dfbit parameter is set to 0 in pfSense, when I issue ICMP echo requests with the DF flag set (DF=1) and the payload size up to and including 1472 bytes to a virtual IP on the pfSense box on the other end of the site-to-site tunnel, I get ICMP replies. This means that the near-end pfSense instance happily encrypts ICMP packets whose length exceeds 1500 bytes (including the IPSec overhead) and then the near-end pfSense box fragments the encrypted IPSec packets. This behavior is certainly undesirable and one of the worst ways to handle oversized packets.

Instead, I need the pfSense box to drop packets that have the DF bit set (DF=1) and reply with a ICMP Destination Unreachable (type=3, code=4) back to the end host informing the end host of the next-hop MTU. The end host should then lower its Path MTU (PMTU) and re-send the packet with the newly set PMTU. The re-sent packet length should accommodate the IPSec overhead, so that when the near-end pfSense encapsulates the packet in IPSec, the packet size does not exceed the MTU on the pfSense egress interface, so no fragmentation of IPSec-encrypted packets is needed.

This process is called Path MTU Discovery (PMTUD).

So, I assume the reason that PMTUD is not working by default is because the parameter net.inet.ipsec.dfbit=0, which is the default in pfSense. I don't see any way to change this in the pfSense GUI, so I changed the value of this parameter to 1 from the pfSense shell:
sysctl -w net.inet.ipsec.dfbit=1

Then I repeated the ICMP echo request test (with the DF bit set) to the virtual interface on the far-end pfSense across the site-to-site VPN tunnel This time, if I specify the size of the ICMP payload up to and including 1410 bytes, I get the ICMP echo replies back. As soon as the payload size exceeds 1410 bytes, I see the Request timeout message instead of Message too long.

Therefore, with the parameter net.inet.ipsec.dfbit=1, pfSense drops the packets that exceed the egress interface MTU when the IPSec encapsulation is factored it, but pfSense does not send the ICMP Unreachable (type=3, code=4) message back to the host that sends the echo request. Hence, PMTUD is not functioning.

I also see a mention in this thread of the parameter net.inet.ipsec.dfbit value set to 2, which I also tried, but there is no difference from its value being set to 1. I don't know what the expected behavior is if this parameter value is set to 2, but I assume it may be "copy DF bit from the inner IP header to the outer IP header".

It's been a few years since this issue was reported by the OP, so I would like to understand why this is still not fixed. Is this the upstream issue? It's hard to believe that such an obvious bug has not been squashed yet.

This PMTUD behavior is only broken when the destination is through the IPSec tunnel. When the destination is outside the IPSec tunnel, PMTUD is working properly. For example, if I ping google.com with the DF bit set (DF=1) and the ICMP payload of 1473, I get the Message too long response, which means that pfSense drops the packet and sends ICMP Destination Unreachable (type=3, code=4) back to the host that issues the ping.

Thank you.

monster4000

There has been a bug open for a long time on this issue now:

https://redmine.pfsense.org/issues/7801

rolytheflycatcher

Is this still an open bug?

carl2187

@rolytheflycatcher @jwt

TLDR: PMTUD (path mtu discovery) is indeed still completely broken in pfsense when using ipsec tunnels.

The closest thing to a fix right now is to set net.inet.ipsec.dfbit = 2. But on pfsense/freebsd, this makes things worse because the ICMP packet too big response is not sent for the discarded packets that exceed the kernel calculated max MTU.

default is "0" in pfsense, which just clears the do not fragment bit, this works OK, but introduces delay and modification of the sent packets. "2" says "copy the do not fragment bit in both inner and outer packet". This is the desired behavior...

But setting to "2" exposes the underlying issue, there is no "ICMP Packet too big" sent back to the sender the way it should. It just discards the packet without the ICMP response.

This results in pfsense becoming a "black hole" router, as packets are SILENTLY discarded without an ICMP too big response.

The freebsd kernel seems to calculate the correct MTU for the interface with the ipsec overhead, as it starts throwing packets away at the right MTU at least.

Like the OP, I've found Linux(ubuntu 20.04)+Strongswan works perfect by default. It defaults to net.inet.ipsec.dfbit = 2, and correctly sends ICMP too big when the MTU is exceeded. Resulting in correct operation of any app because PMTUD works as designed. No MSS clamping or exotic tricks required. PMTUD accounts for all of that, so long as you don't have black hole routers in the path between communicating devices.

rolytheflycatcher

@carl2187 thank you. Your statement confirms the behaviour I am witnessing. I guess my problem (trying to make EAPTLS work) is compounded by the oversized UDP packets that the (cisco) AP tries to send during RADIUS handshake.

I guess I'll be sticking with Draytek for the foreseeable then - a shame as there is so much to like about pfSense.

rolytheflycatcher

I see that in 2.6 there are some new check-boxes specifically for VPN ("These setting will affect IPsec, OpenVPN and PPPoE Server network traffic")

IP Do-Not-Fragment compatibility
IP Fragment Reassemble

Is there any official word on whether these options are designed to circumvent the PMTUD black hole issue, or indeed if there is any sign of the fundamental issue being resolved?