Slow traffic over IPsec tunnel after a move but public traffic still fast

SpaceBass

hey folks! I'm in the process of moving one of the boxes mentioned below. It went from a 1gb WAN to a temporary 200mbs/50mbs pipe (and going back to 1gb this week). It also moved geography from east coast to west coast USA…

I share this because I'd written this problem off as changing ISPs and peering and all. But today, when I dug in, I realized its a problem with the IPsec tunnel itself, not the ISPs or geography.

Here's the topology:
remote[n].canada |–- pfsense.canada <--- IPSec tunnel --> pfsense.california ---| local[n].california

I have an IPsec site to site between two pfSense installs. One is an SG-4800 and the other is on generic xeon hardware. Both have AES-NI support enabled.

Here's the config:

Here's how I've tested things:

iperf between public IPs (eg not across the tunnel):
[SUM] 0.0-10.1 sec 140 MBytes 117 Mbits/sec

iperf from the remote box in canada [10.75.1.1] to a public IPerf server on the west coast of the US:
[SUM] 0.0-10.0 sec 219 MBytes 183 Mbits/sec

iperf over the tunnel from a host to a host:
[SUM] 0.0-49.5 sec 5.50 MBytes 932 Kbits/sec

iperf over the tunnel from pfSense to pfSense box:
[SUM] 0.0-10.1 sec 102 MBytes 85.2 Mbits/sec

iperf from a remote host to the pfSense box on that LAN:
[SUM] 0.0-10.0 sec 1.39 GBytes 1.19 Gbits/sec

iperf from a local machine to the pfSense box on my local LAN (over wifi):
[SUM] 0.0-10.0 sec 216 MBytes 181 Mbits/sec

So…it seems like there's a bottle neck from the remote host, let's call it remote1.canada and my local host, local1.california

It seems like pfSense.canada and pfSense.california have a reasonable connection at 50mbs ...
remote1.canada to pfSense.canada gets a solid 1gbs so, its not a bottle neck there.

remote1.california to pfSense.california get as good as can be expected over wifi, I'm ok with that being 150-230mbs.

First, does it seam reasonable to be troubleshooting the IPsec tunnel? Or should I be looking elsewhere? Bottom line is that local[n].california and remote[n]].canada only get about 5-6mbs between them.

If it's the IPsec tunnel, does anyone have tips for different settings I might try?

Any other troubleshooting tips you'd suggest?

Derelict

Try forcing NAT-T on both sides. New ISP might do something strange with ESP. That has been encountered before. I have seen something in the path make that sort of change overnight.

You do not need any Phase 2 hashing method when you use AESGCM as the transport. It is an authenticated cipher and the hashing method is just wasted (unnecessary) CPU load. Whatever you use for the P1 hash won't really affect throughput because the "Phase 2" connections are where the actual transforms/transfers take place.

Just disable Hardware crypto in System > Advanced, Miscellaneous. IPsec will use it regardless and disabling aesni.ko eliminates the chance that something will want to thunk through the kernel to get AES. Disabling requires a reboot or a kldunload aesni.ko from the shell.

I would make those three changes, one at a time, and undo them before going to the next if it does not correct the problem. The only one I see as a possibility for what you are seeing is the first one, though. The other two would be more like a percentage decrease in throughput, not complete decimation of it.

I would do a PMTU test to see if you really need that MSS clamping, though it will not result in 1Mbit throughput like you are describing.

ETA:

iperf over the tunnel from pfSense to pfSense box:
[SUM] 0.0-10.1 sec 102 MBytes 85.2 Mbits/sec

How did you test this?

SpaceBass

thanks, as always, Derelict! Hugely helpful from an educational process. I don't know nearly as much about IPsec as I do about OpenVPN (and what I know about OVPN is laughably small).

I made the changes you suggested in order. I had high hopes disabling hardware crypto and/or removing the P2 hash would help… regrettably I'm still stuck at 5-6mbs.

I wasn't able to find the NAT-T settings in 2.4 ... I thought that toggle was in P1, did it move or am I a digbat?

regarding your question, I measured traffic over the tunnel between the two pfSense boxes using iperf from the CLI. The only routing between them, bogon local IPs, would be over the tunnel.

Derelict

You have to specifically set source and destination addresses in iperf to be sure the traffic is going over the tunnel. It would be better to run iperf on hosts on both sides, rather than on the firewall itself.

iperf over the tunnel from a host to a host:
[SUM] 0.0-49.5 sec 5.50 MBytes 932 Kbits/sec

iperf over the tunnel from pfSense to pfSense box:
[SUM] 0.0-10.1 sec 102 MBytes 85.2 Mbits/sec

That indicates something strange locally, if that iperf is really running over the tunnel. You didn't enable jumbo frames or something else weird somewhere?

Yeah if IKEv2 there is no forcing NAT-T.

iperf over the tunnel from a host to a host:
[SUM] 0.0-49.5 sec 5.50 MBytes 932 Kbits/sec

That is far, far less than 6-8Mbits/sec. What are you really seeing?

SpaceBass

@Derelict:

You have to specifically set source and destination addresses in iperf to be sure the traffic is going over the tunnel. It would be better to run iperf on hosts on both sides, rather than on the firewall itself.

That indicates something strange locally, if that iperf is really running over the tunnel. You didn't enable jumbo frames or something else weird somewhere?

I think I'm tracking with you. I'm using IP addresses for iperf. EG:
iperf -c 10.75.1.80 -w 1MB -P3

Yeah if IKEv2 there is no forcing NAT-T.

iperf over the tunnel from a host to a host:
[SUM] 0.0-49.5 sec 5.50 MBytes 932 Kbits/sec

That is far, far less than 6-8Mbits/sec. What are you really seeing?

Yep, that's what I'm seeing over the tunnel. I may be my Ms and ms wong in my post, but the [SUM] lines are copy/paste.

That said, as much as I'd like to learn and understand what's going on, this may be short lived. I'm due to get two 1gb lines from Sonic in a week and, fingers crossed, that might fix things. At least it will be a new ISP with different peering and different routes.

Derelict

[SUM] 0.0-49.5 sec 5.50 MBytes 932 Kbits/sec

That says that during the test it transmitted 5.50 MBytes of total data at 932 Kbits/sec

So it's actually worse than you think.

SpaceBass

@Derelict:

[SUM] 0.0-49.5 sec 5.50 MBytes 932 Kbits/sec

That says that during the test it transmitted 5.50 MBytes of total data at 932 Kbits/sec

So it's actually worse than you think.

Well, to make things even more bizarre…after making those hash changes to P2 and disabling all hardware acceleration (no reboots yet)...an hour later, I get this:

Client connecting to prima, TCP port 5001
TCP window size: 1.00 MByte (WARNING: requested 1.00 MByte)
------------------------------------------------------------
[  7] local 10.15.1.156 port 49770 connected with 10.75.1.20 port 5001
[  6] local 10.15.1.156 port 49769 connected with 10.75.1.20 port 5001
[  8] local 10.15.1.156 port 49771 connected with 10.75.1.20 port 5001
[  9] local 10.15.1.156 port 49772 connected with 10.75.1.20 port 5001
[ ID] Interval       Transfer     Bandwidth
[  9]  0.0-10.0 sec  17.9 MBytes  14.9 Mbits/sec
[  7]  0.0-10.1 sec  18.0 MBytes  15.0 Mbits/sec
[  6]  0.0-10.1 sec  29.0 MBytes  24.1 Mbits/sec
[  8]  0.0-10.2 sec  9.12 MBytes  7.52 Mbits/sec
[SUM]  0.0-10.2 sec  74.0 MBytes  61.0 Mbits/sec

I'm not complaining! I'll take it… But who knows what change worked...or is it just time of day, or did someone wave a rubber chicken over the right port on a switch at some Level 3 center?

SpaceBass

well….drat!
Good news: got new gig WAN connection and it's glorious! :) I can easily get ~ 800 Mbits/sec to public servers ... but I'm still only getting ~20 Mbits/sec over the IPsec tunnel.

I've tried with and without clamping. I'm still using AESGCM as the transport.

Any troubleshooting tips?

Derelict

Packet capture and see what wireshark tells you.

There is obviously something misconfigured/wrong somewhere.

You need to be testing from something inside on one side to something inside on the other.

You should not be running iperf anywhere on the firewalls themselves.

GPz1100

I was troubleshooting similar issues with a sophos utm box few weeks back. Turned out I had to reduce the mtu. It was set to 1500, dropping it down to 1472 resolved* the issue. I made this change on the att gateway, and both lan/wan interfaces in utm.

I say that with a caveat as the fastest I've been able to test it with was a comcast 175 mbps down/25 up account. I too have fiber and can upload at full speed. There's 2 limitations at play. First whether or not the 1472 byte mtu is correct, and second the encryption limitation of the box utm is run on. Not to mention, the utm is virtualized under exsi 6.5 on a i5 5250 box. Still, from what i've read it should be capable of 250-300 mbps over the vpn.

SpaceBass

Thanks everyone!
I'm still troubleshooting. I tried running wireshark on a remote host and then passing some traffic from a host on the other side of the tunnel. I did't see anything glaring, but then again I'm not extremely fluent in analyzing things at the packet level.

I also tried playing around with changing my MTU on my WANs and MSS clamping - 1472 seems to be the largest ping I can send without fragmentation. But an MTU of 1472 vs 1500 didn't make a difference in tunnel speeds. Ditto with MSS clamping.

Tonight I'm going to try, just for giggles, recreating the tunnel from scratch. I can't see a config error, but maybe doing it fresh and new will clean out any gremlins.

I'm grateful for the tips - if you have any more troubleshooting steps, please keep 'em coming! :)

Derelict

I also tried playing around with changing my MTU on my WANs and MSS clamping - 1472 seems to be the largest ping I can send without fragmentation.

That is what you should be able to send with a normal 1500 ethernet MTU so you do not have an MTU path problem. I would stop messing about with MSS and MTU as you are likely just wasting your time and possibly making things worse.

SpaceBass

Well….I'm at a loss.

I'm now testing from hosts behind pfSense (vs between pfSense boxes themselves).

I thought I had a breakthrough when I found aes-ni disabled in Advanced but realized that was a troubleshooting tip here :)

MTU is back to defaults, no MSS clamping, using IKE2....

Both boxes also have OpenVPN tunnels to other boxes but the average load is like 1mbs.

Without the tunnel, I easily get 230-250mbs. With the tunnel (and new since my original post gig wan line) I get 30-50mbs. Xeon on one side* and SH-4860 on the other. Neither CPU spikes above 30-40%.

I tried recreating the P1 and P2 tunnels - no change.

I failed to mention... the Xeon is pfSense running as a VM on Proxmox 5. It's the only VM, the CPU type is host, it has 16gb of ram allocated and direct disk access. So it's basically as close to the bare metal as it can be. But if anyone has any tips related to Prox and aes performance, lay em on me!