MTU/MSS and Chrome problem

matsan

We have a weird problem that has been accelerating in my network and I don't know how to continue troubleshooting. We have been using the same ISP for more than two years but lately we have started to see errors in Chrome where pages relying on CDN:s (for example cloudfront.net or Akamai like images.apple.com) doesn't stop loading but instead keeps spinning for minutes. Same pages works perfectly fine in Firefox on the same machines. We have tried both loading the page first in Firefox and then in Chrome and vice-verse to avoid any name-resolution caching. Problem seen on both Windows and Linux machines running Chrome M64.
I have tried the steps from:
https://doc.pfsense.org/index.php/Unable_to_Access_Some_Websites
without any luck and are now stuck. (I'm a bit worried by the step 9 and 10 in this guide as we actually have lower MTU than normal, but kept these options checked anyway).
Tested to enable tcp_mtu_probing on Linux systems but still no go.

We have pfsense 2.4.2-p1 running on an APU-box. WAN is connected to ISP:s router that is tunneling our traffic (to get static IP and LTE backup) to the Internet. The ISP has instructed us to use MTU=1400 for this link. Have also configured this on the LAN-port. Clients are still on 1500 which I have understood should be handled by the 1400 on the LAN-interface of the APU.

Name resolution using unbound with google's servers as forwarders (also been trying the ISP's but still same problem). Reduced EDNS Buffer Size to 512 since this was only way to get resolving to work reliably. Packet capture shows most DNS lookups are using TCP fallback.

No IPv6 configuration in the box or on clients.

Any help greatly appreciated.

A Former User

Have you done some testing using ping with the do-not-fragment bit set to find out what your link MTU really is?
That would be the first step - it sounds like 1400 might not be the actual MTU.

JKnott

@muppet:

Have you done some testing using ping with the do-not-fragment bit set to find out what your link MTU really is?
That would be the first step - it sounds like 1400 might not be the actual MTU.

That wouldn't explain why Firefox works but Chrome doesn't. Path MTU discovery should already be happening in the router with the 1400 byte MTU. Are there ICMP "too big" messages coming back from beyond that point, with a smaller MTU?

Of course, a simple test is to just use a smaller MTU on a computer and see what works or doesn't.

awebster

See mturoute http://www.elifulkerson.com/projects/mturoute.php on windows or tracepath http://linux.die.net/man/8/tracepath on Linux. That should enable you to determine what the MTU is along the way.
However, in my experience usually some misguided admin has disabled ICMP packets on a router along the way (I'm looking at you, ISPs), and this fundamentally breaks Path MTU discovery.

JKnott

Even if PMTUD is broken by blocked ICMP, there is still a mechanism to work around it. Since TCP was able to set up a connection and therefore have a path, it assumes that lost packets are due to too large MTU and backs off.

Regardless, we need more info. The OP should do some of the tests mentioned and share the results.

matsan

Sorry for the delay - I never get any notifications from the forum of new posts…

Ran some of the tests suggested:


root@ns:~# ping -M do -s 1372 8.8.8.8
PING 8.8.8.8 (8.8.8.8) 1372(1400) bytes of data.
1380 bytes from 8.8.8.8: icmp_seq=1 ttl=57 time=14.6 ms
1380 bytes from 8.8.8.8: icmp_seq=2 ttl=57 time=9.56 ms
1380 bytes from 8.8.8.8: icmp_seq=3 ttl=57 time=10.2 ms
^C
--- 8.8.8.8 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2003ms
rtt min/avg/max/mdev = 9.563/11.478/14.664/2.268 ms
root@ns:~# ping -M do -s 1373 8.8.8.8
PING 8.8.8.8 (8.8.8.8) 1373(1401) bytes of data.
ping: local error: Message too long, mtu=1400
ping: local error: Message too long, mtu=1400
^C
--- 8.8.8.8 ping statistics ---
2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 1023ms

tracepath seems to be badly broken upstream, but the client (on MTU=1500) manages to get pmtu=1400 from the router, so that should be good, right? I really don't want to change the MTU on all clients and also, it shouldn't be needed right?
I have all ICMP allowed in the rules, but I don't see anything back from upstream router in a capture (upstream is a Huawei box managed by the ISP)


root@ns:~# tracepath 8.8.8.8
 1?: [LOCALHOST]                      pmtu 1400
 1:  no reply
 2:  no reply
 3:  no reply

MTU is set to 1372 + 28 = 1400 on the WAN interface. MSS field is blank as I understand pfsense will figure that out. MTU on both IPSec and OpenVPN tunnels are capped at 1300 and working just fine.

JKnott

I've never seen notifications either.

Ideally you shouldn't have to change the MTU. However, assuming the devices use DHCP, then you only have to change the MTU on the DHCP server.

A Former User

You have to either change your settings if you want notifications for all topics you reply to, or click on the "Notify" button top right corner of this thread up the top there.

@matsan: Do you have IPv6 enabled for the clients having problems?

I agree that if you have MTU set correctly (And your ping tests show you do) that TCP MSS should be getting clamped correctly.

The only other thing I can think might be a problem is sites that use Google's QUIC, which uses UDP 443. But I don't know if any CDN's use that.

matsan

@muppet:

@matsan: Do you have IPv6 enabled for the clients having problems?

IPv6 is enabled on the clients, but they auto-configure and we have disabled it on our servers (Windows 2012 AD Controller).
IPv6 is disabled on the pfsense.

matsan

OK - finally gave up on trying to solve this. Instead took an Alexandrian stroke and solved it like this:
We moved away from using unbound on pfsense and instead configured a VM running bind9 in the internal network.
Configured google's DNS:es as forwarders but added "edns no" statements for them:

server 8.8.8.8 {
    edns no;
};	       
server 8.8.4.4 {
    edns no;
};

Why this works I don't know other than bind doesn't use EDNS at all. EDNS Buffer Size was set to 512 in unbound but this clearly didn't solve the problem.
DNSSEC is enabled in both configurations and we get AAAA responses from bind as well. For some reason the name resolution is now quicker in general.