Unpredictable connection timeouts
I have a modest home network setup running pfsense. With my previous ISP, I was using double NAT'ing as it was not possible to receive my PPPoE configuration settings (i.e. the ISP's device was giving a static IP address to my pfsense box'es WAN).
I never had any problems with this setup.
I have since changed ISP's and have therefore removed double NAT and have a direct connection on the WAN side using PPPoE. Ever since making this change, I have strange and unpredictable connectivity issues. Mostly when connecting to large websites with a CDN (for example Netflix or Amazon). Sometimes, the connection will consistently fail and timeout from 1 device (for example my phone), but it will connect and work as expected from another device on the same network (for example my laptop). This is consistent in the sense that I can refresh multiple times on the "broken" device and nothing will change, while the service continues to work on another device.
Furthermore, when this happens, if I curl directly from my pfsense box, I always receive a response from the server. Unfortunately, these connectivity issues are unpredictable and I have not been able to debug or identify the cause (or even a predictable way to make this problem occur).
My pfsense machine (with the web GUI open), runs at about 6% CPU, 0% state table size, 3% MBUF usage and 28% memory. I also tried disabling "heavy" services such as suricata and squid - but the problem persists.
The only real change I can see here is that I am directly on the Internet, so my pfsense box is getting hit with a lot more external connections (which was previously filtered by the ISP's device).
Does anybody have any ideas for what could be causing this?
Any suggestions to fix the issue or debug the cause?
Thanks and apologies if this is not the right section to ask this question.
The symptoms sound more like a DNS issue or possibly MTU related.
You need to test from a failing client and see what's actually failing. Try running packet capture for that traffic to see what is being sent and whether it appears on both LAN and WAN in pfSense.
I have done this previously and what I saw with wireshark was that the TCP connections were being RST on the LAN side, but not on the WAN side. I couldn't identify any reason for this RST though.
I don't think it's a DNS issue, as if I try to directly curl the IP address I also have the same problem (external hostnames resolve without an issue).
How would I go about debugging the MTU? Or are there some variables/settings somewhere I can tinker with?
The TCP connections are setup from the client to the server directly. There is no TCP termination on the firewall. Any TCP RST packets would have to be coming from the remote server. It's hard to see how those could be on LAN but not WAN.
The only exception to that would be if you are running Squid in pfSense.
You said you only see this on CDN destinations. Do clients that are failing resolve to a different IP than those that can connect? They may have a different route if that's the case and then MTU size might come into play.
Seems you are correct.
When I nslookup from a machine which works, I receive a different result than from a machine which doesn't work.
I tried playing around with the MSS clamping settings. The default MTU on the WAN is 1492 (PPPoE). I Tried setting the MSS to 1452, but it didn't make any difference..
Should I be looking to change these settings on the WAN, or the VLANs?
Do you have any values you would suggest to try?
This is what I see in Wireshark when the connection is failing:
Hmm, all of those packets are tiny. Unlikely to be an MTU issue.
You are seeing traffic back from the target too so the route is good.
Hard to say why it's failing then. I don't see any RST packets there.
Any suggestions on what I could look into in order to debug this?
I'm at a loss on what to look for to identify the problem.
I would still be looking at packet captures to see what happens when it fails. Does the remote end just send a RST?
Are all devices using the same DNS servers?
You have any sort of VPN involved here?
They are all using the same DNS servers (received via DHCP from pfsense).
There is no VPN, direct Internet connection from my ISP.
The error I see over the wire is the error I posted from wireshark.
I have nothing else to go on to debug..
The only actual issue I see there are two re-tranmissions but that may be normal packet loss. Not really something that should kill the connection. You are seeing traffic in both directions there.
Was that pcap on the WAN? How was it filtered? Do you see anything different on the internal interface?