Intermittent website timeout

dansherman

I've been running pfSense on a Supermicro C2558 board for a little over three years. In the last week we've started seeing lots of websites timing out while negotiating SSL connections. (I've tried a few plain http sites and ping, they time out too).

Nothing useful shows up in pfSense's logs.
Wireshark reports lots of TCP Retransmissions and Dup Acks.

This happens for all clients across VLANs (including from the router directly). It also happens on both of the WAN connections (DSL and Cable). The only common feature among all the variants is the router and TP Link switches (everything is updated to the latest versions). I rebooted the router and switches a few times but it doesn't seem to help.

I had a situation like this a few years ago and something with the IPv6 was causing the problem, but this time the IPv6 system has been working trouble free for months.

Any thoughts?

johnpoz

@dansherman said in Intermittent website timeout:

Wireshark reports lots of TCP Retransmissions and Dup Acks.

Where does it report that?

Derelict

Sounds like something might have changed with the path MTU somewhere. Does everything seem to work until the packet size gets large?

If it's only SSL are there games being played there? Proxies, caches, etc?

dansherman

@johnpoz I ran a packet trace on my computer when the timeouts were happening.

johnpoz

Ok.. yeah that is how tcp works... When there is no answer it sends retrans..

So sniff on your wan - did pfsense send the traffic like your client asked it too, was there an answer? if pfsense sent, and got no answer your problem is upstream.

is your monitor on pfsense showing any packet loss? Your isp more than likely is having an issue.. Or for that matter if the site your trying to talk to, etc.

dansherman

@Derelict I did some MTU ping testing, up to 1472 works fine, but 1473 and over times out. 10.0.10.1 is the ip for pfSense, but I got the same results at other sites.

ds01 [~] » ping 10.0.10.1 -s 1473 -d -o -c1 -t10
PING 10.0.10.1 (10.0.10.1): 1473 data bytes

--- 10.0.10.1 ping statistics ---
1 packets transmitted, 0 packets received, 100.0% packet loss


ds01 [~] » ping 10.0.10.1 -s 1472 -d -o -c1 -t10
PING 10.0.10.1 (10.0.10.1): 1472 data bytes
1480 bytes from 10.0.10.1: icmp_seq=0 ttl=64 time=0.372 ms

--- 10.0.10.1 ping statistics ---
1 packets transmitted, 1 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 0.372/0.372/0.372/0.000 ms

dansherman

@johnpoz I thought it seemed like an isp thing, but it seems weird that it would be both my DSL and cable connections. I marked the cable gateway as down to force everything over the DSL and the problem didn't go away any faster than just waiting for it to clear up on its own.

johnpoz

1472 is normal.. And yeah if you set to not fragment 1473 would fail on a normal 1500 mtu connection.. You have the overhead..

Lets go over this again - if you just sniff on the client and you see retrans, you don't know where it got lost.. Pfsense maybe never got it... You have to sniff upstream of your client to find out why its retrans.. It only retrans because it didn't get an answer..

Derelict

What are you pinging there? If it's an intermittent PMTU problem it could be anywhere in the path between you and the web server.

Look at the pcap you referenced before. Are the retransmissions only when the packets get large?

1472 ICMP payload passing and 1473 being rejected as too large is exactly what one would expect with a normal ethernet MTU of 1500.

dansherman

I'll sniff on the router next time it happens to see what I can find and try some more MTU tests too.

Derelict

You could look at the capture you already took if you still have it.

dansherman

Thanks for the help @Derelict and @johnpoz.

I have a packet capture from the router during an episode and I spent some time looking over it. Some traffic gets through, but there are lots of retransmissions, resets, "ACKed unseen segments", etc. The errors show up for things going out to the internet, but also for some traffic between VLANs.

For the local traffic, DNS has almost no problems, but things like NFS traffic has lots of errors.

Packet size doesn't seem to make a difference.

johnpoz

So your using NFS over tcp?

starting a sniff in the middle of any traffic is going to show stuff like unseen segments for acks.. When the sniff missed the first part, etc.

If your see loads of retrans in your converstations even between your own local network - you might want to dig into this..

You will want to catch the full conversation.. Start with say sniffing on the client and the server at same time... And do say a NFS transfer... In this transfer are you seeing lots of retrans? Did the server not ack in a timely manner - did the server send ack to that seq number, but the client never saw it - or saw it delayed?

dansherman

Well, I ran out of time trying to figure this one out and just replaced the 2558 board with a 3558 one. We've been running for several hours now with no problems. Previously we would have had several.

I'm hoping it was just something funky with the old motherboard or its NICs.

Thanks for the assistance @Derelict and @johnpoz.

dansherman

@dansherman

Follow up post

Replacing the router didn't solve the problem. It was running smoothly for several hours, then we started having lots of dropped TCP packets again. I realized that when I was marking the cable gateway as down to force traffic over our DSL line it didn't change the traffic flow. I think because I used failover rules in gateway groups instead of sending the traffic directly to the interface (I should look into this more). This lead me to think that the traffic problem was happening with both IPs. It turns out the only problem was with the cable connection. The ISP came out yesterday and replaced the modem. Now it seems fixed (12+ hours), but we'll see what happens.

For reference, here is a clean packet capture from the router showing the problem. The TCP connection would handshake properly, but after the Client Hello and its ACK, only a handful of packets make it though, not nearly enough to establish a connection.
Screen Shot 2019-05-03 at 9.10.26 AM.png

johnpoz

@dansherman said in Intermittent website timeout:

The ISP came out yesterday and replaced the modem

That would have ZERO to do with problems on your own local network.

For the local traffic, DNS has almost no problems, but things like NFS traffic has lots of errors.

dansherman

That would have ZERO to do with problems on your own local network.

Yes. The main problem was the internet access timing out; the internal problem only surfaced when I was looking into the packet dumps. There still might be an issue there, but I think its more likely that I wasn't looking at a full conversation.

We've had zero issues with the NFS uses, so I'm chalking it up to my lack of experience with reading packet captures.

Thanks for the help!