TCP Retransmission flooding LAN network

brookheather

I wonder if someone can advise on a problem I've been having with TCP Retransmission packets flooding my LAN intermittently. I have been running Wireshark and have observed that sometimes when a wireless device leaves the network (e.g. Samsung S9 phone leaving the house or an LG TV being switched off) I get a huge number (hundreds of thousands) of TCP Retransmission packets with the same sequence number going from my pfSense LAN port to the device. I would expect only a few retransmission attempts not this huge number which causes internet timeouts as the LAN port is fully occupied - after a few minutes the flood stops and the internet is then available again.

The pfSense router is connected through a Netgear GS116Ev2 smart switch to a number of Asus RT-AC67U AiMesh wireless access points. I have uploaded the Wireshark dump to http://brooktech.ltd:8080/Wireshark.7z and the problem first occurs at timecode 5499.551527. The packet is going from the pfSense Fujitsu i350 LAN port 90:1b:0e:25:5b:24 to the LG TV 20:3d:bd:b1:3b:2c.

I have tried changing the Intel network card and reinstalled pfSense (currently running latest 2.4.5-p1). Is this a bug with the pfSense network stack or could it be a problem with the Asus AiMesh wireless access points?

Frame 220835: 97 bytes on wire (776 bits), 97 bytes captured (776 bits) on interface \Device\NPF_{9C474FB4-2A3E-4FF3-90F2-D39BFF683E8F}, id 0
Ethernet II, Src: FujitsuT_25:5b:24 (90:1b:0e:25:5b:24), Dst: LGInnote_b1:3b:2c (20:3d:bd:b1:3b:2c)
Destination: LGInnote_b1:3b:2c (20:3d:bd:b1:3b:2c)
Source: FujitsuT_25:5b:24 (90:1b:0e:25:5b:24)
Type: IPv4 (0x0800)
Internet Protocol Version 4, Src: 92.122.188.63, Dst: 192.168.1.112
Transmission Control Protocol, Src Port: 443, Dst Port: 50582, Seq: 1, Ack: 1, Len: 31
Source Port: 443
Destination Port: 50582
[Stream index: 9456]
[TCP Segment Len: 31]
Sequence number: 1 (relative sequence number)
Sequence number (raw): 226583085
[Next sequence number: 32 (relative sequence number)]
Acknowledgment number: 1 (relative ack number)
Acknowledgment number (raw): 443703169
1000 .... = Header Length: 32 bytes (8)
Flags: 0x018 (PSH, ACK)
Window size value: 1392
[Calculated window size: 1392]
[Window size scaling factor: -1 (unknown)]
Checksum: 0x4893 [unverified]
[Checksum Status: Unverified]
Urgent pointer: 0
Options: (12 bytes), No-Operation (NOP), No-Operation (NOP), Timestamps
[SEQ/ACK analysis]
[Timestamps]
TCP payload (31 bytes)
Retransmitted TCP segment data (31 bytes)

DaddyGo

@brookheather

Hi,

I only have one question...

wireless devices, mobile phones other WiFi devices, why are they on the LAN
you could more easily monitor this behavior and your system would be more secure, if these devices were delegated separately to a separate VLAN (F.E. WLAN / VLAN for WiFi devices, etc)

I know this doesn't solve your problem yet, but at least it segments the network and allows for more accurate debugging

even just two questions:
(it’s just a guess because there’s not a lot of data you provided)
Maybe you some kind of loop or so sometimes in your network???

PS: you use STP, RSTP, etc.

brookheather

@DaddyGo I'm not sure how a VLAN would help as lots of the wireless devices need to see other wired devices on the network such as printers, servers - if they were on a separate VLAN they wouldn't have access I believe? Also each of the three Asus wireless access points has multiple wired devices attached which also need to connect to other devices on the network. I would need to buy more smart switches to segregate the wired and wireless devices.

DaddyGo

@brookheather

the LAN is at best your escape interface for management purposes
it would be very dangerous for any device to "break loose" on this interface

++++++ inter VLAN routing your best friend

edit: otherwise, if you use a lot of APs, standard equipment is a reliable multi-port manageable switch, like Cisco SG350 series SMB switches
(affordable category for SOHO use)

brookheather

I have another simple example of this TCP Retransmission flooding - I was watching Netflix on an LG TV that is wired to my switch. At 23:38 I turned off the TV and ten minutes later noticed the internet wasn't working for a minute. Looking at the Wireshark log for that time period I can see a data packet arriving at 23:47 (nearly ten minutes later) for the TV IP address 192.168.1.126 which immediately causes thousands of TCP Retransmission packets to be sent from the pfSense router - these all have the identical sequence number and acknowledgement numbers of the original data packet.

This only happens occasionally - when I've tried to reproduce it by watching Netflix and then turning off the TV I can see the data packet come 10 minutes later but the router usually just sends a single TCP Retransmission packet not a flood so I think there is some bug in the router network implementation?

As a workaround I have reduced the ARP table timeout from 20 minutes to 5 minutes so the disconnected devices are now removed from the router ARP table before this data packet arrives. This seems to have stopped the flooding so far as the data packet can't be sent to the IP address of the TV. I added a system tunable for net.link.ether.inet.max_age with a value of 300 seconds.

DaddyGo

@brookheather

I'm glad you were able to move forward with this issue...

if I remember correctly, for a long time there were serious problems with smart TVs from big manufacturers (LG, Samsung, etc.)
(it was a disaster until LG webOS 2, but I don't think newer versions are very appropriate either)

What I remember is that,
it was not possible to turn off the Wifi from the TV operation menu and all kinds of unmanageable packages raced on the network, even when the TV is in standby mode

Since I don't need such smart TV features, I solved this thing by connecting to an empty separate switch port via ethernet, thus turning off WiFi and this switch port doesn't lead anywhere, but it maintains the ethernet connection.
So I tricked the TV with this switch trick.

I still need to tell you - the segmented network is secure

I connect and distribute such stream video channels in a separate VM environment on the coaxial network of TVs with a DVB-T / T2 modulator

PS:

your ASUS devices are also SOHO routers, which also work in AP mode
a better choice than these and more configurable, for example UBNT WiFi APs,
the APs are also capable of genarating strange traffic, so they also need to be separated

https://forum.netgate.com/topic/128481/best-wireless-ap

johnpoz

@brookheather said in TCP Retransmission flooding LAN network:

CP Retransmission packet not a flood so I think there is some bug in the router network implementation?

Pfsense doesn't duplicate packets or send them on its own.. If your seeing a retran, that came from the client sending it.. Pfsense just sends it on..

If your seeing a flood traffic towards your TV.. Whis is 192.168.1.10 in this sniff? Pfsense wouldn't even see traffic from 2 devices on the lan.. Is pfsense 192.168.1.10? You have some sort of bridge setup?

You doing some sort of nat reflection with source natting?

brookheather

@johnpoz The 192.168.1.10 IP is my Windows 10 server running the Wireshark logging plus Plex and other media services. The pfSense router is on 192.168.1.254. There is no bridge setup - not sure what "nat reflection with source natting" means? The setup is pretty vanilla - an FTTP ONT connected to a pfSense router which is connected to a Netgear managed gigabit switch which has three Asus wireless access points attached - each of which has multiple other wired devices attached.

Are you saying the flood of TCP Retransmission packets is actually coming from the external IP address? I have an external ping running every second to my WAN port and it shows no sign of any increased internet traffic when the flooding is happening - the latency remains low - if there was a flood of packets that saturated my download then I would expect to see an uptick in the latency of the pings.

ts_itops

Hi, did u guys find a solution for this? We have the same problem with Aruba APs and randomly when some clients disconnect after 10 minutes the network gets flooded with tcp retransmissions, we already have different vlans. Network architecture is juniper

brookheather

@ts_itops Reducing the ARP table timeout from 20 minutes to 5 minutes fixed it for me - I have since changed my setup to Unifi switches and wireless access points but haven't bothered to retest with the default ARP timeout. Have you tried a 5 minute ARP timeout?

ts_itops

@brookheather Its already at 5 minutes in our configuration, but we will try to reduce it further, this is a huge problem here at the moment and were searching for days for a solution

johnpoz

@ts_itops If you have something on your network that really wants to talk to 192.168.1.100 for example, and its mac changes to something else.. Then yeah that could cause a lot of traffic to the wrong mac.

But normally IPs and mac combo's don't change very often.. The only thing lowering the arp table cache from 20 to 5 minutes would do would be point IP X to whatever the new mac is a bit faster.

But normally when a device gets a new IP it would send a gratuitous arp - which should update the cache saying hey 192.168.1.100 is at mac xyz..

If the device went away, then a short mac arp table cache would prevent traffic from being sent because their would be no mac for it. But if some device on your network is sending the traffic, the arp cache timeout on pfsense would have nothing to do with that.. Devices like windows normally have a arp cache timeout of like 30 seconds only.

ts_itops

@johnpoz yeah, this makes sense. But i cannot explain how the device sends packets when it left the network. f.e. i left our campus on the evening and drove home, next day i see on our wireshark tracker that my phone sent 15 minutes after i left the building the tcp retransmission storm, this lasted for about 5 minutes then it stopped, all cpus on the switches went to 100% for the duration of the storm, but its not with every device, not even with mine all the time

johnpoz

@ts_itops so you take your phone from network X to network Y, and then on Y you see a storm of retrans still trying to talk to IP 192.168.1.100, even say when your now on 192.168.2/24 ?

Or your seeing this traffic on network X, even though your phone is no longer on the X network?

And this traffic comes from pfsense, or goes through pfsense? If it comes through or from pfsense then yeah the arp cache on pfsense would still think phone IP with mac xyz is still there and sure could continue to send traffic even if phone is no longer on the network. In such a case then sure lower arp cache time on pfsense would lower the amount of time such traffic could be sent.