Periodic connection drops for specific client

salousama

Hello all, I'm posting for the following issue I've been having with pfsense. I have pfsense installed (2.4.4-RELEASE-p1) on a single NIC host with VLANs. I have several VLANs to distinguish specific LANs and wireless LANs. VLANs are configured on a managed switch and a wireless AP respectively. WAN connection happens via PPPoE from an ISP router (the connection is very reliable so far).

The issue is that for a specific linux client I get entire traffic loss every 15 to 30 minutes or so, meaning DNS queries timeout, even IP traffic gets dropped.

I've tried to pinpoint the issue, but all efforts haven't lead me to conclusion. I run a very basic test to see if it's specifically related to DNS, e.g. curl to a URI every X seconds and curl to a specific IP every X seconds. All is good but at some point for 30sec to 1 minute all requests are timing out. After that everything works again. I run a tcpdump on the involved interfaces, e.g. bg0.30 and pppoe, however I cannot get any more than some re-transmissions for SYN packets and at some point I've seen some "ICMP host unreachable" errors. From the attached screenshots, it seems that the pppoe interface never sees the ACK packet and the vlan interface does not see the SYN-ACK packet..

At this point, only a linux laptop has this issue, Mac laptop has no issues, Windows laptop has no issues. All clients use wireless connections to the AP. One pending test I want to do is a wired connection directly to the switch. Also, the linux laptop does not have any issues when connected to a regular wifi (e.g. ISP router or a different office network), which leads me to suspect some issue with the firewall configuration.

Any thoughts or ideas how to troubleshoot this further are more than welcome. I can provide any information that might be useful.

salousama

Alright, I found the culprit..I'm leaving this as a reference in case anyone comes across. The time interval between connection interrupts is 20 minutes and it's pretty consistent. Since I could not get more from packet captures I tried to dig into L2. The 20 minute period is due to the ARP cache timeouts, which is the default in FreeBSD.

The issue can be solved by either setting static ARP entries when defining static DHCP mappings (in the DHCP server settings) or by increasing the default cache timeout on the firewall.

On FreeBSD, this can be done by tweaking the following parameter:

net.link.ether.inet.max_age: 1200

This is different than Linux, where cache entries get refreshed and garbage collected. Also, it turns out all clients (e.g. Windows) were having this traffic loss behavior, but not that noticeable like the Linux client case.

One note is that even one configures the cache timeout on the firewall instead of creating static entries, the same behavior will be observed, but less frequently of course. One"hacky" way to avoid this might be to generate ARP requests from the client before the expiration interval, but I haven't tried this.

Derelict

If you need things like that there is something else wrong with your network. As happens millions or maybe even billions of times per day on networks behind pfSense installations, when the ARP times out FreeBSD will request it again. There should be an answer. If there is not, you need to find out why not.

I have been doing this a long time and, while I have had to manually expire ARP entries to get them to renew I have never had to use static ARP or adjust timeouts to solve a problem on a normal "LAN" network. I would try to find what is breaking ARP instead of papering over whatever it is.

johnpoz

Yeah with Derelict here - while there are times that you might want to adjust the arp cache time, or set a static arp entry.

This is not any sort of fix to your problem of why your having issues with arp in the first place..

salousama

Thanks for the responses guys. Indeed, even though static entries seem to eradicate the issue, there is no actual justification why this would be required.

To be honest, I was thinking of digging into the ARP resolution/traffic on specific interfaces on both the client and firewall. I will keep this post updated.

The most "annoying" thing with this timeout is that the client experiences total connection for about a minute or so, however a new entry is registered on the firewall side eventually.

Any thoughts or ideas on what to look for specifically (if that rings any bell) and/or how to troubleshoot this further are more than welcome!

johnpoz

Well I would prob set the arp cache time to something much lower than 20 minutes so you could get more info to work without having to wait 20 minutes between..

Then watch the traffic via sniff on both firewall interface and your client.. Are you seeing the arps go out, and seen on the other end - but not getting answered, or answered by the sender never gets the reply?. Or is the devices not even sending arp out?

Do you have an issue with dupe IP or something using the same mac? etc.. Do you have something flapping between interfaces moving a vitual mac around?

salousama

OK so what I have seen so far is the following:

I set the cache timeout to 120sec for troubleshooting.
5 seconds before the timeout, pfsense generates some ARP requests (broadcast) to ff:ff:ff:ff:ff:ff. However, no requests are seen on the client interfaces.
At some point, depending on the client, each makes an ARP request to pfsense to find out where the gateway is. When this happens pfsense updates the cache and everything is good.
On MacOS clients, no issue is observed because apparently it generates an ARP request every 1m30sec (at least the clients I have tested with).
On Linux and Windows clients, the same behavior is observed, i.e. pfsense is trying to find out their MAC addresses, but on Windows the client generates an ARP request much faster than the Linux client (that's why that linux client experiences losses).
The Linux client, when on a wired connection, it sends an ARP request to find out the gateway every 20sec on VLANx. On a different VLANy (again wired), the client responds properly when pfsense sends a broadcast and the client does not initiate ARP requests on the other hand.

I hope some of it makes sense. I'm still testing some other scenarios to verify the setup.

Derelict

If the broadcast ARP request is not making it to the client it obviously cannot respond.

The problem is in your Layer 2 (switches, wifi, etc)

johnpoz

@salousama said in Periodic connection drops for specific client:

5 seconds before the timeout, pfsense generates some ARP requests (broadcast) to ff:ff:ff:ff:ff:ff. However, no requests are seen on the client interfaces.

And the client is wireless? Your AP could be preventing the arp from going out on the wifi? But allow clients arp from wireless to be sent to your wired network.

unifi for example has a setting to block broadcast/multicast from lan to wlan.. Because broadcast is sent at low data rate.. What specific wifi AP are you using?

Such a scenario would explain that point you made.. If your client having the issue is wireless.

salousama

@johnpoz Thanks for the response. Yes the client is wireless, so it only makes sense that the AP is doing something. Unfortunately, I have a TL-WA901ND in multi-sssid mode (connecting to a D-Link 1100-08).

I'm going to replace it soon from what I'm seeing, for example its LAN interface broadcasts to its controlled VLANs (the multi-sssid mode) :) I have a client on VLANx and it captured the native VLAN broadcast ...

johnpoz

@salousama said in Periodic connection drops for specific client:

TL-WA901ND

Yeah tplink and how vlans work seems to be an issue with their budget gear atleast.. Their entry level smart switches wouldn't allow you to remove vlan 1 from ports.

So yeah get some real AP if you want to do vlans would be my suggestion.

salousama

That's probably the next step indeed..Apparently the AP does not understand the tagging from the firewall, one example of this is the following:

port 1 on the switch is the trunk port for pfsense.
port 8 is the trunk port for the AP.
VLAN99 has port 1 (tagged) and 7, 8 untagged (typically it should be the native VLAN, but without port 1 tagged one cannot get DHCP from port 7).
VLAN60 has port 1 and port 8 tagged.
Management interface of the AP receives an IP from VLAN99 (for example 192.168.99.x).

From the firewall, if I do some arping on vlan60 and vlan99 e.g:

arping -i iface.60 192.168.60.x
arping -i iface.99 192.168.60.x

No arp requests reach the clients wireless interfaces as we've discussed above in the first case, however in the second case all requests are visible to the clients :)

# firewall
arping -i iface.60 192.168.60.x
ARPING 192.168.60.x
Timeout
Timeout
...

# Client sees no traffic

# Firewall
arping -i iface.99 192.168.60.x
ARPING 192.168.60.x
Timeout
Timeout

# Client sees the requests, however the firewall reports Timeouts above, the ARP table is refreshed though.
xxx FW_MAC > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 60: Request who-has 192.168.60.x tell 192.168.99.1, length 46
xxx CLIENT_MAC > FW_MAC, ethertype ARP (0x0806), length 42: Reply 192.168.60.x is-at CLIENT_MAC, length 28
xxx FW_MAC > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 60: Request who-has 192.168.60.x tell 192.168.99.1, length 46
xxx CLIENT_MAC > FW_MAC, ethertype ARP (0x0806), length 42: Reply 192.168.60.x is-at CLIENT_MAC, length 28

I was thinking maybe the switch was misconfigured, but I can verify the AP having the same issue with multiple configs...

Anyway, I'm just providing some context for completeness. Again thanks @johnpoz for the support.

awood

I'm seeing the same behavior on my network for both wired and wireless clients.
https://forum.netgate.com/topic/157090/periodic-drops/4

Thank you for this post, it let me isolate the cause of the network disruption. I'm still not sure what the root cause is, but at least I have a starting point.