Since upgrading to pfSense 2.4.5, I'm seeing periodic drops of packets. This is highly disruptive to my TCP sessions like voice chat and games that are sensitive to latency. However, this also appears to happen to non TCP sessions.
I'm at a loss for how to even start to diagnose or figure out what the issue is. I've isolated the problem at this point to either the firewall or my ISP, but occum's razor says that the firewall is the more likely culprit.
The situation is the following:
- I'll be on a voice chat and I'll stop hearing them. During this time, they can hear me fine. This lasts for between 20 seconds and 1 minute typically and then restores and I'm good for another 20-40 minutes.
- When this happens, I've noticed that pings are not returned either.
A few times when I've been monitoring and pings go "dark" like this, I've been able to find the state in the state table and delete it, this immediately restores the traffic to being routed again and I see my ping responses.
Similar to the TCP disconnect, the pings do restore after an arbitrary number of seconds.
Any thoughts about how to diagnose what I have incorrectly setup in my firewall? How do I verify my assumptions above (that the state table is to blame?)
If I was to setup one of the interfaces as a DMZ (no outbound NAT) - would that bypass the state table and isolate that machine so I can see if it happens - or is that still going through the routing logic and state table? (I'm on a cable modem that only allows one computer to get an IP address - so I can't just put a switch on the other side and connect a machine there)
Thanks for any advice, input, etc. that you can provide.
Btw, I found this article from a while ago that seems very similar to my issue, but has no resolution:
- Happens on wireless and wired connections
- Traffic inside the LAN (switches only) does not appear to be disrupted
I have VLANs configured (VLAN 1, 3, 4, 5, 6, 7, 8). VLAN 3 is the maintenance lan with most of my switching equipment and typically VLAN 1 is untagged on the layer 2 switches and pfsense.
Cable Modem -> PFSense (Physical Machine) -(Trunk)-> Ubiquity EdgeSwitch 24 Lite -(Trunk)-> Ubiquity AP NanoHD
My 2 testing computers:
The wireless computer is connected to the NanoHD on VLAN 6
The wired computer is connected via a trunk connection to the 24 port switch on VLAN 3 or VLAN 5 depending on which tagged interface I enable.
I also found this that looks promising since it would explain why the firewall couldn't send data back to the computer in question:
Drops caused by ARP cache
I'm going to monitor the arp cache today to see if it is possibly related.
I confirmed that the issue is related to the ARP expiration period.
I ran this loop on the PFSense firewall:
while [ true ]; do date; arp <hostname that is dropping>; sleep 1; done
It shows the expiration slowly dropping.
When it hits 0 seconds, the network is disrupted and then after some number of seconds, the ARP is refreshed and the network starts to work again.
This corresponds to the behavior I'm seeing (my packets are leaving the network, incoming packets are not getting delivered).
Any idea what could be wrong or how to get from here to root cause?
Again, this seems to have started after upgrading to PFSense 2.4.5 - was there anything changed about ARP in BSD or PFSense around that version?
I'm running a tcpdump arp command in another window.
Interestingly enough, I see a spam of ARP requests for each ip address in the subnet. However, it seems to skip over the computer in question. (Mine is on IP 192.168.60.104, I see arp requests for 103 and 105).
Is that because it is sending those requests out only for unknown or expired ARP entries?
@awood Did you ever figure out what the cause was, and a solution?
For my wireless setup, I altered the wifi access points to allow broadcast traffic from the pfsense box. In most cases, the wiresless drops broadcast packets from the LAN side.
I also increased the time for the ARP expiration.
Since doing the above my network seems a bit more stable, but I still don't feel like we've gotten to "root" cause.