(2.2.4) Loss of WAN link brings VLAN interfaces down temporarily
Couldn't find a suitable place to post, so I've stuck it here.
Today, whilst messing about with a test rig, I noticed that pulling the WAN cable from the box caused all interfaces to stop passing traffic for a few tens of seconds.
During this time, ifconfig showed the NIC as still having an IP and an active link.
When the system eventually noticed, the IP and link would get cleared, and internal network traffic starts flowing again.
Reconnecting the cable has no effect at first, but again, after a few tens of seconds, internal traffic stops for a few seconds, then starts again, once a WAN DHCP lease has been taken.
Has anyone seen this before, or got any ideas? No fancy dual-WAN or CARP/HA setup, and curiously, no output at all to the local console while this happens.
I'm gonna trawl the logs tomorrow while I fiddle around with cables, but I just wondered if this has already been seen.
Quick system specs:
Pentium G 3.5GHz Dual Core CPU
16GB ECC RAM
Dual 1TB SATA in GEOM Mirror
Quad Intel onboard NIC
MBUF system tunable to 1 million
Packages: Avahi, Captive Portal, Cron, DHCP Server, DNS Resolver, FreeRADIUS, FTP Client Proxy, Squid3+SquidGuard, Service Watchdog, Shellcmd, Snort
Typically sits around 25% CPU usage (but shows ~52% thanks to idlepoll), 9GB RAM usage. Plenty of headroom - or so it seems.
I had initially wondered if Snort was loading the system when interfaces changed, but top (via serial port) showed no such event.
igb0 WAN - Straight to ISP
igb1 LAN - Untagged to OOB switch (only used for management) - this loses the ability to pass traffic
igb2+igb3 OPT1-100 - Tagged VLANs via LACP - these lose the ability to pass traffic
So to do a simple simulation of what it sounds like you did, I got a couple of side by side pings going from my workstation (lan 192.168.9.100/24 em1 on pfsense) to public IP, and one to other segment (printer 192.168.2.50/24 em2 on pfsense) and one that is wlan guest (192.168.6.101/24 em2_vlan300) that vlan is on the same pfsense interface as printer and then pulled the ethernet out of my cable modem that goes to pfsense wan em0.
While I didn't want to leave it disconnected for long - you can see ping to 18.104.22.168 timeout, without any blips to other segments..
Do you have pfsense set to reset states on loss of gateway?? Guess I could turn that on and try same test… That would be my guess under advanced, misc.. But it should only kill states of traffic going to that gateway..
State Killing on Gateway Failure
The monitoring process will flush states for a gateway that goes down if this box is not checked. Check this box to disable this behavior.
So I removed the check mark there and saved, then did the test again.. No loss of connectivity between lan interfaces when wan goes away that I can see.
Thanks for trying that out! It's an issue I've never seen (or at least noticed) on this rig before.
Yeah, I do have state killing turned on. During one of the times I tried this, I did quit the pings and restart them several times - no dice… So, I don't think it's a state issue (but I'll switch it off and try again).
Having thought about it more (but not being near the box at the moment), I am wondering if it's got something to do with device polling. Not sure how the Intel drivers are built, but could it be feasible that link detection/notifications are interrupt based?
I'll turn device polling off and give it another go tomorrow - and report back :)
Tried disabling state killing - no difference.
Tried disabling device polling (and rebooting) - no difference.
Bizarre. I'll be messing about more tomorrow - been a busy day, today!
Finally solved this problem - seems like the onboard NICs (Intel) had some fault or pathology.
Disabled the onboard NICs, installed a four port Intel server card, and it's working fine now.