Problem with NICs flapping at intervals of 5 mins
-
This might be a long post and it's not necessarily a cry for help, more of a weird mystery.
Current situation:
I have a HP slim PC, 10th gen Intel that I use as my main firewall in my home network / lab. Main NIC is a Dell X710-T4L cross-flashed to Intel firmware. My LAN type interfaces, for a long time and across multiple NICs and switches, are tagged VLAN interfaces with the parent being a LAGG.
The switch they are connected to is a Trendnet TEG-7124WS with hardware Rev 1.0. To save you from looking it up, this is a 12 port managed switch with 8 ports capable of 100M/1G/2.5G/5G/10G and 4 SFP+ ports.
Here's the thing I'm seeing. Seemingly randomly, the NICs that are connected to that switch from pfSense go down and then up with about a 5 second interval in between. All of the ports in the LAGG do it, but some significantly more often than others. What really stumps me is that it's always at a time that is a multiple of 5 minutes. I log both the switch and pfSense to a graylog server and can show the events from either, but here's one from the switch for the last 24 hours:
2024 Oct 13 20:10:09 TEG-7124WS CFA Slot0/7 Link Status [UP] 2024 Oct 13 20:10:04 TEG-7124WS CFA Slot0/7 Link Status [DOWN] 2024 Oct 13 18:25:09 TEG-7124WS CFA Slot0/7 Link Status [UP] 2024 Oct 13 18:25:04 TEG-7124WS CFA Slot0/7 Link Status [DOWN] 2024 Oct 13 17:35:09 TEG-7124WS CFA Slot0/7 Link Status [UP] 2024 Oct 13 17:35:04 TEG-7124WS CFA Slot0/7 Link Status [DOWN] 2024 Oct 13 16:40:10 TEG-7124WS CFA Slot0/7 Link Status [UP] 2024 Oct 13 16:40:04 TEG-7124WS CFA Slot0/7 Link Status [DOWN] 2024 Oct 13 15:40:10 TEG-7124WS CFA Slot0/7 Link Status [UP] 2024 Oct 13 15:40:04 TEG-7124WS CFA Slot0/7 Link Status [DOWN] 2024 Oct 13 14:30:10 TEG-7124WS CFA Slot0/7 Link Status [UP] 2024 Oct 13 14:30:04 TEG-7124WS CFA Slot0/7 Link Status [DOWN] 2024 Oct 13 13:35:09 TEG-7124WS CFA Slot0/7 Link Status [UP] 2024 Oct 13 13:35:03 TEG-7124WS CFA Slot0/7 Link Status [DOWN] 2024 Oct 13 13:00:09 TEG-7124WS CFA Slot0/7 Link Status [UP] 2024 Oct 13 13:00:04 TEG-7124WS CFA Slot0/7 Link Status [DOWN] 2024 Oct 13 12:20:09 TEG-7124WS CFA Slot0/7 Link Status [UP] 2024 Oct 13 12:20:04 TEG-7124WS CFA Slot0/7 Link Status [DOWN] 2024 Oct 13 12:00:09 TEG-7124WS CFA Slot0/8 Link Status [UP] 2024 Oct 13 12:00:04 TEG-7124WS CFA Slot0/8 Link Status [DOWN] 2024 Oct 13 11:20:09 TEG-7124WS CFA Slot0/7 Link Status [UP] 2024 Oct 13 11:20:04 TEG-7124WS CFA Slot0/7 Link Status [DOWN] 2024 Oct 13 10:45:09 TEG-7124WS CFA Slot0/7 Link Status [UP] 2024 Oct 13 10:45:04 TEG-7124WS CFA Slot0/7 Link Status [DOWN] 2024 Oct 13 10:15:09 TEG-7124WS CFA Slot0/6 Link Status [UP] 2024 Oct 13 10:15:03 TEG-7124WS CFA Slot0/6 Link Status [DOWN] 2024 Oct 13 09:45:09 TEG-7124WS CFA Slot0/7 Link Status [UP] 2024 Oct 13 09:45:03 TEG-7124WS CFA Slot0/7 Link Status [DOWN] 2024 Oct 13 08:45:08 TEG-7124WS CFA Slot0/7 Link Status [UP] 2024 Oct 13 08:45:03 TEG-7124WS CFA Slot0/7 Link Status [DOWN] 2024 Oct 13 06:45:09 TEG-7124WS CFA Slot0/7 Link Status [UP] 2024 Oct 13 06:45:04 TEG-7124WS CFA Slot0/7 Link Status [DOWN] 2024 Oct 13 04:55:09 TEG-7124WS CFA Slot0/7 Link Status [UP] 2024 Oct 13 04:55:03 TEG-7124WS CFA Slot0/7 Link Status [DOWN] 2024 Oct 13 04:25:08 TEG-7124WS CFA Slot0/8 Link Status [UP] 2024 Oct 13 04:25:03 TEG-7124WS CFA Slot0/8 Link Status [DOWN] 2024 Oct 13 03:55:09 TEG-7124WS CFA Slot0/7 Link Status [UP] 2024 Oct 13 03:55:03 TEG-7124WS CFA Slot0/7 Link Status [DOWN]
Note the time stamps. Always at a multiple of 5 minutes, and always around 4-5 seconds after the minute mark and then back up around 9-10 seconds after the minute mark.
I have months of logs showing this.
My first impulse was to blame the cabling, so I replaced it. No change. So I then swapped out the X710-T4L NIC with a X540-T2.
That resulted in much fewer events, but when they happen, they still happen with the same time stamps. Multiples of 5 minutes and 4-5 seconds.
Ok, so it's a bad switch, right? Well, I have other devices connected to this switch, some using the 10Gbps copper ports at the full 10Gbps speed.
The only other device that shows any messages in the logs is a TrueNAS system (still running Core, so also FreeBSD) with a 2 port LAGG on a X540-T2. BUT that device doesn't show the same timing and nowhere near the frequency of the events. Those happen much less frequently and not at the previously mentioned 5 min timing.
I'm currently running a 3 port LAGG on pfSense and I've only gotten events on two ports at a time so I don't have any operational issues; so far at least one port is always up.
I don't intend to keep using this config indefinitely but I'm really curious about the 5 min timing thing.
Any ideas?
Edit: The one thing that I can come up with that has a 5 min interval is SNMP polling from a Cacti server. That polls the switch as well as pfSense and TrueNAS and almost everything else on my network. pfSense is the only device that shows timing correlated at 5 mins. I also disabled SNMP on the switch long enough to observe the events happening while it remained disabled.
-
@whosmatt if you take the laggs out of the equation - do these interfaces still go up down?
I recently ran into a sort of sim issue with interface going up down on a reg basis.. Took me a bit to track down what was causing it.. It was happening on a port I had connected to a poe port on the back of a NVR..
It would only be down for a couple of seconds - but it was causing me minor packet loss when it would up/down.. Putting a different switch between fixed it.
you don't mention anything with poe, but sometimes it can be something your not thinking about that is causing the problem.. Normally I wouldn't use a poe port as uplink to another switch, so this for sure wasn't one of the things I was thinking of could cause a problem - but there clearly was a pattern to it, and it was quick..
Have yet to see a log entry for this since putting the switch between, and have ran an extended ping of like 7000 pings and not a single loss packet.
-
Hmm, feels like a link negotiation issue. Does it do it for all 3 ports in the lagg?
Can you set both sides to a fixed speed?
-
@stephenw10 seems it only happens on two of the 3..
"I'm currently running a 3 port LAGG on pfSense and I've only gotten events on two ports at a time:"
but not sure if does happen on all 3 but never at the same time, A and B, say one time and then B and C next time?
For testing I would take the lagg out of the equation - while this could cause a brief outage.. seems it should only be a few seconds.. And if doesn't happen then you know for sure its something related to the lagg.
-
@johnpoz said in Problem with NICs flapping at intervals of 5 mins:
if you take the laggs out of the equation - do these interfaces still go up down?
I have taken port 7 on the switch and ixl1 on pfSense out of the lagg but have left them physically connected to see what happens.
-
@johnpoz said in Problem with NICs flapping at intervals of 5 mins:
but not sure if does happen on all 3 but never at the same time, A and B, say one time and then B and C next time?
That's correct. The logs in my original post show at least one event for each of the three ports in the lagg. It rarely happens on two at a time, and never (so far) on all three at the same time.
Oh, and sometimes it will go a day or two without happening at all! Fun one to troubleshoot.
-
@whosmatt said in Problem with NICs flapping at intervals of 5 mins:
@johnpoz said in Problem with NICs flapping at intervals of 5 mins:
if you take the laggs out of the equation - do these interfaces still go up down?
I have taken port 7 on the switch and ixl1 on pfSense out of the lagg but have left them physically connected to see what happens.
It has already happened outside of the lagg:
From switch: 2024 Oct 14 17:30:09 TEG-7124WS CFA Slot0/7 Link Status [UP] 2024 Oct 14 17:30:04 TEG-7124WS CFA Slot0/7 Link Status [DOWN] From pfSense: 2024-10-14 17:30:09.000 kernel: kernel: ixl1: link state changed to UP 2024-10-14 17:30:09.000 kernel: kernel: ixl1: Link is up, 10 Gbps Full Duplex, Requested FEC: None, Negotiated FEC: CL74 FC-FEC/BASE-R, Autoneg: True, Flow Control: None 2024-10-14 17:30:09.000 check_reload_status[430]: check_reload_status[430]: Linkup starting ixl1 2024-10-14 17:30:05.000 kernel: kernel: ixl1: link state changed to DOWN 2024-10-14 17:30:05.000 check_reload_status[430]: check_reload_status[430]: Linkup starting ixl1
Guess I'll force speed and duplex on both sides and see what happens.
-
@stephenw10 said in Problem with NICs flapping at intervals of 5 mins:
Can you set both sides to a fixed speed?
Curious how to do this on the pfSense side with interfaces that are part of a lagg, or whether it's possible.
-
Yeah, not easily since you can't assign the member interfaces separately. You can add shell cmds to set them at boot.
But you can test it with the NIC you removed from the lagg first.
-
@stephenw10 said in Problem with NICs flapping at intervals of 5 mins:
But you can test it with the NIC you removed from the lagg first.
Yep, that's what I'm doing currently. Thanks!
-
Well, I was about to post that it happened again with forced speed and duplex but then I saw this in the pfSense log:
kernel: ixl1: Link is up, 10 Gbps Full Duplex, Requested FEC: None, Negotiated FEC: CL74 FC-FEC/BASE-R, Autoneg: True, Flow Control: None
Which is odd because I definitely set 10G full. But then I realized I didn't enable the interface I assigned to ixl1. So I enabled the interface and will wait and see what happens.
Edit:
Actually it appears that the settings aren't correctly applying at least when I view the output of ifconfig. And I set up a second unassigned NIC and forced its speed and duplex in the UI just to see the difference:
[2.7.2-RELEASE][root@pfsense]/root: ifconfig ixl1 ixl1: flags=1008843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST,LOWER_UP> metric 0 mtu 1500 description: OPT10 options=48100b8<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWFILTER,HWSTATS,MEXTPG> ether b4:96:91:b6:27:b5 inet6 fe80::b696:91ff:feb6:27b5%ixl1 prefixlen 64 scopeid 0x2 media: Ethernet autoselect (10Gbase-T <full-duplex>) status: active nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL> [2.7.2-RELEASE][root@pfsense]/root: ifconfig bge1 bge1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 description: OPT12 options=80098<VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,LINKSTATE> ether 00:0a:f7:8f:51:89 inet6 fe80::20a:f7ff:fe8f:5189%bge1 prefixlen 64 scopeid 0x7 media: Ethernet 1000baseT <full-duplex> (none) status: no carrier nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
The media for ixl1 still shows 'autoselect' even when set to 10G full in the UI.
-
@whosmatt Can you even set manual 10ge, thought part of the spec was auto? You can set the speed down manual.. I have to call up the spec.. And we really don't run much copper 10ge at work.. I believe we do have some.. I will have to tool around tmrw and see..
Might be able to set it 5 or 2.5, etc.
-
@johnpoz said in Problem with NICs flapping at intervals of 5 mins:
Can you even set manual 10ge
It's an option in the UI, yes. It's also an option on the switch side.
@johnpoz said in Problem with NICs flapping at intervals of 5 mins:
And we really don't run much copper 10ge at work
I'm beginning to understand why.
-
@johnpoz said in Problem with NICs flapping at intervals of 5 mins:
Might be able to set it 5 or 2.5, etc.
I tried setting 5000Base-T and ifconfig still shows "media: Ethernet autoselect (10Gbase-T <full-duplex>)"
If I go through the various settings on the switch the NIC follows along, down as far as 1Gbps. There's also a 100M Full setting on the switch but the NIC won't link at that speed.
-
The options offered in the gui are what ifconfig -m returns. For example:
[admin@7100.stevew.lan]/root: ifconfig -vvm ixl0 ixl0: flags=1008843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST,LOWER_UP> metric 0 mtu 1500 options=48100b8<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWFILTER,HWSTATS,MEXTPG> capabilities=4f507bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWFILTER,VLAN_HWTSO,NETMAP,RXCSUM_IPV6,TXCSUM_IPV6,HWSTATS,MEXTPG> ether 00:e0:ed:86:a6:8c inet6 fe80::208:a2ff:fe0e:a591%ixl0 prefixlen 64 scopeid 0x1 media: Ethernet autoselect (10GBase-AOC <full-duplex>) status: active supported media: media autoselect nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL> drivername: ixl0 plugged: SFP/SFP+/SFP28 1X Copper Active (Copper pigtail) vendor: BROCADE PN: 58-1000026-01 SN: CAX116410001093 DATE: 2016-10-07
DACs like that usually don't offer more than one speed.
-
@stephenw10 so yeah that doesn't show any options.. My igb0 on the other hand does
media: Ethernet autoselect (1000baseT <full-duplex>) status: active supported media: media autoselect media 1000baseT media 1000baseT mediaopt full-duplex media 100baseTX mediaopt full-duplex media 100baseTX media 10baseT/UTP mediaopt full-duplex media 10baseT/UTP
-
Yeah I can see the list of supported speed / duplex, it's just that setting any of them in the UI doesn't seem to change the media from autoselect:
ixl1: flags=1008843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST,LOWER_UP> metric 0 mtu 1500 description: OPT10 options=48100b8<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWFILTER,HWSTATS,MEXTPG> capabilities=4f507bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWFILTER,VLAN_HWTSO,NETMAP,RXCSUM_IPV6,TXCSUM_IPV6,HWSTATS,MEXTPG> ether b4:96:91:b6:27:b5 inet6 fe80::b696:91ff:feb6:27b5%ixl1 prefixlen 64 scopeid 0x2 media: Ethernet autoselect (1000baseT <full-duplex>) status: active supported media: media autoselect media 10Gbase-T media 5000Base-T media 2500Base-T media 1000baseT media 100baseTX nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
I've got it running at 1000Mbps right now because the port is forced to that speed on the switch. Incidentally, it hasn't gone down since I set that about 12 hours ago or so.
-
Hmm, I wonder it's misreporting 'autoselect' there. If the switch side is set to 1G fixed the NIC should not be able to negotiate with it.
I guess we'll see if it makes any difference anyway.
-
@stephenw10 said in Problem with NICs flapping at intervals of 5 mins:
Hmm, I wonder it's misreporting 'autoselect' there. If the switch side is set to 1G fixed the NIC should not be able to negotiate with it.
I guess we'll see if it makes any difference anyway.
I'm wondering if the setting on the switch side is really forcing speed/duplex or just forcing it to auto negotiate to a predetermined speed. If that makes sense. In other words it's still autoselect, but the list of possible values has been narrowed.
-
Yes, that could certainly be the case.
You can set that in pfSense using sysctl:
[admin@7100.stevew.lan]/root: sysctl -d dev.ixl.0.advertise_speed dev.ixl.0.advertise_speed: Control advertised link speed. Flags: 0x1 - advertise 100M 0x2 - advertise 1G 0x4 - advertise 10G 0x8 - advertise 20G 0x10 - advertise 25G 0x20 - advertise 40G 0x40 - advertise 2.5G 0x80 - advertise 5G Set to 0 to disable link. Use "sysctl -x" to view flags properly.
But if it is still negotiating and doesn't lose link at 1G that might also be a clue.