Help, Random "Hot Plug" Events!
-
Hey All,
I’ve been really struggling
to find the source of my random network disconnects for a while and I’m hoping to get some advice on how to troubleshoot further. I'd really appreciate any advice or suggestions.
My Setup:
I have one of those Fanless Chinese PCs with 4 X i225 (v3) 2.5gbe ports running pfSense Plus v23.05.1 (vanilla install, no apps). It’s connected to a 2.5gbe managed QNAP switch (qsw-m2116p-2t2s) at a link speed pf 2.5gbe.The Problem:
Throughout the day I get random “hot plug” events (disconnects) on the LAN interface that connects the router to the QNAP switch. After reviewing the pfSense logs, the drops seem completely random. I can’t see anything suspicious leading up to the drop, or the drops happening at the same time of day or at a constant interval (hence, random). See the "hot plug" event that happened this morning for an example.What Might Be Going On:
- I’ve heard that the i225(v1/v2)/i226 chipset has been known to suffer from network disconnects. Although, after doing a little research it seems largely fixed. In fact Netgate uses the i225 nics in their intel-based appliances and don't seem to have an issue.
- Perhaps there is a compatibility issue between the i225/i226 nics on the router and the 2.5gbe nics on the QNAP switch?
Things I’ve Tried:
- I tired rebooting the switch and the router
- I tried replacing the Ethernet cable with several others
- I tried using another 2.5gbe port on the switch
- I tried setting the port speed on the switch to 2.5gbe instead of auto negotiate
- I made sure EEE was disabled in pfSense (which I discovered it is by default)
- I reinstalled a fresh copy of pfSense (no apps, vanilla install)
- I exchanged the switch for a new one (yes, exact same model)
- I tried a NEW Fanless Chinese PC , but this one had the newer i226 chipset
Things I Could Try:
- I could try reducing the port speed to 1gbe instead of 2.5gbe on the switch
-
- Even if this works, I really don’t want to do this for obvious reasons
- I could try using one of the 10gbe ports on the switch instead of one of the 2.5gbe ports
-
- Even if this works, I don’t really want to sacrifice one of my 10gbe ports
- I could try another 2.5gbe switch (different brand and/or model)
- Try another router not running intel i225/i226 nics?
- Maybe there is some switch setting to check?
- Maybe some pfSense setting to check? On/Off hardware offloading?
- Maybe some setting in the bios or bios update?
I'd love to hear your thoughts or suggestions on anything else to consider. On a side note: Is there anyone else out there that's running a 2.5gbe pfsense router connecting to a 2.5gbe switch that has a stable setup? I'd be curious as to what your setup/gear consists of.
Thanks in advance !
pfSense Logs:
Aug 23 08:02:31 check_reload_status 436 Linkup starting igc1.40 Aug 23 08:02:31 check_reload_status 436 Linkup starting igc1.127 Aug 23 08:02:31 check_reload_status 436 Linkup starting igc1.20 Aug 23 08:02:31 check_reload_status 436 Linkup starting igc1.100 Aug 23 08:02:31 kernel igc1.30: link state changed to UP Aug 23 08:02:31 kernel igc1.40: link state changed to UP Aug 23 08:02:31 kernel igc1.127: link state changed to UP Aug 23 08:02:31 kernel igc1.20: link state changed to UP Aug 23 08:02:31 kernel igc1.100: link state changed to UP Aug 23 08:02:31 kernel igc1: link state changed to UP Aug 23 08:02:31 check_reload_status 436 Linkup starting igc1 Aug 23 08:02:28 php-fpm 21392 /rc.linkup: DEVD Ethernet detached event for opt5 Aug 23 08:02:28 php-fpm 21392 /rc.linkup: Hotplug event detected for GUEST(opt5) static IP address (4: 192.168.40.1) Aug 23 08:02:28 php-fpm 10095 /rc.linkup: DEVD Ethernet detached event for opt7 Aug 23 08:02:28 php-fpm 10095 /rc.linkup: Hotplug event detected for VPN(opt7) static IP address (4: 192.168.127.1) Aug 23 08:02:28 php-fpm 80452 /rc.linkup: DEVD Ethernet detached event for lan Aug 23 08:02:28 php-fpm 80452 /rc.linkup: Hotplug event detected for LAN(lan) dynamic IP address (4: 192.168.10.1, 6: track6) Aug 23 08:02:28 php-fpm 38979 /rc.linkup: DEVD Ethernet detached event for opt3 Aug 23 08:02:28 php-fpm 38979 /rc.linkup: Hotplug event detected for IOT(opt3) static IP address (4: 192.168.20.1) Aug 23 08:02:28 php-fpm 70223 /rc.linkup: DEVD Ethernet detached event for opt6 Aug 23 08:02:28 php-fpm 29521 /rc.linkup: DEVD Ethernet detached event for opt4 Aug 23 08:02:28 php-fpm 29521 /rc.linkup: Hotplug event detected for WORK(opt4) static IP address (4: 192.168.30.1) Aug 23 08:02:28 check_reload_status 436 Reloading filter Aug 23 08:02:28 check_reload_status 436 Reloading filter Aug 23 08:02:28 php-fpm 70223 /rc.linkup: Hotplug event detected for MGMT(opt6) static IP address (4: 192.168.100.1) Aug 23 08:02:27 check_reload_status 436 Linkup starting igc1.30 Aug 23 08:02:27 check_reload_status 436 Linkup starting igc1.40 Aug 23 08:02:27 check_reload_status 436 Linkup starting igc1.127 Aug 23 08:02:27 check_reload_status 436 Linkup starting igc1.20 Aug 23 08:02:27 kernel igc1.30: link state changed to DOWN Aug 23 08:02:27 check_reload_status 436 Linkup starting igc1.100 Aug 23 08:02:27 kernel igc1.40: link state changed to DOWN Aug 23 08:02:27 kernel igc1.127: link state changed to DOWN Aug 23 08:02:27 kernel igc1.20: link state changed to DOWN Aug 23 08:02:27 kernel igc1.100: link state changed to DOWN Aug 23 08:02:27 kernel igc1: link state changed to DOWN
-
You tried a completely different device as the router and still behaves exactly the same?
That looks like some low level incompatibility between the switch and i225/226 to me. Trying a different switch there would be a good test.
You don't mention it but I assume you tried different igc ports assigned as LAN?
It's only ever the LAN port (connected to that switch) that goes down?
Can you try a lagg of two igc ports to the switch?
Steve
-
Thanks for the reply. Here are some thoughts on your suggestions (which were really good by the way):
You tried a completely different device as the router and still behaves exactly the same?
Yep, I tried a completely different fanless Chinese box with different NICs (the newer i226 chipset). I don't have anything else laying around with multiple 2.5gbe NICs to try as a router.
That looks like some low level incompatibility between the switch and i225/226 to me. Trying a different switch there would be a good test.
Trying a different 2.5gbe switch is on my list of things to try. I have access to different model of QNAP 2.5gbe managed switch, but it will take a day or so to setup. I wish I had access to another brand of 2.5gbe managed switch, but there really isn't that much selection out there.
You don't mention it but I assume you tried different igc ports assigned as LAN?
You're right, I didn't try another port on the router side. I figured when I tried a whole new router, it was in effect trying a new router port as well. But now you've got me curious, so I've just plugged igc2 into the switch to see what happens overnight.
It's only ever the LAN port (connected to that switch) that goes down?
Yeah, I only have igc0 (WAN) and igc1 (LAN) plugged in use right now. Obviously, between the two, only one (LAN) is plugged into the switch, so it's the only one that goes down. My WAN never does.
Can you try a lagg of two igc ports to the switch?
Yeah, I can try that. I'm guessing the thought behind that is that it would add some redundancy if one of the links went down? That may solve my problem, but somehow feels like a band aid. It would be really interesting if both links went down while lagged. Although, I'm not sure exactly what that would mean, maybe that "low-level incompatibility" you mentioned?
Well, I'll give that all a try and report back as soon as I can.
Thanks!
-
Yes, I'm assuming some low level issue that a lagg might workaround. And, yes, if both links went down at the same time that might tell us more. Cheaper than replacing the switch.
-
Just an update:
I had a few network disconnects last night and this morning on BOTH ports on the router (igc1 and igc2) at the same time for every disconnect. I'm not sure exactly what that behavior could indicate, but I'm guessing LAGG isn't going to work if both ports like to disconnect at the same time. Maybe there is a way to see more verbose logging in pfSense outside of the Status>System Logs>System>General logs? Just a thought...
I could still try:
- Replacing the switch with another 2.5gbe switch (different model, but still QNAP). I'll have to do this over the weekend.
- Explicitly set the port speed to 2.5gbe. When I tried this before, but I just set the port speed on the switch end. I just realized I didn't set it on the router end. I'll try setting the port speed on both ends and see what happens. Probably won't help, but I'll try anything at this point.
Thanks again, for all of the suggestions.
-
Hmm, is there any logging in the switch? Anything on any other port on the switch show the link bounce?
Unfortunately there really isn't any additional logging to be had there. It sees the link go down. Nothing is logged in pfSense that might be a cause of it.
-
Yeah, the current QNAP switch I have does have logs, but it only logs when a port goes down/up and what time. The other QNAP switch doesn't even have logs (at least in their GUI).
Well, I guess I'll keep trying some of those suggestions and report back. I figure documenting my journey trying to get this exact 2.5gbe router/switch combo running might be helpful in case someone else runs into the same issue. Sometimes it's nice to know you're not alone when stuff like this happens
Thanks!
-
UPDATE:
Hey all, I tried the following over the weekend to no avail:
-
Replaced my current 2.5gbe QNAP switch with another (different model) 2.5gbe QNAP managed switch. I still saw random "hot plug" events even with a different switch. Granted, this wasn't the best test because there is a high chance that QNAP uses the same NICs and software in all of their 2.5gbe managed switches.
-
Explicitly set the port speed on both the router and the switch. Unfortunately, this didn't help any.
Things I could still try (and thoughts):
-
I'm tempted to get my hands on a Netgate 4100/6100 to see if it will work with my QNAP switch. Although, both the 4100/6100 have the same NIC chipset (i225) as the fan less Chinese boxes I'm using. I suppose if a the 4100/6100 works, then that would mean there is probably some low level incompatibility with fan less Chinese boxes that I'm using (e.g BIOS FW, etc)?
-
I'm tempted to try a new 2.5gbe managed switch by another brand. Although, I'm not sure at the moment which brand other than QNAP that has an affordable 2.5gbe managed switch. I suppose if a new switch works, then that probably means a low level incompatibility with the QNAP switches I'm using.
I really wish I could find someone out there running a stable 2.5gbe router/switch setup to see what gear their running. Anyways, I'll try to post back when I have more updates.
-
-
We actually use a QNAP switch for testing 2.5G NICs here. It's a QSW-M2108-2C. No problems between that and the i225 or i226 NICs in the Netgate hardware.
-
@uplink Just wondering if you figured this out. I have just setup a Trigkey mini pc (i225-v) dual nic. I also get the exact same issue on my LAN interface. Its currently hooked up to a new dlink switch that supports 2.5GB. Debating putting my old switch back in but its only gigabit.....
-
It's a good test even if you don't keep it there permanently.
-
Hey @mark77ap
It seems I forgot to give an update on where I ended up, thanks for the reminder!
Let's see...When I last posted I was going to try a Netgate appliance and/or try another 2.5gbe switch (non QNAP) . Unfortunately, I didn't do either. However, I did find a quasi-solution that seemed to work. I ended up buying an SFP+ RJ45 module for the switch and have the Chinese fanless router plugged into that. So far, it's been about a month and a half without a single drop. I'm glad it works, but it's not ideal. I don't like that it's occupying one of my 10gbe SFP+ ports when it could be using one of the 16 2.5gbe ports. I also don't like that I kinda have to run a "router on a stick" setup because the other 2 available LAN ports on the router are essentially useless with my switch.
Hey @stephenw10 It's interesting to know that you've used the QNAP QSW-M2108-2C in your testing. That switch is very close in spec to the 2 QNAP switches that I have. I assume that you were testing with Netgate hardware? This makes me think the issue might be on my router end. I may have to pick up a 4100 someday and give that a try :)
@mark77ap - That's a good test, I'd be curious to know if you have better luck on your old gigabit switch.
-
@uplink Thanks for the update.
The switch I have (Dlink DMS-107) has 2 x 2.5 GB ports and 5 1X GB ports. I was getting nonstop drops when my LAN connection was plugged into the 2.5GB port. I moved to use the 1 GB port and it was stable so far for 12 hours (but at 1 GB :( ) My router is plugged into my modems 2.5GB port and has not seen any drops which is odd.
I have since reinstalled pfsense and upgrade to plus, retrying the 2.5GB ports. Fingers crossed but seems unlikely a re-install is going to fix this.
I did check in my BIOS and the I225 is the third revision so rumour is it should be ok but based on google results these NIC's seem to be plagued with issues.
Was the SFP a 2.5GB or was it a 10 GB.?
-
If you have any power saving options in the BIOS for the NICs or PCIe bus, like ASPM, I would try disabling that. I have seen that resolve link issues in some NICs.
-
Yeah, I tried the same thing (upgrading to plus) it didn't work for me. Hope you have better luck than I did. If I remember correctly, I think I also tried a 1Gbe switch and had success there too, so that's doesn't surprise me. Of course that's not ideal, since that's a waste of having a 2.5gbe port on the router.
My router is plugged into my modems 2.5GB port and has not seen any drops which is odd.
Were getting drops on the WAN to your modem too? Is you router WAN 2.5Gbe and your modem 2.5Gbe ? I thought the drops were only on the LAN interface on the router?
So, my switch is reporting that my SFP+ RJ45 module is connected at 10Gbe and pfsense is reporting 2.5Gbe. My SFP+ module is capable of negotiating down to 2.5Gbe so I think it's just the switch reporting incorrectly (which is common). I also tested the throughput and it's indeed 2.5gbe.
@stephenw10
Good idea, I might take a look for that in the BIOS later today and see if I have any power saving options like that. -
If you have any power saving options in the BIOS for the NICs or PCIe bus, like ASPM, I would try disabling that. I have seen that resolve link issues in some NICs.
Surprisingly, I do not have any options in the BIOS for the onboard ethernet. Nothing, I can't even see a place to disable the adapters let alone any energy saving options.
As for the PCI bus options, I did see 4 "PCI Express Root Port" options which I presume are my 4x 2.5Gbe NICs. I checked each one and they all have ASPM disabled already. However, I did see a "DMI Link ASPM Control" option set to "L1". If I understand this PCI stuff correctly, that "DMI Link" is the link between the Southbridge and the CPU and the "PCI Express Root Port" is the link between the Southbridge and the device.
Maybe I'll try setting the DMI Link to "disable" and see if that helps? Haha, I'll try anything, I can always change it back
-
Yeah it likely wouldn't be a setting for the NIC(s) specifically rather than the PCIe bus/lanes. If it is exposed in that bios at all.
-
Same here, Bios is definitly not the same as a normal PC lol. I did manage to change the turbo efficient mode to off but not sure that is really going to do much other than give me some more cpu cycles.
Even though I had less drops yesterday with pfsense+ I did still get some overnight. I have switched my ethernet cable ( really have my doubts this is it) and will see how it goes.
Only thing I have left to try is to buy another 2.5GB switch and try that.
Mark
-
Well it has been 3 days with the new cable and no connection drops. Not sure how this cable just went bad, or maybe it allways was bad and would work at 1 GB and not 2.5 GB.
Mark
-
@mark77ap That's exactly why cables are rated at different "cats". Lots of esoteric sounding calculations relating to transmission lines come into play at different frequencies, so yes a cable may be perfectly fine at 1G and fail at 2.5G or higher.