Upgrade from 23.09.1 to 24.03 Completes Successfully, But NIC Will No Longer Pass Traffic
-
Howdy! First - THANK GOD FOR BOOT ENVIRONMENTS!!
My system is a whitebox (SuperMicro) running pfSense Plus (initially 23.09.1), and has been rock stable for years (on this and previous releases). I finally got around to upgrade to 24.03 this afternoon, and the update finished just fine. I was using an IPMI connection to monitor the update/reboot and after it came up, everything looked just fine, BUT, I couldn't ping my LAN address, and none of my devices could actually access the internet over that interface. Other systems on other interfaces/VLANs were fine, but my LAN is a 25G connection between a Unifi Switch and the SuperMicro pfSense, and while the link was up, and everything LOOKED ok, no traffic would pass.
I left it that way for about 30 minutes, figuring it was just all my packages reinstalling, etc. When it still didn't work, I rebooted from the console, and I still couldn't ping the LAN interface, even though the command line says it was UP. pfSense also couldn't ping OUT from that interface either.
I reverted to 23.09.1, and interestingly enough, it STILL didn't come back. Thinking I was going crazy, i rebooted again... And then everything came up fine... Investigation started...
I let it stabilize for another 30 min, working perfectly fine. Rebooted into the 24.03 Boot Environment, successful boot, waited for another 10-15 min... Still no LAN interface. Reboot again into 24.03, wait 10-15... No traffic.
Rebooted into 23.09.1, booted successfully, no traffic on LAN... Wait 5 min, reboot again, into 23.09.1, traffic is back to normal...
I did this whole iteration twice to see if it was repeatable. It is.
So... Something is messing with my interface in 24.03 such that it requires 2 reboots in 23.09.1 to resolve the problem.
So I'm back in 23.09.1, hoping someone here knows what MIGHT be going on.
Thanks!!
Device Specifics:
System: Supermicro SYS-E300-8D
Problem NIC:
ixl0@pci0:7:0:0: class=0x020000 rev=0x02 hdr=0x00 vendor=0x8086 device=0x158b subvendor=0x8086 subdevice=0x0002 vendor = 'Intel Corporation' device = 'Ethernet Controller XXV710 for 25GbE SFP28' class = network subclass = ethernet bar [10] = type Prefetchable Memory, range 64, base 0xf7000000, size 16777216, enabled bar [1c] = type Prefetchable Memory, range 64, base 0xf8808000, size 32768, enabled cap 01[40] = powerspec 3 supports D0 D3 current D0 cap 05[50] = MSI supports 1 message, 64 bit, vector masks cap 11[70] = MSI-X supports 129 messages, enabled Table in map 0x1c[0x0], PBA in map 0x1c[0x1000] cap 10[a0] = PCI-Express 2 endpoint max data 256(2048) FLR max read 4096 link x8(x8) speed 8.0(8.0) ASPM disabled(L1) ecap 0001[100] = AER 2 0 fatal 0 non-fatal 1 corrected ecap 0003[140] = Serial 1 103babfffffefd3c ecap 000e[150] = ARI 1 ecap 0010[160] = SR-IOV 1 IOV disabled, Memory Space disabled, ARI disabled 0 VFs configured out of 64 supported First VF RID Offset 0x0110, VF RID Stride 0x0001 VF Device ID 0x154c Page Sizes: 4096 (enabled), 8192, 65536, 262144, 1048576, 4194304 ecap 0017[1a0] = TPH Requester 1 ecap 000d[1b0] = ACS 1 Source Validation unavailable, Translation Blocking unavailable P2P Req Redirect unavailable, P2P Cmpl Redirect unavailable P2P Upstream Forwarding unavailable, P2P Egress Control unavailable P2P Direct Translated unavailable, Enhanced Capability unavailable ecap 0019[1d0] = PCIe Sec 1 lane errors 0 PCI-e errors = Correctable Error Detected Unsupported Request Detected
Edit: Sorry!! I thought I was on the General Forum! Please move post if necessary.
-
-
Do you see a difference in the output of:
ifconfig -vvvm ixl0
between 23.09.1 and 24.03?Is it even showing as linked in 24.03?
Steve
-
@stephenw10 -- I won't be able to perform the reboot-two-step until later tonight, but I will get outputs of both.
For now, here is what it says on 23.09.1:
ixl0: flags=1008b43<UP,BROADCAST,RUNNING,PROMISC,ALLMULTI,SIMPLEX,MULTICAST,LOWER_UP> metric 0 mtu 1500 description: LAN options=48100b8<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,VLAN_HWFILTER,HWSTATS,MEXTPG> capabilities=4f507bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWFILTER,VLAN_HWTSO,NETMAP,RXCSUM_IPV6,TXCSUM_IPV6,HWSTATS,MEXTPG> ether 3c:fd:fe:ab:3b:10 inet REDACTED inet6 REDACTED media: Ethernet autoselect (25GBase-CR <full-duplex>) status: active supported media: media autoselect media 25GBase-LR media 25GBase-SR media 25GBase-CR media 10GBase-KR media 1000Base-KX media 10Gbase-LR media 10Gbase-SR media 10Gbase-Twinax media 1000baseLX media 1000baseSX nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
Note, I don't THINK the media would matter, but this is a DAC connection, not Fiber or Copper. I did run a similar permutation of the ifconfig command and I do remember it saying
status: active
.I will reboot tonight to get the exact output for 24.03.
-
Hmm, I agree I wouldn't expect the actual link media reported to matter as long as the speed is correct.
-
H'okay... Got a chance to reboot and test. Here is the screenshot (the only way I could get the information because I didn't have SSH and had to use IPMI).
Interesting here, BUT, the
inet
line is completely missing... Yet the main page says that the IP is assigned. I also selected the command line option to assign an IP to the interface to re-assign the LAN IP, and it made no difference. The output of the command was identical.I then manually tried to add an IP to the interface, and it didn't like that either:
Took 2 reboots back into 23.09.1 to get the interface back. Interestingly enough, that FIRST reboot back, the output of the command is identical to what is shown in the above screenshot...
Weird.
Another thing I noticed, somehow my ntop-ng got hosed and even in the old boot environment it won't work. This is a different issue, so I will wait to fix that one until we get an idea on what might be going on here.
-
Hmm, curious. I assume specifying to subnet using CIDR notation also fails?
What firmware version does that NIC have? It could be the newer driver trying to use some API update perhaps.
sysctl dev.ixl.0.fw_version
-
@stephenw10 -- Interesting point... I never upgraded the firmware on this NIC. I've had real bad luck with NIC firmwares on some Intel Atom chips, so I avoided it.
I've never updated firmware on BSD... Guess I could crack the case open and see where the NIC came from to get updated firmware.
Output from the command:
dev.ixl.0.fw_version: fw 5.50.47059 api 1.5 nvm 5.51 etid 80002bca oem 1.262.0
Edit: Apparently I got the card on eBay... This is the card:
Intel XXV710-DA2 25GbE Dual-Port Ethernet Network Adapter XXV710DA2BLK
-
Oh that is an old firmware version. Importantly an old API version. I would try upgrading it. You'll probably have to do that from Windows or Linux though.
-
@stephenw10 - H'okay. Believe it or not they have a BSD & EFI version for the latest firmware. I'm going to try the EFI version tonight... I will let you know how it goes.
-
@stephenw10 - Ok... So... Bear with me... I just spent about 6 straight hours troubleshooting and didn't really accomplish much.
Upgraded the firmware on the card, that was a breeze. Took about 15 min through UEFI. Here is the new firmware information:
dev.ixl.0.fw_version: fw 9.140.76856 api 1.15 nvm 9.40 etid 8000ed12 oem 1.269.0
Appears to be much newer, and BONUS, it STILL works with 23.09.1.
Long story short, SAME exact symptoms with 24.03.
So, I decided to factory reset the configuration. After the reboot, I manually reassigned the interfaces to be the correct ones for at least my LAN and WAN, manually set the IP address for the LAN and.... NOTHING. I performed a reboot just for giggles, and, wouldn't you know it, it WORKED. And it was repeatable. 3 reboots later and I was confident that it was 'stable'.
I took my backup config (downloaded from the working 23.09.1), loaded it on the GUI, and it... kinda worked after a reboot. The LAN interface came back, and all the other settings came back, but I got an effort for EVERY package that basically said, "Package ABCDEF does not exist in current Netgate pfSense Plus version and it has been removed.", for all 22 of my packages.
I rebooted a couple times, and it again seemed 'stable' on the LAN interface.
I started adding my packages, they all came up and worked no problem... Then I rebooted again... LAN interface was dead again. Reboot 3 more times... still dead. Reboot into the 23.09.1 boot environment, everything is hunky dory again.
So... Maybe it's related to SOMETHING in my configuration related to the packages I have installed?
Here is the list of all 22 packages that I use:
mailreport iperf nmap mtr-nox11 openvpn-client-export acme bandwidthd Cron Status_Traffic_Totals syslog-ng Service_Watchdog System_Patches avahi-daemon arpwatch pimd pfBlockerNG zabbix-agent64 nut WireGuard suricata ntopng
If I had to make a guess, I would suspect that MAYBE it has something to do with either
bandwidthd, Status_Traffic_Control, avahi-daemon, or pimd
because I believe those actually have the ability to muck with the interfaces at a more substantial level than the rest of the packages. I'm reasonably certain that I never actually got media casting across VLANs to work successfully. so I think I can ditchavahi-daemon
andpimd
. Might be able to nix the others too, but I'm not actually sure if that's really needed.I'm willing to share my config privately with support if you think there's something in there that might help.
Back on 23.09.1 for now...
-
Ah, well some progress at least. And always good to prove a theory incorrect. New firmware doesn't hurt also good to know.
I'd guess bandwidthd or, more likely, Suricata if it's running in in-line mode which uses the NIC in netmap mode and can break everything!
-
@stephenw10 said in Upgrade from 23.09.1 to 24.03 Completes Successfully, But NIC Will No Longer Pass Traffic:
I'd guess bandwidthd or, more likely, Suricata if it's running in in-line mode which uses the NIC in netmap mode and can break everything!
Good to know. My suricata is IDS only, so it shouldn't be mucking with the interface. Tonight I'm hoping to go through this again, reload my config (hoping that it also 'fails' to load the packages), and then I will install one and reboot, rinse and repeat until I find the cranky package.
-
@stephenw10 - Ok... So, I'm at a loss. It HAS to be something with my config, but it's somewhat complex, and I really don't want to create everything by hand.
I reset 24.03 back to factory defaults, configured WAN and LAN, set the IPs, rebooted (working). Rebooted again (working)...
I installed the acme, zabbix, and Wireguard packages... Really low impact, right, and should be completely unrelated to the LAN interface. Install works, reboot... Dead. Reboot. Still dead.
Back to 23.09 I go...
I'm not above getting another NIC with another chipset entirely to try it, BUT this SHOULD work without an issue, and swapping out a NIC is going to kill my Netgate ID, which will kill my paid plus subscription, and to be honest, that whole implementation seems flakey to me, so I don't want to introduce yet another wrinkle.
Kinda at a loss... Really want to upgrade, but I now have NO idea what it could be, without manually recreating my config (consisting of almost a dozen interfaces, 6 VLANs, countless rules, and a ton of Suricata & pfBlocker-NG configurations). That would take a SIGNIFICANT amount of time to re-create and the risk of screwing something up in the details is REALLY a possibility.
Thoughts? I mean... This should work. So what else can I do?
-
Hmm well of those 3 I'd have to suspect Wireguard. That can at least add an interface. Zabbix and ACME really could not prevent traffic.
-
@stephenw10 - I will try just wire guard and see what happens. It worked on one of my previous attempts and reboots. So I figured it was safe.
It still leaves me in a pretty crappy situation. I can't swap hardware, because I lose my Plus (different MAC), I can't actually upgrade because, well, it doesn't work.
Anyone else there at Netgate have any ideas? This one happens to be my main router in my home lab, so it's kinda the lynchpin in everything. I DEFINITELY need wire guard to work.
I guess I can wait until there's another release, but that leaves me in 23.09.1 for a long time without any security enhancements.
I really think it might be something latent in my config. Is there anyone at Netgate who would take a look at the XML? Perhaps there's something I'm not seeing? Maybe you guys have better debug tools?
I'll try to do more testing tomorrow...
-
It does seem like something in you config I agree. If it's not some package putting the NIC in an odd mode it could be a system tunable you have added.
Are you able to upload the config for us to review here: https://nc.netgate.com/nextcloud/s/fcTw2Dy3FKD7bCK
Steve
-
@stephenw10 - Config uploaded.
Note, specifically about tunables. I've never actually added any, and there are likely some in there from considerably different hardware, IF, that stuff carries forward. I'm not sure what should be there from default, or how to "safely" reset them back to "default", but I'd definitely be willing to try that too.
-
@stephenw10 - HOLY CRAP I think I figured it out. Performing more testing. Will know in a few more reboots once I get the rest of the packages installed.
It looks like it WAS wireguard in a, "this should never have worked" type of scenario...
Will edit post shortly...
Edit: YES!
The issue was with a WireGuard Gateway Monitor IP. It just so happens that the LAN IP of my router and the LAN IP of the router on the other side of the WG Gateway are flipped, (think 192.168.1.1 and 192.1.168.1). Apparently, 23.09.1 didn't care that I had the LAN IP entered in there and was happy to just status something that was always up... 24.03 was none-too-happy with that config, however, and broke the LAN interface because of it.
Troubleshooting:
I enabled access to the Webconfigurator through another interface so that I could actually see what was going on, and noticed that there was an issue with that ONE WireGuard gateway and when I looked why, I saw it immediately.Problem. SOLVED. Awesome news on a Friday night, and dare I say it, this one was kinda fun!
-
Wow nice catch! Interesting that worked in 23.09.1. Hmm.
-
@stephenw10 - Right?
Thanks for all the help!!