Epyc 3251 and Wireguard
-
Hmm, is it actually panicking and rebooting when that happens?
You have any of the Chelsio hardware off-loading enabled?
-
@stephenw10 Doesn't reboot, just keeps scrolling lines of errors (I assume). Let it go for 10 minutes once, then I rebooted it.
Anywhere I can find those lines or are they not saved?Both LAN and WAN are on the chelsio card.
ifconfig cxl3 cxl3: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 description: WAN options=3e800bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6,TXRTLMT,HWRXTSTMP> ether 00:07:43:2c:e5:38 inet6 fe80::207:43ff:fe2c:e538%cxl3 prefixlen 64 scopeid 0x8 inet 32.219.x.x netmask 0xfffff800 broadcast 32.219.239.255 media: Ethernet 10Gbase-LR <full-duplex,rxpause,txpause> status: active nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL> ifconfig cxl2 cxl2: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 description: LAN options=3e800bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6,TXRTLMT,HWRXTSTMP> ether 00:07:43:2c:e5:30 inet6 fe80::207:43ff:fe2c:e530%cxl2 prefixlen 64 scopeid 0x7 inet 10.12.8.1 netmask 0xffffffc0 broadcast 10.12.8.63 inet 10.255.255.1 netmask 0xffffffff broadcast 10.255.255.1 media: Ethernet 10Gbase-LRM <full-duplex,rxpause,txpause> status: active nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
Also, It's been running fine for a few hours with Wireguard disabled. I enabled WG, and once I tried to connect to the pfSense WebGUI on the other side it went down again.
-
I would expect to see something in the system log when that happens.
That combination of things is not something I've seen before though. I'll run it past the devs tomorrow and see if any of them have.
Steve
-
@stephenw10 Will have an update in a few minutes.
Disconnected the chelsio card, put wan and lan on gig ports. Did the same thing.
took a look at wireguard config and found the gateways were reversed. Started thinking if that got screwy in the config restore what else did??
So I completely removed WG and all config from it.
Rebooted, did a backup, removed all traces of WG from it and restored.
Just came back up now and waiting for the package reinstall.
Once done, I'll reinstall WG, recreate all tunnels and see what happens.I did let it go through the whole process last crash and got the dump files if needed.
Will let you know how it goes. -
@stephenw10
Still no good.
Just created 1 tunnel. It comes up fine but as soon as I try to use it, gone. -
Hmm, still showing issues in the Chelsio driver.
Panic:
Fatal trap 9: general protection fault while in kernel mode cpuid = 12; apic id = 0c instruction pointer = 0x20:0xffffffff8065f3d9 stack pointer = 0x28:0xfffffe009a0fb540 frame pointer = 0x28:0xfffffe009a0fb570 code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 12 (irq323: t5nex0:3a2) trap number = 9 panic: general protection fault cpuid = 12 time = 1661734572 KDB: enter: panic
Backtrace:
db:0:kdb.enter.default> bt Tracing pid 12 tid 100213 td 0xfffff80005df0000 kdb_enter() at kdb_enter+0x37/frame 0xfffffe009a0fb250 vpanic() at vpanic+0x197/frame 0xfffffe009a0fb2a0 panic() at panic+0x43/frame 0xfffffe009a0fb300 trap_fatal() at trap_fatal+0x391/frame 0xfffffe009a0fb360 trap() at trap+0x67/frame 0xfffffe009a0fb470 calltrap() at calltrap+0x8/frame 0xfffffe009a0fb470 --- trap 0x9, rip = 0xffffffff8065f3d9, rsp = 0xfffffe009a0fb540, rbp = 0xfffffe009a0fb570 --- cxgbe_transmit() at cxgbe_transmit+0x19/frame 0xfffffe009a0fb570 ether_output_frame() at ether_output_frame+0xb4/frame 0xfffffe009a0fb5a0 ether_output() at ether_output+0x676/frame 0xfffffe009a0fb620 ip_output() at ip_output+0x136c/frame 0xfffffe009a0fb770 ip_forward() at ip_forward+0x39e/frame 0xfffffe009a0fb840 ip_input() at ip_input+0x850/frame 0xfffffe009a0fb8f0 netisr_dispatch_src() at netisr_dispatch_src+0xca/frame 0xfffffe009a0fb940 ether_demux() at ether_demux+0x16a/frame 0xfffffe009a0fb970 ether_nh_input() at ether_nh_input+0x330/frame 0xfffffe009a0fb9d0 netisr_dispatch_src() at netisr_dispatch_src+0xca/frame 0xfffffe009a0fba20 ether_input() at ether_input+0x89/frame 0xfffffe009a0fba80 service_iq_fl() at service_iq_fl+0x5d2/frame 0xfffffe009a0fbb30 t4_intr() at t4_intr+0x2d/frame 0xfffffe009a0fbb50 ithread_loop() at ithread_loop+0x23c/frame 0xfffffe009a0fbbb0 fork_exit() at fork_exit+0x7e/frame 0xfffffe009a0fbbf0 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe009a0fbbf0 --- trap 0, rip = 0, rsp = 0, rbp = 0 ---
But wireguard was no longer running on it when happened?
-
What's the WG tunnel connected to there? Another pfSense install?
-
@stephenw10 said in Epyc 3251 and Wireguard:
Hmm, still showing issues in the Chelsio driver.
I assume that's only because my WAN is on the chelsio at the time. I didn't check when I disconnected the chelsio card but I would also assume it would've shown as the igb0 at that time.
But wireguard was no longer running on it when happened?
Probably the cause right there. WG shutting down when I try to use it?
-
@stephenw10 said in Epyc 3251 and Wireguard:
What's the WG tunnel connected to there? Another pfSense install?
Unfortunately not.
That tunnel goes to an opnsense box. At least until the vlan0 is fixed. -
Hmm, so the encrypted WG traffic still runs over the Chelsio NIC, the WAN?
-
@stephenw10 Not really sure what you're asking there.
My WAN is on the chelsio card (cxl3), the WG tunnel comes up with handshakes, but as soon as I try to access the other side it crashes. -
Mmm, I'm unsure what you moved to igb0. I would have expected that to have to be the WAN for the WG interface to be running on it.
-
@stephenw10 I moved the WAN to igb0 and disconnected the chelsio card from the motherboard as a test.
The trouble still happened.
So I don't think focusing on the chelsio is the way to go.
It happens with the onboard nics also.
Because it still happened with the onboard nics, I reinserted the chelsio and moved WAN back to it. -
Right, I would agree except that it appeared the error was still on the Chelsio NIC even when it was not carrying WG traffic as I understand it.
It would be good to get a crash report from the igb0 as WAN setup if that's possible. It would be very surprising to see the same error on igb sicne many people are running WG with an igb parent.
-
@stephenw10 said in Epyc 3251 and Wireguard:
Right, I would agree except that it appeared the error was still on the Chelsio NIC even when it was not carrying WG traffic as I understand it.
How are you coming up with that?
-
@jarhead said in Epyc 3251 and Wireguard:
But wireguard was no longer running on it when happened?
Probably the cause right there. WG shutting down when I try to use it?
I may have read that wrong. But what I meant to ask there was; was WG running on the Chelsio NIC when that crash report was generated?
-
@stephenw10 I'll go through the whole thing again, trying to be more clear.
New router. Backed up old, restored on new changing interfaces as needed.
Wireguard would crash.
Moved WAN and LAN to onboard igb nic's.
Wireguard would crash.
Since this proves it's not related to the chelsio card, as it wasn't even plugged in to the motherboard, I reinstalled the chelsio and moved WAN and LAN back to it.
Wireguard would crash.
I found some weird errors in my gateways, as in network 1 was using gateway 2, and network 2 using gateway 1 when they should be 1 to 1 and 2 to 2, so I uninstalled wireguard then reinstalled it and recreated one tunnel.
Wireguard crashed and that's the dump I posted here.So focusing on the chelsio card seems to be not the way to go.
Have you guys used an Epyc 3251 in the office for testing at all?
-
Ah, OK. Sorry I misinterpreted the responses there then.
Is it possible to switch back to igb and try to generate a crash report?
The crash you saw there in cxgbe looks very similar to some we have seen in other drivers but that I expect to be fixed in igb.I'm not aware of any testing that has been done on an Epyc platform device.
Steve
-
@stephenw10 Definitely possible, just don't know when I can get to it. Got a lot of painting to do tonight so maybe I can play around as I watch the paint dry.
-
Ha, sounds like a good option!
I'll keep digging here, see if anyone has any suggestions.