Epyc 3251 and Wireguard
-
@stephenw10
Can you tell me what's going on here?Aug 28 00:12:54 pfsense kernel: Aug 28 00:12:54 pfsense kernel: Aug 28 00:12:54 pfsense kernel: Fatal trap 9: general protection fault while in kernel mode Aug 28 00:12:54 pfsense kernel: Aug 28 00:12:54 pfsense kernel: cpuid = 0; Aug 28 00:12:54 pfsense kernel: Fatal trap 9: general protection fault while in kernel mode Aug 28 00:12:54 pfsense kernel: cpuid = 2; apic id = 00 Aug 28 00:12:54 pfsense kernel: apic id = 02 Aug 28 00:12:54 pfsense kernel: instruction pointer = 0x20:0xffffffff8065f3d9 Aug 28 00:12:54 pfsense kernel: instruction pointer = 0x20:0xffffffff8065f3d9 Aug 28 00:12:54 pfsense kernel: stack pointer = 0x28:0xfffffe009a0dd620 Aug 28 00:12:54 pfsense kernel: stack pointer = 0x0:0xfffffe000055c660 Aug 28 00:12:54 pfsense kernel: frame pointer = 0x28:0xfffffe009a0dd650 Aug 28 00:12:54 pfsense kernel: frame pointer = 0x0:0xfffffe000055c690 Aug 28 00:12:54 pfsense kernel: code segment = base 0x0, limit 0xfffff, type 0x1b Aug 28 00:12:54 pfsense kernel: code segment = base 0x0, limit 0xfffff, type 0x1b Aug 28 00:12:54 pfsense kernel: = DPL 0, pres 1, long 1, def32 0, gran 1 Aug 28 00:12:54 pfsense kernel: = DPL 0, pres 1, long 1, def32 0, gran 1 Aug 28 00:12:54 pfsense kernel: processor eflags = processor eflags = interrupt enabled, interrupt enabled, resume, resume, IOPL = 0 Aug 28 00:12:54 pfsense kernel: IOPL = 0 Aug 28 00:12:54 pfsense kernel: current process = 0 (if_io_tqg_2) Aug 28 00:12:54 pfsense kernel: current process = 12 (irq317: t5nex0:2a0) Aug 28 08:09:41 pfsense kernel: Aug 28 08:09:41 pfsense kernel: Aug 28 08:09:41 pfsense kernel: Fatal trap 9: general protection fault while in kernel mode Aug 28 08:09:41 pfsense kernel: cpuid = 10; apic id = 0a Aug 28 08:09:41 pfsense kernel: instruction pointer = 0x20:0xffffffff8065f3d9 Aug 28 08:09:41 pfsense kernel: stack pointer = 0x28:0xfffffe009a11e540 Aug 28 08:09:41 pfsense kernel: frame pointer = 0x28:0xfffffe009a11e570 Aug 28 08:09:41 pfsense kernel: code segment = base 0x0, limit 0xfffff, type 0x1b Aug 28 08:09:41 pfsense kernel: = DPL 0, pres 1, long 1, def32 0, gran 1 Aug 28 08:09:41 pfsense kernel: processor eflags = interrupt enabled, resume, Aug 28 08:09:41 pfsense kernel: IOPL = 0 Aug 28 08:09:41 pfsense kernel: current process = 12 (irq330: t5nex0:3a3) Aug 28 08:09:41 pfsense kernel: trap number = 9 Aug 28 08:17:47 pfsense kernel: Aug 28 08:17:47 pfsense kernel: Aug 28 08:17:47 pfsense kernel: Aug 28 08:17:47 pfsense kernel: Fatal trap 9: general protection fault while in kernel mode Aug 28 08:17:47 pfsense kernel: Aug 28 08:17:47 pfsense kernel: cpuid = 10; Aug 28 08:17:47 pfsense kernel: Aug 28 08:17:47 pfsense kernel: Fatal trap 9: general protection fault while in kernel mode Aug 28 08:17:47 pfsense kernel: cpuid = 4; Fatal trap 9: general protection fault while in kernel mode Aug 28 08:17:47 pfsense kernel: apic id = 0a Aug 28 08:17:47 pfsense kernel: cpuid = 0; Aug 28 08:17:47 pfsense kernel: instruction pointer = 0x20:0xffffffff8065f3d9 Aug 28 08:17:47 pfsense kernel: apic id = 00 Aug 28 08:17:47 pfsense kernel: apic id = 04 Aug 28 08:17:47 pfsense kernel: instruction pointer = 0x20:0xffffffff8065f3d9 Aug 28 08:17:47 pfsense kernel: instruction pointer = 0x20:0xffffffff8065f3d9 Aug 28 08:17:47 pfsense kernel: stack pointer = 0x28:0xfffffe009a12d540 Aug 28 08:17:47 pfsense kernel: stack pointer = 0x28:0xfffffe009a11e540 Aug 28 08:17:47 pfsense kernel: stack pointer = 0x28:0xfffffe009a10f540 Aug 28 08:17:47 pfsense kernel: frame pointer = 0x28:0xfffffe009a12d570 Aug 28 08:17:47 pfsense kernel: frame pointer = 0x28:0xfffffe009a10f570 Aug 28 08:17:47 pfsense kernel: code segment = base 0x0, limit 0xfffff, type 0x1b Aug 28 08:17:47 pfsense kernel: frame pointer = 0x28:0xfffffe009a11e570 Aug 28 08:17:47 pfsense kernel: = DPL 0, pres 1, long 1, def32 0, gran 1 Aug 28 08:17:47 pfsense kernel: code segment = base 0x0, limit 0xfffff, type 0x1b Aug 28 08:17:47 pfsense kernel: processor eflags = Aug 28 08:17:47 pfsense kernel: = DPL 0, pres 1, long 1, def32 0, gran 1 Aug 28 08:17:47 pfsense kernel: interrupt enabled, processor eflags = code segment = base 0x0, limit 0xfffff, type 0x1b Aug 28 08:17:47 pfsense kernel: interrupt enabled, resume, Aug 28 08:17:47 pfsense kernel: = DPL 0, pres 1, long 1, def32 0, gran 1 Aug 28 08:17:48 pfsense kernel: IOPL = 0 Aug 28 08:17:48 pfsense kernel: resume, processor eflags = IOPL = 0 Aug 28 08:17:48 pfsense kernel: current process = 12 (irq333: t5nex0:3a6) Aug 28 08:17:48 pfsense kernel: current process = 12 (irq327: t5nex0:3a0) Aug 28 08:17:48 pfsense kernel: trap number = 9
Just got a Supermicro AS-5019D-FTN4 and it's been crashing constantly.
I think I got it nailed down to wireguard, at least it looks that way since it's been running fine with WG disabled.
The weird thing is it runs fine with WG enabled until something tries to access something across the tunnel.Any ideas?
-
@jarhead said in Epyc 3251 and Wireguard:
Aug 28 08:17:48 pfsense kernel: current process = 12 (irq333: t5nex0:3a6)
Aug 28 08:17:48 pfsense kernel: current process = 12 (irq327: t5nex0:3a0)Is that a Chelsio NIC? t5nex0?
That's where it appears to be failing, which is odd.
Do you have a full crash report with the backtrace?
Steve
-
@stephenw10 I have a chelsio t540-cr installed.
/var/crash has nothing. Where else would I look? -
Hmm, is it actually panicking and rebooting when that happens?
You have any of the Chelsio hardware off-loading enabled?
-
@stephenw10 Doesn't reboot, just keeps scrolling lines of errors (I assume). Let it go for 10 minutes once, then I rebooted it.
Anywhere I can find those lines or are they not saved?Both LAN and WAN are on the chelsio card.
ifconfig cxl3 cxl3: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 description: WAN options=3e800bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6,TXRTLMT,HWRXTSTMP> ether 00:07:43:2c:e5:38 inet6 fe80::207:43ff:fe2c:e538%cxl3 prefixlen 64 scopeid 0x8 inet 32.219.x.x netmask 0xfffff800 broadcast 32.219.239.255 media: Ethernet 10Gbase-LR <full-duplex,rxpause,txpause> status: active nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL> ifconfig cxl2 cxl2: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 description: LAN options=3e800bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6,TXRTLMT,HWRXTSTMP> ether 00:07:43:2c:e5:30 inet6 fe80::207:43ff:fe2c:e530%cxl2 prefixlen 64 scopeid 0x7 inet 10.12.8.1 netmask 0xffffffc0 broadcast 10.12.8.63 inet 10.255.255.1 netmask 0xffffffff broadcast 10.255.255.1 media: Ethernet 10Gbase-LRM <full-duplex,rxpause,txpause> status: active nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
Also, It's been running fine for a few hours with Wireguard disabled. I enabled WG, and once I tried to connect to the pfSense WebGUI on the other side it went down again.
-
I would expect to see something in the system log when that happens.
That combination of things is not something I've seen before though. I'll run it past the devs tomorrow and see if any of them have.
Steve
-
@stephenw10 Will have an update in a few minutes.
Disconnected the chelsio card, put wan and lan on gig ports. Did the same thing.
took a look at wireguard config and found the gateways were reversed. Started thinking if that got screwy in the config restore what else did??
So I completely removed WG and all config from it.
Rebooted, did a backup, removed all traces of WG from it and restored.
Just came back up now and waiting for the package reinstall.
Once done, I'll reinstall WG, recreate all tunnels and see what happens.I did let it go through the whole process last crash and got the dump files if needed.
Will let you know how it goes. -
@stephenw10
Still no good.
Just created 1 tunnel. It comes up fine but as soon as I try to use it, gone. -
Hmm, still showing issues in the Chelsio driver.
Panic:
Fatal trap 9: general protection fault while in kernel mode cpuid = 12; apic id = 0c instruction pointer = 0x20:0xffffffff8065f3d9 stack pointer = 0x28:0xfffffe009a0fb540 frame pointer = 0x28:0xfffffe009a0fb570 code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 12 (irq323: t5nex0:3a2) trap number = 9 panic: general protection fault cpuid = 12 time = 1661734572 KDB: enter: panic
Backtrace:
db:0:kdb.enter.default> bt Tracing pid 12 tid 100213 td 0xfffff80005df0000 kdb_enter() at kdb_enter+0x37/frame 0xfffffe009a0fb250 vpanic() at vpanic+0x197/frame 0xfffffe009a0fb2a0 panic() at panic+0x43/frame 0xfffffe009a0fb300 trap_fatal() at trap_fatal+0x391/frame 0xfffffe009a0fb360 trap() at trap+0x67/frame 0xfffffe009a0fb470 calltrap() at calltrap+0x8/frame 0xfffffe009a0fb470 --- trap 0x9, rip = 0xffffffff8065f3d9, rsp = 0xfffffe009a0fb540, rbp = 0xfffffe009a0fb570 --- cxgbe_transmit() at cxgbe_transmit+0x19/frame 0xfffffe009a0fb570 ether_output_frame() at ether_output_frame+0xb4/frame 0xfffffe009a0fb5a0 ether_output() at ether_output+0x676/frame 0xfffffe009a0fb620 ip_output() at ip_output+0x136c/frame 0xfffffe009a0fb770 ip_forward() at ip_forward+0x39e/frame 0xfffffe009a0fb840 ip_input() at ip_input+0x850/frame 0xfffffe009a0fb8f0 netisr_dispatch_src() at netisr_dispatch_src+0xca/frame 0xfffffe009a0fb940 ether_demux() at ether_demux+0x16a/frame 0xfffffe009a0fb970 ether_nh_input() at ether_nh_input+0x330/frame 0xfffffe009a0fb9d0 netisr_dispatch_src() at netisr_dispatch_src+0xca/frame 0xfffffe009a0fba20 ether_input() at ether_input+0x89/frame 0xfffffe009a0fba80 service_iq_fl() at service_iq_fl+0x5d2/frame 0xfffffe009a0fbb30 t4_intr() at t4_intr+0x2d/frame 0xfffffe009a0fbb50 ithread_loop() at ithread_loop+0x23c/frame 0xfffffe009a0fbbb0 fork_exit() at fork_exit+0x7e/frame 0xfffffe009a0fbbf0 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe009a0fbbf0 --- trap 0, rip = 0, rsp = 0, rbp = 0 ---
But wireguard was no longer running on it when happened?
-
What's the WG tunnel connected to there? Another pfSense install?
-
@stephenw10 said in Epyc 3251 and Wireguard:
Hmm, still showing issues in the Chelsio driver.
I assume that's only because my WAN is on the chelsio at the time. I didn't check when I disconnected the chelsio card but I would also assume it would've shown as the igb0 at that time.
But wireguard was no longer running on it when happened?
Probably the cause right there. WG shutting down when I try to use it?
-
@stephenw10 said in Epyc 3251 and Wireguard:
What's the WG tunnel connected to there? Another pfSense install?
Unfortunately not.
That tunnel goes to an opnsense box. At least until the vlan0 is fixed. -
Hmm, so the encrypted WG traffic still runs over the Chelsio NIC, the WAN?
-
@stephenw10 Not really sure what you're asking there.
My WAN is on the chelsio card (cxl3), the WG tunnel comes up with handshakes, but as soon as I try to access the other side it crashes. -
Mmm, I'm unsure what you moved to igb0. I would have expected that to have to be the WAN for the WG interface to be running on it.
-
@stephenw10 I moved the WAN to igb0 and disconnected the chelsio card from the motherboard as a test.
The trouble still happened.
So I don't think focusing on the chelsio is the way to go.
It happens with the onboard nics also.
Because it still happened with the onboard nics, I reinserted the chelsio and moved WAN back to it. -
Right, I would agree except that it appeared the error was still on the Chelsio NIC even when it was not carrying WG traffic as I understand it.
It would be good to get a crash report from the igb0 as WAN setup if that's possible. It would be very surprising to see the same error on igb sicne many people are running WG with an igb parent.
-
@stephenw10 said in Epyc 3251 and Wireguard:
Right, I would agree except that it appeared the error was still on the Chelsio NIC even when it was not carrying WG traffic as I understand it.
How are you coming up with that?
-
@jarhead said in Epyc 3251 and Wireguard:
But wireguard was no longer running on it when happened?
Probably the cause right there. WG shutting down when I try to use it?
I may have read that wrong. But what I meant to ask there was; was WG running on the Chelsio NIC when that crash report was generated?
-
@stephenw10 I'll go through the whole thing again, trying to be more clear.
New router. Backed up old, restored on new changing interfaces as needed.
Wireguard would crash.
Moved WAN and LAN to onboard igb nic's.
Wireguard would crash.
Since this proves it's not related to the chelsio card, as it wasn't even plugged in to the motherboard, I reinstalled the chelsio and moved WAN and LAN back to it.
Wireguard would crash.
I found some weird errors in my gateways, as in network 1 was using gateway 2, and network 2 using gateway 1 when they should be 1 to 1 and 2 to 2, so I uninstalled wireguard then reinstalled it and recreated one tunnel.
Wireguard crashed and that's the dump I posted here.So focusing on the chelsio card seems to be not the way to go.
Have you guys used an Epyc 3251 in the office for testing at all?