Crashing on startup (23.09 beta)
-
Hi, I've got a pair of Plus VM's in HA mode. My slave VM was happy running the beta over the weekend. This morning I upgraded my master VM and it crashed 3 times immediately after the upgrade. It was happy for 30min or so, then I did another reboot to see what happened and boom... crashing again during startup... once, twice, three, four times. I may simply revert. Oh wait, fifth reboot succeeded! I'll leave it here for now. :)
Looks like a page fault right around the time of bringing up a Wireguard tunnel.
Slave is running a similar Wireguard config but has never crashed.
How can I be of further assistance? Please reach out if you want the dump files.
Version info:
amd64 14.0-CURRENT FreeBSD 14.0-CURRENT amd64 1400094 #1 plus-RELENG_23_09-n256151-106588946ac: Mon Oct 16 03:08:01 UTC 2023 root@freebsd:/var/jenkins/workspace/pfSense-Plus-snapshots-23_09-main/obj/amd64/nZsGV1oA/var/jenkins/workspace/pfSense-Plus-snapshots-23_09-main/
-
If available, share the panic/backtrace.
-
Hi @marcosm, I suspect the entire tar file contains more info than I wish to post in a public forum, but here's the snippet of the crash in progress:
... <118>Starting syslog...done. <6>carp: 4@vtnet2: INIT -> BACKUP (initialization complete) <6>carp: demoted by -240 to 720 (interface up) <6>carp: 3@vtnet1: INIT -> BACKUP (initialization complete) <6>carp: demoted by -240 to 480 (interface up) <6>carp: 1@vtnet0.32: INIT -> BACKUP (initialization complete) <6>carp: demoted by -240 to 240 (interface up) <6>carp: 2@vtnet0: INIT -> BACKUP (initialization complete) <6>carp: demoted by -240 to 0 (interface up) <118>Starting CRON... done. <118>>>> Upgrading necessary packages... <6>carp: 2@vtnet0: BACKUP -> MASTER (preempting a slower master) <6>carp: 1@vtnet0.32: BACKUP -> MASTER (preempting a slower master) <6>carp: 3@vtnet1: BACKUP -> MASTER (preempting a slower master) <6>carp: 4@vtnet2: BACKUP -> MASTER (preempting a slower master) <118>Checking for upgrades (2 candidates): .. done <118>Processing candidates (2 candidates): .. done <118>Checking integrity... done (0 conflicting) <118>The following 2 package(s) will be affected (of 0 checked): <118> <118>Installed packages to be UPGRADED: <118> pfSense-pkg-WireGuard: 0.2.0_2 -> 0.2.1 [pfSense] <118> <118>Installed packages to be REINSTALLED: <118> pfSense-pkg-node_exporter-0.18.1_3 [pfSense] (direct dependency changed: python311) <118> <118>Number of packages to be upgraded: 1 <118>Number of packages to be reinstalled: 1 <118>[1/2] Upgrading pfSense-pkg-WireGuard from 0.2.0_2 to 0.2.1... <118>[1/2] Extracting pfSense-pkg-WireGuard-0.2.1: .......... done <118>Removing WireGuard components... <118>Menu items... done. <118>Services... done. <118>Loading package instructions... <118> Removing WireGuard early shell commands...done. <118> Removing WireGuard interface group...done. <118> Removing WireGuard temporary files...done. <118> Keeping WireGuard configuration settings...done. <118> Removing WireGuard Unbound access list...done. <118> Destroying WireGuard tunnels... <6>tun_wg0: link state changed to DOWN <6>tun_wg1: link state changed to DOWN <118>done. <118> Stopping and removing the WireGuard service... Fatal trap 12: page fault while in kernel mode cpuid = 2; apic id = 02 fault virtual address = 0x50 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff80fb883c stack pointer = 0x28:0xfffffe00595c2810 frame pointer = 0x28:0xfffffe00595c28d0 code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 12 (irq25: virtio_pci0) rdi: fffff8008d1ee01a rsi: fffffe0005bb082d rdx: 0000000000000000 rcx: fffffe0005bb0370 r8: fffffe00595c2930 r9: 0000000000000000 rax: 0000000000000000 rbx: fffff800aa6248b8 rbp: fffffe00595c28d0 r10: 0000000000000000 r11: fffffe0005bb0370 r12: fffffe0005bb0370 r13: 0000000000000002 r14: fffff800bbd5a160 r15: fffffe00595c2b08 trap number = 12 panic: page fault cpuid = 2 time = 1697479957 KDB: enter: panic
and here's a bit of the trace:
db:0:kdb.enter.default> run pfs db:1:pfs> bt Tracing pid 12 tid 100070 td 0xfffffe0005d78020 kdb_enter() at kdb_enter+0x32/frame 0xfffffe00595c24f0 vpanic() at vpanic+0x163/frame 0xfffffe00595c2620 panic() at panic+0x43/frame 0xfffffe00595c2680 trap_fatal() at trap_fatal+0x40c/frame 0xfffffe00595c26e0 trap_pfault() at trap_pfault+0x4f/frame 0xfffffe00595c2740 calltrap() at calltrap+0x8/frame 0xfffffe00595c2740 --- trap 0xc, rip = 0xffffffff80fb883c, rsp = 0xfffffe00595c2810, rbp = 0xfffffe00595c28d0 --- pf_route() at pf_route+0x56c/frame 0xfffffe00595c28d0 pf_test() at pf_test+0xd7b/frame 0xfffffe00595c2a70 pf_check_out() at pf_check_out+0x22/frame 0xfffffe00595c2a90 pfil_mbuf_out() at pfil_mbuf_out+0x38/frame 0xfffffe00595c2ac0 ip_tryforward() at ip_tryforward+0x29d/frame 0xfffffe00595c2b70 ip_input() at ip_input+0x350/frame 0xfffffe00595c2bd0 netisr_dispatch_src() at netisr_dispatch_src+0x22c/frame 0xfffffe00595c2c20 ether_demux() at ether_demux+0x149/frame 0xfffffe00595c2c50 ether_nh_input() at ether_nh_input+0x36e/frame 0xfffffe00595c2cb0 netisr_dispatch_src() at netisr_dispatch_src+0xaf/frame 0xfffffe00595c2d00 ether_input() at ether_input+0x69/frame 0xfffffe00595c2d60 vtnet_rxq_eof() at vtnet_rxq_eof+0x6d2/frame 0xfffffe00595c2e20 vtnet_rx_vq_process() at vtnet_rx_vq_process+0xbc/frame 0xfffffe00595c2e60 ithread_loop() at ithread_loop+0x257/frame 0xfffffe00595c2ef0 fork_exit() at fork_exit+0x7f/frame 0xfffffe00595c2f30 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00595c2f30 --- trap 0x7c9ebd07, rip = 0x8000065320cca1bc, rsp = 0xffffffff00000007, rbp = 0xbaa558 --- db:1:pfs> show registers cs 0x20 ds 0x3b es 0x3b fs 0x13 gs 0x1b ss 0x28 rax 0x12 rcx 0xffffffff814589e2 rdx 0x3f8 rbx 0x100 rsp 0xfffffe00595c24f0 rbp 0xfffffe00595c24f0 rsi 0xa rdi 0xffffffff82d4e850 gdb_consdev r8 0 r9 0 r10 0 r11 0 r12 0 r13 0 r14 0xffffffff813dbf5b r15 0xfffffe0005d78020 rip 0xffffffff80d388c2 kdb_enter+0x32 rflags 0x82 kdb_enter+0x32: movq $0,0x2344f43(%rip) db:1:pfs> show pcpu cpuid = 2 dynamic pcpu = 0xfffffe0082c79f00 curthread = 0xfffffe0005d78020: pid 12 tid 100070 critnest 1 "irq25: virtio_pci0" curpcb = 0xfffffe0005d78540 fpcurthread = none idlethread = 0xfffffe0005c67560: tid 100005 "idle: cpu2" self = 0xffffffff84012000 curpmap = 0xffffffff83021ab0 tssp = 0xffffffff84012384 rsp0 = 0xfffffe00595c3000 kcr3 = 0x80000000b7900007 ucr3 = 0xffffffffffffffff scr3 = 0xa5b7a9e4 gs32p = 0xffffffff84012404 ldt = 0xffffffff84012444 tss = 0xffffffff84012434 curvnet = 0xfffff800011c7580 db:1:pfs> run lockinfo db:2:lockinfo> show locks No such command; use "help" to list available commands db:2:lockinfo> show alllocks No such command; use "help" to list available commands db:2:lockinfo> show lockedvnods Locked vnodes db:1:pfs> acttrace Tracing command idle pid 11 tid 100003 td 0xfffffe0005c683a0 (CPU 0) cpustop_handler() at cpustop_handler+0x27/frame 0xffffffff83105680 ipi_nmi_handler() at ipi_nmi_handler+0x3a/frame 0xffffffff83105690 trap() at trap+0x3f/frame 0xffffffff831057b0 nmi_calltrap() at nmi_calltrap+0x8/frame 0xffffffff831057b0 --- trap 0x13, rip = 0xffffffff8125c526, rsp = 0xfffffe0005176d70, rbp = 0xfffffe0005176d70 --- acpi_cpu_c1() at acpi_cpu_c1+0x6/frame 0xfffffe0005176d70 acpi_cpu_idle() at acpi_cpu_idle+0x2fe/frame 0xfffffe0005176db0 cpu_idle_acpi() at cpu_idle_acpi+0x46/frame 0xfffffe0005176dd0 cpu_idle() at cpu_idle+0x9d/frame 0xfffffe0005176df0 sched_idletd() at sched_idletd+0x576/frame 0xfffffe0005176ef0 fork_exit() at fork_exit+0x7f/frame 0xfffffe0005176f30 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0005176f30 --- trap 0x9824848b, rip = 0x5ede158d48564100, rsp = 0x894c78ec8348db31, rbp = 0x2464894cd23115eb ---
This is totally not my specialty, but it looks like a virtio driver crash. Fwiw the host OS is Truenas Scale Cobia RC.1, adapter type is 'VirtIO', and it attaches to a software bridge on the linux side.
I can send the tar file directly if you wish.
Mikep.s. it's been running solid for 2 hours now, so likely it's fairly stable once everything comes up. I haven't tried rebooting again.
-
Looks like it's this: https://redmine.pfsense.org/issues/14804
If you look at the comments there it should be fixed in tomorrows build.
Steve
-
Hi @stephenw10,
Based on that theory, I disabled CARP on the master to cause a failover to the slave, waited a minute, then re-enabled CARP on the master. Both failovers went perfectly smoothly and the master did not crash when taking over this time.
Further anecdotal evidence: I did two more reboots (after the above test) and both succeeded. Most likely that just means I don't have a 100% reproducible test case.
In any case, I'll install tonight's snapshot build once it's up.
Mike
-
The backtrace you saw is pretty much identical so I'd be surprised if it wasn't that.
-
I installed the latest beta snapshot this morning on both VM's and so far no crashes. If anything comes up I'll report back here but so far looking good.