Crashing on startup (23.09 beta)

mikebenna

Hi, I've got a pair of Plus VM's in HA mode. My slave VM was happy running the beta over the weekend. This morning I upgraded my master VM and it crashed 3 times immediately after the upgrade. It was happy for 30min or so, then I did another reboot to see what happened and boom... crashing again during startup... once, twice, three, four times. I may simply revert. Oh wait, fifth reboot succeeded! I'll leave it here for now. :)

Looks like a page fault right around the time of bringing up a Wireguard tunnel.

Slave is running a similar Wireguard config but has never crashed.

How can I be of further assistance? Please reach out if you want the dump files.

Version info:

amd64
14.0-CURRENT
FreeBSD 14.0-CURRENT amd64 1400094 #1 plus-RELENG_23_09-n256151-106588946ac: Mon Oct 16 03:08:01 UTC 2023     root@freebsd:/var/jenkins/workspace/pfSense-Plus-snapshots-23_09-main/obj/amd64/nZsGV1oA/var/jenkins/workspace/pfSense-Plus-snapshots-23_09-main/

marcosm

If available, share the panic/backtrace.

mikebenna

Hi @marcosm, I suspect the entire tar file contains more info than I wish to post in a public forum, but here's the snippet of the crash in progress:

...
<118>Starting syslog...done.
<6>carp: 4@vtnet2: INIT -> BACKUP (initialization complete)
<6>carp: demoted by -240 to 720 (interface up)
<6>carp: 3@vtnet1: INIT -> BACKUP (initialization complete)
<6>carp: demoted by -240 to 480 (interface up)
<6>carp: 1@vtnet0.32: INIT -> BACKUP (initialization complete)
<6>carp: demoted by -240 to 240 (interface up)
<6>carp: 2@vtnet0: INIT -> BACKUP (initialization complete)
<6>carp: demoted by -240 to 0 (interface up)
<118>Starting CRON... done.
<118>>>> Upgrading necessary packages... 
<6>carp: 2@vtnet0: BACKUP -> MASTER (preempting a slower master)
<6>carp: 1@vtnet0.32: BACKUP -> MASTER (preempting a slower master)
<6>carp: 3@vtnet1: BACKUP -> MASTER (preempting a slower master)
<6>carp: 4@vtnet2: BACKUP -> MASTER (preempting a slower master)
<118>Checking for upgrades (2 candidates): .. done
<118>Processing candidates (2 candidates): .. done
<118>Checking integrity... done (0 conflicting)
<118>The following 2 package(s) will be affected (of 0 checked):
<118>
<118>Installed packages to be UPGRADED:
<118>	pfSense-pkg-WireGuard: 0.2.0_2 -> 0.2.1 [pfSense]
<118>
<118>Installed packages to be REINSTALLED:
<118>	pfSense-pkg-node_exporter-0.18.1_3 [pfSense] (direct dependency changed: python311)
<118>
<118>Number of packages to be upgraded: 1
<118>Number of packages to be reinstalled: 1
<118>[1/2] Upgrading pfSense-pkg-WireGuard from 0.2.0_2 to 0.2.1...
<118>[1/2] Extracting pfSense-pkg-WireGuard-0.2.1: .......... done
<118>Removing WireGuard components...
<118>Menu items... done.
<118>Services... done.
<118>Loading package instructions...
<118>  Removing WireGuard early shell commands...done.
<118>  Removing WireGuard interface group...done.
<118>  Removing WireGuard temporary files...done.
<118>  Keeping WireGuard configuration settings...done.
<118>  Removing WireGuard Unbound access list...done.
<118>  Destroying WireGuard tunnels...
<6>tun_wg0: link state changed to DOWN
<6>tun_wg1: link state changed to DOWN
<118>done.
<118>  Stopping and removing the WireGuard service...


Fatal trap 12: page fault while in kernel mode
cpuid = 2; apic id = 02
fault virtual address	= 0x50
fault code		= supervisor read data, page not present
instruction pointer	= 0x20:0xffffffff80fb883c
stack pointer	        = 0x28:0xfffffe00595c2810
frame pointer	        = 0x28:0xfffffe00595c28d0
code segment		= base 0x0, limit 0xfffff, type 0x1b
			= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags	= interrupt enabled, resume, IOPL = 0
current process		= 12 (irq25: virtio_pci0)
rdi: fffff8008d1ee01a rsi: fffffe0005bb082d rdx: 0000000000000000
rcx: fffffe0005bb0370  r8: fffffe00595c2930  r9: 0000000000000000
rax: 0000000000000000 rbx: fffff800aa6248b8 rbp: fffffe00595c28d0
r10: 0000000000000000 r11: fffffe0005bb0370 r12: fffffe0005bb0370
r13: 0000000000000002 r14: fffff800bbd5a160 r15: fffffe00595c2b08
trap number		= 12
panic: page fault
cpuid = 2
time = 1697479957
KDB: enter: panic

and here's a bit of the trace:

db:0:kdb.enter.default>  run pfs
db:1:pfs> bt
Tracing pid 12 tid 100070 td 0xfffffe0005d78020
kdb_enter() at kdb_enter+0x32/frame 0xfffffe00595c24f0
vpanic() at vpanic+0x163/frame 0xfffffe00595c2620
panic() at panic+0x43/frame 0xfffffe00595c2680
trap_fatal() at trap_fatal+0x40c/frame 0xfffffe00595c26e0
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe00595c2740
calltrap() at calltrap+0x8/frame 0xfffffe00595c2740
--- trap 0xc, rip = 0xffffffff80fb883c, rsp = 0xfffffe00595c2810, rbp = 0xfffffe00595c28d0 ---
pf_route() at pf_route+0x56c/frame 0xfffffe00595c28d0
pf_test() at pf_test+0xd7b/frame 0xfffffe00595c2a70
pf_check_out() at pf_check_out+0x22/frame 0xfffffe00595c2a90
pfil_mbuf_out() at pfil_mbuf_out+0x38/frame 0xfffffe00595c2ac0
ip_tryforward() at ip_tryforward+0x29d/frame 0xfffffe00595c2b70
ip_input() at ip_input+0x350/frame 0xfffffe00595c2bd0
netisr_dispatch_src() at netisr_dispatch_src+0x22c/frame 0xfffffe00595c2c20
ether_demux() at ether_demux+0x149/frame 0xfffffe00595c2c50
ether_nh_input() at ether_nh_input+0x36e/frame 0xfffffe00595c2cb0
netisr_dispatch_src() at netisr_dispatch_src+0xaf/frame 0xfffffe00595c2d00
ether_input() at ether_input+0x69/frame 0xfffffe00595c2d60
vtnet_rxq_eof() at vtnet_rxq_eof+0x6d2/frame 0xfffffe00595c2e20
vtnet_rx_vq_process() at vtnet_rx_vq_process+0xbc/frame 0xfffffe00595c2e60
ithread_loop() at ithread_loop+0x257/frame 0xfffffe00595c2ef0
fork_exit() at fork_exit+0x7f/frame 0xfffffe00595c2f30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00595c2f30
--- trap 0x7c9ebd07, rip = 0x8000065320cca1bc, rsp = 0xffffffff00000007, rbp = 0xbaa558 ---
db:1:pfs>  show registers
cs                        0x20
ds                        0x3b
es                        0x3b
fs                        0x13
gs                        0x1b
ss                        0x28
rax                       0x12
rcx         0xffffffff814589e2
rdx                      0x3f8
rbx                      0x100
rsp         0xfffffe00595c24f0
rbp         0xfffffe00595c24f0
rsi                        0xa
rdi         0xffffffff82d4e850  gdb_consdev
r8                           0
r9                           0
r10                          0
r11                          0
r12                          0
r13                          0
r14         0xffffffff813dbf5b
r15         0xfffffe0005d78020
rip         0xffffffff80d388c2  kdb_enter+0x32
rflags                    0x82
kdb_enter+0x32: movq    $0,0x2344f43(%rip)
db:1:pfs>  show pcpu
cpuid        = 2
dynamic pcpu = 0xfffffe0082c79f00
curthread    = 0xfffffe0005d78020: pid 12 tid 100070 critnest 1 "irq25: virtio_pci0"
curpcb       = 0xfffffe0005d78540
fpcurthread  = none
idlethread   = 0xfffffe0005c67560: tid 100005 "idle: cpu2"
self         = 0xffffffff84012000
curpmap      = 0xffffffff83021ab0
tssp         = 0xffffffff84012384
rsp0         = 0xfffffe00595c3000
kcr3         = 0x80000000b7900007
ucr3         = 0xffffffffffffffff
scr3         = 0xa5b7a9e4
gs32p        = 0xffffffff84012404
ldt          = 0xffffffff84012444
tss          = 0xffffffff84012434
curvnet      = 0xfffff800011c7580
db:1:pfs>  run lockinfo
db:2:lockinfo> show locks
No such command; use "help" to list available commands
db:2:lockinfo>  show alllocks
No such command; use "help" to list available commands
db:2:lockinfo>  show lockedvnods
Locked vnodes
db:1:pfs>  acttrace

Tracing command idle pid 11 tid 100003 td 0xfffffe0005c683a0 (CPU 0)
cpustop_handler() at cpustop_handler+0x27/frame 0xffffffff83105680
ipi_nmi_handler() at ipi_nmi_handler+0x3a/frame 0xffffffff83105690
trap() at trap+0x3f/frame 0xffffffff831057b0
nmi_calltrap() at nmi_calltrap+0x8/frame 0xffffffff831057b0
--- trap 0x13, rip = 0xffffffff8125c526, rsp = 0xfffffe0005176d70, rbp = 0xfffffe0005176d70 ---
acpi_cpu_c1() at acpi_cpu_c1+0x6/frame 0xfffffe0005176d70
acpi_cpu_idle() at acpi_cpu_idle+0x2fe/frame 0xfffffe0005176db0
cpu_idle_acpi() at cpu_idle_acpi+0x46/frame 0xfffffe0005176dd0
cpu_idle() at cpu_idle+0x9d/frame 0xfffffe0005176df0
sched_idletd() at sched_idletd+0x576/frame 0xfffffe0005176ef0
fork_exit() at fork_exit+0x7f/frame 0xfffffe0005176f30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0005176f30
--- trap 0x9824848b, rip = 0x5ede158d48564100, rsp = 0x894c78ec8348db31, rbp = 0x2464894cd23115eb ---

This is totally not my specialty, but it looks like a virtio driver crash. Fwiw the host OS is Truenas Scale Cobia RC.1, adapter type is 'VirtIO', and it attaches to a software bridge on the linux side.

I can send the tar file directly if you wish.
Mike

p.s. it's been running solid for 2 hours now, so likely it's fairly stable once everything comes up. I haven't tried rebooting again.

stephenw10

Looks like it's this: https://redmine.pfsense.org/issues/14804

If you look at the comments there it should be fixed in tomorrows build.

Steve

mikebenna

Hi @stephenw10,

Based on that theory, I disabled CARP on the master to cause a failover to the slave, waited a minute, then re-enabled CARP on the master. Both failovers went perfectly smoothly and the master did not crash when taking over this time.

Further anecdotal evidence: I did two more reboots (after the above test) and both succeeded. Most likely that just means I don't have a 100% reproducible test case.

In any case, I'll install tonight's snapshot build once it's up.

Mike

stephenw10

The backtrace you saw is pretty much identical so I'd be surprised if it wasn't that.

mikebenna

I installed the latest beta snapshot this morning on both VM's and so far no crashes. If anything comes up I'll report back here but so far looking good.