Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    Crashing on startup (23.09 beta)

    Scheduled Pinned Locked Moved Plus 23.09 Development Snapshots (Retired)
    7 Posts 3 Posters 517 Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • M
      mikebenna
      last edited by

      Hi, I've got a pair of Plus VM's in HA mode. My slave VM was happy running the beta over the weekend. This morning I upgraded my master VM and it crashed 3 times immediately after the upgrade. It was happy for 30min or so, then I did another reboot to see what happened and boom... crashing again during startup... once, twice, three, four times. I may simply revert. Oh wait, fifth reboot succeeded! I'll leave it here for now. :)

      Looks like a page fault right around the time of bringing up a Wireguard tunnel.

      Slave is running a similar Wireguard config but has never crashed.

      How can I be of further assistance? Please reach out if you want the dump files.

      Version info:

      amd64
      14.0-CURRENT
      FreeBSD 14.0-CURRENT amd64 1400094 #1 plus-RELENG_23_09-n256151-106588946ac: Mon Oct 16 03:08:01 UTC 2023     root@freebsd:/var/jenkins/workspace/pfSense-Plus-snapshots-23_09-main/obj/amd64/nZsGV1oA/var/jenkins/workspace/pfSense-Plus-snapshots-23_09-main/
      
      1 Reply Last reply Reply Quote 0
      • M
        marcosm Netgate
        last edited by

        If available, share the panic/backtrace.

        M 1 Reply Last reply Reply Quote 0
        • M
          mikebenna @marcosm
          last edited by

          Hi @marcosm, I suspect the entire tar file contains more info than I wish to post in a public forum, but here's the snippet of the crash in progress:

          ...
          <118>Starting syslog...done.
          <6>carp: 4@vtnet2: INIT -> BACKUP (initialization complete)
          <6>carp: demoted by -240 to 720 (interface up)
          <6>carp: 3@vtnet1: INIT -> BACKUP (initialization complete)
          <6>carp: demoted by -240 to 480 (interface up)
          <6>carp: 1@vtnet0.32: INIT -> BACKUP (initialization complete)
          <6>carp: demoted by -240 to 240 (interface up)
          <6>carp: 2@vtnet0: INIT -> BACKUP (initialization complete)
          <6>carp: demoted by -240 to 0 (interface up)
          <118>Starting CRON... done.
          <118>>>> Upgrading necessary packages... 
          <6>carp: 2@vtnet0: BACKUP -> MASTER (preempting a slower master)
          <6>carp: 1@vtnet0.32: BACKUP -> MASTER (preempting a slower master)
          <6>carp: 3@vtnet1: BACKUP -> MASTER (preempting a slower master)
          <6>carp: 4@vtnet2: BACKUP -> MASTER (preempting a slower master)
          <118>Checking for upgrades (2 candidates): .. done
          <118>Processing candidates (2 candidates): .. done
          <118>Checking integrity... done (0 conflicting)
          <118>The following 2 package(s) will be affected (of 0 checked):
          <118>
          <118>Installed packages to be UPGRADED:
          <118>	pfSense-pkg-WireGuard: 0.2.0_2 -> 0.2.1 [pfSense]
          <118>
          <118>Installed packages to be REINSTALLED:
          <118>	pfSense-pkg-node_exporter-0.18.1_3 [pfSense] (direct dependency changed: python311)
          <118>
          <118>Number of packages to be upgraded: 1
          <118>Number of packages to be reinstalled: 1
          <118>[1/2] Upgrading pfSense-pkg-WireGuard from 0.2.0_2 to 0.2.1...
          <118>[1/2] Extracting pfSense-pkg-WireGuard-0.2.1: .......... done
          <118>Removing WireGuard components...
          <118>Menu items... done.
          <118>Services... done.
          <118>Loading package instructions...
          <118>  Removing WireGuard early shell commands...done.
          <118>  Removing WireGuard interface group...done.
          <118>  Removing WireGuard temporary files...done.
          <118>  Keeping WireGuard configuration settings...done.
          <118>  Removing WireGuard Unbound access list...done.
          <118>  Destroying WireGuard tunnels...
          <6>tun_wg0: link state changed to DOWN
          <6>tun_wg1: link state changed to DOWN
          <118>done.
          <118>  Stopping and removing the WireGuard service...
          
          
          Fatal trap 12: page fault while in kernel mode
          cpuid = 2; apic id = 02
          fault virtual address	= 0x50
          fault code		= supervisor read data, page not present
          instruction pointer	= 0x20:0xffffffff80fb883c
          stack pointer	        = 0x28:0xfffffe00595c2810
          frame pointer	        = 0x28:0xfffffe00595c28d0
          code segment		= base 0x0, limit 0xfffff, type 0x1b
          			= DPL 0, pres 1, long 1, def32 0, gran 1
          processor eflags	= interrupt enabled, resume, IOPL = 0
          current process		= 12 (irq25: virtio_pci0)
          rdi: fffff8008d1ee01a rsi: fffffe0005bb082d rdx: 0000000000000000
          rcx: fffffe0005bb0370  r8: fffffe00595c2930  r9: 0000000000000000
          rax: 0000000000000000 rbx: fffff800aa6248b8 rbp: fffffe00595c28d0
          r10: 0000000000000000 r11: fffffe0005bb0370 r12: fffffe0005bb0370
          r13: 0000000000000002 r14: fffff800bbd5a160 r15: fffffe00595c2b08
          trap number		= 12
          panic: page fault
          cpuid = 2
          time = 1697479957
          KDB: enter: panic
          

          and here's a bit of the trace:

          db:0:kdb.enter.default>  run pfs
          db:1:pfs> bt
          Tracing pid 12 tid 100070 td 0xfffffe0005d78020
          kdb_enter() at kdb_enter+0x32/frame 0xfffffe00595c24f0
          vpanic() at vpanic+0x163/frame 0xfffffe00595c2620
          panic() at panic+0x43/frame 0xfffffe00595c2680
          trap_fatal() at trap_fatal+0x40c/frame 0xfffffe00595c26e0
          trap_pfault() at trap_pfault+0x4f/frame 0xfffffe00595c2740
          calltrap() at calltrap+0x8/frame 0xfffffe00595c2740
          --- trap 0xc, rip = 0xffffffff80fb883c, rsp = 0xfffffe00595c2810, rbp = 0xfffffe00595c28d0 ---
          pf_route() at pf_route+0x56c/frame 0xfffffe00595c28d0
          pf_test() at pf_test+0xd7b/frame 0xfffffe00595c2a70
          pf_check_out() at pf_check_out+0x22/frame 0xfffffe00595c2a90
          pfil_mbuf_out() at pfil_mbuf_out+0x38/frame 0xfffffe00595c2ac0
          ip_tryforward() at ip_tryforward+0x29d/frame 0xfffffe00595c2b70
          ip_input() at ip_input+0x350/frame 0xfffffe00595c2bd0
          netisr_dispatch_src() at netisr_dispatch_src+0x22c/frame 0xfffffe00595c2c20
          ether_demux() at ether_demux+0x149/frame 0xfffffe00595c2c50
          ether_nh_input() at ether_nh_input+0x36e/frame 0xfffffe00595c2cb0
          netisr_dispatch_src() at netisr_dispatch_src+0xaf/frame 0xfffffe00595c2d00
          ether_input() at ether_input+0x69/frame 0xfffffe00595c2d60
          vtnet_rxq_eof() at vtnet_rxq_eof+0x6d2/frame 0xfffffe00595c2e20
          vtnet_rx_vq_process() at vtnet_rx_vq_process+0xbc/frame 0xfffffe00595c2e60
          ithread_loop() at ithread_loop+0x257/frame 0xfffffe00595c2ef0
          fork_exit() at fork_exit+0x7f/frame 0xfffffe00595c2f30
          fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00595c2f30
          --- trap 0x7c9ebd07, rip = 0x8000065320cca1bc, rsp = 0xffffffff00000007, rbp = 0xbaa558 ---
          db:1:pfs>  show registers
          cs                        0x20
          ds                        0x3b
          es                        0x3b
          fs                        0x13
          gs                        0x1b
          ss                        0x28
          rax                       0x12
          rcx         0xffffffff814589e2
          rdx                      0x3f8
          rbx                      0x100
          rsp         0xfffffe00595c24f0
          rbp         0xfffffe00595c24f0
          rsi                        0xa
          rdi         0xffffffff82d4e850  gdb_consdev
          r8                           0
          r9                           0
          r10                          0
          r11                          0
          r12                          0
          r13                          0
          r14         0xffffffff813dbf5b
          r15         0xfffffe0005d78020
          rip         0xffffffff80d388c2  kdb_enter+0x32
          rflags                    0x82
          kdb_enter+0x32: movq    $0,0x2344f43(%rip)
          db:1:pfs>  show pcpu
          cpuid        = 2
          dynamic pcpu = 0xfffffe0082c79f00
          curthread    = 0xfffffe0005d78020: pid 12 tid 100070 critnest 1 "irq25: virtio_pci0"
          curpcb       = 0xfffffe0005d78540
          fpcurthread  = none
          idlethread   = 0xfffffe0005c67560: tid 100005 "idle: cpu2"
          self         = 0xffffffff84012000
          curpmap      = 0xffffffff83021ab0
          tssp         = 0xffffffff84012384
          rsp0         = 0xfffffe00595c3000
          kcr3         = 0x80000000b7900007
          ucr3         = 0xffffffffffffffff
          scr3         = 0xa5b7a9e4
          gs32p        = 0xffffffff84012404
          ldt          = 0xffffffff84012444
          tss          = 0xffffffff84012434
          curvnet      = 0xfffff800011c7580
          db:1:pfs>  run lockinfo
          db:2:lockinfo> show locks
          No such command; use "help" to list available commands
          db:2:lockinfo>  show alllocks
          No such command; use "help" to list available commands
          db:2:lockinfo>  show lockedvnods
          Locked vnodes
          db:1:pfs>  acttrace
          
          Tracing command idle pid 11 tid 100003 td 0xfffffe0005c683a0 (CPU 0)
          cpustop_handler() at cpustop_handler+0x27/frame 0xffffffff83105680
          ipi_nmi_handler() at ipi_nmi_handler+0x3a/frame 0xffffffff83105690
          trap() at trap+0x3f/frame 0xffffffff831057b0
          nmi_calltrap() at nmi_calltrap+0x8/frame 0xffffffff831057b0
          --- trap 0x13, rip = 0xffffffff8125c526, rsp = 0xfffffe0005176d70, rbp = 0xfffffe0005176d70 ---
          acpi_cpu_c1() at acpi_cpu_c1+0x6/frame 0xfffffe0005176d70
          acpi_cpu_idle() at acpi_cpu_idle+0x2fe/frame 0xfffffe0005176db0
          cpu_idle_acpi() at cpu_idle_acpi+0x46/frame 0xfffffe0005176dd0
          cpu_idle() at cpu_idle+0x9d/frame 0xfffffe0005176df0
          sched_idletd() at sched_idletd+0x576/frame 0xfffffe0005176ef0
          fork_exit() at fork_exit+0x7f/frame 0xfffffe0005176f30
          fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0005176f30
          --- trap 0x9824848b, rip = 0x5ede158d48564100, rsp = 0x894c78ec8348db31, rbp = 0x2464894cd23115eb ---
          

          This is totally not my specialty, but it looks like a virtio driver crash. Fwiw the host OS is Truenas Scale Cobia RC.1, adapter type is 'VirtIO', and it attaches to a software bridge on the linux side.

          I can send the tar file directly if you wish.
          Mike

          p.s. it's been running solid for 2 hours now, so likely it's fairly stable once everything comes up. I haven't tried rebooting again.

          1 Reply Last reply Reply Quote 0
          • stephenw10S
            stephenw10 Netgate Administrator
            last edited by

            Looks like it's this: https://redmine.pfsense.org/issues/14804

            If you look at the comments there it should be fixed in tomorrows build.

            Steve

            M 1 Reply Last reply Reply Quote 0
            • M
              mikebenna @stephenw10
              last edited by

              Hi @stephenw10,

              Based on that theory, I disabled CARP on the master to cause a failover to the slave, waited a minute, then re-enabled CARP on the master. Both failovers went perfectly smoothly and the master did not crash when taking over this time.

              Further anecdotal evidence: I did two more reboots (after the above test) and both succeeded. Most likely that just means I don't have a 100% reproducible test case.

              In any case, I'll install tonight's snapshot build once it's up.

              Mike

              1 Reply Last reply Reply Quote 1
              • stephenw10S
                stephenw10 Netgate Administrator
                last edited by

                The backtrace you saw is pretty much identical so I'd be surprised if it wasn't that. 👍

                M 1 Reply Last reply Reply Quote 0
                • M
                  mikebenna @stephenw10
                  last edited by

                  I installed the latest beta snapshot this morning on both VM's and so far no crashes. If anything comes up I'll report back here but so far looking good.

                  1 Reply Last reply Reply Quote 1
                  • First post
                    Last post
                  Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.