Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    25.03.b.20250507.1611 crash

    Scheduled Pinned Locked Moved Plus 25.03 Develoment Snapshots
    30 Posts 3 Posters 967 Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • P
      pst
      last edited by

      I just had a 25.03.b.20250507.1611 crash. However, it might not be a beta specific issue as I get these crashes once or twice a year (out of the blue), and there isn't normally much useful to go on:

      db:1:pfs> bt
      Tracing pid 37021 tid 128707 td 0xfffff80104540000
      kdb_enter() at kdb_enter+0x33/frame 0xfffffe00cdb99a80
      panic() at panic+0x43/frame 0xfffffe00cdb99ae0
      trap_fatal() at trap_fatal+0x40b/frame 0xfffffe00cdb99b40
      calltrap() at calltrap+0x8/frame 0xfffffe00cdb99b40
      --- trap 0x9, rip = 0xffffffff812e4c2a, rsp = 0xfffffe00cdb99c10, rbp = 0xfffffe00cdb99d60 ---
      pmap_remove_pages() at pmap_remove_pages+0x6ba/frame 0xfffffe00cdb99d60
      vmspace_exit() at vmspace_exit+0x80/frame 0xfffffe00cdb99d90
      exit1() at exit1+0x53a/frame 0xfffffe00cdb99df0
      sys_exit() at sys_exit+0xd/frame 0xfffffe00cdb99e00
      amd64_syscall() at amd64_syscall+0x115/frame 0xfffffe00cdb99f30
      fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe00cdb99f30
      --- syscall (1, FreeBSD ELF64, exit), rip = 0x3994995318ea, rsp = 0x399496fb4868, rbp = 0x399496fb4960 ---
      db:1:pfs>  show registers
      cs                        0x20
      ds                        0x3b
      es                        0x3b
      fs                        0x13
      gs                        0x1b
      ss                        0x28
      rax                       0x12
      rcx         0x9a55bef9e544a743
      rdx         0xfffffe00cdb995a0
      rbx                      0x100
      rsp         0xfffffe00cdb99958
      rbp         0xfffffe00cdb99a80
      rsi         0xfffffe00cdb99810
      rdi         0xffffffff82750a00  gdb_consdev
      r8                        0x25
      r9                        0x25
      r10                          0
      r11                          0
      r12                          0
      r13                          0
      r14         0xffffffff814688b4
      r15         0xfffff80104540000
      rip         0xffffffff80d4e3d3  kdb_enter+0x33
      rflags                    0x82
      kdb_enter+0x33: movq    $0,0x1d70132(%rip)
      db:1:pfs>  show pcpu
      cpuid        = 6
      dynamic pcpu = 0xfffffe008efeb540
      curthread    = 0xfffff80104540000: pid 37021 tid 128707 critnest 1 "sh"
      curpcb       = 0xfffff80104540520
      fpcurthread  = 0xfffff80104540000: pid 37021 "sh"
      idlethread   = 0xfffff80002619740: tid 100009 "idle: cpu6"
      self         = 0xffffffff83a16000
      curpmap      = 0xfffff8010450b7e8
      tssp         = 0xffffffff83a16384
      rsp0         = 0xfffffe00cdb9a000
      kcr3         = 0xffffffffffffffff
      ucr3         = 0xffffffffffffffff
      scr3         = 0x0
      gs32p        = 0xffffffff83a16404
      ldt          = 0xffffffff83a16444
      tss          = 0xffffffff83a16434
      curvnet      = 0
      

      Is there an interest in looking at the complete dump?

      1 Reply Last reply Reply Quote 0
      • stephenw10S
        stephenw10 Netgate Administrator
        last edited by

        Yes, please upload it here: https://nc.netgate.com/nextcloud/s/rs8eJHAETGniYGt

        P 1 Reply Last reply Reply Quote 0
        • P
          pst @stephenw10
          last edited by

          @stephenw10 done

          1 Reply Last reply Reply Quote 1
          • stephenw10S
            stephenw10 Netgate Administrator
            last edited by

            Hmm, well these are not good:

            <6>pid 3612 (pftop), jid 0, uid 0: exited on signal 6 (core dumped)
            <6>pid 9368 (pftop), jid 0, uid 0: exited on signal 6 (core dumped)
            <6>pid 65518 (pftop), jid 0, uid 0: exited on signal 6 (core dumped)
            <5>gif0: loop detected
            <5>gif0: loop detected
            <5>gif0: loop detected
            

            There's no timestamps on those so I don't know how close to the crash that is.

            Otherwise it's all pretty generic unfortunately.

            Do you know what version it started happening in? After enabling something?

            P 2 Replies Last reply Reply Quote 0
            • P
              pst @stephenw10
              last edited by

              @stephenw10 I saw the pftop crash first in the 1May beta (I mentioned it here https://forum.netgate.com/post/1213946). I hadn't been using pftop that much prior to that so it was my first sighting looking at the /status.php output.

              I ran pftop a few times today, and the log entries are related to those, not close to the crash if I can recall todays events accurately...

              I get a few gif0 loop detected occasionally. I only enabled a gif interface last weekend so it's also new, could be something misconfigured. I'll look into that over the weekend.

              1 Reply Last reply Reply Quote 0
              • stephenw10S
                stephenw10 Netgate Administrator
                last edited by stephenw10

                Ah, OK. That should be fixed in the next pftop version

                Have you seen the pmap crash in older pfSense versions? I thought I recalled a previous thread....

                Edit: Ah this I guess: https://forum.netgate.com/topic/187067/rare-kernel-panic-on-23-09-1-release-amd64-non-netgate-hw

                P 2 Replies Last reply Reply Quote 0
                • P
                  pst @stephenw10
                  last edited by

                  @stephenw10 said in 25.03.b.20250507.1611 crash:

                  I thought I recalled a previous thread....

                  Indeed, it was mentioned way back in https://forum.netgate.com/topic/187067/rare-kernel-panic-on-23-09-1-release-amd64-non-netgate-hw?_=1746825894764

                  I have had the infrequent ones over the years, normally at night but there hasn't been much to go on so I haven't pushed the issue further.

                  1 Reply Last reply Reply Quote 0
                  • stephenw10S
                    stephenw10 Netgate Administrator
                    last edited by stephenw10

                    Hmm, I'll see if our devs have any insight there but something spanning versions like that seems to point at something specific to your system.

                    If you have any further crash reports that might help.

                    P 1 Reply Last reply Reply Quote 0
                    • P
                      pst @stephenw10
                      last edited by

                      @stephenw10 said in 25.03.b.20250507.1611 crash:

                      If you have any further crash reports that might help.

                      I just realised I do have a bunch, could be every crash between Oct23 and now. I'll zip them up to preserve time stamps. You tell me where to put them.

                      M 1 Reply Last reply Reply Quote 1
                      • M
                        marcosm Netgate @pst
                        last edited by

                        @pst You may use the same link as before.

                        P 1 Reply Last reply Reply Quote 0
                        • P
                          pst @marcosm
                          last edited by

                          @marcosm Ah, done.

                          1 Reply Last reply Reply Quote 0
                          • stephenw10S
                            stephenw10 Netgate Administrator
                            last edited by

                            Thanks!

                            P 1 Reply Last reply Reply Quote 0
                            • P
                              pst @stephenw10
                              last edited by

                              @stephenw10
                              I had a look at the crash dumps, ddb.txt especially, to see if I could see a pattern. I have a growing suspicion that they could be caused by the RRD_Summary package, and related to /var/db/rrd/updaterrd.sh especially.

                              I base this theory on the following observations:

                              The processes active at the time of the crashes, either "sh" running some script, rrdtool, or sysctl (and both are called from updaterrd.sh):

                              $ for f in textdump.tar*/; do ls -l $f/ddb.txt; grep curthread $f/ddb.txt; done
                              -rw-r--r-- 1 ps 197609 401246 Nov 28  2023 'textdump.tar(1)0//ddb.txt'
                              curthread    = 0xfffffe00848f1720: pid 11 tid 100007 critnest 1 "idle: cpu4"
                              fpcurthread  = none
                              -rw-r--r-- 1 ps 197609 419097 Nov 29  2023 'textdump.tar(1)1//ddb.txt'
                              curthread    = 0xfffffe00cdcbb720: pid 82157 tid 112839 critnest 1 "sh"
                              fpcurthread  = 0xfffffe00cdcbb720: pid 82157 "sh"
                              -rw-r--r-- 1 ps 197609 455061 Mar 30  2024 'textdump.tar(2)//ddb.txt'
                              curthread    = 0xfffffe00cbd8b720: pid 73857 tid 113360 critnest 1 "sysctl"
                              fpcurthread  = 0xfffffe00cbd8b720: pid 73857 "sysctl"
                              -rw-r--r-- 1 ps 197609 439927 Jun 16  2024 'textdump.tar(3)//ddb.txt'
                              curthread    = 0xfffff80001a22740: pid 8252 tid 100331 critnest 1 "sh"
                              fpcurthread  = 0xfffff80001a22740: pid 8252 "sh"
                              -rw-r--r-- 1 ps 197609 441834 Nov 19 02:53 'textdump.tar(4)//ddb.txt'
                              curthread    = 0xfffff80088a12740: pid 4364 tid 104673 critnest 1 "sh"
                              fpcurthread  = 0xfffff80088a12740: pid 4364 "sh"
                              -rw-r--r-- 1 ps 197609 454361 Jan 23 02:50 'textdump.tar(5)//ddb.txt'
                              curthread    = 0xfffff8021b38d740: pid 28810 tid 100633 critnest 1 "sh"
                              fpcurthread  = 0xfffff8021b38d740: pid 28810 "sh"
                              -rw-r--r-- 1 ps 197609 336991 May  9 16:27 'textdump.tar(6)//ddb.txt'
                              curthread    = 0xfffff80104540000: pid 37021 tid 128707 critnest 1 "sh"
                              fpcurthread  = 0xfffff80104540000: pid 37021 "sh"
                              -rw-r--r-- 1 ps 197609 424254 Sep 28  2023 textdump.tar//ddb.txt
                              curthread    = 0xfffffe00cb1783a0: pid 73702 tid 101438 critnest 1 "rrdtool"
                              fpcurthread  = 0xfffffe00cb1783a0: pid 73702 "rrdtool"
                              

                              In the latest crash textdump.tar(6)/ddb.txt, the "sh" that crashed while trying to exit (state RE)

                              db:1:pfs>  ps
                                pid  ppid  pgrp   uid  state   wmesg   wchan               cmd
                              37021 68125    22     0  RE      CPU 6                       sh
                              

                              has a parent (68125) that is also a shell

                              68125     1    22     0  S+      wait    0xfffffe00ca219b00  sh
                              

                              whose parent is "1" i.e the system init. If I check the running system there aren't many shell processes started and running from init.

                              [25.03-BETA][admin@pfsense.local.lan]/root: ps -lx | awk '{ if ( $3 == 1 ) { print $0 } }' | grep "/bin/sh"
                                0 98160     1 0  68 20  14644  3208 wait     SN   u0-    1:10.95 /bin/sh /var/db/rrd/updaterrd.sh
                              

                              Now, this doesn't really take me any closer to understanding why the crashes occur on my system, but if you agree to my reasoning we might at least have narrowed the problem down a bit. I could uninstall RRD_Summary, but due to the infrequency of the crashes we wouldn't know for at least six months if that was the cause, and it wouldn't solve the actual problem either.

                              1 Reply Last reply Reply Quote 0
                              • stephenw10S
                                stephenw10 Netgate Administrator
                                last edited by

                                Right, that could well be the case but it shouldn't cause a kernel panic! I run that package here without issue. One of our devs is meditating on it. 😉

                                1 Reply Last reply Reply Quote 3
                                • P
                                  pst @stephenw10
                                  last edited by

                                  @stephenw10 said in 25.03.b.20250507.1611 crash:

                                  <5>gif0: loop detected
                                  <5>gif0: loop detected
                                  <5>gif0: loop detected

                                  and regarding these, I managed to track down the root cause...

                                  The loop detected indications appeared when I put my computer to sleep... What? 🤔

                                  Packet tracing on the gif interface revealed this in wireshark:

                                  7e5520d1-310c-4fab-92bb-ded35b1b5bf7-image.png

                                  After further digging I realised that one of my hyper-v guests (an Ubuntu instance) seems to be calling home every time it gets a suspend indication - but using the link-local address. I'm not sure if that's an issue in the hyper-v guest or the hyper-v server sending packets using the link-local address.

                                  After updating the LAN rules to filter out _private6_ addresses I no longer see gif0 screaming about loop detections.

                                  All is well, for now...

                                  P 1 Reply Last reply Reply Quote 1
                                  • P
                                    pst @pst
                                    last edited by

                                    @stephenw10 but obviously there had to be multiple reasons for these "loop detected", I discovered one additional cause which seems more related to the inner workings of pfSense.

                                    The scenario is as follows

                                    1. I put the computer to sleep
                                    2. a TCP retransmission is received on the gif interface aimed for the now sleeping computer
                                    3. after three seconds (timer expiry?) a ICMPv6 Destination Unreachable (Address Unreachable) is generated by pfSense
                                    4. this ICMPv6 packet is what triggers the "gif0: loop detected" (the timing in syslog and packet trace matches)

                                    The information in the "destination unreachable" looks fine to me, so there is no obvious reason why it could be interpreted as "looped". I can upload the pcap if anyone is interested?

                                    1 Reply Last reply Reply Quote 0
                                    • stephenw10S
                                      stephenw10 Netgate Administrator
                                      last edited by

                                      Hmm, how exactly are you using the gif tunnel(s) there?

                                      P 1 Reply Last reply Reply Quote 0
                                      • P
                                        pst @stephenw10
                                        last edited by

                                        @stephenw10 gif0 is the only gif tunnel I have, it is a tunnelbroker.net connection that provides IPv6 to the LAN (where the sleepy computer resides) and a number of VLANs.

                                        Those LAN/VLANs all have static IPv6 configuration (/64) in the routed/48 tunnelbroker subnet.The router mode is set to Assisted with DHCPv6 servers running.

                                        In addition to tunnelbroker.net I also have one VLAN that is configured with IPv6 by Tracking my ISP WAN which uses DHCPv6.

                                        1 Reply Last reply Reply Quote 0
                                        • stephenw10S
                                          stephenw10 Netgate Administrator
                                          last edited by

                                          Hmm, curious. Yes, hard to see how that could create any sort of loop on any interface.

                                          Is that client that goes to sleep attached directly to pfSense? Such that the link state could change when it goes into standby?

                                          P 2 Replies Last reply Reply Quote 0
                                          • P
                                            pst @stephenw10
                                            last edited by

                                            @stephenw10 said in 25.03.b.20250507.1611 crash:

                                            Is that client that goes to sleep attached directly to pfSense?

                                            No, there's an unmanaged switch inbetween

                                            1 Reply Last reply Reply Quote 0
                                            • First post
                                              Last post
                                            Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.