Netgate Discussion Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search
    • Register
    • Login

    25.03.b.20250507.1611 crash

    Scheduled Pinned Locked Moved Plus 25.03 Develoment Snapshots
    27 Posts 3 Posters 618 Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • M
      marcosm Netgate @pst
      last edited by

      @pst You may use the same link as before.

      P 1 Reply Last reply Reply Quote 0
      • P
        pst @marcosm
        last edited by

        @marcosm Ah, done.

        1 Reply Last reply Reply Quote 0
        • stephenw10S
          stephenw10 Netgate Administrator
          last edited by

          Thanks!

          P 1 Reply Last reply Reply Quote 0
          • P
            pst @stephenw10
            last edited by

            @stephenw10
            I had a look at the crash dumps, ddb.txt especially, to see if I could see a pattern. I have a growing suspicion that they could be caused by the RRD_Summary package, and related to /var/db/rrd/updaterrd.sh especially.

            I base this theory on the following observations:

            The processes active at the time of the crashes, either "sh" running some script, rrdtool, or sysctl (and both are called from updaterrd.sh):

            $ for f in textdump.tar*/; do ls -l $f/ddb.txt; grep curthread $f/ddb.txt; done
            -rw-r--r-- 1 ps 197609 401246 Nov 28  2023 'textdump.tar(1)0//ddb.txt'
            curthread    = 0xfffffe00848f1720: pid 11 tid 100007 critnest 1 "idle: cpu4"
            fpcurthread  = none
            -rw-r--r-- 1 ps 197609 419097 Nov 29  2023 'textdump.tar(1)1//ddb.txt'
            curthread    = 0xfffffe00cdcbb720: pid 82157 tid 112839 critnest 1 "sh"
            fpcurthread  = 0xfffffe00cdcbb720: pid 82157 "sh"
            -rw-r--r-- 1 ps 197609 455061 Mar 30  2024 'textdump.tar(2)//ddb.txt'
            curthread    = 0xfffffe00cbd8b720: pid 73857 tid 113360 critnest 1 "sysctl"
            fpcurthread  = 0xfffffe00cbd8b720: pid 73857 "sysctl"
            -rw-r--r-- 1 ps 197609 439927 Jun 16  2024 'textdump.tar(3)//ddb.txt'
            curthread    = 0xfffff80001a22740: pid 8252 tid 100331 critnest 1 "sh"
            fpcurthread  = 0xfffff80001a22740: pid 8252 "sh"
            -rw-r--r-- 1 ps 197609 441834 Nov 19 02:53 'textdump.tar(4)//ddb.txt'
            curthread    = 0xfffff80088a12740: pid 4364 tid 104673 critnest 1 "sh"
            fpcurthread  = 0xfffff80088a12740: pid 4364 "sh"
            -rw-r--r-- 1 ps 197609 454361 Jan 23 02:50 'textdump.tar(5)//ddb.txt'
            curthread    = 0xfffff8021b38d740: pid 28810 tid 100633 critnest 1 "sh"
            fpcurthread  = 0xfffff8021b38d740: pid 28810 "sh"
            -rw-r--r-- 1 ps 197609 336991 May  9 16:27 'textdump.tar(6)//ddb.txt'
            curthread    = 0xfffff80104540000: pid 37021 tid 128707 critnest 1 "sh"
            fpcurthread  = 0xfffff80104540000: pid 37021 "sh"
            -rw-r--r-- 1 ps 197609 424254 Sep 28  2023 textdump.tar//ddb.txt
            curthread    = 0xfffffe00cb1783a0: pid 73702 tid 101438 critnest 1 "rrdtool"
            fpcurthread  = 0xfffffe00cb1783a0: pid 73702 "rrdtool"
            

            In the latest crash textdump.tar(6)/ddb.txt, the "sh" that crashed while trying to exit (state RE)

            db:1:pfs>  ps
              pid  ppid  pgrp   uid  state   wmesg   wchan               cmd
            37021 68125    22     0  RE      CPU 6                       sh
            

            has a parent (68125) that is also a shell

            68125     1    22     0  S+      wait    0xfffffe00ca219b00  sh
            

            whose parent is "1" i.e the system init. If I check the running system there aren't many shell processes started and running from init.

            [25.03-BETA][admin@pfsense.local.lan]/root: ps -lx | awk '{ if ( $3 == 1 ) { print $0 } }' | grep "/bin/sh"
              0 98160     1 0  68 20  14644  3208 wait     SN   u0-    1:10.95 /bin/sh /var/db/rrd/updaterrd.sh
            

            Now, this doesn't really take me any closer to understanding why the crashes occur on my system, but if you agree to my reasoning we might at least have narrowed the problem down a bit. I could uninstall RRD_Summary, but due to the infrequency of the crashes we wouldn't know for at least six months if that was the cause, and it wouldn't solve the actual problem either.

            1 Reply Last reply Reply Quote 0
            • stephenw10S
              stephenw10 Netgate Administrator
              last edited by

              Right, that could well be the case but it shouldn't cause a kernel panic! I run that package here without issue. One of our devs is meditating on it. ๐Ÿ˜‰

              1 Reply Last reply Reply Quote 3
              • P
                pst @stephenw10
                last edited by

                @stephenw10 said in 25.03.b.20250507.1611 crash:

                <5>gif0: loop detected
                <5>gif0: loop detected
                <5>gif0: loop detected

                and regarding these, I managed to track down the root cause...

                The loop detected indications appeared when I put my computer to sleep... What? ๐Ÿค”

                Packet tracing on the gif interface revealed this in wireshark:

                7e5520d1-310c-4fab-92bb-ded35b1b5bf7-image.png

                After further digging I realised that one of my hyper-v guests (an Ubuntu instance) seems to be calling home every time it gets a suspend indication - but using the link-local address. I'm not sure if that's an issue in the hyper-v guest or the hyper-v server sending packets using the link-local address.

                After updating the LAN rules to filter out _private6_ addresses I no longer see gif0 screaming about loop detections.

                All is well, for now...

                P 1 Reply Last reply Reply Quote 1
                • P
                  pst @pst
                  last edited by

                  @stephenw10 but obviously there had to be multiple reasons for these "loop detected", I discovered one additional cause which seems more related to the inner workings of pfSense.

                  The scenario is as follows

                  1. I put the computer to sleep
                  2. a TCP retransmission is received on the gif interface aimed for the now sleeping computer
                  3. after three seconds (timer expiry?) a ICMPv6 Destination Unreachable (Address Unreachable) is generated by pfSense
                  4. this ICMPv6 packet is what triggers the "gif0: loop detected" (the timing in syslog and packet trace matches)

                  The information in the "destination unreachable" looks fine to me, so there is no obvious reason why it could be interpreted as "looped". I can upload the pcap if anyone is interested?

                  1 Reply Last reply Reply Quote 0
                  • stephenw10S
                    stephenw10 Netgate Administrator
                    last edited by

                    Hmm, how exactly are you using the gif tunnel(s) there?

                    P 1 Reply Last reply Reply Quote 0
                    • P
                      pst @stephenw10
                      last edited by

                      @stephenw10 gif0 is the only gif tunnel I have, it is a tunnelbroker.net connection that provides IPv6 to the LAN (where the sleepy computer resides) and a number of VLANs.

                      Those LAN/VLANs all have static IPv6 configuration (/64) in the routed/48 tunnelbroker subnet.The router mode is set to Assisted with DHCPv6 servers running.

                      In addition to tunnelbroker.net I also have one VLAN that is configured with IPv6 by Tracking my ISP WAN which uses DHCPv6.

                      1 Reply Last reply Reply Quote 0
                      • stephenw10S
                        stephenw10 Netgate Administrator
                        last edited by

                        Hmm, curious. Yes, hard to see how that could create any sort of loop on any interface.

                        Is that client that goes to sleep attached directly to pfSense? Such that the link state could change when it goes into standby?

                        P 2 Replies Last reply Reply Quote 0
                        • P
                          pst @stephenw10
                          last edited by

                          @stephenw10 said in 25.03.b.20250507.1611 crash:

                          Is that client that goes to sleep attached directly to pfSense?

                          No, there's an unmanaged switch inbetween

                          1 Reply Last reply Reply Quote 0
                          • P
                            pst @stephenw10
                            last edited by

                            @stephenw10 said in 25.03.b.20250507.1611 crash:

                            Is that client that goes to sleep attached directly to pfSense?

                            I changed my network setup and have now tested with a direct connection between Sleepyhead and the pfSense and the pfSense behaviour is different for the same scenario:

                            I can still see the reception of the TCP retransmissions, but pfSense does not respond with ICMPv6 Destination Unreachable after a timeout like previously, it just seems to drop the package which eventually leads to a TCP reset from the other end. No ICMP == no gif0: loop detected in this scenario.

                            This all makes sense I guess, considering the amount of work pfSense does when it detects the LAN/igb1 going down at the point of going to sleep. It knows the LAN client is unavailable and acts accordingly.

                            So, a switch is required between pfSense and the LAN client to trigger the "loop detected" scenario.

                            P 1 Reply Last reply Reply Quote 1
                            • P
                              pst @pst
                              last edited by

                              I found an easier way of recreating the issue, saving me having to put the computer to sleep: all I need to do is to is to start a file transfer (wget for example) in one of the hyper-v guests that resides on the LAN (and therefore also exists over the gif interface), and then just pause the hyper-v guest. After a short while pfSense will answer incoming packets on the gif destined for the sleeping guest with ICMPv6 Destination Unreachable, and in the syslog the corresponding "gif0: loop detected" is added.

                              As I monitor the situation I can also see that the remaining "loop detected" that gets triggered are from Wi-Fi attached phones, which have a tendency to wake up and go back to sleep much more regularly.

                              I think I have now found all scenarios which triggers the "gif0: loop detected". One was down to my misconfiguration of firewall rules, I leave it to you to find out why the pfSense generated "ICMv6 Destination Unreachable" are regarded as "looped".

                              1 Reply Last reply Reply Quote 0
                              • stephenw10S
                                stephenw10 Netgate Administrator
                                last edited by

                                Hmm, interesting. I'd assume you still saw those loop warnings in 24.11?

                                I don't believe it's actually related to the crash though TBH. Have to wait on that.

                                P 1 Reply Last reply Reply Quote 0
                                • P
                                  pst @stephenw10
                                  last edited by

                                  @stephenw10 said in 25.03.b.20250507.1611 crash:

                                  I don't believe it's actually related to the crash though

                                  I agree, it is a side track, and I doubt it is actually 25.03-related either. I didn't run the gif in 24.11 so I have no history. If the beta config.xml is compatible with 24.11 I could try and load it and see if the pfSense behaviour has changed.

                                  1 Reply Last reply Reply Quote 0
                                  • stephenw10S
                                    stephenw10 Netgate Administrator
                                    last edited by

                                    The config version has changed so it will complain if you try to load a 25.03 config into 24.11. It might work. ๐Ÿ˜‰ It depends what you actually have configured.

                                    P 1 Reply Last reply Reply Quote 0
                                    • P
                                      pst @stephenw10
                                      last edited by

                                      @stephenw10 yes, I loaded the current 25.03 config into 24.11 earlier and there were some warnings and a few errors. Most things seemed to work well though, including the GIF tunnel, but there was a lot more "loop detected" than in the beta. They seemed triggered by other scenarios than just the beta's "ICMP6 Unreachable Desination, Unreachable Address". Not sure how much can be read into that considering the "invalid" config file (and I really don't feel like manually setting up the current config in 24.11!)

                                      1 Reply Last reply Reply Quote 0
                                      • stephenw10S
                                        stephenw10 Netgate Administrator
                                        last edited by

                                        Ah, OK. So not a regression then. Yup I think we safely say it's unrelated to the crash.

                                        1 Reply Last reply Reply Quote 0
                                        • First post
                                          Last post
                                        Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.