25.03.b.20250507.1611 crash

marcosm

@pst You may use the same link as before.

pst

stephenw10

Thanks!

pst

@stephenw10
I had a look at the crash dumps, ddb.txt especially, to see if I could see a pattern. I have a growing suspicion that they could be caused by the RRD_Summary package, and related to /var/db/rrd/updaterrd.sh especially.

I base this theory on the following observations:

The processes active at the time of the crashes, either "sh" running some script, rrdtool, or sysctl (and both are called from updaterrd.sh):

$ for f in textdump.tar*/; do ls -l $f/ddb.txt; grep curthread $f/ddb.txt; done
-rw-r--r-- 1 ps 197609 401246 Nov 28  2023 'textdump.tar(1)0//ddb.txt'
curthread    = 0xfffffe00848f1720: pid 11 tid 100007 critnest 1 "idle: cpu4"
fpcurthread  = none
-rw-r--r-- 1 ps 197609 419097 Nov 29  2023 'textdump.tar(1)1//ddb.txt'
curthread    = 0xfffffe00cdcbb720: pid 82157 tid 112839 critnest 1 "sh"
fpcurthread  = 0xfffffe00cdcbb720: pid 82157 "sh"
-rw-r--r-- 1 ps 197609 455061 Mar 30  2024 'textdump.tar(2)//ddb.txt'
curthread    = 0xfffffe00cbd8b720: pid 73857 tid 113360 critnest 1 "sysctl"
fpcurthread  = 0xfffffe00cbd8b720: pid 73857 "sysctl"
-rw-r--r-- 1 ps 197609 439927 Jun 16  2024 'textdump.tar(3)//ddb.txt'
curthread    = 0xfffff80001a22740: pid 8252 tid 100331 critnest 1 "sh"
fpcurthread  = 0xfffff80001a22740: pid 8252 "sh"
-rw-r--r-- 1 ps 197609 441834 Nov 19 02:53 'textdump.tar(4)//ddb.txt'
curthread    = 0xfffff80088a12740: pid 4364 tid 104673 critnest 1 "sh"
fpcurthread  = 0xfffff80088a12740: pid 4364 "sh"
-rw-r--r-- 1 ps 197609 454361 Jan 23 02:50 'textdump.tar(5)//ddb.txt'
curthread    = 0xfffff8021b38d740: pid 28810 tid 100633 critnest 1 "sh"
fpcurthread  = 0xfffff8021b38d740: pid 28810 "sh"
-rw-r--r-- 1 ps 197609 336991 May  9 16:27 'textdump.tar(6)//ddb.txt'
curthread    = 0xfffff80104540000: pid 37021 tid 128707 critnest 1 "sh"
fpcurthread  = 0xfffff80104540000: pid 37021 "sh"
-rw-r--r-- 1 ps 197609 424254 Sep 28  2023 textdump.tar//ddb.txt
curthread    = 0xfffffe00cb1783a0: pid 73702 tid 101438 critnest 1 "rrdtool"
fpcurthread  = 0xfffffe00cb1783a0: pid 73702 "rrdtool"

In the latest crash textdump.tar(6)/ddb.txt, the "sh" that crashed while trying to exit (state RE)

db:1:pfs>  ps
  pid  ppid  pgrp   uid  state   wmesg   wchan               cmd
37021 68125    22     0  RE      CPU 6                       sh

has a parent (68125) that is also a shell

68125     1    22     0  S+      wait    0xfffffe00ca219b00  sh

whose parent is "1" i.e the system init. If I check the running system there aren't many shell processes started and running from init.

[25.03-BETA][admin@pfsense.local.lan]/root: ps -lx | awk '{ if ( $3 == 1 ) { print $0 } }' | grep "/bin/sh"
  0 98160     1 0  68 20  14644  3208 wait     SN   u0-    1:10.95 /bin/sh /var/db/rrd/updaterrd.sh

Now, this doesn't really take me any closer to understanding why the crashes occur on my system, but if you agree to my reasoning we might at least have narrowed the problem down a bit. I could uninstall RRD_Summary, but due to the infrequency of the crashes we wouldn't know for at least six months if that was the cause, and it wouldn't solve the actual problem either.

stephenw10

Right, that could well be the case but it shouldn't cause a kernel panic! I run that package here without issue. One of our devs is meditating on it.

pst

@stephenw10 said in 25.03.b.20250507.1611 crash:

<5>gif0: loop detected
<5>gif0: loop detected
<5>gif0: loop detected

and regarding these, I managed to track down the root cause...

The loop detected indications appeared when I put my computer to sleep... What?

Packet tracing on the gif interface revealed this in wireshark:

After further digging I realised that one of my hyper-v guests (an Ubuntu instance) seems to be calling home every time it gets a suspend indication - but using the link-local address. I'm not sure if that's an issue in the hyper-v guest or the hyper-v server sending packets using the link-local address.

After updating the LAN rules to filter out _private6_ addresses I no longer see gif0 screaming about loop detections.

All is well, for now...

pst

@stephenw10 but obviously there had to be multiple reasons for these "loop detected", I discovered one additional cause which seems more related to the inner workings of pfSense.

The scenario is as follows

I put the computer to sleep
a TCP retransmission is received on the gif interface aimed for the now sleeping computer
after three seconds (timer expiry?) a ICMPv6 Destination Unreachable (Address Unreachable) is generated by pfSense
this ICMPv6 packet is what triggers the "gif0: loop detected" (the timing in syslog and packet trace matches)

The information in the "destination unreachable" looks fine to me, so there is no obvious reason why it could be interpreted as "looped". I can upload the pcap if anyone is interested?

stephenw10

Hmm, how exactly are you using the gif tunnel(s) there?

pst

@stephenw10 gif0 is the only gif tunnel I have, it is a tunnelbroker.net connection that provides IPv6 to the LAN (where the sleepy computer resides) and a number of VLANs.

Those LAN/VLANs all have static IPv6 configuration (/64) in the routed/48 tunnelbroker subnet.The router mode is set to Assisted with DHCPv6 servers running.

In addition to tunnelbroker.net I also have one VLAN that is configured with IPv6 by Tracking my ISP WAN which uses DHCPv6.

stephenw10

Hmm, curious. Yes, hard to see how that could create any sort of loop on any interface.

Is that client that goes to sleep attached directly to pfSense? Such that the link state could change when it goes into standby?

pst

@stephenw10 said in 25.03.b.20250507.1611 crash:

Is that client that goes to sleep attached directly to pfSense?

No, there's an unmanaged switch inbetween

pst

@stephenw10 said in 25.03.b.20250507.1611 crash:

Is that client that goes to sleep attached directly to pfSense?

I changed my network setup and have now tested with a direct connection between Sleepyhead and the pfSense and the pfSense behaviour is different for the same scenario:

I can still see the reception of the TCP retransmissions, but pfSense does not respond with ICMPv6 Destination Unreachable after a timeout like previously, it just seems to drop the package which eventually leads to a TCP reset from the other end. No ICMP == no gif0: loop detected in this scenario.

This all makes sense I guess, considering the amount of work pfSense does when it detects the LAN/igb1 going down at the point of going to sleep. It knows the LAN client is unavailable and acts accordingly.

So, a switch is required between pfSense and the LAN client to trigger the "loop detected" scenario.

pst

I found an easier way of recreating the issue, saving me having to put the computer to sleep: all I need to do is to is to start a file transfer (wget for example) in one of the hyper-v guests that resides on the LAN (and therefore also exists over the gif interface), and then just pause the hyper-v guest. After a short while pfSense will answer incoming packets on the gif destined for the sleeping guest with ICMPv6 Destination Unreachable, and in the syslog the corresponding "gif0: loop detected" is added.

As I monitor the situation I can also see that the remaining "loop detected" that gets triggered are from Wi-Fi attached phones, which have a tendency to wake up and go back to sleep much more regularly.

I think I have now found all scenarios which triggers the "gif0: loop detected". One was down to my misconfiguration of firewall rules, I leave it to you to find out why the pfSense generated "ICMv6 Destination Unreachable" are regarded as "looped".

stephenw10

Hmm, interesting. I'd assume you still saw those loop warnings in 24.11?

I don't believe it's actually related to the crash though TBH. Have to wait on that.

pst

@stephenw10 said in 25.03.b.20250507.1611 crash:

I don't believe it's actually related to the crash though

I agree, it is a side track, and I doubt it is actually 25.03-related either. I didn't run the gif in 24.11 so I have no history. If the beta config.xml is compatible with 24.11 I could try and load it and see if the pfSense behaviour has changed.

stephenw10

The config version has changed so it will complain if you try to load a 25.03 config into 24.11. It might work. It depends what you actually have configured.

pst

@stephenw10 yes, I loaded the current 25.03 config into 24.11 earlier and there were some warnings and a few errors. Most things seemed to work well though, including the GIF tunnel, but there was a lot more "loop detected" than in the beta. They seemed triggered by other scenarios than just the beta's "ICMP6 Unreachable Desination, Unreachable Address". Not sure how much can be read into that considering the "invalid" config file (and I really don't feel like manually setting up the current config in 24.11!)

stephenw10

Ah, OK. So not a regression then. Yup I think we safely say it's unrelated to the crash.

pst

@stephenw10 said in 25.03.b.20250507.1611 crash:

That should be fixed in the next pftop version

I'm happy to report that pftop in 25.03.b.20250515.1415 seems rock solid :)

stephenw10

Awesome, thanks for the feedback!