Periodic Panic on CE 2.8.0 - DHCP6 Client (I Think)

davefinster

Hi All

First time posting in the Netgate forums so LMK if I've done this in the wrong place. I've been encountering an issue for as long as I can remember where my PFSense firewall, running on a Lenovo M46KT27A (somewhat overkill sure) that I've installed a 2 SFP port Intel X520 with the following things plugged in:

E.C.I. NETWORKS PN: ENXGSFPPOMACV2 - SFP/SFP+/SFP28 10G Base-LR (SC)
Ubiquiti Inc. PN: DAC-SFP10-0.5M SN: BA22093023861 DATE: 2022-09-26 - SFP/SFP+/SFP28 1X Copper Passive (No separable connector)

The starting point seems to be the following which seems to be DHCP6 related. Beyond that I'm not familiar with debugging these things. It's been happening reasonably regularly (at least once per week) for as long as I can remember and I've only now decided to dig into it.

db:0:kdb.enter.default>  run pfs
db:1:pfs> bt
Tracing pid 52781 tid 100414 td 0xfffff800126df740
kdb_enter() at kdb_enter+0x33/frame 0xfffffe00d3de67f0
panic() at panic+0x43/frame 0xfffffe00d3de6850
trap_fatal() at trap_fatal+0x40b/frame 0xfffffe00d3de68b0
trap_pfault() at trap_pfault+0x46/frame 0xfffffe00d3de6900
calltrap() at calltrap+0x8/frame 0xfffffe00d3de6900
--- trap 0xc, rip = 0xffffffff80f5b213, rsp = 0xfffffe00d3de69d0, rbp = 0xfffffe00d3de6a20 ---
in6_unlink_ifa() at in6_unlink_ifa+0x53/frame 0xfffffe00d3de6a20
in6_purgeaddr() at in6_purgeaddr+0x366/frame 0xfffffe00d3de6b40
in6_purgeifaddr() at in6_purgeifaddr+0x13/frame 0xfffffe00d3de6b60
in6_control_ioctl() at in6_control_ioctl+0x5e1/frame 0xfffffe00d3de6bd0
ifioctl() at ifioctl+0x8b0/frame 0xfffffe00d3de6cd0
kern_ioctl() at kern_ioctl+0x255/frame 0xfffffe00d3de6d40
sys_ioctl() at sys_ioctl+0x117/frame 0xfffffe00d3de6e00
amd64_syscall() at amd64_syscall+0x115/frame 0xfffffe00d3de6f30
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe00d3de6f30
--- syscall (54, FreeBSD ELF64, ioctl), rip = 0x822a2bcca, rsp = 0x820280e58, rbp = 0x820280f50 ---
db:1:pfs>  show registers
cs                        0x20
ds                        0x3b
es                        0x3b
fs                        0x13
gs                        0x1b
ss                        0x28
rax                       0x12
rcx         0xb671471b9956e201
rdx         0xfffffe00d3de6310
rbx                      0x100
rsp         0xfffffe00d3de66c8
rbp         0xfffffe00d3de67f0
rsi         0xfffffe00d3de6580
rdi         0xffffffff82740878  vt_conswindow+0x10
r8                        0x3c
r9                        0x3c
r10                          0
r11                          0
r12                          0
r13                          0
r14         0xffffffff8145d99f
r15         0xfffff800126df740
rip         0xffffffff80d457b3  kdb_enter+0x33
rflags                    0x82
kdb_enter+0x33: movq    $0,0x1d76cd2(%rip)
db:1:pfs>  show pcpu
cpuid        = 6
dynamic pcpu = 0xfffffe009b4325c0
curthread    = 0xfffff800126df740: pid 52781 tid 100414 critnest 1 "dhcp6c"
curpcb       = 0xfffff800126dfc60
fpcurthread  = 0xfffff800126df740: pid 52781 "dhcp6c"
idlethread   = 0xfffff800027e5740: tid 100009 "idle: cpu6"
self         = 0xffffffff83a16000
curpmap      = 0xfffff800126f0358
tssp         = 0xffffffff83a16384
rsp0         = 0xfffffe00d3de7000
kcr3         = 0xffffffffffffffff
ucr3         = 0xffffffffffffffff
scr3         = 0x0
gs32p        = 0xffffffff83a16404
ldt          = 0xffffffff83a16444
tss          = 0xffffffff83a16434
curvnet      = 0xfffff80001288840
db:1:pfs>  run lockinfo
db:2:lockinfo> show locks
No such command; use "help" to list available commands
db:2:lockinfo>  show alllocks
No such command; use "help" to list available commands
db:2:lockinfo>  show lockedvnods
Locked vnodes

info.0.txt
info.1.txt
textdump.0.tar
textdump.1.tar

davefinster

So as a follow on, I have noticed that the gateway monitors are tripping fairly regularly on my AT&T Fiber IPv6 which is probably what is causing the DHCPv6 client to jump into action which occasionally leads to this situation. I've found similar issues from older releases where there was a race between interface reconfiguration and disablement.

I've disabled the IPv6 monitor from taking action (but still logging) so will see if that eliminates the panics. But the fact that it can happen is still concerning.

stephenw10

@davefinster said in Periodic Panic on CE 2.8.0 - DHCP6 Client (I Think):

in6_unlink_ifa

Hmm, that looks like this: https://redmine.pfsense.org/issues/14164 But that should be resolved in 2.8.0.

In both crashes crashes the log is spammed by something trying to use a linklocal IPv6 address for public routing which is not allowed.

I would guess it's an issue with the tailscale interface though since that's the only other thing showing much activity. That has been shown to cause the related bug: https://redmine.pfsense.org/issues/14431

I was never able to replicate that locally but it could be a timing issue that only a fast WAN connection hits. I see you're using ixl NICs, what speed is your WAN that tailscale is using?

davefinster

I see you're using ixl NICs, what speed is your WAN that tailscale is using?

I've got 5Gbps/5Gbps through AT&T Fiber using a WAS-110 in one of the SFP ports as the GPON endpoint. This SFP does all the network/GPON specific bits such that PFSense just performs DHCP(v6) over the interface. That is my WAN side and then on the LAN side it's just a 10Gbps Twinax into an aggregation switch.

To at least prevent the issue from happening, I've been doing a bit more study on the prefix delegation expectations of the AT&T service and I've arrived at a point where I've set the DHCPv6 client on the WAN interface to only ask for a prefix delegation and not for an address for itself. When it asked for such an address the /128 provided by AT&T is non-routable anyway. This also seemed to cause significant instability in IPv6 networking where the gateway pinging and v6 routing in general would periodically break which opened up an opportunity for this race presumably. By not requesting the /128 the gateway pinger for v6 is purely using its link-local address.

The end result is that the WAN interface only ends up with its link-local address and everything IPv6 related that originates from the router (e.g. Tailscale) is now using the routers IP from the PD'd IPv6 range on the LAN interface which unlike the /128 that AT&T provides is routable. Since making these changes I've not had any issues for 2 days.

stephenw10

Ah, interesting. Yup AT&T expect to see their own router at the end of GPON/XPON and pfSense could well be doing something that doesn't play well. Obviously it still shouldn't panic like that.

The panic appears to be caused by a race condition during removal of an IPv6 address. If the WAN was renewing a lease repeatedly that seems likely.