pfSense Crash "Fatal trap 12: page fault while in kernel mode"

stephenw10

@dovh said in pfSense Crash "Fatal trap 12: page fault while in kernel mode":

Isn't tailscale listening on all interfaces by default?

Yes it is. And that's a problem because I don't believe there's any way to limit what it listens on.

stephenw10

How often are you seeing this? Can you replicate it on demand?

We don't yet have a way to replicate it locally which makes debugging difficult.

stephenw10

Are you actually using IPv6? Otherwise you can try disabling IPv6 link-local addresses as I outlined above.

dovh

@stephenw10 I have had it happen once now. I have a hunch that it happened when a new user joined our Tailnet from a local LAN network behind pfsense and tried accessing some route advertised by the Tailscale package on pfsense. I will try replicating it, but I think it's rather random as it only happened once, and it has been up and running for some months before.

dovh

@stephenw10 Yes, we do use IPv6, so disabling it is not really an option for us.

stephenw10

You're not using IPv6 inside the tailnet though? Or explicitly as a tunnel endpoint?

dovh

@stephenw10 No, but it assigns IPv6 to a client by default and I'm not really sure you can disable that in tailscale.

stephenw10

You should be able to disable that in tailscale but it shouldn't make any difference since it's tailscale itself that's binding to it.

dovh

@stephenw10 Yeah, I don't believe that is possible for the Tailscale network. Is this an issue with pfSense or with the Tailscale package?

stephenw10

It appears to be a bug in FreeBSD/pfSense. It's just that the tailsale daemon hits it more often than anything else because it always binds to every IP address.

Just to confirm you saw this randomly in runtime? Not during boot?

dovh

@stephenw10 Yes, this was during runtime.

Also, I'm not sure if that could be connected, but quite a lot of users joined the Tailscale network that day when the crash happened. We also advertise several routes from the pfsense in the Tailscale package to allow some users access to internal services but there is nothing else really special in the configuration.

stephenw10

I don't think new users should trigger this since the daemon doesn't bind to it's own internal addresses. More likely this was some address change on another interface locally.

The only other thing we have seen recently was ntpd not starting due to an IPv6 local address being ,marked as duplicate. But as far as I know that can only happen at boot.

dovh

Would there be anything I can provide to help you find the bug that causes these crashes? Or are there some fixes already being implemented that should mitigate this issue? I'm just trying to find out what my options are right now.

stephenw10

Are you able to trigger this reliably at all?

The biggest issue we have fixing it is that we haven't been able to replicate it locally and users who are seeing it do so only sporadically. So getting data is difficult.

dovh

@stephenw10 It seems I can't replicate it on demand; it has to be something very specific happening since I have only seen it crash like this once more since I reported it originally.

stephenw10

One thing we can try here is to enable a full core dump in the event of a panic. In this particular case it may or may not help but there's a chance it would provide all the answers.

Are you able to set that up on the firewall hitting this? If so do you have a SWAP partition configured and how large is it compared to the RAM?

dovh

@stephenw10 Hello, another crash happened today with an almost identical trace.

We have SWAP configured, and it is approx. 50% of our RAM (3,7GB swap and 8GB RAM)

How would I go with the setup and get the full core dump in case of another crash?

Fatal trap 12: page fault while in kernel mode
cpuid = 3; apic id = 03
fault virtual address	= 0xb8
fault code		= supervisor read data, page not present
instruction pointer	= 0x20:0xffffffff80f44300
stack pointer	        = 0x28:0xfffffe00c8e65c80
frame pointer	        = 0x28:0xfffffe00c8e65d00
code segment		= base 0x0, limit 0xfffff, type 0x1b
			= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags	= interrupt enabled, resume, IOPL = 0
current process		= 95389 (tailscaled)
rdi: ffffffff82d62a40 rsi: 00000000000034ee rdx: 0000000000000000
rcx: 0000000000000000  r8: fffff8024db2f800  r9: 0000000000000000
rax: 0000000000000030 rbx: fffff801b317c700 rbp: fffffe00c8e65d00
r10: 0000000000000000 r11: fffffe008fc8ac60 r12: fffff800599a3040
r13: 00000000000034ee r14: 0000000000000001 r15: fffff8024db2f800
trap number		= 12
panic: page fault
cpuid = 3
time = 1741684274
KDB: enter: panic

stephenw10

To enable full core dumps edit /etc/pfSense-ddb.conf set the kdb.enter.default script line to:

script kdb.enter.default=bt ; show registers ; dump ; reset

Reboot.
Check: sysctl debug.ddb.scripting.scripts make sure it shows the above line.
If you can test a panic: sysctl debug.kdb.panic=1
That will immediately panic the kernel and should generate a full core file.

SWAP is usually double the RAM size so you might not have enough space depending on the usage.