Netgate 4100 - Fatal trap 12: page fault while in kernel mode

sjcjonker · May 6, 2023, 6:18 AM

Hi all,

In less then 24 hours I now have 3 spontaneous reboots, of my 4100 running 23.01 which worked fine for months. Some minor config changes in firewall rule base but nothing major or any other tweaks/etc in the past days. First two reboots without anything I can find in the local system logs or on the remote syslog server. For the third one, the only thing in the logs I can find is:

May 6 08:00:05	kernel		rdi: 0 rsi: 2 rdx: 1
May 6 08:00:05	kernel		current process = 0 (if_io_tqg_1)
May 6 08:00:05	kernel		processor eflags = interrupt enabled, resume, IOPL = 0
May 6 08:00:05	kernel		= DPL 0, pres 1, long 1, def32 0, gran 1
May 6 08:00:05	kernel		code segment = base 0x0, limit 0xfffff, type 0x1b
May 6 08:00:05	kernel		frame pointer = 0x28:0xfffffe0009f8ed80
May 6 08:00:05	kernel		stack pointer = 0x28:0xfffffe0009f8ed80
May 6 08:00:05	kernel		instruction pointer = 0x20:0xffffffff80eb8606
May 6 08:00:05	kernel		fault code = supervisor read data, page not present
May 6 08:00:05	kernel		fault virtual address = 0x460
May 6 08:00:05	kernel		cpuid = 1; apic id = 18
May 6 08:00:05	kernel		Fatal trap 12: page fault while in kernel mode

Any suggestions where to find more info on the cause and where to proceed troubleshooting this?

Thanks,
Stijn

sjcjonker · May 6, 2023, 2:39 PM

This post is deleted!

sjcjonker · May 6, 2023, 5:31 PM

Reply to my own post, some more details:

Netgate 4100 - Fatal trap 12: page fault while in kernel mode

Hi all,

In less then 48 hours I now have 6 spontaneous reboots, of my 4100 running initially 23.01, now 23.05-BETA. The 4100 worked fine for months.

Over the past days only some minor firewall config changes. Mostly expanding an alias from a few entries to 20+ as I had to break out an IPv4 /24 and IPv6 /64 out of a bogon range (100.64.0.0/10 and fc00::/7) and as well an internal NAT forcing all NTP to the NTP clock. For this last it as an IPv4 NAT rule, but the alias had the hosts IPv4 and iPv6 address in there. First couple reboots without anything I can find in the local system logs or on the remote syslog server.

This is currently the reboot cadence, seems 8 hours apart:

May  6 00:02:27 edge-mgmt.sjci.nl root[8707]: Bootup complete
May  6 00:02:27 edge-mgmt.sjci.nl kernel: Bootup complete
May  6 00:22:13 edge-mgmt.sjci.nl root[16602]: Bootup complete
May  6 00:22:13 edge-mgmt.sjci.nl kernel: Bootup complete
May  6 08:02:14 edge-mgmt.sjci.nl root[4881]: Bootup complete
May  6 08:02:14 edge-mgmt.sjci.nl kernel: Bootup complete
May  6 08:22:27 edge-mgmt.sjci.nl root[6060]: Bootup complete
May  6 08:22:27 edge-mgmt.sjci.nl kernel: Bootup complete
May  6 16:02:18 edge-mgmt.sjci.nl root[87706]: Bootup complete
May  6 16:02:18 edge-mgmt.sjci.nl kernel: Bootup complete
May  6 16:22:10 edge-mgmt.sjci.nl root[10286]: Bootup complete
May  6 16:22:10 edge-mgmt.sjci.nl kernel: Bootup complete

Two of the reboots showed this logging via remote syslog:

May  6 08:00:05 edge-mgmt.sjci.nl kernel: Fatal trap 12: page fault while in kernel mode
May  6 08:00:05 edge-mgmt.sjci.nl kernel: cpuid = 1; apic id = 18
May  6 08:00:05 edge-mgmt.sjci.nl kernel: fault virtual address#011= 0x460
May  6 08:00:05 edge-mgmt.sjci.nl kernel: fault code#011#011= supervisor read data, page not present
May  6 08:00:05 edge-mgmt.sjci.nl kernel: instruction pointer#011= 0x20:0xffffffff80eb8606
May  6 08:00:05 edge-mgmt.sjci.nl kernel: stack pointer#011        = 0x28:0xfffffe0009f8ed80
May  6 08:00:05 edge-mgmt.sjci.nl kernel: frame pointer#011        = 0x28:0xfffffe0009f8ed80
May  6 08:00:05 edge-mgmt.sjci.nl kernel: code segment#011#011= base 0x0, limit 0xfffff, type 0x1b
May  6 08:00:05 edge-mgmt.sjci.nl kernel: #011#011#011= DPL 0, pres 1, long 1, def32 0, gran 1
May  6 08:00:05 edge-mgmt.sjci.nl kernel: processor eflags#011= interrupt enabled, resume, IOPL = 0
May  6 08:00:05 edge-mgmt.sjci.nl kernel: current process#011#011= 0 (if_io_tqg_1)
May  6 08:00:05 edge-mgmt.sjci.nl kernel: rdi:                0 rsi:                2 rdx:                1
May  6 08:00:05 edge-mgmt.sjci.nl kernel: rcx:                0  r8:                0  r9: 2b94cbbeab72e4cd
May  6 08:00:05 edge-mgmt.sjci.nl kernel: rax:                2 rbx:                0 rbp: fffffe0009f8ed80


May  6 08:19:58 edge-mgmt.sjci.nl kernel:
May  6 08:19:58 edge-mgmt.sjci.nl kernel:
May  6 08:19:58 edge-mgmt.sjci.nl kernel: Fatal trap 12: page fault while in kernel mode
May  6 08:19:58 edge-mgmt.sjci.nl kernel: cpuid = 1; apic id = 18
May  6 08:19:58 edge-mgmt.sjci.nl kernel: fault virtual address#011= 0x460
May  6 08:19:58 edge-mgmt.sjci.nl kernel: fault code#011#011= supervisor read data, page not present
May  6 08:19:58 edge-mgmt.sjci.nl kernel: instruction pointer#011= 0x20:0xffffffff80eb8606

May  6 16:00:06 edge-mgmt.sjci.nl kernel: Fatal trap 12: page fault while in kernel mode

What I did is connect a RaspberryPI to the USB console of the 4100, and I hope this works as with serial, the output is recorded on the RaspberryPI in a log file hopefully to capture more information via this way.

Sincerely not hoping a HW failure..

Thanks,
Stijn

mfld · May 7, 2023, 3:25 AM

@sjcjonker

I had this.

And it was resolved in 23.05 beta via #14077

I blamed coreboot, then the hardware but everything passed all checks and it started happening on some other devices that were completely different hardware and had AMI BIOS. What they all had in common was they were doing lots of IPv6 stuff :)

Take a look in redmine #14077 - might be your thing.

sjcjonker · May 7, 2023, 9:33 AM

Hi @mfld,

Thanks for your feedback, it might indeed be the same issue. That said (finger crossed) it's now stable since I bumped the version to the 23.05 Beta (18+ hours).

Indeed doing a reasonable amount of IPv6, so possibly indeed related. I hope not, but if it crashes I should have a full dump as the console is still connected to the RasberryPI to capture the output.

stephenw10 · May 9, 2023, 5:44 AM

You should see a crash report shown on the dashboard after it's rebooted. That wil have the backtrace that would confirm you're hitting that same issue.

sjcjonker · May 9, 2023, 5:44 AM

Hi @stephenw10,

Thanks for your reply; but none of the crashes seem to have triggered this. According to "Installs without Swap Space" and below output I do assume/believe the 4100 doesn't have any swap:

[23.05-BETA][sjcjonker@edge.sjci.nl]/home/sjcjonker: sudo swapinfo
Password:
Device          512-blocks     Used    Avail Capacity

Neither is anything recorded in /var/crash

[23.05-BETA][sjcjonker@edge.sjci.nl]/home/sjcjonker: sudo ls -la /var/crash
total 19
drwxr-x---   2 root  wheel   3 May  6 16:33 .
drwxr-xr-x  29 root  wheel  29 May  3 09:02 ..
-rw-r--r--   1 root  wheel   5 May  3 09:02 minfree

That said since the upgrade to 23.05 it is stable for 3 days now. So still (fingerscrossed) my side. But if it does crash I should have the console logs this time.

Stijn

Gertjan · May 9, 2023, 11:25 AM

@sjcjonker said in Netgate 4100 - Fatal trap 12: page fault while in kernel mode:

crash

I have a 4100, and, when I got it, pre loaded with 22.05 ( ? ) the swap was 'not there'.
Have a look here : swap not listed? [solved]

sjcjonker · May 9, 2023, 1:18 PM

Hi @gertjan,

Thanks, so now I have swap :-) just edited /etc/fstab with the right swap partition instead of the GPT-ID which I'm guessing came out of the installer (image).

At least I can decommission the Raspberry-PI doing console logging.

# cat /etc/fstab
# Device                Mountpoint      FStype  Options         Dump    Pass#
/dev/msdosfs/EFISYS     /boot/efi       msdosfs rw,noatime,noauto       0       0
/dev/mmcsd0p3		none	swap	sw		0	0
# swapinfo
Device          512-blocks     Used    Avail Capacity
/dev/mmcsd0p3      1336520        0  1336520     0%
#

Stijn

Gertjan · May 9, 2023, 1:18 PM

@sjcjonker said in Netgate 4100 - Fatal trap 12: page fault while in kernel mode:

At least I can decommission the Raspberry-PI doing console logging.

Wait

A syslogger is always a nice thing to have. I'm using one : my NAS.
When things go downhill, chances are great that logging accelerates.
And when you finally take a look at the "what when who where" you'll notice that the interesting events were just rotated into /dev/null