Netgate 4100 - Fatal trap 12: page fault while in kernel mode
-
Hi all,
In less then 24 hours I now have 3 spontaneous reboots, of my 4100 running 23.01 which worked fine for months. Some minor config changes in firewall rule base but nothing major or any other tweaks/etc in the past days. First two reboots without anything I can find in the local system logs or on the remote syslog server. For the third one, the only thing in the logs I can find is:
May 6 08:00:05 kernel rdi: 0 rsi: 2 rdx: 1 May 6 08:00:05 kernel current process = 0 (if_io_tqg_1) May 6 08:00:05 kernel processor eflags = interrupt enabled, resume, IOPL = 0 May 6 08:00:05 kernel = DPL 0, pres 1, long 1, def32 0, gran 1 May 6 08:00:05 kernel code segment = base 0x0, limit 0xfffff, type 0x1b May 6 08:00:05 kernel frame pointer = 0x28:0xfffffe0009f8ed80 May 6 08:00:05 kernel stack pointer = 0x28:0xfffffe0009f8ed80 May 6 08:00:05 kernel instruction pointer = 0x20:0xffffffff80eb8606 May 6 08:00:05 kernel fault code = supervisor read data, page not present May 6 08:00:05 kernel fault virtual address = 0x460 May 6 08:00:05 kernel cpuid = 1; apic id = 18 May 6 08:00:05 kernel Fatal trap 12: page fault while in kernel mode
Any suggestions where to find more info on the cause and where to proceed troubleshooting this?
Thanks,
Stijn -
This post is deleted! -
Reply to my own post, some more details:
Netgate 4100 - Fatal trap 12: page fault while in kernel mode
Hi all,
In less then 48 hours I now have 6 spontaneous reboots, of my 4100 running initially 23.01, now 23.05-BETA. The 4100 worked fine for months.
Over the past days only some minor firewall config changes. Mostly expanding an alias from a few entries to 20+ as I had to break out an IPv4 /24 and IPv6 /64 out of a bogon range (100.64.0.0/10 and fc00::/7) and as well an internal NAT forcing all NTP to the NTP clock. For this last it as an IPv4 NAT rule, but the alias had the hosts IPv4 and iPv6 address in there. First couple reboots without anything I can find in the local system logs or on the remote syslog server.
This is currently the reboot cadence, seems 8 hours apart:
May 6 00:02:27 edge-mgmt.sjci.nl root[8707]: Bootup complete May 6 00:02:27 edge-mgmt.sjci.nl kernel: Bootup complete May 6 00:22:13 edge-mgmt.sjci.nl root[16602]: Bootup complete May 6 00:22:13 edge-mgmt.sjci.nl kernel: Bootup complete May 6 08:02:14 edge-mgmt.sjci.nl root[4881]: Bootup complete May 6 08:02:14 edge-mgmt.sjci.nl kernel: Bootup complete May 6 08:22:27 edge-mgmt.sjci.nl root[6060]: Bootup complete May 6 08:22:27 edge-mgmt.sjci.nl kernel: Bootup complete May 6 16:02:18 edge-mgmt.sjci.nl root[87706]: Bootup complete May 6 16:02:18 edge-mgmt.sjci.nl kernel: Bootup complete May 6 16:22:10 edge-mgmt.sjci.nl root[10286]: Bootup complete May 6 16:22:10 edge-mgmt.sjci.nl kernel: Bootup complete
Two of the reboots showed this logging via remote syslog:
May 6 08:00:05 edge-mgmt.sjci.nl kernel: Fatal trap 12: page fault while in kernel mode May 6 08:00:05 edge-mgmt.sjci.nl kernel: cpuid = 1; apic id = 18 May 6 08:00:05 edge-mgmt.sjci.nl kernel: fault virtual address#011= 0x460 May 6 08:00:05 edge-mgmt.sjci.nl kernel: fault code#011#011= supervisor read data, page not present May 6 08:00:05 edge-mgmt.sjci.nl kernel: instruction pointer#011= 0x20:0xffffffff80eb8606 May 6 08:00:05 edge-mgmt.sjci.nl kernel: stack pointer#011 = 0x28:0xfffffe0009f8ed80 May 6 08:00:05 edge-mgmt.sjci.nl kernel: frame pointer#011 = 0x28:0xfffffe0009f8ed80 May 6 08:00:05 edge-mgmt.sjci.nl kernel: code segment#011#011= base 0x0, limit 0xfffff, type 0x1b May 6 08:00:05 edge-mgmt.sjci.nl kernel: #011#011#011= DPL 0, pres 1, long 1, def32 0, gran 1 May 6 08:00:05 edge-mgmt.sjci.nl kernel: processor eflags#011= interrupt enabled, resume, IOPL = 0 May 6 08:00:05 edge-mgmt.sjci.nl kernel: current process#011#011= 0 (if_io_tqg_1) May 6 08:00:05 edge-mgmt.sjci.nl kernel: rdi: 0 rsi: 2 rdx: 1 May 6 08:00:05 edge-mgmt.sjci.nl kernel: rcx: 0 r8: 0 r9: 2b94cbbeab72e4cd May 6 08:00:05 edge-mgmt.sjci.nl kernel: rax: 2 rbx: 0 rbp: fffffe0009f8ed80 May 6 08:19:58 edge-mgmt.sjci.nl kernel: May 6 08:19:58 edge-mgmt.sjci.nl kernel: May 6 08:19:58 edge-mgmt.sjci.nl kernel: Fatal trap 12: page fault while in kernel mode May 6 08:19:58 edge-mgmt.sjci.nl kernel: cpuid = 1; apic id = 18 May 6 08:19:58 edge-mgmt.sjci.nl kernel: fault virtual address#011= 0x460 May 6 08:19:58 edge-mgmt.sjci.nl kernel: fault code#011#011= supervisor read data, page not present May 6 08:19:58 edge-mgmt.sjci.nl kernel: instruction pointer#011= 0x20:0xffffffff80eb8606 May 6 16:00:06 edge-mgmt.sjci.nl kernel: Fatal trap 12: page fault while in kernel mode
What I did is connect a RaspberryPI to the USB console of the 4100, and I hope this works as with serial, the output is recorded on the RaspberryPI in a log file hopefully to capture more information via this way.
Sincerely not hoping a HW failure..
Thanks,
Stijn -
And it was resolved in 23.05 beta via #14077
I blamed coreboot, then the hardware but everything passed all checks and it started happening on some other devices that were completely different hardware and had AMI BIOS. What they all had in common was they were doing lots of IPv6 stuff :)
Take a look in redmine #14077 - might be your thing.
-
Hi @mfld,
Thanks for your feedback, it might indeed be the same issue. That said (finger crossed) it's now stable since I bumped the version to the 23.05 Beta (18+ hours).
Indeed doing a reasonable amount of IPv6, so possibly indeed related. I hope not, but if it crashes I should have a full dump as the console is still connected to the RasberryPI to capture the output.
-
You should see a crash report shown on the dashboard after it's rebooted. That wil have the backtrace that would confirm you're hitting that same issue.
-
Hi @stephenw10,
Thanks for your reply; but none of the crashes seem to have triggered this. According to "Installs without Swap Space" and below output I do assume/believe the 4100 doesn't have any swap:
[23.05-BETA][sjcjonker@edge.sjci.nl]/home/sjcjonker: sudo swapinfo Password: Device 512-blocks Used Avail Capacity
Neither is anything recorded in /var/crash
[23.05-BETA][sjcjonker@edge.sjci.nl]/home/sjcjonker: sudo ls -la /var/crash total 19 drwxr-x--- 2 root wheel 3 May 6 16:33 . drwxr-xr-x 29 root wheel 29 May 3 09:02 .. -rw-r--r-- 1 root wheel 5 May 3 09:02 minfree
That said since the upgrade to 23.05 it is stable for 3 days now. So still (fingerscrossed) my side. But if it does crash I should have the console logs this time.
Stijn
-
@sjcjonker said in Netgate 4100 - Fatal trap 12: page fault while in kernel mode:
crash
I have a 4100, and, when I got it, pre loaded with 22.05 ( ? ) the swap was 'not there'.
Have a look here : swap not listed? [solved] -
Hi @gertjan,
Thanks, so now I have swap :-) just edited /etc/fstab with the right swap partition instead of the GPT-ID which I'm guessing came out of the installer (image).
At least I can decommission the Raspberry-PI doing console logging.
# cat /etc/fstab # Device Mountpoint FStype Options Dump Pass# /dev/msdosfs/EFISYS /boot/efi msdosfs rw,noatime,noauto 0 0 /dev/mmcsd0p3 none swap sw 0 0 # swapinfo Device 512-blocks Used Avail Capacity /dev/mmcsd0p3 1336520 0 1336520 0% #
Stijn
-
@sjcjonker said in Netgate 4100 - Fatal trap 12: page fault while in kernel mode:
At least I can decommission the Raspberry-PI doing console logging.
Wait
A syslogger is always a nice thing to have. I'm using one : my NAS.
When things go downhill, chances are great that logging accelerates.
And when you finally take a look at the "what when who where" you'll notice that the interesting events were just rotated into /dev/null