Crash report
-
I have run pfSense on relatively high-configured Supermicro-server for years and short time after testing with IPv6, crash-reports started being generated. Luckly, the machine came back online of itself. It is weeks ago since I upgraded to latest version.
What could be the cause of this or how can I find out? I have removed the IPv6-config, but I got a message short time after that.
Crash report begins.
amd64
12.2-STABLE
FreeBSD 12.2-STABLE d48fb226319(devel-12) pfSenseCrash report details:
No PHP errors found.
Filename: /var/crash/info.0
Dump header from device: /dev/mirror/pfSenseMirrorp3
Architecture: amd64
Architecture Version: 4
Dump Length: 157696
Blocksize: 512
Compression: none
Dumptime: Sat Apr 10 04:29:57 2021
Hostname:
Magic: FreeBSD Text Dump
Version String: FreeBSD 12.2-STABLE d48fb226319(devel-12) pfSense
Panic String: page fault
Dump Parity: 2148879230
Bounds: 0
Dump Status: good
<6>arp: moved from ac:1f:6b:6f:f2:8a to ac:1f:6b:6f:f2:8b on lagg0 kernel trap 12 with interrupts disabled Fatal trap 12: page fault while in kernel mode cpuid = 3; apic id = 06 fault virtual address = 0x20 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff80e024b5 stack pointer = 0x28:0xfffffe00005693d0 frame pointer = 0x28:0xfffffe00005693e0 code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = resume, IOPL = 0 current process = 0 (if_io_tqg_3) trap number = 12 panic: page fault cpuid = 3 time = 1618021797 KDB: enter: panic ������������������
-
You have the backtrace from that report? The section at the top labelled
bt>
.Steve
-
@stephenw10 This happened again now. I can't see anything labeled bt>.
So strange that this started happening now, been running stable for years.
Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address = 0x20
fault code = supervisor read data, page not present -
Can we see the full crash report then?
-
@stephenw10 Only issue is that it has lot of IPs (including public) in it, so didn't want to post it here. But removed sensitive stuff, here is first dump-file:
And here is 2nd:
Dump header from device: /dev/mirror/pfSenseMirrorp3 Architecture: amd64 Architecture Version: 4 Dump Length: 157696 Blocksize: 512 Compression: none Dumptime: Sun Apr 25 00:14:21 2021 Hostname: XX Magic: FreeBSD Text Dump Version String: FreeBSD 12.2-STABLE d48fb226319(devel-12) pfSense Panic String: page fault Dump Parity: 411090558 Bounds: 0 Dump Status: good
-
Ok, so we can see the backtrace in that here:
db:0:kdb.enter.default> bt Tracing pid 0 tid 100046 td 0xfffff8000461d740 kdb_enter() at kdb_enter+0x37/frame 0xfffffe000055a140 vpanic() at vpanic+0x197/frame 0xfffffe000055a190 panic() at panic+0x43/frame 0xfffffe000055a1f0 trap_fatal() at trap_fatal+0x391/frame 0xfffffe000055a250 trap_pfault() at trap_pfault+0x4f/frame 0xfffffe000055a2a0 trap() at trap+0x286/frame 0xfffffe000055a3b0 calltrap() at calltrap+0x8/frame 0xfffffe000055a3b0 --- trap 0xc, rip = 0xffffffff80e024b5, rsp = 0xfffffe000055a480, rbp = 0xfffffe000055a490 --- turnstile_broadcast() at turnstile_broadcast+0x45/frame 0xfffffe000055a490 __mtx_unlock_sleep() at __mtx_unlock_sleep+0x7f/frame 0xfffffe000055a4c0 pf_find_state() at pf_find_state+0x21c/frame 0xfffffe000055a500 pf_test_state_tcp() at pf_test_state_tcp+0x1b6/frame 0xfffffe000055a620 pf_test() at pf_test+0x1f64/frame 0xfffffe000055a870 pf_check_in() at pf_check_in+0x1d/frame 0xfffffe000055a890 pfil_run_hooks() at pfil_run_hooks+0xa1/frame 0xfffffe000055a930 ip_input() at ip_input+0x475/frame 0xfffffe000055a9e0 netisr_dispatch_src() at netisr_dispatch_src+0xca/frame 0xfffffe000055aa30 ether_demux() at ether_demux+0x16a/frame 0xfffffe000055aa60 ether_nh_input() at ether_nh_input+0x330/frame 0xfffffe000055aac0 netisr_dispatch_src() at netisr_dispatch_src+0xca/frame 0xfffffe000055ab10 ether_input() at ether_input+0x4b/frame 0xfffffe000055ab40 iflib_rxeof() at iflib_rxeof+0xae6/frame 0xfffffe000055ac20 _task_fn_rx() at _task_fn_rx+0x72/frame 0xfffffe000055ac60 gtaskqueue_run_locked() at gtaskqueue_run_locked+0x121/frame 0xfffffe000055acc0 gtaskqueue_thread_loop() at gtaskqueue_thread_loop+0xb6/frame 0xfffffe000055acf0 fork_exit() at fork_exit+0x7e/frame 0xfffffe000055ad30 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe000055ad30 --- trap 0, rip = 0, rsp = 0, rbp = 0 ---
Look like the message buffer has been removed.
The first thing to do is compare that backtrace with one from another crash report. If they are all identical or very similar it's probably a software issue at least.
Steve
-
That is dump from two weeks earlier.
Dump header from device: /dev/mirror/pfSenseMirrorp3 Architecture: amd64 Architecture Version: 4 Dump Length: 157696 Blocksize: 512 Compression: none Dumptime: Sat Apr 10 04:29:57 2021 Hostname: Magic: FreeBSD Text Dump Version String: FreeBSD 12.2-STABLE d48fb226319(devel-12) pfSense Panic String: page fault Dump Parity: 2148879230 Bounds: 0 Dump Status: good
-
Ok, so virtually identical.
That is 2.5.1 yes? It looks a lot like an old crash that should be fixed in 2.5.1.
Steve
-
@stephenw10 2.5.1-RELEASE (amd64)
built on Mon Apr 12 07:50:14 EDT 2021I was hoping it was just something fixed in 2.5.1, so I upgraded (from 2.5.0) just after the previous report (2 days later). So 2nd crash last night on 2.5.1.
-
Hmm, are you able to test a 2.6 snapshot?
Though I'm not aware if anything specific that has gone in the address that.
Steve
-
@stephenw10 It is in production, so a bit scary to upgrade since it seems to work for most usage (except one LAN-network, but not sure if related). I have a 2nd machine with same config offline standing ready for years now, so in theory I can just fire it up and load the backup when I'm onsite, but...
In the log, there is weird stuff like the below - many hundred. It is correct it is not a host, it is an alias for hosts and ports that are valid. Maybe this causes overload? I haven't change the aliases for months, started appearing just now. It doesn't seem to cause any problems, but strange that it suggest that the alias names are host.
-
It shouldn't ever cause a crash but you should remove unresolvable entries from aliases and rules.
It can cause delays in updating the ruleset that can cause other issues if there are enough.Steve
-
@stephenw10 There was 6-7 aliases that was no longer in use. Meaning that I have earlier deleted one more more host behind the alias (from the GUI), but the alias it was part of had been left behind or had other valid entries. Now there is only one left in the logs and I can't find it..
-
I usually search the config file directly in that situation.
-
I have the same problem since i move to pfsense 2.5. actually i use pfsense 2.6 and i have one crash by day.
i have in report : fault code = supervisor read instruction, page not present
what are differences in configuration files between pfsense 2.4 and 2.5 ?
best regards
-
Depends which specific version but there are a lot:
https://docs.netgate.com/pfsense/en/latest/releases/versions.htmlIt shouldn't matter though, you can import an older config into the current pfSense version.
Steve
-
Changing hardware didn't help, not removing aliases or IPSec tunnels either.
What finally solved it for me after a year of trouble, was removing the LAN LAG against two switches. Had redundancy in case of one switch failed. All the switches shown the correct properties against the other end (short/long etc), so had no reason to suspect any issues. It all started after a pfSense upgrade.
I assume it must have been some kind of network confusion that caused the crash to happen every month. After this change, no problems has appeared.
-
Hmm, that's weird. You never saw any errors relating the the LAGG?
It was LACP I assume. Was the LAN just directly assigned to it? Or VLANs over it?
Steve
-
@stephenw10 LACP, correct. No VLANS at all, LAN directly assigned to it.
Maybe stupid, but only reason why I started suspecting it, was this message on one of the servers on the network (from the dump/crash-log):
<6>arp: moved from ac:1f:6b:6f:f2:8a to ac:1f:6b:6f:f2:8b on lagg0
I was suspecting that something wasn't working correctly, as there was no reason for a always-on file server to switch port. Maybe it is routine, who knows.. And not a single crash after.
-
Hmm, that's the server's MAC address(es)?
That looks like a log message on pfSense showing that the server moved to a different MAC. I assume you omitted the IP address there.
That wouldn't normally be an issue. It might happen if the server itself was connected with a lagg to the switch stack for example.