Crash report

stephenw10

You have the backtrace from that report? The section at the top labelled bt>.

Steve

fireix

@stephenw10 This happened again now. I can't see anything labeled bt>.

So strange that this started happening now, been running stable for years.

Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address = 0x20
fault code = supervisor read data, page not present

stephenw10

Can we see the full crash report then?

fireix

@stephenw10 Only issue is that it has lot of IPs (including public) in it, so didn't want to post it here. But removed sensitive stuff, here is first dump-file:

textdump-2021.txt

And here is 2nd:

Dump header from device: /dev/mirror/pfSenseMirrorp3
  Architecture: amd64
  Architecture Version: 4
  Dump Length: 157696
  Blocksize: 512
  Compression: none
  Dumptime: Sun Apr 25 00:14:21 2021
  Hostname: XX
  Magic: FreeBSD Text Dump
  Version String: FreeBSD 12.2-STABLE d48fb226319(devel-12) pfSense
  Panic String: page fault
  Dump Parity: 411090558
  Bounds: 0
  Dump Status: good

stephenw10

Ok, so we can see the backtrace in that here:

db:0:kdb.enter.default>  bt
Tracing pid 0 tid 100046 td 0xfffff8000461d740
kdb_enter() at kdb_enter+0x37/frame 0xfffffe000055a140
vpanic() at vpanic+0x197/frame 0xfffffe000055a190
panic() at panic+0x43/frame 0xfffffe000055a1f0
trap_fatal() at trap_fatal+0x391/frame 0xfffffe000055a250
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe000055a2a0
trap() at trap+0x286/frame 0xfffffe000055a3b0
calltrap() at calltrap+0x8/frame 0xfffffe000055a3b0
--- trap 0xc, rip = 0xffffffff80e024b5, rsp = 0xfffffe000055a480, rbp = 0xfffffe000055a490 ---
turnstile_broadcast() at turnstile_broadcast+0x45/frame 0xfffffe000055a490
__mtx_unlock_sleep() at __mtx_unlock_sleep+0x7f/frame 0xfffffe000055a4c0
pf_find_state() at pf_find_state+0x21c/frame 0xfffffe000055a500
pf_test_state_tcp() at pf_test_state_tcp+0x1b6/frame 0xfffffe000055a620
pf_test() at pf_test+0x1f64/frame 0xfffffe000055a870
pf_check_in() at pf_check_in+0x1d/frame 0xfffffe000055a890
pfil_run_hooks() at pfil_run_hooks+0xa1/frame 0xfffffe000055a930
ip_input() at ip_input+0x475/frame 0xfffffe000055a9e0
netisr_dispatch_src() at netisr_dispatch_src+0xca/frame 0xfffffe000055aa30
ether_demux() at ether_demux+0x16a/frame 0xfffffe000055aa60
ether_nh_input() at ether_nh_input+0x330/frame 0xfffffe000055aac0
netisr_dispatch_src() at netisr_dispatch_src+0xca/frame 0xfffffe000055ab10
ether_input() at ether_input+0x4b/frame 0xfffffe000055ab40
iflib_rxeof() at iflib_rxeof+0xae6/frame 0xfffffe000055ac20
_task_fn_rx() at _task_fn_rx+0x72/frame 0xfffffe000055ac60
gtaskqueue_run_locked() at gtaskqueue_run_locked+0x121/frame 0xfffffe000055acc0
gtaskqueue_thread_loop() at gtaskqueue_thread_loop+0xb6/frame 0xfffffe000055acf0
fork_exit() at fork_exit+0x7e/frame 0xfffffe000055ad30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe000055ad30
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---

Look like the message buffer has been removed.

The first thing to do is compare that backtrace with one from another crash report. If they are all identical or very similar it's probably a software issue at least.

Steve

fireix

@stephenw10 textdump-old.txt

That is dump from two weeks earlier.

Dump header from device: /dev/mirror/pfSenseMirrorp3
  Architecture: amd64
  Architecture Version: 4
  Dump Length: 157696
  Blocksize: 512
  Compression: none
  Dumptime: Sat Apr 10 04:29:57 2021
  Hostname: 
  Magic: FreeBSD Text Dump
  Version String: FreeBSD 12.2-STABLE d48fb226319(devel-12) pfSense
  Panic String: page fault
  Dump Parity: 2148879230
  Bounds: 0
  Dump Status: good

stephenw10

Ok, so virtually identical.

That is 2.5.1 yes? It looks a lot like an old crash that should be fixed in 2.5.1.

Steve

fireix

@stephenw10 2.5.1-RELEASE (amd64)
built on Mon Apr 12 07:50:14 EDT 2021

I was hoping it was just something fixed in 2.5.1, so I upgraded (from 2.5.0) just after the previous report (2 days later). So 2nd crash last night on 2.5.1.

stephenw10

Hmm, are you able to test a 2.6 snapshot?

Though I'm not aware if anything specific that has gone in the address that.

Steve

fireix

@stephenw10 It is in production, so a bit scary to upgrade since it seems to work for most usage (except one LAN-network, but not sure if related). I have a 2nd machine with same config offline standing ready for years now, so in theory I can just fire it up and load the backup when I'm onsite, but...

In the log, there is weird stuff like the below - many hundred. It is correct it is not a host, it is an alias for hosts and ports that are valid. Maybe this causes overload? I haven't change the aliases for months, started appearing just now. It doesn't seem to cause any problems, but strange that it suggest that the alias names are host.

stephenw10

It shouldn't ever cause a crash but you should remove unresolvable entries from aliases and rules.
It can cause delays in updating the ruleset that can cause other issues if there are enough.

Steve

fireix

@stephenw10 There was 6-7 aliases that was no longer in use. Meaning that I have earlier deleted one more more host behind the alias (from the GUI), but the alias it was part of had been left behind or had other valid entries. Now there is only one left in the logs and I can't find it..

stephenw10

I usually search the config file directly in that situation.

dcugy

I have the same problem since i move to pfsense 2.5. actually i use pfsense 2.6 and i have one crash by day.

i have in report : fault code = supervisor read instruction, page not present

what are differences in configuration files between pfsense 2.4 and 2.5 ?

best regards

stephenw10

Depends which specific version but there are a lot:
https://docs.netgate.com/pfsense/en/latest/releases/versions.html

It shouldn't matter though, you can import an older config into the current pfSense version.

Steve

fireix

Changing hardware didn't help, not removing aliases or IPSec tunnels either.

What finally solved it for me after a year of trouble, was removing the LAN LAG against two switches. Had redundancy in case of one switch failed. All the switches shown the correct properties against the other end (short/long etc), so had no reason to suspect any issues. It all started after a pfSense upgrade.

I assume it must have been some kind of network confusion that caused the crash to happen every month. After this change, no problems has appeared.

stephenw10

Hmm, that's weird. You never saw any errors relating the the LAGG?

It was LACP I assume. Was the LAN just directly assigned to it? Or VLANs over it?

Steve

fireix

@stephenw10 LACP, correct. No VLANS at all, LAN directly assigned to it.

Maybe stupid, but only reason why I started suspecting it, was this message on one of the servers on the network (from the dump/crash-log):

<6>arp: moved from ac:1f:6b:6f:f2:8a to ac:1f:6b:6f:f2:8b on lagg0

I was suspecting that something wasn't working correctly, as there was no reason for a always-on file server to switch port. Maybe it is routine, who knows.. And not a single crash after.

stephenw10

Hmm, that's the server's MAC address(es)?

That looks like a log message on pfSense showing that the server moved to a different MAC. I assume you omitted the IP address there.

That wouldn't normally be an issue. It might happen if the server itself was connected with a lagg to the switch stack for example.

fireix

@stephenw10 Yes, the servers mac-address. The server (all servers, not only this) was connected through LAGG-setup against switch in the same way. Didn't really think it should be a big problem, just tiny bit weird that only one had the "problem" (qnap server).