Trying to understand crash report.
-
I hope someone may be able to help?
My pfsense box seems to crash around 4 days of use. It's a fresh build so any help appreciated, thanks.
I do have the full dump files if needed.
Thanks.
amd64
14.0-CURRENT
FreeBSD 14.0-CURRENT amd64 1400094 #1 RELENG_2_7_2-n255948-8d2b56da39c: Wed Dec 6 20:45:47 UTC 2023 root@freebsd:/var/jenkins/workspace/pfSense-CE-snapshots-2_7_2-main/obj/amd64/StdASW5b/var/jenkins/workspace/pfSense-CE-snapshots-2_7_2-main/sources/FCrash report details:
No PHP errors found.
Filename: /var/crash/info.0
Dump header from device: /dev/nda0p3
Architecture: amd64
Architecture Version: 4
Dump Length: 76800
Blocksize: 512
Compression: none
Dumptime: 2024-08-29 07:39:59 +0100
Hostname: pfSense.home.arpa
Magic: FreeBSD Text Dump
Version String: FreeBSD 14.0-CURRENT amd64 1400094 #1 RELENG_2_7_2-n255948-8d2b56da39c: Wed Dec 6 20:45:47 UTC 2023
root@freebsd:/var/jenkins/workspace/pfSense-CE-snapshots-2_7_2-main/obj/amd64/StdASW5
Panic String: Unknown caching mode 1632072657Dump Parity: 3375348253
Bounds: 0
Dump Status: goodFilename: /var/crash/textdump.tar.0
-
Yup need to see the full crash report. You can upload it here: https://nc.netgate.com/nextcloud/s/98ScTHcok475QZH
-
-
Backtrace is unhelpful:
db:0:kdb.enter.default> bt Tracing pid 89416 tid 100574 td 0xfffffe0152620740 kdb_enter() at kdb_enter+0x32/frame 0xfffffe015139d9f0 vpanic() at vpanic+0x163/frame 0xfffffe015139db20 panic() at panic+0x43/frame 0xfffffe015139db80 pmap_enter() at pmap_enter+0xfee/frame 0xfffffe015139dc50 vm_fault() at vm_fault+0x134b/frame 0xfffffe015139dd60 vm_fault_trap() at vm_fault_trap+0x6b/frame 0xfffffe015139ddb0 trap_pfault() at trap_pfault+0x1d9/frame 0xfffffe015139de10 trap() at trap+0x442/frame 0xfffffe015139df30 calltrap() at calltrap+0x8/frame 0xfffffe015139df30 --- trap 0xc, rip = 0x22047c0f1975, rsp = 0x220476bb51d8, rbp = 0x220476bb5220 ---
That's very generic.
The panic doesn't tell us much either:
<118>pfSense 2.7.2-RELEASE amd64 20231206-2010 <118>Bootup complete panic: Unknown caching mode 1632072657 cpuid = 3 time = 1724913599 KDB: enter: panic
Is this the first time you've seen it?
Did anything appear to trigger it?
-
2nd time now. I wasn't even home at the time. I accessed the box remotely and was notified of the crash report etc.
I rebuilt it last weekend following the first crash. I'm just looking to see if I have those files
-
I have uploaded the first ones from when it crashed before in case there is anything helpful in there.
-
Hmm, pretty much identical panic. Let me see what I can find....
-
This doesn't appear to be anything known. You are running relatively new hardware for the kernel version. It's possible something is not setup to handle it.
The caching mode string there is completely invalid, something is passing it a bad value or it's reporting it incorrectly.
Is this something that just started happening or has it only just been installed 4 days ago?
-
@stephenw10
yes it's new hardware and I have had some panics since it was built (has been built more than 4 days...). I have rebuilt it a few times in between.I am probably going to move it back to my bench and run a ram test even though it was new which goes for the rest of the components.
Is it anything to do with the networking side of things? Currently I'm using one of the x710 sfp ports on the lan side (latest intel firmware) with an rj45 module and the wan side is using a pcie 10gb rj45 nic as the isp ont is 10gb rj45.
Thanks
-
Nothing there looks like a network issue specifically, no.
I would try to disable anything unused in the hardware. So the on board sound device etc.
-
Hi. So trying a few things to see if it helps at all.
1)several runs of memory testing. No faults though I did notice I had my single 16gb ddr5 in the second slot so moved to slot 1 and retested, all OK
2) disabled anything in the bios not required. There was a section for the realtek audio you mentioned, so that's off now too. Bios is the current version
3) checked x710 sfp ports firmware, I updated it recently and it's still on the current version.
4) for what it was worth I repasted the cpu hestsink etc. It was okish but used so good quality paste. Idling temp seems a bit lower.Rebuilt and it's back on as my live unit so let's see what happens.
One quick question though if you don't mind.
With regards to "hardware checksum offload" does this need to be disabled if using my x710's. Currently it's not but advice on this appreciated. Thanks!
-
That should be fine on the ixl (x710) NICs. However I would leave it disabled for now. It's unlikely to help much and could cause an issue.
-
@stephenw10
OK great . Thanks its off for now and so let's see how it goes.I have noted lan in and wan errors using both sfp ports with rj45 modules(additional cooling already fitted), 4 each I know we talked about them before so for now I'm more interested on hoping it doesn't crash. 57gb on wan in (4 wan in errors).42gb on lan in with 4 errors. I assuming this isn't too much to be of concern?... thank you.
-
No that would not be a concern to me. A few errors like that could be anything and is very unlikely to cause any sort of problem.
-
@stephenw10 thanks. Hope you won't mind but I'll post back if it panics... cheers
-
Hi. So I thought I would update my post.Since carrying out those tasks it hasn't had a panic. Admittedly I configured the wan/lan using the 2 internal i-226 ports and so far it seems OK.
Given the panic free spell I reintroduced a pci 10gb x540 t1 for the lan side only. How many lan "in"errors are deemed acceptable?
Thank you
-
I don't really have a specific number. I expect it to be pretty low in normal operation. Error like that usually happen when some event in the network creates it. So maybe when the NIC links or the switch reloads etc. Thus you might see a bunch of errors initially and then a much smaller rate over time.
-
@stephenw10 OK thanks.
I am sure it's coming from an intel ax201 wifi adapter in a laptop. When I login on it (win11) I see an uptick in lan in errors. Other devices were fine yesterday including other devices on WiFi via the ap but as soon as that laptop is used they go up. I updated everything I could think of but it still seems to generate them... ho hum prolly have to live with it.
-
Hmm, interesting. Well if that really is bad traffic being generated from that wifi device it probably should be dropped and that's OK. Interesting that the Access Point (and switch?) pass it though.
-
@stephenw10
No further crashes since carrying out those repairs etc so that's great.
36 lan in errors for 1.8 TiB of data of which the majority have come from that annoying intel wifi adapter!....