Recurring crashes in the last weeks

inworksit

Good morning developers,

in the last few weeks we had a couple of crashes of our pfSense 2.3 box.
Yesterday morning our box crashed resulting in a degraded raid. This morning our box crashed again with a degraded raid.
Both crash reports have been submitted. The time of submission of the last report should be around 2017-08-16 10:27 CET.

Could you tell us if we should take a closer look at the hardware of if it is a software bug?

Regards,
Juergen Nagel
Inworks GmbH

jimp

I don't see any crashes in the crash reporter server from the IP address on your post. We can't just go by submitted time. If you could at least give the first two octets of the IPv4 address, or first 2-3 sections of an IPv6 address, that should help narrow it down along with the time.

inworksit

Do you see reports from the 92.198.54.104/29 range?

jimp

Yes, there are recent ones from yesterday and one from today. Both were the same.

Fatal double fault:
eip = 0xc12d2498
esp = 0xe4767000
ebp = 0xe4767b70
cpuid = 1; apic id = 01
panic: double fault
cpuid = 1
KDB: enter: panic

db:0:kdb.enter.default>  bt
Tracing pid 11 tid 100004 td 0xc8715c80
kdb_enter(c147cb56,c147cb56,c1643e27,c1fb7994,1,...) at kdb_enter+0x3d/frame 0xc1fb7940
vpanic(c1643e27,c1fb7994,c1fb7994,c1fb79ac,c12e7f2b,...) at vpanic+0x13b/frame 0xc1fb7974
panic(c1643e27,1,1,1,e4767b70,...) at panic+0x1b/frame 0xc1fb7988
dblfault_handler() at dblfault_handler+0xab/frame 0xc1fb7988
--- trap 0x17, eip = 0xc12d2498, esp = 0xe4767000, ebp = 0xe4767b70 ---
Xpage(8,28,28,c87db000,0,...) at Xpage/frame 0xe4767b70
Xinvlrng(e4767c28,c0d3d01e,c1f96f58,103f3,c8715c80,...) at Xinvlrng+0x2d/frame 0xe4767bb8
acpi_cpu_idle(18199824,0,18199824,e4767c28,c12d671a,...) at acpi_cpu_idle+0x15a/frame 0xe4767bf8
cpu_idle_acpi(18199824,0,c1f87404,c1f87408,c1f87414,...) at cpu_idle_acpi+0x3f/frame 0xe4767c0c
cpu_idle(0,e4767c78,c147e4f3,a3d,0,...) at cpu_idle+0x9a/frame 0xe4767c28
sched_idletd(0,e4767ce8,0,0,0,...) at sched_idletd+0x1dd/frame 0xe4767ca4
fork_exit(c0d3fd30,0,e4767ce8) at fork_exit+0xa3/frame 0xe4767cd4
fork_trampoline() at fork_trampoline+0x8/frame 0xe4767cd4
--- trap 0, eip = 0, esp = 0xe4767d20, ebp = 0 ---

Usually a double fault is from a driver or hardware issue. Not much helpful in the backtrace though. The idle process was active at the time, it looks like it was literally just sitting there idling and crashed somehow. To me, that screams hardware, but it's not definitive.

The broken RAID was just because it crashed, it's not directly related. That would happen with gmirror from any panic/crash.

Might be worth checking for a BIOS update, there are some other ACPI errors in the message buffer of the crash that look out of place:

ACPI Error: [GPMN] Namespace lookup failure, AE_NOT_FOUND (20150515/psargs-391)
ACPI Error: Method parse/execution failed [\_SB_.PCI0.LPC0.MBRD._CRS] (Node 0xc887bb80), AE_NOT_FOUND (20150515/psparse-552)

That doesn't look especially harmful but it's still noteworthy.

If you can keep it down for a bit, run memtest86+ and any OEM/other hardware diagnostics you have access to. While those may not necessarily draw a problem out if it's there, if they do find something it's a good indicator that you have a hardware problem.

inworksit

Thanks for the fast analysis!
We'll run a memtest on the machine and look into replacing the box with modern hardware in the foreseeable future.