Recurring crashes in the last weeks

  • Good morning developers,

    in the last few weeks we had a couple of crashes of our pfSense 2.3 box.
    Yesterday morning our box crashed resulting in a degraded raid. This morning our box crashed again with a degraded raid.
    Both crash reports have been submitted. The time of submission of the last report should be around 2017-08-16 10:27 CET.

    Could you tell us if we should take a closer look at the hardware of if it is a software bug?

    Juergen Nagel
    Inworks GmbH

  • Rebel Alliance Developer Netgate

    I don't see any crashes in the crash reporter server from the IP address on your post. We can't just go by submitted time. If you could at least give the first two octets of the IPv4 address, or first 2-3 sections of an IPv6 address, that should help narrow it down along with the time.

  • Do you see reports from the range?

  • Rebel Alliance Developer Netgate

    Yes, there are recent ones from yesterday and one from today. Both were the same.

    Fatal double fault:
    eip = 0xc12d2498
    esp = 0xe4767000
    ebp = 0xe4767b70
    cpuid = 1; apic id = 01
    panic: double fault
    cpuid = 1
    KDB: enter: panic
    db:0:kdb.enter.default>  bt
    Tracing pid 11 tid 100004 td 0xc8715c80
    kdb_enter(c147cb56,c147cb56,c1643e27,c1fb7994,1,...) at kdb_enter+0x3d/frame 0xc1fb7940
    vpanic(c1643e27,c1fb7994,c1fb7994,c1fb79ac,c12e7f2b,...) at vpanic+0x13b/frame 0xc1fb7974
    panic(c1643e27,1,1,1,e4767b70,...) at panic+0x1b/frame 0xc1fb7988
    dblfault_handler() at dblfault_handler+0xab/frame 0xc1fb7988
    --- trap 0x17, eip = 0xc12d2498, esp = 0xe4767000, ebp = 0xe4767b70 ---
    Xpage(8,28,28,c87db000,0,...) at Xpage/frame 0xe4767b70
    Xinvlrng(e4767c28,c0d3d01e,c1f96f58,103f3,c8715c80,...) at Xinvlrng+0x2d/frame 0xe4767bb8
    acpi_cpu_idle(18199824,0,18199824,e4767c28,c12d671a,...) at acpi_cpu_idle+0x15a/frame 0xe4767bf8
    cpu_idle_acpi(18199824,0,c1f87404,c1f87408,c1f87414,...) at cpu_idle_acpi+0x3f/frame 0xe4767c0c
    cpu_idle(0,e4767c78,c147e4f3,a3d,0,...) at cpu_idle+0x9a/frame 0xe4767c28
    sched_idletd(0,e4767ce8,0,0,0,...) at sched_idletd+0x1dd/frame 0xe4767ca4
    fork_exit(c0d3fd30,0,e4767ce8) at fork_exit+0xa3/frame 0xe4767cd4
    fork_trampoline() at fork_trampoline+0x8/frame 0xe4767cd4
    --- trap 0, eip = 0, esp = 0xe4767d20, ebp = 0 ---

    Usually a double fault is from a driver or hardware issue. Not much helpful in the backtrace though. The idle process was active at the time, it looks like it was literally just sitting there idling and crashed somehow. To me, that screams hardware, but it's not definitive.

    The broken RAID was just because it crashed, it's not directly related. That would happen with gmirror from any panic/crash.

    Might be worth checking for a BIOS update, there are some other ACPI errors in the message buffer of the crash that look out of place:

    ACPI Error: [GPMN] Namespace lookup failure, AE_NOT_FOUND (20150515/psargs-391)
    ACPI Error: Method parse/execution failed [\_SB_.PCI0.LPC0.MBRD._CRS] (Node 0xc887bb80), AE_NOT_FOUND (20150515/psparse-552)

    That doesn't look especially harmful but it's still noteworthy.

    If you can keep it down for a bit, run memtest86+ and any OEM/other hardware diagnostics you have access to. While those may not necessarily draw a problem out if it's there, if they do find something it's a good indicator that you have a hardware problem.

  • Thanks for the fast analysis!
    We'll run a memtest on the machine and look into replacing the box with modern hardware in the foreseeable future.

Log in to reply