HA - Crash report - Need help to understand why

  • Hello,

    I would like to know if you can analyse the crash report and help us to understand why the slave pfsense was crashed and why we had a downtime on our first pfsense and instability during 30 minutes period.

    I explain, we have two pfsense configured in HA in the version 2.1.5 (I know this is an old version, we have a project to upgrade). Last week, we have a downtime of our production and so, our internet lines were down (fiber, VPN, VDSL) : the first pfsense had high load average : ~ 13 and the secondary pfsense was crashed with this crash report. We have shutdown the secondary and disable the SYNC (HA - pfsync) interface to bring back to the life the first pfsense.

    Actually, these PFSENSE are virtualized with Proxmox and Intel e1000 network cards  (we would like to upgrade in physical with the newest version but I have tested it and we have a problem with IPSEC and FTP).

    So, can you help us ? Do you need more informations ?



  • Rebel Alliance Developer Netgate

    Your disk and/or disk controller is shot.

    A wipe and reload might help but it looks more like hardware to me because of the NMI trap there – that signal can only be generated from hardware.

    If it was just a corrupted filesystem it would only have crashed in filesystem functions and it wouldn't have the NMI bits in the trace.

    db:0:kdb.enter.default>  bt
    Tracing pid 24734 tid 100230 td 0xc891e5c0
    bcopy(2,eeb32924,c0e8f7ba,c62ee600,0,...) at bcopy+0x1a
    ipi_nmi_handler(c62ee600,0,c0f92f98,eeb32a40,c891a000,...) at ipi_nmi_handler+0x2c
    trap(eeb32930) at trap+0x26a
    calltrap() at calltrap+0x6
    --- trap 0x13, eip = 0xc0eaded0, esp = 0xeeb32970, ebp = 0xeeb32970 ---
    VOP_ISLOCKED_APV(c1502c60,eeb329e0,c0fa12dd,1f8,eeb329c0,...) at VOP_ISLOCKED_APV+0x20
    lookup(eeb32b8c,c62d1000,400,eeb32bac,c0d48dd6,...) at lookup+0x3fa
    namei(eeb32b8c,c14eca80,eeb32af8,0,eeb32ac4,...) at namei+0x5b8
    vn_open_cred(eeb32b8c,eeb32c40,1a4,0,c5d8f700,...) at vn_open_cred+0xc0
    vn_open(eeb32b8c,eeb32c40,1a4,c8935620,c1d8aaf8,...) at vn_open+0x3b
    kern_openat(c891e5c0,ffffff9c,2ccc05ec,0,602,...) at kern_openat+0x11e
    kern_open(c891e5c0,2ccc05ec,0,601,1b6,...) at kern_open+0x35
    open(c891e5c0,eeb32cec,eeb32cc0,c0ac9a76,c155c734,...) at open+0x30
    syscall(eeb32d28) at syscall+0x1fb
    Xint0x80_syscall() at Xint0x80_syscall+0x21
    ata1: WARNING - READ_TOC read data overrun 18>12
    Fatal trap 12: page fault while in kernel mode
    cpuid = 0; apic id = 00
    fault virtual address       = 0x1f4
    fault code                           = supervisor read, page not present
    instruction pointer          = 0x20:0xc0a93746
    stack pointer             = 0x28:0xc5a2abbc
    frame pointer           = 0x28:0xc5a2abd4
    code segment                   = base 0x0, limit 0xfffff, type 0x1b
                                                   = DPL 0, pres 1, def32 1, gran 1
    processor eflags               = interrupt enabled, resume, IOPL = 0
    current process                = 12 (swi6: task queue)
    0xc680a860: tag ufs, type VDIR
        usecount 1, writecount 0, refcount 4 mountedhere 0
        flags ()
        v_object 0xc6752770 ref 0 pages 1
        lock type ufs: EXCL by thread 0xc85322e0 (pid 53831)
                    ino 3933184, on dev ad0s1a
    0xc8676000: tag ufs, type VREG
        usecount 1, writecount 0, refcount 1 mountedhere 0
        flags ()
        lock type ufs: EXCL by thread 0xc85322e0 (pid 53831)
                    ino 3933374, on dev ad0s1a
    version.txt06000021612773423343  7622 ustarrootwheelFreeBSD 8.3-RELEASE-p16 #0: Mon Aug 25 08:25:41 EDT 2014

  • Sorry i don't really understand your answer (and English isn't my native language). Is there a problem with the hard drive ? I must check it ?

  • Rebel Alliance Developer Netgate

    A problem with the hard drive or possibly the disk controller itself on the motherboard (where the drive is plugged in)

    I'm not sure if proxmox is smart enough to generate an NMI on its own for things like that, so it may be passed through from the actual hardware.

    There is a chance it's something in proxmox or the host itself, but someone more familiar with proxmox would have to chime in and answer that part.