HA - Crash report - Need help to understand why
I would like to know if you can analyse the crash report and help us to understand why the slave pfsense was crashed and why we had a downtime on our first pfsense and instability during 30 minutes period.
I explain, we have two pfsense configured in HA in the version 2.1.5 (I know this is an old version, we have a project to upgrade). Last week, we have a downtime of our production and so, our internet lines were down (fiber, VPN, VDSL) : the first pfsense had high load average : ~ 13 and the secondary pfsense was crashed with this crash report. We have shutdown the secondary and disable the SYNC (HA - pfsync) interface to bring back to the life the first pfsense.
Actually, these PFSENSE are virtualized with Proxmox and Intel e1000 network cards (we would like to upgrade in physical with the newest version but I have tested it and we have a problem with IPSEC and FTP).
So, can you help us ? Do you need more informations ?
Your disk and/or disk controller is shot.
A wipe and reload might help but it looks more like hardware to me because of the NMI trap there – that signal can only be generated from hardware.
If it was just a corrupted filesystem it would only have crashed in filesystem functions and it wouldn't have the NMI bits in the trace.
db:0:kdb.enter.default> bt Tracing pid 24734 tid 100230 td 0xc891e5c0 bcopy(2,eeb32924,c0e8f7ba,c62ee600,0,...) at bcopy+0x1a ipi_nmi_handler(c62ee600,0,c0f92f98,eeb32a40,c891a000,...) at ipi_nmi_handler+0x2c trap(eeb32930) at trap+0x26a calltrap() at calltrap+0x6 --- trap 0x13, eip = 0xc0eaded0, esp = 0xeeb32970, ebp = 0xeeb32970 --- VOP_ISLOCKED_APV(c1502c60,eeb329e0,c0fa12dd,1f8,eeb329c0,...) at VOP_ISLOCKED_APV+0x20 lookup(eeb32b8c,c62d1000,400,eeb32bac,c0d48dd6,...) at lookup+0x3fa namei(eeb32b8c,c14eca80,eeb32af8,0,eeb32ac4,...) at namei+0x5b8 vn_open_cred(eeb32b8c,eeb32c40,1a4,0,c5d8f700,...) at vn_open_cred+0xc0 vn_open(eeb32b8c,eeb32c40,1a4,c8935620,c1d8aaf8,...) at vn_open+0x3b kern_openat(c891e5c0,ffffff9c,2ccc05ec,0,602,...) at kern_openat+0x11e kern_open(c891e5c0,2ccc05ec,0,601,1b6,...) at kern_open+0x35 open(c891e5c0,eeb32cec,eeb32cc0,c0ac9a76,c155c734,...) at open+0x30 syscall(eeb32d28) at syscall+0x1fb Xint0x80_syscall() at Xint0x80_syscall+0x21
ata1: WARNING - READ_TOC read data overrun 18>12 Fatal trap 12: page fault while in kernel mode cpuid = 0; apic id = 00 fault virtual address = 0x1f4 fault code = supervisor read, page not present instruction pointer = 0x20:0xc0a93746 stack pointer = 0x28:0xc5a2abbc frame pointer = 0x28:0xc5a2abd4 code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, def32 1, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 12 (swi6: task queue) 0xc680a860: tag ufs, type VDIR usecount 1, writecount 0, refcount 4 mountedhere 0 flags () v_object 0xc6752770 ref 0 pages 1 lock type ufs: EXCL by thread 0xc85322e0 (pid 53831) ino 3933184, on dev ad0s1a 0xc8676000: tag ufs, type VREG usecount 1, writecount 0, refcount 1 mountedhere 0 flags () lock type ufs: EXCL by thread 0xc85322e0 (pid 53831) ino 3933374, on dev ad0s1a version.txt06000021612773423343 7622 ustarrootwheelFreeBSD 8.3-RELEASE-p16 #0: Mon Aug 25 08:25:41 EDT 2014 root@pf2_1_1_i386.pfsense.org:/usr/obj.i386/usr/pfSensesrc/src/sys/pfSense_SMP.8
Sorry i don't really understand your answer (and English isn't my native language). Is there a problem with the hard drive ? I must check it ?
A problem with the hard drive or possibly the disk controller itself on the motherboard (where the drive is plugged in)
I'm not sure if proxmox is smart enough to generate an NMI on its own for things like that, so it may be passed through from the actual hardware.
There is a chance it's something in proxmox or the host itself, but someone more familiar with proxmox would have to chime in and answer that part.