Could you help me analyze these crashdumps?



  • I am having regular crashes (once a month), they have been increasing recently (once a week). I also see the following entries more often in the syslog of the pfsense interface:

    Sep 14 03:05:16	kernel		MCA: Bank 1, Status 0x9400000000000151
    Sep 14 03:05:16	kernel		MCA: Global Cap 0x0000000000000106, Status 0x0000000000000000
    Sep 14 03:05:16	kernel		MCA: Vendor "AuthenticAMD", ID 0x730f01, APIC ID 0
    Sep 14 03:05:16	kernel		MCA: CPU 0 COR ICACHE L1 IRD error
    Sep 14 03:05:16	kernel		MCA: Address 0xffff81250360
    

    This is the last crashdump (ddl.txt):

    db:1:lockinfo> show locks
    No such command; use "help" to list available commands
    db:1:lockinfo>  show alllocks
    No such command; use "help" to list available commands
    db:1:lockinfo>  show lockedvnods
    Locked vnodes
    db:0:kdb.enter.default>  show pcpu
    cpuid        = 0
    dynamic pcpu = 0x860580
    curthread    = 0xfffff80047c3c000: pid 95879 "sh"
    curpcb       = 0xfffffe011f55cb80
    fpcurthread  = 0xfffff80047c3c000: pid 95879 "sh"
    idlethread   = 0xfffff8000496e000: tid 100003 "idle: cpu0"
    curpmap      = 0xfffff8005ec5c138
    tssp         = 0xffffffff835a32d0
    commontssp   = 0xffffffff835a32d0
    rsp0         = 0xfffffe011f55cb80
    gs32p        = 0xffffffff835a9f28
    ldt          = 0xffffffff835a9f68
    tss          = 0xffffffff835a9f58
    tlb gen      = 494627
    db:0:kdb.enter.default>  bt
    Tracing pid 95879 tid 100170 td 0xfffff80047c3c000
    kdb_enter() at kdb_enter+0x3b/frame 0xfffffe011f55c730
    vpanic() at vpanic+0x19b/frame 0xfffffe011f55c790
    panic() at panic+0x43/frame 0xfffffe011f55c7f0
    pmap_remove_pages() at pmap_remove_pages+0x791/frame 0xfffffe011f55c8d0
    vmspace_exit() at vmspace_exit+0x9c/frame 0xfffffe011f55c910
    exit1() at exit1+0x5e9/frame 0xfffffe011f55c970
    sys_sys_exit() at sys_sys_exit+0xd/frame 0xfffffe011f55c980
    amd64_syscall() at amd64_syscall+0xa86/frame 0xfffffe011f55cab0
    fast_syscall_common() at fast_syscall_common+0x101/frame 0xfffffe011f55cab0
    --- syscall (1, FreeBSD ELF64, sys_sys_exit), rip = 0x800b62caa, rsp = 0x7fffffffeb38, rbp = 0x7fffffffec20 ---
    [shortened]
    <118>Bootup complete
    MCA: Bank 1, Status 0x9400000000000151
    MCA: Global Cap 0x0000000000000106, Status 0x0000000000000000
    MCA: Vendor "AuthenticAMD", ID 0x730f01, APIC ID 0
    MCA: CPU 0 COR ICACHE L1 IRD error
    MCA: Address 0xffff80d18660
    MCA: Bank 1, Status 0x9400000000000151
    MCA: Global Cap 0x0000000000000106, Status 0x0000000000000000
    MCA: Vendor "AuthenticAMD", ID 0x730f01, APIC ID 0
    MCA: CPU 0 COR ICACHE L1 IRD error
    MCA: Address 0xffff812503c0
    <6>pid 78627 (unbound), jid 0, uid 59: exited on signal 10
    MCA: Bank 1, Status 0x9400000000000151
    MCA: Global Cap 0x0000000000000106, Status 0x0000000000000000
    MCA: Vendor "AuthenticAMD", ID 0x730f01, APIC ID 0
    MCA: CPU 0 COR ICACHE L1 IRD error
    MCA: Address 0xffff80cfdbd0
    panic: bad pte va 800bb3000 pte 0
    cpuid = 0
    KDB: enter: panic
    

    And these are the start lines of a couple of other ones:

    db:1:lockinfo> show locks
    No such command; use "help" to list available commands
    db:1:lockinfo>  show alllocks
    No such command; use "help" to list available commands
    db:1:lockinfo>  show lockedvnods
    Locked vnodes
    db:0:kdb.enter.default>  show pcpu
    cpuid        = 2
    dynamic pcpu = 0xfffffe01961fb580
    curthread    = 0xfffff80008877620: pid 70097 "cat"
    curpcb       = 0xfffffe011fc63b80
    fpcurthread  = 0xfffff80008877620: pid 70097 "cat"
    idlethread   = 0xfffff80004970000: tid 100005 "idle: cpu2"
    curpmap      = 0xfffff800601ff138
    tssp         = 0xffffffff835a33a0
    commontssp   = 0xffffffff835a33a0
    rsp0         = 0xfffffe011fc63b80
    gs32p        = 0xffffffff835a9ff8
    ldt          = 0xffffffff835aa038
    tss          = 0xffffffff835aa028
    tlb gen      = 1145354
    db:0:kdb.enter.default>  bt
    Tracing pid 70097 tid 100130 td 0xfffff80008877620
    kdb_enter() at kdb_enter+0x3b/frame 0xfffffe011fc63730
    vpanic() at vpanic+0x19b/frame 0xfffffe011fc63790
    panic() at panic+0x43/frame 0xfffffe011fc637f0
    pmap_remove_pages() at pmap_remove_pages+0x791/frame 0xfffffe011fc638d0
    vmspace_exit() at vmspace_exit+0x9c/frame 0xfffffe011fc63910
    exit1() at exit1+0x5e9/frame 0xfffffe011fc63970
    sys_sys_exit() at sys_sys_exit+0xd/frame 0xfffffe011fc63980
    amd64_syscall() at amd64_syscall+0xa86/frame 0xfffffe011fc63ab0
    fast_syscall_common() at fast_syscall_common+0x101/frame 0xfffffe011fc63ab0
    --- syscall (1, FreeBSD ELF64, sys_sys_exit), rip = 0x800903caa, rsp = 0x7fffffffec28, rbp = 0x7fffffffec40 ---
    db:0:kdb.enter.default>  ps
    
    db:1:lockinfo> show locks
    No such command; use "help" to list available commands
    db:1:lockinfo>  show alllocks
    No such command; use "help" to list available commands
    db:1:lockinfo>  show lockedvnods
    Locked vnodes
    db:0:kdb.enter.default>  show pcpu
    cpuid        = 0
    dynamic pcpu = 0x898380
    curthread    = 0xfffff80058533000: pid 67822 "sh"
    curpcb       = 0xfffffe0120575b80
    fpcurthread  = 0xfffff80058533000: pid 67822 "sh"
    idlethread   = 0xfffff80004975000: tid 100003 "idle: cpu0"
    curpmap      = 0xfffff800ced5c138
    tssp         = 0xffffffff82bb6810
    commontssp   = 0xffffffff82bb6810
    rsp0         = 0xfffffe0120575b80
    gs32p        = 0xffffffff82bbd068
    ldt          = 0xffffffff82bbd0a8
    tss          = 0xffffffff82bbd098
    db:0:kdb.enter.default>  bt
    Tracing pid 67822 tid 100156 td 0xfffff80058533000
    kdb_enter() at kdb_enter+0x3b/frame 0xfffffe01205752b0
    vpanic() at vpanic+0x194/frame 0xfffffe0120575310
    panic() at panic+0x43/frame 0xfffffe0120575370
    pmap_remove_pages() at pmap_remove_pages+0x7fc/frame 0xfffffe0120575450
    exec_new_vmspace() at exec_new_vmspace+0x1b5/frame 0xfffffe01205754c0
    exec_elf64_imgact() at exec_elf64_imgact+0x931/frame 0xfffffe01205755b0
    kern_execve() at kern_execve+0x77c/frame 0xfffffe0120575900
    sys_execve() at sys_execve+0x4a/frame 0xfffffe0120575980
    amd64_syscall() at amd64_syscall+0xa38/frame 0xfffffe0120575ab0
    fast_syscall_common() at fast_syscall_common+0x101/frame 0xfffffe0120575ab0
    --- syscall (59, FreeBSD ELF64, sys_execve), rip = 0x800b4664a, rsp = 0x7fffffffe218, rbp = 0x7fffffffe360 ---
    db:0:kdb.enter.default>  ps
    

    Many thanks for any hints!


  • Netgate Administrator

    MCA faults can only be hardware. Check the RAM of just started happening spontaneously.

    Steve



  • Many thanks! Unfortunately, the firewall sits at a remote location, I will need to drive 2 hours to do the memtest. However, good to know it is unambiguous the hardware.



  • @Helmut101 Disable "Core Performance Boost" on Bios and see if it still crashes. I have the same problem and this is what solves the problem for me.

    d203ccbe-ef60-4b8f-a211-591ccfaebca5-image.png



  • Many thanks! Unfortunately (or furtunately), I returned the APU.C2 to the manufacturer and got a replacement - they said they never had this occur, but the Kernel Panic was pretty unambiguous hardware related, which is why they exchanged this without a problem.

    Running on the new APU pfsense since 4 weeks with the same configuration, no problems anymore.

    What finally made me accept that this must be hardware/memory is that errors increased over the last couple of months. First it was a crash once a month. In September, it increased to once a week. In the last week, it was several crashes per day.

    I am not sure, but one possibility I considered what could have caused these problems was that the pfblocker_ng extension increased temperature to 50-55°C on a permanent basis. This is totally within an acceptable range, but below 50°C would be preferrable I think.


  • LAYER 8

    nowadays those CPUs should be able to work smoothly up to 75/80C° degrees.
    50 / 55 is nothing to worry about

    pi@raspberrypi2:~ $ vcgencmd measure_temp
    temp=58.0'C
    

    it's running without any problem

    it is more likely that it was faulty


  • Netgate Administrator

    @Helmut101 said in Could you help me analyze these crashdumps?:

    This is totally within an acceptable range, but below 50°C would be preferrable I think

    Yeah lower is always preferable but that is within the expected temperature range. You should not expect it to fail unreasonably early at that.

    Steve


Log in to reply