Random kernel panic once or more a day :( fresh build 2.4.3



  • greetings everyone!

    i used: pfSense-CE-memstick-2.4.3-RELEASE-amd64.img

    to install on a fresh bare metal box: Asus Q170T CSM running latest BIOS vi UEFI mode
    i5-6600T
    32GB RAM
    Samsung 830 SSD

    after fresh install i'm getting a persistent crash, logs uploaded, looks like some sort of kernel panic around this:
    Crash report details:

    No PHP errors found.

    Filename: /var/crash/bounds
    1

    Filename: /var/crash/info.0
    Dump header from device: /dev/gptid/f90e4105-42e3-11e8-ae07-708bcdbdf2bc
      Architecture: amd64
      Architecture Version: 1
      Dump Length: 72704
      Blocksize: 512
      Dumptime: Fri Apr 20 10:48:23 2018
      Hostname: router.localdomain
      Magic: FreeBSD Text Dump
      Version String: FreeBSD 11.1-RELEASE-p7 #10 r313908+986837ba7e9(RELENG_2_4): Mon Mar 26 18:08:25 CDT 2018
        root@buildbot2.netgate.com:/builder/ce-243/tmp/obj/builder/ce-243/tmp/FreeBSD-src/sys/pfSense
      Panic String: spin lock held too long
      Dump Parity: 4159356250
      Bounds: 0
      Dump Status: good

    Filename: /var/crash/info.last
    Dump header from device: /dev/gptid/f90e4105-42e3-11e8-ae07-708bcdbdf2bc
      Architecture: amd64
      Architecture Version: 1
      Dump Length: 72704
      Blocksize: 512
      Dumptime: Fri Apr 20 10:48:23 2018
      Hostname: router.localdomain
      Magic: FreeBSD Text Dump
      Version String: FreeBSD 11.1-RELEASE-p7 #10 r313908+986837ba7e9(RELENG_2_4): Mon Mar 26 18:08:25 CDT 2018
        root@buildbot2.netgate.com:/builder/ce-243/tmp/obj/builder/ce-243/tmp/FreeBSD-src/sys/pfSense
      Panic String: spin lock held too long
      Dump Parity: 4159356250
      Bounds: 0
      Dump Status: good

    ….

    MCA: Bank 6, Status 0xbe00000000801152
    spin lock 0xffffffff82a16f98 (mca) held by 0xfffff8000bf6a000 (tid 100072) too long
    spin lock 0xffffffff82a16f98 (mca) held by 0xfffff8000bf6a000 (tid 100072) too long
    spin lock 0xffffffff82a16f98 (mca) held by 0xfffff8000bf6a000 (tid 100072) too long
    spin lock 0xffffffff82a3d780 (callout) held by 0xfffff800081415c0 (tid 100008) too long
    panic: spin lock held too long
    cpuid = 0
    KDB: enter: panic
    panic.txt0600002713266376667  7165 ustarrootwheelspin lock held too longversion.txt06000027413266376667  7644 ustarrootwheelFreeBSD 11.1-RELEASE-p7 #10 r313908+986837ba7e9(RELENG_2_4): Mon Mar 26 18:08:25 CDT 2018
        root@buildbot2.netgate.com:/builder/ce-243/tmp/obj/builder/ce-243/tmp/FreeBSD-src/sys/pfSense

    ...

    panic: spin lock held too long
    cpuid = 0
    KDB: enter: panic
    panic.txt0600002713266376667  7165 ustarrootwheelspin lock held too longversion.txt06000027413266376667  7644 ustarrootwheelFreeBSD 11.1-RELEASE-p7 #10 r313908+986837ba7e9(RELENG_2_4): Mon Mar 26 18:08:25 CDT 2018
        root@buildbot2.netgate.com:/builder/ce-243/tmp/obj/builder/ce-243/tmp/FreeBSD-src/sys/pfSense

    what am i missing here... i did look up this:
    https://forum.pfsense.org/index.php?topic=42890.0

    I did also have epu power saving mode enabled... but i just disabled it and rebooted... hopefully that's it?

    anyone else see anything in the logs i'm missing here?
    20180420crashreporter.zip


  • Rebel Alliance Developer Netgate

    MCA: Bank 6, Status 0xbe00000000801152
    

    An MCA/MCE can only ever be a hardware problem. Usually there is more to the MCA messages than just that, the panic might have been from the hardware failing to even report the entire error message.

    tl;dr version is that your BIOS detected a hardware problem and tried to inform the OS of a fault, and the OS is relaying that message to you.

    Could be anything from bad RAM to bad power to a flaky MB/CPU. Need to run diags on the hardware to find out.



  • thank you, it was all brand new =)

    i was running w10pro on it for weeks… without any BSOD on windoze side.

    the only thing i remembered changing was that epu power saving mode... i've just disabled that... hopefully that was it?

    if so... if that is the case... would i need to re-upload a fresh log to compare against the old for possible defect resolution in next build?

    i'm not sure if free/community builds look for defect resolution compared to the paid editions.... :)

    sorry i'm new to the product.

    other than it's freakin brilliant! wished i had gone there sooner!

    @jimp:

    MCA: Bank 6, Status 0xbe00000000801152
    

    An MCA/MCE can only ever be a hardware problem. Usually there is more to the MCA messages than just that, the panic might have been from the hardware failing to even report the entire error message.

    tl;dr version is that your BIOS detected a hardware problem and tried to inform the OS of a fault, and the OS is relaying that message to you.

    Could be anything from bad RAM to bad power to a flaky MB/CPU. Need to run diags on the hardware to find out.


  • Rebel Alliance Developer Netgate

    Sorry but running windows means nothing. The OS cannot trigger an MCE/MCA, those come straight from hardware. Brand new also doesn't mean it's good. It might be defective.



  • memtest is clean ran it for about 48hrs

    HDD is clean, i've got other CPU/RAM/HDD parts… all swapped and tested and vetted.

    seems fine to me... with the exception of that 1 BIOS setting... i cleared the panic for now... but hopefully that one little change was it. perhaps the sw doesn't deal with super low power modes?

    @jimp:

    Sorry but running windows means nothing. The OS cannot trigger an MCE/MCA, those come straight from hardware. Brand new also doesn't mean it's good. It might be defective.


  • Rebel Alliance Developer Netgate

    Again, that type of error cannot be from software. The hardware may not like what the OS set, but that is a pure hardware error. It's not an OS or software issue.



  • is there a list of recommended mobos bare-metal side that pfsense likes to be installed on?

    i might just virtualizing it then… that way i might be able to make snapshot backups.

    @jimp:

    Again, that type of error cannot be from software. The hardware may not like what the OS set, but that is a pure hardware error. It's not an OS or software issue.



  • @JediFonger:

    is there a list of recommended mobos bare-metal side that pfsense likes to be installed on?

    i might just virtualizing it then… that way i might be able to make snapshot backups.

    @jimp:

    Again, that type of error cannot be from software. The hardware may not like what the OS set, but that is a pure hardware error. It's not an OS or software issue.

    I see absolutely no reason not to visualize as you got 32 GB of ram in the machine.
    But try to find defective hardware first.



  • EPS Power Saving Mode didn't resolve issue btw. every component of HW has been replaced… except AC adapter... gonna try that next.

    edit: power bricks replaced that wasn't it.

    so FINALLY after all is said and done i figured out what was the causing the panic, the case expansion plugs like USB ports, audio plugs, eSATA port. i unplugged all those fancy things and system has been rock solid since that time!

    yay! just wanted to toss that in here for future reference!