"Page fault while in kernel mode" on APU2 after bios/coreboot upgrade



  • Hey all,

    I recently upgraded my BIOS on an APU2 board from Coreboot v.4.0.X to v4.12.0.4. That's the latest version downloaded from https://pcengines.github.io. Since then pfSense 2.4.5-RELEASE-p1 crashes once every day at random time and automatically reboots.

    According to the last two crash reports, there is a "page fault while in kernel mode".

    Let me know if you have experienced something similar and if you have any ideas about how to troubleshoot and fix this issue.

    Dump info:

    Architecture: amd64
    Architecture Version: 1
    Dump Length: 72704
    Blocksize: 512
    Magic: FreeBSD Text Dump
    Version String: FreeBSD 11.3-STABLE #243 abf8cba50ce(RELENG_2_4_5): Tue Jun  2 17:53:37 EDT 2020
      root@buildbot1-nyi.netgate.com:/build/ce-crossbuild-245/obj/amd64/YNx4Qq3j/build/ce-crossbuild-245/source
    Panic String: page fault
    Dump Parity: 447612467
    Bounds: 0
    Dump Status: good
    

    Crash 1:

    Fatal trap 12: page fault while in kernel mode
    cpuid = 0; apic id = 00
    fault virtual address	= 0x0
    fault code		= supervisor read instruction, page not present
    instruction pointer	= 0x20:0x0
    stack pointer	        = 0x28:0xfffffe011f4ee800
    frame pointer	        = 0x28:0xfffffe011f4ee8d0
    code segment		= base 0x0, limit 0xfffff, type 0x1b
    			= DPL 0, pres 1, long 1, def32 0, gran 1
    processor eflags	= interrupt enabled, resume, IOPL = 0
    current process		= 54266 (sh)
    trap number		= 12
    panic: page fault
    cpuid = 0
    KDB: enter: panic
    

    Crash 2:

    Fatal trap 12: page fault while in kernel mode
    cpuid = 0; apic id = 00
    fault virtual address	= 0xffffffffffffffff
    fault code		= supervisor write data, page not present
    instruction pointer	= 0x20:0xffffffff812504ca
    stack pointer	        = 0x28:0xfffffe011f5c0800
    frame pointer	        = 0x28:0xfffffe011f5c08d0
    code segment		= base 0x0, limit 0xfffff, type 0x1b
    			= DPL 0, pres 1, long 1, def32 0, gran 1
    processor eflags	= interrupt enabled, resume, IOPL = 0
    current process		= 86485 (ls)
    trap number		= 12
    panic: page fault
    cpuid = 1
    KDB: enter: panic
    

  • Netgate Administrator

    Need to see the backtrace to compare but since those faults are in different processes they will be different. That implies a high likelihood of a hardware issue probably RAM. Did that BIOS update change the memory handling at all?
    Does going back to an earlier version correct it?

    Steve



  • @CS said in "Page fault while in kernel mode" on APU2 after bios/coreboot upgrade:

    I recently upgraded my BIOS on an APU2 board from Coreboot v.4.0.X to v4.12.0.4.

    Beside of many changes between the Legacy Bios Line (4.0.x) and the Mainline (4.12.0.x) in the Mainline is the Core Performance Boost enabled by default. This COULD be something that makes a slightly faulty RAM to react.
    You could deactivate it in the Bios and see whats happening.
    coreboot-apuspare.png

    Regards,
    fireodo



  • Thank you @stephenw10 and @fireodo !

    I deactivated the "Core Performance Boost" option and I'm waiting to see what happens.

    Today it crashed several times, before the change in BIOS, and the faults are in different processes every time. I also got the following error twice "spin lock held too long":

    MCA: Bank 1, Status 0x9400000000000151
    MCA: Global Cap 0x0000000000000106, Status 0x0000000000000000
    MCA: Vendor "AuthenticAMD", ID 0x730f01, APIC ID 1
    MCA: CPU 1 COR ICACHE L1 IRD error
    MCA: Address 0x4eb660
    spin lock 0xffffffff83517de8 (smp rendezvous) held by 0xfffff80008b65620 (tid 100132) too long
    timeout stopping cpus
    panic: spin lock held too long
    cpuid = 1
    KDB: enter: panic
    

    and

    spin lock 0xffffffff83517de8 (smp rendezvous) held by 0xfffff80004acd000 (tid 100059) too long
    timeout stopping cpus
    panic: spin lock held too long
    cpuid = 1
    KDB: enter: panic
    


  • If it fails again I'll run a memtest and possibly downgrade to an older version of coreboot. By the way, my pfSense config is an old one that I have kept while upgrading to newer versions.



  • @CS said in "Page fault while in kernel mode" on APU2 after bios/coreboot upgrade:

    If it fails again I'll run a memtest and possibly downgrade to an older version of coreboot. By the way, my pfSense config is an old one that I have kept while upgrading to newer versions.

    Memtest is a good idea and maybe a checkdisk too!

    Good Weekend,
    fireodo


  • Netgate Administrator

    Yup, definitely try memtest if you can. That MCA error can only be hardware related so I would guess it is something to do with the core boost if it doesn't happen on legacy BIOS versions. I haven't dug deep enough here to find out if that changes the ram clock. I don't have an APU new enough to support that.

    Steve



  • @CS said in "Page fault while in kernel mode" on APU2 after bios/coreboot upgrade:

    possibly downgrade to an older version of coreboot.

    Hi,

    Unnecessary step back (downgrade) APU2 based boxes work perfectly with the new BIOS

    b4930277-e015-43a5-a655-4a8ff61a6fcc-image.png

    The problem is maybe that, with a "legacy BIOS" version left for a long time (I don't understand why?) and now you've taken a big step forward onto an old pfSense install

    My suggestion is a full backup followed by a fresh pfSense installation with the latest BIOS😉 (v4.12.0.4)

    Important:
    After installing the BIOS, the APU boards require a complete power outage (60- 120 sec), a hot and cold reboot is not enough !!!



  • Uptime: 16 hours with no crash yet, fingers-crossed. :)

    Thanks @DaddyGo , I had done the complete power outage so that shouldn't be an issue here.

    I agree that a fresh pfSense with the latest BIOS would be ideal but I keep this as my last option right now. Ideally I wouldn't even restore my config and do everything from scratch but I'm not sure if I'll have the time and patience to do that.

    In regards to your comment about the legacy BIOS version, honestly I didn't have a good reason to keep upgrading the BIOS when the device works flawlessly with the latest pfSense releases. Sometimes the BIOS upgrades might cause issues and I didn't have time to deal with these. I upgraded now because the device relocated and it's always a good opportunity to start fresh with the latest versions.



  • @DaddyGo said in "Page fault while in kernel mode" on APU2 after bios/coreboot upgrade:

    Unnecessary step back (downgrade) APU2 based boxes work perfectly with the new BIOS

    @DaddyGo can you please confirm if "Core Performance Boost" is currently enabled or disabled in your BIOS? For the record, I have Coreboot v4.12.0.4, not v4.12.0.3. Let me know how it goes when you upgrade.



  • @CS said in "Page fault while in kernel mode" on APU2 after bios/coreboot upgrade:

    can you please confirm if "Core Performance Boost" is currently enabled or disabled in your BIOS?

    We have been using APU boards for many years, so we have a lot of experience with these MOBOs.
    We’ve been through a lot of BIOS versions already. 😉

    We have long deviated from the legacy BIOS line at the suggestion of pcEngines and 3mdeb.

    CPB has been in use for a long time, as the first CPU core spins at 1,400 at this time, which is good for OpenVPN stuff.

    CPB has been enabled since V4.9.0.2

    67a74a7a-39af-4767-b573-3acfe18f4ea5-image.png

    with this you can check: sysctl dev.cpu.0.freq_levels

    Updating the BIOS is quite difficult due to known USB flash drive problems, almost only the Kingston DT100 G3 can update the BIOS. I also quickly bought 16 and 32G models out of it as they are no longer available.

    The sequence of operations is well described here, if you need help I am happy to be at your disposal.
    https://pcengines.ch/howto.htm#TinyCoreLinux

    register for BIOS information here:
    https://pcengines.github.io/
    (you will receive a first-hand update via email)

    493043bc-6dcc-42f9-acce-bd2c7f5f2509-image.png

    btw:

    Also, don’t forget about Intel tweaks and the correct configuration of your NIC
    loader.conf.local....
    like:

    legal.intel_ipw.license_ack=1
    legal.intel_iwi.license_ack=1
    hw.igb.rx_process_limit=-1
    hw.igb.tx_process_limit=-1
    hw.igb.rxd=1024
    hw.igb.txd=1024
    hw.igb.max_interrupt_rate=64000

    and etc......

    system tunables...
    disable EEE,
    disable flow control
    kern.ipc.nmbclusters
    set net.inet.ip.redirect (enable tryforward routing path ipv4)

    and similar things....



  • @DaddyGo thanks a lot for your response.

    For the record, the device has been working smoothly without any crashes for about 6 days after I disabled CPB. So that was definitely what caused the issue. I'll try to re-enable it and do some tuning in case this can be solved without having to keep CPB disabled or re-install pfSense from scratch. I'll provide updates about my progress on this thread for future reference.



  • @CS said in "Page fault while in kernel mode" on APU2 after bios/coreboot upgrade:

    the device has been working smoothly without any crashes for about 6 days after I disabled CPB

    This means that your problem is CPB dependent, but I really have not heard of anyone else having this problem in the long run.

    CPB is not a required feature, but if it already exists and can be enabled, why not use it.
    For us, it caused a significant improvement in ExpVPN connections

    These links can also be useful:

    https://teklager.se/en/knowledge-base/apu2-vpn-performance/
    https://teklager.se/en/knowledge-base/apu2-1-gigabit-throughput-pfsense/
    https://teklager.se/en/knowledge-base/

    btw:
    99% of pcEngines users use CPB, the forum is full of APU board descriptions, I think it's a good thing



  • @DaddyGo said in "Page fault while in kernel mode" on APU2 after bios/coreboot upgrade:

    btw:
    99% of pcEngines users use CPB, the forum is full of APU board descriptions, I think it's a good thing

    I have CPB too, and I have tested with and without, there was no difference in the pfsense behavior (beside speed increase), but I think that the original posters APU has RAM that is on the "limit" and the increasing of speed make that RAM to produce errors.
    Thats what I suppose.

    Fine Weekend,
    fireodo



  • @fireodo said in "Page fault while in kernel mode" on APU2 after bios/coreboot upgrade:

    but I think that the original posters APU has RAM that is on the "limit" and the increasing of speed make that RAM to produce errors.

    This is very possible....exhausted RAM

    no matter how good the APU stuff is, 4GB of RAM was often on the "verge" for me

    Don't forget @fireodo that 3mdeb (BIOS developers) has been activating RAM ECC for some time

    so this should help with RAM errors



  • @DaddyGo said in "Page fault while in kernel mode" on APU2 after bios/coreboot upgrade:

    Don't forget @fireodo that 3mdeb (BIOS developers) has been activating RAM ECC for some time

    so this should help with RAM errors

    I know - but if the Hardware is not OK (the RAM-Chips) then even ECC cannot compensate that!



  • @fireodo said in "Page fault while in kernel mode" on APU2 after bios/coreboot upgrade:

    Hardware is not OK

    That’s really true, and then we’ll see what the OP gets



  • @DaddyGo, @fireodo , @stephenw10

    Hey folks, let me provide an update here:

    • Memtest was completed without errors but pfSense kept crashing.
    • I upgraded coreboot to v4.12.0.5 but it kept crashing.
    • I reinstalled pfSense 2.4.5-RELEASE-p1 and restored my config but it kept crashing, which is something I was not expecting.
    • I kept the CPU Boost config option in my loader.conf.local and disabled again the option "Core Performance Boost" in Bios. It stopped crashing and CPU Boost is still active:
    dev.cpu.0.temperature: 62.7C
    dev.cpu.0.cx_method: C1/hlt C2/io
    dev.cpu.0.cx_usage_counters: 24303377 0
    dev.cpu.0.cx_usage: 100.00% 0.00% last 1981us
    dev.cpu.0.cx_lowest: C1
    dev.cpu.0.cx_supported: C1/1/0 C2/2/400
    dev.cpu.0.freq_levels: 1400/-1 1200/-1 1000/-1
    dev.cpu.0.freq: 1400
    dev.cpu.0.%parent: acpi0
    dev.cpu.0.%pnpinfo: _HID=none _UID=0
    dev.cpu.0.%location: handle=\_PR_.P000
    dev.cpu.0.%driver: cpu
    dev.cpu.0.%desc: ACPI CPU
    

    Core Performance Boost is triggering this for some reason, it was crashing randomly and not when it was under load.
    Could anyone share their APU2 loader.config.local file for reference? I'm wondering if I'm missing something obvious, I haven't done any tuning for years because it has been running smoothly with no issues.


  • Netgate Administrator

    The fact it threw an MCA error implies it was hitting some hardware issue and it looked to be in the RAM.

    I'm not entirely sire what the Core Performance Boost setting does but I could well believe it pushes the RAM or bus speed up with the CPU. Your RAM appears to be incapable of running stable at that new rate. Or something lsimilar to that.

    Steve


  • LAYER 8

    are you sure it's ram?

    to me it can be overclocked cpu or burned cpu

    MCA: Vendor "AuthenticAMD", ID 0x730f01, APIC ID 1
    MCA: CPU 1 COR ICACHE L1 IRD error
    

    Machine Check Architecture

    CPU 1
    COR = Corrected
    ICACHE = Instruction Cache
    L1 = L1 Cache (On Chip)
    IRD = Instruction Fetch
    error is self explanatory.


  • Netgate Administrator

    Nope I'm not sure. And your explanation looks better!

    Pretty much the only thinh that made me think it might be ram was:

    MCA: Bank 1, Status 0x9400000000000151
    

    Which I assume to be a RAM bank but it could be cache or some other terminology.

    Steve



  • @kiokoman thanks, that's a good point. I have seen crashes with CPU ID 0 and CPU ID 1.

    Last three dumps:

    Fatal trap 12: page fault while in kernel mode
    cpuid = 0; apic id = 00
    fault virtual address	= 0x1af
    fault code		= supervisor read instruction, page not present
    instruction pointer	= 0x20:0x1af
    stack pointer	        = 0x28:0xfffffe0118ce1890
    frame pointer	        = 0x28:0xfffffe0118ce18f0
    code segment		= base 0x0, limit 0xfffff, type 0x1b
    			= DPL 0, pres 1, long 1, def32 0, gran 1
    processor eflags	= resume, IOPL = 0
    current process		= 11 (idle: cpu0)
    trap number		= 12
    panic: page fault
    cpuid = 0
    KDB: enter: panic
    
    spin lock 0xffffffff83517de8 (smp rendezvous) held by 0xfffff8009ddbf000 (tid 100206) too long
    timeout stopping cpus
    panic: spin lock held too long
    cpuid = 1
    KDB: enter: panic
    
    spin lock 0xffffffff83517de8 (smp rendezvous) held by 0xfffff8008b216620 (tid 100197) too long
    timeout stopping cpus
    panic: spin lock held too long
    cpuid = 1
    KDB: enter: panic
    

  • LAYER 8

    it can be useful for others with this kind of errors but

    it's the MCI status register, not the RAM bank

    ECC error (ADDR valid) 0x9426c0010b000813
    ECC error overflow (ADDR valid) 0xd426c0010b000813
    ECC error (ADDR invalid) 0x9026c0010b000813
    ECC error overflow (ADDR invalid) 0xd026c0010b000813
    L1 Cache Data Store error (UE) 0xb600200000000145
    **L1 Instruction Cache (Instruction Fetch) error (ADDR valid) 0x9400000000000151**
    L1 Instruction Cache (Instruction Fetch) error overflow (ADDR valid) 0xd400000000000151
    Bus Unit (L2 Cache) error (UE) 0xb600000000020136
    L2 Data Cache (Line Fill) error (ADDR valid) 0x9400400000000136
    L2 Data Cache (Line Fill) error overflow (ADDR valid) 0xd400400000000136
    

    this is specific for this CPU:

    The error-reporting machine check register banks supported in this processor are:
    • MC0: Data cache (DC).
    • MC1: Instruction cache (IC). <- "MCA bank 1"
    • MC2: Bus unit (BU), including L2 cache.
    • MC3: Reserved.
    • MC4: Northbridge (NB), including the IO link. These MSRs are also accessible from configuration
    space. There is only one NB error-reporting bank, independent of the number of cores.
    • MC5: Fixed-issue reorder buffer (FR) machine check registers.
    

  • LAYER 8

    @CS
    CPU ID 0 and CPU ID 1 it's probably a dual core cpu ?
    timeout stopping CPUs, it was unable to speak with the CPU
    with spin lock held too long, it's basically telling you: "I can't wait forever here, so I guess I'll stop and panic"
    based on what you had before I would check CPU settings like overclock / voltage / frequency, overheat, and dust on the fan if there is one

    Does it seem to be a common problem for Apu2 ? https://forum.netgate.com/topic/156830/could-you-help-me-analyze-these-crashdumps?_=1602587866619



  • @kiokoman APU2 has a single AMD Embedded G series GX-412TC, 4 CPUs: 1 package x 4 cores.
    No overclocking and no active cooling in place for these boards.

    Reference: https://pcengines.ch/apu2.htm


  • LAYER 8

    ah i didn't understand that the problem was solved
    so it was Core Performance Boost
    it was probably overclocking the cpu



  • @kiokoman correct, "Core Performance Boost" was causing it and we were trying to find out why considering that other folks have it enabled on APU2 without experiencing any issues.


  • LAYER 8

    we have a saying in Italy, literally translated as ‘not all donuts come out with a hole’ meaning ‘not everything turns out as planned’ 😂
    it's called "silicon lottery", not all cpu are the same, there is ample opportunity for some microscopic part of a CPU, which works fine at a certain speed/voltage combination, to no work if the speed or voltage is increased.



  • @CS said in "Page fault while in kernel mode" on APU2 after bios/coreboot upgrade:

    @DaddyGo, @fireodo , @stephenw10

    Could anyone share their APU2 loader.config.local file for reference? I'm wondering if I'm missing something obvious, I haven't done any tuning for years because it has been running smoothly with no issues.

    Hi, here the content of my loader.config.local:

    legal.intel_ipw.license_ack=1
    legal.intel_iwi.license_ack=1
    debug.acpi.avoid="_SB_.PCI0.GPIO" (necessary for loading apuled.ko)

    if you still have "hint.acpi_perf.0.disabled=1" in your loader.conf.local you will see those increased frecv. in sysctl dev.cpu even when you have disabled CPB in BIOS.

    Regards,
    fireodo



  • @CS said in "Page fault while in kernel mode" on APU2 after bios/coreboot upgrade:

    other folks have it enabled on APU2 without experiencing any issues.

    I confirm this 😉

    we have lot of such units at end users, they are "run" with CPB without any problems
    we basically configure these "routers / NGFWs" + pfSense with CPB

    CPB as I wrote above has been enabled in the Coreboot BIOS, but can only be interpreted on 1 core with a frequency of 1,400 instead of 1,000 this is good for OpenVPN stuff, for example...

    @CS I think don't look for the rabbit in the bush...
    this is not an issue whic is caused by CPB or pfSense

    I think the APU2 MOBO is damaged somewhere, cold soldering or something like that

    which causes a malfunction in the BUS or RAM operation due to the elevated clock....???

    maybe try a CPU shock test under linux and insulate the APU2 housing to warm up .....Voilà, maybe there will be results

    @kiokoman anyway, this is an AMD embedded series CPU can not really be overdriven, designed for low-power devices
    either it works or it doesn't, there is no overclocking it only the CPB allows for a small tuning...



  • @DaddyGo @fireodo I won't continue troubleshooting this honestly, the board works fine for me with CPB disabled and I still get the boosted CPU frequency by having the right settings in my loader.conf.local. I'm not even sure if my performance would get any better! Actually, I'm now wondering, just out of curiosity, if this happens when you have both, the CPU boost settings in loader.conf.local and the BIOS setting enabled.



  • @CS , every hardware has a different outcome in Q.C. Even with the same parts, but a different batch.
    Rule of thumb, 4 years lifespan (4 is death in Chinese). Nowadays you should be happy if your electronic works for more than 4 years.

    I am not sure how handy you are but you could try heating up the cpu (without thermal paste) with the heat gun on a flat surface. Keep around 10 cm distance with circular motion for around 10-15 mins. But be warn, you could burn the cpu.



  • @AKEGEC said in "Page fault while in kernel mode" on APU2 after bios/coreboot upgrade:

    but you could try heating up the cpu (without thermal paste) with the heat gun on a flat surface.

    It's a very bad idea.
    This AMD CPU reaches its maximum TDP in about 40 seconds without a cooling surface (heat shrink) and dies...
    (moreover as I wrote it is an embedded CPU, soldered to the PCB)
    (earlier than the said Chinese 4-year death)

    the pcEngines stuff is stable and we have several pieces of it that has been working for 6 years (from ALIX and APU series)

    The ALIXs works as a radio os WISP PtP and AP and is constantly exposed to the weather.
    So these are not subject to your Chinese rule 😉



  • @DaddyGo said in "Page fault while in kernel mode" on APU2 after bios/coreboot upgrade:

    It's a very bad idea.
    This AMD CPU reaches its maximum TDP in about 40 seconds without a cooling surface (heat shrink) and dies...
    (moreover as I wrote it is an embedded CPU, soldered to the PCB)
    (earlier than the said Chinese 4-year death)

    the pcEngines stuff is stable and we have several pieces of it that has been working for 6 years (from ALIX and APU series)

    The ALIXs works as a radio os WISP PtP and AP and is constantly exposed to the weather.
    So these are not subject to your Chinese rule 😉

    Well to solder embedded cpu you need a temperature between 200-400°c.
    Anyway I was talking about heating it up a bit. As long you are not reaching 90°c you will be fine. But if it already passed the 4 years mark, then I would leave it as it is. 😏
    I don't know why manufactures are shortening their products lifespan. It used to be 15-30 years quality guarantee.



  • @AKEGEC said in "Page fault while in kernel mode" on APU2 after bios/coreboot upgrade:

    Well to solder embedded cpu you need a temperature between 200-400°c.

    Yes, this soldering temperature is a separate data in the catalog and the automatic production lines (soldering machines) solder to the permitted seconds. (2 - 3s)
    The temperature of the inner silicon layer of the CPU does not tolerate this temperature (cca. 150 max).

    I don't know, if you already had an APU MOBO in your hand?
    So this is what it looks like...
    (the metal router housing itself cools the CPU and connects to the metal surface with a bit of heat transfer)

    fc118a87-03a7-4d5d-9522-c1531c604811-image.png

    https://www.pcengines.ch/apucool.htm

    As I wrote above, the test method for the following may be, CPU shock test under f.e. Linux and while covering the house.

    Once we launched such a MOBO without its metal housing for testing and it "boiled" quickly under load.

    You're right, half of today's stuff can't stand it until then 25-30 years. 😞
    Anno, I even repaired 30-year-old cathode ray tubes and black and white televisions and continued to operate for another 10 years.

    Welcome to today's money-hungry world. hahahahaha 😉



  • @DaddyGo , did we just revealed our age? hahahahahaa
    Oh well age is just a number. 😊



  • @AKEGEC said in "Page fault while in kernel mode" on APU2 after bios/coreboot upgrade:

    did we just revealed our age?

    Not a shame ...
    the age, I think, brings wisdom

    yea and everyone is as old as he/she feel 😉

    droll - quadragenarian, hihihihi


Log in to reply