PfSense 2.4.2 crashing on PC engines apu2 at random times



  • My pfSense (2.4.2) started crashing at random times, it might be once a week or so. There was no recent configuration change, there is nothing written in the logs, no console output and there is no connectivity when that happens. It works fine after rebooting it.

    I already upgraded the PC Engines APU firmware/bios and did a memtest which came back clean without any errors.

    Any ideas how to troubleshoot it and identify the root cause of the problem? It seems hardware related and I’d like to find out what’s happening. I tried to boot from an SD card and a USB drive but it wasn’t stable enough to understand if my mSATA SSD is causing the problem.

    Thanks,
    CS


  • Rebel Alliance Developer Netgate

    If there is no console output, it is most likely hardware related. The first two things to check are cooling and power supply. It could be that your power supply is failing, which would also explain why it would have trouble booting from USB since it takes a little more power to use a USB drive on top of the base system.

    If you have another power supply, swap it out and test it that way.



  • I have a similar problem with mine.  Mine seems to lock up maybe once a month.  What temperatures is your running at?



  • @acascianelli, it’s running around 50-55 Celsius and it happens every 10 days or so.

    @jimp, thanks for the hint, I’ll search for another power supply and give it a try. I also think it’s hardware related, I hope it’s the hard drive or the power supply that I can easily replace and not the main board.



  • I finally managed to get some error logs from my console, nothing written in system logs though.

    
    ahcich0: Timeout on slot 10 port 0
    ahcich0: is 00000008 cs 00000000 ss 00000000 rs ffffe7ff tfd 40 serr 00000000 cm d 00406a17
    (ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 01 e8 c7 00 40 00 00 00 00 00 00
    (ada0:ahcich0:0:0:0): CAM status: Command timeout
    (ada0:ahcich0:0:0:0): Retrying command
    ahcich0: Timeout on slot 11 port 0
    ahcich0: is 00000002 cs 00000000 ss 00000000 rs 00000800 tfd 50 serr 00000000 cm d 00406b17
    (aprobe0:ahcich0:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
    (aprobe0:ahcich0:0:0:0): CAM status: Command timeout
    (aprobe0:ahcich0:0:0:0): Retrying command
    ahcich0: Timeout on slot 12 port 0
    ahcich0: is 00000002 cs 00000000 ss 00000000 rs 00001000 tfd 50 serr 00000000 cm d 00406c17
    (aprobe0:ahcich0:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
    (aprobe0:ahcich0:0:0:0): CAM status: Command timeout
    (aprobe0:ahcich0:0:0:0): Error 5, Retries exhausted
    ahcich0: Timeout on slot 13 port 0
    ahcich0: is 00000002 cs 00000000 ss 00000000 rs 00002000 tfd 50 serr 00000000 cm d 00406d17
    (aprobe0:ahcich0:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
    (aprobe0:ahcich0:0:0:0): CAM status: Command timeout
    (aprobe0:ahcich0:0:0:0): Error 5, Retry was blocked
    ada0 at ahcich0 bus 0 scbus0 target 0 lun 0
    ada0: <ts16gmsa310 20120703="">s/n 20121222A55XXXXXX detached
    ...
    db:0:kdb.enter.default> textdump set
    ...
    db:0:kdb.enter.default>  capture on
    ...
    db:0:kdb.enter.default>  run lockinfo
    ...
    db:0:kdb.enter.default>  show pcpu
    ...
    db:0:kdb.enter.default>  bt
    ...
    db:0:kdb.enter.default>  ps
    ...
    db:0:kdb.enter.default>  alltrace
    ...
    db:0:kdb.enter.default>  capture off
    ...
    db:0:kdb.enter.default>  textdump dump
    ...
    ...
    Tracing command kernel pid 0 tid 100099 td 0xfffff800144fe000
    sched_switch() at sched_switch+0x4aa/frame 0xfffffe011fdb9960
    mi_switch() at mi_switch+0xe5/frame 0xfffffe011fdb9990
    sleepq_wait() at sleepq_wait+0x3a/frame 0xfffffe011fdb99c0
    _sleep() at _sleep+0x255/frame 0xfffffe011fdb9a40
    taskqueue_thread_loop() at taskqueue_thread_loop+0x121/frame 0xfffffe011fdb9a70
    fork_exit() at fork_exit+0x85/frame 0xfffffe011fdb9ab0
    fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe011fdb9ab0
    --- trap 0, rip = 0, rsp = 0, rbp = 0 ---
    db:0:kdb.enter.default>  capture off
    db:0:kdb.enter.default>  textdump dump
    textdump_writeblock: offset 801111552, error 6
    Textdump: Error 6 writing dump
    db:0:kdb.enter.default>  reset
    cpu_reset: Restarting BSP
    cpu_reset_proxy: Stopped CPU 2
    PC Engines apu2
    coreboot build 07/24/2017
    BIOS version v4.6.0</ts16gmsa310> 
    

    This is where it gets stuck and doesn’t boot until I manually reset it.



  • Could a moderator move this topic under “Hardware” please?



  • Hi,

    Same issue on APU3. I noticed few days ago the same behaviour

    
    ahcich0: Timeout on slot 10 port 0
    ahcich0: is 00000008 cs 00000000 ss 00000000 rs ffffe7ff tfd 40 serr 00000000 cm d 00406a17
    (ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 01 e8 c7 00 40 00 00 00 00 00 00
    (ada0:ahcich0:0:0:0): CAM status: Command timeout
    (ada0:ahcich0:0:0:0): Retrying command
    ahcich0: Timeout on slot 11 port 0
    
    

    Temperature says around  53 - 57 °C .

    Running coreboot v4.6.0



  • That’s interesting. Let me also highlight that the crashes do NOT happen when the system is under load.

    PC Engines apu2
    Coreboot: build 07/24/2017
    BIOS: version v4.6.0
    pfSense: 2.4.2-RELEASE-p1 (amd64)
    OS: FreeBSD 11.1-RELEASE-p6
    mSATA SSD: Transcend TS16GMSA310 16 GB - https://www.amazon.co.uk/Transcend-TS16GMSA310-16-GB-Internal/dp/B007DIS8Y2

    @software, what kind of storage do you use?


  • Rebel Alliance

    I run a number of APU2; Coreboot 4.0.7 & have not seen this issue occur.



  • No problems:

    System PC Engines APU2B2
    BIOS Vendor: coreboot
    Version: 88a4f96
    Release Date: Mon Mar 7 2016
    Version 2.4.2-RELEASE-p1 (amd64)
    built on Tue Dec 12 13:45:26 CST 2017
    FreeBSD 11.1-RELEASE-p6
    &
    mSATA SSD: Transcend TS16GMSA370 16 GB



  • @/CS:

    That’s interesting. Let me also highlight that the crashes do NOT happen when the system is under load.

    PC Engines apu2
    Coreboot: build 07/24/2017
    BIOS: version v4.6.0
    pfSense: 2.4.2-RELEASE-p1 (amd64)
    OS: FreeBSD 11.1-RELEASE-p6
    mSATA SSD: Transcend TS16GMSA310 16 GB - https://www.amazon.co.uk/Transcend-TS16GMSA310-16-GB-Internal/dp/B007DIS8Y2

    @software, what kind of storage do you use?

    It’s a no name chinese mSata SSD.  I think the cheap SSD bits me now in the ass.
    Same here, it was during the night. Almost no load, except some openvpn ping traffic and some home automation.

    
    Jan 23 23:33:46	kernel		(aprobe0:ahcich0:0:0:0): Error 5, Retries exhausted
    Jan 23 23:33:46	kernel		(aprobe0:ahcich0:0:0:0): CAM status: Command timeout
    Jan 23 23:33:46	kernel		(aprobe0:ahcich0:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
    Jan 23 23:33:46	kernel		ahcich0: is 00000002 cs 00000000 ss 00000000 rs 00008000 tfd 50 serr 00000000 cmd 00406f17
    Jan 23 23:33:46	kernel		ahcich0: Timeout on slot 15 port 0
    Jan 23 23:33:16	kernel		(aprobe0:ahcich0:0:0:0): Retrying command
    Jan 23 23:33:16	kernel		(aprobe0:ahcich0:0:0:0): CAM status: Command timeout
    Jan 23 23:33:16	kernel		(aprobe0:ahcich0:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
    Jan 23 23:33:16	kernel		ahcich0: is 00000002 cs 00000000 ss 00000000 rs 00004000 tfd 50 serr 00000000 cmd 00406e17
    Jan 23 23:33:16	kernel		ahcich0: Timeout on slot 14 port 0
    Jan 23 23:32:46	kernel		(aprobe0:ahcich0:0:0:0): Error 5, Retries exhausted
    Jan 23 23:32:46	kernel		(aprobe0:ahcich0:0:0:0): CAM status: Command timeout
    Jan 23 23:32:46	kernel		(aprobe0:ahcich0:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
    Jan 23 23:32:46	kernel		ahcich0: is 00000002 cs 00000000 ss 00000000 rs 00000800 tfd 50 serr 00000000 cmd 00406b17
    Jan 23 23:32:46	kernel		ahcich0: Timeout on slot 11 port 0
    Jan 23 23:32:16	kernel		(aprobe0:ahcich0:0:0:0): Retrying command
    Jan 23 23:32:16	kernel		(aprobe0:ahcich0:0:0:0): CAM status: Command timeout
    Jan 23 23:32:16	kernel		(aprobe0:ahcich0:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
    Jan 23 23:32:16	kernel		ahcich0: is 00000002 cs 00000000 ss 00000000 rs 00000400 tfd 50 serr 00000000 cmd 00406a17
    Jan 23 23:32:16	kernel		ahcich0: Timeout on slot 10 port 0
    Jan 23 23:31:46	kernel		(aprobe0:ahcich0:0:0:0): Error 5, Retries exhausted
    Jan 23 23:31:46	kernel		(aprobe0:ahcich0:0:0:0): CAM status: Command timeout
    Jan 23 23:31:46	kernel		(aprobe0:ahcich0:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
    Jan 23 23:31:46	kernel		ahcich0: is 00000002 cs 00000000 ss 00000000 rs 00000020 tfd 50 serr 00000000 cmd 00406517
    Jan 23 23:31:46	kernel		ahcich0: Timeout on slot 5 port 0
    Jan 23 23:31:16	kernel		(aprobe0:ahcich0:0:0:0): Retrying command
    Jan 23 23:31:16	kernel		(aprobe0:ahcich0:0:0:0): CAM status: Command timeout
    Jan 23 23:31:16	kernel		(aprobe0:ahcich0:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
    Jan 23 23:31:16	kernel		ahcich0: is 00000002 cs 00000000 ss 00000000 rs 00000010 tfd 50 serr 00000000 cmd 00406417
    Jan 23 23:31:16	kernel		ahcich0: Timeout on slot 4 port 0
    Jan 23 23:30:46	kernel		(aprobe0:ahcich0:0:0:0): Error 5, Retries exhausted
    Jan 23 23:30:46	kernel		(aprobe0:ahcich0:0:0:0): CAM status: Command timeout
    Jan 23 23:30:46	kernel		(aprobe0:ahcich0:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
    Jan 23 23:30:46	kernel		ahcich0: is 00000002 cs 00000000 ss 00000000 rs 00000002 tfd 50 serr 00000000 cmd 00406117
    Jan 23 23:30:46	kernel		ahcich0: Timeout on slot 1 port 0
    Jan 23 23:30:16	kernel		(aprobe0:ahcich0:0:0:0): Retrying command
    Jan 23 23:30:16	kernel		(aprobe0:ahcich0:0:0:0): CAM status: Command timeout
    Jan 23 23:30:16	kernel		(aprobe0:ahcich0:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
    Jan 23 23:30:16	kernel		ahcich0: is 00000002 cs 00000000 ss 00000000 rs 00000001 tfd 50 serr 00000000 cmd 00406017
    Jan 23 23:30:16	kernel		ahcich0: Timeout on slot 0 port 0
    Jan 23 23:29:46	kernel		(aprobe0:ahcich0:0:0:0): Error 5, Retries exhausted
    Jan 23 23:29:46	kernel		(aprobe0:ahcich0:0:0:0): CAM status: Command timeout
    Jan 23 23:29:46	kernel		(aprobe0:ahcich0:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
    Jan 23 23:29:46	kernel		ahcich0: is 00000002 cs 00000000 ss 00000000 rs 08000000 tfd 50 serr 00000000 cmd 00407b17
    Jan 23 23:29:46	kernel		ahcich0: Timeout on slot 27 port 0
    Jan 23 23:29:16	kernel		(aprobe0:ahcich0:0:0:0): Retrying command
    Jan 23 23:29:16	kernel		(aprobe0:ahcich0:0:0:0): CAM status: Command timeout
    Jan 23 23:29:16	kernel		(aprobe0:ahcich0:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
    Jan 23 23:29:16	kernel		ahcich0: is 00000002 cs 00000000 ss 00000000 rs 04000000 tfd 50 serr 00000000 cmd 00407a17
    
    


  • Looks like crappy disks to me.



  • @KOM:

    Looks like crappy disks to me.

    I think so too.

    I have ordered a new “Transcend 32GB mSATA SSD (TS32GMSA370)” and I’ll keep you posted.



  • Did your system crash before you upgraded the BIOS to 4.6.0? Which BIOS version were you running then? 4.0.x or 4.5.x?

    Because the version that you are currently running is not recommended to use with pfSense or FreeBSD. PCEngines has a warning on their BIOS/Howto page:

    For FreeBSD based OS like OPNSense and pfSense please use the legacy versions.

    Note: “legacy” versions are 4.0.x
    See: http://pcengines.ch/howto.htm#bios

    There have been several reports about issues with the newer coreboot releases (afaik not this issue though), so I wouldn’t be surprised if this is caused by the firmware and not the disk itself.

    That being said, if you have seen this on the older firmware 4.0x. as well, then it’s likely a disk issue.



  • @silentcreek:

    Did your system crash before you upgraded the BIOS to 4.6.0? Which BIOS version were you running then? 4.0.x or 4.5.x?

    That being said, if you have seen this on the older firmware 4.0x. as well, then it’s likely a disk issue.

    It actually happened when I was on older firmware, I don’t recall the version. I just thought it was a good opportunity to upgrade to the latest one hoping that it could possibly help.
    I’ll find out soon.



  • I bought a brand new SSD (Transcend 32GB SATA III 6Gb/s MSA370 mSATA SSD - TS32GMSA370) and tried to reinstall pfSense loading the installer from one of the USB ports, but I’m getting similar errors during the installation process and it eventually fails. I tried multiple times and it always fails during the extraction of distribution files. Any ideas what’s going on? The board itself maybe? I hope not…

    
     pfSense Installer
     qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
    ahcich0: Timeout on slot 15 port 0
    ahcich0: is 00000008 cs 00000000 ss 00000000 rs 0000ffc0 tfd 40 serr 00000000 cmd 00406f17
    (ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 40 a8 60 62 40 03 00 00 00 00 00
    (ada0:ahcich0:0:0:0): CAM status: Command timeout
    (ada0:ahcich0:0:0:0): Retrying command
    
                          lqqqqqqqArchive Extractionqqqqqqqqqk
                          x Extracting distribution files... x
                          x                                  x
                          x                                  x
                          x   Overall Progress:              x
                          x  lqqqqqqqqqqqqqqqqqqqqqqqqqqqqk  x
                          x  x             23%            x  x
                          x  mqqqqqqqqqqqqqqqqqqqqqqqqqqqqj  x
                          mqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqj
    
     pfSense Installer
     qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
    ahcich0: Timeout on slot 15 port 0
    ahcich0: is 00000008 cs 00000000 ss 00000000 rs 0000ffc0 tfd 40 serr 00000000 cmd 00406f17
    (ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 40 a8 60 62 40 03 00 00 00 00 00
    (ada0:ahcich0:0:0:0): CAM status: Command timeout
    (ada0:ahcich0:0:0:0): Retrying command
    ahcich0: Timeout on slot 16 port 0
    ahcich0: is 00000002 cs 00000000 ss 00000000 rs 00010000 tfd 50 serr 00000000 cmd 00407017
    (aprobe0:ahcich0:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
    (aprobe0:ahcich0:0:0:0): CAM status: Command timeout     x
    (aprobe0:ahcich0:0:0:0): Retrying command                x
                          x   Overall Progress:              x
                          x  lqqqqqqqqqqqqqqqqqqqqqqqqqqqqk  x
                          x  x             23%            x  x
                          x  mqqqqqqqqqqqqqqqqqqqqqqqqqqqqj  x
                          mqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqj
    
    


  • For FreeBSD based OS like OPNSense and pfSense please use the legacy versions.

    Reference: http://pcengines.ch/howto.htm#bios

    The comment above was the reason I went to https://github.com/pcengines/apu2-documentation#binary-releases and downloaded http://pcengines.ch/file/apu2_v4.0.12.rom.tar.gz which is the latest “Legacy” version. After downgrading to 4.0.12 the installation of pfSense on my new SSD completed smoothly and without any errors.

    Lets wait now for a week or so to see if the system will crash again or not.



  • Uptime: 22 days

    I had no crach since I downgraded to the stable firmware and replaced my SSD.  😉



  • @/CS:

    Uptime: 22 days

    I had no crach since I downgraded to the stable firmware and replaced my SSD.  😉

    Thanks! I have a similar problem and I will try the downgrade and report back.

    Apr  8 11:12:10 pfsense kernel: ahcich0: Timeout on slot 14 port 0
    Apr  8 11:12:10 pfsense kernel: ahcich0: is 00000008 cs 00000000 ss 00000000 rs 00007800 tfd 40 serr 00000000 cmd 00406e17
    Apr  8 11:12:10 pfsense kernel: (ada0:ahcich0:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 01 0f c3 00 40 00 00 00 00 00 00
    Apr  8 11:12:10 pfsense kernel: (ada0:ahcich0:0:0:0): CAM status: Command timeout
    Apr  8 11:12:10 pfsense kernel: (ada0:ahcich0:0:0:0): Retrying command
    Apr  8 11:12:40 pfsense kernel: ahcich0: Timeout on slot 15 port 0
    Apr  8 11:12:40 pfsense kernel: ahcich0: is 00000002 cs 00000000 ss 00000000 rs 00008000 tfd 50 serr 00000000 cmd 00406f17
    Apr  8 11:12:40 pfsense kernel: (aprobe0:ahcich0:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
    Apr  8 11:12:40 pfsense kernel: (aprobe0:ahcich0:0:0:0): CAM status: Command timeout
    Apr  8 11:12:40 pfsense kernel: (aprobe0:ahcich0:0:0:0): Retrying command
    Apr  8 11:13:10 pfsense kernel: ahcich0: Timeout on slot 16 port 0
    Apr  8 11:13:10 pfsense kernel: ahcich0: is 00000002 cs 00000000 ss 00000000 rs 00010000 tfd 50 serr 00000000 cmd 00407017
    Apr  8 11:13:10 pfsense kernel: (aprobe0:ahcich0:0:0:0): ATA_IDENTIFY. ACB: ec 00 00 00 00 40 00 00 00 00 00 00
    Apr  8 11:13:10 pfsense kernel: (aprobe0:ahcich0:0:0:0): CAM status: Command timeout
    Apr  8 11:13:10 pfsense kernel: (aprobe0:ahcich0:0:0:0): Error 5, Retries exhausted



  • Everything works absolutely fine after the BIOS downgrade. I am running coreboot 4.0.7.1 now. Uptime 1 day and >6 hours.


 

© Copyright 2002 - 2018 Rubicon Communications, LLC | Privacy Policy